what is big data?

Big Data

Dipl. Inform.(FH) Jony Sugianto, M. Comp. Sc.Hp:0838-98355491WA:0812-13086659Email:[email protected]

Agenda

● What is Big Data?

● Analytic

● Big Data Platforms

● Questions

What is Big Data?

● The Basic idea behind the phrase Big Data is that everything we do is increasingly leaving a digital trace(or data), which we(and others) can use and analyse

● Big Data therefore refers to our ability to make use of the ever increasing volumes of data

● Big Data is not about the size of the data, it's about the value within the data

Datafication of the world

● Activities

- Web Browser

- Credit Cards

- E-Commerce

● Conversations

- WhatsApp

- Email

- Twitter

● Photos/Videos

- Instagram

- You Tube

● Sensors

- Gps

● Etc...

Turning Big Data into Value

Datafication of our world

● Activity

● Conversation

● Sensors

● Photo/Video

● Etc...

Analysing Big Data

● Text Analytics

● Sentiment Analysis

● Movement Analytics

● Face/Voice Recognition

● Etc...

Value

Webdata

● Log Data(all user)

- Anonymous ID from Cookie Data

- LoginID (if exist)

- ArticleId

- Kanal / Category

- Browser

- IP

- Etc...

● Registered User Data(10 %)

- Login ID

- Name

- Age

- Gender

- Education

- Etc...

Valuable data

● User activness

● User interest based on reading behaviour

● Personal Profile for all user

Compute

Why use the UA 2 ?

User Activness and User Interest

How to update the User activness?

How to update the User activness?

New-UA=w_history * UA-sofar + w_current * UA-Per-Minggu

w_history=0.75w_current=0.25

Asigning personal profile

Final data

How to define the similarity?

● Linear |x1 – x2|

● Square (x1 – x2)^2

● Exponential 10^f(|x1-x2|)

Complexity Analysis

● Assume 30.000.000 click a day

● A week: 210.000.000 click

● Size log entry: 1 kb

● Total size: 210.000.000.000 byte = 210 Gb

Complexity Analysis

● All User : 10.000.000

● Loginuser: 1.000.000

● Comparison per second per CPU: 1.000.000

● Total Comparison: 9.000.000.000.000

● Total Time: 9.000.000 second=104 hari

Big Data Platforms

What is the different?

What is Hadoop?

● Hadoop:

an open-source framework that supports data-intensive distributed applications, licensed under apache v2 license

● Goals:

- Abstract and facilitate the storage and processing of large and/or rapidly growing data sets

- High scalability and availability

- Use commodity Hardware

- Fault-tolerance

- Move computation rather than data

Hadoop Components

● Hadoop Distributed File System(HDFS)

A distributed file system that provides high-throughput access to application data

● Hadoop YARN

A framework for job scheduling and cluster resource management

● Hadoop MapReduce

A Yarn-based system for parallel processing of large data sets

What is Hive?

● Hive is a data warehouse infrastructure built on top of Hadoop

● Hive stored data in the HDFS

● Hive compile SQL Queries into MapReduce jobs

Example Hive Script

What is Pig?

● Pig is a platform for analyzing large data sets that consist of a high-level language for expressing data analysis programs

● Pig generates and compiles a MapReduce program on the fly

Example Pig Script

What is Spark?

● Fast and general purpose cluster computing system

● 10x(on disk) – 100x(in-memory) faster than Hadoop MapReduce

● Provides high level APIs in

-Scala

-Java

-Python

● Can be deployed through Apache Mesos, Apache Hadoop via YARN, or Spark's cluster manager

Resilient Distributed Datasets

● Written in scala

● Fundamental Unit of data in spark

● Distributed collection of object

● Resilient-Ability to recompute missing partions(node failure)

● Distributed-Split across multiple partions

● Dataset-Can contains any type, Scala/Java/Python Object or User defined object

● Operations

-Transformations(map, filter, groupBy,...)

-Actions(count, collect, save, ...)

Spark Example

// Spark wordcountobject WordCount { def main(args: Array[String]) {

val env = new SparkContext("local","wordCount")

val data = List("hi","how are you","hi")

val dataSet = env.parallelize(data)

val words = dataSet.flatMap(value => value.split("\\s+"))

val mappedWords = words.map(value => (value,1))

val sum = mappedWords.reduceByKey(_+_)

println(sum.collect())

}}

What is Flink?

● Written in java

● An open source platform for distributed stream and batch data processing

● Several APIs in Java/Scala/Python

-DataSet API – Batch processing

-DataStream API – Stream processing

-Table API – Relational Queries

Flink Example// Flink wordcountobject WordCount { def main(args: Array[String]) {

val env = ExecutionEnvironment.getExecutionEnvironment

val data = List("hi","how are you","hi")

val dataSet = env.fromCollection(data)

val words = dataSet.flatMap(value => value.split("\\s+"))

val mappedWords = words.map(value => (value,1))

val grouped = mappedWords.groupBy(0)

val sum = grouped.sum(1)

println(sum.collect()) }}

Question ?

what is big data?

Data & Analytics