what is big data?
TRANSCRIPT
Big Data
Dipl. Inform.(FH) Jony Sugianto, M. Comp. Sc.Hp:0838-98355491WA:0812-13086659Email:[email protected]
Agenda
● What is Big Data?
● Analytic
● Big Data Platforms
● Questions
What is Big Data?
● The Basic idea behind the phrase Big Data is that everything we do is increasingly leaving a digital trace(or data), which we(and others) can use and analyse
● Big Data therefore refers to our ability to make use of the ever increasing volumes of data
● Big Data is not about the size of the data, it's about the value within the data
Datafication of the world
● Activities
- Web Browser
- Credit Cards
- E-Commerce
● Conversations
● Photos/Videos
- You Tube
● Sensors
- Gps
● Etc...
Turning Big Data into Value
Datafication of our world
● Activity
● Conversation
● Sensors
● Photo/Video
● Etc...
Analysing Big Data
● Text Analytics
● Sentiment Analysis
● Movement Analytics
● Face/Voice Recognition
● Etc...
Value
Webdata
● Log Data(all user)
- Anonymous ID from Cookie Data
- LoginID (if exist)
- ArticleId
- Kanal / Category
- Browser
- IP
- Etc...
● Registered User Data(10 %)
- Login ID
- Name
- Age
- Gender
- Education
- Etc...
Valuable data
● User activness
● User interest based on reading behaviour
● Personal Profile for all user
Compute
Compute
Why use the UA 2 ?
User Activness and User Interest
How to update the User activness?
How to update the User activness?
New-UA=w_history * UA-sofar + w_current * UA-Per-Minggu
w_history=0.75w_current=0.25
Asigning personal profile
Final data
How to define the similarity?
● Linear |x1 – x2|
● Square (x1 – x2)^2
● Exponential 10^f(|x1-x2|)
Complexity Analysis
● Assume 30.000.000 click a day
● A week: 210.000.000 click
● Size log entry: 1 kb
● Total size: 210.000.000.000 byte = 210 Gb
Complexity Analysis
● All User : 10.000.000
● Loginuser: 1.000.000
● Comparison per second per CPU: 1.000.000
● Total Comparison: 9.000.000.000.000
● Total Time: 9.000.000 second=104 hari
Big Data Platforms
What is the different?
What is Hadoop?
● Hadoop:
an open-source framework that supports data-intensive distributed applications, licensed under apache v2 license
● Goals:
- Abstract and facilitate the storage and processing of large and/or rapidly growing data sets
- High scalability and availability
- Use commodity Hardware
- Fault-tolerance
- Move computation rather than data
Hadoop Components
● Hadoop Distributed File System(HDFS)
A distributed file system that provides high-throughput access to application data
● Hadoop YARN
A framework for job scheduling and cluster resource management
● Hadoop MapReduce
A Yarn-based system for parallel processing of large data sets
What is Hive?
● Hive is a data warehouse infrastructure built on top of Hadoop
● Hive stored data in the HDFS
● Hive compile SQL Queries into MapReduce jobs
Example Hive Script
What is Pig?
● Pig is a platform for analyzing large data sets that consist of a high-level language for expressing data analysis programs
● Pig generates and compiles a MapReduce program on the fly
Example Pig Script
What is Spark?
● Fast and general purpose cluster computing system
● 10x(on disk) – 100x(in-memory) faster than Hadoop MapReduce
● Provides high level APIs in
-Scala
-Java
-Python
● Can be deployed through Apache Mesos, Apache Hadoop via YARN, or Spark's cluster manager
Resilient Distributed Datasets
● Written in scala
● Fundamental Unit of data in spark
● Distributed collection of object
● Resilient-Ability to recompute missing partions(node failure)
● Distributed-Split across multiple partions
● Dataset-Can contains any type, Scala/Java/Python Object or User defined object
● Operations
-Transformations(map, filter, groupBy,...)
-Actions(count, collect, save, ...)
Spark Example
// Spark wordcountobject WordCount { def main(args: Array[String]) {
val env = new SparkContext("local","wordCount")
val data = List("hi","how are you","hi")
val dataSet = env.parallelize(data)
val words = dataSet.flatMap(value => value.split("\\s+"))
val mappedWords = words.map(value => (value,1))
val sum = mappedWords.reduceByKey(_+_)
println(sum.collect())
}}
What is Flink?
● Written in java
● An open source platform for distributed stream and batch data processing
● Several APIs in Java/Scala/Python
-DataSet API – Batch processing
-DataStream API – Stream processing
-Table API – Relational Queries
Flink Example// Flink wordcountobject WordCount { def main(args: Array[String]) {
val env = ExecutionEnvironment.getExecutionEnvironment
val data = List("hi","how are you","hi")
val dataSet = env.fromCollection(data)
val words = dataSet.flatMap(value => value.split("\\s+"))
val mappedWords = words.map(value => (value,1))
val grouped = mappedWords.groupBy(0)
val sum = grouped.sum(1)
println(sum.collect()) }}
Question ?