icta meetup 11 - big data
DESCRIPTION
Big DataTRANSCRIPT
ICTA Technology Meetup 11ICTA Technology Meetup 11
By Crishantha Nanayakkara
Meetup Recap
● 1 – Enterprise Application Integration
● 2 – Enterprise Level High Availability Options
● 3 – SOA Security
● 4 – Towards Hybrid Mobile App Development
● 5 – The Semantic Web and Linked Data
● 6 – Enterprise Application Design Patterns
● 7 – GIS – An Introduction
● 8 – The Future of the Database World
● 9 – The Enterprise Storage Management
● 10 – An Introduction to Content Management with Joomla
The Scope
● Big Data – The Definition
● The Sources of Data
● Structured, SemiStructured vs Unstructured Data
● Relational Data vs Big Data
● Towards Big Data
● Big Data Adoption in other countries
● Big Data Technologies and the ecosystem
● Big Data Open Source and Commercial Options
Big Data Definition
A new generation of technologies and architectures, designed to economically
extract VALUE from very large VOLUMES of a wide variety of data by enabling high
VELOCITY capture, discovery, and/or analysis.
The Three Vs of Big Data
The Three Vs of Big Data
● Volume – Big● Variety – From different sources and types● Velocity – Frequency of its generation: how
quickly the data arrives and is stored, and how quickly it can be retrieved
The Sources of Data
The Sources of Data
● Documents● Emails ● Images● Relational Databases● Logs● Social Media feeds● Videos
● Sensor Data● Click Streams
Structured, Semi Structured and Unstructured Data
Structured Data
● Structured:– The information with a high degree of
organization– Seamless and readily searchable by
straightforward search algorithms or operations
– e.g: relational databases, spreadsheets, XML
SemiStructured Data
● SemiStructured:– This is a form of structured data that does not
conform to an explicit and fixed schema– The data is inherently selfdescribing and
contains tags or other markers to enforce hierarchies of records and fields within the data
– e.g: web logs, social media feeds
Unstructured Data
● Unstructured:– This type of data consists of formats which
cannot easily be indexed into relational tables for analysis or querying
– e.g.: images, videos
Relational Vs Big Data
Relational Data vs Big Data
● Thinking of Big Data as “just lots more enterprise data” is tempting, but it’s a serious mistake.
● Big Data is commonly generated outside of traditional enterprise applications
● Big Data is often composed of unstructured or semistructured information types that continually arrive in enormous amounts
Relational Data vs Big Data
● To get maximum value from Big Data, it needs to be associated with traditional enterprise data, automatically or via purpose built applications, reports, queries, and other approaches
Towards Big Data
The Digital Universe● From 2005 to 2020, the digital universe will grow from 130
exabytes to 40,000 exabytes, or 40 trillion gigabytes.
According to IDC, the Big Data technology and service market was about US$4.8 billion in 2011. The market is projected to grow at a compound annual growth rate (CAGR) of 37.2% between 2011 and 2015. By 2015, the market size is expected to be US$16.9 billion.
[Source: IDC. Worldwide Big Data Technology and Services 20122015 Forecast.]
Gartner reported that more than 65 billion devices were connected to the internet by 2010. By 2020, this number will go up to 230 billion
[Source: https://www.gartner.com/doc/1799626]
The Opportunity for Big Data● Only a tiny fraction of the digital universe has been
explored for analytic value so far. ● By 2020, as much as 33% of the digital universe will
contain information that might be valuable if analyzed.
● But only if it is tagged and analyzed. That is the opportunity for Big Data.
Source: IDC's Digital Universe Study, 2012
The Candidates for Big Data● Not all data is necessarily useful for Big Data
analytics. However, some data types are particularly good for analysis
– Surveillance Footage– Embedded medical devices– Entertainment and Social Media– Images and Voice Data– Data Processing
Source: IDC's Digital Universe Study, 2012
● Over a history that spans more than 30 years, SQL database servers have traditionally held gigabytes of information — and reaching that milestone took a long time.
● In the past 15 years, data warehouses and enterprise analytics expanded these volumes to terabytes.
● And in the last 5 years, the distributed file systems that store Big Data now routinely house petabytes of information.
The Statistics
The Big Data Adoption in the World
Source: http://www.informationweek.com/government/information-management/white-house-shares-200-million-big-data/232700522
http://www.informationweek.com/regulations/federal-standards-body-focuses-on-big-data-cloud/d/d-id/1102703?
Singapore Transport System(Land Transport Authority LTA)
Source: How Cities using Big Data in Asia? - FutureGov Report
Singapore Transport System(Land Transport Authority LTA)
Source: How Cities using Big Data in Asia? - FutureGov Report
● Data Collection:– Junction Electronic Eyes– Green Link Determining System– Web cams– Parking Guidance Systems– Expressway monitoring Systems– Traffic Scan
Singapore Transport System(Land Transport Authority LTA)
Source: How Cities using Big Data in Asia? - FutureGov Report
● Data Processing:– All the data is fed into this integrated i
Transport Processing System– The data is aggregated, integrated and
analyzed ● Data Dissemination:
– Via web portals, radio broadcasting, navigation devices, smart phones, etc
– Certain data elements are given as “open data”
Singapore National EnvironmentAgency
Source: How Cities using Big Data in Asia? - FutureGov Report
● Dengue related data:– The data is pulled from dengue cases, public
feedback, mosquito inspections and other sources for analysis.
– Making use of GIS to identify highrisk areas,they are also able to prioritize places for checks
iPlan Project(Urban Redevelopment Authority URA)
Source: How Cities using Big Data in Asia? - FutureGov Report
● iPLAN is among the world’s first nationwide enterprise GIS systems for urban planning and it contains comprehensive land, building, planning and approval information which is readily available to URA’s planners
Kuala Lampur Government
Source: How Cities using Big Data in Asia? - FutureGov Report
● The government has created a Big Data Analytics fund to support four governmentinitiated projects by 2015 focusing on,
– Transport, – Planning, Environment and – Security
Technologies behindBig Data
Reference: http://www.bdisys.com/27/1/17/BIG%20DATA/HADOOP
Hadoop
Hadoop – An Introduction● Hadoop is a framework that provides open source
libraries for distributed computing using MapReduce software and its own distributed file system Hadoop Distributed File System (HDFS)
● Open Source, written in Java● Maintained by Apache Software Foundation as a top
level project● Original deployments
– Yahoo, Facebook, LinkedIn
Hadoop – The Core Components
● The kernal(core) of Hadoop provides: – A reliable shared storage (HDFS) – An Analysis system (MapReduce)
● There are other components in Hadoop, which makes a complete Hadoop ecosystem
Hadoop Architecture● Designed to scale out from a few computing nodes to
thousands of machines, each offering local computation and storage
● Leverages the power of massive parallel processing to take advantage of Big Data, generally by using lots of inexpensive commodity servers, which has a high tolerance of hardware failure. In Hadoop, hardware failure is taken as rule rather than an exception
● Designed to abstract away much of the complexity of distributed processing. This lets developers focus on the task at hand
Reference: Hadoop In Action
Hadoop Architecture
p
Reference: Hadoop In Action
Hadoop Architecture
Hadoop Distributed File System (HDFS)
Scale Up Vs Scale Out
Reference: http://quickfileaccounting.wordpress.com/2013/07/02/scaleoutvsscaleup/
Scale Up Vs Scale Out
HDFS● A faulttolerant storage system that can store huge
amounts of information● Scale up incrementally and survive storage failure
without losing data● Hadoop clusters are built with inexpensive computers.
If one computer (or node) fails, the cluster can continue to operate without losing data or interrupting work by simply redistributing the work to the remaining machines in the cluster
HDFS● HDFS manages storage on the cluster by breaking
files into small blocks and storing duplicated copies of them across the pool of nodes
● In the common case, HDFS stores three complete copies of each file by copying each piece to three different servers
● If any two servers can fail, and the entire file will still be available HDFS notices when a block or a node is lost, and creates a new copy of missing data from the replicas it manages.
HDFS● HDFS offers two key advantages over RAID:
– It requires no special hardware, since it can be built from commodity servers, and can survive more kinds of failure – a disk, a node on the network or a network interface
HDFS
Reference: Hadoop In Action
MapReduce
MapReduce
● Hadoop takes advantage of HDFS’ data distribution strategy to push work out to many nodes in a cluster. This allows analyses to run in parallel and eliminates the bottlenecks imposed by monolithic storage systems.
● Hadoop uses MapReduce for this task.
MapReduce
MapReduce● A new programming framework — created and
successfully deployed by Google — that uses the divideandconquer method (and lots of commodity servers) to break down complex Big Data problems into small units of work, and then process them in parallel
● MapReduce is built on the proven concept of divide and conquer: it’s much faster to break a massive task into smaller chunks and process them in parallel.
MapReduce
● MapReduce is a data processing algorithm that uses a parallel programming implementation. In simple terms, MapReduce is a programming paradigm that involves distributing a task across multiple nodes running a "map" function. The map function takes the problem, splits it into sub parts and sends them to different machines so that all the subparts can run concurrently. The results from the parallel map functions are collected and distributed to a set of servers running "reduce" functions, which then takes the results from the subparts and recombines them to get the single answer.
Source: http://www.youtube.com/watch?v=HFplUBeBhcM (MapR Demo)
Hadoop EcoSystem
● In addition to MapReduce and HDFS, Hadoop also refers to a collection of other software projects that uses the MapReduce and HDFS framework
– HBase– Hive– Pig– Mahout– Zookeeper– Sqoop
The Hadoop EcoSystem
The Hadoop EcoSystem
Reference: http://www.bdisys.com/27/1/17/BIG%20DATA/HADOOP
Apache Pig
● This is a platform for analyzing large data sets that consists of a highlevel language for expressing data analysis programs
● Those who want to have a simple job tracking with MapReduce, can use Apache Pig.
● This can reduce the overhead of learning and writing complex MapReduce jobs mainly in Java Language
Apache Pig
HDFSHDFS
MapReduceMapReduce
PigPig
Apache Hive
● Those who like to use SQL like query languages for job tracking with MapReduce and whom does not like Apache Pig style of coding can use Apache Hive
● Hive manages data stored in HDFS and provides a query language based on SQL (and which is translated by the runtime engine to MapReduce jobs) for querying the data
Apache Hive
HDFSHDFS
MapReduceMapReduce
PigPig HiveHive
Apache HBase
● A distributed, columnoriented database. HBase uses HDFS for its underlying storage, and supports not only batchstyle computations real time queries (random reads) as well.
● Facebook messages are using Apache Hbase as the real time processing
Apache HBase
HDFSHDFS
MapReduceMapReduce
PigPig HiveHive
HBaseHBase
Apache ZooKeeper
● A distributed, highly available coordination service most of the components in the Hadoop ecosystem.
● It stores some of the metadata of the Apache Hbase as well
● ZooKeeper provides primitives such as distributed locks that can be used for building distributed applications.
Apache Zookeeper
HDFSHDFS
MapReduceMapReduce
PigPig HiveHive
HBaseHBase
ZooKeeperZooKeeper
Apache Sqoop
● A tool for efficiently moving data between relational databases and HDFS
Hadoop support for GIShttp://esri.github.io/gis-tools-for-hadoop/
Hadoop Distributions
● Open Source:– Apache Hadoop
● Commercial:– Cloudera– Hortonworks– MapR– AWS MapReduce– Microsoft HDInsight
NoSQL
References
● Big Data Right Now: Five Trendy Open Source Technologies: http://techcrunch.com/2012/10/27/bigdatarightnowfivetrendyopensourcetechnologies/
● An Introduction to NOSQL Data Management for Big Data: http://datainformed.com/introductionnosqldatamanagementbigdata/
● Overview of Big Data and NOSQL Technologies as of January 2013: http://www.syoncloud.com/big_data_technology_overview
● The Digital Universe in 2020: Big Data, Bigger Digital Shadows, and Biggest Growth in Far East, December 2012, EMC Corporation, [IDC Report 2012]
References● White House Shares $200 Million Big Data Plan:
http://www.informationweek.com/government/informationmanagement/whitehouseshares200millionbigdata/232700522
● Federal Standards Body Focuses On Big Data, Cloud: http://www.informationweek.com/regulations/federalstandardsbodyfocusesonbigdatacloud/d/did/1102703?
● The Internet of Things Is Coming: https://www.gartner.com/doc/1799626
● What is Data Science? http://radar.oreilly.com/2010/06/whatisdatascience.html
● Google Flu Trends : http://www.google.org/flutrends/about/how.html
References● Hadoop: The Definitive Guide, Second Edition, by Tom White.
Copyright 2011 Tom White, 9781449389734
● Hadoop In Action: by Chuck Lam, 2011, 9781935182191
● MapR Demo on Introduction to MapReduce: http://www.youtube.com/watch?v=HFplUBeBhcM
● Basic Introduction to Apache Hadoop by HortonWorks: http://www.youtube.com/watch?v=OoEpfb6yga8