icta meetup 11 - big data

ICTA Technology Meetup 11ICTA Technology Meetup 11

By Crishantha Nanayakkara

Meetup Recap

● 1 – Enterprise Application Integration

● 2 – Enterprise Level High Availability Options

● 3 – SOA Security

● 4 – Towards Hybrid Mobile App Development

● 5 – The Semantic Web and Linked Data

● 6 – Enterprise Application Design Patterns

● 7 – GIS – An Introduction

● 8 – The Future of the Database World

● 9 – The Enterprise Storage Management

● 10 – An Introduction to Content Management with Joomla

The Scope

● Big Data – The Definition

● The Sources of Data

● Structured, SemiStructured vs Unstructured Data

● Relational Data vs Big Data

● Towards Big Data

● Big Data Adoption in other countries

● Big Data Technologies and the ecosystem

● Big Data Open Source and Commercial Options

Big Data Definition

A new generation of technologies and architectures, designed to economically

extract VALUE from very large VOLUMES of a wide variety of data by enabling high

VELOCITY capture, discovery, and/or analysis.

The Three Vs of Big Data

The Three Vs of Big Data

● Volume – Big● Variety – From different sources and types● Velocity – Frequency of its generation: how

quickly the data arrives and is stored, and how quickly it can be retrieved

The Sources of Data

The Sources of Data

● Documents● Emails ● Images● Relational Databases● Logs● Social Media feeds● Videos

● Sensor Data● Click Streams

Structured, Semi Structured and Unstructured Data

Structured Data

● Structured:– The information with a high degree of

organization– Seamless and readily searchable by

straightforward search algorithms or operations

– e.g: relational databases, spreadsheets, XML

SemiStructured Data

● SemiStructured:– This is a form of structured data that does not

conform to an explicit and fixed schema– The data is inherently selfdescribing and

contains tags or other markers to enforce hierarchies of records and fields within the data

– e.g: web logs, social media feeds

Unstructured Data

● Unstructured:– This type of data consists of formats which

cannot easily be indexed into relational tables for analysis or querying

– e.g.: images, videos

Relational Vs Big Data

Relational Data vs Big Data

● Thinking of Big Data as “just lots more enterprise data” is tempting, but it’s a serious mistake.

● Big Data is commonly generated outside of traditional enterprise applications

● Big Data is often composed of unstructured or semistructured information types that continually arrive in enormous amounts

Relational Data vs Big Data

● To get maximum value from Big Data, it needs to be associated with traditional enterprise data, automatically or via purpose built applications, reports, queries, and other approaches

Towards Big Data

The Digital Universe● From 2005 to 2020, the digital universe will grow from 130

exabytes to 40,000 exabytes, or 40 trillion gigabytes.

According to IDC, the Big Data technology and service market was about US$4.8 billion in 2011. The market is projected to grow at a compound annual growth rate (CAGR) of 37.2% between 2011 and 2015. By 2015, the market size is expected to be US$16.9 billion.

[Source: IDC. Worldwide Big Data Technology and Services 20122015 Forecast.]

Gartner reported that more than 65 billion devices were connected to the internet by 2010. By 2020, this number will go up to 230 billion

[Source: https://www.gartner.com/doc/1799626]

The Opportunity for Big Data● Only a tiny fraction of the digital universe has been

explored for analytic value so far. ● By 2020, as much as 33% of the digital universe will

contain information that might be valuable if analyzed.

● But only if it is tagged and analyzed. That is the opportunity for Big Data.

Source: IDC's Digital Universe Study, 2012

The Candidates for Big Data● Not all data is necessarily useful for Big Data

analytics. However, some data types are particularly good for analysis

– Surveillance Footage– Embedded medical devices– Entertainment and Social Media– Images and Voice Data– Data Processing

Source: IDC's Digital Universe Study, 2012

● Over a history that spans more than 30 years, SQL database servers have traditionally held gigabytes of information — and reaching that milestone took a long time.

● In the past 15 years, data warehouses and enterprise analytics expanded these volumes to terabytes.

● And in the last 5 years, the distributed file systems that store Big Data now routinely house petabytes of information.

The Statistics

The Big Data Adoption in the World

Source: http://www.informationweek.com/government/information-management/white-house-shares-200-million-big-data/232700522

http://www.informationweek.com/regulations/federal-standards-body-focuses-on-big-data-cloud/d/d-id/1102703?

Singapore Transport System(Land Transport Authority LTA)

Source: How Cities using Big Data in Asia? - FutureGov Report



● Data Collection:– Junction Electronic Eyes– Green Link Determining System– Web cams– Parking Guidance Systems– Expressway monitoring Systems– Traffic Scan



● Data Processing:– All the data is fed into this integrated i

Transport Processing System– The data is aggregated, integrated and

analyzed ● Data Dissemination:

– Via web portals, radio broadcasting, navigation devices, smart phones, etc

– Certain data elements are given as “open data”

Singapore National EnvironmentAgency


● Dengue related data:– The data is pulled from dengue cases, public

feedback, mosquito inspections and other sources for analysis.

– Making use of GIS to identify highrisk areas,they are also able to prioritize places for checks

iPlan Project(Urban Redevelopment Authority URA)


● iPLAN is among the world’s first nationwide enterprise GIS systems for urban planning and it contains comprehensive land, building, planning and approval information which is readily available to URA’s planners

Kuala Lampur Government


● The government has created a Big Data Analytics fund to support four governmentinitiated projects by 2015 focusing on,

– Transport, – Planning, Environment and – Security

Technologies behindBig Data

Reference: http://www.bdisys.com/27/1/17/BIG%20DATA/HADOOP

Hadoop

Hadoop – An Introduction● Hadoop is a framework that provides open source

libraries for distributed computing using MapReduce software and its own distributed file system Hadoop Distributed File System (HDFS)

● Open Source, written in Java● Maintained by Apache Software Foundation as a top

level project● Original deployments

– Yahoo, Facebook, LinkedIn

Hadoop – The Core Components

● The kernal(core) of Hadoop provides: – A reliable shared storage (HDFS) – An Analysis system (MapReduce)

● There are other components in Hadoop, which makes a complete Hadoop ecosystem

Hadoop Architecture● Designed to scale out from a few computing nodes to

thousands of machines, each offering local computation and storage

● Leverages the power of massive parallel processing to take advantage of Big Data, generally by using lots of inexpensive commodity servers, which has a high tolerance of hardware failure. In Hadoop, hardware failure is taken as rule rather than an exception

● Designed to abstract away much of the complexity of distributed processing. This lets developers focus on the task at hand

Reference: Hadoop In Action

Hadoop Architecture

p


Hadoop Architecture

Hadoop Distributed File System (HDFS)

Scale Up Vs Scale Out

Reference: http://quickfileaccounting.wordpress.com/2013/07/02/scaleoutvsscaleup/

Scale Up Vs Scale Out

HDFS● A faulttolerant storage system that can store huge

amounts of information● Scale up incrementally and survive storage failure

without losing data● Hadoop clusters are built with inexpensive computers.

If one computer (or node) fails, the cluster can continue to operate without losing data or interrupting work by simply redistributing the work to the remaining machines in the cluster

HDFS● HDFS manages storage on the cluster by breaking

files into small blocks and storing duplicated copies of them across the pool of nodes

● In the common case, HDFS stores three complete copies of each file by copying each piece to three different servers

● If any two servers can fail, and the entire file will still be available HDFS notices when a block or a node is lost, and creates a new copy of missing data from the replicas it manages.

HDFS● HDFS offers two key advantages over RAID:

– It requires no special hardware, since it can be built from commodity servers, and can survive more kinds of failure – a disk, a node on the network or a network interface

MapReduce

MapReduce

● Hadoop takes advantage of HDFS’ data distribution strategy to push work out to many nodes in a cluster. This allows analyses to run in parallel and eliminates the bottlenecks imposed by monolithic storage systems.

● Hadoop uses MapReduce for this task.

MapReduce

MapReduce● A new programming framework — created and

successfully deployed by Google — that uses the divideandconquer method (and lots of commodity servers) to break down complex Big Data problems into small units of work, and then process them in parallel

● MapReduce is built on the proven concept of divide and conquer: it’s much faster to break a massive task into smaller chunks and process them in parallel.

MapReduce

● MapReduce is a data processing algorithm that uses a parallel programming implementation. In simple terms, MapReduce is a programming paradigm that involves distributing a task across multiple nodes running a "map" function. The map function takes the problem, splits it into sub parts and sends them to different machines so that all the subparts can run concurrently. The results from the parallel map functions are collected and distributed to a set of servers running "reduce" functions, which then takes the results from the subparts and recombines them to get the single answer.

Source: http://www.youtube.com/watch?v=HFplUBeBhcM (MapR Demo)

http://www.youtube.com/watch?v=HFplUBeBhcM

Hadoop EcoSystem

● In addition to MapReduce and HDFS, Hadoop also refers to a collection of other software projects that uses the MapReduce and HDFS framework

– HBase– Hive– Pig– Mahout– Zookeeper– Sqoop

The Hadoop EcoSystem

The Hadoop EcoSystem

Reference: http://www.bdisys.com/27/1/17/BIG%20DATA/HADOOP

Apache Pig

● This is a platform for analyzing large data sets that consists of a highlevel language for expressing data analysis programs

● Those who want to have a simple job tracking with MapReduce, can use Apache Pig.

● This can reduce the overhead of learning and writing complex MapReduce jobs mainly in Java Language

Apache Pig

HDFSHDFS

MapReduceMapReduce

PigPig

Apache Hive

● Those who like to use SQL like query languages for job tracking with MapReduce and whom does not like Apache Pig style of coding can use Apache Hive

● Hive manages data stored in HDFS and provides a query language based on SQL (and which is translated by the runtime engine to MapReduce jobs) for querying the data

Apache Hive

HDFSHDFS

MapReduceMapReduce

PigPig HiveHive

Apache HBase

● A distributed, columnoriented database. HBase uses HDFS for its underlying storage, and supports not only batchstyle computations real time queries (random reads) as well.

● Facebook messages are using Apache Hbase as the real time processing

Apache HBase

HDFSHDFS

MapReduceMapReduce

PigPig HiveHive

HBaseHBase

Apache ZooKeeper

● A distributed, highly available coordination service most of the components in the Hadoop ecosystem.

● It stores some of the metadata of the Apache Hbase as well

● ZooKeeper provides primitives such as distributed locks that can be used for building distributed applications.

Apache Zookeeper

HDFSHDFS

MapReduceMapReduce

PigPig HiveHive

HBaseHBase

ZooKeeperZooKeeper

Apache Sqoop

● A tool for efficiently moving data between relational databases and HDFS

Hadoop support for GIShttp://esri.github.io/gis-tools-for-hadoop/

Hadoop Distributions

● Open Source:– Apache Hadoop

● Commercial:– Cloudera– Hortonworks– MapR– AWS MapReduce– Microsoft HDInsight

References

● Big Data Right Now: Five Trendy Open Source Technologies: http://techcrunch.com/2012/10/27/bigdatarightnowfivetrendyopensourcetechnologies/

● An Introduction to NOSQL Data Management for Big Data: http://datainformed.com/introductionnosqldatamanagementbigdata/

● Overview of Big Data and NOSQL Technologies as of January 2013: http://www.syoncloud.com/big_data_technology_overview

● The Digital Universe in 2020: Big Data, Bigger Digital Shadows, and Biggest Growth in Far East, December 2012, EMC Corporation, [IDC Report 2012]

http://www.syoncloud.com/big_data_technology_overview

References● White House Shares $200 Million Big Data Plan:

http://www.informationweek.com/government/informationmanagement/whitehouseshares200millionbigdata/232700522

● Federal Standards Body Focuses On Big Data, Cloud: http://www.informationweek.com/regulations/federalstandardsbodyfocusesonbigdatacloud/d/did/1102703?

● The Internet of Things Is Coming: https://www.gartner.com/doc/1799626

● What is Data Science? http://radar.oreilly.com/2010/06/whatisdatascience.html

● Google Flu Trends : http://www.google.org/flutrends/about/how.html

https://www.gartner.com/doc/1799626

http://radar.oreilly.com/2010/06/what-is-data-science.html

http://www.google.org/flutrends/about/how.html

References● Hadoop: The Definitive Guide, Second Edition, by Tom White.

Copyright 2011 Tom White, 9781449389734

● Hadoop In Action: by Chuck Lam, 2011, 9781935182191

● MapR Demo on Introduction to MapReduce: http://www.youtube.com/watch?v=HFplUBeBhcM

● Basic Introduction to Apache Hadoop by HortonWorks: http://www.youtube.com/watch?v=OoEpfb6yga8

http://www.youtube.com/watch?v=HFplUBeBhcM

http://www.youtube.com/watch?v=OoEpfb6yga8