Understanding Big Data Services ~ Sharada Rao
“Every two days now we create as much information as we did from the dawn of
civilization up until 2003.” Eric Schmidt, CEO Google
1. The speed at which the world is building services around data merits a robust approach to how data is modeled, understood and served up in a stable, regulated and predictable manner.
2. The unprecedented monies being spent in this market allows for SMBs, large players, disruptors and products to spawn and proliferate in hordes, making it an expensive mammoth ecosystem over the long term.
3. Differentiators would be intelligent around the craft of data services and tailor it to the Enterprise Data Architecture and industry wise custom-built genetic re-engineering of the same.
Advanced Analytics
REPORT TO THE PRESIDENT: EVERY FEDERAL
AGENCY NEEDS A 'BIG DATA' STRATEGY
• Speed • Traffic flow • Detection
GOT BIG DATA?
YOU’RE GONNA NEED
A FASTER NETWORK
BIG DATA NEEDS TO
THINK BIGGER
• Counter-intelligence • Situational awareness
MORE CUSTOMERS
EXPOSED AS BIG DATA
BREACH GROWS
• Credit and risk scoring • Fraud detection • Trade analysis
WHY WE STILL CAN'T
PREVENT FLASH
CRASHES
• “Whole earth” modeling • Climate change • Alternative energy
LINKING “BIG WEATHER” TO
GLOBAL WARMING
BIG DATA: YOU'LL
HAVE IT, BUT CAN
YOU HANDLE IT?
Core I/T (Compute, Storage, Database)
• Disease surveillance • Drug discovery • Personalized healthcare
MANAGING HEALTHCARE’S
“BIG DATA TSUNAMI”
WORLD RECORD IN DATA TRANSMISSION:
26 TERABITS PER SECOND ON A SINGLE
LASER BEAM
• Sentiment • Consumption • Promotions
P&G BIND WITH SOCIAL MEDIA MOMS — IMAGINE IF THIS WERE PHARMA
Advanced
Analytics
Core I/T
(Compute,
Storage,
Database)
Social Networks
Enterprise SOA
Smart Devices
Cloud Mobile
Big Data
There are many different measures of this phenomenon. IDC predicts that the digital universe will be 44 times bigger in 2020 than it was in 2009, totaling a staggering 35 zettabytes.3 EMC reports that
the number of custom-ers storing a petabyte or more of data will grow from 1,000 (reached in 2010) to 100,000 before the end of the decade.4 By 2012 it expects that some customers will be storing exabytes (1,000 petabytes) of information.5 In 2010 Gartner reported that enterprise data
growth will be 650 percent over the next five years, and that 80 percent of that will be unstructured.
Global Data volume is predicted to grow by 44 times by 2020 The ask is for mammoth scale out infrastructures, to manage large distributed sources of data The ask is for parallel processing capability in data modeling, cleansing, analytics and services
IaaS 1. Compute as a Service 2. Storage as a Service 3. Cloud Services – Data
Migration to & fro
DWaaS 1. DataManage as a Service 2. DataModeling as a
Service 3. DW as a Service
DaaS 1. DataAPIs as a Service 2. Data as a Product per
industry
BIaaS 1. Analytics as a Service 2. Insights as a Service
BPaaS 1. Data process re-
engineering as a Service
In order to accord rapid and parallel processing speed, scale and flexibilty, Google and Amazon have begun to use a ‘Shared Nothing’ architecture wherein each node is self contained and can crawl and link as per need of the outcome. Hence the emphasis is on the link and open-noding, stateless tri-axial plane of data/application APIs/ infrastructure. So the link is in effect an arrow which is a data retrieval form. This blows away legacy relational databases. Hence shared nothing. The largest application of this concept is the Linked Open Data W3C project. This in turn means a marriage with NoSQL – data that will not fit into row/column relational databases easily. This also means an IDOL framework – Intelligent Data Operating Layer; which can interpret human language semantically, process into a SQL like query, cross reference different kinds of data and integrate several algorithms of Search to serve up appropriate logical results. All with parallel processing and speed; crawling several stateless data universes.
Trading
Insurance Claims
Automotive Design
Genomics and Proteomics
Patient profiling
Disease Surveillance
Retail consumer trends
RFID and SCM
Social Networking
Mobile
Global Climate
Sensors and Energy
Space Research
Defence Intelligence
Security Information
Banking – Risk/Investment
Federal classified information
CERN data project
1. Templatized Metadata would be a data ‘productization’ for above areas 2. Predictable, industry specific regulated real time data
Map Reduce is a patentedsoftware framework introduced by Google and used to simplify data processing across massive data sets. As people rapidly increase their online activity and digital footprint, a huge amount of data is generated continuously. This data can be of multiple types (text, rich text, rdbms, graph, etc…) and organizations are finding it vital to quickly analyze this huge amounts of data generated by their customers and audiences to better understand and serve them. MapReduce is the tool that is helping those organizations in quick and efficient analysis and bring business value to the organization.
Read More @ Google
Twister4Azure is a distributed decentralized iterative MapReduce runtime for Windows Azure Cloud that was developed utilizing Azure cloud
infrastructure services. Twister4Azure extends MapReduce paradigm by introducing extensions and optimizations for iterative MapReduce applications. Twister4Azure supports
cachingof l
oop-invarient data, adds a new merge step (map->reduce->merge) to the programming model and introduces a novel cache-aware task scheduling
mechanism. Twister4Azure running in Azure cloud outperforms Hadoop in local cluster by 2 to 4 times.
•Decentralized architecture for clouds
•Avoids single point of failures
•Utilize highly available and scalable Cloud services •Efficient execution of Iterative MapReduce applications
•Extends the MR programming model with iterative extensions
•Multi-level data caching to overcome data access latencies
•Cache aware hybrid scheduling
•Collective communication primitives for Iterative MapReduce
•Support for traditional MapReduce and pleasingly parallel applications
•Ability to execute multiple MR applications inside a single iteration
•Dynamic scheduling achieving better load balancing
•Typical MapReduce fault tolerance, ensuring eventual completion of your computation
•Web based monitoring console
•Azure local emulator based local testing/debugging We are happy to provide support for scientific application developments using Twister4Azure.
Big Data needs to be given context or meaning based on 3 parameters
1. Timing
2. Location
3. Situational Intelligence
Such actionable context driven data can help in better informed decision making, be that clinical analytics or geospatial data or flood / tornado reports or sentiment analytics on social media. Examples – FreeBase, Google’s Image Labeler, Verbosity, Tag a Tune
In an ideal world, actionable data and statistical algorithms have to be integrated with human knowledge, so that human intelligence can then supersede the BI of a predictive analytics wrold.
Then again, sometimes its best to let the data tell the story….