Gail Z Associates, LLC
DevNexus 2014, Data + Integration
Big Data Technology, Strategy, and Applications
Dr. Gail Zhou
Gail Z Associates, LLC
February 25, 2014LinkedIn: http://www.linkedin.com/in/gailZhou
Email: [email protected]
Gail Z Associates, LLC
Outline
•What is Big Data and why is it such a big deal? Where can we use Big Data?
• Big Data Key Concepts and Technologies using Hadoop as an example
•Big Data Challenges and Start up Strategy: What are the challenges? How do you get started on Big Data?
Appendix: Other Big Data Technologies, Integration of Big Data with Existing Applications (an example)
2
Gail Z Associates, LLC
What is Big Data and why is it such a big deal?
3
Gail Z Associates, LLC
A Brief History of Big Data Sources: Wikipedia, Forbes.com, and other articles
• 1941: “Information Explosion” term coined.• 1963: Physicist and science historian Derek Price concluded the number of new journals grown exponentially. • 1990: Computer Scientist Peter J. Denning, “Saving All the Bits”, what machines can we build to monitor, process, and understand the data, its meanings, and patterns? – Intelligence out of the data? • 1998: Steve Bryson et all, “Visually exploring gigabyte data sets in real time”, ACM, Section “Big Data for Scientific Visualization”.
4
Gail Z Associates, LLC
A Brief History of Big Data Cont’dSources: Wikipedia, Forbes.com, and other articles
• 2001, Doug Laney, Meta Group, “3D Data Management, Controlling Data Volume, Velocity, and Variety” (More now: Veracity, Variability, and Value)
5
Gail Z Associates, LLC
A Brief History of Big Data Cont’dSources: Wikipedia, Forbes.com, and other articles
• 2001 - 2003: Google outgrown as a result of new revenue model, 5 cents per click. Google is now a giant big data leader.
• 1994 – Present: Yahoo!, Hadoop Shop (10K Nodes), Genome, Big Data Analytics.
• 1994 – Present: Amazon, AWS Cloud.
• 2003 – Present: Facebook, Twitter, LinkedIn, etc.
• 2013 and beyond : Many others.
6
Gail Z Associates, LLC7
Gail Z Associates, LLC
Source: Global Education Project
Population Growth Chart: Does it have something to do with Big Data? Machines, Satellites, Cameras, Internet, computers, and mobile phones are just “enablers” of big data.
8
Gail Z Associates, LLC
Information Explosion. It is just the real beginning.
You got mail (too much).
You are embarrassed to admit you don’t know a lot of cool things happening in the world.
Don’t despair. You are not alone.
Source: Newbury College, UK
www.ucg.org
www.spchui.net
9
Gail Z Associates, LLC
Big Data Opportunities
10
Gail Z Associates, LLC
Big Data Opportunities
• Medical Research and Healthcare: Massive collected research and clinical information can be used to predict and prevent diseases, moving us from ‘sick care’ to ‘health care’. • Telecom: Traffic data and patterns can be utilized in real time to re-route.• Defense: Satellite images and other information can be meshed up to identify threats. • Utilities: Smart meter monitoring.• Public Safety: Pattern recognition and social media can help to predict crimes. • Financial Industry: Patten recognition and business rules to flag fraudulent activities.• Functional Areas: Investigational Search, Pricing Optimization, Risk Analysis, Churn Analysis, Behavior Analysis, Transactions Analysis, Revenue Assurance, Recommendation Engines, etc.
11
Gail Z Associates, LLC
• Traditional (Examples) Financial Transactions Energy and
InfrastructureTransportation Life Science and
HealthCare
Where Big Data Can Shine
• Notes – Big Data Technology is not the replacement – Big Data is complementary – In some cases, Big Data is the only way to get things done – Big Data has its own challenges
•Big Data (Examples)AdvertisementsSearch and Indexing Social NetworksScience Research Communications
12
Gail Z Associates, LLC
Key Concepts in Big Data – Technology and Architectures
13
Gail Z Associates, LLC14
Gail Z Associates, LLC 15
Gail Z Associates, LLC 16
Hadoop HDFS
Blocks (64M, 128M, etc.) are saved in different nodes with a replication factor ( default 3)
Gail Z Associates, LLC 17
Hadoop Logical View
http://nosqlessentials.comProfessor: Fernando Rodriguez Olivera
Gail Z Associates, LLC 18
Hadoop Logical View (HDFS + Map Reduce)
Gail Z Associates, LLC 19
Hadoop V1 – Map Reduce Jobs Execution
Gail Z Associates, LLC
Hadoop 2.0 with YARN
Gail Z Associates, LLC
YARN Interaction & Sequence
Gail Z Associates, LLC
Big Data Challenges, Suggested Startup Strategy
22
Gail Z Associates, LLC
Business urgency, time to market pressures Big Data start up needs careful planning Big Data needs infrastructure, software stacks, people, start up
plan Lack of Big Data Resources, Lack of Sponsorships (except in some
companies) Big Data is complex and multiple skill sets (mostly new to many
companies) – Infrastructure, Administration, Security, Programming, Testing, etc.
Skepticism about Big Data Integration with Existing Technologies and Systems
Can not develop isolated big data solutions Integration with existing systems will be a top challenge (requires
both sides to do additional work) Open Sources: Stability, Maturity, and Security
Big Data Start up Challenges
Gail Z Associates, LLC
Full business needs and information requirements analysis. Business Drivers Revenue generation? Cost reduction? Customer retention? Compliance? Process Improvement? Fraud detection? Analytics? Dashboard? Solving a tough problem? Retiring/replacing technologies and systems?
Technology Evaluation and Selection Define requirements and objective first Evaluation a variety of technology stacks – develop a framework first
Executive Support for Start up Resources Prototyping, Discovery, and Planning
Rent Infrastructure in Cloud – VMWare, Amazon EC2, and others Use Spare Hardware and Network Bandwidth Assessment, Proposal. Project/Program Plan for next steps Start small and keep delivering
Architecture Design, Estimation, Business Case Obtain funding and executive sponsorships, owners, etc. SDLC, don’t forget Hardware, Security, Testing, etc.
Suggested Big Data Start up Strategy
Gail Z Associates, LLC25
Appendix
Gail Z Associates, LLC
Hadoop & Cassandra Based OfferingsName Offerings Notes
Apache Hadoop Hadoop Core Enhancement: YARN
Cloudera Enhanced Hadoop Leader
DataStax Enhanced Apache Cassandra Cassandra is a distributed NoSQL DB
Hortonworks Hadoop Development and support. Hortonworks Data Platform (HDP)
Yahoo Funded $23M + Others . Major alliances.
MapR Develops and sells Hadoop-derived software. M3. M5, M7.
Alliance with EMC, Amazon, and Google.
Sqoop HDFS and SQL Integration
Hue Hadoop GUI Tools
Amazon AWS, Cloud Hadoop Cluster
Microsoft Windows Azure HDInsight
IBM, Dell, etc. Hardware, Software, Services
Gail Z Associates, LLC
Hadoop Related Technologies (Examples)Name Functions Notes
Apache Hue Hadoop GUI Hadoop has cmd.
Apache HBase NoSQL Distributed DB, Key/value Column Family Store, runs on top of Hadoop
Big Table Like Storage for Hadoop, written in Java.
Apache PIG High Level programming language for Map Reduce
Pig Latin, interoperability with Python, JavaScript, Ruby and Groovy
Apache HIVE Data Warehouse on top of Hadoop. HiveQL
Summaries, queries, and analysis. Open Sourced by Facebook.
Apache Zoo Keeper Hadoop Configuration / Build Tools
Distributed configuration, synchronization, etc)
Apache Sqoop Move RDBMS data into Hadoop Command lines
Gail Z Associates, LLC
Cassandra
http://nosqlessentials.comProfessor: Fernando Rodriguez Olivera
Gail Z Associates, LLC
HBase
http://blog.sematext.com/2012/07/16/hbase-memstore-what-you-should-know/
Gail Z Associates, LLC
http://blog.cloudera.com/blog/2011/06/biodiversity-indexing-migration-from-mysql-to-hadoop/