big data open source software and projects introduction i590 data science curriculum august 20 2014...

37
Big Data Open Source Software and Projects Introduction I590 Data Science Curriculum August 20 2014 Geoffrey Fox [email protected] http://www.infomall.org School of Informatics and Computing Digital Science Center Indiana University Bloomington

Upload: arnold-arnold

Post on 30-Dec-2015

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Big Data Open Source Software and Projects Introduction I590 Data Science Curriculum August 20 2014 Geoffrey Fox gcf@indiana.edu

Big Data Open Source Software and ProjectsIntroduction 

I590 Data Science CurriculumAugust 20 2014

Geoffrey Fox [email protected]             http://www.infomall.org

School of Informatics and ComputingDigital Science Center

Indiana University Bloomington

Page 2: Big Data Open Source Software and Projects Introduction I590 Data Science Curriculum August 20 2014 Geoffrey Fox gcf@indiana.edu

INTRODUCTION

Stress Programming ExpertisePython and Java

Page 3: Big Data Open Source Software and Projects Introduction I590 Data Science Curriculum August 20 2014 Geoffrey Fox gcf@indiana.edu

Introduction I

• This course studies software used in many commercial activities to study Big Data. The backdrop for course is the ~120 software subsystems illustrated at http://hpc-abds.org/kaleidoscope/. 

• We will describe the software architecture represented by this collection which we term HPC-ABDS (High Performance Computing enhanced Apache Big Data Stack). – A paper discussing this can be found at http://arxiv.org/abs/1403.1528– http://grids.ucs.indiana.edu/ptliupages/publications/nist-hpc-abds.pdf– http://grids.ucs.indiana.edu/ptliupages/publications/OgrePaperv9.pdf 

• and presentations at – http://www.slideshare.net/Foxsden/microsoft-april302014 and – http

://www.slideshare.net/Foxsden/multifaceted-classification-of-big-data-uses-and-proposed-architecture-integrating-high-performance-computing-and-the-apache-stack. 

• Copies of this material may be found at http://www.infomall.org/I590ABDSSoftware/Resources/.

Page 4: Big Data Open Source Software and Projects Introduction I590 Data Science Curriculum August 20 2014 Geoffrey Fox gcf@indiana.edu

Introduction II• The course covers the following material

a) The cloud computing architecture underlying ABDS and contrast of this with HPC.

b) The software architecture with its different layers at http://hpc-abds.org/kaleidoscope/ covering broad functionality and rationale for each layer.

c) We will give application examplesd) Then we will go through selected software systems – about 10% of those in the 

Kaleidoscope which have been already deployed on FutureGrid systems using OpenStack and Chef recipes.

e) Students will chose one other open source member of Kaleidoscope each and deploy as in d).

f) The main activity of the course will be building a significant project using multiple HPC-ABDS subsystems combined with user code and data.

g) Teams of up to 3 students can be formed with corresponding increase in scope in activities e), f)

• Grading will be based on participation (10%), ABDS deployment (30%) and Project (60%). The class will interact with postings on a Google community group. The online section will also interact with Google Hangout or equivalent.

• We will use FutureSystems (FutureGrid) facilities and cloud computing experience is helpful but not essential. 

• Good working experience with Java is required and Python will be used

Page 5: Big Data Open Source Software and Projects Introduction I590 Data Science Curriculum August 20 2014 Geoffrey Fox gcf@indiana.edu

DIGITAL DATA AND CLOUD BACKDROP

Page 6: Big Data Open Source Software and Projects Introduction I590 Data Science Curriculum August 20 2014 Geoffrey Fox gcf@indiana.edu

Gartner Emerging Technology Hype Cycle 20136

Page 7: Big Data Open Source Software and Projects Introduction I590 Data Science Curriculum August 20 2014 Geoffrey Fox gcf@indiana.edu

Gartner Emerging Technology Hype Cycle 20147

Page 8: Big Data Open Source Software and Projects Introduction I590 Data Science Curriculum August 20 2014 Geoffrey Fox gcf@indiana.edu

Six Business Era Models in the Digital Business Development Path• As set out on the Gartner road map to 

digital business, there are six progressive business era models that enterprises can identify with today and to which they can aspire in the future. 

• Last 3 are in Emerging Technologies Hype cycle• Stage 1: Analog• Stage 2: Web• Stage 3: E-Business• Stage 4: Digital Marketing• Stage 5: Digital Business• Stage 6: Autonomous• http://www.gartner.com/newsroom/id/2819918?_ga=1.51071

721.1904172021.1401730474

Page 9: Big Data Open Source Software and Projects Introduction I590 Data Science Curriculum August 20 2014 Geoffrey Fox gcf@indiana.edu

Digital Business Development Stage 4: Digital Marketing

• The Digital Marketing stage sees the emergence of the Nexus of Forces (mobile, social, cloud and information). – Enterprises in this stage focus on new and more sophisticated ways to reach 

consumers, who are more willing to participate in marketing efforts to gain greater social connection, or product and service value. 

– Buyers of products and services have more brand influence than previously, and they see their mobile devices and social networks as preferred gateways. 

– Enterprises at this stage grapple with tapping into buyer influence to grow their business. 

• Digital Marketing tech includes: Software-Defined Anything; Volumetric and Holographic Displays; Neurobusiness; Data Science; Prescriptive Analytics; Complex Event Processing; Big Data; In-Memory DBMS; Content Analytics; Hybrid Cloud Computing; Gamification; Augmented Reality; Cloud Computing; NFC; Virtual Reality; Gesture Control; In-Memory Analytics; Activity Streams; Speech Recognition.

Page 10: Big Data Open Source Software and Projects Introduction I590 Data Science Curriculum August 20 2014 Geoffrey Fox gcf@indiana.edu

Digital Business Development Stage 5: Digital Business

• Digital Business is the first post-nexus stage on the road map and focuses on the convergence of people, business and things. – The Internet of Things and the concept of blurring the physical and virtual worlds are strong 

concepts in this stage. – Physical assets become digitalized and become equal actors in the business value chain 

alongside already-digital entities, such as systems and apps. – 3D printing takes the digitalization of physical items further and provides opportunities for 

disruptive change in the supply chain and manufacturing. – The ability to digitalize attributes of people (such as the health vital signs) is also part of this 

stage. – Even currency (which is often thought of as digital already) can be transformed (for example, 

cryptocurrencies). – Enterprises seeking to go past the Nexus of Forces technologies (stage 4) to become a digital 

business should look to these additional technologies: • Digital Business tech includes: Bioacoustic Sensing; Digital Security; Smart 

Workspace; Connected Home; 3D Bioprinting Systems; Affective Computing; Speech-to-Speech Translation; Internet of Things; Cryptocurrencies; Wearable User Interfaces; Consumer 3D Printing; Machine-to-Machine Communication Services; Mobile Health Monitoring; Enterprise 3D Printing; 3D Scanners; Consumer Telematics.

Page 11: Big Data Open Source Software and Projects Introduction I590 Data Science Curriculum August 20 2014 Geoffrey Fox gcf@indiana.edu

Digital Business DevelopmentStage 6: Autonomous

• Autonomous represents the final post-nexus stage. – This stage is defined by an enterprise's ability 

to leverage technologies that provide humanlike or human-replacing capabilities. 

– Using autonomous vehicles to move people or products or using cognitive systems to write texts or answer customer questions are all examples that mark the Autonomous stage. 

– Enterprises seeking to reach this stage to gain competitiveness should consider these technologies on the Hype Cycle

• Autonomous stage tech include:   Virtual Personal Assistants; Human Augmentation; Brain-Computer Interface; Quantum Computing; Smart Robots; Biochips; Smart Advisors; Autonomous Vehicles; Natural-Language Question Answering.

Page 12: Big Data Open Source Software and Projects Introduction I590 Data Science Curriculum August 20 2014 Geoffrey Fox gcf@indiana.edu

REAL WORLD BIGDATA

Page 13: Big Data Open Source Software and Projects Introduction I590 Data Science Curriculum August 20 2014 Geoffrey Fox gcf@indiana.edu

http://www.kpcb.com/internet-trends

My Research focus is Science Big Data but noteNote largest science ~100 petabytes = 0.000025 total

Science should take notice of commodityConverse not clearly true?

Note 7 ZB (7. 1021) is about a terabyte (1012) for each person in world

Page 14: Big Data Open Source Software and Projects Introduction I590 Data Science Curriculum August 20 2014 Geoffrey Fox gcf@indiana.edu
Page 15: Big Data Open Source Software and Projects Introduction I590 Data Science Curriculum August 20 2014 Geoffrey Fox gcf@indiana.edu

http://www.kpcb.com/internet-trends

Page 16: Big Data Open Source Software and Projects Introduction I590 Data Science Curriculum August 20 2014 Geoffrey Fox gcf@indiana.edu

http://www.kpcb.com/internet-trends

Page 17: Big Data Open Source Software and Projects Introduction I590 Data Science Curriculum August 20 2014 Geoffrey Fox gcf@indiana.edu

http://www.kpcb.com/internet-trends

Page 18: Big Data Open Source Software and Projects Introduction I590 Data Science Curriculum August 20 2014 Geoffrey Fox gcf@indiana.edu
Page 19: Big Data Open Source Software and Projects Introduction I590 Data Science Curriculum August 20 2014 Geoffrey Fox gcf@indiana.edu

Hundreds Of Retail Stores Are Closing

No more malls?

Page 20: Big Data Open Source Software and Projects Introduction I590 Data Science Curriculum August 20 2014 Geoffrey Fox gcf@indiana.edu

Where Are Shoppers Going?

Page 21: Big Data Open Source Software and Projects Introduction I590 Data Science Curriculum August 20 2014 Geoffrey Fox gcf@indiana.edu

Online!

We Are Here

Page 22: Big Data Open Source Software and Projects Introduction I590 Data Science Curriculum August 20 2014 Geoffrey Fox gcf@indiana.edu

E-Commerce Is Driving Nearly All Retail Growth In US

Page 23: Big Data Open Source Software and Projects Introduction I590 Data Science Curriculum August 20 2014 Geoffrey Fox gcf@indiana.edu

1 In 20 Retail Dollars Are Already Online

Page 24: Big Data Open Source Software and Projects Introduction I590 Data Science Curriculum August 20 2014 Geoffrey Fox gcf@indiana.edu

Even online groceries taking off

Page 25: Big Data Open Source Software and Projects Introduction I590 Data Science Curriculum August 20 2014 Geoffrey Fox gcf@indiana.edu

BASIC TRENDS AND JOBS

Page 26: Big Data Open Source Software and Projects Introduction I590 Data Science Curriculum August 20 2014 Geoffrey Fox gcf@indiana.edu

http://www.kpcb.com/internet-trends

Note that translates NOW into smaller devicesIn PAST translated into faster devices of same form factor 

Page 27: Big Data Open Source Software and Projects Introduction I590 Data Science Curriculum August 20 2014 Geoffrey Fox gcf@indiana.edu

http://www.kpcb.com/internet-trends

Page 28: Big Data Open Source Software and Projects Introduction I590 Data Science Curriculum August 20 2014 Geoffrey Fox gcf@indiana.edu

http://www.kpcb.com/internet-trends

Page 29: Big Data Open Source Software and Projects Introduction I590 Data Science Curriculum August 20 2014 Geoffrey Fox gcf@indiana.edu

http://www.kpcb.com/internet-trends

Page 30: Big Data Open Source Software and Projects Introduction I590 Data Science Curriculum August 20 2014 Geoffrey Fox gcf@indiana.edu

http://www.kpcb.com/internet-trends

Page 31: Big Data Open Source Software and Projects Introduction I590 Data Science Curriculum August 20 2014 Geoffrey Fox gcf@indiana.edu

http://www.kpcb.com/internet-trends

Page 32: Big Data Open Source Software and Projects Introduction I590 Data Science Curriculum August 20 2014 Geoffrey Fox gcf@indiana.edu
Page 33: Big Data Open Source Software and Projects Introduction I590 Data Science Curriculum August 20 2014 Geoffrey Fox gcf@indiana.edu
Page 34: Big Data Open Source Software and Projects Introduction I590 Data Science Curriculum August 20 2014 Geoffrey Fox gcf@indiana.edu
Page 35: Big Data Open Source Software and Projects Introduction I590 Data Science Curriculum August 20 2014 Geoffrey Fox gcf@indiana.edu

Jobs

35

Page 36: Big Data Open Source Software and Projects Introduction I590 Data Science Curriculum August 20 2014 Geoffrey Fox gcf@indiana.edu

Jobs v. Countries

36http://www.microsoft.com/en-us/news/features/2012/mar12/03-05CloudComputingJobs.aspx

Page 37: Big Data Open Source Software and Projects Introduction I590 Data Science Curriculum August 20 2014 Geoffrey Fox gcf@indiana.edu

McKinsey Institute on Big Data Jobs

• There will be a shortage of talent necessary for organizations to take advantage of big data. By 2018, the United States alone could face a shortage of 140,000 to 190,000 people with deep analytical skills as well as 1.5 million managers and analysts with the know-how to use the analysis of big data to make effective decisions.

• At IU, Informatics aimed at 1.5 million jobs. Computer Science covers the 140,000 to 190,000 37http://www.mckinsey.com/mgi/publications/big_data/index.asp.