data about tech and data trends

147
BIG DATA APPLICATIONS & ANALYTICS MOTIVATION: BIG DATA AND THE CLOUD; CENTERPIECES OF THE FUTURE ECONOMY 11/26/2014 Course Motivation 1 Geoffrey Fox November 25 2014 [email protected] http://www.infomall.org School of Informatics and Computing Digital Science Center Indiana University Bloomington

Upload: antarme

Post on 15-Jul-2015

162 views

Category:

Software


0 download

TRANSCRIPT

Page 1: data about tech and data trends

BIG DATA APPLICATIONS & ANALYTICS MOTIVATION:

BIG DATA AND THE CLOUD; CENTERPIECES OF THE FUTURE ECONOMY

11/26/2014Course Motivation 1

Geoffrey Fox

November 25 2014

[email protected]

http://www.infomall.org

School of Informatics and Computing

Digital Science Center

Indiana University Bloomington

Page 2: data about tech and data trends

Course Motivation

General Remarks including Hype curves

INTRODUCTION

11/26/20142

Page 3: data about tech and data trends

• There is an endlessly growing amount of data as we record every transaction between people and the environment (whether shopping or on a social networking site) while smart phones, smart homes, ubiquitous cities, smart power grids, and intelligent vehicles deploy sensors recording even more.

• Science with satellites and accelerators is giving data on transactions of particles and photons at the microscopic scale.

• This data are and will be stored in immense clouds with co-located storage and computing that perform "analytics" that transform data into information and then to wisdom and decisions; data mining finds the proverbial knowledge diamonds in the data rough.

• This disruptive transformation is driving the economy and creating millions of jobs in the emerging area of "data science".

• We discuss this revolution and its implications for universities and society

ABSTRACT

11/26/2014

Course Motivation

3

Page 4: data about tech and data trends

The Data Deluge is clear trend from Commercial (Amazon, e-commerce) , Community (Facebook, Search) and Scientific applications

Smaller (INTEL/ARM/AMD) chips drive

Multicore (i.e. more computing) on shared servers

Smaller Light weight clients from smartphones, tablets to sensors (i.e. more clients)

Clouds with cheaper, greener, easier to use IT for applications

New jobs associated with new curricula

Clouds as a distributed system (changing a classic CS course)

Data Science (new area)

SOME TRENDS

11/26/2014

Course Motivation

4

Page 5: data about tech and data trends

48 technologies are listed in this year’s hype cycle which is the highest in last ten years.

Year 2008 was the lowest (27)

Gartner Says in 2012: We are at an interesting moment — a time when the scenarios we’ve

been talking about for a long time are almost becoming reality.11/26/20145

Course Motivation

Page 6: data about tech and data trends

11/26/20146

Course Motivation

Private Cloud Computing is off

the chart

http://public.brighttalk.com/resource/core/19507/august_21_hype_cycle_fenn_lehong_29685.pdf

Page 7: data about tech and data trends

11/26/20147

GARTNER EMERGING TECHNOLOGY HYPE CYCLE 2014

Course Motivation

Page 8: data about tech and data trends

11/26/20148

Course Motivation

http://public.brighttalk.com/resource/core/19507/august_21_hype_cycle_fenn_lehong_29685.pdf

2013

Page 9: data about tech and data trends

11/26/20149

Course Motivation

2013

http://public.brighttalk.com/resource/core/19507/august_21_hype_cycle_fenn_lehong_29685.pdf

Page 10: data about tech and data trends

11/26/201410

Course Motivation

Note number

of “analytics”

areas

http://public.brighttalk.com/resource/core/19507/august_21_hype_cycle_fenn_lehong_29685.pdf

Page 11: data about tech and data trends

http://www.kpcb.com/internet-trends11/26/201411

Course Motivation

Page 12: data about tech and data trends

• Economic Imperative: There are a lot of data and a lot of jobs

• Computing Model: Industry adopted clouds which are attractive for data analytics

• Research Model: 4th Paradigm; From Theory to Data driven science?

• Research/Business opportunities in advancingcomputing technologies and algorithms

• Research/Business opportunities in X-Informatics: applying 4th paradigm (more here!)

• Development in Data Science Education: opportunities at universities

ISSUES OF IMPORTANCE

11/26/2014

Course Motivation

12

Page 13: data about tech and data trends

Course Motivation

DATA DELUGE

11/26/201413

Page 14: data about tech and data trends

http://www.kpcb.com/internet-trends

My Research focus is Science Big Data but note

Note largest science ~100 petabytes = 0.000025 total

Note 7 ZB (7. 1021) is about a

terabyte (1012) for each person

in world

Zettabyte ~1010 Typical Local Storage (100 Gigabytes)

Zettabyte = 1000 Exabytes

Exabyte = 1000 Petabytes

Petabyte = 1000 Terabyte

Terabyte = 1000 Gigabytes

Gigabyte = 1000 Megabytes

11/26/201414

Course Motivation

Page 15: data about tech and data trends

11/26/201415

Course Motivation

Page 16: data about tech and data trends

http://www.kpcb.com/internet-trends11/26/201416

Course Motivation

Page 17: data about tech and data trends

11/26/201417

Course Motivation

Meeker/Wu May 29 2013 Internet Trends D11 Conference

20 hours

https://www.youtube.com/yt/press/statistics.html

Page 18: data about tech and data trends

http://www.kpcb.com/internet-trends

11/26/201418

Course Motivation

Page 19: data about tech and data trends

http://cs.metrostate.edu/~sbd/ Oracle11/26/201419

Course Motivation

Page 20: data about tech and data trends

11/26/201420

Course Motivation

Page 21: data about tech and data trends

• Web Data (“the original big data”)• Analyze customer web browsing of e-commerce site to see topics looked at etc.

• Auto Insurance (telematics monitoring driving)• Equip cars with sensors

• Text data in multiple industries• Sentiment analysis, identify common issues (as in eBay lamp example), Natural Language

processing

• Time and location (GPS) data• Track trucks (delivery), vehicles(track), people(tell them nearby goodies)

• Retail and manufacturing: RFID• Asset and inventory management,

• Utility industry: Smart Grid• Sensors allow dynamic optimization of power

• Gaming industry: Casino Chip tracking (RFID)• Track individual players, detect fraud, identify patterns

• Industrial engines and equipment: sensor data• See GE engine

• Video games: telemetry• This is like monitoring web browsing but rather monitor actions in a game

• Telecommunication and other industries: Social Network data• Connections make this big data. • Use connections to find new customers with similar interests

“TAMING THE BIG DATA TIDAL WAVE” 2012(BILL FRANKS, CHIEF ANALYTICS OFFICER TERADATA)

11/26/2014

Course Motivation

21

Page 22: data about tech and data trends

Ruh VP Software GE http://fisheritcenter.haas.berkeley.edu/Big_Data/index.html

11/26/201422

Course Motivation

Page 23: data about tech and data trends

Ruh VP Software GE http://fisheritcenter.haas.berkeley.edu/Big_Data/index.html

MM = Million

11/26/201423

Course Motivation

Page 24: data about tech and data trends

• LHC Particle Physics 15 petabytes per year

• Radiology 69 petabytes per year

• Square Kilometer Array Telescope will be 0.5 zettabytes per year raw data in ~2022

• Earth Observation becoming ~4 petabytes per year

• Earthquake Science – few terabytes total today

• PolarGrid Radar studies of glaciers– 100’s terabytes/year

• Exascale simulation data dumps – ~0.1 zettabyteper year

SOME SCIENCE/TECHNICAL DATA SIZES

11/26/2014

Course Motivation

24

Page 25: data about tech and data trends

http://www.kpcb.com/internet-trends

http://www.genome.gov/sequencingcosts/

Need cost effective

Computing!

Sequence every

newborn by 2019 gives100 petabytes/year

11/26/201425

Course Motivation

Page 26: data about tech and data trends

The Long Tail of Science

80-20 rule: 20% users generate 80% data but not necessarily

80% knowledge

Collectively “long tail” science is generating a lot of data

Estimated at over 1PB per year and it is growing fast.

CSTI Meeting. October

2012 Dennis Gannon11/26/201426

Course Motivation

Page 27: data about tech and data trends

• Particle Physics LHC (bag of events of particles)

• Information Retrieval or web search (bag of words)

• e-commerce (bag of items with properties or users with rankings)

• Social Networking (bag of people with links & properties)

• Health Informatics (bag of health records, gene sequences)

• Sensors – web cams, self driving cars etc. (bag of pixels)

• Using

• Statistics (Histograms, Chisq)

• Deep Learning (Machine Learning)

• Image Analysis (including internet uploaded images)

• Recommender Engines (Bag of Ratings or properties)

• Patterns or Anomaly detection in graphs (linked data)

• On Clouds using MapReduce etc.

DATA INTENSIVE ACTIVITIESBag=

Space

11/26/2014

Course Motivation

27

Page 28: data about tech and data trends

• Use Clouds running Data Analytics Collaboratively processing Big Data to solve problems in X-Informatics ( or e-X)

• X = Astronomy, Biology, Biomedicine, Business, Chemistry, Climate, Crisis, Earth Science, Energy, Environment, Finance, Health, Intelligence, Lifestyle, Marketing, Medicine, Pathology, Policy, Radar, Security, Sensor, Social, Sustainability, Wealth and Wellness with more fields (physics) defined implicitly

• Spans Industry and Science (research)

• Education: Data Science see recent New York Times articles

• http://datascience101.wordpress.com/2013/04/13/new-york-times-data-science-articles/

BIG DATA ECOSYSTEM IN ONE SENTENCE

11/26/2014

Course Motivation

28

Page 29: data about tech and data trends

Social Informatics

Visual&Decision

Informatics

11/26/201429

Course Motivation

Page 30: data about tech and data trends

Course Motivation

Data Science, Clouds and Computer Science

JOBS

11/26/201430

Page 31: data about tech and data trends

JOBS V. COUNTRIES

11/26/201431

Course Motivation

http://www.microsoft.com/en-us/news/features/2012/mar12/03-

05CloudComputingJobs.aspx

Page 32: data about tech and data trends

• There will be a shortage of talent necessary for organizations to take advantage of big data. By 2018, the United States alone could face a shortage of 140,000 to 190,000 people with deep analytical skills as well as 1.5 million managers and analysts with the know-how to use the analysis of big data to make effective decisions.

• Informatics aimed at 1.5 million jobs. Computer Science covers the 140,000 to 190,000

MCKINSEY INSTITUTE ON BIG DATA JOBS

Course Motivation

3211/26/2014

http://www.mckinsey.com/mgi/publications/

big_data/index.asp.

Page 33: data about tech and data trends

Tom Davenport Harvard Business School

http://fisheritcenter.haas.berkeley.edu/Big_Data/index.html Nov 201211/26/201433

Course Motivation

Page 34: data about tech and data trends

11/26/201434

Course Motivation

Meeker/Wu May 29 2013 Internet Trends D11 Conference

Page 35: data about tech and data trends

11/26/201435

Course Motivation

Meeker/Wu May 29 2013 Internet Trends D11 Conference

Page 36: data about tech and data trends

Course Motivation

Many Technology trends

INDUSTRY TRENDS

11/26/201436

Page 37: data about tech and data trends

http://www.kpcb.com/internet-trends

Note that translates NOW into smaller

devices

In PAST translated into faster devices of

same form factor

11/26/201437

Course Motivation

Page 38: data about tech and data trends

11/26/201438

Course Motivation

http://www.kpcb.com/internet-trends

Page 39: data about tech and data trends

http://www.kpcb.com/internet-trends11/26/201439

Course Motivation

Page 40: data about tech and data trends

11/26/201440

Course Motivation

Meeker/Wu May 29 2013 Internet Trends D11 Conference

Page 41: data about tech and data trends

11/26/201441

Course Motivation

Meeker/Wu May 29 2013 Internet Trends D11 Conference

Page 42: data about tech and data trends

11/26/201442

Course Motivation

Meeker/Wu May 29 2013 Internet Trends D11 Conference

Page 43: data about tech and data trends

11/26/201443

Course Motivation

Meeker/Wu May 29 2013 Internet Trends D11 Conference

Page 44: data about tech and data trends

11/26/201444

Course Motivation

Meeker/Wu May 29 2013 Internet Trends D11 Conference

Page 45: data about tech and data trends

11/26/201445

Course Motivation

Meeker/Wu May 29 2013 Internet Trends D11 Conference

Page 46: data about tech and data trends

http://www.kpcb.com/internet-trends11/26/201446

Course Motivation

Page 47: data about tech and data trends

http://www.kpcb.com/internet-trends11/26/201447

Course Motivation

Page 48: data about tech and data trends

11/26/201448

Course Motivation

Page 49: data about tech and data trends

11/26/201449

Course Motivation

Meeker/Wu May 29 2013 Internet Trends D11 Conference

Page 50: data about tech and data trends

11/26/201450

Course Motivation

Page 51: data about tech and data trends

11/26/201451

Course Motivation

Page 52: data about tech and data trends

Course Motivation

Displaced by Digital Disruption

THE PAST?

11/26/201452

Page 53: data about tech and data trends

11/26/201453

Course Motivation

Meeker/Wu May 29 2013 Internet Trends D11 Conference

Page 54: data about tech and data trends

11/26/201454

Course Motivation

http://sephlawless.com/black-friday-2014 No more malls?

Page 55: data about tech and data trends

11/26/201455

Course Motivation

No more malls?

Page 56: data about tech and data trends

WHERE ARE SHOPPERS GOING?

11/26/201456

Course Motivation

Page 57: data about tech and data trends

11/26/201457

Course Motivation

We Are Here 2014-2015

Page 58: data about tech and data trends

11/26/201458

E-COMMERCE IS DRIVING NEARLY ALL RETAIL GROWTH IN US

Course Motivation

Page 59: data about tech and data trends

11/26/201459

1 IN 20 RETAIL DOLLARS ARE ALREADY ONLINE

Course Motivation

Page 60: data about tech and data trends

Even online groceries taking off

11/26/201460

Course Motivation

Page 61: data about tech and data trends

Course Motivation

Industry adopted clouds which are attractive for data analytics

COMPUTING MODEL

11/26/201461

Page 62: data about tech and data trends

For last 5 years Cloud Computing and last 2 years Big Data Transformational

Note in 2013 Big Data moves to 5-10 year slot

11/26/201462

Course Motivation

Page 63: data about tech and data trends

• It took Amazon Web Services (AWS) eight years to hit $650 million in revenue, according to Citigroup in 2010.

• Just three years later, Macquarie Capital analyst Ben Schachter estimates that AWS will top $3.8 billion in 2013 revenue, up from $2.1 billion in 2012 (estimated), valuing the AWS business at $19 billion.

• First public cloud computing supplier building on many cloud systems used to run Amazon, Google, Bing, eBay ….

AMAZON CLOUD AWS MAKING MONEY

11/26/2014

Course Motivation

63

Page 64: data about tech and data trends

• A bunch of computers in an efficient data center with an excellent Internet connection

• They were produced to meet need of public-facing Web 2.0 e-Commerce/Social Networking sites

• They can be considered as “optimal giant data center” plus internet connection

• Note enterprises use private clouds that are giant data centers but not optimized for Internet access

OPERATIONALLY CLOUDS ARE CLEAR

11/26/2014

Course Motivation

64

Page 65: data about tech and data trends

THE MICROSOFT CLOUD IS BUILT ON DATA CENTERS

11/26/201465

Course Motivation Quincy, WA Chicago, IL San Antonio, TX Dublin, Ireland Generation 4 DCs

Range in size from “edge” facilities to megascale (100K to 1M servers). Giant data centers with ~ 200-1000 to a shipping container with Internet access.

~100 Globally Distributed Data Centers

CSTI Meeting. October 2012 Dennis Gannon

Page 66: data about tech and data trends

DATA CENTERS CLOUDS & ECONOMIES OF SCALE

11/26/201466

Range in size from “edge” facilities to megascale.

Economies of scale: Approximate costs for a small size center (1K servers) and a larger, 50K server center.

Course Motivation

Each data center is

11.5 times the size of a football field

Technology Cost in small-

sized Data

Center

Cost in Large

Data Center

Ratio

Network $95 per Mbps/

month

$13 per Mbps/

month

7.1

Storage $2.20 per GB/

month

$0.40 per GB/

month

5.7

Administration ~140 servers/

Administrator

>1000 Servers/

Administrator

7.1

http://research.microsoft.com/en-us/people/barga/sc09_cloudcomp_tutorial.pdf

2 Google warehouses of

computers on the banks of the

Columbia River, in The Dalles,

Oregon

Such centers use 20MW-200MW

(Future) each with 150 watts

per CPU

Save money from large size,

positioning with cheap power

and access with Internet

Page 67: data about tech and data trends

• Virtualization = abstraction; run a job – you know not where

• Virtualization = use hypervisor to support “images”• Allows you to define complete job as an “image” – OS +

application

• Efficient packing of multiple applications into one server as they don’t interfere (much) with each other if in different virtual machines;

• They interfere if put as two jobs in same machine as for example must have same OS and same OS services

• Also security model between VM’s more robust than between processes

VIRTUALIZATION MADE SEVERAL THINGS MORE CONVENIENT

11/26/2014

Course Motivation

67

Page 68: data about tech and data trends

• http://research.microsoft.com/pubs/78813/AJ18_EN.pdf

• Typical data center CPU has 9.75% utilization

• Take 5000 SQL servers and rehost on virtual machines with 6:1 consolidation

MICROSOFT SERVER CONSOLIDATION

60% saving

11/26/2014

Course Motivation

68

Page 69: data about tech and data trends

• http://www.google.com/green/pdfs/google-green-computing.pdf

• Clouds win by efficient resource use and efficient data centers

THE GOOGLE GMAIL EXAMPLE

Business

Type

Number

of users

# servers IT Power

per user

PUE (Power

Usage

effectivene

ss)

Total

Power

per user

Annual

Energy

per user

Small 50 2 8W 2.5 20W 175 kWh

Medium 500 2 1.8W 1.8 3.2W 28.4 kWh

Large 10000 12 0.54W 1.6 0.9W 7.6 kWh

Gmail

(Cloud) < 0.22W 1.16 < 0.25W < 2.2 kWh

11/26/2014

Course Motivation

69

Page 70: data about tech and data trends

• Features from NIST:

• On-demand service (elastic);

• Broad network access;

• Resource pooling;

• Flexible resource allocation;

• Measured service

• Economies of scale in performance and electrical power (Green IT)

• Powerful new software models

• Platform as a Service is not an alternative to Infrastructure as a Service – it is instead an incredible valued added

• Amazon is as much PaaS as Azure

• They are cheaper than classic clusters unless latter 100% utilized

CLOUDS OFFER FROM DIFFERENT POINTS OF VIEW

11/26/2014

Course Motivation

70

Page 71: data about tech and data trends

BPM = Business Process management

IaaS Hardware e.g. Server

PaaS Systems Services e.g. MapReduce, Database

SaaS Applications e.g. Recommender System

BPaaS Particular Application Set

11/26/201471

Course Motivation

Page 72: data about tech and data trends

11/26/201472

Course Motivation

http://www.gartner.com/technology/reprints.do?

id=1-1UKQQA6&ct=140528&st=sb

Gartner Magic

Quadrant for Cloud

Infrastructure as a

Service

AWS in lead

followed by Azure

Page 73: data about tech and data trends

Course Motivation

4th Paradigm; From Theory to Data driven science?

RESEARCH MODEL

11/26/201473

Page 74: data about tech and data trends

http://www.wired.com/wired/issue/16-07 September 2008

11/26/201474

Course Motivation

Page 75: data about tech and data trends

1. Theory

2. Experiment or Observation

• E.g. Newton observed apples falling to design his theory of mechanics

3. Simulation of theory or model Supercomputers

4. Data-driven (Big Data) or The Fourth Paradigm: Data-Intensive Scientific Discovery (aka Data Science)

• http://research.microsoft.com/en-us/collaboration/fourthparadigm/ A free book

• More data; less models

THE 4 PARADIGMS OF SCIENTIFIC RESEARCH

11/26/2014

Course Motivation

75

Page 76: data about tech and data trends

MORE DATA USUALLY BEATS BETTER ALGORITHMS

Course Motivation

7611/26/2014

• Here's how the competition works. Netflix has provided a large data set that tells you how nearly half a million people have rated about 18,000 movies. Based on these ratings, you are asked to predict the ratings of these users for movies in the set that they have not rated. The first team to beat the accuracy of Netflix's proprietary algorithm by a certain margin wins a prize of $1 million!

• Different student teams in my class adopted different approaches to the problem, using both published algorithms and novel ideas. Of these, the results from two of the teams illustrate a broader point. Team A came up with a very sophisticated algorithm using the Netflix data. Team B used a very simple algorithm, but they added in additional data beyond the Netflix set: information about movie genres from the Internet Movie Database(IMDB). Guess which team did better?

• Anand Rajaraman is Senior Vice President at Walmart Global eCommerce, where he heads up the newly created @WalmartLabs,

• http://anand.typepad.com/datawocky/2008/03/more-data-usual.html

• 20120117berkeley1.pdf Jeff Hammerbacher

Page 77: data about tech and data trends

Course Motivation

DIKW

Data

Information

Knowledge

Wisdom

Decisions

DATA SCIENCE PROCESS

11/26/201477

Page 78: data about tech and data trends

• Data becomes

• Information becomes

• Knowledge becomes

• Wisdom or Decisions

• Community acceptance of results or approach important here

• Volume of bits&bytes decreases as we proceed down DIKW pipeline

DIKW PROCESS

11/26/2014

Course Motivation

78

Page 79: data about tech and data trends

79

Course Motivation

Database

SS SS SS SS SS SS

SS: Sensor or DataInterchangeServiceWorkflow through multiple filter/discovery clouds

AnotherCloud

Raw Data Data Information Knowledge Wisdom Decisions

SSSS

AnotherService

SS

AnotherGrid SS

SS

SS

SS

SS

SS

SS

SS

StorageCloud

ComputeCloud

SSSSSS SS

FilterCloud

FilterCloud

FilterCloud

DiscoveryCloud

DiscoveryCloud

FilterCloud

FilterCloud

FilterCloud

SS

FilterCloud

FilterCloud Filter

Cloud

FilterCloud

DistributedGrid

Hadoop Cluster

SS

Data Deluge is also Information/Knowledge/Wisdom/Decision Deluge?

11/26/2014

Page 80: data about tech and data trends

• Data comes from traditional maps (US Geological Survey), Satellites (overlays) and street cams

• Information is presented by basic Google Maps web page

• Knowledge is a particular optimized route

• Decisions (Wisdom) comes from deciding to drive a particular route

EXAMPLE OF GOOGLE MAPS/NAVIGATION

11/26/2014

Course Motivation

80

Page 81: data about tech and data trends

Course Motivation

Application Example

PHYSICS-INFORMATICS LOOKING FOR HIGGS PARTICLE

WITH LARGE HADRON COLLIDER LHC

11/26/201481

Page 82: data about tech and data trends

11/26/201482

Course Motivation

The LHC produces some 15 petabytes of data per year of all varieties and with the exact value depending on duty factor of accelerator (which is reduced simply to cut electricity cost but also due to malfunction of one or more of the many complex systems) and experiments. The raw data produced by experiments is processed on the LHC Computing Grid, which has some 200,000 Cores arranged in a three level structure. Tier-0 is CERN itself, Tier 1 are national facilities and Tier 2 are regional systems. For example one LHC experiment (CMS) has 7 Tier-1 and 50 Tier-2 facilities.

This analysis raw data reconstructed data AOD and TAGS Physics is performed on the multi-tier LHC Computing Grid. Note that every event can be analyzed independently so that many events can be processed in parallel with some concentration operations such as those to gather entries in a histogram. This implies that both Grid and Cloud solutions work with this type of data with currently Grids being the only implementation today. Higgs Event

http://grids.ucs.indiana.edu/ptliupages/publications/Where%20does%20all%20the%20data%20come%20from%20v7.pdf

Note LHC lies in a tunnel 27 kilometres (17 mi) in circumference

ATLAS Expt

Page 83: data about tech and data trends

http://www.interactions.org/cms/?pid=1032811

The inside of the RHIC (Relativistic

Heavy Ion Collider) tunnel, a 2.4-mile

high-tech particle racetrack at

Brookhaven National Laboratory.

11/26/201483

Course Motivation

Page 84: data about tech and data trends

Model

http://www.quantumdiaries.org/2012/09/07/why-particle-detectors-need-a-trigger/atlasmgg/

11/26/201484

Course Motivation

Page 85: data about tech and data trends

• As a naïve undergraduate in 1964, I was told by Professor who later left university to enter church that bumps like

were particles. I was amazed and found this more intriguing than anything else I had heard about so I decided to do PhD in particle physics.

• I later decided computing moving faster than physics, so I went into Informatics

• Also I was alarmed by size and time scale of physics activities

• Note ATLAS is 45 metres long, 25 metres in diameter, and weighs about 7,000 tons. The experiment is a collaboration involving roughly 3,000 physicists at 175 institutions in 38 countries

• US version of LHC, Superconducting Super Collider (SSC) discussed in 1983 was cancelled in 1993 after $2B spent

PERSONAL NOTE

11/26/2014

Course Motivation

85

Page 86: data about tech and data trends

http://www.sciencedirect.

com/science/article/pii/S

037026931200857X

11/26/201486

Course Motivation

Page 87: data about tech and data trends

Course Motivation

Technology Example

RECOMMENDER SYSTEMS I

11/26/201487

Page 88: data about tech and data trends

• In many cases, one needs personalized matching of items to people or perhaps collections of items to collections of people

• People to products: Online and Offline Commerce

• People to People: Social Networking

• People to Jobs or Employers: Job Sites

• People+Queries to the Web: Information Retrieval (search as in Bing/Google)

OVERVIEW OF MANY INFORMATICS AREAS

11/26/2014

Course Motivation

88

Page 89: data about tech and data trends

• A large number of online and offline commerce activities plus basic Internet site personalization relies on “recommender systems”

• Given real-time action by user, immediately suggest new actions (as in Amazon buy recommendations on web)

• Based on past actions of users (and others) suggest movies tolook at, restaurants to eat at, events to go to, books and music to buy

• Based on mix of explicit user choice and grouping of internet sites, present customized Google News page

• Given sales statistics, decide on discounts at “real” supermarkets and placement of related (by analysis of buying habits) products

• Identify possible colleagues at Social Networking sites like LinkedIn

• Identify matches between employers and employees at sites like CareerBuilder and Monster

RECOMMENDER SYSTEMS IN MORE DETAIL

11/26/2014

Course Motivation

89

Page 90: data about tech and data trends

• Fit Model to Data

• Higgs + Background

• Match User to Jobs or Books or Other Users?

• Classification is optimizing assignment of members of an ontology (list of categories) to data

• Typically minimize some function (or maximize negative of function)

• Interesting feature of these problems is ingenious choice of function

• Note Physics minimizes (free) energy

• Often involves thinking of people and/or items as points in a space (not always a traditional vector space)

• Space called “bag” in “bag of words” model for information retrieval

EVERYTHING IS AN OPTIMIZATION PROBLEM?

11/26/2014

Course Motivation

90

Page 91: data about tech and data trends

NETFLIX ON PERSONALIZATION

11/26/201491

Course Motivation

http://www.slideshare.net/xamat/building-largescale-

realworld-recommender-systems-recsys2012-tutorial

Page 92: data about tech and data trends

NETFLIX ON RECOMMENDATIONS

11/26/201492

Course Motivation

http://www.slideshare.net/xamat/building-largescale-

realworld-recommender-systems-recsys2012-tutorial

Page 93: data about tech and data trends

APRIL 2013: THE LAST TWO QUARTERS HAVE EACH BROUGHT MORE THAN 2 MILLION NEW STREAMING SUBSCRIBER SIGNUPS. THAT GIVES NETFLIX A

CURRENT TOTAL OF NEARLY 29.2 MILLION SUBSCRIBERS

11/26/201493

Course Motivation

http://www.slideshare.net/xamat/building-largescale-realworld-recommender-

systems-recsys2012-tutorial

Page 94: data about tech and data trends

http://www.ifi.uzh.ch/ce/teaching/spring2

012/16-Recommender-Systems_Slides.pdf11/26/201494

Course Motivation

Page 95: data about tech and data trends

NETFLIX ON DATA SCIENCE

11/26/201495

Course Motivation

Note Netflix and others

run tests all the time on

subsets of customers

http://www.slideshare.net/xamat/building-largescale-realworld-recommender-

systems-recsys2012-tutorial

Page 96: data about tech and data trends

Course Motivation

Use of Data bags and spaces

RECOMMENDER SYSTEMS II

11/26/201496

Page 97: data about tech and data trends

• In user-based collaborative filtering, we can think of users in a space of dimension N where there are N items and M users. • Let i run over items and u over users

• Then each user is represented as a vector Ui(u) in “item-space” where ratings are vector components. We are looking for users u u’ that are near each other in this space as measured by some distance between Ui(u) and Ui(u’)

• If u and u’ rate all items then these are “real” vectors but almost always they each only rates a small fraction of items and the number in common is even smaller

• The “Pearson coefficient” is just one distance measure that can be used• Only sum over i rated by u and u’

DISTANCES IN FUNNY SPACES I

Last.fm uses this for songs as does Amazon, Netflix11/26/2014

Course Motivation

97

Page 98: data about tech and data trends

• In item-based collaborative filtering, we can think of items in a space of dimension M where there are N items and M users. • Let i run over items and u over users

• Then each item is represented as a vector Ru(i) in “user-space” where ratings are vector components. We are looking for items i i’ that are near each other in this space as measured by some distance between Ru(i) and Ru(i’)

• If i and i’ rated by all users then these are “real” vectors but almost always they are each only rated by a small fraction of users and the number in common is even smaller

• The “Cosine measure” is just one distance measure that can be used• Only sum over users u rating both i and i’

DISTANCES IN FUNNY SPACES II

11/26/2014

Course Motivation

98

Page 99: data about tech and data trends

• In content based recommender systems, we can think of items in a space of dimension M where there are N items and M properties. • Let i run over items and p over properties

• Then each item is represented as a vector Pp(i) in “property-space” where values of properties are vector components. We are looking for items i i’ that are near each other in this space as measured by some distance between Pp(i) and Rp(i’)

• Properties could be “reading age” or “character highlighted” or “author” for books

• Properties can be genre or artist for songs and video

• Properties can characterize pixel structure for images used in face recognition, driving etc.

DISTANCES IN FUNNY SPACES III

Pandora uses this for songs (Music Genome) as does Amazon, Netflix11/26/2014

Course Motivation

99

Page 100: data about tech and data trends

• Much of (eCommerce/LifeStyle) Informatics involves “points”

• Events in LHC analysis

• Users (people) or items (books, jobs, music, other people)

• These points can be thought of being in a “space” or “bag”

• Set of all books

• Set of all physics reactions

• Set of all Internet users

• However as in recommender systems where a given user only rates some items, we don’t know “full position”

• However we can nearly always define a useful distance d(a,b) between points

• Always d(a,b) >= 0

• Usually d(a,b) = d(b,a)

• Rarely d(a,b) + d(b,c) >= d(a,c) Triangle Inequality

DO WE NEED “REAL” SPACES?

11/26/2014

Course Motivation

100

Page 101: data about tech and data trends

• The simplest way to use distances is “nearest neighbor algorithms” – given one point, find a set of points near it – cut off by number of identified nearby points and/or distance to initial point

• Here point is either user or item

• Another approach is divide space into regions (topics, latent factors) consisting of nearby points

• This is clustering

• Also other algorithms like Gaussian mixture models or Latent Semantic Analysis or Latent Dirichlet Allocation which use a more sophisticated model

USING DISTANCES

11/26/2014

Course Motivation

101

Page 102: data about tech and data trends

Course Motivation

Another Example

WEB SEARCHINFORMATION RETRIEVAL

11/26/2014102

Page 103: data about tech and data trends

• Get the digital data (from web or from scanning)

• Need to crawl web (? Solved “engineering” problem)

• Preprocess data to get searchable things (words positions)

• Form Inverted Index mapping words to documents

• Typically use TF-IDF (term frequency, Inverse Document frequency) to quantify importance of word match

• Rank relevance of documents: PageRank

• Lots of technology for advertising, “reverse engineering” “preventing reverse engineering”

• Clustering of documents into topics (as in Google News)

“WEB DATA ANALYTICS”

11/26/2014

Course Motivation

103

Page 104: data about tech and data trends

Size of face proportional

to PageRank11/26/2014104

Course Motivation

Page 105: data about tech and data trends

• Goal (Function to Optimize – Long Term dollars)

• Serve the right item to a user in a given context to optimize long-term business objectives

• A scientific discipline that involves

• Large scale Machine Learning & Statistics• Offline Models (capture global & stable characteristics)• Online Models (incorporates dynamic components)

• Explore/Exploit (active and adaptive experimentation)

• Multi-Objective Optimization • Click-rates (CTR), Engagement, advertising revenue, diversity,

etc

• Inferring user interest• Constructing User Profiles

• Natural Language Processing to understand content• Topics, “aboutness”, entities, follow-up of something, breaking

news,…

MODERN RECOMMENDATION SYSTEMS (FROM YAHOO)

http://pages.cs.wisc.edu/~beechung/icml11-tutorial/11/26/2014

Course Motivation

105

Page 106: data about tech and data trends

11/26/2014106

Course Motivation

Recommend applications

Recommend search queries

Recommend news article

Recommend packages:Image

Title, summary

Links to other pages

Pick 4 out of a pool of KK = 20 ~ 40

Dynamic

Routes traffic other

pages

http://pages.cs.wisc.edu/~beechung/icml11-tutorial/

Page 107: data about tech and data trends

• Simple version

• I have a content module on my page, content inventory is obtained from a third party source which is further refined through editorial oversight. Can I algorithmically recommend content on this module? I want to improve overall click-rate (CTR) on this module

• More advanced

• I got X% lift in CTR. But I have additional information on other downstream utilities (e.g. advertising revenue). Can I increase downstream utility without losing too many clicks?

• Highly advanced

• There are multiple modules running on my webpage. How do I perform a simultaneous optimization?

SOME EXAMPLES FROM CONTENT OPTIMIZATION

http://pages.cs.wisc.edu/~beechung/icml11-tutorial/11/26/2014

Course Motivation

107

Page 108: data about tech and data trends

Course Motivation

Science Clouds

Internet of Things

CLOUD APPLICATIONS IN RESEARCH

11/26/2014108

Page 109: data about tech and data trends

• Large Scale Supercomputers – Multicore nodes linked by high performance low latency network

• Increasingly with GPU enhancement

• Suitable for highly parallel simulations

• High Throughput Systems such as European Grid Initiative EGI or Open Science Grid OSG typically aimed at pleasingly parallel jobs

• Can use “cycle stealing”

• Classic example is LHC data analysis

• Grids federate resources as in EGI/OSG or enable convenient access to multiple backend systems including supercomputers

• Use Services (SaaS)

• Portals make access convenient and

• Workflow integrates multiple processes into a single job

SCIENCE COMPUTING ENVIRONMENTS

11/26/2014

Course Motivation

109

Page 110: data about tech and data trends

• Synchronization/communication PerformanceGrids > Clouds > Classic HPC Systems

• Clouds naturally execute effectively Grid workloads but are less clear for closely coupled HPC applications

• Classic HPC machines as MPI engines offer highest possible performance on closely coupled problems

• The 4 forms of MapReduce/MPI 1) Map Only – pleasingly parallel

2) Classic MapReduce as in Hadoop; single Map followed by reduction with fault tolerant use of disk

3) Iterative MapReduce use for data mining such as Expectation Maximization in clustering etc.; Cache data in memory between iterations and support the large collective communication (Reduce, Scatter, Gather, Multicast) use in data mining

4) Classic MPI! Support small point to point messaging efficiently as used in partial differential equation solvers

CLOUDS HPC AND GRIDS

11/26/2014

Course Motivation

110

Page 111: data about tech and data trends

• Pleasingly (moving to modestly) parallel applications of all sorts with roughly independent data or spawning independent simulations• Long tail of science and integration of

distributed sensors• Commercial and Science Data analytics that can

use MapReduce (some of such apps) or itsiterative variants (most other data analytics apps)

• Which science applications are using clouds? • Venus-C (Azure in Europe): 27 applications not using

Scheduler, Workflow or MapReduce (except roll your own)

• 50% of applications on FutureGrid were from Life Science • Locally Lilly corporation is commercial cloud user (for

drug discovery) but not IU Biology

• But overall very little science use of clouds yet

WHAT APPLICATIONS WORK IN CLOUDS

11/26/2014

Course Motivation

111

Page 112: data about tech and data trends

• “Long tail of science” can be an important usage mode of clouds.

• In some areas like particle physics and astronomy, i.e. “big science”, there are just a few major instruments generating now petascale data driving discovery in a coordinated fashion.

• In other areas such as genomics and environmental science, there are many “individual” researchers with distributed collection and analysis of data whose total data and processing needs can match the size of big science.

• Clouds can provide scaling convenient resources for this important aspect of science.

• Can be map only use of MapReduce if different usages naturally linked e.g. exploring docking of multiple chemicals or alignment of multiple DNA sequences• Collecting together (summarizing) multiple “maps” is a Reduction

PARALLELISM OVER USERS AND USAGES

11/26/2014

Course Motivation

112

Page 113: data about tech and data trends

• It is projected that there will be 24-75 billion devices on the Internet by 2020. Most will be small sensors that send streams of information into the cloud where it will be processed and integrated with other streams and turned into knowledge that will help our lives in a multitude of small and big ways.

• The cloud will become increasing important as a controller of and resource provider for the Internet of Things.

• As well as today’s use for smart phone and gaming console support, “Intelligent River” “smart homes and grid” and “ubiquitous cities” build on this vision and we could expect a growth in cloud supported/controlled robotics.

• Some of these “things” will be supporting science

• Natural parallelism over “things”

• “Things” are distributed and so form a Grid

INTERNET OF THINGS AND THE CLOUD

11/26/2014

Course Motivation

113

Page 114: data about tech and data trends

• We will look at several streaming examples later but most of the use cases involve this. Streaming can be seen in many ways

• There are devices – The Internet of Things and various MEMS in smartphones

• There are people tweeting or logging in to the cloud to search. These interactions are streaming

• Apache Storm is critical software here to gather and integrate multiple erratic streams

• Also important algorithm challenges to update quickly analyses with streamed data

STREAMING CATEGORY

http://www.kpcb.com/internet-trends11/26/2014

Course Motivation

114

Page 115: data about tech and data trends

11/26/2014115

HOME DEVICES

Course Motivation

Page 116: data about tech and data trends

11/26/2014116

SENSORS (THINGS) AS A SERVICE

Course Motivation

Sensors as a Service

Sensor Processing as

a Service (could use

MapReduce)

A larger sensor ………

Output Sensor

https://sites.google.com/site/opensourceiotcloud/ Open Source Sensor (IoT) Cloud

Page 117: data about tech and data trends

11/26/2014117

Course Motivation

Meeker/Wu May 29 2013 Internet Trends D11 Conference

Page 118: data about tech and data trends

11/26/2014118

Course Motivation

Meeker/Wu May 29 2013 Internet Trends D11 Conference

Page 119: data about tech and data trends

11/26/2014119

Course Motivation

Meeker/Wu May 29 2013 Internet Trends D11 Conference

Page 120: data about tech and data trends

Course Motivation

Software Ecosystems

PARALLEL COMPUTING AND MAPREDUCE

11/26/2014120

Page 121: data about tech and data trends

11/26/2014121

Course Motivation

http://www.slideshare.net/mjft01/big-data-landscape-matt-turck-may-2014

Page 122: data about tech and data trends

There are a lot of Big Data and HPC Software systems

Challenge! Manage environment offering these different components

11/26/2014122

Course Motivation

Page 123: data about tech and data trends

• We don’t need 266 software packages so can choose e.g.

• Workflow: IPython, Pegasus or Kepler (replaced by tools like Tez?)

• Data Analytics: Mahout, R, ImageJ, Scalapack

• High level Programming: Hive, Pig

• Parallel Programming model: Hadoop, Spark, Giraph (Twister4Azure, Harp), MPI; Storm, Kapfka or RabbitMQ (Sensors)

• In-memory: Memcached

• Data Management: Hbase, MongoDB, MySQL or Derby

• Distributed Coordination: Zookeeper

• Cluster Management: Yarn, Slurm

• File Systems: HDFS, Lustre

• DevOps: Cloudmesh, Chef, Puppet, Docker, Cobbler

• IaaS: Amazon, Azure, OpenStack, Libcloud

• Monitoring: Inca, Ganglia, Nagios

MAYBE A BIG DATA INITIATIVE WOULD INCLUDE

11/26/2014

Course Motivation

123

Page 124: data about tech and data trends

• HPC: Typically SPMD (Single Program Multiple Data) “maps” typically processing particles or mesh points interspersed with multitude of low latency messages supported by specialized networks such as Infiniband and technologies like MPI

• Often run large capability jobs with 100K (going to 1.5M) cores on same job

• National DoE/NSF/NASA facilities run 100% utilization

• Fault fragile and cannot tolerate “outlier maps” taking longer than others

• Clouds: MapReduce has asynchronous maps typically processing data points with results saved to disk. Final reduce phase integrates results from different maps

• Fault tolerant and does not require map synchronization

• Map only useful special case

• HPC + Clouds: Iterative MapReduce caches results between “MapReduce” steps and supports SPMD parallel computing with large messages as seen in parallel kernels (linear algebra) in clustering and other data mining

CLASSIC PARALLEL COMPUTING

11/26/2014

Course Motivation

124

Page 125: data about tech and data trends

MAPREDUCE “FILE/DATA REPOSITORY” PARALLELISM

11/26/2014125

Course Motivation

Instruments

Disks Map1 Map2 Map3

Reduce

Communication

Map = (data parallel) computation reading and writing dataReduce = Collective/Consolidation phase e.g. forming multiple global sums as in histogram

Portals/Users

Iterative MapReduceMap Map Map Map

Reduce Reduce Reduce

Page 126: data about tech and data trends

SAM’S PROBLEMHTTP://WWW.SLIDESHARE.NET/ESALIYA/MAPREDUCE-IN-SIMPLE-TERMS

11/26/2014126

Course Motivation

• Sam thought of “drinking” the apple

He used a to cut the

and a to make juice.

Page 127: data about tech and data trends

CREATIVE SAM

11/26/2014127

Course Motivation

(<a’, > , <o’, > , <p’, > )

• Implemented a parallel version of his innovation

(<a, > , <o, > , <p, > , …)

Each input to a map is a list of <key, value> pairs

Each output of slice is a list of <key, value> pairs

Grouped by key

Each input to a reduce is a <key, value-list> (possibly a list of these, depending on the grouping/hashing mechanism)e.g. <ao, ( …)>

Reduced into a list of values

The idea of Map Reduce in Data Intensive Computing

A list of <key, value> pairs mapped into another list of <key, value> pairs which gets grouped by

the key and reduced into a list of values

Page 128: data about tech and data trends

Course Motivation

Opportunities at Universities

see New York Times articles

http://datascience101.wordpress.com/2013/04/13/new-york-times-data-science-articles/

DATA SCIENCE EDUCATION

11/26/2014128

Page 129: data about tech and data trends

• Broad Range of Topics from Policy to curation to applications and algorithms, programming models, data systems, statistics, and broad range of CS subjects such as Clouds, Programming, HCI,

• Plenty of Jobs and broader range of possibilities than computational science but similar cosmic issues

• What type of degree (Certificate, minor, track, “real” degree)

• What implementation (department, interdisciplinary group supporting education and research program)

DATA SCIENCE EDUCATION

11/26/2014

Course Motivation

129

Page 130: data about tech and data trends

• Have set up certificate and Masters degree in data science

• Joint between 3 units in School of Informatics and Computing: Computer Science, Informatics, Information & Library Science, and Statistics department in COAS College

• Certificate online

• Masters online or residential

• Looking at version with Kelley with Business data analytics flavor

• Attractive to offer online as few universities have this and so a potentially large audience outside IU

AT INDIANA UNIVERSITY

11/26/2014

Course Motivation

130

Page 131: data about tech and data trends

11/26/2014131

Course Motivation

Meeker/Wu May 29 2013 Internet Trends D11 Conference

Page 132: data about tech and data trends

11/26/2014132

Course Motivation

Meeker/Wu May 29 2013 Internet Trends D11 Conference

Page 133: data about tech and data trends

• MOOC’s are very “hot” these days with Udacity and Coursera as start-ups; perhaps over 100,000 participants

• Relevant to Data Science as this is a new field with few courses at most universities

• Typical model is collection of short prerecorded segments (talking head over PowerPoint) of length 3-15 minutes• This is Boredom limit http://blog.coursera.org/post/49750392396/on-

the-topic-of-boredom

• These “lesson objects” can be viewed as “songs”

• Google Course Builder (python open source) builds customizable MOOC’s as “playlists” of “songs”

• Tells you to capture all material as “lesson objects”

• We are aiming to build a repository of many “songs”; used in many ways – tutorials, classes …

MASSIVE OPEN ONLINE COURSES (MOOC)

11/26/2014

Course Motivation

133

Page 134: data about tech and data trends

11/26/2014134

Course Motivation

http://x-informatics.appspot.com/course

One of 2

Original IU

SOIC

MOOCs

Page 135: data about tech and data trends

11/26/2014135

Seven ~10 minutes lesson objects in this lecture

Started using closed caption but gave upCould be useful if in other languages

Course Motivation

Page 136: data about tech and data trends

• We could teach one class to 100,000 students or 2,000 classes to 50 students

• The 2,000 class choice has 2 useful features• One can use the usual (electronic) mentoring/grading technology

• One can customize each of 2,000 classes for a particular audience given their level and interests

• One can even allow student to customize – that’s what one does in making play lists in iTunes

• Both models can be supported by a repository of lesson objects (10-15 minute video segments) in the cloud

• The teacher can choose from existing lesson objects and add their own to produce a new customized course with new lessons contributed back to repository

CUSTOMIZABLE MOOC’S I

11/26/2014

Course Motivation

136

Page 137: data about tech and data trends

11/26/2014137

Course Motivation

http://iucloudsummerschool.appspot.com/preview

Unit ~1 hour with ~6 lessons,Total 115 lesson objects

SCIENCE CLOUD MOOC REPOSITORY

Page 138: data about tech and data trends

• The 3-15 minute Video over PowerPoint of MOOC lesson object’s is easy to re-use

• Qiu (IU)and Hayden (ECSU Elizabeth City State University – (a small HBCU Historically Black University) will customize a module• Starting with Qiu’s cloud computing course at IU• Adding material on use of Cloud Computing in Remote

Sensing (area covered by ECSU course)

• This is a model for adding cloud curricula material to wide set of universities where faculty not able to teach

• Defining how to support computing labs associated with MOOC’s with clouds or VM’s on clients• Appliances scale as download to student’s client

CUSTOMIZABLE MOOC’S II

11/26/2014

Course Motivation

138

Page 139: data about tech and data trends

11/26/2014139

Course Motivation

139

Can of course build many different interfaces

Songs stored on YouTube

Songs prepared with Adobe Presenter on Laptop

http://cloudmooc.soic.indiana.edu/

Page 140: data about tech and data trends

• High volume courses (CS/Ph/Chem/Bio101…) where scalability of MOOC’s make them attractive to reach a lot of students

• Niche areas where there is some student interest but either no faculty expertise or not enough students to justify traditional courses

• Offer to many institutions simultaneously

TWO LIMITS WHERE MOOC’S ARE COMPELLING

11/26/2014

Course Motivation

140

Page 141: data about tech and data trends

• I proposed in 1999 (misjudging pace of online education): One can place faculty in their favorite location (e.g. remote cave) and universities provide structure to give “credentials” while faculty teach

• Maybe some populate repository; others actually deliver courses

• Note Google Course Builder or Microsoft Mix have no need for local resources except for faculty client (laptop) – entire course stored in the cloud

• So we can radically change university system with a major cross institution virtual education andas well research component

• Enabled by cloud plus high performance reliable networking

• Lets set up community to defineand build the virtual universityMOOC repository and otherneeded activities

HERMIT’S CAVE VIRTUAL UNIVERSITY

11/26/2014

Course Motivation

141

Page 142: data about tech and data trends

• We should aim at simplicity; attractive at moment is a mix of multi-topic forum plus more interactive Hangouts or equivalent (5-12 people)

MENTORING / GRADING

https://www.youtube.com/watch?v=M3jcSCA9_hM

11/26/2014

Course Motivation

142

Page 143: data about tech and data trends

11/26/2014143

Course Motivation

Meeker/Wu May 29 2013 Internet

Trends D11 Conference

Page 144: data about tech and data trends

• Use clouds in faculty, graduate student and undergraduate research

• Clouds for Scientific data analysis

• Core cloud technologies (MapReduce, Virtualization)

• Teach clouds as it involves areas of Information Technology with lots of job opportunities

• Use clouds to support distributed learning and research environment

• A cloud backend for course materials and collaboration as in MOOC repository

• Green environmentally friendly computing infrastructure

MANY SYNERGIES CLOUDS AND UNIVERSITIES

11/26/2014

Course Motivation

144

Page 145: data about tech and data trends

Course Motivation

CONCLUSIONS

11/26/2014145

Page 146: data about tech and data trends

• Clouds are here to stay and one should plan on exploiting them

• Data Intensive studies in business and research continue to grow in importance• Data Analytics: Everything is an optimization problem in a funny

space

• Growing employment opportunities in clouds and data related activities and so popular with students• Enabling many of the most important companies from

Facebook/Google to General Electric

• Need community discussion of data science education• Agree on curricula; is such a degree attractive?

• MOOC’s interesting for• Disseminating new curricula • Managing course fragments that can be assembled into

custom courses for particular interdisciplinary students

CONCLUSIONS

11/26/2014

Course Motivation

146

Page 147: data about tech and data trends

• Use Clouds running Data Analytics Collaboratively processing Big Data to solve problems in X-Informatics educated in data science

• X = Astronomy, Biology, Biomedicine, Business, Chemistry, Climate, Crisis, Earth Science, Energy, Environment, Finance, Health, Intelligence, Lifestyle, Marketing, Medicine, Pathology, Policy, Radar, Security, Sensor, Social, Sustainability, Wealth and Wellness with more fields (physics) defined implicitly

• Spans Industry and Science (research)

BIG DATA ECOSYSTEM IN ONE SENTENCE

11/26/2014

Course Motivation

147