knowledge discovery from mining big data

62
Knowledge Discovery from Mining Big Data Kirk Borne @KirkDBorne George Mason University School of Physics, Astronomy, & Computational Sciences http://classweb.gmu.edu/kborne/ Data Literacy for all ! (the Borne ultimatum) (the Borne Identity)

Upload: lucio

Post on 08-Jan-2016

51 views

Category:

Documents


2 download

DESCRIPTION

(the Borne Identity). Data Literacy for all !. (the Borne ultimatum). Knowledge Discovery from Mining Big Data. Kirk Borne @KirkDBorne George Mason University School of Physics, Astronomy, & Computational Sciences http://classweb.gmu.edu/kborne/. The Big Data Manifesto - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Knowledge Discovery from Mining Big Data

Knowledge Discovery fromMining Big Data

Kirk Borne@KirkDBorne

George Mason UniversitySchool of Physics, Astronomy, & Computational Sciences

http://classweb.gmu.edu/kborne/

Data Literacy for all !(the Borne ultimatum)

(the Borne Identity)

Page 2: Knowledge Discovery from Mining Big Data

The Big Data ManifestoThe Big Data Manifesto(the Borne Ultimatum)(the Borne Ultimatum)

• More data is not just more data … More data is not just more data … more is different!more is different!

• Discover the unknown unknowns.Discover the unknown unknowns.

• Address massive Data-to-Knowledge (D2K) challenge.Address massive Data-to-Knowledge (D2K) challenge.

• Data Literacy for all !Data Literacy for all !

2

Page 3: Knowledge Discovery from Mining Big Data

Ever since we first began to explore our world…

Page 4: Knowledge Discovery from Mining Big Data

… humans have asked questions and …We have collected evidence (data) to help answer those questions.

Page 5: Knowledge Discovery from Mining Big Data

… humans have asked questions and …We have collected evidence (data) to help answer those questions.

The journey from traditional

science to …

Page 6: Knowledge Discovery from Mining Big Data

… Data-intensive Science is a Big Challenge

6

Page 9: Knowledge Discovery from Mining Big Data

Promising News: Big Data leads to Big Insights and New Discoveries

9http://news.nationalgeographic.com/news/2010/11/photogalleries/101103-nasa-space-shuttle-discovery-firsts-pictures/

Page 10: Knowledge Discovery from Mining Big Data

Good News: Big Data is Sexy

10

http://dilbert.com/strips/comic/2012-09-05/

http://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century/ar/1

Page 11: Knowledge Discovery from Mining Big Data

11

• Computing power doubles every 18 months (Moore’s Law) ...• 100x improvement in 10 years

• I/O bandwidth increases ~10% / year• <3x improvement in 10 years.

• The amount of data doubles every year ...• 1000x in 10 years, and 1,000,000x in 20 yrs.

How much data are there in the world?From the beginning of recorded time until 2003,

we created 5 billion gigabytes (exabytes) of data. In 2011 the same amount was created every two days. In 2013, the same amount is created every 10 minutes.

http://money.cnn.com/gallery/technology/2012/09/10/big-data.fortune/index.html

Characteristics of Big Data – 1234

Page 12: Knowledge Discovery from Mining Big Data

12

• Computing power doubles every 18 months (Moore’s Law) ...• 100x improvement in 10 years

• The amount of data doubles every year (or faster!) ...• 1000x in 10 years, and 1,000,000x in 20 yrs.

• I/O bandwidth increases ~10% / year• <3x improvement in 10 years.

Characteristics of Big Data – 1234

Page 13: Knowledge Discovery from Mining Big Data

13

• Computing power doubles every 18 months (Moore’s Law) ...• 100x improvement in 10 years

• The amount of data doubles every year ...• 1000x in 10 years, and 1,000,000x in 20 yrs.

• I/O bandwidth increases ~10% / year• <3x improvement in 10 years.

Characteristics of Big Data – 1234

Moore’s Law of Slacking

will not help !http://arxiv.org/abs/astro-ph/9912202

Page 14: Knowledge Discovery from Mining Big Data

Characteristics of Big Data – 1234• Big quantities of data are acquired everywhere.• It is now a big issue in all aspects of life: science, business,

healthcare, gov, social networks, national security, media, etc.

Page 15: Knowledge Discovery from Mining Big Data

• Big quantities of data are acquired everywhere.• It is now a big issue in all aspects of life: science, business,

healthcare, gov, social networks, national security, media, etc.

Characteristics of Big Data – 1234

LSST project (www.lsst.org) :• 20 Terabytes of astronomical imaging every night• 100-200 Petabyte image archive after 10 years• 20-40 Petabyte database• 2-10 million new sky events nightly that need to be

characterized and classified – potential new discoveries!

Page 16: Knowledge Discovery from Mining Big Data

Characteristics of Big Data – 1234 • Job opportunities are sky-rocketing• Extremely high demand for Data Science skills• Demand will continue to increase• Old: “100 applicants per job”. New: “100 jobs per applicant”

Page 17: Knowledge Discovery from Mining Big Data

• Job opportunities are sky-rocketing• Extremely high demand for Data Science skills• Demand will continue to increase• Old: “100 applicants per job”. New: “100 jobs per applicant”

McKinsey Report (2011**) :• Big Data is the new “gold rush” , the “new oil”• 1.5 million skilled data scientist shortage within 5 years

• ** http://www.mckinsey.com/insights/mgi/research/technology_and_innovation/big_data_the_next_frontier_for_innovation

Characteristics of Big Data – 1234

Page 18: Knowledge Discovery from Mining Big Data

Data Sciences: A National Imperative

1. National Academies report: Bits of Power: Issues in Global Access to Scientific Data, (1997) http://www.nap.edu/catalog.php?record_id=5504 2. NSF (National Science Foundation) report: Knowledge Lost in Information: Research Directions for Digital Libraries, (2003) downloaded from

http://www.sis.pitt.edu/~dlwkshop/report.pdf 3. NSF report: Cyberinfrastructure for Environmental Research and Education, (2003) downloaded from

http://www.ncar.ucar.edu/cyber/cyberreport.pdf 4. NSB (National Science Board) report: Long-lived Digital Data Collections: Enabling Research and Education in the 21st Century, (2005)

downloaded from http://www.nsf.gov/nsb/documents/2005/LLDDC_report.pdf 5. NSF report with the Computing Research Association: Cyberinfrastructure for Education and Learning for the Future: A Vision and Research

Agenda, (2005) downloaded from http://www.cra.org/reports/cyberinfrastructure.pdf 6. NSF Atkins Report: Revolutionizing Science & Engineering Through Cyberinfrastructure: Report of the NSF Blue-Ribbon Advisory Panel on

Cyberinfrastructure, (2005) downloaded from http://www.nsf.gov/od/oci/reports/atkins.pdf 7. NSF report: The Role of Academic Libraries in the Digital Data Universe, (2006) downloaded from http://www.arl.org/bm~doc/digdatarpt.pdf 8. NSF report: Cyberinfrastructure Vision for 21st Century Discovery, (2007) downloaded from http://www.nsf.gov/od/oci/ci_v5.pdf 9. JISC/NSF Workshop report on Data-Driven Science & Repositories, (2007) downloaded from http://www.sis.pitt.edu/~repwkshop/NSF-

JISC-report.pdf 10. DOE report: Visualization and Knowledge Discovery: Report from the DOE/ASCR Workshop on Visual Analysis and Data Exploration at

Extreme Scale, (2007) downloaded from http://www.sc.doe.gov/ascr/ProgramDocuments/Docs/DOE-Visualization-Report-2007.pdf 11. DOE report: Mathematics for Analysis of Petascale Data Workshop Report, (2008) downloaded from

http://www.sc.doe.gov/ascr/ProgramDocuments/Docs/PetascaleDataWorkshopReport.pdf 12. NSTC Interagency Working Group on Digital Data report: Harnessing the Power of Digital Data for Science and Society, (2009) downloaded

from http://www.nitrd.gov/about/Harnessing_Power_Web.pdf 13. National Academies report: Ensuring the Integrity, Accessibility, and Stewardship of Research Data in the Digital Age, (2009) downloaded

from http://www.nap.edu/catalog.php?record_id=1261514. NSF report: Data-Enabled Science in the Mathematical and Physical Sciences, (2010) downloaded from http://www.cra.org/ccc/docs/reports/DES-report_final.pdf 15. National Big Data Research and Development Initiative, (2012) downloaded from http://www.whitehouse.gov/sites/default/files/microsites/ostp/big_data_press_release_final_2.pdf

Page 19: Knowledge Discovery from Mining Big Data

The Fourth Paradigm: Data-Intensive Scientific Discoveryhttp://research.microsoft.com/en-us/collaboration/fourthparadigm/

The 4 Scientific Paradigms:

1. Experiment (sensors)2. Theory (modeling)3. Simulation (HPC)4. Data Exploration (KDD)

Page 20: Knowledge Discovery from Mining Big Data

• The emergence of Data Science and Data-Oriented Science (the 4th paradigm of science). “Computational literacy and data literacy are critical for all.” - Kirk Borne

20

Characteristics of Big Data – 1234

Page 21: Knowledge Discovery from Mining Big Data

• The emergence of Data Science and Data-Oriented Science (the 4th paradigm of science). “Computational literacy and data literacy are critical for all.” - Kirk Borne

• A complete data collection on any complex domain (e.g., Earth, or the Universe, or the Human Body) has the potential to encode the knowledge of that domain, waiting to be mined and discovered. “Somewhere, something incredible is waiting to be known.” - Carl Sagan

21

Characteristics of Big Data – 1234

Page 22: Knowledge Discovery from Mining Big Data

• The emergence of Data Science and Data-Oriented Science (the 4th paradigm of science). “Computational literacy and data literacy are critical for all.” - Kirk Borne

• A complete data collection on any complex domain (e.g., Earth, or the Universe, or the Human Body) has the potential to encode the knowledge of that domain, waiting to be mined and discovered. “Somewhere, something incredible is waiting to be known.” - Carl Sagan

• We call this “X-Informatics”: addressing the D2K (Data-to-Knowledge) Challenge in any discipline X using Data Science.

• Examples: Astroinformatics, Bioinformatics, Geoinformatics, Climate Informatics, Ecological Informatics, Biodiversity Informatics, Environmental Informatics, Health Informatics, Medical Informatics, Neuroinformatics, Crystal Informatics, Cheminformatics, Discovery Informatics, and more. 22

Characteristics of Big Data – 1234

Page 23: Knowledge Discovery from Mining Big Data

Characterizing the Big Data Hype

If the only distinguishing characteristic was that we have lots of data, we would call it “Lots of Data”.

23

Page 24: Knowledge Discovery from Mining Big Data

Characterizing the Big Data Hype

If the only distinguishing characteristic was that we have lots of data, we would call it “Lots of Data”.

Big Data characteristics: the 3+n V’s =1. Volume (lots of data = “Tonnabytes”)2. Variety (complexity, curse of dimensionality)3. Velocity (rate of data and information flow)4. V 5. V6. V7. V8. V 24

Page 25: Knowledge Discovery from Mining Big Data

Characterizing the Big Data Hype

If the only distinguishing characteristic was that we have lots of data, we would call it “Lots of Data”.

Big Data characteristics: the 3+n V’s =1. Volume (lots of data = “Tonnabytes”)2. Variety (complexity, curse of dimensionality)3. Velocity (rate of data and information flow)4. Veracity 5. Variability6. Venue7. Vocabulary8. Value 25

Page 26: Knowledge Discovery from Mining Big Data

Characterizing the Big Data Hype

If the only distinguishing characteristic was that we have lots of data, we would call it “Lots of Data”.

Big Data characteristics: the 3+n V’s =1. Volume (lots of data = “Tonnabytes”)2. Variety (complexity, curse of dimensionality)3. Velocity (rate of data and information flow)4. Veracity (verifying inference-based models from

comprehensive data collections)5. Variability6. Venue7. Vocabulary8. Value

26

Page 27: Knowledge Discovery from Mining Big Data

Characterizing the Big Data Hype

If the only distinguishing characteristic was that we have lots of data, we would call it “Lots of Data”.

Big Data characteristics: the 3+n V’s =1. Volume (lots of data = “Tonnabytes”)2. Variety (complexity, curse of dimensionality)3. Velocity (rate of data and information flow)4. Veracity (verifying inference-based models from

comprehensive data collections) … as I said earlier:5. Variability6. Venue7. Vocabulary8. Value 27

A complete data collection on any complex domain (e.g., Earth, or the Universe, or the Human Body) has the potential to encode the knowledge of that domain, waiting to be mined and discovered.

Page 28: Knowledge Discovery from Mining Big Data

Characterizing the Big Data Hype

If the only distinguishing characteristic was that we have lots of data, we would call it “Lots of Data”.

Big Data characteristics: the 3+n V’s =1. Volume 2. Variety : this one helps us to discriminate

subtle new classes (= Class Discovery)3. Velocity4. Veracity5. Variability6. Venue7. Vocabulary8. Value

28

Big Data Example :

Page 29: Knowledge Discovery from Mining Big Data

Insufficient Variety: stars & galaxies are not separated in this parameter

29

Page 30: Knowledge Discovery from Mining Big Data

Sufficient Variety: stars & galaxies are separated in this parameter

30

Page 31: Knowledge Discovery from Mining Big Data

4 Categories of Scientific KDD(Knowledge Discovery in Databases)

Class Discovery Finding new classes of objects and behaviors Learning the rules that constrain the class boundaries

Novelty Discovery Finding new, rare, one-in-a-million(billion)(trillion)

objects and events Correlation Discovery

Finding new patterns and dependencies, which reveal new natural laws or new scientific principles

Association Discovery Finding unusual (improbable) co-occurring associations

31

Page 32: Knowledge Discovery from Mining Big Data

This graphic says it all …

Graphic provided by Professor S. G. Djorgovski, Caltech

• Clustering – examine the data and find the data clusters (clouds), without considering what the items are = Characterization !

• Classification – for each new data item, try to place it within a known class (i.e., a known category or cluster) = Classify !

• Outlier Detection – identify those data items that don’t fit into the known classes or clusters = Surprise !

32

Page 33: Knowledge Discovery from Mining Big Data

Scientists have been doing Data Mining for

centuries“The data are mine, and

you can’t have them!”

• Seriously ... • Scientists love to classify things ...

(Supervised Learning. e.g., classification)

• Scientists love to characterize things ... (Unsupervised Learning. e.g., clustering)

• And we love to discover new things ... (Semi-supervised Learning. e.g., outlier detection)

33

Page 34: Knowledge Discovery from Mining Big Data

Data-Driven Discovery:Scientific KDD (Knowledge Discovery from Data)

1. Class Discovery

2. Novelty Discovery

3. Correlation Discovery

4. Association Discovery

• Benefits of very large datasets:• best statistical analysis of “typical” events• automated search for “rare” events

Graphic from S. G. DjorgovskiGraphic from S. G. Djorgovski

Page 35: Knowledge Discovery from Mining Big Data

Scientific Data-to-Knowledge Problem 1-a

• The Class Discovery Problem : (clustering)– Find distinct clusters of multivariate scientific parameters

that separate objects within a data set.– Find new classes of objects or new behaviors.– What is the significance of the clusters (statistically and

scientifically)?– What is the optimal algorithm for finding friends-of-friends

or nearest neighbors in very high dimensions (complex data with Variety)?

• N is >1010, so what is the most efficient way to sort?• Number of dimensions > 1000 – therefore, we have an enormous

subspace search problem

Page 36: Knowledge Discovery from Mining Big Data

• The superposition / decomposition problem:– Finding the parameters or combinations of parameters

(out of 100’s or 1000’s) that most cleanly and optimally (parsimoniously) distinguish different object classes

– What if there are 1010 objects that overlap in a 103-D parameter space?

– What is the optimal way to separate and accurately classify the different unique classes of objects?

Scientific Data-to-Knowledge Problem 1-b

Page 37: Knowledge Discovery from Mining Big Data

Class Discovery: feature separation and discrimination of classes

The separation of classes improves when the “correct” features are chosen for investigation, as in the following star-galaxy discrimination test: the “Star-Galaxy Separation” Problem

Reference: http://www.cs.princeton.edu/courses/archive/spr04/cos598B/bib/BrunnerDPS.pdf

Not good Good

37

Page 38: Knowledge Discovery from Mining Big Data

Scientific Data-to-Knowledge Problem 1234

• The Novelty Discovery Problem :– Anomaly Detection, Deviation Detection, Surprise Discovery,

Novelty Discovery: Finding objects and events that are outside the bounds of our expectations (outside known clusters)

– Finding new, rare, one-in-a-million(billion)(trillion) objects and events – Finding the Unknown Unknowns

– These may be real scientific discoveries or garbage– Outlier detection is therefore useful for:

• Anomaly Detection – is the detector system working?• Data Quality Assurance – is the data pipeline working?• Novelty Discovery – is my Nobel prize waiting?

– How does one optimally find outliers in 103-D parameter space? or in interesting subspaces (in lower dimensions)?

– How do we measure their “interestingness”?

Page 39: Knowledge Discovery from Mining Big Data

39

Novelty Discovery: Improved Discovery of Rare Objects or Events across Multiple Data

Sources

Page 40: Knowledge Discovery from Mining Big Data

• The Correlation Discovery Problem = Dimension Reduction Problem:– Finding new correlations and “fundamental planes” of parameters.

– Such correlations, patterns, and dependencies may reveal new physics or new scientific relations.

– Number of attributes can be hundreds or thousands =

• The Curse of High Dimensionality !

– Are there eigenvectors or condensed representations (e.g., basis sets) that represent the full set of properties?

Scientific Data-to-Knowledge Problem 1234

Page 41: Knowledge Discovery from Mining Big Data

Fundamental Plane for 156,000 Elliptical Galaxies: plot shows variance captured by first 2 Principal Components as a function of local galaxy density.

Slide Content• Slide content• Slide content• Slide content

low (Local Galaxy Density) high

% o

f va

rian

ce c

ap

ture

d b

y P

C1+

PC

2

Reference: Borne, Dutta, Giannella, Kargupta, & Griffin 2008

41

Page 42: Knowledge Discovery from Mining Big Data

• The Association Discovery Problem : Link Analysis – Network Analysis – Graph Mining

Identify connections between different events (or objects)

Find unusual (improbable) co-occurring combinations of data attribute values

Find data items that have much fewer than “6 degrees of separation” Identifying such connectivity in our scientific

databases and knowledge repositories can lead to new insights, new knowledge, new discoveries.

Scientific Data-to-Knowledge Problem 1234

Page 43: Knowledge Discovery from Mining Big Data

http://siliconangle.com/blog/2012/07/13/big-data-nightmares/

There are many technologies associated with Big Data

43

Page 44: Knowledge Discovery from Mining Big Data

http://www.bigdatabytes.com/wp-content/uploads/2012/01/big-data.jpg

One approach to Big Data:Computational Science (Hadoop,Map/Reduce)

44

Page 45: Knowledge Discovery from Mining Big Data

Another approach to Big Data: Data Science (Informatics)

45

Page 46: Knowledge Discovery from Mining Big Data

A third approach to Big Data: Citizen Science (crowdsourcing)

46

Page 47: Knowledge Discovery from Mining Big Data

Galaxy Zoo: example ofCitizen Science (crowdsourcing)

http://astrophysics.gsfc.nasa.gov/outreach/podcast/wordpress/index.php/2010/10/08/saras-blog-be-a-scientist/

http://www.zooniverse.org

47

Page 48: Knowledge Discovery from Mining Big Data

Borne (2010): “Astroinformatics: Data-Oriented Astronomy Research and Education”, Journal of Earth Science Informatics, vol. 3, pp. 5-17.

See also http://arxiv.org/abs/0909.3892

Astroinformatics Research paper available !Addresses the data science challenges, research agenda, application areas, use cases, and recommendations for the new science of Astroinformatics.

Page 49: Knowledge Discovery from Mining Big Data

LSST = Large

Synoptic Survey

Telescopehttp://www.lsst.org/

8.4-meter diameterprimary mirror =10 square degrees!

Hello !

– 100-200 Petabyte image archive– 20-40 Petabyte database catalog 49

Page 50: Knowledge Discovery from Mining Big Data

Observing Strategy: One pair of images every 40 seconds for each spot on the sky,then continue across the sky continuously every night for 10 years (~2021-2031), with time domain sampling in log(time) intervals (to capture dynamic range of transients).

• LSST (Large Synoptic Survey Telescope): – Ten-year time series imaging of the night sky – mapping the Universe !– ~2,000,000 events each night – anything that goes bump in the night ! – Cosmic Cinematography! The New Sky! @ http://www.lsst.org/

50

Page 51: Knowledge Discovery from Mining Big Data

The LSST Informatics and Statistics Science Collaboration (ISSC) Research Team

• The ISSC team focuses on several research areas:– Statistics– Data & Information Visualization– Data mining (machine learning)– Data-intensive computing & analysis– Large-scale scientific data management

• These areas represent Statistics and the science of Informatics (Astroinformatics) = Data-intensive Science = the 4th Paradigm of Scientific Research– Addressing the LSST “Data to Knowledge” challenge– Helping to discover the unknown unknowns

Informatics

51

Page 52: Knowledge Discovery from Mining Big Data

The LSST ISSC Research Team

• Chairperson: K.Borne, GMU• Core team: 3 astronomers + 2 =

– K.Borne (astroinformatics)– Eric Feigelsen, Tom Loredo (astrostatistics)– Jogesh Babu (statistics)– Alex Gray (computer science, data mining)

• Full team: 41 scientists– ~60% astronomers– ~30% statisticians– ~10% data mining, machine learning computer scientists

http://www.lsstcorp.org/ScienceCollaborators/ScienceMembers.php

http://aurora.gmu.edu/~kborne/LSST-Informatics-and-Statistics.pdf

52

Page 53: Knowledge Discovery from Mining Big Data

• Probabilistic Cross-Matching of objects from different catalogues• The distance problem (e.g., Photometric Redshift estimators)• Star-Galaxy separation ; QSO-Star separation• Cosmic-Ray Detection in images• Supernova Detection and Classification• Morphological Classification (galaxies, AGN, gravitational lenses, ...)• Class and Subclass Discovery (brown dwarfs, methane dwarfs, ...)• Dimension Reduction = Correlation Discovery• Learning Rules for improved classifiers • Classification of massive data streams• Real-time Classification of Astronomical Events • Clustering of massive data collections• Novelty, Anomaly, Outlier Detection in massive databases

http://aurora.gmu.edu/~kborne/LSST-Informatics-and-Statistics.pdf

Some key astronomy problems that require informatics and statistical techniques …Astroinformatics & Astrostatistics!

53

Page 54: Knowledge Discovery from Mining Big Data

ISSC “current topics”• Advancing the field = Community-building:

• Astroinformatics + Astrostatistics (several workshops this year!!)• Education, education, education! (Citizen Science, undergrad+grad ed…)

• LSST Event Characterization vs. Classification• Sparse time series and the LSST observing cadence• Challenge Problems, such as the Photo-z challenge and the Supernova

Photometric Classification challenge• Testing algorithms on the LSST simulations: images/catalogs PLUS

observing cadence – can we recover known classes of variability?• Generating and/or accumulating training samples of numerous classes

(especially variables and transients)• Proposing a mini-survey during the science verification year (Science

Commissioning):• e.g., high-density and evenly-spaced observations of extragalactic and

Galactic test fields are obtained, to generate training sets for variability classification and assessment thereof

• Science Data Quality Assessment (SDQA): R&D efforts to support LSST Data Management team

http://aurora.gmu.edu/~kborne/LSST-Informatics-and-Statistics.pdf 54

Page 55: Knowledge Discovery from Mining Big Data

Agenda of ISSC Workshop atLSST Project Meeting, August 2012

Brief talks by team members:–Kirk Borne: Outlier Detection for Surprise Discovery in Big Data–Jogesh Babu: Statistical Resources–Nathan De Lee: The VIDA Astroinformatics Portal–Matthew Graham: Characterizing and Classifying CRTS–Joseph Richards: Time-Domain Discovery and Classification–Sam Schmidt: Upcoming Challenges for Photometric Redshifts–Lior Shamir: Automatic Analysis of Galaxy Morphology–John Wallin: Citizen Science and Machine Learning–Jake Vanderplas: AstroML – Machine Learning for Astronomy

http://aurora.gmu.edu/~kborne/LSST-Informatics-and-Statistics.pdf 55

Page 56: Knowledge Discovery from Mining Big Data

Why do all of this?… for 4 very simple reasons:• (1) Any real data collection may consist of

millions, or billions, or trillions of sampled data points.

• (2) Any real data set will probably have many hundreds (or thousands) of measured attributes (features, dimensions).

• (3) Humans can make mistakes when staring for hours at long lists of numbers, especially in a dynamic data stream.

• (4) The use of a data-driven model provides an objective, scientific, rational, and justifiable test of a hypothesis.

56

Page 57: Knowledge Discovery from Mining Big Data

Why do all of this?… for 4 very simple reasons:• (1) Any real data collection may consist of

millions, or billions, or trillions of sampled data points.

• (2) Any real data set will probably have many hundreds (or thousands) of measured attributes (features, dimensions).

• (3) Humans can make mistakes when staring for hours at long lists of numbers, especially in a dynamic data stream.

• (4) The use of a data-driven model provides an objective, scientific, rational, and justifiable test of a hypothesis.

57

Volume

Variety

Velocity

Veracity

Page 58: Knowledge Discovery from Mining Big Data

Why do all of this?… for 4 very simple reasons:• (1) Any real data collection may consist of

millions, or billions, or trillions of sampled data points.

• (2) Any real data set will probably have many hundreds (or thousands) of measured attributes (features, dimensions).

• (3) Humans can make mistakes when staring for hours at long lists of numbers, especially in a dynamic data stream.

• (4) The use of a data-driven model provides an objective, scientific, rational, and justifiable test of a hypothesis.

58

Volume

Variety

Velocity

Veracity Can you prove your results ?

It is too much !

It is too complex !

It keeps on coming !

Page 59: Knowledge Discovery from Mining Big Data

Knowledge Discovery from Mining Big Data – 12

• By collecting a thorough set of parameters (high-dimensional data) for a complete set of items within our domain of study, we should be able to build a “perfect” statistical model for that domain.

• In other words, Big Data becomes the model for a domain X = we call this X-informatics.

• Anything we want to know about that domain is specified and encoded within the data.

• The goal of Big Data Science is to find those encodings, patterns, and knowledge nuggets.

• Example : Big-Data Vision? Whole-population analytics 59

Page 60: Knowledge Discovery from Mining Big Data

• By collecting a thorough set of parameters (high-dimensional data) for a complete set of items within our domain of study, we should be able to build a “perfect” statistical model for that domain.

• In other words, Big Data becomes the model for a domain X = we call this X-informatics.

• Anything we want to know about that domain is specified and encoded within the data.

• The goal of Big Data Science is to find those encodings, patterns, and knowledge nuggets.

•Recall what we said before …60

Knowledge Discovery from Mining Big Data – 12

Page 61: Knowledge Discovery from Mining Big Data

Knowledge Discovery from Mining Big Data – 12

• … one of the two major benefits of BIG DATA is to provide the best statistical analysis ever(!) for the domain of study.

61

Benefits of very large datasets:

1. best statistical analysis of “typical” events

2. automated search for “rare” events

Page 62: Knowledge Discovery from Mining Big Data