math36032 problem solving by computer introduction to data science · 2019-03-19 · data science...

34
MATH36032 Problem Solving by Computer Introduction to Data Science

Upload: others

Post on 31-May-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: MATH36032 Problem Solving by Computer Introduction to Data Science · 2019-03-19 · Data Science using MATLAB MATLAB is not the best tool for data science. Certain tasks like text

MATH36032 Problem Solving by Computer

Introduction to Data Science

Page 2: MATH36032 Problem Solving by Computer Introduction to Data Science · 2019-03-19 · Data Science using MATLAB MATLAB is not the best tool for data science. Certain tasks like text

NO. of jobs on jobsite1

Jan 2016 Jul 2016 Jan 2017 Jul 2017 Jan 20180

2000

4000

6000

8000

10000

12000

NO

. of

Jobs

MATLAB

Data Related

1http://www.jobsite.co.uk/

Page 3: MATH36032 Problem Solving by Computer Introduction to Data Science · 2019-03-19 · Data Science using MATLAB MATLAB is not the best tool for data science. Certain tasks like text

What is Data Science?

(from Wiki) an interdisciplinary field about processesand systems to extract knowledge or insights fromdata in various forms, either structured orunstructured, which is a continuation of some of thedata analysis fields such as statistics, data mining, andpredictive analytics, ...

Page 4: MATH36032 Problem Solving by Computer Introduction to Data Science · 2019-03-19 · Data Science using MATLAB MATLAB is not the best tool for data science. Certain tasks like text

What is Data Science?

I Math & Stat Knowledge: calculus, statistics/probability, linearalgebra

I Hacking skills: number/text manipulation and (vectorized)manipulation, algorithmic thinking, ...

I Substantive expertise/domain knowledge: knowledge relatedto specific facts

Page 5: MATH36032 Problem Solving by Computer Introduction to Data Science · 2019-03-19 · Data Science using MATLAB MATLAB is not the best tool for data science. Certain tasks like text

Data Science/Big Data: why now?

We are generating more data before

Technology in data collection and storage are improving

Page 6: MATH36032 Problem Solving by Computer Introduction to Data Science · 2019-03-19 · Data Science using MATLAB MATLAB is not the best tool for data science. Certain tasks like text

Data Science/Big Data: why now?

We are generating more data before

Technology in data collection and storage are improving

Page 7: MATH36032 Problem Solving by Computer Introduction to Data Science · 2019-03-19 · Data Science using MATLAB MATLAB is not the best tool for data science. Certain tasks like text

Data Science/Big Data: why now?

We are generating more data before

Technology in data collection and storage are improving

Page 8: MATH36032 Problem Solving by Computer Introduction to Data Science · 2019-03-19 · Data Science using MATLAB MATLAB is not the best tool for data science. Certain tasks like text

Applications: Social media and search engine

Personalised webpage

How does Amazon know which items to recommend?

Page 9: MATH36032 Problem Solving by Computer Introduction to Data Science · 2019-03-19 · Data Science using MATLAB MATLAB is not the best tool for data science. Certain tasks like text

Applications: Retailers

Personalised promotion offers

Who should get what kind of offer?

Page 10: MATH36032 Problem Solving by Computer Introduction to Data Science · 2019-03-19 · Data Science using MATLAB MATLAB is not the best tool for data science. Certain tasks like text

Personal experience: Credit Card Fraud Detection?Do these transactions look normal?

This is my credit card statement. The transactions are made withintwo hours after I lost the card in Montreal in the summer of 2015.

Page 11: MATH36032 Problem Solving by Computer Introduction to Data Science · 2019-03-19 · Data Science using MATLAB MATLAB is not the best tool for data science. Certain tasks like text

Personal experience: Credit Card Fraud Detection?Do these transactions look normal?

This is my credit card statement. The transactions are made withintwo hours after I lost the card in Montreal in the summer of 2015.

Page 12: MATH36032 Problem Solving by Computer Introduction to Data Science · 2019-03-19 · Data Science using MATLAB MATLAB is not the best tool for data science. Certain tasks like text

Personal experience: Car insurance

Page 13: MATH36032 Problem Solving by Computer Introduction to Data Science · 2019-03-19 · Data Science using MATLAB MATLAB is not the best tool for data science. Certain tasks like text

More applications

I Insurance/ Actuarial Science: how much do you charge yourcustomer

I Weather/climate forecasting: long term prediction

I Finance: better prediction of the stock prices

I...

Page 14: MATH36032 Problem Solving by Computer Introduction to Data Science · 2019-03-19 · Data Science using MATLAB MATLAB is not the best tool for data science. Certain tasks like text

Big data leaked and generated

Is the 2.6 terabytes Panama Papers big data?

How about the 1 billion accounts leaked from Yahoo’s database?

For the past few days, 500 million Facebook users’ information wasbelieved to be misused by Cambridge Analytica.

Data generated dailyI More than 40 terabytes for Met OfficeI More than 1000 terabytes of data processed by FacebookI More than 50 petabytes (5.0× 1016 bytes) handled by Google

Page 15: MATH36032 Problem Solving by Computer Introduction to Data Science · 2019-03-19 · Data Science using MATLAB MATLAB is not the best tool for data science. Certain tasks like text

Big data leaked and generated

Is the 2.6 terabytes Panama Papers big data?

How about the 1 billion accounts leaked from Yahoo’s database?

For the past few days, 500 million Facebook users’ information wasbelieved to be misused by Cambridge Analytica.

Data generated dailyI More than 40 terabytes for Met OfficeI More than 1000 terabytes of data processed by FacebookI More than 50 petabytes (5.0× 1016 bytes) handled by Google

Page 16: MATH36032 Problem Solving by Computer Introduction to Data Science · 2019-03-19 · Data Science using MATLAB MATLAB is not the best tool for data science. Certain tasks like text

Big data leaked and generated

Is the 2.6 terabytes Panama Papers big data?

How about the 1 billion accounts leaked from Yahoo’s database?

For the past few days, 500 million Facebook users’ information wasbelieved to be misused by Cambridge Analytica.

Data generated dailyI More than 40 terabytes for Met OfficeI More than 1000 terabytes of data processed by FacebookI More than 50 petabytes (5.0× 1016 bytes) handled by Google

Page 17: MATH36032 Problem Solving by Computer Introduction to Data Science · 2019-03-19 · Data Science using MATLAB MATLAB is not the best tool for data science. Certain tasks like text

Three ”V”s of Big Data

I Volume: large quantity of data, big size of datasets

I Variety: many different types and forms of data, e.g.transactional from ATMs, social media site, emails,demographics data, tracking data from cell phones, etc.

I Velocity: data that is coming in at a very fast pace

Require many new softwares/technologies:

The support for big data sets is extended to all major technologiesin MATLAB (mapreduce, datastore and other toolboxes)

Page 18: MATH36032 Problem Solving by Computer Introduction to Data Science · 2019-03-19 · Data Science using MATLAB MATLAB is not the best tool for data science. Certain tasks like text

Three ”V”s of Big Data

I Volume: large quantity of data, big size of datasets

I Variety: many different types and forms of data, e.g.transactional from ATMs, social media site, emails,demographics data, tracking data from cell phones, etc.

I Velocity: data that is coming in at a very fast pace

Require many new softwares/technologies:

The support for big data sets is extended to all major technologiesin MATLAB (mapreduce, datastore and other toolboxes)

Page 19: MATH36032 Problem Solving by Computer Introduction to Data Science · 2019-03-19 · Data Science using MATLAB MATLAB is not the best tool for data science. Certain tasks like text

Big Data Landscape

Page 20: MATH36032 Problem Solving by Computer Introduction to Data Science · 2019-03-19 · Data Science using MATLAB MATLAB is not the best tool for data science. Certain tasks like text

The first data science application?

Kepler’s three laws of planetary motion

Page 21: MATH36032 Problem Solving by Computer Introduction to Data Science · 2019-03-19 · Data Science using MATLAB MATLAB is not the best tool for data science. Certain tasks like text

Make a TV show using data?

Both Amazon and Netflix made TV shows based on customerrating!

Page 22: MATH36032 Problem Solving by Computer Introduction to Data Science · 2019-03-19 · Data Science using MATLAB MATLAB is not the best tool for data science. Certain tasks like text

TV made by Amazon and Netflix using big data

Alpha House was not as successful as expected. Netflix also hadopen competitions for best algorithms to predict user ratings forfilms (discontinued now for privacy and other reasons).

Page 23: MATH36032 Problem Solving by Computer Introduction to Data Science · 2019-03-19 · Data Science using MATLAB MATLAB is not the best tool for data science. Certain tasks like text

TV made by Amazon and Netflix using big data

Alpha House was not as successful as expected. Netflix also hadopen competitions for best algorithms to predict user ratings forfilms (discontinued now for privacy and other reasons).

Page 24: MATH36032 Problem Solving by Computer Introduction to Data Science · 2019-03-19 · Data Science using MATLAB MATLAB is not the best tool for data science. Certain tasks like text

Google flu trendsGoogle made a big splash in the news in 2008

and five years later

Page 25: MATH36032 Problem Solving by Computer Introduction to Data Science · 2019-03-19 · Data Science using MATLAB MATLAB is not the best tool for data science. Certain tasks like text

Google flu trendsGoogle made a big splash in the news in 2008

and five years later

Page 26: MATH36032 Problem Solving by Computer Introduction to Data Science · 2019-03-19 · Data Science using MATLAB MATLAB is not the best tool for data science. Certain tasks like text

The dark side of data science

Data, if used in the right way, can greatly facilitate our life (likethe concept of smart cities), but ...

I Privacy: how the data collected from you are being used?

I Biased data: polls before Brexit or US presidential election in2016

I Biased interpretation:

Data shows that people who shop at Waitrose havelonger life span than those at Aldi and Asda.

Page 27: MATH36032 Problem Solving by Computer Introduction to Data Science · 2019-03-19 · Data Science using MATLAB MATLAB is not the best tool for data science. Certain tasks like text

The dark side of data science

Data, if used in the right way, can greatly facilitate our life (likethe concept of smart cities), but ...

I Privacy: how the data collected from you are being used?

I Biased data: polls before Brexit or US presidential election in2016

I Biased interpretation:

Data shows that people who shop at Waitrose havelonger life span than those at Aldi and Asda.

Page 28: MATH36032 Problem Solving by Computer Introduction to Data Science · 2019-03-19 · Data Science using MATLAB MATLAB is not the best tool for data science. Certain tasks like text

The dark side of data science

Data, if used in the right way, can greatly facilitate our life (likethe concept of smart cities), but ...

I Privacy: how the data collected from you are being used?

I Biased data: polls before Brexit or US presidential election in2016

I Biased interpretation:

Data shows that people who shop at Waitrose havelonger life span than those at Aldi and Asda.

Page 29: MATH36032 Problem Solving by Computer Introduction to Data Science · 2019-03-19 · Data Science using MATLAB MATLAB is not the best tool for data science. Certain tasks like text

The dark side of data science

Data, if used in the right way, can greatly facilitate our life (likethe concept of smart cities), but ...

I Privacy: how the data collected from you are being used?

I Biased data: polls before Brexit or US presidential election in2016

I Biased interpretation:

Data shows that people who shop at Waitrose havelonger life span than those at Aldi and Asda.

Page 30: MATH36032 Problem Solving by Computer Introduction to Data Science · 2019-03-19 · Data Science using MATLAB MATLAB is not the best tool for data science. Certain tasks like text

The dark side of data science

Data, if used in the right way, can greatly facilitate our life (likethe concept of smart cities), but ...

I Privacy: how the data collected from you are being used?

I Biased data: polls before Brexit or US presidential election in2016

I Biased interpretation:

Data shows that people who shop at Waitrose havelonger life span than those at Aldi and Asda.

Page 31: MATH36032 Problem Solving by Computer Introduction to Data Science · 2019-03-19 · Data Science using MATLAB MATLAB is not the best tool for data science. Certain tasks like text

Data Science Workflow (in business)

Page 32: MATH36032 Problem Solving by Computer Introduction to Data Science · 2019-03-19 · Data Science using MATLAB MATLAB is not the best tool for data science. Certain tasks like text

Data Science using MATLAB

MATLAB is not the best tool for data science. Certain tasks liketext processing is better done with Python or R.

More tools and data types are introduced in MATLAB for the pastfew years, mainly to cope with the increased need in data science.

Page 33: MATH36032 Problem Solving by Computer Introduction to Data Science · 2019-03-19 · Data Science using MATLAB MATLAB is not the best tool for data science. Certain tasks like text

MATLAB Data Type 2

We have already seen these (type whos in command window):

I Numerical Types: double (double precision floatingnumber), uint8 (images), int32, int64, ...

I Symbolic: Defined by syms

I Logical: true,false

I Characters and strings: A = ’string’

I Function Handles: @ as in integral(@(x) sin(x), ...)

New data types introduced recently : structures (struct), cellarrays (cell), Time Series (timeseries from 2006b), Table(table from 2013b), Categorical Arrays (categorical from2013b), Date and Time (datetime from 2015b), ..

2http://uk.mathworks.com/help/matlab/data-types data-types.html

Page 34: MATH36032 Problem Solving by Computer Introduction to Data Science · 2019-03-19 · Data Science using MATLAB MATLAB is not the best tool for data science. Certain tasks like text

The plan for the rest of the semester

I In Week 9 (Wednesday)I Review (and introduce) a few new data structures (mainly

character strings)I Read (and write) different data formats (csv, excel, image, ...)

I Specific topics (some are related to the last Project):I Regression and classificationI Dimension reduction/Low rank approximationI Random simulationI Ranking