math36032 problem solving by computer introduction to data science · 2019-03-19 · data science...
TRANSCRIPT
MATH36032 Problem Solving by Computer
Introduction to Data Science
NO. of jobs on jobsite1
Jan 2016 Jul 2016 Jan 2017 Jul 2017 Jan 20180
2000
4000
6000
8000
10000
12000
NO
. of
Jobs
MATLAB
Data Related
1http://www.jobsite.co.uk/
What is Data Science?
(from Wiki) an interdisciplinary field about processesand systems to extract knowledge or insights fromdata in various forms, either structured orunstructured, which is a continuation of some of thedata analysis fields such as statistics, data mining, andpredictive analytics, ...
What is Data Science?
I Math & Stat Knowledge: calculus, statistics/probability, linearalgebra
I Hacking skills: number/text manipulation and (vectorized)manipulation, algorithmic thinking, ...
I Substantive expertise/domain knowledge: knowledge relatedto specific facts
Data Science/Big Data: why now?
We are generating more data before
Technology in data collection and storage are improving
Data Science/Big Data: why now?
We are generating more data before
Technology in data collection and storage are improving
Data Science/Big Data: why now?
We are generating more data before
Technology in data collection and storage are improving
Applications: Social media and search engine
Personalised webpage
How does Amazon know which items to recommend?
Applications: Retailers
Personalised promotion offers
Who should get what kind of offer?
Personal experience: Credit Card Fraud Detection?Do these transactions look normal?
This is my credit card statement. The transactions are made withintwo hours after I lost the card in Montreal in the summer of 2015.
Personal experience: Credit Card Fraud Detection?Do these transactions look normal?
This is my credit card statement. The transactions are made withintwo hours after I lost the card in Montreal in the summer of 2015.
Personal experience: Car insurance
More applications
I Insurance/ Actuarial Science: how much do you charge yourcustomer
I Weather/climate forecasting: long term prediction
I Finance: better prediction of the stock prices
I...
Big data leaked and generated
Is the 2.6 terabytes Panama Papers big data?
How about the 1 billion accounts leaked from Yahoo’s database?
For the past few days, 500 million Facebook users’ information wasbelieved to be misused by Cambridge Analytica.
Data generated dailyI More than 40 terabytes for Met OfficeI More than 1000 terabytes of data processed by FacebookI More than 50 petabytes (5.0× 1016 bytes) handled by Google
Big data leaked and generated
Is the 2.6 terabytes Panama Papers big data?
How about the 1 billion accounts leaked from Yahoo’s database?
For the past few days, 500 million Facebook users’ information wasbelieved to be misused by Cambridge Analytica.
Data generated dailyI More than 40 terabytes for Met OfficeI More than 1000 terabytes of data processed by FacebookI More than 50 petabytes (5.0× 1016 bytes) handled by Google
Big data leaked and generated
Is the 2.6 terabytes Panama Papers big data?
How about the 1 billion accounts leaked from Yahoo’s database?
For the past few days, 500 million Facebook users’ information wasbelieved to be misused by Cambridge Analytica.
Data generated dailyI More than 40 terabytes for Met OfficeI More than 1000 terabytes of data processed by FacebookI More than 50 petabytes (5.0× 1016 bytes) handled by Google
Three ”V”s of Big Data
I Volume: large quantity of data, big size of datasets
I Variety: many different types and forms of data, e.g.transactional from ATMs, social media site, emails,demographics data, tracking data from cell phones, etc.
I Velocity: data that is coming in at a very fast pace
Require many new softwares/technologies:
The support for big data sets is extended to all major technologiesin MATLAB (mapreduce, datastore and other toolboxes)
Three ”V”s of Big Data
I Volume: large quantity of data, big size of datasets
I Variety: many different types and forms of data, e.g.transactional from ATMs, social media site, emails,demographics data, tracking data from cell phones, etc.
I Velocity: data that is coming in at a very fast pace
Require many new softwares/technologies:
The support for big data sets is extended to all major technologiesin MATLAB (mapreduce, datastore and other toolboxes)
Big Data Landscape
The first data science application?
⇓
Kepler’s three laws of planetary motion
Make a TV show using data?
Both Amazon and Netflix made TV shows based on customerrating!
TV made by Amazon and Netflix using big data
Alpha House was not as successful as expected. Netflix also hadopen competitions for best algorithms to predict user ratings forfilms (discontinued now for privacy and other reasons).
TV made by Amazon and Netflix using big data
Alpha House was not as successful as expected. Netflix also hadopen competitions for best algorithms to predict user ratings forfilms (discontinued now for privacy and other reasons).
Google flu trendsGoogle made a big splash in the news in 2008
and five years later
Google flu trendsGoogle made a big splash in the news in 2008
and five years later
The dark side of data science
Data, if used in the right way, can greatly facilitate our life (likethe concept of smart cities), but ...
I Privacy: how the data collected from you are being used?
I Biased data: polls before Brexit or US presidential election in2016
I Biased interpretation:
Data shows that people who shop at Waitrose havelonger life span than those at Aldi and Asda.
The dark side of data science
Data, if used in the right way, can greatly facilitate our life (likethe concept of smart cities), but ...
I Privacy: how the data collected from you are being used?
I Biased data: polls before Brexit or US presidential election in2016
I Biased interpretation:
Data shows that people who shop at Waitrose havelonger life span than those at Aldi and Asda.
The dark side of data science
Data, if used in the right way, can greatly facilitate our life (likethe concept of smart cities), but ...
I Privacy: how the data collected from you are being used?
I Biased data: polls before Brexit or US presidential election in2016
I Biased interpretation:
Data shows that people who shop at Waitrose havelonger life span than those at Aldi and Asda.
The dark side of data science
Data, if used in the right way, can greatly facilitate our life (likethe concept of smart cities), but ...
I Privacy: how the data collected from you are being used?
I Biased data: polls before Brexit or US presidential election in2016
I Biased interpretation:
Data shows that people who shop at Waitrose havelonger life span than those at Aldi and Asda.
The dark side of data science
Data, if used in the right way, can greatly facilitate our life (likethe concept of smart cities), but ...
I Privacy: how the data collected from you are being used?
I Biased data: polls before Brexit or US presidential election in2016
I Biased interpretation:
Data shows that people who shop at Waitrose havelonger life span than those at Aldi and Asda.
Data Science Workflow (in business)
Data Science using MATLAB
MATLAB is not the best tool for data science. Certain tasks liketext processing is better done with Python or R.
More tools and data types are introduced in MATLAB for the pastfew years, mainly to cope with the increased need in data science.
MATLAB Data Type 2
We have already seen these (type whos in command window):
I Numerical Types: double (double precision floatingnumber), uint8 (images), int32, int64, ...
I Symbolic: Defined by syms
I Logical: true,false
I Characters and strings: A = ’string’
I Function Handles: @ as in integral(@(x) sin(x), ...)
New data types introduced recently : structures (struct), cellarrays (cell), Time Series (timeseries from 2006b), Table(table from 2013b), Categorical Arrays (categorical from2013b), Date and Time (datetime from 2015b), ..
2http://uk.mathworks.com/help/matlab/data-types data-types.html
The plan for the rest of the semester
I In Week 9 (Wednesday)I Review (and introduce) a few new data structures (mainly
character strings)I Read (and write) different data formats (csv, excel, image, ...)
I Specific topics (some are related to the last Project):I Regression and classificationI Dimension reduction/Low rank approximationI Random simulationI Ranking