dami:&& introduc.on&to&data&mining&

DAMI:

Introduc.on to Data Mining

Panagio(s Papapetrou, PhD Associate Professor, Stockholm University Adjunct Professor, Aalto University

Short Bio §  BSc: University of Ioannina, Greece, 2003

Short Bio §  PhD: Boston University, USA, 2009

Short Bio §  2009 -‐ 2012: Aalto University, Finland §  Postdoc: Data Mining Group

Short Bio §  2012 -‐ 2013: Birkbeck, University of London, UK §  Lecturer and director of the ITApps Programme

Short Bio §  September 2013: Senior Lecturer at DSV

!

Course logisNcs •  Course webpage:

–  hPps://ilearn2.dsv.su.se/course/view.php?id=225

•  Schedule: –  Lectures: Nov 4 – Dec 17 –  Exercise sessions: Nov 17, Dec 1, Dec 15 –  WriPen Exam: Jan 14 –  Re-‐exam: Feb 23

•  Instructors: –  PanagioNs Papapetrou: panagio([email protected] –  Lars Asker: [email protected] –  Henrik Boström: [email protected]

•  Course Assistant: –  Jing Zhao: [email protected]

•  Office hours: by appointment only

ILearn2

Topics to be covered

•  AssociaNon Rules •  Clustering •  Data RepresentaNon •  ClassificaNon •  Similarity Matching •  Model EvaluaNon •  Time Series Analysis •  Ranking

Syllabus Nov 4 Introduc.on to data mining

Nov 5 AssociaNon Rules

Nov 10, 14 Clustering and Data RepresentaNon

Nov 17 Exercise session 1 (Homework 1 due)

Nov 19 ClassificaNon

Nov 24, 26 Similarity Matching and Model EvaluaNon

Dec 1 Exercise session 2 (Homework 2 due)

Dec 3 Combining Models

Dec 8, 10 Time Series Analysis


Dec 17 Ranking

Jan 13 Review

Jan 14 EXAM

Feb 23 Re-‐EXAM

Course workload

•  Homeworks 3hp

• WriPen Exam 4.5hp

•  Online quizzes

Homework Assignments •  Three assignments (30pts each, total 90pts) •  3-‐5 online quizzes (total of 10 + 20pts) •  To be done individually •  Will involve some programming in R •  Three in-‐class exercise sessions •  Submissions:

–  Before each exercise session – No submissions allowed amer that!

•  Grade scheme: A-‐F

Quizzes

•  3-‐5 short online quizzes •  Material to be examined

–  The latest lecture •  Available at the end of each lecture and to be completed before the next lecture

•  Only one aPempt per quiz

•  Will offer –  10pts towards the Homework Assignments –  20pts BONUS towards the Homework Assignments

•  No make-‐up quizzes are possible

To Pass the Course

•  Pass the Homework Assignments

–  at least 50/100 pts (including the BONUS pts from Quiz)

•  Pass the WriPen Exam

–  at least 50/100 pts

•  Ask quesNons

•  Enjoy it J

Learning ObjecNves

•  Become familiar with fundamental data mining algorithms

•  Be able to idenNfy a correct algorithmic soluNon to a given data mining problem

•  Be able to apply these algorithmic soluNons to solve pracNcal problems

•  Be able to perform basic data mining tasks on real data using the R tool

Textbooks Main:

Data Mining: PracNcal Machine Learning Tools and Techniques, Third EdiNon Publisher: Morgan Kaufmann Year: 2011 ISBN: 978-‐0123748560

Addi.onal: An IntroducNon to StaNsNcal Learning with applicaNons in R Publisher: Springer Year: 2013 ISBN: 978-‐1-‐4614-‐7138-‐7 URL: hPp://www-‐bcf.usc.edu/~gareth/ISL/

Research papers (pointers will be provided)

Recommended prerequisites •  Basic algorithms: sorNng, set manipulaNon, hashing •  Analysis of algorithms: O-‐notaNon and its variants, NP-‐

hardness

•  Programming: some programming knowledge, ability to do small experiments reasonably quickly

•  Probability: concepts of probability and condiNonal probability, expectaNons, random walks

•  Some linear algebra: e.g., eigenvector and eigenvalue computaNons

Above all

•  The goal of the course is to learn and enjoy

•  The basic principle is to ask quesNons when you don’t understand

•  Say when things are unclear; not everything can be clear from the beginning

•  ParNcipate in the class as much as possible

IntroducNon to data mining •  Why do we need data analysis?

•  What is data mining?

•  Examples where data mining has been useful

•  Data mining and other areas of computer science and mathemaNcs

•  Some (basic) data mining tasks

Why do we need data analysis

•  Really really lots of raw data data!! –  Moore’s law: more efficient processors, larger memories –  CommunicaNons have improved too

–  Measurement technologies have improved dramaNcally

–  It is possible to store and collect lots of raw data

–  The data analysis methods are lagging behind

•  Need to analyze the raw data to extract knowledge

The data is also very complex

•  MulNple types of data: tables, Nme series, images, graphs, etc

•  SpaNal and temporal aspects

•  Large number of different variables

•  Lots of observaNons à large datasets

Example: transacNon data

•  Billions of real-‐life customers: – COOP, ICA – Tele2

•  Billions of online customers: – amazon – expedia

Example: document data

•  Web as a document repository: 50 billion of web pages

•  Wikipedia: 4 million arNcles (and counNng)

•  Online collecNons of scienNfic arNcles

Example: network data •  Web: 50 billion pages linked via hyperlinks

•  Facebook: 200 million users

•  MySpace: 300 million users

•  Instant messenger: 1 billion users

•  Blogs: 250 million blogs worldwide

Example: genomic sequences

•  hPp://www.1000genomes.org/page.php

•  Full sequence of 1000 individuals

•  3 billion nucleoNdes per person •  Lots more data in fact: medical history of the persons, gene expression data

Example: environmental data •  Climate data (just an example) hPp://www.ncdc.gov/oa/climate/ghcn-‐monthly/index.php

•  “a database of temperature, precipitaNon and pressure records managed by the NaNonal ClimaNc Data Center, Arizona State University and the Carbon Dioxide InformaNon Analysis Center”

•  “6000 temperature staNons, 7500 precipitaNon staNons, 2000 pressure staNons”

We have large datasets…so what? •  Goal: obtain useful knowledge from large masses of data

•  “Data mining is the analysis of (omen large) observaNonal data sets to find unsuspected relaNonships and to summarize the data in novel ways that are both understandable and useful to the data analyst”

•  Tell me something interesNng about the data; describe the data

What can data-‐mining methods do?

•  Extract frequent paPerns –  There are lots of documents that contain the phrases “Stockholm”, “Housing” and “^#@$&^#$@”

•  Extract associaNon rules –  80% of the ICA customers who buy beer and sausage also buy mustard

•  Extract rules –  If occupaNon = PhD student, then Salary < 30,000 SEK

What can data-‐mining methods do?

•  Rank web-‐query results – What are the most relevant web-‐pages to the query: “Student housing Stockholm University”?

•  Find good recommendaNons for users –  Recommend amazon customers new books –  Recommend facebook users new friends/groups

•  Find groups of enNNes that are similar (clustering) –  Find groups of facebook users that have similar friends/interests

–  Find groups amazon users that buy similar products –  Find groups of ICA customers that buy similar products

Goal of this course •  Describe some problems that can be solved using data-‐mining methods

•  Discuss the intuiNon behind data mining methods that solve these problems

•  Illustrate the theoreNcal underpinnings of these methods (this is very important!!)

•  Show how these methods can be real applicaNon scenarios (this is also very important!!)

Data mining and related areas

•  How does data mining relate to machine learning?

•  How does data mining relate to staNsNcs?

•  Other related areas?

Data mining vs. machine learning •  Machine learning methods are used for data mining –  ClassificaNon, clustering

•  Amount of data makes the difference – Data mining deals with much larger datasets and scalability becomes an issue

•  Data mining has more modest goals – AutomaNng tedious discovery tasks – Helping users, not replacing them

Data mining vs. staNsNcs •  “tell me something interesNng about this data” – what else

is this than staNsNcs?

–  The goal is similar

–  Different types of methods

–  In data mining one invesNgates lots of possible hypotheses

–  Data mining is more exploratory data analysis

–  In data mining there are much larger datasetsà algorithmics/scalability is an issue

Data mining and databases

•  Ordinary database usage: deducNve

•  Knowledge discovery: inducNve •  New requirements for database management systems

•  Novel data structures, algorithms and architectures are needed

Machine learning

The machine learning area deals with artificial systems that are able to improve their performance with experience

Supervised learning Experience: objects that have been assigned class labels Performance: typically concerns the ability to classify new (previously unseen) objects Unsupervised learning Experience: objects for which no class labels have been given Performance: typically concerns the ability to output useful characterizations (or groupings) of objects

Predictive data mining

Descriptive data mining

•  Email classificaNon (spam or not)

•  Customer classificaNon (will leave or not)

•  Credit card transacNons (fraud or not)

•  Molecular properNes (toxic or not)

Examples of supervised learning

Examples of unsupervised learning

•  find useful email categories

•  find interesNng purchase paPerns

•  describe normal credit card transacNons

•  find groups of molecules with similar properNes

Data mining: input •  Standard requirement: each case is represented by one row in one table •  Possible addiNonal requirements

-‐ only numerical variables -‐ all variables have to be normalized -‐ only categorical variables -‐ no missing values

•  Possible generalizaNons -‐ mulNple tables -‐ recursive data types (sequences, trees, etc.)

An example: email classificaNon Features (aPributes)

Exam

ples (o

bservaNo

ns)

Ex. All

caps

No. excl.

marks

Missing

date

No. digits

in From:

Image

fraction

Spam

e1 yes 0 no 3 0 yes

e2 yes 3 no 0 0.2 yes

e3 no 0 no 0 1 no

e4 no 4 yes 4 0.5 yes

e5 yes 0 yes 2 0 no

e6 no 0 no 0 0 no

Spam = yes Spam = no

Spam = yes

Data mining: output

Data mining: output

Data mining: output •  Interpretable representaNon of findings

-‐ equaNons, rules, decision trees, clusters

321 1.32.25.425.0 xxxy +−+=

0.18.1&0.3 21 =≤> yxx then if

0.85]:Confidence0.05,:[SupportBuysJuicesBuysCereal&BuysMilk →

The Knowledge Discovery Process Knowledge Discovery in Databases (KDD) is the nontrivial process of iden(fying valid, novel, poten(ally useful, and

ul(mately understandable paFerns in data.

U.M. Fayyad, G. Piatetsky-‐Shapiro and P. Smyth, “From Data Mining to Knowledge Discovery in Databases”, AI Magazine 17(3): 37-‐54 (1996)

CRISP-‐DM: CRoss Industry Standard Process for Data Mining

Shearer C., “The CRISP-‐DM model: the new blueprint for data mining”, Journal of Data Warehousing 5 (2000) 13-‐22 (see also www.crisp-‐dm.org)

CRISP-‐DM •  Business Understanding

– understand the project objecNves and requirements from a business perspecNve

– convert this knowledge into a data mining problem definiNon

– create a preliminary plan to achieve the objecNves

CRISP-‐DM •  Data Understanding

–  iniNal data collecNon – get familiar with the data –  idenNfy data quality problems

– discover first insights – detect interesNng subsets

–  form hypotheses for hidden informaNon

CRISP-‐DM •  Data Prepara.on

– construct the final dataset to be fed into the machine learning algorithm

–  tasks here include: table, record, and aPribute selecNon, data transformaNon and cleaning

CRISP-‐DM •  Modeling

– various data mining techniques are selected and applied

– parameters are learned – some methods may have specific requirements on the form of input data

– going back to the data preparaNon phase may be needed

CRISP-‐DM •  Evalua.on

– current model should have high quality from a data mining perspecNve

– before final deployment, it is important to test whether the model achieves all business objecNves

CRISP-‐DM •  Deployment

–  just creaNng the model is not enough

–  the new knowledge should be organized and presented in a usable way

– generate a report –  implement a repeatable data mining process for the user or the analyst

Tools

•  Many data mining tools are freely available •  Some opNons are:

Tool URL

WEKA www.cs.waikato.ac.nz/ml/weka/

Rule Discovery System www.compumine.com

R www.r-‐project.org/

RapidMiner rapid-‐i.com/

More options can be found at www.kdnuggets.com

Some simple data-‐analysis tasks •  Given a stream or set of numbers (idenNfiers, etc)

•  How many numbers are there?

•  How many disNnct numbers are there?

•  What are the most frequent numbers?

•  How many numbers appear at least K Nmes?

•  How many numbers appear only once?

•  etc

Finding the majority element •  Given a stream of labeled elements, e.g.,

{C, B, C, C, A, C, C, A, B, C}

•  IdenNfy the majority element: element that occurs more than 50% of the Nme

•  How can you find it? •  … using no more than a few memory locaNons?

CounNng sort •  Given a stream of labeled elements, e.g.,

{C, B, C, C, A, C, C, A, B, C} •  Count the number of objects that have each disNnct key value

•  Complexity: O(N + k) – N: number of items – k: range of items (largest-‐smallest)

•  May fail for small N << k

Finding the majority element (Moore’s VoNng Algorithm)

•  Complexity: O(N) – N: number of items

•  Can we do bePer? – No! Unless we skip reading some items

The Set Cover Problem

•  A trickier data mining task…

•  A common algorithmic problem…

•  One of the MOST USEFUL problems in CS!

The Set Cover Problem •  The mayor of a city wants to place fire staNons so as to cover each neighborhood

•  Each fire staNon covers: – own neighborhood – all adjacent ones

Challenge: •  Where shall we place the fire staNons so as to minimize the city’s expenses?

•  Each fire-‐staNon costs X SEK per month


•  A set of objects •  Some sets T that cover the objects

•  Find the set of Ts that cover all objects!

•  Find the smallest set!

Formal DefiniNon

•  SeVng: – Universe of N elements U = {U1,…,UN}

– A set of n sets T = {T1,…,Tn}

– Find a collecNon C of sets in T (C subset of T) such that C contains all elements from U

•  Set-‐cover problem: Find the smallest collecNon C of sets from T such that all elements in the universe U are covered

•  SoluNon?

Formal DefiniNon

Trivial algorithm

•  Try all sub-‐collecNons of T

•  Select the smallest one that covers all the elements in U

•  The running Nme of the trivial algorithm is O(2|T||U|)

•  This is way too slow

Formal DefiniNon

•  Set-‐cover problem: Find the smallest collecNon C of sets from T such that all elements in the universe U are covered

•  The set cover problem is NP-‐hard

•  Simple approxima(on algorithms with provable properNes are available and very useful in pracNce

Greedy algorithm for set cover

•  Select first the largest-‐cardinality set t from T

•  Remove the elements of t from U

•  Recompute the sizes of the remaining sets in T

•  Go back to the first step

The Greedy algorithm

•  X = U •  C = {} •  while X is not empty do

– For all tєT let at=|t intersec.on X| – Let t be such that at is maximal – C = C U {t} – X = X\ t

Recall… •  We want to find a set of Ts such that we cover all the objects

•  What would the greedy algorithm find?

Example •  Select biggest set: T1 •  Remove all elements covered by T1

Current solu.on: X = {T1}

Example •  Select the next biggest set: T4 •  Remove all elements covered by T4

Current solu.on: X = {T1}


Current solu.on: X = {T1, T4}


Current solu.on: X = {T1, T4, T5}

Example •  Select the next biggest set: T6 •  Done!

Current solu.on: X = {T1, T4, T5, T6}

Example •  What is the opNmal soluNon?

•  Recall: we want the smallest possible set!

Greedy solu.on: X = {T1, T4, T5, T6}

An op.mal solu.on: X* = {T3, T4, T5}

How can this go wrong?

•  No global consideraNon of how good or bad a selected set is going to be…

•  How good is the proposed greedy algorithm?

Do your best then.

NP-hardness

Approximation Algorithms

Find an algorithm that will return solu(ons that are

guaranteed to be close to an op(mal solu(on

Constant factor approxima.on algorithms:

SOL <= f OPT

for some constant f

•  OPT: value of an opNmal soluNon •  SOL: value of the soluNon that our algorithm returns

The key of designing a polytime approximation algorithm is to obtain a good (lower or upper) bound to the optimal solution

For an NP-hard problem, we cannot compute an optimal solution in polynomial time

The general strategy (for an optimization problem) is:

OPT SOL OPT ≤ SOL ≤ f · OPT, if f > 1

Approximation Algorithms

SOL f · OPT ≤ SOL ≤ OPT, if f < 1

minimizaNon maximizaNon

How good is the greedy algorithm for the Set Cover Problem?

•  Consider a soluNon I: –  Let a(I) be the cost of the approximate soluNon –  Let a*(I) be the cost of the opNmal soluNon –  e.g., a*(I): is the minimum number of sets in S that cover all elements in U

•  An algorithm for a minimizaNon problem has

approximaNon factor f if for all instances I we have that a(I) ≤ f a*(I)

How about the set cover greedy algorithm?

•  The greedy algorithm for set cover has approximaNon factors: –  f = |smax|

•  Proof: From CLR “IntroducNon to Algorithms”

•  The set cover cannot be approximated with f becer than O (log |smax|)

•  What does that mean?

Today… •  Why do we need data analysis?

•  What is data mining?

•  Examples where data mining has been useful

•  Data mining and other areas of computer science and mathemaNcs

•  Some (basic) data mining prototype problems

Next Nme… Nov 4 IntroducNon to data mining

Nov 5 Associa.on Rules

Nov 10, 14 Clustering and Data RepresentaNon

Nov 17 Exercise session 1 (Homework 1 due)

Nov 19 ClassificaNon

Nov 24, 26 Similarity Matching and Model EvaluaNon


Dec 3 Combining Models

Dec 8, 10 Time Series Analysis


Dec 17 Ranking

Jan 13 Review

Jan 14 EXAM

Feb 23 Re-‐EXAM

•  AssociaNon rules

Market-Basket transactions

TID Items

1 Bread, Milk

2 Bread, Diaper, Beer, Eggs

3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer

5 Bread, Milk, Diaper, Coke

Examples of association rules

{Diaper} → {Beer}, {Milk, Bread} → {Diaper,Coke}, {Beer, Bread} → {Milk},

Next Nme…

TODOs

•  Online R-‐tutorial: –  Install R – Learn how to load files – Learn how to use the help command – Learn how to install packages – Learn how to print basic data staNsNcs

hPp://dist.stat.tamu.edu/pub/rvideos/

dami:&& introduc.on&to&data&mining&

Documents