dami:&& introduc.on&to&data&mining&

86
DAMI: Introduc.on to Data Mining Panagio(s Papapetrou, PhD Associate Professor, Stockholm University Adjunct Professor, Aalto University

Upload: others

Post on 27-Apr-2022

11 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: DAMI:&& Introduc.on&to&Data&Mining&

DAMI:      

Introduc.on  to  Data  Mining    

Panagio(s  Papapetrou,  PhD  Associate  Professor,  Stockholm  University  Adjunct  Professor,  Aalto  University  

Page 2: DAMI:&& Introduc.on&to&Data&Mining&

Short  Bio  §  BSc:  University  of    Ioannina,  Greece,  2003  

Page 3: DAMI:&& Introduc.on&to&Data&Mining&

Short  Bio  §  PhD:  Boston  University,  USA,  2009  

Page 4: DAMI:&& Introduc.on&to&Data&Mining&

Short  Bio  §  2009  -­‐  2012:  Aalto  University,  Finland  §  Postdoc:  Data  Mining  Group  

Page 5: DAMI:&& Introduc.on&to&Data&Mining&

Short  Bio  §  2009  -­‐  2012:  Aalto  University,  Finland  §  Postdoc:  Data  Mining  Group  

Page 6: DAMI:&& Introduc.on&to&Data&Mining&

Short  Bio  §  2012  -­‐  2013:  Birkbeck,  University  of  London,  UK  §  Lecturer  and  director  of  the  ITApps  Programme  

Page 7: DAMI:&& Introduc.on&to&Data&Mining&

Short  Bio  §  September  2013:  Senior  Lecturer  at  DSV  

!

Page 8: DAMI:&& Introduc.on&to&Data&Mining&

Course  logisNcs  •  Course  webpage:  

–  hPps://ilearn2.dsv.su.se/course/view.php?id=225    

•  Schedule:    –  Lectures:  Nov  4  –  Dec  17  –  Exercise  sessions:  Nov  17,  Dec  1,  Dec  15  –  WriPen  Exam:  Jan  14  –  Re-­‐exam:  Feb  23    

•  Instructors:    –  PanagioNs  Papapetrou:                panagio([email protected]  –  Lars  Asker:                                  [email protected]  –  Henrik  Boström:                                  [email protected]    

•  Course  Assistant:    –  Jing  Zhao:                                          [email protected]    

•  Office  hours:  by  appointment  only  

Page 9: DAMI:&& Introduc.on&to&Data&Mining&

ILearn2  

Page 10: DAMI:&& Introduc.on&to&Data&Mining&

Topics  to  be  covered  

•  AssociaNon  Rules  •  Clustering  •  Data  RepresentaNon  •  ClassificaNon  •  Similarity  Matching  •  Model  EvaluaNon  •  Time  Series  Analysis  •  Ranking  

Page 11: DAMI:&& Introduc.on&to&Data&Mining&

Syllabus  Nov  4   Introduc.on  to  data  mining  

Nov  5   AssociaNon  Rules  

Nov  10,  14   Clustering  and  Data  RepresentaNon  

Nov  17   Exercise  session  1  (Homework  1  due)  

Nov  19   ClassificaNon  

Nov  24,  26   Similarity  Matching  and  Model  EvaluaNon  

Dec  1   Exercise  session  2  (Homework  2  due)  

Dec  3   Combining  Models  

Dec  8,  10   Time  Series  Analysis  

Dec  15   Exercise  session  3  (Homework  3  due)  

Dec  17   Ranking  

Jan  13   Review  

Jan  14   EXAM  

Feb  23   Re-­‐EXAM  

Page 12: DAMI:&& Introduc.on&to&Data&Mining&

Course  workload  

•  Homeworks    3hp  

• WriPen  Exam        4.5hp  

•  Online  quizzes    

Page 13: DAMI:&& Introduc.on&to&Data&Mining&

Homework  Assignments  •  Three  assignments  (30pts  each,  total  90pts)  •  3-­‐5  online  quizzes  (total  of  10  +  20pts)  •  To  be  done  individually  •  Will  involve  some  programming  in  R  •  Three  in-­‐class  exercise  sessions  •  Submissions:  

–  Before  each  exercise  session  – No  submissions  allowed  amer  that!  

•  Grade  scheme:  A-­‐F  

Page 14: DAMI:&& Introduc.on&to&Data&Mining&

Quizzes  

•  3-­‐5  short  online  quizzes  •  Material  to  be  examined  

–  The  latest  lecture  •  Available  at  the  end  of  each  lecture  and  to  be  completed  before  the  next  lecture  

•  Only  one  aPempt  per  quiz  

•  Will  offer    –  10pts  towards  the  Homework  Assignments  –  20pts  BONUS  towards  the  Homework  Assignments  

•  No  make-­‐up  quizzes  are  possible  

Page 15: DAMI:&& Introduc.on&to&Data&Mining&

To  Pass  the  Course  

•  Pass  the  Homework  Assignments    

–  at  least  50/100  pts  (including  the  BONUS  pts  from  Quiz)  

•  Pass  the  WriPen  Exam  

–  at  least  50/100  pts  

•  Ask  quesNons  

•  Enjoy  it  J    

Page 16: DAMI:&& Introduc.on&to&Data&Mining&

Learning  ObjecNves  

•  Become   familiar   with   fundamental   data   mining  algorithms  

•  Be   able   to   idenNfy   a   correct   algorithmic   soluNon   to   a  given  data  mining  problem  

•  Be   able   to   apply   these   algorithmic   soluNons   to   solve  pracNcal  problems  

•  Be  able  to  perform  basic  data  mining  tasks  on  real  data  using  the  R  tool  

 

Page 17: DAMI:&& Introduc.on&to&Data&Mining&

Textbooks  Main:    

Data  Mining:  PracNcal  Machine  Learning  Tools  and  Techniques,  Third  EdiNon  Publisher:  Morgan  Kaufmann  Year:  2011    ISBN:  978-­‐0123748560  

 Addi.onal:              An  IntroducNon  to  StaNsNcal  Learning  with  applicaNons  in  R              Publisher:  Springer  Year:  2013                ISBN:  978-­‐1-­‐4614-­‐7138-­‐7              URL:  hPp://www-­‐bcf.usc.edu/~gareth/ISL/  

             Research  papers  (pointers  will  be  provided)  

Page 18: DAMI:&& Introduc.on&to&Data&Mining&

Recommended  prerequisites  •  Basic  algorithms:  sorNng,  set  manipulaNon,  hashing    •  Analysis   of   algorithms:   O-­‐notaNon   and   its   variants,   NP-­‐

hardness  

•  Programming:   some   programming   knowledge,   ability   to   do  small  experiments  reasonably  quickly  

•  Probability:   concepts   of   probability   and   condiNonal  probability,  expectaNons,  random  walks  

•  Some   linear   algebra:   e.g.,   eigenvector   and   eigenvalue  computaNons  

Page 19: DAMI:&& Introduc.on&to&Data&Mining&

Above  all    

•  The  goal  of  the  course  is  to  learn  and  enjoy  

•  The  basic  principle  is  to  ask  quesNons  when  you  don’t  understand  

•  Say  when  things  are  unclear;  not  everything  can  be  clear  from  the  beginning  

•  ParNcipate  in  the  class  as  much  as  possible  

Page 20: DAMI:&& Introduc.on&to&Data&Mining&

IntroducNon  to  data  mining  •  Why  do  we  need  data  analysis?  

•  What  is  data  mining?  

•  Examples  where  data  mining  has  been  useful  

•  Data  mining  and  other  areas  of  computer  science  and  mathemaNcs  

•  Some  (basic)  data  mining  tasks  

Page 21: DAMI:&& Introduc.on&to&Data&Mining&

Why  do  we  need  data  analysis  

•  Really  really  lots  of  raw  data  data!!  –  Moore’s  law:  more  efficient  processors,  larger  memories    –  CommunicaNons  have  improved  too  

–  Measurement  technologies  have  improved  dramaNcally  

–  It  is  possible  to  store  and  collect  lots  of  raw  data  

–  The  data  analysis  methods  are  lagging  behind  

•  Need  to  analyze  the  raw  data  to  extract  knowledge  

Page 22: DAMI:&& Introduc.on&to&Data&Mining&

The  data  is  also  very  complex  

•  MulNple  types  of  data:  tables,  Nme  series,  images,  graphs,  etc  

•  SpaNal  and  temporal  aspects  

•  Large  number  of  different  variables  

•  Lots  of  observaNons  à  large  datasets  

Page 23: DAMI:&& Introduc.on&to&Data&Mining&

Example:  transacNon  data  

•  Billions  of  real-­‐life  customers:  – COOP,  ICA  – Tele2  

•  Billions  of  online  customers:    – amazon  – expedia  

 

Page 24: DAMI:&& Introduc.on&to&Data&Mining&

Example:  document  data  

•  Web  as  a  document  repository:  50  billion  of  web  pages  

•  Wikipedia:  4  million  arNcles  (and  counNng)  

•  Online  collecNons  of  scienNfic  arNcles  

Page 25: DAMI:&& Introduc.on&to&Data&Mining&

Example:  network  data  •  Web:  50  billion  pages  linked  via  hyperlinks  

•  Facebook:  200  million  users  

•  MySpace:  300  million  users  

•  Instant  messenger:  1  billion  users  

•  Blogs:  250  million  blogs  worldwide  

Page 26: DAMI:&& Introduc.on&to&Data&Mining&

Example:  genomic  sequences  

•  hPp://www.1000genomes.org/page.php  

•  Full  sequence  of  1000  individuals  

•  3  billion  nucleoNdes  per  person    •  Lots  more  data  in  fact:  medical  history  of  the  persons,  gene  expression  data  

Page 27: DAMI:&& Introduc.on&to&Data&Mining&

Example:  environmental  data  •  Climate  data  (just  an  example)  hPp://www.ncdc.gov/oa/climate/ghcn-­‐monthly/index.php  

•  “a  database  of  temperature,  precipitaNon  and  pressure  records  managed  by  the  NaNonal  ClimaNc  Data  Center,  Arizona  State  University  and  the  Carbon  Dioxide  InformaNon  Analysis  Center”  

•  “6000  temperature  staNons,  7500  precipitaNon  staNons,  2000  pressure  staNons”    

Page 28: DAMI:&& Introduc.on&to&Data&Mining&

We  have  large  datasets…so  what?  •  Goal:  obtain  useful  knowledge  from  large  masses  of  data  

•  “Data   mining   is   the   analysis   of   (omen   large)  observaNonal   data   sets   to   find   unsuspected  relaNonships  and  to  summarize  the  data  in  novel  ways  that  are  both  understandable  and  useful  to  the  data  analyst”  

•  Tell   me   something   interesNng   about   the   data;  describe  the  data  

Page 29: DAMI:&& Introduc.on&to&Data&Mining&

What  can  data-­‐mining  methods  do?  

•  Extract  frequent  paPerns  –  There  are  lots  of  documents  that  contain  the  phrases  “Stockholm”,  “Housing”  and  “^#@$&^#$@”  

•  Extract  associaNon  rules  –  80%  of  the  ICA  customers  who  buy  beer  and  sausage  also  buy  mustard  

 

•  Extract  rules  –  If  occupaNon  =  PhD  student,  then  Salary  <  30,000  SEK  

Page 30: DAMI:&& Introduc.on&to&Data&Mining&

What  can  data-­‐mining  methods  do?  

•  Rank  web-­‐query  results  – What  are  the  most  relevant  web-­‐pages  to  the  query:  “Student  housing  Stockholm  University”?  

•  Find  good  recommendaNons  for  users  –  Recommend  amazon  customers  new  books  –  Recommend  facebook  users  new  friends/groups  

•  Find  groups  of  enNNes  that  are  similar  (clustering)  –  Find  groups  of  facebook  users  that  have  similar  friends/interests  

–  Find  groups  amazon  users  that  buy  similar  products  –  Find  groups  of  ICA  customers  that  buy  similar  products  

Page 31: DAMI:&& Introduc.on&to&Data&Mining&

Goal  of  this  course  •  Describe  some  problems  that  can  be  solved  using  data-­‐mining  methods  

•  Discuss  the  intuiNon  behind  data  mining  methods  that  solve  these  problems  

•  Illustrate   the   theoreNcal   underpinnings   of   these  methods  (this  is  very  important!!)  

•  Show   how   these   methods   can   be   real   applicaNon  scenarios  (this  is  also  very  important!!)  

 

Page 32: DAMI:&& Introduc.on&to&Data&Mining&

Data  mining  and  related  areas  

•  How  does  data  mining  relate  to  machine  learning?  

•  How  does  data  mining  relate  to  staNsNcs?  

•  Other  related  areas?  

Page 33: DAMI:&& Introduc.on&to&Data&Mining&

Data  mining  vs.  machine  learning  •  Machine  learning  methods  are  used  for  data  mining  –  ClassificaNon,  clustering  

•  Amount  of  data  makes  the  difference  – Data  mining  deals  with  much  larger  datasets  and  scalability  becomes  an  issue  

•  Data  mining  has  more  modest  goals  – AutomaNng  tedious  discovery  tasks  – Helping  users,  not  replacing  them  

Page 34: DAMI:&& Introduc.on&to&Data&Mining&

Data  mining  vs.  staNsNcs  •  “tell  me  something  interesNng  about  this  data”  –  what  else  

is  this  than  staNsNcs?  

–  The  goal  is  similar  

–  Different  types  of  methods  

–  In  data  mining  one  invesNgates  lots  of  possible  hypotheses  

–  Data  mining  is  more  exploratory  data  analysis  

–  In  data  mining  there  are    much  larger  datasetsà  algorithmics/scalability  is  an  issue  

Page 35: DAMI:&& Introduc.on&to&Data&Mining&

Data  mining  and  databases  

•  Ordinary  database  usage:  deducNve  

•  Knowledge  discovery:  inducNve    •  New   requirements   for   database   management  systems  

•  Novel   data   structures,   algorithms   and  architectures  are  needed  

Page 36: DAMI:&& Introduc.on&to&Data&Mining&

Machine  learning  

The machine learning area deals with artificial systems that are able to improve their performance with experience

Supervised learning Experience: objects that have been assigned class labels Performance: typically concerns the ability to classify new (previously unseen) objects Unsupervised learning Experience: objects for which no class labels have been given Performance: typically concerns the ability to output useful characterizations (or groupings) of objects

Predictive data mining

Descriptive data mining

Page 37: DAMI:&& Introduc.on&to&Data&Mining&

•  Email  classificaNon  (spam  or  not)  

•  Customer  classificaNon  (will  leave  or  not)  

•  Credit  card  transacNons  (fraud  or  not)  

•  Molecular  properNes  (toxic  or  not)  

Examples  of  supervised  learning  

Page 38: DAMI:&& Introduc.on&to&Data&Mining&

Examples  of  unsupervised  learning  

•   find  useful  email  categories  

•   find  interesNng  purchase  paPerns  

•   describe  normal  credit  card  transacNons  

•   find  groups  of  molecules  with  similar  properNes  

Page 39: DAMI:&& Introduc.on&to&Data&Mining&

Data  mining:  input  •   Standard  requirement:  each  case  is  represented  by          one  row  in  one  table  •   Possible  addiNonal  requirements  

 -­‐  only  numerical  variables    -­‐  all  variables  have  to  be  normalized    -­‐  only  categorical  variables    -­‐  no  missing  values  

•   Possible  generalizaNons    -­‐  mulNple  tables    -­‐  recursive  data  types  (sequences,  trees,  etc.)  

Page 40: DAMI:&& Introduc.on&to&Data&Mining&

An  example:  email  classificaNon  Features  (aPributes)  

Exam

ples  (o

bservaNo

ns)  

Ex. All

caps

No. excl.

marks

Missing

date

No. digits

in From:

Image

fraction

Spam

e1 yes 0 no 3 0 yes

e2 yes 3 no 0 0.2 yes

e3 no 0 no 0 1 no

e4 no 4 yes 4 0.5 yes

e5 yes 0 yes 2 0 no

e6 no 0 no 0 0 no

Page 41: DAMI:&& Introduc.on&to&Data&Mining&

Spam  =  yes  Spam  =  no  

Spam  =  yes  

Data  mining:  output  

Page 42: DAMI:&& Introduc.on&to&Data&Mining&

Data  mining:  output  

Page 43: DAMI:&& Introduc.on&to&Data&Mining&

Data  mining:  output  •   Interpretable  representaNon  of  findings  

 -­‐  equaNons,  rules,  decision  trees,  clusters  

321 1.32.25.425.0 xxxy +−+=

0.18.1&0.3 21 =≤> yxx then if

0.85]:Confidence0.05,:[SupportBuysJuicesBuysCereal&BuysMilk →

Page 44: DAMI:&& Introduc.on&to&Data&Mining&

The  Knowledge  Discovery  Process  Knowledge  Discovery  in  Databases  (KDD)  is  the  nontrivial  process  of  iden(fying  valid,  novel,  poten(ally  useful,  and  

ul(mately  understandable  paFerns  in  data.  

U.M.  Fayyad,  G.  Piatetsky-­‐Shapiro  and  P.  Smyth,  “From  Data  Mining  to  Knowledge  Discovery  in  Databases”,  AI  Magazine  17(3):  37-­‐54  (1996)  

Page 45: DAMI:&& Introduc.on&to&Data&Mining&

CRISP-­‐DM:  CRoss  Industry  Standard  Process  for  Data  Mining  

Shearer  C.,  “The  CRISP-­‐DM  model:  the  new  blueprint  for  data  mining”,    Journal  of  Data  Warehousing  5  (2000)  13-­‐22  (see  also  www.crisp-­‐dm.org)  

Page 46: DAMI:&& Introduc.on&to&Data&Mining&

CRISP-­‐DM  •  Business  Understanding  

– understand  the  project  objecNves  and  requirements  from  a  business  perspecNve    

– convert  this  knowledge  into  a  data  mining  problem  definiNon  

– create  a  preliminary  plan  to  achieve  the  objecNves  

 

Page 47: DAMI:&& Introduc.on&to&Data&Mining&

CRISP-­‐DM  •  Data  Understanding  

–  iniNal  data  collecNon    – get  familiar  with  the  data  –  idenNfy  data  quality  problems  

– discover  first  insights  – detect  interesNng  subsets  

–  form  hypotheses  for  hidden  informaNon    

Page 48: DAMI:&& Introduc.on&to&Data&Mining&

The  Knowledge  Discovery  Process  Knowledge  Discovery  in  Databases  (KDD)  is  the  nontrivial  process  of  iden(fying  valid,  novel,  poten(ally  useful,  and  

ul(mately  understandable  paFerns  in  data.  

U.M.  Fayyad,  G.  Piatetsky-­‐Shapiro  and  P.  Smyth,  “From  Data  Mining  to  Knowledge  Discovery  in  Databases”,  AI  Magazine  17(3):  37-­‐54  (1996)  

Page 49: DAMI:&& Introduc.on&to&Data&Mining&

CRISP-­‐DM  •  Data  Prepara.on  

– construct  the  final  dataset  to  be  fed  into  the  machine  learning  algorithm    

–  tasks  here  include:  table,  record,  and  aPribute  selecNon,  data  transformaNon  and  cleaning  

Page 50: DAMI:&& Introduc.on&to&Data&Mining&

The  Knowledge  Discovery  Process  Knowledge  Discovery  in  Databases  (KDD)  is  the  nontrivial  process  of  iden(fying  valid,  novel,  poten(ally  useful,  and  

ul(mately  understandable  paFerns  in  data.  

U.M.  Fayyad,  G.  Piatetsky-­‐Shapiro  and  P.  Smyth,  “From  Data  Mining  to  Knowledge  Discovery  in  Databases”,  AI  Magazine  17(3):  37-­‐54  (1996)  

Page 51: DAMI:&& Introduc.on&to&Data&Mining&

CRISP-­‐DM  •  Modeling  

– various  data  mining  techniques  are  selected  and  applied  

– parameters  are  learned  – some  methods  may  have  specific  requirements  on  the  form  of  input  data  

– going  back  to  the  data  preparaNon  phase  may  be  needed    

Page 52: DAMI:&& Introduc.on&to&Data&Mining&

The  Knowledge  Discovery  Process  Knowledge  Discovery  in  Databases  (KDD)  is  the  nontrivial  process  of  iden(fying  valid,  novel,  poten(ally  useful,  and  

ul(mately  understandable  paFerns  in  data.  

U.M.  Fayyad,  G.  Piatetsky-­‐Shapiro  and  P.  Smyth,  “From  Data  Mining  to  Knowledge  Discovery  in  Databases”,  AI  Magazine  17(3):  37-­‐54  (1996)  

Page 53: DAMI:&& Introduc.on&to&Data&Mining&

CRISP-­‐DM  •  Evalua.on  

– current  model  should  have  high  quality  from  a  data  mining  perspecNve    

– before  final  deployment,  it  is  important  to  test  whether  the  model  achieves  all  business  objecNves  

Page 54: DAMI:&& Introduc.on&to&Data&Mining&

CRISP-­‐DM  •  Deployment  

–  just  creaNng  the  model  is  not  enough  

–  the  new  knowledge  should  be  organized  and  presented  in  a  usable  way  

– generate  a  report  –  implement  a  repeatable  data  mining  process  for  the  user  or  the  analyst  

Page 55: DAMI:&& Introduc.on&to&Data&Mining&

Tools  

•  Many  data  mining  tools  are  freely  available  •  Some  opNons  are:      

Tool   URL  

WEKA   www.cs.waikato.ac.nz/ml/weka/  

Rule  Discovery  System   www.compumine.com  

R   www.r-­‐project.org/  

RapidMiner   rapid-­‐i.com/  

More options can be found at www.kdnuggets.com

Page 56: DAMI:&& Introduc.on&to&Data&Mining&

Some  simple  data-­‐analysis  tasks  •  Given  a  stream  or  set  of  numbers  (idenNfiers,  etc)  

•  How  many  numbers  are  there?  

•  How  many  disNnct  numbers  are  there?  

•  What  are  the  most  frequent  numbers?  

•  How  many  numbers  appear  at  least  K  Nmes?  

•  How  many  numbers  appear  only  once?  

•  etc  

Page 57: DAMI:&& Introduc.on&to&Data&Mining&

Finding  the  majority  element  •  Given  a  stream  of  labeled  elements,  e.g.,  

       {C,  B,  C,  C,  A,  C,  C,  A,  B,  C}  

   •  IdenNfy   the   majority   element:   element   that  occurs  more  than  50%  of  the  Nme  

•  How  can  you  find  it?    •  …  using  no  more  than  a  few  memory  locaNons?    

 

Page 58: DAMI:&& Introduc.on&to&Data&Mining&

CounNng  sort  •  Given  a  stream  of  labeled  elements,  e.g.,  

   {C,  B,  C,  C,  A,  C,  C,  A,  B,  C}  •  Count   the   number   of   objects   that   have   each  disNnct  key  value  

•  Complexity:  O(N  +  k)  – N:  number  of  items  – k:  range  of  items  (largest-­‐smallest)  

•  May  fail  for  small  N  <<  k    

Page 59: DAMI:&& Introduc.on&to&Data&Mining&

Finding  the  majority  element    (Moore’s  VoNng  Algorithm)  

•  Complexity:  O(N)  – N:  number  of  items  

•  Can  we  do  bePer?  – No!  Unless  we  skip  reading  some  items  

Page 60: DAMI:&& Introduc.on&to&Data&Mining&

The  Set  Cover  Problem  

•  A  trickier  data  mining  task…  

•  A  common  algorithmic  problem…  

•  One  of  the  MOST  USEFUL  problems  in  CS!  

Page 61: DAMI:&& Introduc.on&to&Data&Mining&

The  Set  Cover  Problem  •  The  mayor  of  a  city  wants  to  place  fire  staNons  so  as  to  cover  each  neighborhood  

•  Each  fire  staNon  covers:    – own  neighborhood  – all  adjacent  ones  

Challenge:    •  Where  shall  we  place  the  fire  staNons  so  as  to  minimize  the  city’s  expenses?  

 •  Each  fire-­‐staNon  costs  X  SEK  per  month  

Page 62: DAMI:&& Introduc.on&to&Data&Mining&

The  Set  Cover  Problem  

Page 63: DAMI:&& Introduc.on&to&Data&Mining&

The  Set  Cover  Problem  

•  A  set  of  objects  •  Some  sets  T  that  cover  the  objects  

•  Find  the  set  of  Ts  that  cover  all  objects!  

•  Find  the  smallest  set!  

Page 64: DAMI:&& Introduc.on&to&Data&Mining&

Formal  DefiniNon  

•  SeVng:    – Universe  of  N  elements  U  =  {U1,…,UN}  

– A  set  of  n  sets  T  =  {T1,…,Tn}  

– Find  a  collecNon  C  of  sets  in  T  (C  subset  of  T)  such  that  C    contains  all  elements  from  U  

Page 65: DAMI:&& Introduc.on&to&Data&Mining&

•  Set-­‐cover  problem:  Find  the  smallest  collecNon  C  of  sets  from  T    such  that  all  elements  in  the  universe  U  are  covered  

 •  SoluNon?  

Formal  DefiniNon  

Page 66: DAMI:&& Introduc.on&to&Data&Mining&

Trivial  algorithm  

•  Try  all  sub-­‐collecNons  of  T  

•  Select  the  smallest  one  that  covers  all  the  elements  in  U  

•  The  running  Nme  of  the  trivial  algorithm  is          O(2|T||U|)  

•  This  is  way  too  slow  

Page 67: DAMI:&& Introduc.on&to&Data&Mining&

Formal  DefiniNon  

•  Set-­‐cover  problem:  Find  the  smallest  collecNon  C  of  sets  from  T    such  that  all  elements  in  the  universe  U  are  covered  

 •  The  set  cover  problem  is  NP-­‐hard  

•  Simple  approxima(on  algorithms  with  provable  properNes  are  available  and  very  useful  in  pracNce  

Page 68: DAMI:&& Introduc.on&to&Data&Mining&

Greedy  algorithm  for  set  cover  

•  Select  first  the  largest-­‐cardinality  set  t  from  T  

•  Remove  the  elements  of  t  from  U  

•  Recompute  the  sizes  of  the  remaining  sets  in  T  

•  Go  back  to  the  first  step  

Page 69: DAMI:&& Introduc.on&to&Data&Mining&

The  Greedy  algorithm  

•  X  =  U  •  C  =  {}  •  while  X  is  not  empty  do  

– For  all  tєT  let  at=|t  intersec.on  X|  – Let  t  be  such  that  at  is  maximal  – C  =  C  U  {t}  – X  =  X\  t  

Page 70: DAMI:&& Introduc.on&to&Data&Mining&

Recall…  •  We  want  to  find  a  set  of  Ts  such  that  we  cover  all  the  objects  

•  What  would  the  greedy  algorithm  find?  

Page 71: DAMI:&& Introduc.on&to&Data&Mining&

Example  •  Select  biggest  set:  T1  •  Remove  all  elements  covered  by  T1  

Current  solu.on:    X  =  {T1}  

Page 72: DAMI:&& Introduc.on&to&Data&Mining&

Example  •  Select  the  next  biggest  set:  T4  •  Remove  all  elements  covered  by  T4  

Current  solu.on:    X  =  {T1}  

Page 73: DAMI:&& Introduc.on&to&Data&Mining&

Example  •  Select  the  next  biggest  set:  T5  •  Remove  all  elements  covered  by  T5    

Current  solu.on:    X  =  {T1,  T4}  

Page 74: DAMI:&& Introduc.on&to&Data&Mining&

Example  •  Select  the  next  biggest  set:  T5  •  Remove  all  elements  covered  by  T5    

Current  solu.on:    X  =  {T1,  T4,  T5}  

Page 75: DAMI:&& Introduc.on&to&Data&Mining&

Example  •  Select  the  next  biggest  set:  T6  •  Done!    

Current  solu.on:    X  =  {T1,  T4,  T5,  T6}  

Page 76: DAMI:&& Introduc.on&to&Data&Mining&

Example  •  What  is  the  opNmal  soluNon?  

•  Recall:  we  want  the  smallest  possible  set!  

Greedy  solu.on:    X  =  {T1,  T4,  T5,  T6}  

An  op.mal  solu.on:    X*  =  {T3,  T4,  T5}  

Page 77: DAMI:&& Introduc.on&to&Data&Mining&

How  can  this  go  wrong?  

•  No  global  consideraNon  of  how  good  or  bad  a  selected  set  is  going  to  be…  

   •  How  good  is  the  proposed  greedy  algorithm?  

Page 78: DAMI:&& Introduc.on&to&Data&Mining&

Do  your  best  then.  

NP-hardness

Page 79: DAMI:&& Introduc.on&to&Data&Mining&

Approximation Algorithms

Find  an  algorithm   that  will   return   solu(ons   that  are  

guaranteed  to  be  close  to  an  op(mal  solu(on  

Constant  factor  approxima.on  algorithms:      

SOL    <=  f  OPT    

for  some  constant  f    

•  OPT:  value  of  an  opNmal  soluNon  •  SOL:  value  of  the  soluNon  that  our  algorithm  returns  

Page 80: DAMI:&& Introduc.on&to&Data&Mining&

The key of designing a polytime approximation algorithm is to obtain a good (lower or upper) bound to the optimal solution

For an NP-hard problem, we cannot compute an optimal solution in polynomial time

The general strategy (for an optimization problem) is:

OPT SOL OPT ≤ SOL ≤ f · OPT, if f > 1

Approximation Algorithms

SOL f · OPT ≤ SOL ≤ OPT, if f < 1

minimizaNon  maximizaNon  

Page 81: DAMI:&& Introduc.on&to&Data&Mining&

How  good  is  the  greedy  algorithm  for  the  Set  Cover  Problem?  

•  Consider  a  soluNon  I:  –  Let  a(I)  be  the  cost  of  the  approximate  soluNon  –  Let  a*(I)  be  the  cost  of  the  opNmal  soluNon  –  e.g.,  a*(I):  is  the  minimum  number  of  sets  in  S  that  cover  all  elements  in  U  

 •  An  algorithm  for  a  minimizaNon  problem  has  

approximaNon  factor  f  if  for  all  instances  I  we  have  that              a(I)    ≤      f      a*(I)  

Page 82: DAMI:&& Introduc.on&to&Data&Mining&

How  about  the  set  cover  greedy  algorithm?  

•  The   greedy   algorithm   for   set   cover   has  approximaNon  factors:    –  f  =    |smax|    

•  Proof:  From  CLR  “IntroducNon  to  Algorithms”  

•  The   set   cover   cannot   be   approximated   with   f  becer  than      O  (log  |smax|)  

•  What  does  that  mean?  

Page 83: DAMI:&& Introduc.on&to&Data&Mining&

Today…  •  Why  do  we  need  data  analysis?  

•  What  is  data  mining?  

•  Examples  where  data  mining  has  been  useful  

•  Data  mining  and  other  areas  of  computer  science  and  mathemaNcs  

•  Some  (basic)  data  mining  prototype  problems  

Page 84: DAMI:&& Introduc.on&to&Data&Mining&

Next  Nme…  Nov  4   IntroducNon  to  data  mining  

Nov  5   Associa.on  Rules  

Nov  10,  14   Clustering  and  Data  RepresentaNon  

Nov  17   Exercise  session  1  (Homework  1  due)  

Nov  19   ClassificaNon  

Nov  24,  26   Similarity  Matching  and  Model  EvaluaNon  

Dec  1   Exercise  session  2  (Homework  2  due)  

Dec  3   Combining  Models  

Dec  8,  10   Time  Series  Analysis  

Dec  15   Exercise  session  3  (Homework  3  due)  

Dec  17   Ranking  

Jan  13   Review  

Jan  14   EXAM  

Feb  23   Re-­‐EXAM  

Page 85: DAMI:&& Introduc.on&to&Data&Mining&

•  AssociaNon  rules  

Market-Basket transactions

TID Items

1 Bread, Milk

2 Bread, Diaper, Beer, Eggs

3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer

5 Bread, Milk, Diaper, Coke

Examples of association rules

{Diaper} → {Beer}, {Milk, Bread} → {Diaper,Coke}, {Beer, Bread} → {Milk},

Next  Nme…  

Page 86: DAMI:&& Introduc.on&to&Data&Mining&

TODOs  

•  Online  R-­‐tutorial:  –  Install  R  – Learn  how  to  load  files  – Learn  how  to  use  the  help  command  – Learn  how  to  install  packages  – Learn  how  to  print  basic  data  staNsNcs  

hPp://dist.stat.tamu.edu/pub/rvideos/