nlp demystified 2013nasmith/slides/socs-6-28-13.pdf · secondperson pronoun u...

67
NLP Demys*fied Noah Smith Language Technologies Ins*tute Machine Learning Department School of Computer Science Carnegie Mellon University [email protected]

Upload: others

Post on 02-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: NLP Demystified 2013nasmith/slides/socs-6-28-13.pdf · secondperson pronoun u yu$yuh$yhu$uu$yuuyewy0u yuhh$youh$yhuu$ iget$yoy$yooh$yuo$yue$juu$dya$youz$yyou$$ preposions Clusters$from$Owopu

NLP  Demys*fied  

Noah  Smith  Language  Technologies  Ins*tute  Machine  Learning  Department  School  of  Computer  Science  Carnegie  Mellon  University  

[email protected]  

Page 2: NLP Demystified 2013nasmith/slides/socs-6-28-13.pdf · secondperson pronoun u yu$yuh$yhu$uu$yuuyewy0u yuhh$youh$yhuu$ iget$yoy$yooh$yuo$yue$juu$dya$youz$yyou$$ preposions Clusters$from$Owopu

NLP?  

Page 3: NLP Demystified 2013nasmith/slides/socs-6-28-13.pdf · secondperson pronoun u yu$yuh$yhu$uu$yuuyewy0u yuhh$youh$yhuu$ iget$yoy$yooh$yuo$yue$juu$dya$youz$yyou$$ preposions Clusters$from$Owopu

Outline  

1.  Automa*cally  categorizing  documents  2.  Decoding  sequences  of  words  3.  Clustering  documents  and/or  words  

Page 4: NLP Demystified 2013nasmith/slides/socs-6-28-13.pdf · secondperson pronoun u yu$yuh$yhu$uu$yuuyewy0u yuhh$youh$yhuu$ iget$yoy$yooh$yuo$yue$juu$dya$youz$yyou$$ preposions Clusters$from$Owopu

Categorizing  Documents:    Examples  

•  Mosteller  and  Wallace  (1964):    authorship  of  the  Federalist  papers  

•  News  categories:    U.S.,  world,  sports,  religion,  business,  technology,  entertainment,  ...  

•  How  posi*ve  or  nega*ve  is  a  review  of  a  film  or  restaurant?  

•  Is  a  given  email  message  spam?  •  What  is  the  reading  level  of  a  piece  of  text?  •  How  influen*al  will  a  research  paper  be?  •  Will  a  congressional  bill  pass  commiZee?  

Page 5: NLP Demystified 2013nasmith/slides/socs-6-28-13.pdf · secondperson pronoun u yu$yuh$yhu$uu$yuuyewy0u yuhh$youh$yhuu$ iget$yoy$yooh$yuo$yue$juu$dya$youz$yyou$$ preposions Clusters$from$Owopu

The  Vision  

•  Human  experts  label  some  data  •  Feed  the  data  to  a  learning  algorithm  that  constructs  an  automa*c  labeling  func*on  

•  Apply  that  func*on  to  as  much  data  as  you  want!  

Page 6: NLP Demystified 2013nasmith/slides/socs-6-28-13.pdf · secondperson pronoun u yu$yuh$yhu$uu$yuuyewy0u yuhh$youh$yhuu$ iget$yoy$yooh$yuo$yue$juu$dya$youz$yyou$$ preposions Clusters$from$Owopu

Basic  Recipe  for  Document  Categoriza*on  

1.  Obtain  a  pool  of  correctly  categorized  documents  D.  

2.  Define  a  func*on  f  from  documents  to  feature  vectors.    

3.  Define  a  parameterized  func*on  hw  from  feature  vectors  to  categories.  

4.  Select  h’s  parameters  w  using  a  training  sample  from  D.  

5.  Es*mate  performance  on  a  held-­‐out  sample  from  D.  

Page 7: NLP Demystified 2013nasmith/slides/socs-6-28-13.pdf · secondperson pronoun u yu$yuh$yhu$uu$yuuyewy0u yuhh$youh$yhuu$ iget$yoy$yooh$yuo$yue$juu$dya$youz$yyou$$ preposions Clusters$from$Owopu

1.    Obtain  Categorized  Documents  

Spinoza,  17th  century  ra*onalist  

Page 8: NLP Demystified 2013nasmith/slides/socs-6-28-13.pdf · secondperson pronoun u yu$yuh$yhu$uu$yuuyewy0u yuhh$youh$yhuu$ iget$yoy$yooh$yuo$yue$juu$dya$youz$yyou$$ preposions Clusters$from$Owopu

2.    Define  the  Feature  Vector  Func*on  

•  Simplest  choice:    one  dimension  per  word,  and  let  [  f(d)  ]j  be  the  count  of  wj  in  d.  

•  Twists:      – Monotonic  transforms,  like  dividing  by  the  length  of  d  or  taking  a  log.  

–  Increase  the  weights  of  words  that  occur  in  fewer  documents  (“inverse  document  frequency”)  

–  n-­‐grams  –  Count  specially  defined  groupings  of  words  –  Sta*s*cal  tests  to  select  words  likely  to  be  informa*ve  

Page 9: NLP Demystified 2013nasmith/slides/socs-6-28-13.pdf · secondperson pronoun u yu$yuh$yhu$uu$yuuyewy0u yuhh$youh$yhuu$ iget$yoy$yooh$yuo$yue$juu$dya$youz$yyou$$ preposions Clusters$from$Owopu

Basic  Recipe  for  Document  Categoriza*on  

1.  Obtain  a  pool  of  correctly  categorized  documents  D.  

2.  Define  a  func*on  f  from  documents  to  feature  vectors.    

3.  Define  a  parameterized  func*on  hw  from  feature  vectors  to  categories.  

4.  Select  h’s  parameters  w  using  a  training  sample  from  D.  

5.  Es*mate  performance  on  a  held-­‐out  sample  from  D.  

Page 10: NLP Demystified 2013nasmith/slides/socs-6-28-13.pdf · secondperson pronoun u yu$yuh$yhu$uu$yuuyewy0u yuhh$youh$yhuu$ iget$yoy$yooh$yuo$yue$juu$dya$youz$yyou$$ preposions Clusters$from$Owopu

3.    Define  a  Func*on    from  Feature  Vectors  to  Categories  

•  Simplest  choice:    linear  model  

wc  is  the  vector  of  coefficients  associa*ng  each  feature  with  class  c  (can  be  posi*ve  or  nega*ve).  – Advantage:    interpretability  – Advantage:    computa*onal  efficiency  

•  Some  alterna*ves:    k-­‐nearest  neighbors,  decision  trees,  neural  networks,  ...  

hw(d) = argmax

cw>

c f(d) + wbiasc

Page 11: NLP Demystified 2013nasmith/slides/socs-6-28-13.pdf · secondperson pronoun u yu$yuh$yhu$uu$yuuyewy0u yuhh$youh$yhuu$ iget$yoy$yooh$yuo$yue$juu$dya$youz$yyou$$ preposions Clusters$from$Owopu

4.    Select  Parameters  using  Data  

•  Also  known  as  “machine  learning.”  •  Many  learning  op*ons  for  linear  classifiers!  

probabilis3c  interpreta3on  

discrimina3ve  

LR  SVM  

perceptron  

NB  

Page 12: NLP Demystified 2013nasmith/slides/socs-6-28-13.pdf · secondperson pronoun u yu$yuh$yhu$uu$yuuyewy0u yuhh$youh$yhuu$ iget$yoy$yooh$yuo$yue$juu$dya$youz$yyou$$ preposions Clusters$from$Owopu

4.    Select  Parameters  using  Data  

Op*miza*on  view  of  learning:  

   Typical  loss  func*ons  for  linear  models  are  convex  and  can  be  efficiently  op*mized  using  online  or  batch  itera*ve  algorithms  with  convergence  guarantees.  

w = argminw

R(w) +1

|Dtrain |X

d2Dtrain

L(d;w)

“regulariza*on”  to  avoid  overfijng   “empirical  risk”  =  average  loss  over  training  data  

Page 13: NLP Demystified 2013nasmith/slides/socs-6-28-13.pdf · secondperson pronoun u yu$yuh$yhu$uu$yuuyewy0u yuhh$youh$yhuu$ iget$yoy$yooh$yuo$yue$juu$dya$youz$yyou$$ preposions Clusters$from$Owopu

4.    Select  Parameters  using  Data  Considera*ons:    •  Do  you  want  posterior  probabili*es,  or  just  labels?  

•  What  methods  do  you  understand  well  enough  to    explain  in  your  paper?  

•  What  methods  will  your  readers  understand?  •  What  implementa*ons  are  available?  –  Cost,  scalability,  programming  language,  compa*bility  with  your  workflow,  ...  

•  How  well  does  it  work  (on  held-­‐out  data)?      

Page 14: NLP Demystified 2013nasmith/slides/socs-6-28-13.pdf · secondperson pronoun u yu$yuh$yhu$uu$yuuyewy0u yuhh$youh$yhuu$ iget$yoy$yooh$yuo$yue$juu$dya$youz$yyou$$ preposions Clusters$from$Owopu

5.    Es*mate  Performance  

•  Always,  always,  always  use  held-­‐out  data.  – Mul*ple  rounds  of  tests?    Fresh  tes*ng  data!  

•  Consider  the  “most  frequent  class”  baseline.  •  Consider  inter-­‐annotator  agreement.  •  What  to  measure?  – Accuracy  – When  one  class  is  special:    precision/recall  

Page 15: NLP Demystified 2013nasmith/slides/socs-6-28-13.pdf · secondperson pronoun u yu$yuh$yhu$uu$yuuyewy0u yuhh$youh$yhuu$ iget$yoy$yooh$yuo$yue$juu$dya$youz$yyou$$ preposions Clusters$from$Owopu

5.    Es*mate  Performance  

precision  

recall  

hw(d) = argmax

cw>

c f(d) + wbiasc

Page 16: NLP Demystified 2013nasmith/slides/socs-6-28-13.pdf · secondperson pronoun u yu$yuh$yhu$uu$yuuyewy0u yuhh$youh$yhuu$ iget$yoy$yooh$yuo$yue$juu$dya$youz$yyou$$ preposions Clusters$from$Owopu

Basic  Recipe  for  Document  Categoriza*on  

1.  Obtain  a  pool  of  correctly  categorized  documents  D.  

2.  Define  a  func*on  f  from  documents  to  feature  vectors.    

3.  Define  a  parameterized  func*on  hw  from  feature  vectors  to  categories.  

4.  Select  h’s  parameters  w  using  a  training  sample  from  D.  

5.  Es*mate  performance  on  a  held-­‐out  sample  from  D.  

Page 17: NLP Demystified 2013nasmith/slides/socs-6-28-13.pdf · secondperson pronoun u yu$yuh$yhu$uu$yuuyewy0u yuhh$youh$yhuu$ iget$yoy$yooh$yuo$yue$juu$dya$youz$yyou$$ preposions Clusters$from$Owopu

Outline  

ü Automa*cally  categorizing  documents  2.  Decoding  sequences  of  words  3.  Clustering  documents  and/or  words  

Page 18: NLP Demystified 2013nasmith/slides/socs-6-28-13.pdf · secondperson pronoun u yu$yuh$yhu$uu$yuuyewy0u yuhh$youh$yhuu$ iget$yoy$yooh$yuo$yue$juu$dya$youz$yyou$$ preposions Clusters$from$Owopu

Decoding  Word  Sequences:    Examples  

•  Categorizing  each  word  by  its  part-­‐of-­‐speech  or  seman*c  class  

•  Recognizing  men*ons  of  named  en**es  •  Segmen*ng  a  document  into  parts  •  Parsing  a  sentence  into  a  gramma*cal  or  seman*c  structure  

Page 19: NLP Demystified 2013nasmith/slides/socs-6-28-13.pdf · secondperson pronoun u yu$yuh$yhu$uu$yuuyewy0u yuhh$youh$yhuu$ iget$yoy$yooh$yuo$yue$juu$dya$youz$yyou$$ preposions Clusters$from$Owopu

High-­‐Level  View  

d   c   classifica*on  

x  y1  

y3  

y2  

yN  ...  

structured  predic*on  

Page 20: NLP Demystified 2013nasmith/slides/socs-6-28-13.pdf · secondperson pronoun u yu$yuh$yhu$uu$yuuyewy0u yuhh$youh$yhuu$ iget$yoy$yooh$yuo$yue$juu$dya$youz$yyou$$ preposions Clusters$from$Owopu

Possible  Lines  of  AZack  

1.  Transform  into  a  sequence  of  classifica*on  problems  (see  part  1).  

2.  Transform  into  a  sequence  labeling  problem  and  use  a  variant  of  the  Viterbi  algorithm.  

3.  Design  a  representa*on,  predic*on  algorithm,  and  learning  algorithm  for  your  par*cular  problem.  

Page 21: NLP Demystified 2013nasmith/slides/socs-6-28-13.pdf · secondperson pronoun u yu$yuh$yhu$uu$yuuyewy0u yuhh$youh$yhuu$ iget$yoy$yooh$yuo$yue$juu$dya$youz$yyou$$ preposions Clusters$from$Owopu

Shameless  Self-­‐Promo*on  

Morgan Claypool Publishers&SYNTHESIS LECTURES ONHUMAN LANGUAGE TECHNOLOGIES

w w w . m o r g a n c l a y p o o l . c o m

Series Editor: Graeme Hirst, University of Toronto

MO

RG

AN

&C

LA

YP

OO

L

CM& Morgan Claypool Publishers&

About SYNTHESIsThis volume is a printed version of a work that appears in the SynthesisDigital Library of Engineering and Computer Science. Synthesis Lecturesprovide concise, original presentations of important research and developmenttopics, published quickly, in digital and print formats. For more informationvisit www.morganclaypool.com

SYNTHESIS LECTURES ONHUMAN LANGUAGE TECHNOLOGIES

LINGUISTIC STRUCTURE PREDICTION

Linguistic StructurePrediction

Graeme Hirst, Series Editor

ISBN: 978-1-60845-405-1

9 781608 454051

90000

Series ISSN: 1947-4040

Linguistic Structure PredictionNoah A. Smith, Carnegie Mellon UniversityA major part of natural language processing now depends on the use of text data to build linguisticanalyzers. We consider statistical, computational approaches to modeling linguistic structure. We seekto unify across many approaches and many kinds of linguistic structures. Assuming a basic understandingof natural language processing and/or machine learning, we seek to bridge the gap between the two fields.Approaches to decoding (i.e., carrying out linguistic structure prediction) and supervised and unsupervisedlearning of models that predict discrete structures as outputs are the focus. We also survey natural languageprocessing problems to which these methods are being applied, and we address related topics in probabilisticinference, optimization, and experimental methodology.

Noah A. Smith

SMITH

$56.43  on  amazon.com    possibly  free  in  electronic  form,  through  your  university’s  library  

Page 22: NLP Demystified 2013nasmith/slides/socs-6-28-13.pdf · secondperson pronoun u yu$yuh$yhu$uu$yuuyewy0u yuhh$youh$yhuu$ iget$yoy$yooh$yuo$yue$juu$dya$youz$yyou$$ preposions Clusters$from$Owopu

Lines  of  AZack  

1.  Reduce  to  a  sequence  of  classifica*on  problems  (see  part  1).  

2.  Reduce  to  a  sequence  labeling  problem  and  use  a  variant  of  the  Viterbi  algorithm.  

3.  Design  a  representa*on,  predic*on  algorithm,  and  learning  algorithm  for  your  problem.  

Page 23: NLP Demystified 2013nasmith/slides/socs-6-28-13.pdf · secondperson pronoun u yu$yuh$yhu$uu$yuuyewy0u yuhh$youh$yhuu$ iget$yoy$yooh$yuo$yue$juu$dya$youz$yyou$$ preposions Clusters$from$Owopu

Sequence  Labeling  

•  Input:    sequence  of  symbols  x1  x2  ...  xL  •  Output:    sequence  of  labels  y1  y2  ...  yL  each  ∈  Λ    Predic*on  rule:        Problem:    there  are  O(|Λ|L)  choices  for  y1  y2  ...  yL  !    

hw(x) = argmax

yw>

f(x1 . . . xL, y1 . . . yL)

Page 24: NLP Demystified 2013nasmith/slides/socs-6-28-13.pdf · secondperson pronoun u yu$yuh$yhu$uu$yuuyewy0u yuhh$youh$yhuu$ iget$yoy$yooh$yuo$yue$juu$dya$youz$yyou$$ preposions Clusters$from$Owopu

Sequence  Labeling  with  Local  Features  

A  key  assump*on  about  f  allows  us  to  solve  the  problem  exactly,  in  O(|Λ|2  L)  *me  and  O(|Λ|L)  space.  

hw(x) = argmax

yw>

f(x1 . . . xL, y1 . . . yL)

= argmax

yw>

L�1X

`=1

f

local

(x1 . . . xL, y`y`+1)

!

Page 25: NLP Demystified 2013nasmith/slides/socs-6-28-13.pdf · secondperson pronoun u yu$yuh$yhu$uu$yuuyewy0u yuhh$youh$yhuu$ iget$yoy$yooh$yuo$yue$juu$dya$youz$yyou$$ preposions Clusters$from$Owopu

If  I  knew  the  best  label  sequence  for  x1  ...  xL  –  1,  then  yL  would  be  easy.    That  decision  would  depend  only  on  state  L  –  1.            I  don’t  know  that  best  sequence,  but  there  are  only  |Λ|  op*ons  at  L  –  1.    So  I  only  need  the  score  of  the  best  sequence  up  to  L  –  1,  for  each  possible  label  at  L  –  1.      Call  this  V[L  –  1,  y]  for  y  ∈  Λ.    From  this,  I  can  score  each  label  at  L,  for  each  hypothe*cal  label  at  L  –  1.    Score  of  the  best  sequences  up  to  L  –  1  relies  similarly  on  score  of  the  best  sequences  up  to  L  –  2.      DiZo,  at  every  other  *mestep  L  –  2,  L  –  3,  ...  1.    

y

⇤L = arg max

yL2⇤w>

L�2X

`=1

flocal

(x1 . . . xL, y⇤` y

⇤`+1)

!+w>f

local

(x1 . . . xL, y⇤L�1yL)

= w>

L�2X

`=1

flocal

(x1 . . . xL, y⇤` y

⇤`+1)

!+ arg max

yL2⇤w>f

local

(x1 . . . xL, y⇤L�1yL)

Page 26: NLP Demystified 2013nasmith/slides/socs-6-28-13.pdf · secondperson pronoun u yu$yuh$yhu$uu$yuuyewy0u yuhh$youh$yhuu$ iget$yoy$yooh$yuo$yue$juu$dya$youz$yyou$$ preposions Clusters$from$Owopu

(Featurized)  Viterbi  Algorithm  

•  Precompute  V[*,  *]  from  ler  to  right.    V[1,  *]  =  0.    For  ℓ𝓁  =  2  to  L,  for  each  y  in  Λ:  

•  Backtrack  and  select  the  labels  from  right  to  ler.    For  ℓ𝓁  =  L  -­‐  1  to  1:  

y⇤L = argmax

yV [L, y]

y⇤` = B[`+ 1, y⇤`+1]

V [`, y] = max

y02⇤V [`� 1, y

0] +w>f

local

(x1 . . . xL, y0y)

B[`, y] = argmax

y02⇤V [`� 1, y

0] +w>f

local

(x1 . . . xL, y0y)

Page 27: NLP Demystified 2013nasmith/slides/socs-6-28-13.pdf · secondperson pronoun u yu$yuh$yhu$uu$yuuyewy0u yuhh$youh$yhuu$ iget$yoy$yooh$yuo$yue$juu$dya$youz$yyou$$ preposions Clusters$from$Owopu

Part  of  Speech  Tagging  

Arer  paying  the  medical  bills  ,  Frances  was  nearly  broke  .        RB            VBG          DT              JJ                NNS    ,          NNP          VBZ          RB                JJ            .    •  Adverb  (RB)  •  Verb  (VBG,  VBZ,  and  others)  •  Determiner  (DT)  •  Adjec*ve  (JJ)  •  Noun  (NN,  NNS,  NNP,  and  others)  •  Punctua*on  (.,  ,,  and  others)  

Page 28: NLP Demystified 2013nasmith/slides/socs-6-28-13.pdf · secondperson pronoun u yu$yuh$yhu$uu$yuuyewy0u yuhh$youh$yhuu$ iget$yoy$yooh$yuo$yue$juu$dya$youz$yyou$$ preposions Clusters$from$Owopu

Named  En*ty  Recogni*on  

With  Commander  Chris  Ferguson  at  the  helm  ,  

Atlan*s  touched  down  at  Kennedy  Space  Center  .  

Page 29: NLP Demystified 2013nasmith/slides/socs-6-28-13.pdf · secondperson pronoun u yu$yuh$yhu$uu$yuuyewy0u yuhh$youh$yhuu$ iget$yoy$yooh$yuo$yue$juu$dya$youz$yyou$$ preposions Clusters$from$Owopu

Named  En*ty  Recogni*on  

With  Commander  Chris  Ferguson  at  the  helm  ,  

Atlan*s  touched  down  at  Kennedy  Space  Center  .  

B-­‐person   I-­‐person   I-­‐person  O   O   O   O   O  

O  O  O  O  B-­‐space-­‐shuZle   B-­‐place   I-­‐place   I-­‐place  

Page 30: NLP Demystified 2013nasmith/slides/socs-6-28-13.pdf · secondperson pronoun u yu$yuh$yhu$uu$yuuyewy0u yuhh$youh$yhuu$ iget$yoy$yooh$yuo$yue$juu$dya$youz$yyou$$ preposions Clusters$from$Owopu

Word  Alignment  Mr.  President  ,  Noah’s  ark  was  filled  not  with  produc*on  factors  ,  but  with  living  creatures.  

       

NULL            Noahs  Arche  war  nicht  voller  Produc*onsfactoren  ,  sondern  Geschöpfe  .  

Page 31: NLP Demystified 2013nasmith/slides/socs-6-28-13.pdf · secondperson pronoun u yu$yuh$yhu$uu$yuuyewy0u yuhh$youh$yhuu$ iget$yoy$yooh$yuo$yue$juu$dya$youz$yyou$$ preposions Clusters$from$Owopu

Word  Alignment  Mr.  President  ,  Noah’s  ark  was  filled  not  with  produc*on  factors  ,  but  with  living  creatures.  

       

NULL            Noahs  Arche  war  nicht  voller  Produc*onsfactoren  ,  sondern  Geschöpfe  .  

Page 32: NLP Demystified 2013nasmith/slides/socs-6-28-13.pdf · secondperson pronoun u yu$yuh$yhu$uu$yuuyewy0u yuhh$youh$yhuu$ iget$yoy$yooh$yuo$yue$juu$dya$youz$yyou$$ preposions Clusters$from$Owopu

Word  Alignment  Mr.  President  ,  Noah’s  ark  was  filled  not  with  produc*on  factors  ,  but  with  living  creatures.  

       

NULL            Noahs  Arche  war  nicht  voller  Produc*onsfactoren  ,  sondern  Geschöpfe  .  

Page 33: NLP Demystified 2013nasmith/slides/socs-6-28-13.pdf · secondperson pronoun u yu$yuh$yhu$uu$yuuyewy0u yuhh$youh$yhuu$ iget$yoy$yooh$yuo$yue$juu$dya$youz$yyou$$ preposions Clusters$from$Owopu

Word  Alignment  Mr.  President  ,  Noah’s  ark  was  filled  not  with  produc*on  factors  ,  but  with  living  creatures.  

       

NULL            Noahs  Arche  war  nicht  voller  Produc*onsfactoren  ,  sondern  Geschöpfe  .  

Page 34: NLP Demystified 2013nasmith/slides/socs-6-28-13.pdf · secondperson pronoun u yu$yuh$yhu$uu$yuuyewy0u yuhh$youh$yhuu$ iget$yoy$yooh$yuo$yue$juu$dya$youz$yyou$$ preposions Clusters$from$Owopu

Word  Alignment  Mr.  President  ,  Noah’s  ark  was  filled  not  with  produc*on  factors  ,  but  with  living  creatures.  

       

NULL            Noahs  Arche  war  nicht  voller  Produc*onsfactoren  ,  sondern  Geschöpfe  .  

Page 35: NLP Demystified 2013nasmith/slides/socs-6-28-13.pdf · secondperson pronoun u yu$yuh$yhu$uu$yuuyewy0u yuhh$youh$yhuu$ iget$yoy$yooh$yuo$yue$juu$dya$youz$yyou$$ preposions Clusters$from$Owopu

Word  Alignment  Mr.  President  ,  Noah’s  ark  was  filled  not  with  produc*on  factors  ,  but  with  living  creatures.  

       

NULL            Noahs  Arche  war  nicht  voller  Produc*onsfactoren  ,  sondern  Geschöpfe  .  

Page 36: NLP Demystified 2013nasmith/slides/socs-6-28-13.pdf · secondperson pronoun u yu$yuh$yhu$uu$yuuyewy0u yuhh$youh$yhuu$ iget$yoy$yooh$yuo$yue$juu$dya$youz$yyou$$ preposions Clusters$from$Owopu

Word  Alignment  Mr.  President  ,  Noah’s  ark  was  filled  not  with  produc*on  factors  ,  but  with  living  creatures.  

       

NULL            Noahs  Arche  war  nicht  voller  Produc*onsfactoren  ,  sondern  Geschöpfe  .  

Page 37: NLP Demystified 2013nasmith/slides/socs-6-28-13.pdf · secondperson pronoun u yu$yuh$yhu$uu$yuuyewy0u yuhh$youh$yhuu$ iget$yoy$yooh$yuo$yue$juu$dya$youz$yyou$$ preposions Clusters$from$Owopu

Word  Alignment  Mr.  President  ,  Noah’s  ark  was  filled  not  with  produc*on  factors  ,  but  with  living  creatures.  

       

NULL            Noahs  Arche  war  nicht  voller  Produc*onsfactoren  ,  sondern  Geschöpfe  .  

Page 38: NLP Demystified 2013nasmith/slides/socs-6-28-13.pdf · secondperson pronoun u yu$yuh$yhu$uu$yuuyewy0u yuhh$youh$yhuu$ iget$yoy$yooh$yuo$yue$juu$dya$youz$yyou$$ preposions Clusters$from$Owopu

Word  Alignment  Mr.  President  ,  Noah’s  ark  was  filled  not  with  produc*on  factors  ,  but  with  living  creatures.  

       

NULL            Noahs  Arche  war  nicht  voller  Produc*onsfactoren  ,  sondern  Geschöpfe  .  

Page 39: NLP Demystified 2013nasmith/slides/socs-6-28-13.pdf · secondperson pronoun u yu$yuh$yhu$uu$yuuyewy0u yuhh$youh$yhuu$ iget$yoy$yooh$yuo$yue$juu$dya$youz$yyou$$ preposions Clusters$from$Owopu

Word  Alignment  Mr.  President  ,  Noah’s  ark  was  filled  not  with  produc*on  factors  ,  but  with  living  creatures.  

       

NULL            Noahs  Arche  war  nicht  voller  Produc*onsfactoren  ,  sondern  Geschöpfe  .  

Page 40: NLP Demystified 2013nasmith/slides/socs-6-28-13.pdf · secondperson pronoun u yu$yuh$yhu$uu$yuuyewy0u yuhh$youh$yhuu$ iget$yoy$yooh$yuo$yue$juu$dya$youz$yyou$$ preposions Clusters$from$Owopu

Basic  Recipe  for  Document  Categoriza*on  

1.  Obtain  a  pool  of  correctly  labeled  sequences  D.  2.  Define  a  locally  factored  func*on  f  from  

sequences  and  labelings  to  feature  vectors.    3.  Define  a  parameterized  func*on  hw  from  

feature  vectors  to  labelings.  4.  Select  h’s  parameters  w  using  a  training  sample  

from  D.  5.  Es*mate  performance  on  a  held-­‐out  sample  

from  D.  

Sequence  Labeling  

Page 41: NLP Demystified 2013nasmith/slides/socs-6-28-13.pdf · secondperson pronoun u yu$yuh$yhu$uu$yuuyewy0u yuhh$youh$yhuu$ iget$yoy$yooh$yuo$yue$juu$dya$youz$yyou$$ preposions Clusters$from$Owopu

Structured  Learners  Generalize    Linear  Classifica*on  Learners!  

•  hidden  Markov  models  ⟵  naïve  Bayes  •  condi*onal  random  fields  ⟵  logis*c  regression  •  structured  perceptron  ⟵  perceptron  •  structured  SVM  ⟵  support  vector  machine  

Page 42: NLP Demystified 2013nasmith/slides/socs-6-28-13.pdf · secondperson pronoun u yu$yuh$yhu$uu$yuuyewy0u yuhh$youh$yhuu$ iget$yoy$yooh$yuo$yue$juu$dya$youz$yyou$$ preposions Clusters$from$Owopu

Addi*onal  Notes  

•  Outputs  that  are  trees,  graphs,  logical  forms,  other  strings  ...  parse  trees  (phrase  structure,  dependencies)  coreference  rela*onships  among  en*ty  men*ons  (and  pronouns)  a  huge  range  of  seman*c  analyses  

•  Evalua*on?  

Page 43: NLP Demystified 2013nasmith/slides/socs-6-28-13.pdf · secondperson pronoun u yu$yuh$yhu$uu$yuuyewy0u yuhh$youh$yhuu$ iget$yoy$yooh$yuo$yue$juu$dya$youz$yyou$$ preposions Clusters$from$Owopu

Dependency  Parse  

Page 44: NLP Demystified 2013nasmith/slides/socs-6-28-13.pdf · secondperson pronoun u yu$yuh$yhu$uu$yuuyewy0u yuhh$youh$yhuu$ iget$yoy$yooh$yuo$yue$juu$dya$youz$yyou$$ preposions Clusters$from$Owopu

Frame-­‐Seman*c  Parse  

Page 45: NLP Demystified 2013nasmith/slides/socs-6-28-13.pdf · secondperson pronoun u yu$yuh$yhu$uu$yuuyewy0u yuhh$youh$yhuu$ iget$yoy$yooh$yuo$yue$juu$dya$youz$yyou$$ preposions Clusters$from$Owopu

Run  our  Parsers!  

http://demo.ark.cs.cmu.edu/parse

Page 46: NLP Demystified 2013nasmith/slides/socs-6-28-13.pdf · secondperson pronoun u yu$yuh$yhu$uu$yuuyewy0u yuhh$youh$yhuu$ iget$yoy$yooh$yuo$yue$juu$dya$youz$yyou$$ preposions Clusters$from$Owopu

Outline  

ü Automa*cally  categorizing  documents  ü Decoding  sequences  of  words  3.  Clustering  documents  and/or  words  

Page 47: NLP Demystified 2013nasmith/slides/socs-6-28-13.pdf · secondperson pronoun u yu$yuh$yhu$uu$yuuyewy0u yuhh$youh$yhuu$ iget$yoy$yooh$yuo$yue$juu$dya$youz$yyou$$ preposions Clusters$from$Owopu

Clustering  Real  Data  

Page 48: NLP Demystified 2013nasmith/slides/socs-6-28-13.pdf · secondperson pronoun u yu$yuh$yhu$uu$yuuyewy0u yuhh$youh$yhuu$ iget$yoy$yooh$yuo$yue$juu$dya$youz$yyou$$ preposions Clusters$from$Owopu

K-­‐Means  

Given:    points  {x1,  …,  xN},  K  (number  of  clusters)    1. Arbitrarily  select  μ1,  …,  μK.  2. Assign  each  xi  to  the  nearest  μj.  3. Select  each  μj  to  be  the  mean  of  all  xi  assigned  to  it.  

4. If  all  μj  have  converged  stop;  else  go  to  2.  

Page 49: NLP Demystified 2013nasmith/slides/socs-6-28-13.pdf · secondperson pronoun u yu$yuh$yhu$uu$yuuyewy0u yuhh$youh$yhuu$ iget$yoy$yooh$yuo$yue$juu$dya$youz$yyou$$ preposions Clusters$from$Owopu

K-­‐Means,  Visualized  

Page 50: NLP Demystified 2013nasmith/slides/socs-6-28-13.pdf · secondperson pronoun u yu$yuh$yhu$uu$yuuyewy0u yuhh$youh$yhuu$ iget$yoy$yooh$yuo$yue$juu$dya$youz$yyou$$ preposions Clusters$from$Owopu

K-­‐Means,  Visualized  

Page 51: NLP Demystified 2013nasmith/slides/socs-6-28-13.pdf · secondperson pronoun u yu$yuh$yhu$uu$yuuyewy0u yuhh$youh$yhuu$ iget$yoy$yooh$yuo$yue$juu$dya$youz$yyou$$ preposions Clusters$from$Owopu

K-­‐Means,  Visualized  

Page 52: NLP Demystified 2013nasmith/slides/socs-6-28-13.pdf · secondperson pronoun u yu$yuh$yhu$uu$yuuyewy0u yuhh$youh$yhuu$ iget$yoy$yooh$yuo$yue$juu$dya$youz$yyou$$ preposions Clusters$from$Owopu

K-­‐Means,  Visualized  

Page 53: NLP Demystified 2013nasmith/slides/socs-6-28-13.pdf · secondperson pronoun u yu$yuh$yhu$uu$yuuyewy0u yuhh$youh$yhuu$ iget$yoy$yooh$yuo$yue$juu$dya$youz$yyou$$ preposions Clusters$from$Owopu

K-­‐Means,  Visualized  

Page 54: NLP Demystified 2013nasmith/slides/socs-6-28-13.pdf · secondperson pronoun u yu$yuh$yhu$uu$yuuyewy0u yuhh$youh$yhuu$ iget$yoy$yooh$yuo$yue$juu$dya$youz$yyou$$ preposions Clusters$from$Owopu

K-­‐Means,  Visualized  

Page 55: NLP Demystified 2013nasmith/slides/socs-6-28-13.pdf · secondperson pronoun u yu$yuh$yhu$uu$yuuyewy0u yuhh$youh$yhuu$ iget$yoy$yooh$yuo$yue$juu$dya$youz$yyou$$ preposions Clusters$from$Owopu

K-­‐Means  for  Text?  

•  Documents  – Use  the  same  f  we  might  use  for  classifica*on.  

•  Words  – Use  “context”  vectors  ...  

Page 56: NLP Demystified 2013nasmith/slides/socs-6-28-13.pdf · secondperson pronoun u yu$yuh$yhu$uu$yuuyewy0u yuhh$youh$yhuu$ iget$yoy$yooh$yuo$yue$juu$dya$youz$yyou$$ preposions Clusters$from$Owopu

Where’s  the  beef?  

Page 57: NLP Demystified 2013nasmith/slides/socs-6-28-13.pdf · secondperson pronoun u yu$yuh$yhu$uu$yuuyewy0u yuhh$youh$yhuu$ iget$yoy$yooh$yuo$yue$juu$dya$youz$yyou$$ preposions Clusters$from$Owopu

chicken  

Page 58: NLP Demystified 2013nasmith/slides/socs-6-28-13.pdf · secondperson pronoun u yu$yuh$yhu$uu$yuuyewy0u yuhh$youh$yhuu$ iget$yoy$yooh$yuo$yue$juu$dya$youz$yyou$$ preposions Clusters$from$Owopu

Hypothe*cal  Counts  based  on  Syntac*c  Dependencies  

Modified-­‐by-­‐ferocious(adj)  

Subject-­‐of-­‐devour(v)  

Object-­‐of-­‐pet(v)  

Modified-­‐by-­‐African(adj)  

Modified-­‐by-­‐big(adj)  

Lion   15   5   0   6   15  

Dog   7   3   8   0   12  

Cat   1   1   6   1   9  

Elephant   0   0   0   10   15  

…  

Page 59: NLP Demystified 2013nasmith/slides/socs-6-28-13.pdf · secondperson pronoun u yu$yuh$yhu$uu$yuuyewy0u yuhh$youh$yhuu$ iget$yoy$yooh$yuo$yue$juu$dya$youz$yyou$$ preposions Clusters$from$Owopu

Brown  Clustering  

Given:    corpus  of  length  N,  K    1.  Assign  each  word  to  its  cluster  (V  clusters)  2.  Repeat  V  –  K  *mes:  •  Find  the  single  merge  (cj,  ck)  that  results  in  a  new  clustering  with  the  highest  Quality  score  •  Prepend  cj’s  bitstring  with  0  and  ck’s  with  1  (and  the  same  for  all  their  descendents)  

Page 60: NLP Demystified 2013nasmith/slides/socs-6-28-13.pdf · secondperson pronoun u yu$yuh$yhu$uu$yuuyewy0u yuhh$youh$yhuu$ iget$yoy$yooh$yuo$yue$juu$dya$youz$yyou$$ preposions Clusters$from$Owopu

Mini-­‐Example  

Bitstrings  that  share  a  prefix  are  in  the  same  cluster,  at  some  level  of  granularity.  

Page 61: NLP Demystified 2013nasmith/slides/socs-6-28-13.pdf · secondperson pronoun u yu$yuh$yhu$uu$yuuyewy0u yuhh$youh$yhuu$ iget$yoy$yooh$yuo$yue$juu$dya$youz$yyou$$ preposions Clusters$from$Owopu

Clusters  from  Brown  et  al.  (1992)  

Page 62: NLP Demystified 2013nasmith/slides/socs-6-28-13.pdf · secondperson pronoun u yu$yuh$yhu$uu$yuuyewy0u yuhh$youh$yhuu$ iget$yoy$yooh$yuo$yue$juu$dya$youz$yyou$$ preposions Clusters$from$Owopu

Clusters  from  Owopu*  et  al.  (2013)  (56M  Tweets)  

acronyms  for  laughter  

lmao  lmfao  lmaoo  lmaooo  hahahahaha  lool  c�u  rofl  loool  lmfaoo  lmfaooo  lmaoooo  lmbo  lololol    

onomatopoeic    laugher  

haha  hahaha  hehe  hahahaha  hahah  aha  hehehe  ahaha  hah  hahahah  kk  hahaa  ahah  

affirma*ve   yes  yep  yup  nope  yess  yesss  yessss  ofcourse  yeap  likewise  yepp  yesh  yw  yuup  yus  

nega*ve   yeah  yea  nah  naw  yeahh  nooo  yeh  noo  noooo  yeaa  ikr  nvm  yeahhh  nahh  nooooo    

metacomment   smh  jk  #fail  #random  #fact  sm�  #smh  #winning  #realtalk  smdh  #dead  #justsaying    

Page 63: NLP Demystified 2013nasmith/slides/socs-6-28-13.pdf · secondperson pronoun u yu$yuh$yhu$uu$yuuyewy0u yuhh$youh$yhuu$ iget$yoy$yooh$yuo$yue$juu$dya$youz$yyou$$ preposions Clusters$from$Owopu

second  person  pronoun  

u  yu  yuh  yhu  uu  yuu  yew  y0u  yuhh  youh  yhuu  iget  yoy  yooh  yuo  yue  juu  dya  youz  yyou    

preposi*ons   w  fo  fa  fr  fro  ov  fer  fir  whit  abou  ar  serie  fore  fah  fuh  w/her  w/that  fron  isn  agains    

“contrac*ons”   tryna  gon  finna  bouta  trynna  bouZa  gne  fina  gonn  tryina  fenna  qone  trynaa  qon    

going  to   gonna  gunna  gona  gna  guna  gnna  ganna  qonna  gonnna  gana  qunna  gonne  goona    

so+   soo  sooo  soooo  sooooo  soooooo  sooooooo  soooooooo  sooooooooo  soooooooooo    

Clusters  from  Owopu*  et  al.  (2013)  (56M  Tweets)  

Page 64: NLP Demystified 2013nasmith/slides/socs-6-28-13.pdf · secondperson pronoun u yu$yuh$yhu$uu$yuuyewy0u yuhh$youh$yhuu$ iget$yoy$yooh$yuo$yue$juu$dya$youz$yyou$$ preposions Clusters$from$Owopu

mischevious   ;)  :p  :-­‐)  xd  ;-­‐)  ;d  (;  :3  ;p  =p  :-­‐p  =))  ;]  xdd  #gno  xddd  >:)  ;-­‐p  >:d  8-­‐)  ;-­‐d    

happy   :)  (:  =)  :))  :]  :’)  =]  ^_^  :)))  ^.^  [:  ;))  ((:  ^__^  (=  ^-­‐^  :))))  

sad   :(  :/  -­‐_-­‐  -­‐.-­‐  :-­‐(  :’(  d:  :|  :s  -­‐__-­‐  =(  =/  >.<  -­‐___-­‐  :-­‐/  </3  :\  -­‐____-­‐  ;(  /:  :((  >_<  =[  :[  #fml    

love   <3  xoxo  <33  xo  <333  #love  s2  <URL-­‐twi**on.com>  #neversaynever  <3333    

F-­‐word  +  ing  

fucking  fuckin  freaking  bloody  freakin  friggin  effin  effing  fuckn  fucken  frickin  fukin  f'n  fckn  flippin  �n  motherfucking  fckin  f*cking  fricken  fukn  fuccin  fcking  fukkin  

Clusters  from  Owopu*  et  al.  (2013)  (56M  Tweets)  

Page 65: NLP Demystified 2013nasmith/slides/socs-6-28-13.pdf · secondperson pronoun u yu$yuh$yhu$uu$yuuyewy0u yuhh$youh$yhuu$ iget$yoy$yooh$yuo$yue$juu$dya$youz$yyou$$ preposions Clusters$from$Owopu

Browse  our  TwiZer  Clusters!  

http://www.ark.cs.cmu.edu/TweetNLP/cluster_viewer.html

Page 66: NLP Demystified 2013nasmith/slides/socs-6-28-13.pdf · secondperson pronoun u yu$yuh$yhu$uu$yuuyewy0u yuhh$youh$yhuu$ iget$yoy$yooh$yuo$yue$juu$dya$youz$yyou$$ preposions Clusters$from$Owopu

Addi*onal  Notes  

•  Sor  clustering  allows  items  to  have  mixed  membership  in  different  clusters.  – Typically  accomplished  with  probabilis*c  models  – Latent  Dirichlet  alloca*on  is  a  popular  and  Bayesian  model  

•  Evalua*on?  •  One  view  of  clusters:    feature  crea*on!  

Page 67: NLP Demystified 2013nasmith/slides/socs-6-28-13.pdf · secondperson pronoun u yu$yuh$yhu$uu$yuuyewy0u yuhh$youh$yhuu$ iget$yoy$yooh$yuo$yue$juu$dya$youz$yyou$$ preposions Clusters$from$Owopu

Summary  

supervised  classifica*on  

(5  steps:    data,  features,  predic3on  func3on,  learning,  evalua3on)  

structured  predic*on  

local  factoring  +  dynamic  programming  

unsupervised  clustering  

alterna3ng  or  greedy  

op3miza3on