greed%is%good%if%randomized:%new%inference% …people.csail.mit.edu/yuanzh/papers/greed.pdf ·...

Post on 18-Jul-2020

4 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Greed  is  Good  if  Randomized:  New  Inference  for  Dependency  Parsing  

1  

Yuan Zhang CSAIL, MIT

Joint work with Tao Lei,

Regina Barzilay, and Tommi Jaakkola

Inference vs. Scoring

2  

Inference  

Scoring  Func.on  

Approximate  

Exact  

Limited   Expressive  

Inference vs. Scoring

3  

Inference  

Scoring  Func.on  

Approximate  

Exact  

Limited   Expressive  

Minimum  Spanning  Tree  

Inference vs. Scoring

4  

Inference  

Scoring  Func.on  

Approximate  

Exact  

Limited   Expressive  

Minimum  Spanning  Tree  

Reranking  

•  Reranking:  incorporate  arbitrary  features  

Inference vs. Scoring

5  

Inference  

Scoring  Func.on  

Approximate  

Exact  

Limited   Expressive  

Minimum  Spanning  Tree  

Reranking  

Dual  DecomposiKon  

•  Reranking:  incorporate  arbitrary  features  •  Dual  DecomposiKon:  search  in  full  space  

Parsing Complexity

•  High-­‐order  parsing  is  NP-­‐hard  (McDonald  et  al.,  2006)  •  Hypothesis:  parsing  is  easy  on  average  •  Many  NP-­‐hard  problems  are  easy  on  average  

- MAX-­‐SAT  (Resende  et  al.,  1997)  - Set  cover  (Hochbaum,  1982)  

6  

Parsing Complexity

•  High-­‐order  parsing  is  NP-­‐hard  (McDonald  et  al.,  2006)  •  Hypothesis:  parsing  is  easy  on  average  •  Many  NP-­‐hard  problems  are  easy  on  average  

- MAX-­‐SAT  (Resende  et  al.,  1997)  - Set  cover  (Hochbaum,  1982)  

7  

We  show  •  Analysis  on  average  parsing  complexity  •  A  simple  inference  algorithm  based  on  the  analysis  

Our Approach

8  

Our            Approach  

Inference  

Scoring  Func.on  

Approximate  

Exact  

Limited   Expressive  

Minimum  Spanning  Tree  

Reranking  

Dual  DecomposiKon  

•  Reranking:  incorporate  arbitrary  features  •  Dual  DecomposiKon:  search  in  full  space  

Core Idea

•  Climb  to  the  opKmal  tree  in  a  few  small  greedy  steps  

9  

1)  Randomly  sample  a  dependency  tree  2)  Greedily  improve  the  tree  one  edge  at  a  Kme  

3)  Repeat  (2)  unKl  converge  

Randomized  Hill-­‐climbing  

For  k  =  1  to  K  

Select  the  tree  with  the  highest  score    

Core Idea

•  Climb  to  the  opKmal  tree  in  a  few  small  greedy  steps  

10  

1)  Randomly  sample  a  dependency  tree  2)  Greedily  improve  the  tree  one  edge  at  a  Kme  

3)  Repeat  (2)  unKl  converge  

Randomized  Hill-­‐climbing  

That’s  it!  

For  k  =  1  to  K  

Select  the  tree  with  the  highest  score    

It Works!

11  

89.44%  

88.73%  

Our  Full  

Turbo  

Parsing  Performance  on  CoNLL  Dataset  

Dual  Decomposi;on  

12  

“  I  ate  an  apple  today”  

Example

13  

today

apple

I

ate

an

ROOT  

“  I  ate  an  apple  today”  

Initial treeExample

14  

today

apple

I

ate

an

ROOT  

today apple I

ate

an

ROOT  

“  I  ate  an  apple  today”  Target tree

Initial treeExample

15  

today

apple

I

ate

an

ROOT  

today apple I

ate

an

ROOT  

“  I  ate  an  apple  today”  

I

an

ate

today

apple

Target tree

Initial treeExample

16  

today

apple

I

ate

an

ROOT  

today apple I

ate

an

ROOT  

“  I  ate  an  apple  today”  

I

an

ate

today

apple

Target tree

Example

17  

today

apple

I

ate

an

ROOT  

today apple I

ate

an

ROOT  

“  I  ate  an  apple  today”  

I

an

ate

today

apple

Target tree

Example

18  

today

apple

I

ate

ROOT  

today apple I

ate

an

ROOT  

“  I  ate  an  apple  today”  

I

an

ate

today

apple

an

Target tree

Example

19  

today

apple

I

ate

ROOT  

today apple I

ate

an

ROOT  

“  I  ate  an  apple  today”  

I

an

ate

today

apple

an

Target tree

Example

20  

today

apple

ROOT  

today apple I

ate

an

ROOT  

“  I  ate  an  apple  today”  

I

an

ate

today

apple

an I

ate

Target tree

Example

21  

today

apple

ROOT  

today apple I

ate

an

ROOT  

“  I  ate  an  apple  today”  

I

an

ate

today

apple

an I

ate

Target tree

Example

22  

today

apple

ROOT  

today apple I

ate

an

ROOT  

“  I  ate  an  apple  today”  

I

an

ate

today

apple

an I

ate

Target tree

Example

23  

today

apple

ROOT  

today apple I

ate

an

ROOT  

“  I  ate  an  apple  today”  

I

an

ate

today

apple

an I

ate

Target tree

Example

24  

today

ROOT  

today apple I

ate

an

ROOT  

“  I  ate  an  apple  today”  

I

an

ate

today

apple

I

ate

apple

an

Target tree

Example

25  

today

ROOT  

today apple I

ate

an

ROOT  

“  I  ate  an  apple  today”  

I

ate

apple

an

Target tree

Example

26  

Reachability:  transforming  any  tree  to  any  other  tree  

•  maintaining  the  structure  a  valid  tree  at  any  point  

•  using  as  few  as  d  steps  (d  :  head  differences/hamming  distance)  

……  today

apple

I

ate

an

ROOT  

today apple I

ate

an

ROOT  

y(0) y(T )

Why Greedy Has a Chance to Work

27  

……  today

apple

I

ate

an

ROOT  

today apple I

ate

an

ROOT  

y(0) y(T )increase  S(x, y(t ) )

Greedy Hill-climbing

28  

……  today

apple

I

ate

an

ROOT  

today apple I

ate

an

ROOT  

y(0) y(T )increase  S(x, y(t ) )

Greedy Hill-climbing

Arbitrary  features  in  the  scoring  func;on  

29  

……  today

apple

I

ate

an

ROOT  

today apple I

ate

an

ROOT  

y(0) y(T )

tree  y

score  S

local  opKmum  

global  opKmum  

increase  S(x, y(t ) )

Challenge: Local Optimum

tree  y

score  S

Hill-climbing with Restarts

30  Overcome  local  opKma  via  restarts  

……  today

apple

I

ate

an

ROOT  

today apple I

ate

an

ROOT  

31  

Hill-climbing with Restarts

Random  iniKalizaKon  (e.g.  uniform)  

……  max  

……  

Overcome  local  opKma  via  restarts  

……  today

apple

I

ate

an

ROOT  

today apple I

ate

an

ROOT  

y(0)

y(0)

y(0)

y(T )

y(T )

y(T )

Hill-­‐climbing  

Learning Algorithm

•  Follow  common  max-­‐margin  framework  

32  

∀y ∈ T (x) S(x, y) ≥ S(x, y)+ | y− y |−ξ

§           is  the  gold  tree  y

Learning Algorithm

•  Follow  common  max-­‐margin  framework  

•  Adopt  passive-­‐aggressive  online  learning  framework  (Crammer  et  al.  2006)    

•  Decode  with  our  randomized  greedy  algorithm    

33  

∀y ∈ T (x) S(x, y) ≥ S(x, y)+ | y− y |−ξ

§           is  the  gold  tree  y

Analysis

34  

Analysis

35  

TheoreKcal   Empirical  

First-­‐order  

Analysis

36  

TheoreKcal   Empirical  

First-­‐order  

High-­‐order   ?

Analysis

37  

TheoreKcal   Empirical  

First-­‐order  

High-­‐order   ?

38  

Search Space Complexity: First-order

10  words  

39  

Search Space Complexity: First-order

10  words  

≈  2  billion  trees  

40  

Search Space Complexity: First-order

10  words  

≈  2  billion  trees  

<  512  local  opKma  

41  

Search Space Complexity: First-order

Theorem:  For  any  first-­‐order  scoring  funcKon:  

•  there  are  at  most  2n-­‐1  locally  opKmal  trees  

•  this  upper  bound  is  .ght  

42  

Search Space Complexity: First-order

Theorem:  For  any  first-­‐order  scoring  funcKon:  

•  there  are  at  most  2n-­‐1  locally  opKmal  trees  

•  this  upper  bound  is  .ght  

2n-­‐1  is  sKll  a  lot,  but  it  is  the  worst  case  

43  

Search Space Complexity: First-order

Theorem:  For  any  first-­‐order  scoring  funcKon:  

•  there  are  at  most  2n-­‐1  locally  opKmal  trees  

•  this  upper  bound  is  .ght  

2n-­‐1  is  sKll  a  lot,  but  it  is  the  worst  case  

What  about  the  average  case?  

44  

Algorithm for Counting Local Optima

root

Johnsaw

Mary

10

9 20

30

30

0

9

45  

Algorithm for Counting Local Optima

The  method  is  based  on  Chu-­‐Liu-­‐Edmonds  algorithm    

root

Johnsaw

Mary

10

9 20

30

30

0

9

•  Select  the  best  heads  independently  

46  

Algorithm for Counting Local Optima

root

Johnsaw

Mary

10

9 20

30

30

0

9

•  Contract  the  cycle  and  recursively  count  the  local  opKma  Ø     Any  local  opKmum  exactly  reassigns  one  edge  in  the  cycle  

root

Johnsaw

Mary9 30

9 root

Johnsaw

Mary

10

30

9

+  

47  

Empirical Results: First-order

How  many  local  opKma  in  real  data?  

#  Op.ma  on  English  Dataset  

%  sentences  

48  

Empirical Results: First-order

50%  

21  

How  many  local  opKma  in  real  data?  

#  Op.ma  on  English  Dataset  

%  sentences  

49  

Empirical Results: First-order

50%   70%  

21  121  

How  many  local  opKma  in  real  data?  

#  Op.ma  on  English  Dataset  

%  sentences  

50  

Empirical Results: First-order

50%   70%   90%  

21  121  

2000  

How  many  local  opKma  in  real  data?  

#  Op.ma  on  English  Dataset  

%  sentences  

51  

Empirical Results: First-order

Does  the  hill-­‐climbing  find  the  argmax?  

Finding  Global  Op.mum  on  English  

52  

Empirical Results: First-order

Does  the  hill-­‐climbing  find  the  argmax?  

Len.  ≤  15  100%  

Len.  >  15  99.3%  

Easy  search  space  leads  to  successful  decoding  

Finding  Global  Op.mum  on  English  

Empirical Results: High-order

%  Cer.ficate  94.5%  

Dual  decomposiKon  (Koo  et  al.,  2010)  

SDD  =  SHC  99.8%  

Given  a  cerKficate  by  DD  

Comparison  on  English  Given  DD  Cert.  

Does  the  hill-­‐climbing  find  the  argmax?  

Empirical Results: High-order

Overall  Comparison  on  English  

Does  the  hill-­‐climbing  find  the  argmax?  

Empirical Results: High-order

SDD  =  SHC  98.7%  

Overall  Comparison  on  English  

Does  the  hill-­‐climbing  find  the  argmax?  

Empirical Results: High-order

SDD  =  SHC  98.7%  

SDD  <  SHC  1.0%  

SDD  >  SHC  0.3%  

Overall  Comparison  on  English  

Does  the  hill-­‐climbing  find  the  argmax?  

Experimental Setup

57  

Implementa;on  

§  AdapKve  restarKng  strategy  with    

Datasets  §  14  languages  in  CoNLL  2006  &  2008  shared  tasks  

Features  

§  Up  to  3rd-­‐order  (three  arcs)  features  used  in  MST/Turbo  parsers  

§  Global  features  used  in  re-­‐ranking  

K = 300

Baselines and Evaluation Measure

58  

Baselines:  §  Turbo  Parser:  Dual  DecomposiKon  with  3rd-­‐order  

features  (MarKns  et  al.,  2013)  

§  Sampling-­‐based  Parser:  MCMC  sampling  with  global  features  (Zhang  et  al.,  2014)  

Evalua;on  Measure:  

§  Unlabeled  Amachment  Score  (UAS),  without  punctuaKons  

Comparing with Baselines

59  

89.24%  

89.23%  

88.66%  

88.73%  

Our  Full  

Sampling-­‐based  

Our  3rd  

Turbo  (DD)  

Our  Full  w/o  Tensor  

Sampling-­‐based  (MCMC)  

Our  Full  w/o  Tensor  

Comparing with Baselines

60  

89.24%  

89.23%  

88.66%  

88.73%  

Our  Full  

Sampling-­‐based  

Our  3rd  

Turbo  (DD)  

Our  Full  w/o  Tensor  

Sampling-­‐based  (MCMC)  

Comparing with Baselines

61  

89.24%  

89.23%  

88.66%  

88.73%  

Our  Full  

Sampling-­‐based  

Our  3rd  

Turbo  (DD)  

Our  Full  w/o  Tensor  

Sampling-­‐based  (MCMC)  

Global  Features  

Impact of Initialization

62  

88.0   88.1  

84.0  

85.0  

86.0  

87.0  

88.0  

89.0  

Uniform   Rnd-­‐1st  

UAS

(%)  

Impact of Restarts

63  

85.4  

88.1  

84.0  

85.0  

86.0  

87.0  

88.0  

89.0  

No  Restart   300  Restarts  

UAS

(%)  

Convergence Property

64  

•  Score  normalized  by  the  highest  score  in  3000  restarts    

0 100 200 300 400 500

0.994

0.996

0.998

1

# Restarts

Scor

e

Len ≤ 15Len > 15

Convergence  Analysis  on  English  

Trade-off between Speed and Performance

65  

Decoding  Speed  on  English  

2 4 6 8 10x 10−3

88

90

92

94

Sec/Tok

UAS

3rd−order ModelFull Model

Fast   Slow  

Conclusion

•  Analysis:  we  invesKgate  average  case  complexity  of  parsing    •  Algorithm:  we  introduce  a  simple  randomized  greedy  inference  algorithm  

 

 Source  code  available  at:  hSps://github.com/taolei87/RBGParser  

66  

Thank  You!  

67  

top related