greed%is%good%if%randomized:%new%inference% …people.csail.mit.edu/yuanzh/papers/greed.pdf ·...

Greed is Good if Randomized: New Inference for Dependency Parsing

Yuan Zhang CSAIL, MIT

Joint work with Tao Lei,

Regina Barzilay, and Tommi Jaakkola

Inference vs. Scoring

Inference

Scoring Func.on

Approximate

Limited Expressive

Inference

Scoring Func.on

Approximate

Limited Expressive

Minimum Spanning Tree

Inference

Scoring Func.on

Approximate

Limited Expressive

Reranking

•  Reranking: incorporate arbitrary features

Inference

Scoring Func.on

Approximate

Limited Expressive

Reranking

Dual DecomposiKon

•  Reranking: incorporate arbitrary features •  Dual DecomposiKon: search in full space

Parsing Complexity

•  High-‐order parsing is NP-‐hard (McDonald et al., 2006) •  Hypothesis: parsing is easy on average •  Many NP-‐hard problems are easy on average

- MAX-‐SAT (Resende et al., 1997) - Set cover (Hochbaum, 1982)

Parsing Complexity

•  High-‐order parsing is NP-‐hard (McDonald et al., 2006) •  Hypothesis: parsing is easy on average •  Many NP-‐hard problems are easy on average

- MAX-‐SAT (Resende et al., 1997) - Set cover (Hochbaum, 1982)

We show •  Analysis on average parsing complexity •  A simple inference algorithm based on the analysis

Our Approach

Inference

Scoring Func.on

Approximate

Limited Expressive

Reranking

Dual DecomposiKon

•  Reranking: incorporate arbitrary features •  Dual DecomposiKon: search in full space

Core Idea

•  Climb to the opKmal tree in a few small greedy steps

1)  Randomly sample a dependency tree 2)  Greedily improve the tree one edge at a Kme

3)  Repeat (2) unKl converge

Randomized Hill-‐climbing

For k = 1 to K

Select the tree with the highest score

Core Idea

•  Climb to the opKmal tree in a few small greedy steps

1)  Randomly sample a dependency tree 2)  Greedily improve the tree one edge at a Kme

3)  Repeat (2) unKl converge

Randomized Hill-‐climbing

That’s it!

For k = 1 to K

Select the tree with the highest score

It Works!

89.44%

88.73%

Our Full

Parsing Performance on CoNLL Dataset

Dual Decomposi;on

“ I ate an apple today”

Example

Initial treeExample

today apple I

“ I ate an apple today” Target tree

Initial treeExample

today apple I

Target tree

Initial treeExample

today apple I

Target tree

Example

today apple I

Target tree

Example

today apple I

Target tree

Example

today apple I

Target tree

Example

today apple I

Target tree

Example

today apple I

Target tree

Example

today apple I

Target tree

Example

today apple I

Target tree

Example

today apple I

Target tree

Example

today apple I

Target tree

Example

Reachability: transforming any tree to any other tree

•  maintaining the structure a valid tree at any point

•  using as few as d steps (d : head differences/hamming distance)

…… today

today apple I

y(0) y(T )

Why Greedy Has a Chance to Work

…… today

today apple I

y(0) y(T )increase S(x, y(t ) )

Greedy Hill-climbing

…… today

today apple I

y(0) y(T )increase S(x, y(t ) )

Greedy Hill-climbing

Arbitrary features in the scoring func;on

…… today

today apple I

y(0) y(T )

tree y

score S

local opKmum

global opKmum

increase S(x, y(t ) )

Challenge: Local Optimum

tree y

score S

Hill-climbing with Restarts

30 Overcome local opKma via restarts

…… today

today apple I

Hill-climbing with Restarts

Random iniKalizaKon (e.g. uniform)

…… max

……

Overcome local opKma via restarts

…… today

today apple I

Hill-‐climbing

Learning Algorithm

•  Follow common max-‐margin framework

∀y ∈ T (x) S(x, y) ≥ S(x, y)+ | y− y |−ξ

§  is the gold tree y

Learning Algorithm

•  Follow common max-‐margin framework

•  Adopt passive-‐aggressive online learning framework (Crammer et al. 2006)

•  Decode with our randomized greedy algorithm

∀y ∈ T (x) S(x, y) ≥ S(x, y)+ | y− y |−ξ

§  is the gold tree y

Analysis

TheoreKcal Empirical

First-‐order

Analysis

First-‐order

High-‐order ?

Analysis

First-‐order

High-‐order ?

Search Space Complexity: First-order

10 words

≈ 2 billion trees

10 words

≈ 2 billion trees

< 512 local opKma

Theorem: For any first-‐order scoring funcKon:

•  there are at most 2n-‐1 locally opKmal trees

•  this upper bound is .ght

2n-‐1 is sKll a lot, but it is the worst case

What about the average case?

Algorithm for Counting Local Optima

Johnsaw

The method is based on Chu-‐Liu-‐Edmonds algorithm

Johnsaw

•  Select the best heads independently

Johnsaw

•  Contract the cycle and recursively count the local opKma Ø  Any local opKmum exactly reassigns one edge in the cycle

Johnsaw

Mary9 30

9 root

Johnsaw

Empirical Results: First-order

How many local opKma in real data?

# Op.ma on English Dataset

% sentences

50% 70%

21 121

% sentences

50% 70% 90%

21 121

% sentences

Does the hill-‐climbing find the argmax?

Finding Global Op.mum on English

Len. ≤ 15 100%

Len. > 15 99.3%

Easy search space leads to successful decoding

Finding Global Op.mum on English

Empirical Results: High-order

% Cer.ficate 94.5%

Dual decomposiKon (Koo et al., 2010)

SDD = SHC 99.8%

Given a cerKficate by DD

Comparison on English Given DD Cert.

Overall Comparison on English

SDD = SHC 98.7%

SDD < SHC 1.0%

SDD > SHC 0.3%

Experimental Setup

Implementa;on

§  AdapKve restarKng strategy with

Datasets §  14 languages in CoNLL 2006 & 2008 shared tasks

Features

§  Up to 3rd-‐order (three arcs) features used in MST/Turbo parsers

§  Global features used in re-‐ranking

K = 300

Baselines and Evaluation Measure

Baselines: §  Turbo Parser: Dual DecomposiKon with 3rd-‐order

features (MarKns et al., 2013)

§  Sampling-‐based Parser: MCMC sampling with global features (Zhang et al., 2014)

Evalua;on Measure:

§  Unlabeled Amachment Score (UAS), without punctuaKons

Comparing with Baselines

89.24%

89.23%

88.66%

88.73%

Our Full

Sampling-‐based

Our 3rd

Turbo (DD)

Our Full w/o Tensor

Sampling-‐based (MCMC)

Our Full w/o Tensor

89.24%

89.23%

88.66%

88.73%

Our Full

Sampling-‐based

Our 3rd

Turbo (DD)

Our Full w/o Tensor

89.24%

89.23%

88.66%

88.73%

Our Full

Sampling-‐based

Our 3rd

Turbo (DD)

Our Full w/o Tensor

Global Features

Impact of Initialization

88.0 88.1

Uniform Rnd-‐1st

Impact of Restarts

No Restart 300 Restarts

Convergence Property

•  Score normalized by the highest score in 3000 restarts

0 100 200 300 400 500

# Restarts

Len ≤ 15Len > 15

Convergence Analysis on English

Trade-off between Speed and Performance

Decoding Speed on English

2 4 6 8 10x 10−3

Sec/Tok

3rd−order ModelFull Model

Fast Slow

Conclusion

•  Analysis: we invesKgate average case complexity of parsing •  Algorithm: we introduce a simple randomized greedy inference algorithm

Source code available at: hSps://github.com/taolei87/RBGParser

Thank You!

greed%is%good%if%randomized:%new%inference% …people.csail.mit.edu/yuanzh/papers/greed.pdf ·...

Documents

randomized algorithms 1. 2 randomization algorithmic design...

essays on causal inference in randomized...

summary€¦ · web viewregression discontinuity designs in...

greed and guns

4. greed€¦ · 4. greed greed is good. greed is right....

institutionalised cultural greed

shame and greed

analyzingcausalmechanismsinsurvey experiments€¦ ·...

regression and causal inference - resakss-asia€¦ ·...

rigid motion segmentation using randomized voting ·...

towards personalized learning using counterfactual inference...

odalis puma- greed

201 b greed

greed is good if randomized: new inference for...

beyond greed-and-scarcity

fear & greed

project : greed

essays on causal inference in randomized...

‘greed — for lack of a better word — is good....

collier greed grievance