the chaos project: theory and practice fabio massimo zanzotto department of computer science,...

38
The CHAOS Project: Theory and Practice Fabio Massimo Zanzotto Department of Computer Science, Systems and Production University of Roma “Tor Vergata”

Upload: abigail-martinez

Post on 26-Mar-2015

218 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: The CHAOS Project: Theory and Practice Fabio Massimo Zanzotto Department of Computer Science, Systems and Production University of Roma Tor Vergata

The CHAOS Project:Theory and Practice

Fabio Massimo ZanzottoDepartment of Computer Science, Systems and ProductionUniversity of Roma “Tor Vergata”

Page 2: The CHAOS Project: Theory and Practice Fabio Massimo Zanzotto Department of Computer Science, Systems and Production University of Roma Tor Vergata

People

INVESTIGATORS Roberto Basili Fabio Massimo Zanzotto Maria Teresa Pazienza

FORMER CONTRIBUTORS Daniele Pighin Daniele Previtali Alessandro Bahgat Marco Pennacchiotti Massimo Di Nanni Michele Vindigni Luigi Mazzucchelli Paola Velardi Paolo Zirilli Alessandro Cucchiarelli Alessandro Marziali Fabrizio Grisoli Gianluca De Rossi

Page 3: The CHAOS Project: Theory and Practice Fabio Massimo Zanzotto Department of Computer Science, Systems and Production University of Roma Tor Vergata

Outline

Theory: Customizable parsing architectures XDG: eXtended Dependency Graph

Task oriented parsing design Practice: System Implementation and Use

A component-based approach An object-oriented platform

Linguistic data Processing modules

How to use the parser in an application Demo!!!

Page 4: The CHAOS Project: Theory and Practice Fabio Massimo Zanzotto Department of Computer Science, Systems and Production University of Roma Tor Vergata

Theory

Customizable parsing architectures

Page 5: The CHAOS Project: Theory and Practice Fabio Massimo Zanzotto Department of Computer Science, Systems and Production University of Roma Tor Vergata

Motivation

The Chaos Project unofficially began in ’96 … on the long tradition of ARIOSTO (Basili, Pazienza, Velardi) @ the

University of Rome “Tor Vergata” (RTV) Aim

building robust parsers for Italian and for English that use verb sub-categorization (syntactic) lexicons induced from

corpora that can be used in applications

Constraints use the long tradition @ RTV

“Social” background Microtheories for microphenomena Language analysis can be reduced to a cascade of modules (e.g., FSA) Application-oriented language anaysis (e.g., IE) Robust (formely, shallow) parsing approaches

Page 6: The CHAOS Project: Theory and Practice Fabio Massimo Zanzotto Department of Computer Science, Systems and Production University of Roma Tor Vergata

Motivation

Inf(S2)

Inf(S1)

[ Mr. Gaubert ] [contributed] [real estate] [valued] [ at $ 25 million] [to the assets] [of Independent American]

contribute-NP-PP(to)value-NP-PP(at)

Page 7: The CHAOS Project: Theory and Practice Fabio Massimo Zanzotto Department of Computer Science, Systems and Production University of Roma Tor Vergata

Motivation (found on vinyl supports)

Different NLP applications have different performance constraints in term of:

Accuracy Throughput

Customizable parsing architectures are reusable in different application scenarios if:

the architectural design supports performance control

Page 8: The CHAOS Project: Theory and Practice Fabio Massimo Zanzotto Department of Computer Science, Systems and Production University of Roma Tor Vergata

Customizable parsing architectures (found on vinyl supports)

Modularization clarifies the interdependency between

different syntactic information (grammatical/lexicalized)

allows to control throughput via eliciting modules quality via a clear relation between modules

(prerequisites/contributions)

Page 9: The CHAOS Project: Theory and Practice Fabio Massimo Zanzotto Department of Computer Science, Systems and Production University of Roma Tor Vergata

Modular approach

Syntactic parser SP(S,K)=I SP(S)=I

Syntactic parsing module:Pi(Si,Ki)=Si+1 Pi(Si)=Si+1

Modular syntactic parserSP = Pn... P2P1

Page 10: The CHAOS Project: Theory and Practice Fabio Massimo Zanzotto Department of Computer Science, Systems and Production University of Roma Tor Vergata

Modular approach

To push a modular approach we need:

a suitable annotation scheme a classification of the processing

modules

Page 11: The CHAOS Project: Theory and Practice Fabio Massimo Zanzotto Department of Computer Science, Systems and Production University of Roma Tor Vergata

A suitable annotation scheme

Requirements: Modularization

a stable representation of partially analyzed structures

Lexicalization a clear representation of the (semantic)

head of a given structure able to activate the lexicalized rule

Page 12: The CHAOS Project: Theory and Practice Fabio Massimo Zanzotto Department of Computer Science, Systems and Production University of Roma Tor Vergata

XDG: Extended Dependency Graph

XDG combines constituency and dependency based formalisms

XDG=(C,D)C = {(c,t,h)|cS,t,hc}D = {(c1,c2,t)| c1,c2C, t}

Nice property: allow to store persistent ambiguity (for interpretations projected by the same nodes)

Page 13: The CHAOS Project: Theory and Practice Fabio Massimo Zanzotto Department of Computer Science, Systems and Production University of Roma Tor Vergata

XDG: Extended Dependency Graph

C are constituents syntactic head potential semantic

governor D are dependencies

among constituents

Page 14: The CHAOS Project: Theory and Practice Fabio Massimo Zanzotto Department of Computer Science, Systems and Production University of Roma Tor Vergata

Classification of parsing modules

Pi(XDGi,Ki)=Pi(XDGi)=XDGi+1

The classification is performed according to: the type of information K used how they manipulate the sentence

representation

Page 15: The CHAOS Project: Theory and Practice Fabio Massimo Zanzotto Department of Computer Science, Systems and Production University of Roma Tor Vergata

Task oriented parsing design

Given: The NLP application requirements R The test-bed T A pool of parsing modules PM

The designing activity is: The research of a combination of the

parsing modules PM that fits R on the T

Page 16: The CHAOS Project: Theory and Practice Fabio Massimo Zanzotto Department of Computer Science, Systems and Production University of Roma Tor Vergata

NLP application requirements

Target phenomena: es. VP_PP, NP_PP, etc

Metrics: Recall R per sentence Precision P per sentence F-measure per sentence

Page 17: The CHAOS Project: Theory and Practice Fabio Massimo Zanzotto Department of Computer Science, Systems and Production University of Roma Tor Vergata

CHAOS: Levels of Analysis

POS

Chunks

Clauses

Dependencies

Strategies to use with questions you cannot answer

NNS TO VB IN NNS PRP MD VB

NPK VPK PPK NPK VPK

Page 18: The CHAOS Project: Theory and Practice Fabio Massimo Zanzotto Department of Computer Science, Systems and Production University of Roma Tor Vergata

Verb dependencies and Clause Boundaries

Inf(S2)

Inf(S1)

[ Mr. Gaubert ] [contributed] [real estate] [valued] [ at $ 25 million] [to the assets] [of Independent American]

contribute-NP-PP(to)value-NP-PP(at)

Page 19: The CHAOS Project: Theory and Practice Fabio Massimo Zanzotto Department of Computer Science, Systems and Production University of Roma Tor Vergata

Verb dependencies and Clause Boundaries

Inf(S2)

Inf(S1)

[ Mr. Gaubert ] [contributed] [real estate] [valued] [ at $ 25 million] [to the assets] [of Independent American]

contribute-NP-PP(to)value-NP-PP(at)

Page 20: The CHAOS Project: Theory and Practice Fabio Massimo Zanzotto Department of Computer Science, Systems and Production University of Roma Tor Vergata

Verb dependencies and Clause Boundaries

Inf(S2)

Inf(S1)

[ Mr. Gaubert ] [contributed] [real estate] [valued] [ at $ 25 million] [to the assets] [of Independent American]

contribute-NP-PP(to)value-NP-PP(at)

Page 21: The CHAOS Project: Theory and Practice Fabio Massimo Zanzotto Department of Computer Science, Systems and Production University of Roma Tor Vergata

Verb dependencies and Clause Boundaries

The algorithm: Initial Hypoteses:

Minimal boundaries of the clauses in the sentence

Derived Hierarchy

Until all verbs have not been analyzed: Take the rightmost not analyzed verb v:

Take the lexicalized rules R(v) for the verb v Find the dependencies of

Augment the clause boundaries

Page 22: The CHAOS Project: Theory and Practice Fabio Massimo Zanzotto Department of Computer Science, Systems and Production University of Roma Tor Vergata

Practice

System Implementation and Use

Page 23: The CHAOS Project: Theory and Practice Fabio Massimo Zanzotto Department of Computer Science, Systems and Production University of Roma Tor Vergata

A Computational Framework

Object-oriented backbone Objects for the different data Objects for the different sub-processes

Linguistic sub-processors as libraries Coexisting languages: Java, C++, C,

Prolog

Page 24: The CHAOS Project: Theory and Practice Fabio Massimo Zanzotto Department of Computer Science, Systems and Production University of Roma Tor Vergata

System implementation

A component-based approach An object-oriented platform

Linguistic data Textual entities: Text, Paragraphs XDG

Linguistic processors

Page 25: The CHAOS Project: Theory and Practice Fabio Massimo Zanzotto Department of Computer Science, Systems and Production University of Roma Tor Vergata

A Component-based Approach

Advantages: Computational efficiency Rapid prototyping Integration of different technologies Easy reuse

Page 26: The CHAOS Project: Theory and Practice Fabio Massimo Zanzotto Department of Computer Science, Systems and Production University of Roma Tor Vergata

Linguistic processors

Page 27: The CHAOS Project: Theory and Practice Fabio Massimo Zanzotto Department of Computer Science, Systems and Production University of Roma Tor Vergata

Linguistic processors

Tokenizer, Complex Tokenizer Dictionary lookup modules

Yellow page look-up Morphology analyzer

Name Entity Recognition Part-of-speech tagging Chunker Verb shallow analyzer Shallow analyzer

Page 28: The CHAOS Project: Theory and Practice Fabio Massimo Zanzotto Department of Computer Science, Systems and Production University of Roma Tor Vergata

Linguistic modules

Each process is encapsulated in an object initialize()

Load lexicons and rules (general or domain specific)

finalize() Dismiss the process rules and lexicons

run() Enrich the input with the contributes of the process

Page 29: The CHAOS Project: Theory and Practice Fabio Massimo Zanzotto Department of Computer Science, Systems and Production University of Roma Tor Vergata

Linguistic processors

Microtheories for microphenomena

Each processor implements its own theory: It has its language for describing rules It is written in its own programming language

Page 30: The CHAOS Project: Theory and Practice Fabio Massimo Zanzotto Department of Computer Science, Systems and Production University of Roma Tor Vergata

Processor: Yellow page look-up, Morphology analyzer

compra comprare d(a) v.tran.sempl 2.sing.imper.pres ~:u:~compra comprare d(a) v.tran.sempl 3.sing.ind.pres ~:u:~comprai comprare d(a) v.tran.sempl 1.sing.ind.pass_rem ~:u:~comprammo comprare d(a) v.tran.sempl 1.plur.ind.pass_rem ~:u:~compran comprare d(a) v.tran.sempl 3.plur.ind.pres ~:u:~comprando comprare d(a) v.tran.sempl geru.pres ~:u:~comprano comprare d(a) v.tran.sempl 3.plur.ind.pres ~:u:~

Dictionary

Page 31: The CHAOS Project: Theory and Practice Fabio Massimo Zanzotto Department of Computer Science, Systems and Production University of Roma Tor Vergata

Processor: Chunker

…constituent_class([_cst1, _cst2, _cst3], 'VerFin', _mor, 1, 3):-

verb_finite(_cst1),verb_to_have(_cst1),verb_past_particle(_cst2),verb_to_be(_cst2),verb_past_particle(_cst3),common_morfology(_cst1,_mor).

Rules

Page 32: The CHAOS Project: Theory and Practice Fabio Massimo Zanzotto Department of Computer Science, Systems and Production University of Roma Tor Vergata

Processor: Verb Shallow Analyser

…pattern(comprare,[

[(oggetto,Post),(per,Post)],[(oggetto,Post),(da,Post),(per,Post)],[(oggetto,Post),(a,Post),(per,Post)],[(oggetto,Post)]]).

pattern(comprendere,[[(oggetto,Post)],[],[(oggetto,Post)]]).pattern(comprimere,[[(oggetto,Post)],[(oggetto,Post)]]).pattern(compromettere,[[(con,Post)],[(oggetto,Post)]]).pattern(comunicare,[[],

[(con,Post)],[(a,Post)],[(oggetto,Post),(a,Post)],[(oggetto,Post)]]).

Sub-categorization lexicon

Page 33: The CHAOS Project: Theory and Practice Fabio Massimo Zanzotto Department of Computer Science, Systems and Production University of Roma Tor Vergata

Implemented Italian Shallow Grammar

Constituent Categories Part-of-Speech Tags Chunk Types

Dependency Categories Dependency Categories over Chunk

Types

Page 34: The CHAOS Project: Theory and Practice Fabio Massimo Zanzotto Department of Computer Science, Systems and Production University of Roma Tor Vergata

A survival user guide

Version stand-alone: chaosparser -h

Version client-server: chaosserver –h chaosclient –h

XDG editor and actual gui: choasgui

Page 35: The CHAOS Project: Theory and Practice Fabio Massimo Zanzotto Department of Computer Science, Systems and Production University of Roma Tor Vergata

Using CHAOS in applications

In JAVA applications:ConfigurationHandler.initialize();

ConfigurationHandler.parseKBPropFile(“LANGUAGE”,”KB”);

Parser ms = new Parser();

ms.initialize();

In Non-JAVA applications: Using one of the possible output forms:

XDG in Xml XDG in Prolog XDG in QLF (in prolog)

Page 36: The CHAOS Project: Theory and Practice Fabio Massimo Zanzotto Department of Computer Science, Systems and Production University of Roma Tor Vergata

Perspective

Building a statistical Italian parser Increasing the Itailan annotated

corpora Reusing existing corpora

TUT SITAL VIT

Page 37: The CHAOS Project: Theory and Practice Fabio Massimo Zanzotto Department of Computer Science, Systems and Production University of Roma Tor Vergata

Tools

XDG editor DEMO!!!!

Syntactic annotation transformer

Page 38: The CHAOS Project: Theory and Practice Fabio Massimo Zanzotto Department of Computer Science, Systems and Production University of Roma Tor Vergata

People

INVESTIGATORS Roberto Basili Fabio Massimo Zanzotto Maria Teresa Pazienza

FORMER CONTRIBUTORS Daniele Pighin Daniele Previtali Alessandro Bahgat Marco Pennacchiotti Massimo Di Nanni Michele Vindigni Luigi Mazzucchelli Paola Velardi Paolo Zirilli Alessandro Cucchiarelli Alessandro Marziali Fabrizio Grisoli Gianluca De Rossi