estrazione di informazioni da testo

28
Estrazione di informazioni da testo

Upload: thom

Post on 25-Feb-2016

30 views

Category:

Documents


0 download

DESCRIPTION

Estrazione di informazioni da testo. Perchè occuparsene?. E’ un’applicazione particolarmente complessa. Sfrutta la maggior parte delle risorse utilizzate in compiti di analisi. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Estrazione di informazioni da testo

Estrazione di informazioni da testo

Page 2: Estrazione di informazioni da testo

Perchè occuparsene?• E’ un’applicazione particolarmente complessa.• Sfrutta la maggior parte delle risorse utilizzate in

compiti di analisi.• Il suo studio permette quindi di avere una buona

panoramica delle problematiche e delle tecnologie utilizzate nell’analisi del linguaggio naturale.

Page 3: Estrazione di informazioni da testo

Cosa è l’Estrazione di Informazioni da Testo?• Information retrieval (IR): cercare e informazioni in testi a

fronte di richieste specifiche.• Recupero di passaggi: cercare e trovare passaggi

(paragrafi, frasi) all’interno di un testo che possano fornire risposte a determinati quesiti.

• Estrazione di informazioni (IE): trovare informazioni che possano riempire schemi (templates) predefiniti.

• Domanda-risposta (Question-answering): dare risposte a domande di tipo generale formulate da un utente: IE+IR

• Comprensione di testi: modellare la comprensione dei testi da parte di umani.

Page 4: Estrazione di informazioni da testo

Tipo di domande

• IR

• Recupero di passaggi

• IE

• Domanda/risposta

• Comprensione dei testi

Pre-definite. Aspetti fissi della informazione testuale

Page 5: Estrazione di informazioni da testo

What is “Information Extraction”Filling slots in a database from sub-segments of text.As a task:

October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

NAME TITLE ORGANIZATION

Page 6: Estrazione di informazioni da testo

What is “Information Extraction”Filling slots in a database from sub-segments of text.As a task:

October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

NAME TITLE ORGANIZATIONBill Gates CEO MicrosoftBill Veghte VP MicrosoftRichard Stallman founder Free Soft..

IE

Page 7: Estrazione di informazioni da testo

What is “Information Extraction”Information Extraction = segmentation + classification + clustering + association

As a familyof techniques:October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

Microsoft CorporationCEOBill GatesMicrosoftGatesMicrosoftBill VeghteMicrosoftVPRichard StallmanfounderFree Software Foundation

aka “named entity extraction”

Page 8: Estrazione di informazioni da testo

What is “Information Extraction”Information Extraction = segmentation + classification + association + clustering

As a familyof techniques:October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

Microsoft CorporationCEOBill GatesMicrosoftGatesMicrosoftBill VeghteMicrosoftVPRichard StallmanfounderFree Software Foundation

Page 9: Estrazione di informazioni da testo

What is “Information Extraction”Information Extraction = segmentation + classification + association + clustering

As a familyof techniques:October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

Microsoft CorporationCEOBill GatesMicrosoftGatesMicrosoftBill VeghteMicrosoftVPRichard StallmanfounderFree Software Foundation

Page 10: Estrazione di informazioni da testo

What is “Information Extraction”Information Extraction = segmentation + classification + association + clustering

As a familyof techniques:October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

Microsoft CorporationCEOBill GatesMicrosoftGatesMicrosoftBill VeghteMicrosoftVPRichard StallmanfounderFree Software Foundation NA

ME

TITL

E

ORGA

NIZA

TION

Bill

Gat

esCEO

Micr

osof

tBill

Ve g

h te

VPMicr

osof

tRich

ard

Stal

lman

foun

der

Free

Sof

t..

*

*

*

*

Page 11: Estrazione di informazioni da testo

Un esempio: FASTUS (1993)

Bridgestone Sports Co. said Friday it had set up a joint venture in Taiwan with a local concern and a Japanese trading house to produce golf clubs to be supplied to Japan.The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20 million new Taiwan dollars, will start production in January 1990 with production of 20,000 iron and “metal wood” clubs a month.

TIE-UP-1Relationship: TIE-UPEntities: “Bridgestone Sport Co.” “a local concern” “a Japanese trading house”Joint Venture Company: “Bridgestone Sports Taiwan Co.”Activity: ACTIVITY-1Amount: NT$200000000

ACTIVITY-1Activity: PRODUCTIONCompany: “Bridgestone Sports Taiwan Co.”Product: “iron and ‘metal wood’ clubs”Start Date: DURING: January 1990

Page 12: Estrazione di informazioni da testo

Bridgestone Sports Co. said Friday it had set up a joint venture in Taiwan with a local concern and a Japanese trading house to produce golf clubs to be supplied to Japan.The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20 million new Taiwan dollars, will start production in January 1990 with production of 20,000 iron and “metal wood” clubs a month

Un esempio: FASTUS (1993)

TIE-UP-1Relationship: TIE-UPEntities: “Bridgestone Sport Co.” “a local concern” “a Japanese trading house”Joint Venture Company: “Bridgestone Sports Taiwan Co.”Activity: ACTIVITY-1Amount: NT$200000000

ACTIVITY-1Activity: PRODUCTIONCompany: “Bridgestone Sports Taiwan Co.”Product: “iron and ‘metal wood’ clubs”Start Date: DURING: January 1990

Page 13: Estrazione di informazioni da testo

Bridgestone Sports Co. said Friday it had set up a joint venture in Taiwan with a local concern and a Japanese trading house to produce golf clubs to be supplied to Japan.The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20million new Taiwan dollars, will start production in January 1990with production of 20,000 iron and “metal wood” clubs a month

TIE-UP-1Relationship: TIE-UPEntities: “Bridgestone Sport Co.” “a local concern” “a Japanese trading house”Joint Venture Company: “Bridgestone Sports Taiwan Co.”Activity: ACTIVITY-1Amount: NT$200000000

ACTIVITY-1Activity: PRODUCTIONCompany: “Bridgestone Sports Taiwan Co.”Product: “iron and ‘metal wood’ clubs”Start Date: DURING: January 1990

Page 14: Estrazione di informazioni da testo

Bridgestone Sports Co. said Friday it had set up a joint venture in Taiwan with a local concern and a Japanese trading house toproduce golf clubs to be supplied to Japan.The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20million new Taiwan dollars, will start production in January 1990with production of 20,000 iron and “metal wood” clubs a month.

TIE-UP-1Relationship: TIE-UPEntities: “Bridgestone Sport Co.” “a local concern” “a Japanese trading house”Joint Venture Company: “Bridgestone Sports Taiwan Co.”Activity: ACTIVITY-1Amount: NT$200000000

ACTIVITY-1Activity: PRODUCTIONCompany: “Bridgestone Sports Taiwan Co.”Product: “iron and ‘metal wood’ clubs”Start Date: DURING: January 1990

Page 15: Estrazione di informazioni da testo

Come funziona FASTUS1.Parole complesse e nomi propri

2.Sintagmi semplici: nominali, verbali, particelle

3.Sintagmi complessi:

4.Eventi rilevantiCostruzione di semplici templates

5. Fusione di templates, nel casoPresentino informazioni sullo stesso evento

set upnew Twaiwan dollars

a Japanese trading househad set up

production of 20, 000 iron and metal wood clubs

[company][set up][Joint-Venture]with[company]

Page 16: Estrazione di informazioni da testo

Bridgestone Sports Co. said Friday it had set up a joint venture in Taiwan with a local concern and a Japanese trading house toproduce golf clubs to be supplied to Japan.

The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20million new Taiwan dollars, will start production in January 1990with production of 20,000 iron and “metal wood” clubs a month.

TIE-UP-1Relationship: TIE-UPEntities: “Bridgestone Sport Co.” “a local concern” “a Japanese trading house”Joint Venture Company: “Bridgestone Sports Taiwan Co.”Activity: ACTIVITY-1Amount: NT$200000000

ACTIVITY-1Activity: PRODUCTIONCompany: “Bridgestone Sports Taiwan Co.”Product: “iron and ‘metal wood’ clubs”Start Date: DURING: January 1990

Page 17: Estrazione di informazioni da testo

Altro esempio – un template sbagliato………. Jurgen Pfrang, 51, reportedly stumbled upon the robbers on thesecond floor of his Nanjing home early on Sunday. The deputy general manager of Yaxing Benz, a Sino-German joint venture that makes buses and bus chassis in nearby Yangzhou,was hacked to death with 45 cm watermelon knives. ……….

Name of the Venture: Yaxing BenzProducts: buses and bus chassisLocation: Yangzhou,ChinaCompanies involved: (1)Name: X? Country: German (2)Name: Y? Country: China

Template sbagliato

Page 18: Estrazione di informazioni da testo

Template giusto A German vehicle-firm executive was stabbed to death ….………. Jurgen Pfrang, 51, reportedly stumbled upon the robbers on thesecond floor of his Nanjing home early on Sunday. The deputy general manager of Yaxing Benz, a Sino-German joint venture that makes buses and bus chassis in nearby Yangzhou,was hacked to death with 45 cm watermelon knives. ……….

Crime-Type: Murder Type: StabbingThe killed: Name: Jurgen Pfrang Age: 51 Profession: Deputy general managerLocation: Nanjing, China

Page 19: Estrazione di informazioni da testo

Chi esegue l’interpretazione?

(1) IR

(2) Recupero passaggi

(3) IE

(5) Comprensione testi

(4) Domanda/risposta

Utente

Utente

Sistema

Sistema

Sistema

Page 20: Estrazione di informazioni da testo

Insieme di testi

Sistema di IR

Caratterizzazione dei testi

richiesta

Page 21: Estrazione di informazioni da testo

Sistema di IR

Caratterizzazione dei testi

Richiesta

interpretazioneconoscenza

Insieme di testi

Page 22: Estrazione di informazioni da testo

Recupero passaggiIR

Caratterizzazione dei testi

richiesta

Interpretazioneconoscenza

Insieme di testi

Page 23: Estrazione di informazioni da testo

Caratterizzazione dei testi

Queries

Interpretazione

conoscenza

Sistema di IE

testi template

Elaborazione Linguaggio

naturaleInsieme di testi

Recupero passaggi

IR

Page 24: Estrazione di informazioni da testo

Interpretazione

conoscenza

Sistema di IE

testi Templates

Page 25: Estrazione di informazioni da testo

Interpretazaioneconoscenza

IE

Testi TemplatesPredefinito

Approccio generale All’elaborazione/

Comprensione del LN

IE: un approccioPragmatico al NLP

Page 26: Estrazione di informazioni da testo

(1)IR,

(2) recupero passaggi

(3) ie

(5) Comprensione di testi

(4) Domanda/Risposa

Valutazione delle prestazioni

Metodologia chiara

Metodologia non chiara

Metodologia chiara

Metodologia abbastanzavaga

Metodologia vaghissima

Page 27: Estrazione di informazioni da testo

N

N: documenti correttiM: documenti recuperatiC: documenti recuperati che sono corretti

M

C

domanda

Insieme dei documenti

Precision:

Recall:

CMCN

F-Value:

P

R

P+R2P ・ R

Page 28: Estrazione di informazioni da testo

N

N: Templates correttiM: Templates recuperatiC: Templates corretti che sono stati recuperati

M

C

domanda

Insieme dei documenti

Precision:

Recall:

CMCN

F-Value:

P

R

P+R2P ・ R

Il tutto è più complicato per laPossibilità di template parzialmenteriempiti