linked data for czech legislation - 2nd year of our project

64
Linked Data for Czech Legislation Martin Nečaský, Ph.D. [email protected] Matematicko-fyzikální fakulta Univerzity Karlovy http://www.xrg.cz http://www.opendata.cz

Upload: martin-necasky

Post on 24-Jan-2015

220 views

Category:

Data & Analytics


0 download

DESCRIPTION

The slides present our approach to represent legal documents as Linked Data. We extract various kinds of structured data from semi-structured legal documents with natural language processing techniques and represent them in RDF with respect to Linked Data principles. We show how the resulting database, which consists of various kinds of legal documents in RDF linked to other kinds of data, can be queried using SPARQL

TRANSCRIPT

Page 1: Linked Data for Czech Legislation - 2nd year of our project

Linked Data for Czech Legislation

Martin Nečaský, Ph.D. [email protected]

Matematicko-fyzikální fakulta Univerzity Karlovy

http://www.xrg.cz

http://www.opendata.cz

Page 2: Linked Data for Czech Legislation - 2nd year of our project

Project Motivation

There are many documents/entities published by public bodies which refer to particular legal acts or their parts.

People need to find which documents/entities refer to what acts or their parts.

Acts

Court decisions Inspection results

Agenda Permissions

Page 3: Linked Data for Czech Legislation - 2nd year of our project

Project Motivation

Legal acts define concepts and relationships between them.

People need to find relationships of a given concept with other concepts. They also need to refer to that concept from their documents/entities.

Accounting entity

hasDefinition

hasObligation

Accounting Act

Page 4: Linked Data for Czech Legislation - 2nd year of our project

Data processing workflow

Page 5: Linked Data for Czech Legislation - 2nd year of our project

Project Objectives

1. Find a common data model (language) which enables to represent all this data

publish the data on the web in a standard way so that it can be linked from other data sources on the web

2. Get consolidated expressions of Czech acts We can buy them or reconstruct them on our

own.

We reconstructed them! (great thanks to Charles University student Karel Klíma)

Page 6: Linked Data for Czech Legislation - 2nd year of our project

Project Objectives

3. Use machine-learning methods for recognizing references to acts which appear in documents. Currently, we have recognition in court decisions (by

our Ph.D. student Vincent Kríž)

4. Use NLP methods to extract concepts and relationships between them from consolidated expressions of Czech acts, with the following constraints Only from in a specified domain

Initial list of important concepts constructed manually as an input

Page 7: Linked Data for Czech Legislation - 2nd year of our project

UFAL + KSI (+ students) cooperation

Gathering data (code of law, court decisions, ….)

Consolidated acts Extraction of act references in text

Extraction of concepts and relationships

Representation in a common data model

Linking with other data sources

Application development Application development

Page 8: Linked Data for Czech Legislation - 2nd year of our project

Common Data Model – Linked Data

Page 9: Linked Data for Czech Legislation - 2nd year of our project

Common Data Model – Linked Data

RDF + Linked Data principles

1. Use URLs to identify your things.

2. When someone looks up your URL of an entity, provide useful data about the entity.

3. Use RDF as a data format, enable querying with SPARQL.

4. Provide links to other related things as part of the provided data, also in RDF.

Page 10: Linked Data for Czech Legislation - 2nd year of our project

Common Data Model - URLs

Act no. 235/2004 (Value Added Tax Act) http://linked.opendata.cz/resource/legislation/cz/act/2004/235-2004

When a client requests this URL (via HTTP protocol), data about Act no. 235/2004 is provided in RDF

There are various serialization formats of RDF data model; provided serialization format depends on the request (content negotiation is applied)

Page 11: Linked Data for Czech Legislation - 2nd year of our project

Common Data Model - SPARQL

All sections of Act no. 235/2004

PREFIX frbr: <http://purl.org/vocab/frbr/core#>

SELECT DISTINCT ?section

WHERE {

?section frbr:partOf+

<http://linked.opendata.cz/resource/legislation/cz/act/2004/235-2004> ;

a frbr:Work .

}

ORDER BY ?section

Page 12: Linked Data for Czech Legislation - 2nd year of our project

Common Data Model - SPARQL

The number of consolidated versions of particular sections of Act no. 235/2004?

PREFIX frbr: <http://purl.org/vocab/frbr/core#>

PREFIX dcterms: <http://purl.org/dc/terms/>

SELECT ?section (COUNT(DISTINCT(?text)) AS ?cnt)

WHERE {

?expression frbr:realizationOf ?section ;

dcterms:description ?text .

?section frbr:partOf+ <http://linked.opendata.cz/resource/legislation/cz/act/2004/235-2004>

}

GROUP BY ?section

ORDER BY DESC(?cnt)

Page 13: Linked Data for Czech Legislation - 2nd year of our project

Common Data Model - SPARQL

Are there any court decisions citing Act no. 235/2004 or any of its sections?

PREFIX frbr: <http://purl.org/vocab/frbr/core#>

PREFIX dcterms: <http://purl.org/dc/terms/>

PREFIX sao: <http://salt.semanticauthoring.org/ontologies/sao#>

PREFIX sdo: <http://salt.semanticauthoring.org/ontologies/sdo#>

SELECT DISTINCT ?decision ?decisionTitle ?sectionOrAct

WHERE {

?annotation sao:hasTopic ?sectionOrAct .

?sectionOrAct frbr:partOf*

<http://linked.opendata.cz/resource/legislation/cz/act/2004/235-2004> .

?decisionExpr

sdo:hasSection/sdo:hasParagraph/sdo:hasTextChunk/sdo:hasAnnotation ?annotation ;

frbr:realizationOf ?decision .

?decision dcterms:title ?decisionTitle .

}

ORDER BY ?sectionOrAct

Page 14: Linked Data for Czech Legislation - 2nd year of our project

Common Data Model - SPARQL

What kinds of entities/documents are linked to Act no. 235/2004?

SELECT DISTINCT ?p ?t

WHERE {

?s ?p <http://linked.opendata.cz/resource/legislation/cz/act/2004/235-2004> ;

a ?t .

}

Page 15: Linked Data for Czech Legislation - 2nd year of our project

Linked Data Representation of Extracted

Concepts and Relationships

Page 16: Linked Data for Czech Legislation - 2nd year of our project

Judikáty

Representation of Concepts and

Relationships

K návrhu je navrhovatel povinen připojit listiny, kterých se v návrhu dovolává .

K návrhu je navrhovatel povinen připojit listiny, kterých se v návrhu dovolává .

subject predicate object

navrhovatel povinen připojit listiny, kterých se v návrhu dovolává

Navrhovatel (dle zák. NN/YYYY)

lingv:TextChunk lingv:TextChunk lingv:TextChunk

lexc:Concept

lingv:subject lingv:object

Připojit listiny, kterých se

v návrhu dovolává (dle zák. NN/YYYY)

lexc:Concept

lexc:hasObligation

lexc:hasDefinition

extracted definition text

lexc:hasObligation

… (dle zák.

NN/YYYY)

lexc:Concept

§ C

Zákon č. NN/YYYY

frbr:partOf

frbr:partOf

Judikáty

Page 17: Linked Data for Czech Legislation - 2nd year of our project

Legal Concepts Ontology

Each extracted concept is represented as an instance of class lexc:Concept.

lexc:Concept

lexc:ConceptVersion frbr:Expression

lex:Act frbr:partOf

frbr:partOf

lexc:hasObligation, lexc:hasRight

rdfs:Literal

lexc:hasDefinition

Page 18: Linked Data for Czech Legislation - 2nd year of our project

Concept “Zaměstnavatel”

http://linked.opendata.cz/resource/legislation/cz/expression/2006/262-2006/version/cz/2006-04-21/concept/ucetni-pojem/zaměstnavatel

(

see http://internal.opendata.cz:8890/describe/?url=http://linked.opendata.cz/resource/legislation/cz/expression/2006/262-2006/version/cz/2006-04-21/concept/ucetni-pojem/zam%C4%9Bstnavatel

)

Page 19: Linked Data for Czech Legislation - 2nd year of our project

Concept “Zaměstnavatel”

Page 20: Linked Data for Czech Legislation - 2nd year of our project

Obligations of “Zaměstnavatel”

PREFIX skos: <http://www.w3.org/2004/02/skos/core#>

PREFIX frbr: <http://purl.org/vocab/frbr/core#>

PREFIX oa: <http://www.w3.org/ns/oa#>

PREFIX lingv: <http://purl.org/lingv/ontology#>

PREFIX lexc: <http://purl.org/lex/ontology/concepts#>

SELECT ?obligationChunk ?obligationLabel

WHERE {

<http://linked.opendata.cz/resource/legislation/cz/expression/2006/262-2006/version/cz/2006-04-21/concept/ucetni-pojem/zaměstnavatel>

lexc:hasObligation ?obligation .

?obligation ^oa:hasBody/oa:hasTarget ?obligationChunk .

?obligationChunk lingv:hasForm/skos:prefLabel ?obligationLabel .

}

Page 21: Linked Data for Czech Legislation - 2nd year of our project

Obligations of “Zaměstnavatel”

Page 22: Linked Data for Czech Legislation - 2nd year of our project

Linguistic Ontology

lexc:ConceptVersion

oa:hasBody

oa:Annotation

oa:hasTarget

lingv:TextChunk lingv:Form lingv:hasForm

lingv:Form lingv:hasLemma

lingv:DependencyTree

lingv:hasTree

Page 23: Linked Data for Czech Legislation - 2nd year of our project

Next steps

Improve NLP extraction (see next part of the presentation) – queries

Better linking of concepts to particular sections of acts

to other data sources (e.g., life situations, agendas of public bodies, fines imposed by public bodies, etc.)

Develop web applications which enable users to work with the extracted concepts and

relationships

enable to explore links between extracted concepts and other data sources

Page 24: Linked Data for Czech Legislation - 2nd year of our project

Vincent Kríž, Barbora Hladká

RExtractorEntity Relation Extraction from Unstructured Texts

Intelligent library (INTLIB, TA02010182)

Seminar of formal linguistics, 2014-05-12

Institute of Formal and Applied LinguisticsFaculty of Mathematics and PhysicsCharles University in PragueCzech Republic

{kriz,hladka}@ufal.mff.cuni.czhttp://ufal.mff.cuni.cz/intlib

Page 25: Linked Data for Czech Legislation - 2nd year of our project

Kríž, Hladká: RExtractor – Entity Relation Extraction from Unstructured Texts SFL, 2014-05-12

Motivation

Typical search approaches

– full-text search– metadata search

Our approach

– building a knowledge base– semantic representation of documents– entities and their relations– represented in the Resource Description

Framework (RDF)

Page 26: Linked Data for Czech Legislation - 2nd year of our project

Kríž, Hladká: RExtractor – Entity Relation Extraction from Unstructured Texts SFL, 2014-05-12

Data processing workflow

Page 27: Linked Data for Czech Legislation - 2nd year of our project

Kríž, Hladká: RExtractor – Entity Relation Extraction from Unstructured Texts SFL, 2014-05-12

RExtractor Architecture

● Domain independent

Page 28: Linked Data for Czech Legislation - 2nd year of our project

Kríž, Hladká: RExtractor – Entity Relation Extraction from Unstructured Texts SFL, 2014-05-12

Conversion Component

● converts various input formats into unified representation (XML)

Page 29: Linked Data for Czech Legislation - 2nd year of our project

Kríž, Hladká: RExtractor – Entity Relation Extraction from Unstructured Texts SFL, 2014-05-12

NLP Component

● Prague Dependency Treebank framework

● Tools

– segmentation & tokenization

– lemmatization & morphology

– syntactic parsing– deep syntactic parsing– Treex

● http://ufal.mff.cuni.cz/pdt3.0● http://ufal.mff.cuni.cz/treex

Page 30: Linked Data for Czech Legislation - 2nd year of our project

Kríž, Hladká: RExtractor – Entity Relation Extraction from Unstructured Texts SFL, 2014-05-12

Entity Detection Component

● Database of Entities– entities specified by domain experts

● PML-TQ

● http://ufal.mff.cuni.cz/tools/pml-tq

Page 31: Linked Data for Czech Legislation - 2nd year of our project

Kríž, Hladká: RExtractor – Entity Relation Extraction from Unstructured Texts SFL, 2014-05-12

Relation Extraction Component● Database of Queries

– queries formulated by domain experts

– their formulation in the form of PML-TQ queries on dependency trees

● RDF ready output:Subject Predicate Object

Entity hasToCreate Something

Accounting units

create fixed items

Accounting units

create reserves

Subject Predicate Object

Entity hasToCreate Something

Accounting units

create fixed items

Accounting units

create reserves

● Example of user query: accounting units' obligations

Page 32: Linked Data for Czech Legislation - 2nd year of our project

Kríž, Hladká: RExtractor – Entity Relation Extraction from Unstructured Texts SFL, 2014-05-12

Case study on legislative domain

Page 33: Linked Data for Czech Legislation - 2nd year of our project

Kríž, Hladká: RExtractor – Entity Relation Extraction from Unstructured Texts SFL, 2014-05-12

Case study on legislative domain

Legal texts – specialized texts operating in legal settings– they should transmit legal norms to their recipients – they need to be clear, explicit and precise

Sentences– simple sentences are very rare– usually long and very complex

Legal texts are “generally considered very difficult to read and understand”.

(Tiersma, 2010)

Page 34: Linked Data for Czech Legislation - 2nd year of our project

Kríž, Hladká: RExtractor – Entity Relation Extraction from Unstructured Texts SFL, 2014-05-12

RExtractor Architecture

Adaptation for legislative domain

Page 35: Linked Data for Czech Legislation - 2nd year of our project

Kríž, Hladká: RExtractor – Entity Relation Extraction from Unstructured Texts SFL, 2014-05-12

Conversion component

HLAVA IÚVODNÍ USTANOVENÍ

§ 1Předmět úpravy

Tato vyhláška zapracovává příslušné předpisy Evropské unie a upravuje: a) způsob vymezení hydrogeologických rajonů, vymezení útvarů podzemních vod, b) způsob hodnocení stavu podzemních vod a c) náležitosti programů zjišťování a hodnocení stavu podzemních vod.

Page 36: Linked Data for Czech Legislation - 2nd year of our project

Kríž, Hladká: RExtractor – Entity Relation Extraction from Unstructured Texts SFL, 2014-05-12

Conversion component

HLAVA IÚVODNÍ USTANOVENÍ

§ 1Předmět úpravy

Tato vyhláška zapracovává příslušné předpisy Evropské unie a upravuje: a) způsob vymezení hydrogeologických rajonů, vymezení útvarů podzemních vod, b) způsob hodnocení stavu podzemních vod a c) náležitosti programů zjišťování a hodnocení stavu podzemních vod.

<head id="11" label="HLAVA I"> <title>ÚVODNÍ USTANOVENÍ</title> <section id="12" label="§ 1"> <title>Předmět úpravy</title> <text>Tato vyhláška zapracovává příslušné předpisy Evropské unie a upravuje:</text> <section id="13" label="a)"> <text>způsob vymezení hydrogeologických rajonů, vymezení útvarů podzemních vod,</text> </section> <section id="14" label="b)"> <text>způsob hodnocení stavu podzemních vod a</text> </section> <section id="15" label="c)"> <text>náležitosti programů zjišťování a hodnocení stavu podzemních vod.</text> </section> </section></head>

<head id="11" label="HLAVA I"> <title>ÚVODNÍ USTANOVENÍ</title> <section id="12" label="§ 1"> <title>Předmět úpravy</title> <text>Tato vyhláška zapracovává příslušné předpisy Evropské unie a upravuje:</text> <section id="13" label="a)"> <text>způsob vymezení hydrogeologických rajonů, vymezení útvarů podzemních vod,</text> </section> <section id="14" label="b)"> <text>způsob hodnocení stavu podzemních vod a</text> </section> <section id="15" label="c)"> <text>náležitosti programů zjišťování a hodnocení stavu podzemních vod.</text> </section> </section></head>

Page 37: Linked Data for Czech Legislation - 2nd year of our project

Kríž, Hladká: RExtractor – Entity Relation Extraction from Unstructured Texts SFL, 2014-05-12

NLP Component

Corpus of Czech legal texts (CCLT)– The Accounting Act (563/1991 Coll.)– Decree on Double-entry Accounting for

undertakers (500/2002 Coll.)– automatically parsed, then manually checked

● 1,133 manually annotated a-trees● 35,085 tokens● Credit to Zdeňka Urešová

Page 38: Linked Data for Czech Legislation - 2nd year of our project

Kríž, Hladká: RExtractor – Entity Relation Extraction from Unstructured Texts SFL, 2014-05-12

NLP Component

Corpus of Czech legal texts (CCLT)– enumerations and lists as one tree– manual annotation guidelines

● split sentence according to formal markers● use links for dependencies between partial trees

– automatic procedure merges partial annotations into a final tree

Pipeline visualization available on-line at

ufal.mff.cuni.cz/intlib

Page 39: Linked Data for Czech Legislation - 2nd year of our project

Kríž, Hladká: RExtractor – Entity Relation Extraction from Unstructured Texts SFL, 2014-05-12

NLP Component

Automatic parsers for Czech– trained on newspaper texts– verification whether we can use the parser

trained on newspaper texts or some modifications are needed

– MST parser Ryan McDonald, Fernando Pereira, Kiril Ribarov, Jan Hajič (2005): Non-projective Dependency Parsing using Spanning Tree Algorithms. In: Proceedings of HLT/EMNLP, Vancouver, British Columbia.

Page 40: Linked Data for Czech Legislation - 2nd year of our project

Kríž, Hladká: RExtractor – Entity Relation Extraction from Unstructured Texts SFL, 2014-05-12

NLP Component

Sentence splitting– We substitute long lists and enumerations by

several shorter sentences

Original sentence New sentences

(2) Veřejným rozpočtem se pro účely tohoto zákona rozumí a) státní rozpočet b) rozpočet státního fondu, c) rozpočet Evropské unie, nebo d) rozpočet, o němž to stanoví zákon.

Veřejným rozpočtem se pro účely tohoto zákona rozumí státní rozpočet.

Veřejným rozpočtem se pro účely tohoto zákona rozumí rozpočet státního fondu.

Veřejným rozpočtem se pro účely tohoto zákona rozumí rozpočet Evropské unie.

Veřejným rozpočtem se pro účely tohoto zákona rozumí rozpočet, o němž to stanoví zákon.

Original sentence New sentences

(2) Veřejným rozpočtem se pro účely tohoto zákona rozumí a) státní rozpočet b) rozpočet státního fondu, c) rozpočet Evropské unie, nebo d) rozpočet, o němž to stanoví zákon.

Veřejným rozpočtem se pro účely tohoto zákona rozumí státní rozpočet.

Veřejným rozpočtem se pro účely tohoto zákona rozumí rozpočet státního fondu.

Veřejným rozpočtem se pro účely tohoto zákona rozumí rozpočet Evropské unie.

Veřejným rozpočtem se pro účely tohoto zákona rozumí rozpočet, o němž to stanoví zákon.

Page 41: Linked Data for Czech Legislation - 2nd year of our project

Kríž, Hladká: RExtractor – Entity Relation Extraction from Unstructured Texts SFL, 2014-05-12

NLP ComponentRe-tokenization

Účetní jednotky tvoří opravné položky podle ustanovení § 16, 26, 31, 55 a 57

a neoceňují majetek podle § 27, § 14, 39, § 51 až 55, § 58, 60 a 69

Page 42: Linked Data for Czech Legislation - 2nd year of our project

Kríž, Hladká: RExtractor – Entity Relation Extraction from Unstructured Texts SFL, 2014-05-12

NLP Component

Re-tokenizationÚčetní jednotky tvoří opravné položky podle ustanovení § 16, 26, 31, 55 a 57

a neoceňují majetek podle § 27, § 14, 39, § 51 až 55, § 58, 60 a 69

Page 43: Linked Data for Czech Legislation - 2nd year of our project

Kríž, Hladká: RExtractor – Entity Relation Extraction from Unstructured Texts SFL, 2014-05-12

Entity Detection Component

Entities in CCLT– Accounting subdomain– Entities manually annotated by Sysnet, Ltd.

● Decree on Double-entry Accounting for undertakers (500/2002 Coll.)

Sample

Page 44: Linked Data for Czech Legislation - 2nd year of our project

Kríž, Hladká: RExtractor – Entity Relation Extraction from Unstructured Texts SFL, 2014-05-12

Entity Detection Component

Initializing DBE with entities from CCLT– Each (unique) entity parsed automatically by MST– Automatic procedure takes an entity dependency

tree and creates a PML-TQ query

Page 45: Linked Data for Czech Legislation - 2nd year of our project

Kríž, Hladká: RExtractor – Entity Relation Extraction from Unstructured Texts SFL, 2014-05-12

Entity Detection Component

Experiment– identify entities in gold standard trees in CCLT

● with re-tokenized tokens and (very) long sentences

– identify entities in trees created by MST● with re-tokenized tokens and split sentences

Results– high False positives– automatic parser has low influence on detection

Parsing method Extracted TP FP FN Precision Recall

Manual 16428 9549 6879 628 58.1 93.8

Automatic 16160 9278 6882 838 57.4 91.7

Parsing method Extracted TP FP FN Precision Recall

Manual 16428 9549 6879 628 58.1 93.8

Automatic 16160 9278 6882 838 57.4 91.7

Page 46: Linked Data for Czech Legislation - 2nd year of our project

Kríž, Hladká: RExtractor – Entity Relation Extraction from Unstructured Texts SFL, 2014-05-12

Relation Extraction Component

Types of relations– Definitions (D) – entities are defined or explained

● Náhradním ubytováním se rozumí byt o jedné místnosti nebo pokoj ve svobodárně nebo podnájem v zařízené nebo nezařízené části bytu jiného nájemce.

– Obligations (O) – entity is obligated to do something

● K návrhu je navrhovatel povinen připojit listiny , kterých se v návrhu dovolává.

– Rights (R) – entity has right to do something● Nabyvatel může uplatňovat nárok z odpovědnosti za vady u soudu

jen tehdy , vytkl-li vady bez zbytečného odkladu po té , kdy měl možnost věc prohlédnout .

Page 47: Linked Data for Czech Legislation - 2nd year of our project

Kríž, Hladká: RExtractor – Entity Relation Extraction from Unstructured Texts SFL, 2014-05-12

Relation Extraction Component

Manual design of queries● Strategy: cover maximum of relations with

minimum of queries● tree query expert

– observes typical constructions for given type of relation

– designs query for the most frequent construction– goes through matches and redesign query if

needed

Page 48: Linked Data for Czech Legislation - 2nd year of our project

Kríž, Hladká: RExtractor – Entity Relation Extraction from Unstructured Texts SFL, 2014-05-12

Relation Extraction Component

Query design & evaluation on CCLT ● Query design

– on The Accounting Act (563/1991 Coll.)

– 5 queries for Definitions

– 4 queries for Rights

– 2 queries for Obligation● Evaluation

– on Decree on Double-entry Accounting for undertakers (500/2002 Coll.)

Page 49: Linked Data for Czech Legislation - 2nd year of our project

Kríž, Hladká: RExtractor – Entity Relation Extraction from Unstructured Texts SFL, 2014-05-12

Relation Extraction Component

Results

D R O Total

# of queries 5 4 2 11

Goldstandard 97 308 62 467

Extracted 70 255 41 366

True positive 53 206 36 295

False negative 44 102 26 172

False positive 17 49 5 71

Precision (%) 75.7 80.8 87.8 80.6

Recall (%) 54.6 66.9 58.1 63.2

D R O Total

# of queries 5 4 2 11

Goldstandard 97 308 62 467

Extracted 70 255 41 366

True positive 53 206 36 295

False negative 44 102 26 172

False positive 17 49 5 71

Precision (%) 75.7 80.8 87.8 80.6

Recall (%) 54.6 66.9 58.1 63.2

Page 50: Linked Data for Czech Legislation - 2nd year of our project

Kríž, Hladká: RExtractor – Entity Relation Extraction from Unstructured Texts SFL, 2014-05-12

Relation Extraction Component

Error analysis

Results

– errors in automatic parsing

– query design

Error # of errors Ratio

Parser 145 59.7%

Query 93 38.3%

Entity 5 2.1%

Error # of errors Ratio

Parser 145 59.7%

Query 93 38.3%

Entity 5 2.1%

Page 51: Linked Data for Czech Legislation - 2nd year of our project

Kríž, Hladká: RExtractor – Entity Relation Extraction from Unstructured Texts SFL, 2014-05-12

Relation Extraction Component

Experiment with more data

– 28 laws from accounting subdomain

– 27,808 sentences

– 745,137 tokens

D R O

D1

36 R1

240 O1

183

D2

287 R2

470 O2

37

D3

35 R3

127

D4

466 R4

6

D5

46

Total 1580 Total 843 Total 220

D R O

D1

36 R1

240 O1

183

D2

287 R2

470 O2

37

D3

35 R3

127

D4

466 R4

6

D5

46

Total 1580 Total 843 Total 220

Page 52: Linked Data for Czech Legislation - 2nd year of our project

Kríž, Hladká: RExtractor – Entity Relation Extraction from Unstructured Texts SFL, 2014-05-12

Relation Extraction Component

Query example - Definition– Náhradním ubytováním se rozumí byt o jedné místnosti nebo pokoj ve

svobodárně nebo podnájem v zařízené nebo nezařízené části bytu jiného nájemce .

Page 53: Linked Data for Czech Legislation - 2nd year of our project

Kríž, Hladká: RExtractor – Entity Relation Extraction from Unstructured Texts SFL, 2014-05-12

Relation Extraction Component

Query example – Obligation– K návrhu je navrhovatel povinen připojit listiny , kterých se v návrhu

dovolává .

Page 54: Linked Data for Czech Legislation - 2nd year of our project

Kríž, Hladká: RExtractor – Entity Relation Extraction from Unstructured Texts SFL, 2014-05-12

Relation Extraction Component

Query example – Right– Nabyvatel může uplatňovat nárok z odpovědnosti za vady u soudu jen

tehdy , vytkl-li vady bez zbytečného odkladu po té , kdy měl možnost věc prohlédnout .

Page 55: Linked Data for Czech Legislation - 2nd year of our project

Kríž, Hladká: RExtractor – Entity Relation Extraction from Unstructured Texts SFL, 2014-05-12

Future Work

Legislative domain

– Parsing● evaluation and adaptation

– Entity detection● automatic entity detection based on

sample of entities annotated manually

– Relation extraction● automatic query design

Page 56: Linked Data for Czech Legislation - 2nd year of our project

Kríž, Hladká: RExtractor – Entity Relation Extraction from Unstructured Texts SFL, 2014-05-12

Case study on environmental domain

Page 57: Linked Data for Czech Legislation - 2nd year of our project

Kríž, Hladká: RExtractor – Entity Relation Extraction from Unstructured Texts SFL, 2014-05-12

Case study on environmental domain

● What are the environmental consequences of a project?

● Environmental Impact Assessment considers the environmental impacts whether or not to proceed with a project.

● In the Czech Republic, CENIA administers the information system EIA.

Page 58: Linked Data for Czech Legislation - 2nd year of our project

Kríž, Hladká: RExtractor – Entity Relation Extraction from Unstructured Texts SFL, 2014-05-12

EIA system

Page 59: Linked Data for Czech Legislation - 2nd year of our project

Kríž, Hladká: RExtractor – Entity Relation Extraction from Unstructured Texts SFL, 2014-05-12

Example

● Amazon's plan to build a distribution center in Brno, CR (no, no, no, yes by Brno councilors)

● May 9, 2014: a new intention posted at EIA by CTP Invest

Page 60: Linked Data for Czech Legislation - 2nd year of our project

Kríž, Hladká: RExtractor – Entity Relation Extraction from Unstructured Texts SFL, 2014-05-12

Mining EIA documentation

● Sysnet, Ltd. specified what entities and relations to extract, e.g.

● Title (Section B.I.1)● Category, type (Section B.I.1)● Capacity, size (Section B.I.2, B.I.6)● Location (Section B.I.3)● Scheduling (Section B.I.7)● ...

Page 61: Linked Data for Czech Legislation - 2nd year of our project

Kríž, Hladká: RExtractor – Entity Relation Extraction from Unstructured Texts SFL, 2014-05-12

Focus on section B.I.2

● Example

Vlastní areál bude sestávat z halového objektu o ploše cca 96 000 m2 , který bude uvnitř rozdělen na 3 haly … Předpokládají se 2 krytá stání pro jízdní kola a 1150 parkovacích stání pro osobní vozidla … Součástí záměru je realizace sadových úprav, která zahrnuje výsadbu více než 250 ks vzrostlých stromů– The park will consists of the hall with the area of cca 96 000

m2 that will be split into 3 halls … There will be 2 roofed bicycle parking stations and 1,150 parking slots ...

Page 62: Linked Data for Czech Legislation - 2nd year of our project

Kríž, Hladká: RExtractor – Entity Relation Extraction from Unstructured Texts SFL, 2014-05-12

Using RExtractor

● queries by regular expressions

Page 63: Linked Data for Czech Legislation - 2nd year of our project

Kríž, Hladká: RExtractor – Entity Relation Extraction from Unstructured Texts SFL, 2014-05-12

Dále je provozována produkční stáj VKK pro 336 ks dojnic (403,2 DJ). (In addition, a reproductive barn VKK is used for 336 cows.)

(Adj Nom)? (Noun Nom) (number) (unit) (Noun Gen)

( attribute )( entity ) (number) (unit) ( entity )

( reproductive )( barn ) (336) (pcs) ( cow )

Regular expressions

Credit to Ivana Lukšová

Page 64: Linked Data for Czech Legislation - 2nd year of our project

Kríž, Hladká: RExtractor – Entity Relation Extraction from Unstructured Texts SFL, 2014-05-12

● Evaluation– Developers vs. users– Gold standard data vs. practical use cases– Experience vs. expectation– Scientific contribution vs. “making life easier”

Both l. & e. domain