keynote exploring and exploiting official publications

34
PoliticalMashup 1 PoliticalMashup Open Official Documents: Requirements and Opportunities Maarten Marx Universiteit van Amsterdam Istanbul, EEOP (@LREC), 2012-05-27

Upload: maartenmarx

Post on 19-Jan-2015

514 views

Category:

Education


0 download

DESCRIPTION

Maarten Marx

TRANSCRIPT

Page 1: Keynote Exploring and Exploiting Official Publications

PoliticalMashup 1

PoliticalMashupOpen Official Documents Requirements and

Opportunities

Maarten Marx

Universiteit van Amsterdam

Istanbul EEOP (LREC) 2012-05-27

PoliticalMashup 2

Content

bull Official Documents Zoom in on a specific official publications

dataset

bull Opportunities What makes official publications data valuable

bull Requirements What is needed to make official publications data

reusable and interoperable

PoliticalMashup 3

Our Leading Research Question

What is the best data format for publishing both legacy and current

parliamentary proceedings in a digital sustainable manner [Marx et

al 2010]

PoliticalMashup 4

W3C recommendations on Open Government Data

bull make data both machine and human readable

bull link data make data linkable provide permanent identifiers for

each government object and data item

bull provide metadata using common standards (eg Dublin Core)

bull make the data as easy to reuse (eg in mashups) as possible

Goal of this talk make this concrete

PoliticalMashup 5

Value of a large data corpus

bull Consider a 200 year corpus of temperature and humidity readings

in one location

bull Value is not in the individual ldquodocumentsrdquo

bull Value is not in the corpus as a whole

bull Value is in the relation between the ldquodocumentsrdquo

PoliticalMashup 6

Documents related by publication date

Google books Ngram viewer

PoliticalMashup 7

Properties of our Parliamentary ProceedingsDataset

PoliticalMashup 8

Longitudinal data

bull weakly measurement for over 150 years

bull very stable measurement procedure and data model

PoliticalMashup 9

Data about human behaviour

PoliticalMashup 10

Often rather boring

PoliticalMashup 11

But sometimes full of drama and excitement

PoliticalMashup 12

Loads of measurement points

24000 days 450000 topics 75 miljoen speeches

PoliticalMashup 13

Digitally available

PoliticalMashup 14

About this collection

bull very sparse available metadata

bull very rich ldquometadatardquo sits hidden inside the raw data

bull Rich data model

bull Meeting (1 Day)

bull Topic

bull Stage direction

bull Scene

bull Stage direction

bull Speech

bull Paragraph

PoliticalMashup 15

Very rich metadata for each word

For every word spoken in parliament the following facts are known

at the time of the speech act and can often be extracted from the

written proceedings

1) when it was said

2) who said it

3) in what function

4) speaking on behalf of which party

5) in which context and

6) who was actively present during the speech act

PoliticalMashup 16

How to exploit the extra metadata and structure

bull Letrsquos consider a simple killer app

PoliticalMashup 17

Political n-gram viewer

bull From every word we know both the date and the speaker

bull Every speaker belongs to a political party

bull 3D n-gram viewer political spectrum vs time vs word-count

bull Use topic ownership agenda setting framing

PoliticalMashup 18

Political n-gram viewer requirements

documents

1 metadata date of the meeting

2 document structure for every spoken word who said it

Linked Data Speakers names are disambiguated normalized and

mapped to a database with temporal party information

Completeness and correctness Few missing or wrong data also for

long time ago

PoliticalMashup 19

Is Linked (Open) Data the solution

bull Link speakers name to WikipediaDBpedia page (named entity

disambiguation and resolution) See also Google Knowledge

Graph and [Spitkovsky Chang LREC 2012]

bull DBpedia extracts link between person and party affiliation from

Wikipedia infobox

bull Timestamped triple

Geert Wilders is partymember of VVD

from 1998-08-25 until 2004-09-02

PoliticalMashup 20

DBpedia not yet reliable

bull Data extraction is difficult even from the infobox even from

complete data

Wikipedia page of Geert Wilders

DBpedia information about Geert Wilders

Notice the values of the party and the office attributes

Timestamped facts are difficult to extract and difficult to

represent in RDF triples

PoliticalMashup 21

Lesson learned requirement on metadata andrelations

bull One cannot rely on Linked Open Data for good quality metadata

bull Official documents should be self-describing also for facts which

are obvious at publication time

bull Compare speakerrsquos data in original (OCRed) data and XMLified

and enriched version

bull Original

bull Part of it in XML

bull And now for human consumption

PoliticalMashup 22

A few more applications

PoliticalMashup 23

Entity Profiling and Entity Search

bull Users search for entities not for documents [TREC Entity Track]

[Balog et al 2009]

bull Main research questions

How to collect information on entities

how to model an entity

how to rank entities

bull (Parsimonious) language models work well as models [Balog et

al 2009][Hiemstra et al 2004]

bull Entity profiling httpwwwpolitiekinzichtcom

bull Entity search httpikkieswijzernl

PoliticalMashup 24

Content and structure search

bull Usual advanced search combines keyword search with metadata

search

bull Extra fields are just extra filters on the returned documents

bull With structured documents we can do search on content and

structure

bull Most useful task rank best entry points in large documents

bull Compare two search systems on the same data

on flat text

on an XML representation

PoliticalMashup 25

Lesson learned requirement on structure

bull Make semantically important structure of documents explicit in

XML markup

bull Publish for machine readability

bull Publish generic data not data prepared for one use-case

PoliticalMashup 26

Application of structure Interruption graph(Attackogram)

bull MP A interrupts B lArrrArr A speaks during the block of B

combined with entity profiling

httpdebatpolitiekinzichtcom

PoliticalMashup 27

Exploring and exploiting official documents

bull We saw what can be done with one well-curated collection

bull What are the key infrastructural and research questions

In what direction and how to scale this up

1 in time

2 in breadth

3 in links

PoliticalMashup 28

Scale diachronically

bull Stable data model and measurement procedure make this data

very valuable for diachronic comparisons

bull towards the past

bull OCR

bull consistency in structure

bull more missing data to link to

bull towards the future

bull remain up to date

bull legacy decisions

PoliticalMashup 29

Scale in breadth eg parlproceedings of allEuropean countries

bull All describe the same ldquoscriptrdquo so all fit in one schema

bull Main question how to connect the data from different countries

Common structure and annotation use the same Relax NG

schema

Common values on certain attributesbull Entities Normalize to Wikipedia concepts

bull Controlled vocabulary keywords Normalize to Eurovoc

bull Language Machine translate to English

bull Events Normalize to EMM Newsexplorer query Wikinews

query

PoliticalMashup 30

Scale in breadth link to related datasets

bull Link on time entities events topics

bull Other official publications

bull News

bull User generated content

bull (In our case) promisses of political actors election manifestos

PoliticalMashup 31

Conclusions

bull There are ample opportunities for exploiting Official Publications

bull Preprocessing and interlinking with other datasets is difficult and

does not scale well

bull High precision and recall is needed for many applications

bull Many text analysis and data-mapping tasks [MUC TAC]

bull Every format needs an own transformer

bull Linked Open Data knowledge bases are not (yet) good enough

create special purpose knowledge extractors

bull High investment but if done in a general way high return and

impact

PoliticalMashup 32

Back to our research question

What is the best data format for publishing both legacy and current

parliamentary proceedings in a digital sustainable manner

Lessons learned

bull Common open standardized self-describing machine readable

bull not tied to a single application

bull linked linked linked

bull Not only shared attributes

bull but more importantly shared data values

bull also store utterly obvious facts (10 years later they arenrsquot)

PoliticalMashup 33

How we can help (ourselves)

Help improve input data at the source

bull Push at the source (in UK open government data in Holland all

parliamentary data is now in XML )

bull Help reduce dumb cut-and-paste annotation work so annotators

can concentrate on tasks which are hard for machines (eg

text-classification)

bull Emphasize importance of using shared standards

Future researchers will love you

PoliticalMashup 34

Last Question

Official Publications are they

or

  • 1 Open Official Documents Requirements and Opportunities
  • 2 Content
  • 3 Our Leading Research Question
  • 4 W3C recommendations on Open Government Data
  • 5 Value of a large data corpus
  • 6 Documents related by publication date
  • 7 Properties of our Parliamentary Proceedings Dataset
  • 8 Longitudinal data
  • 9 Data about human behaviour
  • 10 Often rather boring
  • 11 But sometimes full of drama and excitement
  • 12 Loads of measurement points
  • 13 Digitally available
  • 14 About this collection
  • 15 Very rich metadata for each word
  • 16 How to exploit the extra metadata and structure
  • 17 Political n-gram viewer
  • 18 Political n-gram viewer requirements
  • 19 Is Linked (Open) Data the solution
  • 20 DBpedia not yet reliable
  • 21 Lesson learned requirement on metadata and relations
  • 22 A few more applications
  • 23 Entity Profiling and Entity Search
  • 24 Content and structure search
  • 25 Lesson learned requirement on structure
  • 26 Application of structure Interruption graph (Attackogram)
  • 27 Exploring and exploiting official documents
  • 28 Scale diachronically
  • 29 Scale in breadth eg parlproceedings of all European countries
  • 30 Scale in breadth link to related datasets
  • 31 Conclusions
  • 32 Back to our research question
  • 33 How we can help (ourselves)
  • 34 Last Question
Page 2: Keynote Exploring and Exploiting Official Publications

PoliticalMashup 2

Content

bull Official Documents Zoom in on a specific official publications

dataset

bull Opportunities What makes official publications data valuable

bull Requirements What is needed to make official publications data

reusable and interoperable

PoliticalMashup 3

Our Leading Research Question

What is the best data format for publishing both legacy and current

parliamentary proceedings in a digital sustainable manner [Marx et

al 2010]

PoliticalMashup 4

W3C recommendations on Open Government Data

bull make data both machine and human readable

bull link data make data linkable provide permanent identifiers for

each government object and data item

bull provide metadata using common standards (eg Dublin Core)

bull make the data as easy to reuse (eg in mashups) as possible

Goal of this talk make this concrete

PoliticalMashup 5

Value of a large data corpus

bull Consider a 200 year corpus of temperature and humidity readings

in one location

bull Value is not in the individual ldquodocumentsrdquo

bull Value is not in the corpus as a whole

bull Value is in the relation between the ldquodocumentsrdquo

PoliticalMashup 6

Documents related by publication date

Google books Ngram viewer

PoliticalMashup 7

Properties of our Parliamentary ProceedingsDataset

PoliticalMashup 8

Longitudinal data

bull weakly measurement for over 150 years

bull very stable measurement procedure and data model

PoliticalMashup 9

Data about human behaviour

PoliticalMashup 10

Often rather boring

PoliticalMashup 11

But sometimes full of drama and excitement

PoliticalMashup 12

Loads of measurement points

24000 days 450000 topics 75 miljoen speeches

PoliticalMashup 13

Digitally available

PoliticalMashup 14

About this collection

bull very sparse available metadata

bull very rich ldquometadatardquo sits hidden inside the raw data

bull Rich data model

bull Meeting (1 Day)

bull Topic

bull Stage direction

bull Scene

bull Stage direction

bull Speech

bull Paragraph

PoliticalMashup 15

Very rich metadata for each word

For every word spoken in parliament the following facts are known

at the time of the speech act and can often be extracted from the

written proceedings

1) when it was said

2) who said it

3) in what function

4) speaking on behalf of which party

5) in which context and

6) who was actively present during the speech act

PoliticalMashup 16

How to exploit the extra metadata and structure

bull Letrsquos consider a simple killer app

PoliticalMashup 17

Political n-gram viewer

bull From every word we know both the date and the speaker

bull Every speaker belongs to a political party

bull 3D n-gram viewer political spectrum vs time vs word-count

bull Use topic ownership agenda setting framing

PoliticalMashup 18

Political n-gram viewer requirements

documents

1 metadata date of the meeting

2 document structure for every spoken word who said it

Linked Data Speakers names are disambiguated normalized and

mapped to a database with temporal party information

Completeness and correctness Few missing or wrong data also for

long time ago

PoliticalMashup 19

Is Linked (Open) Data the solution

bull Link speakers name to WikipediaDBpedia page (named entity

disambiguation and resolution) See also Google Knowledge

Graph and [Spitkovsky Chang LREC 2012]

bull DBpedia extracts link between person and party affiliation from

Wikipedia infobox

bull Timestamped triple

Geert Wilders is partymember of VVD

from 1998-08-25 until 2004-09-02

PoliticalMashup 20

DBpedia not yet reliable

bull Data extraction is difficult even from the infobox even from

complete data

Wikipedia page of Geert Wilders

DBpedia information about Geert Wilders

Notice the values of the party and the office attributes

Timestamped facts are difficult to extract and difficult to

represent in RDF triples

PoliticalMashup 21

Lesson learned requirement on metadata andrelations

bull One cannot rely on Linked Open Data for good quality metadata

bull Official documents should be self-describing also for facts which

are obvious at publication time

bull Compare speakerrsquos data in original (OCRed) data and XMLified

and enriched version

bull Original

bull Part of it in XML

bull And now for human consumption

PoliticalMashup 22

A few more applications

PoliticalMashup 23

Entity Profiling and Entity Search

bull Users search for entities not for documents [TREC Entity Track]

[Balog et al 2009]

bull Main research questions

How to collect information on entities

how to model an entity

how to rank entities

bull (Parsimonious) language models work well as models [Balog et

al 2009][Hiemstra et al 2004]

bull Entity profiling httpwwwpolitiekinzichtcom

bull Entity search httpikkieswijzernl

PoliticalMashup 24

Content and structure search

bull Usual advanced search combines keyword search with metadata

search

bull Extra fields are just extra filters on the returned documents

bull With structured documents we can do search on content and

structure

bull Most useful task rank best entry points in large documents

bull Compare two search systems on the same data

on flat text

on an XML representation

PoliticalMashup 25

Lesson learned requirement on structure

bull Make semantically important structure of documents explicit in

XML markup

bull Publish for machine readability

bull Publish generic data not data prepared for one use-case

PoliticalMashup 26

Application of structure Interruption graph(Attackogram)

bull MP A interrupts B lArrrArr A speaks during the block of B

combined with entity profiling

httpdebatpolitiekinzichtcom

PoliticalMashup 27

Exploring and exploiting official documents

bull We saw what can be done with one well-curated collection

bull What are the key infrastructural and research questions

In what direction and how to scale this up

1 in time

2 in breadth

3 in links

PoliticalMashup 28

Scale diachronically

bull Stable data model and measurement procedure make this data

very valuable for diachronic comparisons

bull towards the past

bull OCR

bull consistency in structure

bull more missing data to link to

bull towards the future

bull remain up to date

bull legacy decisions

PoliticalMashup 29

Scale in breadth eg parlproceedings of allEuropean countries

bull All describe the same ldquoscriptrdquo so all fit in one schema

bull Main question how to connect the data from different countries

Common structure and annotation use the same Relax NG

schema

Common values on certain attributesbull Entities Normalize to Wikipedia concepts

bull Controlled vocabulary keywords Normalize to Eurovoc

bull Language Machine translate to English

bull Events Normalize to EMM Newsexplorer query Wikinews

query

PoliticalMashup 30

Scale in breadth link to related datasets

bull Link on time entities events topics

bull Other official publications

bull News

bull User generated content

bull (In our case) promisses of political actors election manifestos

PoliticalMashup 31

Conclusions

bull There are ample opportunities for exploiting Official Publications

bull Preprocessing and interlinking with other datasets is difficult and

does not scale well

bull High precision and recall is needed for many applications

bull Many text analysis and data-mapping tasks [MUC TAC]

bull Every format needs an own transformer

bull Linked Open Data knowledge bases are not (yet) good enough

create special purpose knowledge extractors

bull High investment but if done in a general way high return and

impact

PoliticalMashup 32

Back to our research question

What is the best data format for publishing both legacy and current

parliamentary proceedings in a digital sustainable manner

Lessons learned

bull Common open standardized self-describing machine readable

bull not tied to a single application

bull linked linked linked

bull Not only shared attributes

bull but more importantly shared data values

bull also store utterly obvious facts (10 years later they arenrsquot)

PoliticalMashup 33

How we can help (ourselves)

Help improve input data at the source

bull Push at the source (in UK open government data in Holland all

parliamentary data is now in XML )

bull Help reduce dumb cut-and-paste annotation work so annotators

can concentrate on tasks which are hard for machines (eg

text-classification)

bull Emphasize importance of using shared standards

Future researchers will love you

PoliticalMashup 34

Last Question

Official Publications are they

or

  • 1 Open Official Documents Requirements and Opportunities
  • 2 Content
  • 3 Our Leading Research Question
  • 4 W3C recommendations on Open Government Data
  • 5 Value of a large data corpus
  • 6 Documents related by publication date
  • 7 Properties of our Parliamentary Proceedings Dataset
  • 8 Longitudinal data
  • 9 Data about human behaviour
  • 10 Often rather boring
  • 11 But sometimes full of drama and excitement
  • 12 Loads of measurement points
  • 13 Digitally available
  • 14 About this collection
  • 15 Very rich metadata for each word
  • 16 How to exploit the extra metadata and structure
  • 17 Political n-gram viewer
  • 18 Political n-gram viewer requirements
  • 19 Is Linked (Open) Data the solution
  • 20 DBpedia not yet reliable
  • 21 Lesson learned requirement on metadata and relations
  • 22 A few more applications
  • 23 Entity Profiling and Entity Search
  • 24 Content and structure search
  • 25 Lesson learned requirement on structure
  • 26 Application of structure Interruption graph (Attackogram)
  • 27 Exploring and exploiting official documents
  • 28 Scale diachronically
  • 29 Scale in breadth eg parlproceedings of all European countries
  • 30 Scale in breadth link to related datasets
  • 31 Conclusions
  • 32 Back to our research question
  • 33 How we can help (ourselves)
  • 34 Last Question
Page 3: Keynote Exploring and Exploiting Official Publications

PoliticalMashup 3

Our Leading Research Question

What is the best data format for publishing both legacy and current

parliamentary proceedings in a digital sustainable manner [Marx et

al 2010]

PoliticalMashup 4

W3C recommendations on Open Government Data

bull make data both machine and human readable

bull link data make data linkable provide permanent identifiers for

each government object and data item

bull provide metadata using common standards (eg Dublin Core)

bull make the data as easy to reuse (eg in mashups) as possible

Goal of this talk make this concrete

PoliticalMashup 5

Value of a large data corpus

bull Consider a 200 year corpus of temperature and humidity readings

in one location

bull Value is not in the individual ldquodocumentsrdquo

bull Value is not in the corpus as a whole

bull Value is in the relation between the ldquodocumentsrdquo

PoliticalMashup 6

Documents related by publication date

Google books Ngram viewer

PoliticalMashup 7

Properties of our Parliamentary ProceedingsDataset

PoliticalMashup 8

Longitudinal data

bull weakly measurement for over 150 years

bull very stable measurement procedure and data model

PoliticalMashup 9

Data about human behaviour

PoliticalMashup 10

Often rather boring

PoliticalMashup 11

But sometimes full of drama and excitement

PoliticalMashup 12

Loads of measurement points

24000 days 450000 topics 75 miljoen speeches

PoliticalMashup 13

Digitally available

PoliticalMashup 14

About this collection

bull very sparse available metadata

bull very rich ldquometadatardquo sits hidden inside the raw data

bull Rich data model

bull Meeting (1 Day)

bull Topic

bull Stage direction

bull Scene

bull Stage direction

bull Speech

bull Paragraph

PoliticalMashup 15

Very rich metadata for each word

For every word spoken in parliament the following facts are known

at the time of the speech act and can often be extracted from the

written proceedings

1) when it was said

2) who said it

3) in what function

4) speaking on behalf of which party

5) in which context and

6) who was actively present during the speech act

PoliticalMashup 16

How to exploit the extra metadata and structure

bull Letrsquos consider a simple killer app

PoliticalMashup 17

Political n-gram viewer

bull From every word we know both the date and the speaker

bull Every speaker belongs to a political party

bull 3D n-gram viewer political spectrum vs time vs word-count

bull Use topic ownership agenda setting framing

PoliticalMashup 18

Political n-gram viewer requirements

documents

1 metadata date of the meeting

2 document structure for every spoken word who said it

Linked Data Speakers names are disambiguated normalized and

mapped to a database with temporal party information

Completeness and correctness Few missing or wrong data also for

long time ago

PoliticalMashup 19

Is Linked (Open) Data the solution

bull Link speakers name to WikipediaDBpedia page (named entity

disambiguation and resolution) See also Google Knowledge

Graph and [Spitkovsky Chang LREC 2012]

bull DBpedia extracts link between person and party affiliation from

Wikipedia infobox

bull Timestamped triple

Geert Wilders is partymember of VVD

from 1998-08-25 until 2004-09-02

PoliticalMashup 20

DBpedia not yet reliable

bull Data extraction is difficult even from the infobox even from

complete data

Wikipedia page of Geert Wilders

DBpedia information about Geert Wilders

Notice the values of the party and the office attributes

Timestamped facts are difficult to extract and difficult to

represent in RDF triples

PoliticalMashup 21

Lesson learned requirement on metadata andrelations

bull One cannot rely on Linked Open Data for good quality metadata

bull Official documents should be self-describing also for facts which

are obvious at publication time

bull Compare speakerrsquos data in original (OCRed) data and XMLified

and enriched version

bull Original

bull Part of it in XML

bull And now for human consumption

PoliticalMashup 22

A few more applications

PoliticalMashup 23

Entity Profiling and Entity Search

bull Users search for entities not for documents [TREC Entity Track]

[Balog et al 2009]

bull Main research questions

How to collect information on entities

how to model an entity

how to rank entities

bull (Parsimonious) language models work well as models [Balog et

al 2009][Hiemstra et al 2004]

bull Entity profiling httpwwwpolitiekinzichtcom

bull Entity search httpikkieswijzernl

PoliticalMashup 24

Content and structure search

bull Usual advanced search combines keyword search with metadata

search

bull Extra fields are just extra filters on the returned documents

bull With structured documents we can do search on content and

structure

bull Most useful task rank best entry points in large documents

bull Compare two search systems on the same data

on flat text

on an XML representation

PoliticalMashup 25

Lesson learned requirement on structure

bull Make semantically important structure of documents explicit in

XML markup

bull Publish for machine readability

bull Publish generic data not data prepared for one use-case

PoliticalMashup 26

Application of structure Interruption graph(Attackogram)

bull MP A interrupts B lArrrArr A speaks during the block of B

combined with entity profiling

httpdebatpolitiekinzichtcom

PoliticalMashup 27

Exploring and exploiting official documents

bull We saw what can be done with one well-curated collection

bull What are the key infrastructural and research questions

In what direction and how to scale this up

1 in time

2 in breadth

3 in links

PoliticalMashup 28

Scale diachronically

bull Stable data model and measurement procedure make this data

very valuable for diachronic comparisons

bull towards the past

bull OCR

bull consistency in structure

bull more missing data to link to

bull towards the future

bull remain up to date

bull legacy decisions

PoliticalMashup 29

Scale in breadth eg parlproceedings of allEuropean countries

bull All describe the same ldquoscriptrdquo so all fit in one schema

bull Main question how to connect the data from different countries

Common structure and annotation use the same Relax NG

schema

Common values on certain attributesbull Entities Normalize to Wikipedia concepts

bull Controlled vocabulary keywords Normalize to Eurovoc

bull Language Machine translate to English

bull Events Normalize to EMM Newsexplorer query Wikinews

query

PoliticalMashup 30

Scale in breadth link to related datasets

bull Link on time entities events topics

bull Other official publications

bull News

bull User generated content

bull (In our case) promisses of political actors election manifestos

PoliticalMashup 31

Conclusions

bull There are ample opportunities for exploiting Official Publications

bull Preprocessing and interlinking with other datasets is difficult and

does not scale well

bull High precision and recall is needed for many applications

bull Many text analysis and data-mapping tasks [MUC TAC]

bull Every format needs an own transformer

bull Linked Open Data knowledge bases are not (yet) good enough

create special purpose knowledge extractors

bull High investment but if done in a general way high return and

impact

PoliticalMashup 32

Back to our research question

What is the best data format for publishing both legacy and current

parliamentary proceedings in a digital sustainable manner

Lessons learned

bull Common open standardized self-describing machine readable

bull not tied to a single application

bull linked linked linked

bull Not only shared attributes

bull but more importantly shared data values

bull also store utterly obvious facts (10 years later they arenrsquot)

PoliticalMashup 33

How we can help (ourselves)

Help improve input data at the source

bull Push at the source (in UK open government data in Holland all

parliamentary data is now in XML )

bull Help reduce dumb cut-and-paste annotation work so annotators

can concentrate on tasks which are hard for machines (eg

text-classification)

bull Emphasize importance of using shared standards

Future researchers will love you

PoliticalMashup 34

Last Question

Official Publications are they

or

  • 1 Open Official Documents Requirements and Opportunities
  • 2 Content
  • 3 Our Leading Research Question
  • 4 W3C recommendations on Open Government Data
  • 5 Value of a large data corpus
  • 6 Documents related by publication date
  • 7 Properties of our Parliamentary Proceedings Dataset
  • 8 Longitudinal data
  • 9 Data about human behaviour
  • 10 Often rather boring
  • 11 But sometimes full of drama and excitement
  • 12 Loads of measurement points
  • 13 Digitally available
  • 14 About this collection
  • 15 Very rich metadata for each word
  • 16 How to exploit the extra metadata and structure
  • 17 Political n-gram viewer
  • 18 Political n-gram viewer requirements
  • 19 Is Linked (Open) Data the solution
  • 20 DBpedia not yet reliable
  • 21 Lesson learned requirement on metadata and relations
  • 22 A few more applications
  • 23 Entity Profiling and Entity Search
  • 24 Content and structure search
  • 25 Lesson learned requirement on structure
  • 26 Application of structure Interruption graph (Attackogram)
  • 27 Exploring and exploiting official documents
  • 28 Scale diachronically
  • 29 Scale in breadth eg parlproceedings of all European countries
  • 30 Scale in breadth link to related datasets
  • 31 Conclusions
  • 32 Back to our research question
  • 33 How we can help (ourselves)
  • 34 Last Question
Page 4: Keynote Exploring and Exploiting Official Publications

PoliticalMashup 4

W3C recommendations on Open Government Data

bull make data both machine and human readable

bull link data make data linkable provide permanent identifiers for

each government object and data item

bull provide metadata using common standards (eg Dublin Core)

bull make the data as easy to reuse (eg in mashups) as possible

Goal of this talk make this concrete

PoliticalMashup 5

Value of a large data corpus

bull Consider a 200 year corpus of temperature and humidity readings

in one location

bull Value is not in the individual ldquodocumentsrdquo

bull Value is not in the corpus as a whole

bull Value is in the relation between the ldquodocumentsrdquo

PoliticalMashup 6

Documents related by publication date

Google books Ngram viewer

PoliticalMashup 7

Properties of our Parliamentary ProceedingsDataset

PoliticalMashup 8

Longitudinal data

bull weakly measurement for over 150 years

bull very stable measurement procedure and data model

PoliticalMashup 9

Data about human behaviour

PoliticalMashup 10

Often rather boring

PoliticalMashup 11

But sometimes full of drama and excitement

PoliticalMashup 12

Loads of measurement points

24000 days 450000 topics 75 miljoen speeches

PoliticalMashup 13

Digitally available

PoliticalMashup 14

About this collection

bull very sparse available metadata

bull very rich ldquometadatardquo sits hidden inside the raw data

bull Rich data model

bull Meeting (1 Day)

bull Topic

bull Stage direction

bull Scene

bull Stage direction

bull Speech

bull Paragraph

PoliticalMashup 15

Very rich metadata for each word

For every word spoken in parliament the following facts are known

at the time of the speech act and can often be extracted from the

written proceedings

1) when it was said

2) who said it

3) in what function

4) speaking on behalf of which party

5) in which context and

6) who was actively present during the speech act

PoliticalMashup 16

How to exploit the extra metadata and structure

bull Letrsquos consider a simple killer app

PoliticalMashup 17

Political n-gram viewer

bull From every word we know both the date and the speaker

bull Every speaker belongs to a political party

bull 3D n-gram viewer political spectrum vs time vs word-count

bull Use topic ownership agenda setting framing

PoliticalMashup 18

Political n-gram viewer requirements

documents

1 metadata date of the meeting

2 document structure for every spoken word who said it

Linked Data Speakers names are disambiguated normalized and

mapped to a database with temporal party information

Completeness and correctness Few missing or wrong data also for

long time ago

PoliticalMashup 19

Is Linked (Open) Data the solution

bull Link speakers name to WikipediaDBpedia page (named entity

disambiguation and resolution) See also Google Knowledge

Graph and [Spitkovsky Chang LREC 2012]

bull DBpedia extracts link between person and party affiliation from

Wikipedia infobox

bull Timestamped triple

Geert Wilders is partymember of VVD

from 1998-08-25 until 2004-09-02

PoliticalMashup 20

DBpedia not yet reliable

bull Data extraction is difficult even from the infobox even from

complete data

Wikipedia page of Geert Wilders

DBpedia information about Geert Wilders

Notice the values of the party and the office attributes

Timestamped facts are difficult to extract and difficult to

represent in RDF triples

PoliticalMashup 21

Lesson learned requirement on metadata andrelations

bull One cannot rely on Linked Open Data for good quality metadata

bull Official documents should be self-describing also for facts which

are obvious at publication time

bull Compare speakerrsquos data in original (OCRed) data and XMLified

and enriched version

bull Original

bull Part of it in XML

bull And now for human consumption

PoliticalMashup 22

A few more applications

PoliticalMashup 23

Entity Profiling and Entity Search

bull Users search for entities not for documents [TREC Entity Track]

[Balog et al 2009]

bull Main research questions

How to collect information on entities

how to model an entity

how to rank entities

bull (Parsimonious) language models work well as models [Balog et

al 2009][Hiemstra et al 2004]

bull Entity profiling httpwwwpolitiekinzichtcom

bull Entity search httpikkieswijzernl

PoliticalMashup 24

Content and structure search

bull Usual advanced search combines keyword search with metadata

search

bull Extra fields are just extra filters on the returned documents

bull With structured documents we can do search on content and

structure

bull Most useful task rank best entry points in large documents

bull Compare two search systems on the same data

on flat text

on an XML representation

PoliticalMashup 25

Lesson learned requirement on structure

bull Make semantically important structure of documents explicit in

XML markup

bull Publish for machine readability

bull Publish generic data not data prepared for one use-case

PoliticalMashup 26

Application of structure Interruption graph(Attackogram)

bull MP A interrupts B lArrrArr A speaks during the block of B

combined with entity profiling

httpdebatpolitiekinzichtcom

PoliticalMashup 27

Exploring and exploiting official documents

bull We saw what can be done with one well-curated collection

bull What are the key infrastructural and research questions

In what direction and how to scale this up

1 in time

2 in breadth

3 in links

PoliticalMashup 28

Scale diachronically

bull Stable data model and measurement procedure make this data

very valuable for diachronic comparisons

bull towards the past

bull OCR

bull consistency in structure

bull more missing data to link to

bull towards the future

bull remain up to date

bull legacy decisions

PoliticalMashup 29

Scale in breadth eg parlproceedings of allEuropean countries

bull All describe the same ldquoscriptrdquo so all fit in one schema

bull Main question how to connect the data from different countries

Common structure and annotation use the same Relax NG

schema

Common values on certain attributesbull Entities Normalize to Wikipedia concepts

bull Controlled vocabulary keywords Normalize to Eurovoc

bull Language Machine translate to English

bull Events Normalize to EMM Newsexplorer query Wikinews

query

PoliticalMashup 30

Scale in breadth link to related datasets

bull Link on time entities events topics

bull Other official publications

bull News

bull User generated content

bull (In our case) promisses of political actors election manifestos

PoliticalMashup 31

Conclusions

bull There are ample opportunities for exploiting Official Publications

bull Preprocessing and interlinking with other datasets is difficult and

does not scale well

bull High precision and recall is needed for many applications

bull Many text analysis and data-mapping tasks [MUC TAC]

bull Every format needs an own transformer

bull Linked Open Data knowledge bases are not (yet) good enough

create special purpose knowledge extractors

bull High investment but if done in a general way high return and

impact

PoliticalMashup 32

Back to our research question

What is the best data format for publishing both legacy and current

parliamentary proceedings in a digital sustainable manner

Lessons learned

bull Common open standardized self-describing machine readable

bull not tied to a single application

bull linked linked linked

bull Not only shared attributes

bull but more importantly shared data values

bull also store utterly obvious facts (10 years later they arenrsquot)

PoliticalMashup 33

How we can help (ourselves)

Help improve input data at the source

bull Push at the source (in UK open government data in Holland all

parliamentary data is now in XML )

bull Help reduce dumb cut-and-paste annotation work so annotators

can concentrate on tasks which are hard for machines (eg

text-classification)

bull Emphasize importance of using shared standards

Future researchers will love you

PoliticalMashup 34

Last Question

Official Publications are they

or

  • 1 Open Official Documents Requirements and Opportunities
  • 2 Content
  • 3 Our Leading Research Question
  • 4 W3C recommendations on Open Government Data
  • 5 Value of a large data corpus
  • 6 Documents related by publication date
  • 7 Properties of our Parliamentary Proceedings Dataset
  • 8 Longitudinal data
  • 9 Data about human behaviour
  • 10 Often rather boring
  • 11 But sometimes full of drama and excitement
  • 12 Loads of measurement points
  • 13 Digitally available
  • 14 About this collection
  • 15 Very rich metadata for each word
  • 16 How to exploit the extra metadata and structure
  • 17 Political n-gram viewer
  • 18 Political n-gram viewer requirements
  • 19 Is Linked (Open) Data the solution
  • 20 DBpedia not yet reliable
  • 21 Lesson learned requirement on metadata and relations
  • 22 A few more applications
  • 23 Entity Profiling and Entity Search
  • 24 Content and structure search
  • 25 Lesson learned requirement on structure
  • 26 Application of structure Interruption graph (Attackogram)
  • 27 Exploring and exploiting official documents
  • 28 Scale diachronically
  • 29 Scale in breadth eg parlproceedings of all European countries
  • 30 Scale in breadth link to related datasets
  • 31 Conclusions
  • 32 Back to our research question
  • 33 How we can help (ourselves)
  • 34 Last Question
Page 5: Keynote Exploring and Exploiting Official Publications

PoliticalMashup 5

Value of a large data corpus

bull Consider a 200 year corpus of temperature and humidity readings

in one location

bull Value is not in the individual ldquodocumentsrdquo

bull Value is not in the corpus as a whole

bull Value is in the relation between the ldquodocumentsrdquo

PoliticalMashup 6

Documents related by publication date

Google books Ngram viewer

PoliticalMashup 7

Properties of our Parliamentary ProceedingsDataset

PoliticalMashup 8

Longitudinal data

bull weakly measurement for over 150 years

bull very stable measurement procedure and data model

PoliticalMashup 9

Data about human behaviour

PoliticalMashup 10

Often rather boring

PoliticalMashup 11

But sometimes full of drama and excitement

PoliticalMashup 12

Loads of measurement points

24000 days 450000 topics 75 miljoen speeches

PoliticalMashup 13

Digitally available

PoliticalMashup 14

About this collection

bull very sparse available metadata

bull very rich ldquometadatardquo sits hidden inside the raw data

bull Rich data model

bull Meeting (1 Day)

bull Topic

bull Stage direction

bull Scene

bull Stage direction

bull Speech

bull Paragraph

PoliticalMashup 15

Very rich metadata for each word

For every word spoken in parliament the following facts are known

at the time of the speech act and can often be extracted from the

written proceedings

1) when it was said

2) who said it

3) in what function

4) speaking on behalf of which party

5) in which context and

6) who was actively present during the speech act

PoliticalMashup 16

How to exploit the extra metadata and structure

bull Letrsquos consider a simple killer app

PoliticalMashup 17

Political n-gram viewer

bull From every word we know both the date and the speaker

bull Every speaker belongs to a political party

bull 3D n-gram viewer political spectrum vs time vs word-count

bull Use topic ownership agenda setting framing

PoliticalMashup 18

Political n-gram viewer requirements

documents

1 metadata date of the meeting

2 document structure for every spoken word who said it

Linked Data Speakers names are disambiguated normalized and

mapped to a database with temporal party information

Completeness and correctness Few missing or wrong data also for

long time ago

PoliticalMashup 19

Is Linked (Open) Data the solution

bull Link speakers name to WikipediaDBpedia page (named entity

disambiguation and resolution) See also Google Knowledge

Graph and [Spitkovsky Chang LREC 2012]

bull DBpedia extracts link between person and party affiliation from

Wikipedia infobox

bull Timestamped triple

Geert Wilders is partymember of VVD

from 1998-08-25 until 2004-09-02

PoliticalMashup 20

DBpedia not yet reliable

bull Data extraction is difficult even from the infobox even from

complete data

Wikipedia page of Geert Wilders

DBpedia information about Geert Wilders

Notice the values of the party and the office attributes

Timestamped facts are difficult to extract and difficult to

represent in RDF triples

PoliticalMashup 21

Lesson learned requirement on metadata andrelations

bull One cannot rely on Linked Open Data for good quality metadata

bull Official documents should be self-describing also for facts which

are obvious at publication time

bull Compare speakerrsquos data in original (OCRed) data and XMLified

and enriched version

bull Original

bull Part of it in XML

bull And now for human consumption

PoliticalMashup 22

A few more applications

PoliticalMashup 23

Entity Profiling and Entity Search

bull Users search for entities not for documents [TREC Entity Track]

[Balog et al 2009]

bull Main research questions

How to collect information on entities

how to model an entity

how to rank entities

bull (Parsimonious) language models work well as models [Balog et

al 2009][Hiemstra et al 2004]

bull Entity profiling httpwwwpolitiekinzichtcom

bull Entity search httpikkieswijzernl

PoliticalMashup 24

Content and structure search

bull Usual advanced search combines keyword search with metadata

search

bull Extra fields are just extra filters on the returned documents

bull With structured documents we can do search on content and

structure

bull Most useful task rank best entry points in large documents

bull Compare two search systems on the same data

on flat text

on an XML representation

PoliticalMashup 25

Lesson learned requirement on structure

bull Make semantically important structure of documents explicit in

XML markup

bull Publish for machine readability

bull Publish generic data not data prepared for one use-case

PoliticalMashup 26

Application of structure Interruption graph(Attackogram)

bull MP A interrupts B lArrrArr A speaks during the block of B

combined with entity profiling

httpdebatpolitiekinzichtcom

PoliticalMashup 27

Exploring and exploiting official documents

bull We saw what can be done with one well-curated collection

bull What are the key infrastructural and research questions

In what direction and how to scale this up

1 in time

2 in breadth

3 in links

PoliticalMashup 28

Scale diachronically

bull Stable data model and measurement procedure make this data

very valuable for diachronic comparisons

bull towards the past

bull OCR

bull consistency in structure

bull more missing data to link to

bull towards the future

bull remain up to date

bull legacy decisions

PoliticalMashup 29

Scale in breadth eg parlproceedings of allEuropean countries

bull All describe the same ldquoscriptrdquo so all fit in one schema

bull Main question how to connect the data from different countries

Common structure and annotation use the same Relax NG

schema

Common values on certain attributesbull Entities Normalize to Wikipedia concepts

bull Controlled vocabulary keywords Normalize to Eurovoc

bull Language Machine translate to English

bull Events Normalize to EMM Newsexplorer query Wikinews

query

PoliticalMashup 30

Scale in breadth link to related datasets

bull Link on time entities events topics

bull Other official publications

bull News

bull User generated content

bull (In our case) promisses of political actors election manifestos

PoliticalMashup 31

Conclusions

bull There are ample opportunities for exploiting Official Publications

bull Preprocessing and interlinking with other datasets is difficult and

does not scale well

bull High precision and recall is needed for many applications

bull Many text analysis and data-mapping tasks [MUC TAC]

bull Every format needs an own transformer

bull Linked Open Data knowledge bases are not (yet) good enough

create special purpose knowledge extractors

bull High investment but if done in a general way high return and

impact

PoliticalMashup 32

Back to our research question

What is the best data format for publishing both legacy and current

parliamentary proceedings in a digital sustainable manner

Lessons learned

bull Common open standardized self-describing machine readable

bull not tied to a single application

bull linked linked linked

bull Not only shared attributes

bull but more importantly shared data values

bull also store utterly obvious facts (10 years later they arenrsquot)

PoliticalMashup 33

How we can help (ourselves)

Help improve input data at the source

bull Push at the source (in UK open government data in Holland all

parliamentary data is now in XML )

bull Help reduce dumb cut-and-paste annotation work so annotators

can concentrate on tasks which are hard for machines (eg

text-classification)

bull Emphasize importance of using shared standards

Future researchers will love you

PoliticalMashup 34

Last Question

Official Publications are they

or

  • 1 Open Official Documents Requirements and Opportunities
  • 2 Content
  • 3 Our Leading Research Question
  • 4 W3C recommendations on Open Government Data
  • 5 Value of a large data corpus
  • 6 Documents related by publication date
  • 7 Properties of our Parliamentary Proceedings Dataset
  • 8 Longitudinal data
  • 9 Data about human behaviour
  • 10 Often rather boring
  • 11 But sometimes full of drama and excitement
  • 12 Loads of measurement points
  • 13 Digitally available
  • 14 About this collection
  • 15 Very rich metadata for each word
  • 16 How to exploit the extra metadata and structure
  • 17 Political n-gram viewer
  • 18 Political n-gram viewer requirements
  • 19 Is Linked (Open) Data the solution
  • 20 DBpedia not yet reliable
  • 21 Lesson learned requirement on metadata and relations
  • 22 A few more applications
  • 23 Entity Profiling and Entity Search
  • 24 Content and structure search
  • 25 Lesson learned requirement on structure
  • 26 Application of structure Interruption graph (Attackogram)
  • 27 Exploring and exploiting official documents
  • 28 Scale diachronically
  • 29 Scale in breadth eg parlproceedings of all European countries
  • 30 Scale in breadth link to related datasets
  • 31 Conclusions
  • 32 Back to our research question
  • 33 How we can help (ourselves)
  • 34 Last Question
Page 6: Keynote Exploring and Exploiting Official Publications

PoliticalMashup 6

Documents related by publication date

Google books Ngram viewer

PoliticalMashup 7

Properties of our Parliamentary ProceedingsDataset

PoliticalMashup 8

Longitudinal data

bull weakly measurement for over 150 years

bull very stable measurement procedure and data model

PoliticalMashup 9

Data about human behaviour

PoliticalMashup 10

Often rather boring

PoliticalMashup 11

But sometimes full of drama and excitement

PoliticalMashup 12

Loads of measurement points

24000 days 450000 topics 75 miljoen speeches

PoliticalMashup 13

Digitally available

PoliticalMashup 14

About this collection

bull very sparse available metadata

bull very rich ldquometadatardquo sits hidden inside the raw data

bull Rich data model

bull Meeting (1 Day)

bull Topic

bull Stage direction

bull Scene

bull Stage direction

bull Speech

bull Paragraph

PoliticalMashup 15

Very rich metadata for each word

For every word spoken in parliament the following facts are known

at the time of the speech act and can often be extracted from the

written proceedings

1) when it was said

2) who said it

3) in what function

4) speaking on behalf of which party

5) in which context and

6) who was actively present during the speech act

PoliticalMashup 16

How to exploit the extra metadata and structure

bull Letrsquos consider a simple killer app

PoliticalMashup 17

Political n-gram viewer

bull From every word we know both the date and the speaker

bull Every speaker belongs to a political party

bull 3D n-gram viewer political spectrum vs time vs word-count

bull Use topic ownership agenda setting framing

PoliticalMashup 18

Political n-gram viewer requirements

documents

1 metadata date of the meeting

2 document structure for every spoken word who said it

Linked Data Speakers names are disambiguated normalized and

mapped to a database with temporal party information

Completeness and correctness Few missing or wrong data also for

long time ago

PoliticalMashup 19

Is Linked (Open) Data the solution

bull Link speakers name to WikipediaDBpedia page (named entity

disambiguation and resolution) See also Google Knowledge

Graph and [Spitkovsky Chang LREC 2012]

bull DBpedia extracts link between person and party affiliation from

Wikipedia infobox

bull Timestamped triple

Geert Wilders is partymember of VVD

from 1998-08-25 until 2004-09-02

PoliticalMashup 20

DBpedia not yet reliable

bull Data extraction is difficult even from the infobox even from

complete data

Wikipedia page of Geert Wilders

DBpedia information about Geert Wilders

Notice the values of the party and the office attributes

Timestamped facts are difficult to extract and difficult to

represent in RDF triples

PoliticalMashup 21

Lesson learned requirement on metadata andrelations

bull One cannot rely on Linked Open Data for good quality metadata

bull Official documents should be self-describing also for facts which

are obvious at publication time

bull Compare speakerrsquos data in original (OCRed) data and XMLified

and enriched version

bull Original

bull Part of it in XML

bull And now for human consumption

PoliticalMashup 22

A few more applications

PoliticalMashup 23

Entity Profiling and Entity Search

bull Users search for entities not for documents [TREC Entity Track]

[Balog et al 2009]

bull Main research questions

How to collect information on entities

how to model an entity

how to rank entities

bull (Parsimonious) language models work well as models [Balog et

al 2009][Hiemstra et al 2004]

bull Entity profiling httpwwwpolitiekinzichtcom

bull Entity search httpikkieswijzernl

PoliticalMashup 24

Content and structure search

bull Usual advanced search combines keyword search with metadata

search

bull Extra fields are just extra filters on the returned documents

bull With structured documents we can do search on content and

structure

bull Most useful task rank best entry points in large documents

bull Compare two search systems on the same data

on flat text

on an XML representation

PoliticalMashup 25

Lesson learned requirement on structure

bull Make semantically important structure of documents explicit in

XML markup

bull Publish for machine readability

bull Publish generic data not data prepared for one use-case

PoliticalMashup 26

Application of structure Interruption graph(Attackogram)

bull MP A interrupts B lArrrArr A speaks during the block of B

combined with entity profiling

httpdebatpolitiekinzichtcom

PoliticalMashup 27

Exploring and exploiting official documents

bull We saw what can be done with one well-curated collection

bull What are the key infrastructural and research questions

In what direction and how to scale this up

1 in time

2 in breadth

3 in links

PoliticalMashup 28

Scale diachronically

bull Stable data model and measurement procedure make this data

very valuable for diachronic comparisons

bull towards the past

bull OCR

bull consistency in structure

bull more missing data to link to

bull towards the future

bull remain up to date

bull legacy decisions

PoliticalMashup 29

Scale in breadth eg parlproceedings of allEuropean countries

bull All describe the same ldquoscriptrdquo so all fit in one schema

bull Main question how to connect the data from different countries

Common structure and annotation use the same Relax NG

schema

Common values on certain attributesbull Entities Normalize to Wikipedia concepts

bull Controlled vocabulary keywords Normalize to Eurovoc

bull Language Machine translate to English

bull Events Normalize to EMM Newsexplorer query Wikinews

query

PoliticalMashup 30

Scale in breadth link to related datasets

bull Link on time entities events topics

bull Other official publications

bull News

bull User generated content

bull (In our case) promisses of political actors election manifestos

PoliticalMashup 31

Conclusions

bull There are ample opportunities for exploiting Official Publications

bull Preprocessing and interlinking with other datasets is difficult and

does not scale well

bull High precision and recall is needed for many applications

bull Many text analysis and data-mapping tasks [MUC TAC]

bull Every format needs an own transformer

bull Linked Open Data knowledge bases are not (yet) good enough

create special purpose knowledge extractors

bull High investment but if done in a general way high return and

impact

PoliticalMashup 32

Back to our research question

What is the best data format for publishing both legacy and current

parliamentary proceedings in a digital sustainable manner

Lessons learned

bull Common open standardized self-describing machine readable

bull not tied to a single application

bull linked linked linked

bull Not only shared attributes

bull but more importantly shared data values

bull also store utterly obvious facts (10 years later they arenrsquot)

PoliticalMashup 33

How we can help (ourselves)

Help improve input data at the source

bull Push at the source (in UK open government data in Holland all

parliamentary data is now in XML )

bull Help reduce dumb cut-and-paste annotation work so annotators

can concentrate on tasks which are hard for machines (eg

text-classification)

bull Emphasize importance of using shared standards

Future researchers will love you

PoliticalMashup 34

Last Question

Official Publications are they

or

  • 1 Open Official Documents Requirements and Opportunities
  • 2 Content
  • 3 Our Leading Research Question
  • 4 W3C recommendations on Open Government Data
  • 5 Value of a large data corpus
  • 6 Documents related by publication date
  • 7 Properties of our Parliamentary Proceedings Dataset
  • 8 Longitudinal data
  • 9 Data about human behaviour
  • 10 Often rather boring
  • 11 But sometimes full of drama and excitement
  • 12 Loads of measurement points
  • 13 Digitally available
  • 14 About this collection
  • 15 Very rich metadata for each word
  • 16 How to exploit the extra metadata and structure
  • 17 Political n-gram viewer
  • 18 Political n-gram viewer requirements
  • 19 Is Linked (Open) Data the solution
  • 20 DBpedia not yet reliable
  • 21 Lesson learned requirement on metadata and relations
  • 22 A few more applications
  • 23 Entity Profiling and Entity Search
  • 24 Content and structure search
  • 25 Lesson learned requirement on structure
  • 26 Application of structure Interruption graph (Attackogram)
  • 27 Exploring and exploiting official documents
  • 28 Scale diachronically
  • 29 Scale in breadth eg parlproceedings of all European countries
  • 30 Scale in breadth link to related datasets
  • 31 Conclusions
  • 32 Back to our research question
  • 33 How we can help (ourselves)
  • 34 Last Question
Page 7: Keynote Exploring and Exploiting Official Publications

PoliticalMashup 7

Properties of our Parliamentary ProceedingsDataset

PoliticalMashup 8

Longitudinal data

bull weakly measurement for over 150 years

bull very stable measurement procedure and data model

PoliticalMashup 9

Data about human behaviour

PoliticalMashup 10

Often rather boring

PoliticalMashup 11

But sometimes full of drama and excitement

PoliticalMashup 12

Loads of measurement points

24000 days 450000 topics 75 miljoen speeches

PoliticalMashup 13

Digitally available

PoliticalMashup 14

About this collection

bull very sparse available metadata

bull very rich ldquometadatardquo sits hidden inside the raw data

bull Rich data model

bull Meeting (1 Day)

bull Topic

bull Stage direction

bull Scene

bull Stage direction

bull Speech

bull Paragraph

PoliticalMashup 15

Very rich metadata for each word

For every word spoken in parliament the following facts are known

at the time of the speech act and can often be extracted from the

written proceedings

1) when it was said

2) who said it

3) in what function

4) speaking on behalf of which party

5) in which context and

6) who was actively present during the speech act

PoliticalMashup 16

How to exploit the extra metadata and structure

bull Letrsquos consider a simple killer app

PoliticalMashup 17

Political n-gram viewer

bull From every word we know both the date and the speaker

bull Every speaker belongs to a political party

bull 3D n-gram viewer political spectrum vs time vs word-count

bull Use topic ownership agenda setting framing

PoliticalMashup 18

Political n-gram viewer requirements

documents

1 metadata date of the meeting

2 document structure for every spoken word who said it

Linked Data Speakers names are disambiguated normalized and

mapped to a database with temporal party information

Completeness and correctness Few missing or wrong data also for

long time ago

PoliticalMashup 19

Is Linked (Open) Data the solution

bull Link speakers name to WikipediaDBpedia page (named entity

disambiguation and resolution) See also Google Knowledge

Graph and [Spitkovsky Chang LREC 2012]

bull DBpedia extracts link between person and party affiliation from

Wikipedia infobox

bull Timestamped triple

Geert Wilders is partymember of VVD

from 1998-08-25 until 2004-09-02

PoliticalMashup 20

DBpedia not yet reliable

bull Data extraction is difficult even from the infobox even from

complete data

Wikipedia page of Geert Wilders

DBpedia information about Geert Wilders

Notice the values of the party and the office attributes

Timestamped facts are difficult to extract and difficult to

represent in RDF triples

PoliticalMashup 21

Lesson learned requirement on metadata andrelations

bull One cannot rely on Linked Open Data for good quality metadata

bull Official documents should be self-describing also for facts which

are obvious at publication time

bull Compare speakerrsquos data in original (OCRed) data and XMLified

and enriched version

bull Original

bull Part of it in XML

bull And now for human consumption

PoliticalMashup 22

A few more applications

PoliticalMashup 23

Entity Profiling and Entity Search

bull Users search for entities not for documents [TREC Entity Track]

[Balog et al 2009]

bull Main research questions

How to collect information on entities

how to model an entity

how to rank entities

bull (Parsimonious) language models work well as models [Balog et

al 2009][Hiemstra et al 2004]

bull Entity profiling httpwwwpolitiekinzichtcom

bull Entity search httpikkieswijzernl

PoliticalMashup 24

Content and structure search

bull Usual advanced search combines keyword search with metadata

search

bull Extra fields are just extra filters on the returned documents

bull With structured documents we can do search on content and

structure

bull Most useful task rank best entry points in large documents

bull Compare two search systems on the same data

on flat text

on an XML representation

PoliticalMashup 25

Lesson learned requirement on structure

bull Make semantically important structure of documents explicit in

XML markup

bull Publish for machine readability

bull Publish generic data not data prepared for one use-case

PoliticalMashup 26

Application of structure Interruption graph(Attackogram)

bull MP A interrupts B lArrrArr A speaks during the block of B

combined with entity profiling

httpdebatpolitiekinzichtcom

PoliticalMashup 27

Exploring and exploiting official documents

bull We saw what can be done with one well-curated collection

bull What are the key infrastructural and research questions

In what direction and how to scale this up

1 in time

2 in breadth

3 in links

PoliticalMashup 28

Scale diachronically

bull Stable data model and measurement procedure make this data

very valuable for diachronic comparisons

bull towards the past

bull OCR

bull consistency in structure

bull more missing data to link to

bull towards the future

bull remain up to date

bull legacy decisions

PoliticalMashup 29

Scale in breadth eg parlproceedings of allEuropean countries

bull All describe the same ldquoscriptrdquo so all fit in one schema

bull Main question how to connect the data from different countries

Common structure and annotation use the same Relax NG

schema

Common values on certain attributesbull Entities Normalize to Wikipedia concepts

bull Controlled vocabulary keywords Normalize to Eurovoc

bull Language Machine translate to English

bull Events Normalize to EMM Newsexplorer query Wikinews

query

PoliticalMashup 30

Scale in breadth link to related datasets

bull Link on time entities events topics

bull Other official publications

bull News

bull User generated content

bull (In our case) promisses of political actors election manifestos

PoliticalMashup 31

Conclusions

bull There are ample opportunities for exploiting Official Publications

bull Preprocessing and interlinking with other datasets is difficult and

does not scale well

bull High precision and recall is needed for many applications

bull Many text analysis and data-mapping tasks [MUC TAC]

bull Every format needs an own transformer

bull Linked Open Data knowledge bases are not (yet) good enough

create special purpose knowledge extractors

bull High investment but if done in a general way high return and

impact

PoliticalMashup 32

Back to our research question

What is the best data format for publishing both legacy and current

parliamentary proceedings in a digital sustainable manner

Lessons learned

bull Common open standardized self-describing machine readable

bull not tied to a single application

bull linked linked linked

bull Not only shared attributes

bull but more importantly shared data values

bull also store utterly obvious facts (10 years later they arenrsquot)

PoliticalMashup 33

How we can help (ourselves)

Help improve input data at the source

bull Push at the source (in UK open government data in Holland all

parliamentary data is now in XML )

bull Help reduce dumb cut-and-paste annotation work so annotators

can concentrate on tasks which are hard for machines (eg

text-classification)

bull Emphasize importance of using shared standards

Future researchers will love you

PoliticalMashup 34

Last Question

Official Publications are they

or

  • 1 Open Official Documents Requirements and Opportunities
  • 2 Content
  • 3 Our Leading Research Question
  • 4 W3C recommendations on Open Government Data
  • 5 Value of a large data corpus
  • 6 Documents related by publication date
  • 7 Properties of our Parliamentary Proceedings Dataset
  • 8 Longitudinal data
  • 9 Data about human behaviour
  • 10 Often rather boring
  • 11 But sometimes full of drama and excitement
  • 12 Loads of measurement points
  • 13 Digitally available
  • 14 About this collection
  • 15 Very rich metadata for each word
  • 16 How to exploit the extra metadata and structure
  • 17 Political n-gram viewer
  • 18 Political n-gram viewer requirements
  • 19 Is Linked (Open) Data the solution
  • 20 DBpedia not yet reliable
  • 21 Lesson learned requirement on metadata and relations
  • 22 A few more applications
  • 23 Entity Profiling and Entity Search
  • 24 Content and structure search
  • 25 Lesson learned requirement on structure
  • 26 Application of structure Interruption graph (Attackogram)
  • 27 Exploring and exploiting official documents
  • 28 Scale diachronically
  • 29 Scale in breadth eg parlproceedings of all European countries
  • 30 Scale in breadth link to related datasets
  • 31 Conclusions
  • 32 Back to our research question
  • 33 How we can help (ourselves)
  • 34 Last Question
Page 8: Keynote Exploring and Exploiting Official Publications

PoliticalMashup 8

Longitudinal data

bull weakly measurement for over 150 years

bull very stable measurement procedure and data model

PoliticalMashup 9

Data about human behaviour

PoliticalMashup 10

Often rather boring

PoliticalMashup 11

But sometimes full of drama and excitement

PoliticalMashup 12

Loads of measurement points

24000 days 450000 topics 75 miljoen speeches

PoliticalMashup 13

Digitally available

PoliticalMashup 14

About this collection

bull very sparse available metadata

bull very rich ldquometadatardquo sits hidden inside the raw data

bull Rich data model

bull Meeting (1 Day)

bull Topic

bull Stage direction

bull Scene

bull Stage direction

bull Speech

bull Paragraph

PoliticalMashup 15

Very rich metadata for each word

For every word spoken in parliament the following facts are known

at the time of the speech act and can often be extracted from the

written proceedings

1) when it was said

2) who said it

3) in what function

4) speaking on behalf of which party

5) in which context and

6) who was actively present during the speech act

PoliticalMashup 16

How to exploit the extra metadata and structure

bull Letrsquos consider a simple killer app

PoliticalMashup 17

Political n-gram viewer

bull From every word we know both the date and the speaker

bull Every speaker belongs to a political party

bull 3D n-gram viewer political spectrum vs time vs word-count

bull Use topic ownership agenda setting framing

PoliticalMashup 18

Political n-gram viewer requirements

documents

1 metadata date of the meeting

2 document structure for every spoken word who said it

Linked Data Speakers names are disambiguated normalized and

mapped to a database with temporal party information

Completeness and correctness Few missing or wrong data also for

long time ago

PoliticalMashup 19

Is Linked (Open) Data the solution

bull Link speakers name to WikipediaDBpedia page (named entity

disambiguation and resolution) See also Google Knowledge

Graph and [Spitkovsky Chang LREC 2012]

bull DBpedia extracts link between person and party affiliation from

Wikipedia infobox

bull Timestamped triple

Geert Wilders is partymember of VVD

from 1998-08-25 until 2004-09-02

PoliticalMashup 20

DBpedia not yet reliable

bull Data extraction is difficult even from the infobox even from

complete data

Wikipedia page of Geert Wilders

DBpedia information about Geert Wilders

Notice the values of the party and the office attributes

Timestamped facts are difficult to extract and difficult to

represent in RDF triples

PoliticalMashup 21

Lesson learned requirement on metadata andrelations

bull One cannot rely on Linked Open Data for good quality metadata

bull Official documents should be self-describing also for facts which

are obvious at publication time

bull Compare speakerrsquos data in original (OCRed) data and XMLified

and enriched version

bull Original

bull Part of it in XML

bull And now for human consumption

PoliticalMashup 22

A few more applications

PoliticalMashup 23

Entity Profiling and Entity Search

bull Users search for entities not for documents [TREC Entity Track]

[Balog et al 2009]

bull Main research questions

How to collect information on entities

how to model an entity

how to rank entities

bull (Parsimonious) language models work well as models [Balog et

al 2009][Hiemstra et al 2004]

bull Entity profiling httpwwwpolitiekinzichtcom

bull Entity search httpikkieswijzernl

PoliticalMashup 24

Content and structure search

bull Usual advanced search combines keyword search with metadata

search

bull Extra fields are just extra filters on the returned documents

bull With structured documents we can do search on content and

structure

bull Most useful task rank best entry points in large documents

bull Compare two search systems on the same data

on flat text

on an XML representation

PoliticalMashup 25

Lesson learned requirement on structure

bull Make semantically important structure of documents explicit in

XML markup

bull Publish for machine readability

bull Publish generic data not data prepared for one use-case

PoliticalMashup 26

Application of structure Interruption graph(Attackogram)

bull MP A interrupts B lArrrArr A speaks during the block of B

combined with entity profiling

httpdebatpolitiekinzichtcom

PoliticalMashup 27

Exploring and exploiting official documents

bull We saw what can be done with one well-curated collection

bull What are the key infrastructural and research questions

In what direction and how to scale this up

1 in time

2 in breadth

3 in links

PoliticalMashup 28

Scale diachronically

bull Stable data model and measurement procedure make this data

very valuable for diachronic comparisons

bull towards the past

bull OCR

bull consistency in structure

bull more missing data to link to

bull towards the future

bull remain up to date

bull legacy decisions

PoliticalMashup 29

Scale in breadth eg parlproceedings of allEuropean countries

bull All describe the same ldquoscriptrdquo so all fit in one schema

bull Main question how to connect the data from different countries

Common structure and annotation use the same Relax NG

schema

Common values on certain attributesbull Entities Normalize to Wikipedia concepts

bull Controlled vocabulary keywords Normalize to Eurovoc

bull Language Machine translate to English

bull Events Normalize to EMM Newsexplorer query Wikinews

query

PoliticalMashup 30

Scale in breadth link to related datasets

bull Link on time entities events topics

bull Other official publications

bull News

bull User generated content

bull (In our case) promisses of political actors election manifestos

PoliticalMashup 31

Conclusions

bull There are ample opportunities for exploiting Official Publications

bull Preprocessing and interlinking with other datasets is difficult and

does not scale well

bull High precision and recall is needed for many applications

bull Many text analysis and data-mapping tasks [MUC TAC]

bull Every format needs an own transformer

bull Linked Open Data knowledge bases are not (yet) good enough

create special purpose knowledge extractors

bull High investment but if done in a general way high return and

impact

PoliticalMashup 32

Back to our research question

What is the best data format for publishing both legacy and current

parliamentary proceedings in a digital sustainable manner

Lessons learned

bull Common open standardized self-describing machine readable

bull not tied to a single application

bull linked linked linked

bull Not only shared attributes

bull but more importantly shared data values

bull also store utterly obvious facts (10 years later they arenrsquot)

PoliticalMashup 33

How we can help (ourselves)

Help improve input data at the source

bull Push at the source (in UK open government data in Holland all

parliamentary data is now in XML )

bull Help reduce dumb cut-and-paste annotation work so annotators

can concentrate on tasks which are hard for machines (eg

text-classification)

bull Emphasize importance of using shared standards

Future researchers will love you

PoliticalMashup 34

Last Question

Official Publications are they

or

  • 1 Open Official Documents Requirements and Opportunities
  • 2 Content
  • 3 Our Leading Research Question
  • 4 W3C recommendations on Open Government Data
  • 5 Value of a large data corpus
  • 6 Documents related by publication date
  • 7 Properties of our Parliamentary Proceedings Dataset
  • 8 Longitudinal data
  • 9 Data about human behaviour
  • 10 Often rather boring
  • 11 But sometimes full of drama and excitement
  • 12 Loads of measurement points
  • 13 Digitally available
  • 14 About this collection
  • 15 Very rich metadata for each word
  • 16 How to exploit the extra metadata and structure
  • 17 Political n-gram viewer
  • 18 Political n-gram viewer requirements
  • 19 Is Linked (Open) Data the solution
  • 20 DBpedia not yet reliable
  • 21 Lesson learned requirement on metadata and relations
  • 22 A few more applications
  • 23 Entity Profiling and Entity Search
  • 24 Content and structure search
  • 25 Lesson learned requirement on structure
  • 26 Application of structure Interruption graph (Attackogram)
  • 27 Exploring and exploiting official documents
  • 28 Scale diachronically
  • 29 Scale in breadth eg parlproceedings of all European countries
  • 30 Scale in breadth link to related datasets
  • 31 Conclusions
  • 32 Back to our research question
  • 33 How we can help (ourselves)
  • 34 Last Question
Page 9: Keynote Exploring and Exploiting Official Publications

PoliticalMashup 9

Data about human behaviour

PoliticalMashup 10

Often rather boring

PoliticalMashup 11

But sometimes full of drama and excitement

PoliticalMashup 12

Loads of measurement points

24000 days 450000 topics 75 miljoen speeches

PoliticalMashup 13

Digitally available

PoliticalMashup 14

About this collection

bull very sparse available metadata

bull very rich ldquometadatardquo sits hidden inside the raw data

bull Rich data model

bull Meeting (1 Day)

bull Topic

bull Stage direction

bull Scene

bull Stage direction

bull Speech

bull Paragraph

PoliticalMashup 15

Very rich metadata for each word

For every word spoken in parliament the following facts are known

at the time of the speech act and can often be extracted from the

written proceedings

1) when it was said

2) who said it

3) in what function

4) speaking on behalf of which party

5) in which context and

6) who was actively present during the speech act

PoliticalMashup 16

How to exploit the extra metadata and structure

bull Letrsquos consider a simple killer app

PoliticalMashup 17

Political n-gram viewer

bull From every word we know both the date and the speaker

bull Every speaker belongs to a political party

bull 3D n-gram viewer political spectrum vs time vs word-count

bull Use topic ownership agenda setting framing

PoliticalMashup 18

Political n-gram viewer requirements

documents

1 metadata date of the meeting

2 document structure for every spoken word who said it

Linked Data Speakers names are disambiguated normalized and

mapped to a database with temporal party information

Completeness and correctness Few missing or wrong data also for

long time ago

PoliticalMashup 19

Is Linked (Open) Data the solution

bull Link speakers name to WikipediaDBpedia page (named entity

disambiguation and resolution) See also Google Knowledge

Graph and [Spitkovsky Chang LREC 2012]

bull DBpedia extracts link between person and party affiliation from

Wikipedia infobox

bull Timestamped triple

Geert Wilders is partymember of VVD

from 1998-08-25 until 2004-09-02

PoliticalMashup 20

DBpedia not yet reliable

bull Data extraction is difficult even from the infobox even from

complete data

Wikipedia page of Geert Wilders

DBpedia information about Geert Wilders

Notice the values of the party and the office attributes

Timestamped facts are difficult to extract and difficult to

represent in RDF triples

PoliticalMashup 21

Lesson learned requirement on metadata andrelations

bull One cannot rely on Linked Open Data for good quality metadata

bull Official documents should be self-describing also for facts which

are obvious at publication time

bull Compare speakerrsquos data in original (OCRed) data and XMLified

and enriched version

bull Original

bull Part of it in XML

bull And now for human consumption

PoliticalMashup 22

A few more applications

PoliticalMashup 23

Entity Profiling and Entity Search

bull Users search for entities not for documents [TREC Entity Track]

[Balog et al 2009]

bull Main research questions

How to collect information on entities

how to model an entity

how to rank entities

bull (Parsimonious) language models work well as models [Balog et

al 2009][Hiemstra et al 2004]

bull Entity profiling httpwwwpolitiekinzichtcom

bull Entity search httpikkieswijzernl

PoliticalMashup 24

Content and structure search

bull Usual advanced search combines keyword search with metadata

search

bull Extra fields are just extra filters on the returned documents

bull With structured documents we can do search on content and

structure

bull Most useful task rank best entry points in large documents

bull Compare two search systems on the same data

on flat text

on an XML representation

PoliticalMashup 25

Lesson learned requirement on structure

bull Make semantically important structure of documents explicit in

XML markup

bull Publish for machine readability

bull Publish generic data not data prepared for one use-case

PoliticalMashup 26

Application of structure Interruption graph(Attackogram)

bull MP A interrupts B lArrrArr A speaks during the block of B

combined with entity profiling

httpdebatpolitiekinzichtcom

PoliticalMashup 27

Exploring and exploiting official documents

bull We saw what can be done with one well-curated collection

bull What are the key infrastructural and research questions

In what direction and how to scale this up

1 in time

2 in breadth

3 in links

PoliticalMashup 28

Scale diachronically

bull Stable data model and measurement procedure make this data

very valuable for diachronic comparisons

bull towards the past

bull OCR

bull consistency in structure

bull more missing data to link to

bull towards the future

bull remain up to date

bull legacy decisions

PoliticalMashup 29

Scale in breadth eg parlproceedings of allEuropean countries

bull All describe the same ldquoscriptrdquo so all fit in one schema

bull Main question how to connect the data from different countries

Common structure and annotation use the same Relax NG

schema

Common values on certain attributesbull Entities Normalize to Wikipedia concepts

bull Controlled vocabulary keywords Normalize to Eurovoc

bull Language Machine translate to English

bull Events Normalize to EMM Newsexplorer query Wikinews

query

PoliticalMashup 30

Scale in breadth link to related datasets

bull Link on time entities events topics

bull Other official publications

bull News

bull User generated content

bull (In our case) promisses of political actors election manifestos

PoliticalMashup 31

Conclusions

bull There are ample opportunities for exploiting Official Publications

bull Preprocessing and interlinking with other datasets is difficult and

does not scale well

bull High precision and recall is needed for many applications

bull Many text analysis and data-mapping tasks [MUC TAC]

bull Every format needs an own transformer

bull Linked Open Data knowledge bases are not (yet) good enough

create special purpose knowledge extractors

bull High investment but if done in a general way high return and

impact

PoliticalMashup 32

Back to our research question

What is the best data format for publishing both legacy and current

parliamentary proceedings in a digital sustainable manner

Lessons learned

bull Common open standardized self-describing machine readable

bull not tied to a single application

bull linked linked linked

bull Not only shared attributes

bull but more importantly shared data values

bull also store utterly obvious facts (10 years later they arenrsquot)

PoliticalMashup 33

How we can help (ourselves)

Help improve input data at the source

bull Push at the source (in UK open government data in Holland all

parliamentary data is now in XML )

bull Help reduce dumb cut-and-paste annotation work so annotators

can concentrate on tasks which are hard for machines (eg

text-classification)

bull Emphasize importance of using shared standards

Future researchers will love you

PoliticalMashup 34

Last Question

Official Publications are they

or

  • 1 Open Official Documents Requirements and Opportunities
  • 2 Content
  • 3 Our Leading Research Question
  • 4 W3C recommendations on Open Government Data
  • 5 Value of a large data corpus
  • 6 Documents related by publication date
  • 7 Properties of our Parliamentary Proceedings Dataset
  • 8 Longitudinal data
  • 9 Data about human behaviour
  • 10 Often rather boring
  • 11 But sometimes full of drama and excitement
  • 12 Loads of measurement points
  • 13 Digitally available
  • 14 About this collection
  • 15 Very rich metadata for each word
  • 16 How to exploit the extra metadata and structure
  • 17 Political n-gram viewer
  • 18 Political n-gram viewer requirements
  • 19 Is Linked (Open) Data the solution
  • 20 DBpedia not yet reliable
  • 21 Lesson learned requirement on metadata and relations
  • 22 A few more applications
  • 23 Entity Profiling and Entity Search
  • 24 Content and structure search
  • 25 Lesson learned requirement on structure
  • 26 Application of structure Interruption graph (Attackogram)
  • 27 Exploring and exploiting official documents
  • 28 Scale diachronically
  • 29 Scale in breadth eg parlproceedings of all European countries
  • 30 Scale in breadth link to related datasets
  • 31 Conclusions
  • 32 Back to our research question
  • 33 How we can help (ourselves)
  • 34 Last Question
Page 10: Keynote Exploring and Exploiting Official Publications

PoliticalMashup 10

Often rather boring

PoliticalMashup 11

But sometimes full of drama and excitement

PoliticalMashup 12

Loads of measurement points

24000 days 450000 topics 75 miljoen speeches

PoliticalMashup 13

Digitally available

PoliticalMashup 14

About this collection

bull very sparse available metadata

bull very rich ldquometadatardquo sits hidden inside the raw data

bull Rich data model

bull Meeting (1 Day)

bull Topic

bull Stage direction

bull Scene

bull Stage direction

bull Speech

bull Paragraph

PoliticalMashup 15

Very rich metadata for each word

For every word spoken in parliament the following facts are known

at the time of the speech act and can often be extracted from the

written proceedings

1) when it was said

2) who said it

3) in what function

4) speaking on behalf of which party

5) in which context and

6) who was actively present during the speech act

PoliticalMashup 16

How to exploit the extra metadata and structure

bull Letrsquos consider a simple killer app

PoliticalMashup 17

Political n-gram viewer

bull From every word we know both the date and the speaker

bull Every speaker belongs to a political party

bull 3D n-gram viewer political spectrum vs time vs word-count

bull Use topic ownership agenda setting framing

PoliticalMashup 18

Political n-gram viewer requirements

documents

1 metadata date of the meeting

2 document structure for every spoken word who said it

Linked Data Speakers names are disambiguated normalized and

mapped to a database with temporal party information

Completeness and correctness Few missing or wrong data also for

long time ago

PoliticalMashup 19

Is Linked (Open) Data the solution

bull Link speakers name to WikipediaDBpedia page (named entity

disambiguation and resolution) See also Google Knowledge

Graph and [Spitkovsky Chang LREC 2012]

bull DBpedia extracts link between person and party affiliation from

Wikipedia infobox

bull Timestamped triple

Geert Wilders is partymember of VVD

from 1998-08-25 until 2004-09-02

PoliticalMashup 20

DBpedia not yet reliable

bull Data extraction is difficult even from the infobox even from

complete data

Wikipedia page of Geert Wilders

DBpedia information about Geert Wilders

Notice the values of the party and the office attributes

Timestamped facts are difficult to extract and difficult to

represent in RDF triples

PoliticalMashup 21

Lesson learned requirement on metadata andrelations

bull One cannot rely on Linked Open Data for good quality metadata

bull Official documents should be self-describing also for facts which

are obvious at publication time

bull Compare speakerrsquos data in original (OCRed) data and XMLified

and enriched version

bull Original

bull Part of it in XML

bull And now for human consumption

PoliticalMashup 22

A few more applications

PoliticalMashup 23

Entity Profiling and Entity Search

bull Users search for entities not for documents [TREC Entity Track]

[Balog et al 2009]

bull Main research questions

How to collect information on entities

how to model an entity

how to rank entities

bull (Parsimonious) language models work well as models [Balog et

al 2009][Hiemstra et al 2004]

bull Entity profiling httpwwwpolitiekinzichtcom

bull Entity search httpikkieswijzernl

PoliticalMashup 24

Content and structure search

bull Usual advanced search combines keyword search with metadata

search

bull Extra fields are just extra filters on the returned documents

bull With structured documents we can do search on content and

structure

bull Most useful task rank best entry points in large documents

bull Compare two search systems on the same data

on flat text

on an XML representation

PoliticalMashup 25

Lesson learned requirement on structure

bull Make semantically important structure of documents explicit in

XML markup

bull Publish for machine readability

bull Publish generic data not data prepared for one use-case

PoliticalMashup 26

Application of structure Interruption graph(Attackogram)

bull MP A interrupts B lArrrArr A speaks during the block of B

combined with entity profiling

httpdebatpolitiekinzichtcom

PoliticalMashup 27

Exploring and exploiting official documents

bull We saw what can be done with one well-curated collection

bull What are the key infrastructural and research questions

In what direction and how to scale this up

1 in time

2 in breadth

3 in links

PoliticalMashup 28

Scale diachronically

bull Stable data model and measurement procedure make this data

very valuable for diachronic comparisons

bull towards the past

bull OCR

bull consistency in structure

bull more missing data to link to

bull towards the future

bull remain up to date

bull legacy decisions

PoliticalMashup 29

Scale in breadth eg parlproceedings of allEuropean countries

bull All describe the same ldquoscriptrdquo so all fit in one schema

bull Main question how to connect the data from different countries

Common structure and annotation use the same Relax NG

schema

Common values on certain attributesbull Entities Normalize to Wikipedia concepts

bull Controlled vocabulary keywords Normalize to Eurovoc

bull Language Machine translate to English

bull Events Normalize to EMM Newsexplorer query Wikinews

query

PoliticalMashup 30

Scale in breadth link to related datasets

bull Link on time entities events topics

bull Other official publications

bull News

bull User generated content

bull (In our case) promisses of political actors election manifestos

PoliticalMashup 31

Conclusions

bull There are ample opportunities for exploiting Official Publications

bull Preprocessing and interlinking with other datasets is difficult and

does not scale well

bull High precision and recall is needed for many applications

bull Many text analysis and data-mapping tasks [MUC TAC]

bull Every format needs an own transformer

bull Linked Open Data knowledge bases are not (yet) good enough

create special purpose knowledge extractors

bull High investment but if done in a general way high return and

impact

PoliticalMashup 32

Back to our research question

What is the best data format for publishing both legacy and current

parliamentary proceedings in a digital sustainable manner

Lessons learned

bull Common open standardized self-describing machine readable

bull not tied to a single application

bull linked linked linked

bull Not only shared attributes

bull but more importantly shared data values

bull also store utterly obvious facts (10 years later they arenrsquot)

PoliticalMashup 33

How we can help (ourselves)

Help improve input data at the source

bull Push at the source (in UK open government data in Holland all

parliamentary data is now in XML )

bull Help reduce dumb cut-and-paste annotation work so annotators

can concentrate on tasks which are hard for machines (eg

text-classification)

bull Emphasize importance of using shared standards

Future researchers will love you

PoliticalMashup 34

Last Question

Official Publications are they

or

  • 1 Open Official Documents Requirements and Opportunities
  • 2 Content
  • 3 Our Leading Research Question
  • 4 W3C recommendations on Open Government Data
  • 5 Value of a large data corpus
  • 6 Documents related by publication date
  • 7 Properties of our Parliamentary Proceedings Dataset
  • 8 Longitudinal data
  • 9 Data about human behaviour
  • 10 Often rather boring
  • 11 But sometimes full of drama and excitement
  • 12 Loads of measurement points
  • 13 Digitally available
  • 14 About this collection
  • 15 Very rich metadata for each word
  • 16 How to exploit the extra metadata and structure
  • 17 Political n-gram viewer
  • 18 Political n-gram viewer requirements
  • 19 Is Linked (Open) Data the solution
  • 20 DBpedia not yet reliable
  • 21 Lesson learned requirement on metadata and relations
  • 22 A few more applications
  • 23 Entity Profiling and Entity Search
  • 24 Content and structure search
  • 25 Lesson learned requirement on structure
  • 26 Application of structure Interruption graph (Attackogram)
  • 27 Exploring and exploiting official documents
  • 28 Scale diachronically
  • 29 Scale in breadth eg parlproceedings of all European countries
  • 30 Scale in breadth link to related datasets
  • 31 Conclusions
  • 32 Back to our research question
  • 33 How we can help (ourselves)
  • 34 Last Question
Page 11: Keynote Exploring and Exploiting Official Publications

PoliticalMashup 11

But sometimes full of drama and excitement

PoliticalMashup 12

Loads of measurement points

24000 days 450000 topics 75 miljoen speeches

PoliticalMashup 13

Digitally available

PoliticalMashup 14

About this collection

bull very sparse available metadata

bull very rich ldquometadatardquo sits hidden inside the raw data

bull Rich data model

bull Meeting (1 Day)

bull Topic

bull Stage direction

bull Scene

bull Stage direction

bull Speech

bull Paragraph

PoliticalMashup 15

Very rich metadata for each word

For every word spoken in parliament the following facts are known

at the time of the speech act and can often be extracted from the

written proceedings

1) when it was said

2) who said it

3) in what function

4) speaking on behalf of which party

5) in which context and

6) who was actively present during the speech act

PoliticalMashup 16

How to exploit the extra metadata and structure

bull Letrsquos consider a simple killer app

PoliticalMashup 17

Political n-gram viewer

bull From every word we know both the date and the speaker

bull Every speaker belongs to a political party

bull 3D n-gram viewer political spectrum vs time vs word-count

bull Use topic ownership agenda setting framing

PoliticalMashup 18

Political n-gram viewer requirements

documents

1 metadata date of the meeting

2 document structure for every spoken word who said it

Linked Data Speakers names are disambiguated normalized and

mapped to a database with temporal party information

Completeness and correctness Few missing or wrong data also for

long time ago

PoliticalMashup 19

Is Linked (Open) Data the solution

bull Link speakers name to WikipediaDBpedia page (named entity

disambiguation and resolution) See also Google Knowledge

Graph and [Spitkovsky Chang LREC 2012]

bull DBpedia extracts link between person and party affiliation from

Wikipedia infobox

bull Timestamped triple

Geert Wilders is partymember of VVD

from 1998-08-25 until 2004-09-02

PoliticalMashup 20

DBpedia not yet reliable

bull Data extraction is difficult even from the infobox even from

complete data

Wikipedia page of Geert Wilders

DBpedia information about Geert Wilders

Notice the values of the party and the office attributes

Timestamped facts are difficult to extract and difficult to

represent in RDF triples

PoliticalMashup 21

Lesson learned requirement on metadata andrelations

bull One cannot rely on Linked Open Data for good quality metadata

bull Official documents should be self-describing also for facts which

are obvious at publication time

bull Compare speakerrsquos data in original (OCRed) data and XMLified

and enriched version

bull Original

bull Part of it in XML

bull And now for human consumption

PoliticalMashup 22

A few more applications

PoliticalMashup 23

Entity Profiling and Entity Search

bull Users search for entities not for documents [TREC Entity Track]

[Balog et al 2009]

bull Main research questions

How to collect information on entities

how to model an entity

how to rank entities

bull (Parsimonious) language models work well as models [Balog et

al 2009][Hiemstra et al 2004]

bull Entity profiling httpwwwpolitiekinzichtcom

bull Entity search httpikkieswijzernl

PoliticalMashup 24

Content and structure search

bull Usual advanced search combines keyword search with metadata

search

bull Extra fields are just extra filters on the returned documents

bull With structured documents we can do search on content and

structure

bull Most useful task rank best entry points in large documents

bull Compare two search systems on the same data

on flat text

on an XML representation

PoliticalMashup 25

Lesson learned requirement on structure

bull Make semantically important structure of documents explicit in

XML markup

bull Publish for machine readability

bull Publish generic data not data prepared for one use-case

PoliticalMashup 26

Application of structure Interruption graph(Attackogram)

bull MP A interrupts B lArrrArr A speaks during the block of B

combined with entity profiling

httpdebatpolitiekinzichtcom

PoliticalMashup 27

Exploring and exploiting official documents

bull We saw what can be done with one well-curated collection

bull What are the key infrastructural and research questions

In what direction and how to scale this up

1 in time

2 in breadth

3 in links

PoliticalMashup 28

Scale diachronically

bull Stable data model and measurement procedure make this data

very valuable for diachronic comparisons

bull towards the past

bull OCR

bull consistency in structure

bull more missing data to link to

bull towards the future

bull remain up to date

bull legacy decisions

PoliticalMashup 29

Scale in breadth eg parlproceedings of allEuropean countries

bull All describe the same ldquoscriptrdquo so all fit in one schema

bull Main question how to connect the data from different countries

Common structure and annotation use the same Relax NG

schema

Common values on certain attributesbull Entities Normalize to Wikipedia concepts

bull Controlled vocabulary keywords Normalize to Eurovoc

bull Language Machine translate to English

bull Events Normalize to EMM Newsexplorer query Wikinews

query

PoliticalMashup 30

Scale in breadth link to related datasets

bull Link on time entities events topics

bull Other official publications

bull News

bull User generated content

bull (In our case) promisses of political actors election manifestos

PoliticalMashup 31

Conclusions

bull There are ample opportunities for exploiting Official Publications

bull Preprocessing and interlinking with other datasets is difficult and

does not scale well

bull High precision and recall is needed for many applications

bull Many text analysis and data-mapping tasks [MUC TAC]

bull Every format needs an own transformer

bull Linked Open Data knowledge bases are not (yet) good enough

create special purpose knowledge extractors

bull High investment but if done in a general way high return and

impact

PoliticalMashup 32

Back to our research question

What is the best data format for publishing both legacy and current

parliamentary proceedings in a digital sustainable manner

Lessons learned

bull Common open standardized self-describing machine readable

bull not tied to a single application

bull linked linked linked

bull Not only shared attributes

bull but more importantly shared data values

bull also store utterly obvious facts (10 years later they arenrsquot)

PoliticalMashup 33

How we can help (ourselves)

Help improve input data at the source

bull Push at the source (in UK open government data in Holland all

parliamentary data is now in XML )

bull Help reduce dumb cut-and-paste annotation work so annotators

can concentrate on tasks which are hard for machines (eg

text-classification)

bull Emphasize importance of using shared standards

Future researchers will love you

PoliticalMashup 34

Last Question

Official Publications are they

or

  • 1 Open Official Documents Requirements and Opportunities
  • 2 Content
  • 3 Our Leading Research Question
  • 4 W3C recommendations on Open Government Data
  • 5 Value of a large data corpus
  • 6 Documents related by publication date
  • 7 Properties of our Parliamentary Proceedings Dataset
  • 8 Longitudinal data
  • 9 Data about human behaviour
  • 10 Often rather boring
  • 11 But sometimes full of drama and excitement
  • 12 Loads of measurement points
  • 13 Digitally available
  • 14 About this collection
  • 15 Very rich metadata for each word
  • 16 How to exploit the extra metadata and structure
  • 17 Political n-gram viewer
  • 18 Political n-gram viewer requirements
  • 19 Is Linked (Open) Data the solution
  • 20 DBpedia not yet reliable
  • 21 Lesson learned requirement on metadata and relations
  • 22 A few more applications
  • 23 Entity Profiling and Entity Search
  • 24 Content and structure search
  • 25 Lesson learned requirement on structure
  • 26 Application of structure Interruption graph (Attackogram)
  • 27 Exploring and exploiting official documents
  • 28 Scale diachronically
  • 29 Scale in breadth eg parlproceedings of all European countries
  • 30 Scale in breadth link to related datasets
  • 31 Conclusions
  • 32 Back to our research question
  • 33 How we can help (ourselves)
  • 34 Last Question
Page 12: Keynote Exploring and Exploiting Official Publications

PoliticalMashup 12

Loads of measurement points

24000 days 450000 topics 75 miljoen speeches

PoliticalMashup 13

Digitally available

PoliticalMashup 14

About this collection

bull very sparse available metadata

bull very rich ldquometadatardquo sits hidden inside the raw data

bull Rich data model

bull Meeting (1 Day)

bull Topic

bull Stage direction

bull Scene

bull Stage direction

bull Speech

bull Paragraph

PoliticalMashup 15

Very rich metadata for each word

For every word spoken in parliament the following facts are known

at the time of the speech act and can often be extracted from the

written proceedings

1) when it was said

2) who said it

3) in what function

4) speaking on behalf of which party

5) in which context and

6) who was actively present during the speech act

PoliticalMashup 16

How to exploit the extra metadata and structure

bull Letrsquos consider a simple killer app

PoliticalMashup 17

Political n-gram viewer

bull From every word we know both the date and the speaker

bull Every speaker belongs to a political party

bull 3D n-gram viewer political spectrum vs time vs word-count

bull Use topic ownership agenda setting framing

PoliticalMashup 18

Political n-gram viewer requirements

documents

1 metadata date of the meeting

2 document structure for every spoken word who said it

Linked Data Speakers names are disambiguated normalized and

mapped to a database with temporal party information

Completeness and correctness Few missing or wrong data also for

long time ago

PoliticalMashup 19

Is Linked (Open) Data the solution

bull Link speakers name to WikipediaDBpedia page (named entity

disambiguation and resolution) See also Google Knowledge

Graph and [Spitkovsky Chang LREC 2012]

bull DBpedia extracts link between person and party affiliation from

Wikipedia infobox

bull Timestamped triple

Geert Wilders is partymember of VVD

from 1998-08-25 until 2004-09-02

PoliticalMashup 20

DBpedia not yet reliable

bull Data extraction is difficult even from the infobox even from

complete data

Wikipedia page of Geert Wilders

DBpedia information about Geert Wilders

Notice the values of the party and the office attributes

Timestamped facts are difficult to extract and difficult to

represent in RDF triples

PoliticalMashup 21

Lesson learned requirement on metadata andrelations

bull One cannot rely on Linked Open Data for good quality metadata

bull Official documents should be self-describing also for facts which

are obvious at publication time

bull Compare speakerrsquos data in original (OCRed) data and XMLified

and enriched version

bull Original

bull Part of it in XML

bull And now for human consumption

PoliticalMashup 22

A few more applications

PoliticalMashup 23

Entity Profiling and Entity Search

bull Users search for entities not for documents [TREC Entity Track]

[Balog et al 2009]

bull Main research questions

How to collect information on entities

how to model an entity

how to rank entities

bull (Parsimonious) language models work well as models [Balog et

al 2009][Hiemstra et al 2004]

bull Entity profiling httpwwwpolitiekinzichtcom

bull Entity search httpikkieswijzernl

PoliticalMashup 24

Content and structure search

bull Usual advanced search combines keyword search with metadata

search

bull Extra fields are just extra filters on the returned documents

bull With structured documents we can do search on content and

structure

bull Most useful task rank best entry points in large documents

bull Compare two search systems on the same data

on flat text

on an XML representation

PoliticalMashup 25

Lesson learned requirement on structure

bull Make semantically important structure of documents explicit in

XML markup

bull Publish for machine readability

bull Publish generic data not data prepared for one use-case

PoliticalMashup 26

Application of structure Interruption graph(Attackogram)

bull MP A interrupts B lArrrArr A speaks during the block of B

combined with entity profiling

httpdebatpolitiekinzichtcom

PoliticalMashup 27

Exploring and exploiting official documents

bull We saw what can be done with one well-curated collection

bull What are the key infrastructural and research questions

In what direction and how to scale this up

1 in time

2 in breadth

3 in links

PoliticalMashup 28

Scale diachronically

bull Stable data model and measurement procedure make this data

very valuable for diachronic comparisons

bull towards the past

bull OCR

bull consistency in structure

bull more missing data to link to

bull towards the future

bull remain up to date

bull legacy decisions

PoliticalMashup 29

Scale in breadth eg parlproceedings of allEuropean countries

bull All describe the same ldquoscriptrdquo so all fit in one schema

bull Main question how to connect the data from different countries

Common structure and annotation use the same Relax NG

schema

Common values on certain attributesbull Entities Normalize to Wikipedia concepts

bull Controlled vocabulary keywords Normalize to Eurovoc

bull Language Machine translate to English

bull Events Normalize to EMM Newsexplorer query Wikinews

query

PoliticalMashup 30

Scale in breadth link to related datasets

bull Link on time entities events topics

bull Other official publications

bull News

bull User generated content

bull (In our case) promisses of political actors election manifestos

PoliticalMashup 31

Conclusions

bull There are ample opportunities for exploiting Official Publications

bull Preprocessing and interlinking with other datasets is difficult and

does not scale well

bull High precision and recall is needed for many applications

bull Many text analysis and data-mapping tasks [MUC TAC]

bull Every format needs an own transformer

bull Linked Open Data knowledge bases are not (yet) good enough

create special purpose knowledge extractors

bull High investment but if done in a general way high return and

impact

PoliticalMashup 32

Back to our research question

What is the best data format for publishing both legacy and current

parliamentary proceedings in a digital sustainable manner

Lessons learned

bull Common open standardized self-describing machine readable

bull not tied to a single application

bull linked linked linked

bull Not only shared attributes

bull but more importantly shared data values

bull also store utterly obvious facts (10 years later they arenrsquot)

PoliticalMashup 33

How we can help (ourselves)

Help improve input data at the source

bull Push at the source (in UK open government data in Holland all

parliamentary data is now in XML )

bull Help reduce dumb cut-and-paste annotation work so annotators

can concentrate on tasks which are hard for machines (eg

text-classification)

bull Emphasize importance of using shared standards

Future researchers will love you

PoliticalMashup 34

Last Question

Official Publications are they

or

  • 1 Open Official Documents Requirements and Opportunities
  • 2 Content
  • 3 Our Leading Research Question
  • 4 W3C recommendations on Open Government Data
  • 5 Value of a large data corpus
  • 6 Documents related by publication date
  • 7 Properties of our Parliamentary Proceedings Dataset
  • 8 Longitudinal data
  • 9 Data about human behaviour
  • 10 Often rather boring
  • 11 But sometimes full of drama and excitement
  • 12 Loads of measurement points
  • 13 Digitally available
  • 14 About this collection
  • 15 Very rich metadata for each word
  • 16 How to exploit the extra metadata and structure
  • 17 Political n-gram viewer
  • 18 Political n-gram viewer requirements
  • 19 Is Linked (Open) Data the solution
  • 20 DBpedia not yet reliable
  • 21 Lesson learned requirement on metadata and relations
  • 22 A few more applications
  • 23 Entity Profiling and Entity Search
  • 24 Content and structure search
  • 25 Lesson learned requirement on structure
  • 26 Application of structure Interruption graph (Attackogram)
  • 27 Exploring and exploiting official documents
  • 28 Scale diachronically
  • 29 Scale in breadth eg parlproceedings of all European countries
  • 30 Scale in breadth link to related datasets
  • 31 Conclusions
  • 32 Back to our research question
  • 33 How we can help (ourselves)
  • 34 Last Question
Page 13: Keynote Exploring and Exploiting Official Publications

PoliticalMashup 13

Digitally available

PoliticalMashup 14

About this collection

bull very sparse available metadata

bull very rich ldquometadatardquo sits hidden inside the raw data

bull Rich data model

bull Meeting (1 Day)

bull Topic

bull Stage direction

bull Scene

bull Stage direction

bull Speech

bull Paragraph

PoliticalMashup 15

Very rich metadata for each word

For every word spoken in parliament the following facts are known

at the time of the speech act and can often be extracted from the

written proceedings

1) when it was said

2) who said it

3) in what function

4) speaking on behalf of which party

5) in which context and

6) who was actively present during the speech act

PoliticalMashup 16

How to exploit the extra metadata and structure

bull Letrsquos consider a simple killer app

PoliticalMashup 17

Political n-gram viewer

bull From every word we know both the date and the speaker

bull Every speaker belongs to a political party

bull 3D n-gram viewer political spectrum vs time vs word-count

bull Use topic ownership agenda setting framing

PoliticalMashup 18

Political n-gram viewer requirements

documents

1 metadata date of the meeting

2 document structure for every spoken word who said it

Linked Data Speakers names are disambiguated normalized and

mapped to a database with temporal party information

Completeness and correctness Few missing or wrong data also for

long time ago

PoliticalMashup 19

Is Linked (Open) Data the solution

bull Link speakers name to WikipediaDBpedia page (named entity

disambiguation and resolution) See also Google Knowledge

Graph and [Spitkovsky Chang LREC 2012]

bull DBpedia extracts link between person and party affiliation from

Wikipedia infobox

bull Timestamped triple

Geert Wilders is partymember of VVD

from 1998-08-25 until 2004-09-02

PoliticalMashup 20

DBpedia not yet reliable

bull Data extraction is difficult even from the infobox even from

complete data

Wikipedia page of Geert Wilders

DBpedia information about Geert Wilders

Notice the values of the party and the office attributes

Timestamped facts are difficult to extract and difficult to

represent in RDF triples

PoliticalMashup 21

Lesson learned requirement on metadata andrelations

bull One cannot rely on Linked Open Data for good quality metadata

bull Official documents should be self-describing also for facts which

are obvious at publication time

bull Compare speakerrsquos data in original (OCRed) data and XMLified

and enriched version

bull Original

bull Part of it in XML

bull And now for human consumption

PoliticalMashup 22

A few more applications

PoliticalMashup 23

Entity Profiling and Entity Search

bull Users search for entities not for documents [TREC Entity Track]

[Balog et al 2009]

bull Main research questions

How to collect information on entities

how to model an entity

how to rank entities

bull (Parsimonious) language models work well as models [Balog et

al 2009][Hiemstra et al 2004]

bull Entity profiling httpwwwpolitiekinzichtcom

bull Entity search httpikkieswijzernl

PoliticalMashup 24

Content and structure search

bull Usual advanced search combines keyword search with metadata

search

bull Extra fields are just extra filters on the returned documents

bull With structured documents we can do search on content and

structure

bull Most useful task rank best entry points in large documents

bull Compare two search systems on the same data

on flat text

on an XML representation

PoliticalMashup 25

Lesson learned requirement on structure

bull Make semantically important structure of documents explicit in

XML markup

bull Publish for machine readability

bull Publish generic data not data prepared for one use-case

PoliticalMashup 26

Application of structure Interruption graph(Attackogram)

bull MP A interrupts B lArrrArr A speaks during the block of B

combined with entity profiling

httpdebatpolitiekinzichtcom

PoliticalMashup 27

Exploring and exploiting official documents

bull We saw what can be done with one well-curated collection

bull What are the key infrastructural and research questions

In what direction and how to scale this up

1 in time

2 in breadth

3 in links

PoliticalMashup 28

Scale diachronically

bull Stable data model and measurement procedure make this data

very valuable for diachronic comparisons

bull towards the past

bull OCR

bull consistency in structure

bull more missing data to link to

bull towards the future

bull remain up to date

bull legacy decisions

PoliticalMashup 29

Scale in breadth eg parlproceedings of allEuropean countries

bull All describe the same ldquoscriptrdquo so all fit in one schema

bull Main question how to connect the data from different countries

Common structure and annotation use the same Relax NG

schema

Common values on certain attributesbull Entities Normalize to Wikipedia concepts

bull Controlled vocabulary keywords Normalize to Eurovoc

bull Language Machine translate to English

bull Events Normalize to EMM Newsexplorer query Wikinews

query

PoliticalMashup 30

Scale in breadth link to related datasets

bull Link on time entities events topics

bull Other official publications

bull News

bull User generated content

bull (In our case) promisses of political actors election manifestos

PoliticalMashup 31

Conclusions

bull There are ample opportunities for exploiting Official Publications

bull Preprocessing and interlinking with other datasets is difficult and

does not scale well

bull High precision and recall is needed for many applications

bull Many text analysis and data-mapping tasks [MUC TAC]

bull Every format needs an own transformer

bull Linked Open Data knowledge bases are not (yet) good enough

create special purpose knowledge extractors

bull High investment but if done in a general way high return and

impact

PoliticalMashup 32

Back to our research question

What is the best data format for publishing both legacy and current

parliamentary proceedings in a digital sustainable manner

Lessons learned

bull Common open standardized self-describing machine readable

bull not tied to a single application

bull linked linked linked

bull Not only shared attributes

bull but more importantly shared data values

bull also store utterly obvious facts (10 years later they arenrsquot)

PoliticalMashup 33

How we can help (ourselves)

Help improve input data at the source

bull Push at the source (in UK open government data in Holland all

parliamentary data is now in XML )

bull Help reduce dumb cut-and-paste annotation work so annotators

can concentrate on tasks which are hard for machines (eg

text-classification)

bull Emphasize importance of using shared standards

Future researchers will love you

PoliticalMashup 34

Last Question

Official Publications are they

or

  • 1 Open Official Documents Requirements and Opportunities
  • 2 Content
  • 3 Our Leading Research Question
  • 4 W3C recommendations on Open Government Data
  • 5 Value of a large data corpus
  • 6 Documents related by publication date
  • 7 Properties of our Parliamentary Proceedings Dataset
  • 8 Longitudinal data
  • 9 Data about human behaviour
  • 10 Often rather boring
  • 11 But sometimes full of drama and excitement
  • 12 Loads of measurement points
  • 13 Digitally available
  • 14 About this collection
  • 15 Very rich metadata for each word
  • 16 How to exploit the extra metadata and structure
  • 17 Political n-gram viewer
  • 18 Political n-gram viewer requirements
  • 19 Is Linked (Open) Data the solution
  • 20 DBpedia not yet reliable
  • 21 Lesson learned requirement on metadata and relations
  • 22 A few more applications
  • 23 Entity Profiling and Entity Search
  • 24 Content and structure search
  • 25 Lesson learned requirement on structure
  • 26 Application of structure Interruption graph (Attackogram)
  • 27 Exploring and exploiting official documents
  • 28 Scale diachronically
  • 29 Scale in breadth eg parlproceedings of all European countries
  • 30 Scale in breadth link to related datasets
  • 31 Conclusions
  • 32 Back to our research question
  • 33 How we can help (ourselves)
  • 34 Last Question
Page 14: Keynote Exploring and Exploiting Official Publications

PoliticalMashup 14

About this collection

bull very sparse available metadata

bull very rich ldquometadatardquo sits hidden inside the raw data

bull Rich data model

bull Meeting (1 Day)

bull Topic

bull Stage direction

bull Scene

bull Stage direction

bull Speech

bull Paragraph

PoliticalMashup 15

Very rich metadata for each word

For every word spoken in parliament the following facts are known

at the time of the speech act and can often be extracted from the

written proceedings

1) when it was said

2) who said it

3) in what function

4) speaking on behalf of which party

5) in which context and

6) who was actively present during the speech act

PoliticalMashup 16

How to exploit the extra metadata and structure

bull Letrsquos consider a simple killer app

PoliticalMashup 17

Political n-gram viewer

bull From every word we know both the date and the speaker

bull Every speaker belongs to a political party

bull 3D n-gram viewer political spectrum vs time vs word-count

bull Use topic ownership agenda setting framing

PoliticalMashup 18

Political n-gram viewer requirements

documents

1 metadata date of the meeting

2 document structure for every spoken word who said it

Linked Data Speakers names are disambiguated normalized and

mapped to a database with temporal party information

Completeness and correctness Few missing or wrong data also for

long time ago

PoliticalMashup 19

Is Linked (Open) Data the solution

bull Link speakers name to WikipediaDBpedia page (named entity

disambiguation and resolution) See also Google Knowledge

Graph and [Spitkovsky Chang LREC 2012]

bull DBpedia extracts link between person and party affiliation from

Wikipedia infobox

bull Timestamped triple

Geert Wilders is partymember of VVD

from 1998-08-25 until 2004-09-02

PoliticalMashup 20

DBpedia not yet reliable

bull Data extraction is difficult even from the infobox even from

complete data

Wikipedia page of Geert Wilders

DBpedia information about Geert Wilders

Notice the values of the party and the office attributes

Timestamped facts are difficult to extract and difficult to

represent in RDF triples

PoliticalMashup 21

Lesson learned requirement on metadata andrelations

bull One cannot rely on Linked Open Data for good quality metadata

bull Official documents should be self-describing also for facts which

are obvious at publication time

bull Compare speakerrsquos data in original (OCRed) data and XMLified

and enriched version

bull Original

bull Part of it in XML

bull And now for human consumption

PoliticalMashup 22

A few more applications

PoliticalMashup 23

Entity Profiling and Entity Search

bull Users search for entities not for documents [TREC Entity Track]

[Balog et al 2009]

bull Main research questions

How to collect information on entities

how to model an entity

how to rank entities

bull (Parsimonious) language models work well as models [Balog et

al 2009][Hiemstra et al 2004]

bull Entity profiling httpwwwpolitiekinzichtcom

bull Entity search httpikkieswijzernl

PoliticalMashup 24

Content and structure search

bull Usual advanced search combines keyword search with metadata

search

bull Extra fields are just extra filters on the returned documents

bull With structured documents we can do search on content and

structure

bull Most useful task rank best entry points in large documents

bull Compare two search systems on the same data

on flat text

on an XML representation

PoliticalMashup 25

Lesson learned requirement on structure

bull Make semantically important structure of documents explicit in

XML markup

bull Publish for machine readability

bull Publish generic data not data prepared for one use-case

PoliticalMashup 26

Application of structure Interruption graph(Attackogram)

bull MP A interrupts B lArrrArr A speaks during the block of B

combined with entity profiling

httpdebatpolitiekinzichtcom

PoliticalMashup 27

Exploring and exploiting official documents

bull We saw what can be done with one well-curated collection

bull What are the key infrastructural and research questions

In what direction and how to scale this up

1 in time

2 in breadth

3 in links

PoliticalMashup 28

Scale diachronically

bull Stable data model and measurement procedure make this data

very valuable for diachronic comparisons

bull towards the past

bull OCR

bull consistency in structure

bull more missing data to link to

bull towards the future

bull remain up to date

bull legacy decisions

PoliticalMashup 29

Scale in breadth eg parlproceedings of allEuropean countries

bull All describe the same ldquoscriptrdquo so all fit in one schema

bull Main question how to connect the data from different countries

Common structure and annotation use the same Relax NG

schema

Common values on certain attributesbull Entities Normalize to Wikipedia concepts

bull Controlled vocabulary keywords Normalize to Eurovoc

bull Language Machine translate to English

bull Events Normalize to EMM Newsexplorer query Wikinews

query

PoliticalMashup 30

Scale in breadth link to related datasets

bull Link on time entities events topics

bull Other official publications

bull News

bull User generated content

bull (In our case) promisses of political actors election manifestos

PoliticalMashup 31

Conclusions

bull There are ample opportunities for exploiting Official Publications

bull Preprocessing and interlinking with other datasets is difficult and

does not scale well

bull High precision and recall is needed for many applications

bull Many text analysis and data-mapping tasks [MUC TAC]

bull Every format needs an own transformer

bull Linked Open Data knowledge bases are not (yet) good enough

create special purpose knowledge extractors

bull High investment but if done in a general way high return and

impact

PoliticalMashup 32

Back to our research question

What is the best data format for publishing both legacy and current

parliamentary proceedings in a digital sustainable manner

Lessons learned

bull Common open standardized self-describing machine readable

bull not tied to a single application

bull linked linked linked

bull Not only shared attributes

bull but more importantly shared data values

bull also store utterly obvious facts (10 years later they arenrsquot)

PoliticalMashup 33

How we can help (ourselves)

Help improve input data at the source

bull Push at the source (in UK open government data in Holland all

parliamentary data is now in XML )

bull Help reduce dumb cut-and-paste annotation work so annotators

can concentrate on tasks which are hard for machines (eg

text-classification)

bull Emphasize importance of using shared standards

Future researchers will love you

PoliticalMashup 34

Last Question

Official Publications are they

or

  • 1 Open Official Documents Requirements and Opportunities
  • 2 Content
  • 3 Our Leading Research Question
  • 4 W3C recommendations on Open Government Data
  • 5 Value of a large data corpus
  • 6 Documents related by publication date
  • 7 Properties of our Parliamentary Proceedings Dataset
  • 8 Longitudinal data
  • 9 Data about human behaviour
  • 10 Often rather boring
  • 11 But sometimes full of drama and excitement
  • 12 Loads of measurement points
  • 13 Digitally available
  • 14 About this collection
  • 15 Very rich metadata for each word
  • 16 How to exploit the extra metadata and structure
  • 17 Political n-gram viewer
  • 18 Political n-gram viewer requirements
  • 19 Is Linked (Open) Data the solution
  • 20 DBpedia not yet reliable
  • 21 Lesson learned requirement on metadata and relations
  • 22 A few more applications
  • 23 Entity Profiling and Entity Search
  • 24 Content and structure search
  • 25 Lesson learned requirement on structure
  • 26 Application of structure Interruption graph (Attackogram)
  • 27 Exploring and exploiting official documents
  • 28 Scale diachronically
  • 29 Scale in breadth eg parlproceedings of all European countries
  • 30 Scale in breadth link to related datasets
  • 31 Conclusions
  • 32 Back to our research question
  • 33 How we can help (ourselves)
  • 34 Last Question
Page 15: Keynote Exploring and Exploiting Official Publications

PoliticalMashup 15

Very rich metadata for each word

For every word spoken in parliament the following facts are known

at the time of the speech act and can often be extracted from the

written proceedings

1) when it was said

2) who said it

3) in what function

4) speaking on behalf of which party

5) in which context and

6) who was actively present during the speech act

PoliticalMashup 16

How to exploit the extra metadata and structure

bull Letrsquos consider a simple killer app

PoliticalMashup 17

Political n-gram viewer

bull From every word we know both the date and the speaker

bull Every speaker belongs to a political party

bull 3D n-gram viewer political spectrum vs time vs word-count

bull Use topic ownership agenda setting framing

PoliticalMashup 18

Political n-gram viewer requirements

documents

1 metadata date of the meeting

2 document structure for every spoken word who said it

Linked Data Speakers names are disambiguated normalized and

mapped to a database with temporal party information

Completeness and correctness Few missing or wrong data also for

long time ago

PoliticalMashup 19

Is Linked (Open) Data the solution

bull Link speakers name to WikipediaDBpedia page (named entity

disambiguation and resolution) See also Google Knowledge

Graph and [Spitkovsky Chang LREC 2012]

bull DBpedia extracts link between person and party affiliation from

Wikipedia infobox

bull Timestamped triple

Geert Wilders is partymember of VVD

from 1998-08-25 until 2004-09-02

PoliticalMashup 20

DBpedia not yet reliable

bull Data extraction is difficult even from the infobox even from

complete data

Wikipedia page of Geert Wilders

DBpedia information about Geert Wilders

Notice the values of the party and the office attributes

Timestamped facts are difficult to extract and difficult to

represent in RDF triples

PoliticalMashup 21

Lesson learned requirement on metadata andrelations

bull One cannot rely on Linked Open Data for good quality metadata

bull Official documents should be self-describing also for facts which

are obvious at publication time

bull Compare speakerrsquos data in original (OCRed) data and XMLified

and enriched version

bull Original

bull Part of it in XML

bull And now for human consumption

PoliticalMashup 22

A few more applications

PoliticalMashup 23

Entity Profiling and Entity Search

bull Users search for entities not for documents [TREC Entity Track]

[Balog et al 2009]

bull Main research questions

How to collect information on entities

how to model an entity

how to rank entities

bull (Parsimonious) language models work well as models [Balog et

al 2009][Hiemstra et al 2004]

bull Entity profiling httpwwwpolitiekinzichtcom

bull Entity search httpikkieswijzernl

PoliticalMashup 24

Content and structure search

bull Usual advanced search combines keyword search with metadata

search

bull Extra fields are just extra filters on the returned documents

bull With structured documents we can do search on content and

structure

bull Most useful task rank best entry points in large documents

bull Compare two search systems on the same data

on flat text

on an XML representation

PoliticalMashup 25

Lesson learned requirement on structure

bull Make semantically important structure of documents explicit in

XML markup

bull Publish for machine readability

bull Publish generic data not data prepared for one use-case

PoliticalMashup 26

Application of structure Interruption graph(Attackogram)

bull MP A interrupts B lArrrArr A speaks during the block of B

combined with entity profiling

httpdebatpolitiekinzichtcom

PoliticalMashup 27

Exploring and exploiting official documents

bull We saw what can be done with one well-curated collection

bull What are the key infrastructural and research questions

In what direction and how to scale this up

1 in time

2 in breadth

3 in links

PoliticalMashup 28

Scale diachronically

bull Stable data model and measurement procedure make this data

very valuable for diachronic comparisons

bull towards the past

bull OCR

bull consistency in structure

bull more missing data to link to

bull towards the future

bull remain up to date

bull legacy decisions

PoliticalMashup 29

Scale in breadth eg parlproceedings of allEuropean countries

bull All describe the same ldquoscriptrdquo so all fit in one schema

bull Main question how to connect the data from different countries

Common structure and annotation use the same Relax NG

schema

Common values on certain attributesbull Entities Normalize to Wikipedia concepts

bull Controlled vocabulary keywords Normalize to Eurovoc

bull Language Machine translate to English

bull Events Normalize to EMM Newsexplorer query Wikinews

query

PoliticalMashup 30

Scale in breadth link to related datasets

bull Link on time entities events topics

bull Other official publications

bull News

bull User generated content

bull (In our case) promisses of political actors election manifestos

PoliticalMashup 31

Conclusions

bull There are ample opportunities for exploiting Official Publications

bull Preprocessing and interlinking with other datasets is difficult and

does not scale well

bull High precision and recall is needed for many applications

bull Many text analysis and data-mapping tasks [MUC TAC]

bull Every format needs an own transformer

bull Linked Open Data knowledge bases are not (yet) good enough

create special purpose knowledge extractors

bull High investment but if done in a general way high return and

impact

PoliticalMashup 32

Back to our research question

What is the best data format for publishing both legacy and current

parliamentary proceedings in a digital sustainable manner

Lessons learned

bull Common open standardized self-describing machine readable

bull not tied to a single application

bull linked linked linked

bull Not only shared attributes

bull but more importantly shared data values

bull also store utterly obvious facts (10 years later they arenrsquot)

PoliticalMashup 33

How we can help (ourselves)

Help improve input data at the source

bull Push at the source (in UK open government data in Holland all

parliamentary data is now in XML )

bull Help reduce dumb cut-and-paste annotation work so annotators

can concentrate on tasks which are hard for machines (eg

text-classification)

bull Emphasize importance of using shared standards

Future researchers will love you

PoliticalMashup 34

Last Question

Official Publications are they

or

  • 1 Open Official Documents Requirements and Opportunities
  • 2 Content
  • 3 Our Leading Research Question
  • 4 W3C recommendations on Open Government Data
  • 5 Value of a large data corpus
  • 6 Documents related by publication date
  • 7 Properties of our Parliamentary Proceedings Dataset
  • 8 Longitudinal data
  • 9 Data about human behaviour
  • 10 Often rather boring
  • 11 But sometimes full of drama and excitement
  • 12 Loads of measurement points
  • 13 Digitally available
  • 14 About this collection
  • 15 Very rich metadata for each word
  • 16 How to exploit the extra metadata and structure
  • 17 Political n-gram viewer
  • 18 Political n-gram viewer requirements
  • 19 Is Linked (Open) Data the solution
  • 20 DBpedia not yet reliable
  • 21 Lesson learned requirement on metadata and relations
  • 22 A few more applications
  • 23 Entity Profiling and Entity Search
  • 24 Content and structure search
  • 25 Lesson learned requirement on structure
  • 26 Application of structure Interruption graph (Attackogram)
  • 27 Exploring and exploiting official documents
  • 28 Scale diachronically
  • 29 Scale in breadth eg parlproceedings of all European countries
  • 30 Scale in breadth link to related datasets
  • 31 Conclusions
  • 32 Back to our research question
  • 33 How we can help (ourselves)
  • 34 Last Question
Page 16: Keynote Exploring and Exploiting Official Publications

PoliticalMashup 16

How to exploit the extra metadata and structure

bull Letrsquos consider a simple killer app

PoliticalMashup 17

Political n-gram viewer

bull From every word we know both the date and the speaker

bull Every speaker belongs to a political party

bull 3D n-gram viewer political spectrum vs time vs word-count

bull Use topic ownership agenda setting framing

PoliticalMashup 18

Political n-gram viewer requirements

documents

1 metadata date of the meeting

2 document structure for every spoken word who said it

Linked Data Speakers names are disambiguated normalized and

mapped to a database with temporal party information

Completeness and correctness Few missing or wrong data also for

long time ago

PoliticalMashup 19

Is Linked (Open) Data the solution

bull Link speakers name to WikipediaDBpedia page (named entity

disambiguation and resolution) See also Google Knowledge

Graph and [Spitkovsky Chang LREC 2012]

bull DBpedia extracts link between person and party affiliation from

Wikipedia infobox

bull Timestamped triple

Geert Wilders is partymember of VVD

from 1998-08-25 until 2004-09-02

PoliticalMashup 20

DBpedia not yet reliable

bull Data extraction is difficult even from the infobox even from

complete data

Wikipedia page of Geert Wilders

DBpedia information about Geert Wilders

Notice the values of the party and the office attributes

Timestamped facts are difficult to extract and difficult to

represent in RDF triples

PoliticalMashup 21

Lesson learned requirement on metadata andrelations

bull One cannot rely on Linked Open Data for good quality metadata

bull Official documents should be self-describing also for facts which

are obvious at publication time

bull Compare speakerrsquos data in original (OCRed) data and XMLified

and enriched version

bull Original

bull Part of it in XML

bull And now for human consumption

PoliticalMashup 22

A few more applications

PoliticalMashup 23

Entity Profiling and Entity Search

bull Users search for entities not for documents [TREC Entity Track]

[Balog et al 2009]

bull Main research questions

How to collect information on entities

how to model an entity

how to rank entities

bull (Parsimonious) language models work well as models [Balog et

al 2009][Hiemstra et al 2004]

bull Entity profiling httpwwwpolitiekinzichtcom

bull Entity search httpikkieswijzernl

PoliticalMashup 24

Content and structure search

bull Usual advanced search combines keyword search with metadata

search

bull Extra fields are just extra filters on the returned documents

bull With structured documents we can do search on content and

structure

bull Most useful task rank best entry points in large documents

bull Compare two search systems on the same data

on flat text

on an XML representation

PoliticalMashup 25

Lesson learned requirement on structure

bull Make semantically important structure of documents explicit in

XML markup

bull Publish for machine readability

bull Publish generic data not data prepared for one use-case

PoliticalMashup 26

Application of structure Interruption graph(Attackogram)

bull MP A interrupts B lArrrArr A speaks during the block of B

combined with entity profiling

httpdebatpolitiekinzichtcom

PoliticalMashup 27

Exploring and exploiting official documents

bull We saw what can be done with one well-curated collection

bull What are the key infrastructural and research questions

In what direction and how to scale this up

1 in time

2 in breadth

3 in links

PoliticalMashup 28

Scale diachronically

bull Stable data model and measurement procedure make this data

very valuable for diachronic comparisons

bull towards the past

bull OCR

bull consistency in structure

bull more missing data to link to

bull towards the future

bull remain up to date

bull legacy decisions

PoliticalMashup 29

Scale in breadth eg parlproceedings of allEuropean countries

bull All describe the same ldquoscriptrdquo so all fit in one schema

bull Main question how to connect the data from different countries

Common structure and annotation use the same Relax NG

schema

Common values on certain attributesbull Entities Normalize to Wikipedia concepts

bull Controlled vocabulary keywords Normalize to Eurovoc

bull Language Machine translate to English

bull Events Normalize to EMM Newsexplorer query Wikinews

query

PoliticalMashup 30

Scale in breadth link to related datasets

bull Link on time entities events topics

bull Other official publications

bull News

bull User generated content

bull (In our case) promisses of political actors election manifestos

PoliticalMashup 31

Conclusions

bull There are ample opportunities for exploiting Official Publications

bull Preprocessing and interlinking with other datasets is difficult and

does not scale well

bull High precision and recall is needed for many applications

bull Many text analysis and data-mapping tasks [MUC TAC]

bull Every format needs an own transformer

bull Linked Open Data knowledge bases are not (yet) good enough

create special purpose knowledge extractors

bull High investment but if done in a general way high return and

impact

PoliticalMashup 32

Back to our research question

What is the best data format for publishing both legacy and current

parliamentary proceedings in a digital sustainable manner

Lessons learned

bull Common open standardized self-describing machine readable

bull not tied to a single application

bull linked linked linked

bull Not only shared attributes

bull but more importantly shared data values

bull also store utterly obvious facts (10 years later they arenrsquot)

PoliticalMashup 33

How we can help (ourselves)

Help improve input data at the source

bull Push at the source (in UK open government data in Holland all

parliamentary data is now in XML )

bull Help reduce dumb cut-and-paste annotation work so annotators

can concentrate on tasks which are hard for machines (eg

text-classification)

bull Emphasize importance of using shared standards

Future researchers will love you

PoliticalMashup 34

Last Question

Official Publications are they

or

  • 1 Open Official Documents Requirements and Opportunities
  • 2 Content
  • 3 Our Leading Research Question
  • 4 W3C recommendations on Open Government Data
  • 5 Value of a large data corpus
  • 6 Documents related by publication date
  • 7 Properties of our Parliamentary Proceedings Dataset
  • 8 Longitudinal data
  • 9 Data about human behaviour
  • 10 Often rather boring
  • 11 But sometimes full of drama and excitement
  • 12 Loads of measurement points
  • 13 Digitally available
  • 14 About this collection
  • 15 Very rich metadata for each word
  • 16 How to exploit the extra metadata and structure
  • 17 Political n-gram viewer
  • 18 Political n-gram viewer requirements
  • 19 Is Linked (Open) Data the solution
  • 20 DBpedia not yet reliable
  • 21 Lesson learned requirement on metadata and relations
  • 22 A few more applications
  • 23 Entity Profiling and Entity Search
  • 24 Content and structure search
  • 25 Lesson learned requirement on structure
  • 26 Application of structure Interruption graph (Attackogram)
  • 27 Exploring and exploiting official documents
  • 28 Scale diachronically
  • 29 Scale in breadth eg parlproceedings of all European countries
  • 30 Scale in breadth link to related datasets
  • 31 Conclusions
  • 32 Back to our research question
  • 33 How we can help (ourselves)
  • 34 Last Question
Page 17: Keynote Exploring and Exploiting Official Publications

PoliticalMashup 17

Political n-gram viewer

bull From every word we know both the date and the speaker

bull Every speaker belongs to a political party

bull 3D n-gram viewer political spectrum vs time vs word-count

bull Use topic ownership agenda setting framing

PoliticalMashup 18

Political n-gram viewer requirements

documents

1 metadata date of the meeting

2 document structure for every spoken word who said it

Linked Data Speakers names are disambiguated normalized and

mapped to a database with temporal party information

Completeness and correctness Few missing or wrong data also for

long time ago

PoliticalMashup 19

Is Linked (Open) Data the solution

bull Link speakers name to WikipediaDBpedia page (named entity

disambiguation and resolution) See also Google Knowledge

Graph and [Spitkovsky Chang LREC 2012]

bull DBpedia extracts link between person and party affiliation from

Wikipedia infobox

bull Timestamped triple

Geert Wilders is partymember of VVD

from 1998-08-25 until 2004-09-02

PoliticalMashup 20

DBpedia not yet reliable

bull Data extraction is difficult even from the infobox even from

complete data

Wikipedia page of Geert Wilders

DBpedia information about Geert Wilders

Notice the values of the party and the office attributes

Timestamped facts are difficult to extract and difficult to

represent in RDF triples

PoliticalMashup 21

Lesson learned requirement on metadata andrelations

bull One cannot rely on Linked Open Data for good quality metadata

bull Official documents should be self-describing also for facts which

are obvious at publication time

bull Compare speakerrsquos data in original (OCRed) data and XMLified

and enriched version

bull Original

bull Part of it in XML

bull And now for human consumption

PoliticalMashup 22

A few more applications

PoliticalMashup 23

Entity Profiling and Entity Search

bull Users search for entities not for documents [TREC Entity Track]

[Balog et al 2009]

bull Main research questions

How to collect information on entities

how to model an entity

how to rank entities

bull (Parsimonious) language models work well as models [Balog et

al 2009][Hiemstra et al 2004]

bull Entity profiling httpwwwpolitiekinzichtcom

bull Entity search httpikkieswijzernl

PoliticalMashup 24

Content and structure search

bull Usual advanced search combines keyword search with metadata

search

bull Extra fields are just extra filters on the returned documents

bull With structured documents we can do search on content and

structure

bull Most useful task rank best entry points in large documents

bull Compare two search systems on the same data

on flat text

on an XML representation

PoliticalMashup 25

Lesson learned requirement on structure

bull Make semantically important structure of documents explicit in

XML markup

bull Publish for machine readability

bull Publish generic data not data prepared for one use-case

PoliticalMashup 26

Application of structure Interruption graph(Attackogram)

bull MP A interrupts B lArrrArr A speaks during the block of B

combined with entity profiling

httpdebatpolitiekinzichtcom

PoliticalMashup 27

Exploring and exploiting official documents

bull We saw what can be done with one well-curated collection

bull What are the key infrastructural and research questions

In what direction and how to scale this up

1 in time

2 in breadth

3 in links

PoliticalMashup 28

Scale diachronically

bull Stable data model and measurement procedure make this data

very valuable for diachronic comparisons

bull towards the past

bull OCR

bull consistency in structure

bull more missing data to link to

bull towards the future

bull remain up to date

bull legacy decisions

PoliticalMashup 29

Scale in breadth eg parlproceedings of allEuropean countries

bull All describe the same ldquoscriptrdquo so all fit in one schema

bull Main question how to connect the data from different countries

Common structure and annotation use the same Relax NG

schema

Common values on certain attributesbull Entities Normalize to Wikipedia concepts

bull Controlled vocabulary keywords Normalize to Eurovoc

bull Language Machine translate to English

bull Events Normalize to EMM Newsexplorer query Wikinews

query

PoliticalMashup 30

Scale in breadth link to related datasets

bull Link on time entities events topics

bull Other official publications

bull News

bull User generated content

bull (In our case) promisses of political actors election manifestos

PoliticalMashup 31

Conclusions

bull There are ample opportunities for exploiting Official Publications

bull Preprocessing and interlinking with other datasets is difficult and

does not scale well

bull High precision and recall is needed for many applications

bull Many text analysis and data-mapping tasks [MUC TAC]

bull Every format needs an own transformer

bull Linked Open Data knowledge bases are not (yet) good enough

create special purpose knowledge extractors

bull High investment but if done in a general way high return and

impact

PoliticalMashup 32

Back to our research question

What is the best data format for publishing both legacy and current

parliamentary proceedings in a digital sustainable manner

Lessons learned

bull Common open standardized self-describing machine readable

bull not tied to a single application

bull linked linked linked

bull Not only shared attributes

bull but more importantly shared data values

bull also store utterly obvious facts (10 years later they arenrsquot)

PoliticalMashup 33

How we can help (ourselves)

Help improve input data at the source

bull Push at the source (in UK open government data in Holland all

parliamentary data is now in XML )

bull Help reduce dumb cut-and-paste annotation work so annotators

can concentrate on tasks which are hard for machines (eg

text-classification)

bull Emphasize importance of using shared standards

Future researchers will love you

PoliticalMashup 34

Last Question

Official Publications are they

or

  • 1 Open Official Documents Requirements and Opportunities
  • 2 Content
  • 3 Our Leading Research Question
  • 4 W3C recommendations on Open Government Data
  • 5 Value of a large data corpus
  • 6 Documents related by publication date
  • 7 Properties of our Parliamentary Proceedings Dataset
  • 8 Longitudinal data
  • 9 Data about human behaviour
  • 10 Often rather boring
  • 11 But sometimes full of drama and excitement
  • 12 Loads of measurement points
  • 13 Digitally available
  • 14 About this collection
  • 15 Very rich metadata for each word
  • 16 How to exploit the extra metadata and structure
  • 17 Political n-gram viewer
  • 18 Political n-gram viewer requirements
  • 19 Is Linked (Open) Data the solution
  • 20 DBpedia not yet reliable
  • 21 Lesson learned requirement on metadata and relations
  • 22 A few more applications
  • 23 Entity Profiling and Entity Search
  • 24 Content and structure search
  • 25 Lesson learned requirement on structure
  • 26 Application of structure Interruption graph (Attackogram)
  • 27 Exploring and exploiting official documents
  • 28 Scale diachronically
  • 29 Scale in breadth eg parlproceedings of all European countries
  • 30 Scale in breadth link to related datasets
  • 31 Conclusions
  • 32 Back to our research question
  • 33 How we can help (ourselves)
  • 34 Last Question
Page 18: Keynote Exploring and Exploiting Official Publications

PoliticalMashup 18

Political n-gram viewer requirements

documents

1 metadata date of the meeting

2 document structure for every spoken word who said it

Linked Data Speakers names are disambiguated normalized and

mapped to a database with temporal party information

Completeness and correctness Few missing or wrong data also for

long time ago

PoliticalMashup 19

Is Linked (Open) Data the solution

bull Link speakers name to WikipediaDBpedia page (named entity

disambiguation and resolution) See also Google Knowledge

Graph and [Spitkovsky Chang LREC 2012]

bull DBpedia extracts link between person and party affiliation from

Wikipedia infobox

bull Timestamped triple

Geert Wilders is partymember of VVD

from 1998-08-25 until 2004-09-02

PoliticalMashup 20

DBpedia not yet reliable

bull Data extraction is difficult even from the infobox even from

complete data

Wikipedia page of Geert Wilders

DBpedia information about Geert Wilders

Notice the values of the party and the office attributes

Timestamped facts are difficult to extract and difficult to

represent in RDF triples

PoliticalMashup 21

Lesson learned requirement on metadata andrelations

bull One cannot rely on Linked Open Data for good quality metadata

bull Official documents should be self-describing also for facts which

are obvious at publication time

bull Compare speakerrsquos data in original (OCRed) data and XMLified

and enriched version

bull Original

bull Part of it in XML

bull And now for human consumption

PoliticalMashup 22

A few more applications

PoliticalMashup 23

Entity Profiling and Entity Search

bull Users search for entities not for documents [TREC Entity Track]

[Balog et al 2009]

bull Main research questions

How to collect information on entities

how to model an entity

how to rank entities

bull (Parsimonious) language models work well as models [Balog et

al 2009][Hiemstra et al 2004]

bull Entity profiling httpwwwpolitiekinzichtcom

bull Entity search httpikkieswijzernl

PoliticalMashup 24

Content and structure search

bull Usual advanced search combines keyword search with metadata

search

bull Extra fields are just extra filters on the returned documents

bull With structured documents we can do search on content and

structure

bull Most useful task rank best entry points in large documents

bull Compare two search systems on the same data

on flat text

on an XML representation

PoliticalMashup 25

Lesson learned requirement on structure

bull Make semantically important structure of documents explicit in

XML markup

bull Publish for machine readability

bull Publish generic data not data prepared for one use-case

PoliticalMashup 26

Application of structure Interruption graph(Attackogram)

bull MP A interrupts B lArrrArr A speaks during the block of B

combined with entity profiling

httpdebatpolitiekinzichtcom

PoliticalMashup 27

Exploring and exploiting official documents

bull We saw what can be done with one well-curated collection

bull What are the key infrastructural and research questions

In what direction and how to scale this up

1 in time

2 in breadth

3 in links

PoliticalMashup 28

Scale diachronically

bull Stable data model and measurement procedure make this data

very valuable for diachronic comparisons

bull towards the past

bull OCR

bull consistency in structure

bull more missing data to link to

bull towards the future

bull remain up to date

bull legacy decisions

PoliticalMashup 29

Scale in breadth eg parlproceedings of allEuropean countries

bull All describe the same ldquoscriptrdquo so all fit in one schema

bull Main question how to connect the data from different countries

Common structure and annotation use the same Relax NG

schema

Common values on certain attributesbull Entities Normalize to Wikipedia concepts

bull Controlled vocabulary keywords Normalize to Eurovoc

bull Language Machine translate to English

bull Events Normalize to EMM Newsexplorer query Wikinews

query

PoliticalMashup 30

Scale in breadth link to related datasets

bull Link on time entities events topics

bull Other official publications

bull News

bull User generated content

bull (In our case) promisses of political actors election manifestos

PoliticalMashup 31

Conclusions

bull There are ample opportunities for exploiting Official Publications

bull Preprocessing and interlinking with other datasets is difficult and

does not scale well

bull High precision and recall is needed for many applications

bull Many text analysis and data-mapping tasks [MUC TAC]

bull Every format needs an own transformer

bull Linked Open Data knowledge bases are not (yet) good enough

create special purpose knowledge extractors

bull High investment but if done in a general way high return and

impact

PoliticalMashup 32

Back to our research question

What is the best data format for publishing both legacy and current

parliamentary proceedings in a digital sustainable manner

Lessons learned

bull Common open standardized self-describing machine readable

bull not tied to a single application

bull linked linked linked

bull Not only shared attributes

bull but more importantly shared data values

bull also store utterly obvious facts (10 years later they arenrsquot)

PoliticalMashup 33

How we can help (ourselves)

Help improve input data at the source

bull Push at the source (in UK open government data in Holland all

parliamentary data is now in XML )

bull Help reduce dumb cut-and-paste annotation work so annotators

can concentrate on tasks which are hard for machines (eg

text-classification)

bull Emphasize importance of using shared standards

Future researchers will love you

PoliticalMashup 34

Last Question

Official Publications are they

or

  • 1 Open Official Documents Requirements and Opportunities
  • 2 Content
  • 3 Our Leading Research Question
  • 4 W3C recommendations on Open Government Data
  • 5 Value of a large data corpus
  • 6 Documents related by publication date
  • 7 Properties of our Parliamentary Proceedings Dataset
  • 8 Longitudinal data
  • 9 Data about human behaviour
  • 10 Often rather boring
  • 11 But sometimes full of drama and excitement
  • 12 Loads of measurement points
  • 13 Digitally available
  • 14 About this collection
  • 15 Very rich metadata for each word
  • 16 How to exploit the extra metadata and structure
  • 17 Political n-gram viewer
  • 18 Political n-gram viewer requirements
  • 19 Is Linked (Open) Data the solution
  • 20 DBpedia not yet reliable
  • 21 Lesson learned requirement on metadata and relations
  • 22 A few more applications
  • 23 Entity Profiling and Entity Search
  • 24 Content and structure search
  • 25 Lesson learned requirement on structure
  • 26 Application of structure Interruption graph (Attackogram)
  • 27 Exploring and exploiting official documents
  • 28 Scale diachronically
  • 29 Scale in breadth eg parlproceedings of all European countries
  • 30 Scale in breadth link to related datasets
  • 31 Conclusions
  • 32 Back to our research question
  • 33 How we can help (ourselves)
  • 34 Last Question
Page 19: Keynote Exploring and Exploiting Official Publications

PoliticalMashup 19

Is Linked (Open) Data the solution

bull Link speakers name to WikipediaDBpedia page (named entity

disambiguation and resolution) See also Google Knowledge

Graph and [Spitkovsky Chang LREC 2012]

bull DBpedia extracts link between person and party affiliation from

Wikipedia infobox

bull Timestamped triple

Geert Wilders is partymember of VVD

from 1998-08-25 until 2004-09-02

PoliticalMashup 20

DBpedia not yet reliable

bull Data extraction is difficult even from the infobox even from

complete data

Wikipedia page of Geert Wilders

DBpedia information about Geert Wilders

Notice the values of the party and the office attributes

Timestamped facts are difficult to extract and difficult to

represent in RDF triples

PoliticalMashup 21

Lesson learned requirement on metadata andrelations

bull One cannot rely on Linked Open Data for good quality metadata

bull Official documents should be self-describing also for facts which

are obvious at publication time

bull Compare speakerrsquos data in original (OCRed) data and XMLified

and enriched version

bull Original

bull Part of it in XML

bull And now for human consumption

PoliticalMashup 22

A few more applications

PoliticalMashup 23

Entity Profiling and Entity Search

bull Users search for entities not for documents [TREC Entity Track]

[Balog et al 2009]

bull Main research questions

How to collect information on entities

how to model an entity

how to rank entities

bull (Parsimonious) language models work well as models [Balog et

al 2009][Hiemstra et al 2004]

bull Entity profiling httpwwwpolitiekinzichtcom

bull Entity search httpikkieswijzernl

PoliticalMashup 24

Content and structure search

bull Usual advanced search combines keyword search with metadata

search

bull Extra fields are just extra filters on the returned documents

bull With structured documents we can do search on content and

structure

bull Most useful task rank best entry points in large documents

bull Compare two search systems on the same data

on flat text

on an XML representation

PoliticalMashup 25

Lesson learned requirement on structure

bull Make semantically important structure of documents explicit in

XML markup

bull Publish for machine readability

bull Publish generic data not data prepared for one use-case

PoliticalMashup 26

Application of structure Interruption graph(Attackogram)

bull MP A interrupts B lArrrArr A speaks during the block of B

combined with entity profiling

httpdebatpolitiekinzichtcom

PoliticalMashup 27

Exploring and exploiting official documents

bull We saw what can be done with one well-curated collection

bull What are the key infrastructural and research questions

In what direction and how to scale this up

1 in time

2 in breadth

3 in links

PoliticalMashup 28

Scale diachronically

bull Stable data model and measurement procedure make this data

very valuable for diachronic comparisons

bull towards the past

bull OCR

bull consistency in structure

bull more missing data to link to

bull towards the future

bull remain up to date

bull legacy decisions

PoliticalMashup 29

Scale in breadth eg parlproceedings of allEuropean countries

bull All describe the same ldquoscriptrdquo so all fit in one schema

bull Main question how to connect the data from different countries

Common structure and annotation use the same Relax NG

schema

Common values on certain attributesbull Entities Normalize to Wikipedia concepts

bull Controlled vocabulary keywords Normalize to Eurovoc

bull Language Machine translate to English

bull Events Normalize to EMM Newsexplorer query Wikinews

query

PoliticalMashup 30

Scale in breadth link to related datasets

bull Link on time entities events topics

bull Other official publications

bull News

bull User generated content

bull (In our case) promisses of political actors election manifestos

PoliticalMashup 31

Conclusions

bull There are ample opportunities for exploiting Official Publications

bull Preprocessing and interlinking with other datasets is difficult and

does not scale well

bull High precision and recall is needed for many applications

bull Many text analysis and data-mapping tasks [MUC TAC]

bull Every format needs an own transformer

bull Linked Open Data knowledge bases are not (yet) good enough

create special purpose knowledge extractors

bull High investment but if done in a general way high return and

impact

PoliticalMashup 32

Back to our research question

What is the best data format for publishing both legacy and current

parliamentary proceedings in a digital sustainable manner

Lessons learned

bull Common open standardized self-describing machine readable

bull not tied to a single application

bull linked linked linked

bull Not only shared attributes

bull but more importantly shared data values

bull also store utterly obvious facts (10 years later they arenrsquot)

PoliticalMashup 33

How we can help (ourselves)

Help improve input data at the source

bull Push at the source (in UK open government data in Holland all

parliamentary data is now in XML )

bull Help reduce dumb cut-and-paste annotation work so annotators

can concentrate on tasks which are hard for machines (eg

text-classification)

bull Emphasize importance of using shared standards

Future researchers will love you

PoliticalMashup 34

Last Question

Official Publications are they

or

  • 1 Open Official Documents Requirements and Opportunities
  • 2 Content
  • 3 Our Leading Research Question
  • 4 W3C recommendations on Open Government Data
  • 5 Value of a large data corpus
  • 6 Documents related by publication date
  • 7 Properties of our Parliamentary Proceedings Dataset
  • 8 Longitudinal data
  • 9 Data about human behaviour
  • 10 Often rather boring
  • 11 But sometimes full of drama and excitement
  • 12 Loads of measurement points
  • 13 Digitally available
  • 14 About this collection
  • 15 Very rich metadata for each word
  • 16 How to exploit the extra metadata and structure
  • 17 Political n-gram viewer
  • 18 Political n-gram viewer requirements
  • 19 Is Linked (Open) Data the solution
  • 20 DBpedia not yet reliable
  • 21 Lesson learned requirement on metadata and relations
  • 22 A few more applications
  • 23 Entity Profiling and Entity Search
  • 24 Content and structure search
  • 25 Lesson learned requirement on structure
  • 26 Application of structure Interruption graph (Attackogram)
  • 27 Exploring and exploiting official documents
  • 28 Scale diachronically
  • 29 Scale in breadth eg parlproceedings of all European countries
  • 30 Scale in breadth link to related datasets
  • 31 Conclusions
  • 32 Back to our research question
  • 33 How we can help (ourselves)
  • 34 Last Question
Page 20: Keynote Exploring and Exploiting Official Publications

PoliticalMashup 20

DBpedia not yet reliable

bull Data extraction is difficult even from the infobox even from

complete data

Wikipedia page of Geert Wilders

DBpedia information about Geert Wilders

Notice the values of the party and the office attributes

Timestamped facts are difficult to extract and difficult to

represent in RDF triples

PoliticalMashup 21

Lesson learned requirement on metadata andrelations

bull One cannot rely on Linked Open Data for good quality metadata

bull Official documents should be self-describing also for facts which

are obvious at publication time

bull Compare speakerrsquos data in original (OCRed) data and XMLified

and enriched version

bull Original

bull Part of it in XML

bull And now for human consumption

PoliticalMashup 22

A few more applications

PoliticalMashup 23

Entity Profiling and Entity Search

bull Users search for entities not for documents [TREC Entity Track]

[Balog et al 2009]

bull Main research questions

How to collect information on entities

how to model an entity

how to rank entities

bull (Parsimonious) language models work well as models [Balog et

al 2009][Hiemstra et al 2004]

bull Entity profiling httpwwwpolitiekinzichtcom

bull Entity search httpikkieswijzernl

PoliticalMashup 24

Content and structure search

bull Usual advanced search combines keyword search with metadata

search

bull Extra fields are just extra filters on the returned documents

bull With structured documents we can do search on content and

structure

bull Most useful task rank best entry points in large documents

bull Compare two search systems on the same data

on flat text

on an XML representation

PoliticalMashup 25

Lesson learned requirement on structure

bull Make semantically important structure of documents explicit in

XML markup

bull Publish for machine readability

bull Publish generic data not data prepared for one use-case

PoliticalMashup 26

Application of structure Interruption graph(Attackogram)

bull MP A interrupts B lArrrArr A speaks during the block of B

combined with entity profiling

httpdebatpolitiekinzichtcom

PoliticalMashup 27

Exploring and exploiting official documents

bull We saw what can be done with one well-curated collection

bull What are the key infrastructural and research questions

In what direction and how to scale this up

1 in time

2 in breadth

3 in links

PoliticalMashup 28

Scale diachronically

bull Stable data model and measurement procedure make this data

very valuable for diachronic comparisons

bull towards the past

bull OCR

bull consistency in structure

bull more missing data to link to

bull towards the future

bull remain up to date

bull legacy decisions

PoliticalMashup 29

Scale in breadth eg parlproceedings of allEuropean countries

bull All describe the same ldquoscriptrdquo so all fit in one schema

bull Main question how to connect the data from different countries

Common structure and annotation use the same Relax NG

schema

Common values on certain attributesbull Entities Normalize to Wikipedia concepts

bull Controlled vocabulary keywords Normalize to Eurovoc

bull Language Machine translate to English

bull Events Normalize to EMM Newsexplorer query Wikinews

query

PoliticalMashup 30

Scale in breadth link to related datasets

bull Link on time entities events topics

bull Other official publications

bull News

bull User generated content

bull (In our case) promisses of political actors election manifestos

PoliticalMashup 31

Conclusions

bull There are ample opportunities for exploiting Official Publications

bull Preprocessing and interlinking with other datasets is difficult and

does not scale well

bull High precision and recall is needed for many applications

bull Many text analysis and data-mapping tasks [MUC TAC]

bull Every format needs an own transformer

bull Linked Open Data knowledge bases are not (yet) good enough

create special purpose knowledge extractors

bull High investment but if done in a general way high return and

impact

PoliticalMashup 32

Back to our research question

What is the best data format for publishing both legacy and current

parliamentary proceedings in a digital sustainable manner

Lessons learned

bull Common open standardized self-describing machine readable

bull not tied to a single application

bull linked linked linked

bull Not only shared attributes

bull but more importantly shared data values

bull also store utterly obvious facts (10 years later they arenrsquot)

PoliticalMashup 33

How we can help (ourselves)

Help improve input data at the source

bull Push at the source (in UK open government data in Holland all

parliamentary data is now in XML )

bull Help reduce dumb cut-and-paste annotation work so annotators

can concentrate on tasks which are hard for machines (eg

text-classification)

bull Emphasize importance of using shared standards

Future researchers will love you

PoliticalMashup 34

Last Question

Official Publications are they

or

  • 1 Open Official Documents Requirements and Opportunities
  • 2 Content
  • 3 Our Leading Research Question
  • 4 W3C recommendations on Open Government Data
  • 5 Value of a large data corpus
  • 6 Documents related by publication date
  • 7 Properties of our Parliamentary Proceedings Dataset
  • 8 Longitudinal data
  • 9 Data about human behaviour
  • 10 Often rather boring
  • 11 But sometimes full of drama and excitement
  • 12 Loads of measurement points
  • 13 Digitally available
  • 14 About this collection
  • 15 Very rich metadata for each word
  • 16 How to exploit the extra metadata and structure
  • 17 Political n-gram viewer
  • 18 Political n-gram viewer requirements
  • 19 Is Linked (Open) Data the solution
  • 20 DBpedia not yet reliable
  • 21 Lesson learned requirement on metadata and relations
  • 22 A few more applications
  • 23 Entity Profiling and Entity Search
  • 24 Content and structure search
  • 25 Lesson learned requirement on structure
  • 26 Application of structure Interruption graph (Attackogram)
  • 27 Exploring and exploiting official documents
  • 28 Scale diachronically
  • 29 Scale in breadth eg parlproceedings of all European countries
  • 30 Scale in breadth link to related datasets
  • 31 Conclusions
  • 32 Back to our research question
  • 33 How we can help (ourselves)
  • 34 Last Question
Page 21: Keynote Exploring and Exploiting Official Publications

PoliticalMashup 21

Lesson learned requirement on metadata andrelations

bull One cannot rely on Linked Open Data for good quality metadata

bull Official documents should be self-describing also for facts which

are obvious at publication time

bull Compare speakerrsquos data in original (OCRed) data and XMLified

and enriched version

bull Original

bull Part of it in XML

bull And now for human consumption

PoliticalMashup 22

A few more applications

PoliticalMashup 23

Entity Profiling and Entity Search

bull Users search for entities not for documents [TREC Entity Track]

[Balog et al 2009]

bull Main research questions

How to collect information on entities

how to model an entity

how to rank entities

bull (Parsimonious) language models work well as models [Balog et

al 2009][Hiemstra et al 2004]

bull Entity profiling httpwwwpolitiekinzichtcom

bull Entity search httpikkieswijzernl

PoliticalMashup 24

Content and structure search

bull Usual advanced search combines keyword search with metadata

search

bull Extra fields are just extra filters on the returned documents

bull With structured documents we can do search on content and

structure

bull Most useful task rank best entry points in large documents

bull Compare two search systems on the same data

on flat text

on an XML representation

PoliticalMashup 25

Lesson learned requirement on structure

bull Make semantically important structure of documents explicit in

XML markup

bull Publish for machine readability

bull Publish generic data not data prepared for one use-case

PoliticalMashup 26

Application of structure Interruption graph(Attackogram)

bull MP A interrupts B lArrrArr A speaks during the block of B

combined with entity profiling

httpdebatpolitiekinzichtcom

PoliticalMashup 27

Exploring and exploiting official documents

bull We saw what can be done with one well-curated collection

bull What are the key infrastructural and research questions

In what direction and how to scale this up

1 in time

2 in breadth

3 in links

PoliticalMashup 28

Scale diachronically

bull Stable data model and measurement procedure make this data

very valuable for diachronic comparisons

bull towards the past

bull OCR

bull consistency in structure

bull more missing data to link to

bull towards the future

bull remain up to date

bull legacy decisions

PoliticalMashup 29

Scale in breadth eg parlproceedings of allEuropean countries

bull All describe the same ldquoscriptrdquo so all fit in one schema

bull Main question how to connect the data from different countries

Common structure and annotation use the same Relax NG

schema

Common values on certain attributesbull Entities Normalize to Wikipedia concepts

bull Controlled vocabulary keywords Normalize to Eurovoc

bull Language Machine translate to English

bull Events Normalize to EMM Newsexplorer query Wikinews

query

PoliticalMashup 30

Scale in breadth link to related datasets

bull Link on time entities events topics

bull Other official publications

bull News

bull User generated content

bull (In our case) promisses of political actors election manifestos

PoliticalMashup 31

Conclusions

bull There are ample opportunities for exploiting Official Publications

bull Preprocessing and interlinking with other datasets is difficult and

does not scale well

bull High precision and recall is needed for many applications

bull Many text analysis and data-mapping tasks [MUC TAC]

bull Every format needs an own transformer

bull Linked Open Data knowledge bases are not (yet) good enough

create special purpose knowledge extractors

bull High investment but if done in a general way high return and

impact

PoliticalMashup 32

Back to our research question

What is the best data format for publishing both legacy and current

parliamentary proceedings in a digital sustainable manner

Lessons learned

bull Common open standardized self-describing machine readable

bull not tied to a single application

bull linked linked linked

bull Not only shared attributes

bull but more importantly shared data values

bull also store utterly obvious facts (10 years later they arenrsquot)

PoliticalMashup 33

How we can help (ourselves)

Help improve input data at the source

bull Push at the source (in UK open government data in Holland all

parliamentary data is now in XML )

bull Help reduce dumb cut-and-paste annotation work so annotators

can concentrate on tasks which are hard for machines (eg

text-classification)

bull Emphasize importance of using shared standards

Future researchers will love you

PoliticalMashup 34

Last Question

Official Publications are they

or

  • 1 Open Official Documents Requirements and Opportunities
  • 2 Content
  • 3 Our Leading Research Question
  • 4 W3C recommendations on Open Government Data
  • 5 Value of a large data corpus
  • 6 Documents related by publication date
  • 7 Properties of our Parliamentary Proceedings Dataset
  • 8 Longitudinal data
  • 9 Data about human behaviour
  • 10 Often rather boring
  • 11 But sometimes full of drama and excitement
  • 12 Loads of measurement points
  • 13 Digitally available
  • 14 About this collection
  • 15 Very rich metadata for each word
  • 16 How to exploit the extra metadata and structure
  • 17 Political n-gram viewer
  • 18 Political n-gram viewer requirements
  • 19 Is Linked (Open) Data the solution
  • 20 DBpedia not yet reliable
  • 21 Lesson learned requirement on metadata and relations
  • 22 A few more applications
  • 23 Entity Profiling and Entity Search
  • 24 Content and structure search
  • 25 Lesson learned requirement on structure
  • 26 Application of structure Interruption graph (Attackogram)
  • 27 Exploring and exploiting official documents
  • 28 Scale diachronically
  • 29 Scale in breadth eg parlproceedings of all European countries
  • 30 Scale in breadth link to related datasets
  • 31 Conclusions
  • 32 Back to our research question
  • 33 How we can help (ourselves)
  • 34 Last Question
Page 22: Keynote Exploring and Exploiting Official Publications

PoliticalMashup 22

A few more applications

PoliticalMashup 23

Entity Profiling and Entity Search

bull Users search for entities not for documents [TREC Entity Track]

[Balog et al 2009]

bull Main research questions

How to collect information on entities

how to model an entity

how to rank entities

bull (Parsimonious) language models work well as models [Balog et

al 2009][Hiemstra et al 2004]

bull Entity profiling httpwwwpolitiekinzichtcom

bull Entity search httpikkieswijzernl

PoliticalMashup 24

Content and structure search

bull Usual advanced search combines keyword search with metadata

search

bull Extra fields are just extra filters on the returned documents

bull With structured documents we can do search on content and

structure

bull Most useful task rank best entry points in large documents

bull Compare two search systems on the same data

on flat text

on an XML representation

PoliticalMashup 25

Lesson learned requirement on structure

bull Make semantically important structure of documents explicit in

XML markup

bull Publish for machine readability

bull Publish generic data not data prepared for one use-case

PoliticalMashup 26

Application of structure Interruption graph(Attackogram)

bull MP A interrupts B lArrrArr A speaks during the block of B

combined with entity profiling

httpdebatpolitiekinzichtcom

PoliticalMashup 27

Exploring and exploiting official documents

bull We saw what can be done with one well-curated collection

bull What are the key infrastructural and research questions

In what direction and how to scale this up

1 in time

2 in breadth

3 in links

PoliticalMashup 28

Scale diachronically

bull Stable data model and measurement procedure make this data

very valuable for diachronic comparisons

bull towards the past

bull OCR

bull consistency in structure

bull more missing data to link to

bull towards the future

bull remain up to date

bull legacy decisions

PoliticalMashup 29

Scale in breadth eg parlproceedings of allEuropean countries

bull All describe the same ldquoscriptrdquo so all fit in one schema

bull Main question how to connect the data from different countries

Common structure and annotation use the same Relax NG

schema

Common values on certain attributesbull Entities Normalize to Wikipedia concepts

bull Controlled vocabulary keywords Normalize to Eurovoc

bull Language Machine translate to English

bull Events Normalize to EMM Newsexplorer query Wikinews

query

PoliticalMashup 30

Scale in breadth link to related datasets

bull Link on time entities events topics

bull Other official publications

bull News

bull User generated content

bull (In our case) promisses of political actors election manifestos

PoliticalMashup 31

Conclusions

bull There are ample opportunities for exploiting Official Publications

bull Preprocessing and interlinking with other datasets is difficult and

does not scale well

bull High precision and recall is needed for many applications

bull Many text analysis and data-mapping tasks [MUC TAC]

bull Every format needs an own transformer

bull Linked Open Data knowledge bases are not (yet) good enough

create special purpose knowledge extractors

bull High investment but if done in a general way high return and

impact

PoliticalMashup 32

Back to our research question

What is the best data format for publishing both legacy and current

parliamentary proceedings in a digital sustainable manner

Lessons learned

bull Common open standardized self-describing machine readable

bull not tied to a single application

bull linked linked linked

bull Not only shared attributes

bull but more importantly shared data values

bull also store utterly obvious facts (10 years later they arenrsquot)

PoliticalMashup 33

How we can help (ourselves)

Help improve input data at the source

bull Push at the source (in UK open government data in Holland all

parliamentary data is now in XML )

bull Help reduce dumb cut-and-paste annotation work so annotators

can concentrate on tasks which are hard for machines (eg

text-classification)

bull Emphasize importance of using shared standards

Future researchers will love you

PoliticalMashup 34

Last Question

Official Publications are they

or

  • 1 Open Official Documents Requirements and Opportunities
  • 2 Content
  • 3 Our Leading Research Question
  • 4 W3C recommendations on Open Government Data
  • 5 Value of a large data corpus
  • 6 Documents related by publication date
  • 7 Properties of our Parliamentary Proceedings Dataset
  • 8 Longitudinal data
  • 9 Data about human behaviour
  • 10 Often rather boring
  • 11 But sometimes full of drama and excitement
  • 12 Loads of measurement points
  • 13 Digitally available
  • 14 About this collection
  • 15 Very rich metadata for each word
  • 16 How to exploit the extra metadata and structure
  • 17 Political n-gram viewer
  • 18 Political n-gram viewer requirements
  • 19 Is Linked (Open) Data the solution
  • 20 DBpedia not yet reliable
  • 21 Lesson learned requirement on metadata and relations
  • 22 A few more applications
  • 23 Entity Profiling and Entity Search
  • 24 Content and structure search
  • 25 Lesson learned requirement on structure
  • 26 Application of structure Interruption graph (Attackogram)
  • 27 Exploring and exploiting official documents
  • 28 Scale diachronically
  • 29 Scale in breadth eg parlproceedings of all European countries
  • 30 Scale in breadth link to related datasets
  • 31 Conclusions
  • 32 Back to our research question
  • 33 How we can help (ourselves)
  • 34 Last Question
Page 23: Keynote Exploring and Exploiting Official Publications

PoliticalMashup 23

Entity Profiling and Entity Search

bull Users search for entities not for documents [TREC Entity Track]

[Balog et al 2009]

bull Main research questions

How to collect information on entities

how to model an entity

how to rank entities

bull (Parsimonious) language models work well as models [Balog et

al 2009][Hiemstra et al 2004]

bull Entity profiling httpwwwpolitiekinzichtcom

bull Entity search httpikkieswijzernl

PoliticalMashup 24

Content and structure search

bull Usual advanced search combines keyword search with metadata

search

bull Extra fields are just extra filters on the returned documents

bull With structured documents we can do search on content and

structure

bull Most useful task rank best entry points in large documents

bull Compare two search systems on the same data

on flat text

on an XML representation

PoliticalMashup 25

Lesson learned requirement on structure

bull Make semantically important structure of documents explicit in

XML markup

bull Publish for machine readability

bull Publish generic data not data prepared for one use-case

PoliticalMashup 26

Application of structure Interruption graph(Attackogram)

bull MP A interrupts B lArrrArr A speaks during the block of B

combined with entity profiling

httpdebatpolitiekinzichtcom

PoliticalMashup 27

Exploring and exploiting official documents

bull We saw what can be done with one well-curated collection

bull What are the key infrastructural and research questions

In what direction and how to scale this up

1 in time

2 in breadth

3 in links

PoliticalMashup 28

Scale diachronically

bull Stable data model and measurement procedure make this data

very valuable for diachronic comparisons

bull towards the past

bull OCR

bull consistency in structure

bull more missing data to link to

bull towards the future

bull remain up to date

bull legacy decisions

PoliticalMashup 29

Scale in breadth eg parlproceedings of allEuropean countries

bull All describe the same ldquoscriptrdquo so all fit in one schema

bull Main question how to connect the data from different countries

Common structure and annotation use the same Relax NG

schema

Common values on certain attributesbull Entities Normalize to Wikipedia concepts

bull Controlled vocabulary keywords Normalize to Eurovoc

bull Language Machine translate to English

bull Events Normalize to EMM Newsexplorer query Wikinews

query

PoliticalMashup 30

Scale in breadth link to related datasets

bull Link on time entities events topics

bull Other official publications

bull News

bull User generated content

bull (In our case) promisses of political actors election manifestos

PoliticalMashup 31

Conclusions

bull There are ample opportunities for exploiting Official Publications

bull Preprocessing and interlinking with other datasets is difficult and

does not scale well

bull High precision and recall is needed for many applications

bull Many text analysis and data-mapping tasks [MUC TAC]

bull Every format needs an own transformer

bull Linked Open Data knowledge bases are not (yet) good enough

create special purpose knowledge extractors

bull High investment but if done in a general way high return and

impact

PoliticalMashup 32

Back to our research question

What is the best data format for publishing both legacy and current

parliamentary proceedings in a digital sustainable manner

Lessons learned

bull Common open standardized self-describing machine readable

bull not tied to a single application

bull linked linked linked

bull Not only shared attributes

bull but more importantly shared data values

bull also store utterly obvious facts (10 years later they arenrsquot)

PoliticalMashup 33

How we can help (ourselves)

Help improve input data at the source

bull Push at the source (in UK open government data in Holland all

parliamentary data is now in XML )

bull Help reduce dumb cut-and-paste annotation work so annotators

can concentrate on tasks which are hard for machines (eg

text-classification)

bull Emphasize importance of using shared standards

Future researchers will love you

PoliticalMashup 34

Last Question

Official Publications are they

or

  • 1 Open Official Documents Requirements and Opportunities
  • 2 Content
  • 3 Our Leading Research Question
  • 4 W3C recommendations on Open Government Data
  • 5 Value of a large data corpus
  • 6 Documents related by publication date
  • 7 Properties of our Parliamentary Proceedings Dataset
  • 8 Longitudinal data
  • 9 Data about human behaviour
  • 10 Often rather boring
  • 11 But sometimes full of drama and excitement
  • 12 Loads of measurement points
  • 13 Digitally available
  • 14 About this collection
  • 15 Very rich metadata for each word
  • 16 How to exploit the extra metadata and structure
  • 17 Political n-gram viewer
  • 18 Political n-gram viewer requirements
  • 19 Is Linked (Open) Data the solution
  • 20 DBpedia not yet reliable
  • 21 Lesson learned requirement on metadata and relations
  • 22 A few more applications
  • 23 Entity Profiling and Entity Search
  • 24 Content and structure search
  • 25 Lesson learned requirement on structure
  • 26 Application of structure Interruption graph (Attackogram)
  • 27 Exploring and exploiting official documents
  • 28 Scale diachronically
  • 29 Scale in breadth eg parlproceedings of all European countries
  • 30 Scale in breadth link to related datasets
  • 31 Conclusions
  • 32 Back to our research question
  • 33 How we can help (ourselves)
  • 34 Last Question
Page 24: Keynote Exploring and Exploiting Official Publications

PoliticalMashup 24

Content and structure search

bull Usual advanced search combines keyword search with metadata

search

bull Extra fields are just extra filters on the returned documents

bull With structured documents we can do search on content and

structure

bull Most useful task rank best entry points in large documents

bull Compare two search systems on the same data

on flat text

on an XML representation

PoliticalMashup 25

Lesson learned requirement on structure

bull Make semantically important structure of documents explicit in

XML markup

bull Publish for machine readability

bull Publish generic data not data prepared for one use-case

PoliticalMashup 26

Application of structure Interruption graph(Attackogram)

bull MP A interrupts B lArrrArr A speaks during the block of B

combined with entity profiling

httpdebatpolitiekinzichtcom

PoliticalMashup 27

Exploring and exploiting official documents

bull We saw what can be done with one well-curated collection

bull What are the key infrastructural and research questions

In what direction and how to scale this up

1 in time

2 in breadth

3 in links

PoliticalMashup 28

Scale diachronically

bull Stable data model and measurement procedure make this data

very valuable for diachronic comparisons

bull towards the past

bull OCR

bull consistency in structure

bull more missing data to link to

bull towards the future

bull remain up to date

bull legacy decisions

PoliticalMashup 29

Scale in breadth eg parlproceedings of allEuropean countries

bull All describe the same ldquoscriptrdquo so all fit in one schema

bull Main question how to connect the data from different countries

Common structure and annotation use the same Relax NG

schema

Common values on certain attributesbull Entities Normalize to Wikipedia concepts

bull Controlled vocabulary keywords Normalize to Eurovoc

bull Language Machine translate to English

bull Events Normalize to EMM Newsexplorer query Wikinews

query

PoliticalMashup 30

Scale in breadth link to related datasets

bull Link on time entities events topics

bull Other official publications

bull News

bull User generated content

bull (In our case) promisses of political actors election manifestos

PoliticalMashup 31

Conclusions

bull There are ample opportunities for exploiting Official Publications

bull Preprocessing and interlinking with other datasets is difficult and

does not scale well

bull High precision and recall is needed for many applications

bull Many text analysis and data-mapping tasks [MUC TAC]

bull Every format needs an own transformer

bull Linked Open Data knowledge bases are not (yet) good enough

create special purpose knowledge extractors

bull High investment but if done in a general way high return and

impact

PoliticalMashup 32

Back to our research question

What is the best data format for publishing both legacy and current

parliamentary proceedings in a digital sustainable manner

Lessons learned

bull Common open standardized self-describing machine readable

bull not tied to a single application

bull linked linked linked

bull Not only shared attributes

bull but more importantly shared data values

bull also store utterly obvious facts (10 years later they arenrsquot)

PoliticalMashup 33

How we can help (ourselves)

Help improve input data at the source

bull Push at the source (in UK open government data in Holland all

parliamentary data is now in XML )

bull Help reduce dumb cut-and-paste annotation work so annotators

can concentrate on tasks which are hard for machines (eg

text-classification)

bull Emphasize importance of using shared standards

Future researchers will love you

PoliticalMashup 34

Last Question

Official Publications are they

or

  • 1 Open Official Documents Requirements and Opportunities
  • 2 Content
  • 3 Our Leading Research Question
  • 4 W3C recommendations on Open Government Data
  • 5 Value of a large data corpus
  • 6 Documents related by publication date
  • 7 Properties of our Parliamentary Proceedings Dataset
  • 8 Longitudinal data
  • 9 Data about human behaviour
  • 10 Often rather boring
  • 11 But sometimes full of drama and excitement
  • 12 Loads of measurement points
  • 13 Digitally available
  • 14 About this collection
  • 15 Very rich metadata for each word
  • 16 How to exploit the extra metadata and structure
  • 17 Political n-gram viewer
  • 18 Political n-gram viewer requirements
  • 19 Is Linked (Open) Data the solution
  • 20 DBpedia not yet reliable
  • 21 Lesson learned requirement on metadata and relations
  • 22 A few more applications
  • 23 Entity Profiling and Entity Search
  • 24 Content and structure search
  • 25 Lesson learned requirement on structure
  • 26 Application of structure Interruption graph (Attackogram)
  • 27 Exploring and exploiting official documents
  • 28 Scale diachronically
  • 29 Scale in breadth eg parlproceedings of all European countries
  • 30 Scale in breadth link to related datasets
  • 31 Conclusions
  • 32 Back to our research question
  • 33 How we can help (ourselves)
  • 34 Last Question
Page 25: Keynote Exploring and Exploiting Official Publications

PoliticalMashup 25

Lesson learned requirement on structure

bull Make semantically important structure of documents explicit in

XML markup

bull Publish for machine readability

bull Publish generic data not data prepared for one use-case

PoliticalMashup 26

Application of structure Interruption graph(Attackogram)

bull MP A interrupts B lArrrArr A speaks during the block of B

combined with entity profiling

httpdebatpolitiekinzichtcom

PoliticalMashup 27

Exploring and exploiting official documents

bull We saw what can be done with one well-curated collection

bull What are the key infrastructural and research questions

In what direction and how to scale this up

1 in time

2 in breadth

3 in links

PoliticalMashup 28

Scale diachronically

bull Stable data model and measurement procedure make this data

very valuable for diachronic comparisons

bull towards the past

bull OCR

bull consistency in structure

bull more missing data to link to

bull towards the future

bull remain up to date

bull legacy decisions

PoliticalMashup 29

Scale in breadth eg parlproceedings of allEuropean countries

bull All describe the same ldquoscriptrdquo so all fit in one schema

bull Main question how to connect the data from different countries

Common structure and annotation use the same Relax NG

schema

Common values on certain attributesbull Entities Normalize to Wikipedia concepts

bull Controlled vocabulary keywords Normalize to Eurovoc

bull Language Machine translate to English

bull Events Normalize to EMM Newsexplorer query Wikinews

query

PoliticalMashup 30

Scale in breadth link to related datasets

bull Link on time entities events topics

bull Other official publications

bull News

bull User generated content

bull (In our case) promisses of political actors election manifestos

PoliticalMashup 31

Conclusions

bull There are ample opportunities for exploiting Official Publications

bull Preprocessing and interlinking with other datasets is difficult and

does not scale well

bull High precision and recall is needed for many applications

bull Many text analysis and data-mapping tasks [MUC TAC]

bull Every format needs an own transformer

bull Linked Open Data knowledge bases are not (yet) good enough

create special purpose knowledge extractors

bull High investment but if done in a general way high return and

impact

PoliticalMashup 32

Back to our research question

What is the best data format for publishing both legacy and current

parliamentary proceedings in a digital sustainable manner

Lessons learned

bull Common open standardized self-describing machine readable

bull not tied to a single application

bull linked linked linked

bull Not only shared attributes

bull but more importantly shared data values

bull also store utterly obvious facts (10 years later they arenrsquot)

PoliticalMashup 33

How we can help (ourselves)

Help improve input data at the source

bull Push at the source (in UK open government data in Holland all

parliamentary data is now in XML )

bull Help reduce dumb cut-and-paste annotation work so annotators

can concentrate on tasks which are hard for machines (eg

text-classification)

bull Emphasize importance of using shared standards

Future researchers will love you

PoliticalMashup 34

Last Question

Official Publications are they

or

  • 1 Open Official Documents Requirements and Opportunities
  • 2 Content
  • 3 Our Leading Research Question
  • 4 W3C recommendations on Open Government Data
  • 5 Value of a large data corpus
  • 6 Documents related by publication date
  • 7 Properties of our Parliamentary Proceedings Dataset
  • 8 Longitudinal data
  • 9 Data about human behaviour
  • 10 Often rather boring
  • 11 But sometimes full of drama and excitement
  • 12 Loads of measurement points
  • 13 Digitally available
  • 14 About this collection
  • 15 Very rich metadata for each word
  • 16 How to exploit the extra metadata and structure
  • 17 Political n-gram viewer
  • 18 Political n-gram viewer requirements
  • 19 Is Linked (Open) Data the solution
  • 20 DBpedia not yet reliable
  • 21 Lesson learned requirement on metadata and relations
  • 22 A few more applications
  • 23 Entity Profiling and Entity Search
  • 24 Content and structure search
  • 25 Lesson learned requirement on structure
  • 26 Application of structure Interruption graph (Attackogram)
  • 27 Exploring and exploiting official documents
  • 28 Scale diachronically
  • 29 Scale in breadth eg parlproceedings of all European countries
  • 30 Scale in breadth link to related datasets
  • 31 Conclusions
  • 32 Back to our research question
  • 33 How we can help (ourselves)
  • 34 Last Question
Page 26: Keynote Exploring and Exploiting Official Publications

PoliticalMashup 26

Application of structure Interruption graph(Attackogram)

bull MP A interrupts B lArrrArr A speaks during the block of B

combined with entity profiling

httpdebatpolitiekinzichtcom

PoliticalMashup 27

Exploring and exploiting official documents

bull We saw what can be done with one well-curated collection

bull What are the key infrastructural and research questions

In what direction and how to scale this up

1 in time

2 in breadth

3 in links

PoliticalMashup 28

Scale diachronically

bull Stable data model and measurement procedure make this data

very valuable for diachronic comparisons

bull towards the past

bull OCR

bull consistency in structure

bull more missing data to link to

bull towards the future

bull remain up to date

bull legacy decisions

PoliticalMashup 29

Scale in breadth eg parlproceedings of allEuropean countries

bull All describe the same ldquoscriptrdquo so all fit in one schema

bull Main question how to connect the data from different countries

Common structure and annotation use the same Relax NG

schema

Common values on certain attributesbull Entities Normalize to Wikipedia concepts

bull Controlled vocabulary keywords Normalize to Eurovoc

bull Language Machine translate to English

bull Events Normalize to EMM Newsexplorer query Wikinews

query

PoliticalMashup 30

Scale in breadth link to related datasets

bull Link on time entities events topics

bull Other official publications

bull News

bull User generated content

bull (In our case) promisses of political actors election manifestos

PoliticalMashup 31

Conclusions

bull There are ample opportunities for exploiting Official Publications

bull Preprocessing and interlinking with other datasets is difficult and

does not scale well

bull High precision and recall is needed for many applications

bull Many text analysis and data-mapping tasks [MUC TAC]

bull Every format needs an own transformer

bull Linked Open Data knowledge bases are not (yet) good enough

create special purpose knowledge extractors

bull High investment but if done in a general way high return and

impact

PoliticalMashup 32

Back to our research question

What is the best data format for publishing both legacy and current

parliamentary proceedings in a digital sustainable manner

Lessons learned

bull Common open standardized self-describing machine readable

bull not tied to a single application

bull linked linked linked

bull Not only shared attributes

bull but more importantly shared data values

bull also store utterly obvious facts (10 years later they arenrsquot)

PoliticalMashup 33

How we can help (ourselves)

Help improve input data at the source

bull Push at the source (in UK open government data in Holland all

parliamentary data is now in XML )

bull Help reduce dumb cut-and-paste annotation work so annotators

can concentrate on tasks which are hard for machines (eg

text-classification)

bull Emphasize importance of using shared standards

Future researchers will love you

PoliticalMashup 34

Last Question

Official Publications are they

or

  • 1 Open Official Documents Requirements and Opportunities
  • 2 Content
  • 3 Our Leading Research Question
  • 4 W3C recommendations on Open Government Data
  • 5 Value of a large data corpus
  • 6 Documents related by publication date
  • 7 Properties of our Parliamentary Proceedings Dataset
  • 8 Longitudinal data
  • 9 Data about human behaviour
  • 10 Often rather boring
  • 11 But sometimes full of drama and excitement
  • 12 Loads of measurement points
  • 13 Digitally available
  • 14 About this collection
  • 15 Very rich metadata for each word
  • 16 How to exploit the extra metadata and structure
  • 17 Political n-gram viewer
  • 18 Political n-gram viewer requirements
  • 19 Is Linked (Open) Data the solution
  • 20 DBpedia not yet reliable
  • 21 Lesson learned requirement on metadata and relations
  • 22 A few more applications
  • 23 Entity Profiling and Entity Search
  • 24 Content and structure search
  • 25 Lesson learned requirement on structure
  • 26 Application of structure Interruption graph (Attackogram)
  • 27 Exploring and exploiting official documents
  • 28 Scale diachronically
  • 29 Scale in breadth eg parlproceedings of all European countries
  • 30 Scale in breadth link to related datasets
  • 31 Conclusions
  • 32 Back to our research question
  • 33 How we can help (ourselves)
  • 34 Last Question
Page 27: Keynote Exploring and Exploiting Official Publications

PoliticalMashup 27

Exploring and exploiting official documents

bull We saw what can be done with one well-curated collection

bull What are the key infrastructural and research questions

In what direction and how to scale this up

1 in time

2 in breadth

3 in links

PoliticalMashup 28

Scale diachronically

bull Stable data model and measurement procedure make this data

very valuable for diachronic comparisons

bull towards the past

bull OCR

bull consistency in structure

bull more missing data to link to

bull towards the future

bull remain up to date

bull legacy decisions

PoliticalMashup 29

Scale in breadth eg parlproceedings of allEuropean countries

bull All describe the same ldquoscriptrdquo so all fit in one schema

bull Main question how to connect the data from different countries

Common structure and annotation use the same Relax NG

schema

Common values on certain attributesbull Entities Normalize to Wikipedia concepts

bull Controlled vocabulary keywords Normalize to Eurovoc

bull Language Machine translate to English

bull Events Normalize to EMM Newsexplorer query Wikinews

query

PoliticalMashup 30

Scale in breadth link to related datasets

bull Link on time entities events topics

bull Other official publications

bull News

bull User generated content

bull (In our case) promisses of political actors election manifestos

PoliticalMashup 31

Conclusions

bull There are ample opportunities for exploiting Official Publications

bull Preprocessing and interlinking with other datasets is difficult and

does not scale well

bull High precision and recall is needed for many applications

bull Many text analysis and data-mapping tasks [MUC TAC]

bull Every format needs an own transformer

bull Linked Open Data knowledge bases are not (yet) good enough

create special purpose knowledge extractors

bull High investment but if done in a general way high return and

impact

PoliticalMashup 32

Back to our research question

What is the best data format for publishing both legacy and current

parliamentary proceedings in a digital sustainable manner

Lessons learned

bull Common open standardized self-describing machine readable

bull not tied to a single application

bull linked linked linked

bull Not only shared attributes

bull but more importantly shared data values

bull also store utterly obvious facts (10 years later they arenrsquot)

PoliticalMashup 33

How we can help (ourselves)

Help improve input data at the source

bull Push at the source (in UK open government data in Holland all

parliamentary data is now in XML )

bull Help reduce dumb cut-and-paste annotation work so annotators

can concentrate on tasks which are hard for machines (eg

text-classification)

bull Emphasize importance of using shared standards

Future researchers will love you

PoliticalMashup 34

Last Question

Official Publications are they

or

  • 1 Open Official Documents Requirements and Opportunities
  • 2 Content
  • 3 Our Leading Research Question
  • 4 W3C recommendations on Open Government Data
  • 5 Value of a large data corpus
  • 6 Documents related by publication date
  • 7 Properties of our Parliamentary Proceedings Dataset
  • 8 Longitudinal data
  • 9 Data about human behaviour
  • 10 Often rather boring
  • 11 But sometimes full of drama and excitement
  • 12 Loads of measurement points
  • 13 Digitally available
  • 14 About this collection
  • 15 Very rich metadata for each word
  • 16 How to exploit the extra metadata and structure
  • 17 Political n-gram viewer
  • 18 Political n-gram viewer requirements
  • 19 Is Linked (Open) Data the solution
  • 20 DBpedia not yet reliable
  • 21 Lesson learned requirement on metadata and relations
  • 22 A few more applications
  • 23 Entity Profiling and Entity Search
  • 24 Content and structure search
  • 25 Lesson learned requirement on structure
  • 26 Application of structure Interruption graph (Attackogram)
  • 27 Exploring and exploiting official documents
  • 28 Scale diachronically
  • 29 Scale in breadth eg parlproceedings of all European countries
  • 30 Scale in breadth link to related datasets
  • 31 Conclusions
  • 32 Back to our research question
  • 33 How we can help (ourselves)
  • 34 Last Question
Page 28: Keynote Exploring and Exploiting Official Publications

PoliticalMashup 28

Scale diachronically

bull Stable data model and measurement procedure make this data

very valuable for diachronic comparisons

bull towards the past

bull OCR

bull consistency in structure

bull more missing data to link to

bull towards the future

bull remain up to date

bull legacy decisions

PoliticalMashup 29

Scale in breadth eg parlproceedings of allEuropean countries

bull All describe the same ldquoscriptrdquo so all fit in one schema

bull Main question how to connect the data from different countries

Common structure and annotation use the same Relax NG

schema

Common values on certain attributesbull Entities Normalize to Wikipedia concepts

bull Controlled vocabulary keywords Normalize to Eurovoc

bull Language Machine translate to English

bull Events Normalize to EMM Newsexplorer query Wikinews

query

PoliticalMashup 30

Scale in breadth link to related datasets

bull Link on time entities events topics

bull Other official publications

bull News

bull User generated content

bull (In our case) promisses of political actors election manifestos

PoliticalMashup 31

Conclusions

bull There are ample opportunities for exploiting Official Publications

bull Preprocessing and interlinking with other datasets is difficult and

does not scale well

bull High precision and recall is needed for many applications

bull Many text analysis and data-mapping tasks [MUC TAC]

bull Every format needs an own transformer

bull Linked Open Data knowledge bases are not (yet) good enough

create special purpose knowledge extractors

bull High investment but if done in a general way high return and

impact

PoliticalMashup 32

Back to our research question

What is the best data format for publishing both legacy and current

parliamentary proceedings in a digital sustainable manner

Lessons learned

bull Common open standardized self-describing machine readable

bull not tied to a single application

bull linked linked linked

bull Not only shared attributes

bull but more importantly shared data values

bull also store utterly obvious facts (10 years later they arenrsquot)

PoliticalMashup 33

How we can help (ourselves)

Help improve input data at the source

bull Push at the source (in UK open government data in Holland all

parliamentary data is now in XML )

bull Help reduce dumb cut-and-paste annotation work so annotators

can concentrate on tasks which are hard for machines (eg

text-classification)

bull Emphasize importance of using shared standards

Future researchers will love you

PoliticalMashup 34

Last Question

Official Publications are they

or

  • 1 Open Official Documents Requirements and Opportunities
  • 2 Content
  • 3 Our Leading Research Question
  • 4 W3C recommendations on Open Government Data
  • 5 Value of a large data corpus
  • 6 Documents related by publication date
  • 7 Properties of our Parliamentary Proceedings Dataset
  • 8 Longitudinal data
  • 9 Data about human behaviour
  • 10 Often rather boring
  • 11 But sometimes full of drama and excitement
  • 12 Loads of measurement points
  • 13 Digitally available
  • 14 About this collection
  • 15 Very rich metadata for each word
  • 16 How to exploit the extra metadata and structure
  • 17 Political n-gram viewer
  • 18 Political n-gram viewer requirements
  • 19 Is Linked (Open) Data the solution
  • 20 DBpedia not yet reliable
  • 21 Lesson learned requirement on metadata and relations
  • 22 A few more applications
  • 23 Entity Profiling and Entity Search
  • 24 Content and structure search
  • 25 Lesson learned requirement on structure
  • 26 Application of structure Interruption graph (Attackogram)
  • 27 Exploring and exploiting official documents
  • 28 Scale diachronically
  • 29 Scale in breadth eg parlproceedings of all European countries
  • 30 Scale in breadth link to related datasets
  • 31 Conclusions
  • 32 Back to our research question
  • 33 How we can help (ourselves)
  • 34 Last Question
Page 29: Keynote Exploring and Exploiting Official Publications

PoliticalMashup 29

Scale in breadth eg parlproceedings of allEuropean countries

bull All describe the same ldquoscriptrdquo so all fit in one schema

bull Main question how to connect the data from different countries

Common structure and annotation use the same Relax NG

schema

Common values on certain attributesbull Entities Normalize to Wikipedia concepts

bull Controlled vocabulary keywords Normalize to Eurovoc

bull Language Machine translate to English

bull Events Normalize to EMM Newsexplorer query Wikinews

query

PoliticalMashup 30

Scale in breadth link to related datasets

bull Link on time entities events topics

bull Other official publications

bull News

bull User generated content

bull (In our case) promisses of political actors election manifestos

PoliticalMashup 31

Conclusions

bull There are ample opportunities for exploiting Official Publications

bull Preprocessing and interlinking with other datasets is difficult and

does not scale well

bull High precision and recall is needed for many applications

bull Many text analysis and data-mapping tasks [MUC TAC]

bull Every format needs an own transformer

bull Linked Open Data knowledge bases are not (yet) good enough

create special purpose knowledge extractors

bull High investment but if done in a general way high return and

impact

PoliticalMashup 32

Back to our research question

What is the best data format for publishing both legacy and current

parliamentary proceedings in a digital sustainable manner

Lessons learned

bull Common open standardized self-describing machine readable

bull not tied to a single application

bull linked linked linked

bull Not only shared attributes

bull but more importantly shared data values

bull also store utterly obvious facts (10 years later they arenrsquot)

PoliticalMashup 33

How we can help (ourselves)

Help improve input data at the source

bull Push at the source (in UK open government data in Holland all

parliamentary data is now in XML )

bull Help reduce dumb cut-and-paste annotation work so annotators

can concentrate on tasks which are hard for machines (eg

text-classification)

bull Emphasize importance of using shared standards

Future researchers will love you

PoliticalMashup 34

Last Question

Official Publications are they

or

  • 1 Open Official Documents Requirements and Opportunities
  • 2 Content
  • 3 Our Leading Research Question
  • 4 W3C recommendations on Open Government Data
  • 5 Value of a large data corpus
  • 6 Documents related by publication date
  • 7 Properties of our Parliamentary Proceedings Dataset
  • 8 Longitudinal data
  • 9 Data about human behaviour
  • 10 Often rather boring
  • 11 But sometimes full of drama and excitement
  • 12 Loads of measurement points
  • 13 Digitally available
  • 14 About this collection
  • 15 Very rich metadata for each word
  • 16 How to exploit the extra metadata and structure
  • 17 Political n-gram viewer
  • 18 Political n-gram viewer requirements
  • 19 Is Linked (Open) Data the solution
  • 20 DBpedia not yet reliable
  • 21 Lesson learned requirement on metadata and relations
  • 22 A few more applications
  • 23 Entity Profiling and Entity Search
  • 24 Content and structure search
  • 25 Lesson learned requirement on structure
  • 26 Application of structure Interruption graph (Attackogram)
  • 27 Exploring and exploiting official documents
  • 28 Scale diachronically
  • 29 Scale in breadth eg parlproceedings of all European countries
  • 30 Scale in breadth link to related datasets
  • 31 Conclusions
  • 32 Back to our research question
  • 33 How we can help (ourselves)
  • 34 Last Question
Page 30: Keynote Exploring and Exploiting Official Publications

PoliticalMashup 30

Scale in breadth link to related datasets

bull Link on time entities events topics

bull Other official publications

bull News

bull User generated content

bull (In our case) promisses of political actors election manifestos

PoliticalMashup 31

Conclusions

bull There are ample opportunities for exploiting Official Publications

bull Preprocessing and interlinking with other datasets is difficult and

does not scale well

bull High precision and recall is needed for many applications

bull Many text analysis and data-mapping tasks [MUC TAC]

bull Every format needs an own transformer

bull Linked Open Data knowledge bases are not (yet) good enough

create special purpose knowledge extractors

bull High investment but if done in a general way high return and

impact

PoliticalMashup 32

Back to our research question

What is the best data format for publishing both legacy and current

parliamentary proceedings in a digital sustainable manner

Lessons learned

bull Common open standardized self-describing machine readable

bull not tied to a single application

bull linked linked linked

bull Not only shared attributes

bull but more importantly shared data values

bull also store utterly obvious facts (10 years later they arenrsquot)

PoliticalMashup 33

How we can help (ourselves)

Help improve input data at the source

bull Push at the source (in UK open government data in Holland all

parliamentary data is now in XML )

bull Help reduce dumb cut-and-paste annotation work so annotators

can concentrate on tasks which are hard for machines (eg

text-classification)

bull Emphasize importance of using shared standards

Future researchers will love you

PoliticalMashup 34

Last Question

Official Publications are they

or

  • 1 Open Official Documents Requirements and Opportunities
  • 2 Content
  • 3 Our Leading Research Question
  • 4 W3C recommendations on Open Government Data
  • 5 Value of a large data corpus
  • 6 Documents related by publication date
  • 7 Properties of our Parliamentary Proceedings Dataset
  • 8 Longitudinal data
  • 9 Data about human behaviour
  • 10 Often rather boring
  • 11 But sometimes full of drama and excitement
  • 12 Loads of measurement points
  • 13 Digitally available
  • 14 About this collection
  • 15 Very rich metadata for each word
  • 16 How to exploit the extra metadata and structure
  • 17 Political n-gram viewer
  • 18 Political n-gram viewer requirements
  • 19 Is Linked (Open) Data the solution
  • 20 DBpedia not yet reliable
  • 21 Lesson learned requirement on metadata and relations
  • 22 A few more applications
  • 23 Entity Profiling and Entity Search
  • 24 Content and structure search
  • 25 Lesson learned requirement on structure
  • 26 Application of structure Interruption graph (Attackogram)
  • 27 Exploring and exploiting official documents
  • 28 Scale diachronically
  • 29 Scale in breadth eg parlproceedings of all European countries
  • 30 Scale in breadth link to related datasets
  • 31 Conclusions
  • 32 Back to our research question
  • 33 How we can help (ourselves)
  • 34 Last Question
Page 31: Keynote Exploring and Exploiting Official Publications

PoliticalMashup 31

Conclusions

bull There are ample opportunities for exploiting Official Publications

bull Preprocessing and interlinking with other datasets is difficult and

does not scale well

bull High precision and recall is needed for many applications

bull Many text analysis and data-mapping tasks [MUC TAC]

bull Every format needs an own transformer

bull Linked Open Data knowledge bases are not (yet) good enough

create special purpose knowledge extractors

bull High investment but if done in a general way high return and

impact

PoliticalMashup 32

Back to our research question

What is the best data format for publishing both legacy and current

parliamentary proceedings in a digital sustainable manner

Lessons learned

bull Common open standardized self-describing machine readable

bull not tied to a single application

bull linked linked linked

bull Not only shared attributes

bull but more importantly shared data values

bull also store utterly obvious facts (10 years later they arenrsquot)

PoliticalMashup 33

How we can help (ourselves)

Help improve input data at the source

bull Push at the source (in UK open government data in Holland all

parliamentary data is now in XML )

bull Help reduce dumb cut-and-paste annotation work so annotators

can concentrate on tasks which are hard for machines (eg

text-classification)

bull Emphasize importance of using shared standards

Future researchers will love you

PoliticalMashup 34

Last Question

Official Publications are they

or

  • 1 Open Official Documents Requirements and Opportunities
  • 2 Content
  • 3 Our Leading Research Question
  • 4 W3C recommendations on Open Government Data
  • 5 Value of a large data corpus
  • 6 Documents related by publication date
  • 7 Properties of our Parliamentary Proceedings Dataset
  • 8 Longitudinal data
  • 9 Data about human behaviour
  • 10 Often rather boring
  • 11 But sometimes full of drama and excitement
  • 12 Loads of measurement points
  • 13 Digitally available
  • 14 About this collection
  • 15 Very rich metadata for each word
  • 16 How to exploit the extra metadata and structure
  • 17 Political n-gram viewer
  • 18 Political n-gram viewer requirements
  • 19 Is Linked (Open) Data the solution
  • 20 DBpedia not yet reliable
  • 21 Lesson learned requirement on metadata and relations
  • 22 A few more applications
  • 23 Entity Profiling and Entity Search
  • 24 Content and structure search
  • 25 Lesson learned requirement on structure
  • 26 Application of structure Interruption graph (Attackogram)
  • 27 Exploring and exploiting official documents
  • 28 Scale diachronically
  • 29 Scale in breadth eg parlproceedings of all European countries
  • 30 Scale in breadth link to related datasets
  • 31 Conclusions
  • 32 Back to our research question
  • 33 How we can help (ourselves)
  • 34 Last Question
Page 32: Keynote Exploring and Exploiting Official Publications

PoliticalMashup 32

Back to our research question

What is the best data format for publishing both legacy and current

parliamentary proceedings in a digital sustainable manner

Lessons learned

bull Common open standardized self-describing machine readable

bull not tied to a single application

bull linked linked linked

bull Not only shared attributes

bull but more importantly shared data values

bull also store utterly obvious facts (10 years later they arenrsquot)

PoliticalMashup 33

How we can help (ourselves)

Help improve input data at the source

bull Push at the source (in UK open government data in Holland all

parliamentary data is now in XML )

bull Help reduce dumb cut-and-paste annotation work so annotators

can concentrate on tasks which are hard for machines (eg

text-classification)

bull Emphasize importance of using shared standards

Future researchers will love you

PoliticalMashup 34

Last Question

Official Publications are they

or

  • 1 Open Official Documents Requirements and Opportunities
  • 2 Content
  • 3 Our Leading Research Question
  • 4 W3C recommendations on Open Government Data
  • 5 Value of a large data corpus
  • 6 Documents related by publication date
  • 7 Properties of our Parliamentary Proceedings Dataset
  • 8 Longitudinal data
  • 9 Data about human behaviour
  • 10 Often rather boring
  • 11 But sometimes full of drama and excitement
  • 12 Loads of measurement points
  • 13 Digitally available
  • 14 About this collection
  • 15 Very rich metadata for each word
  • 16 How to exploit the extra metadata and structure
  • 17 Political n-gram viewer
  • 18 Political n-gram viewer requirements
  • 19 Is Linked (Open) Data the solution
  • 20 DBpedia not yet reliable
  • 21 Lesson learned requirement on metadata and relations
  • 22 A few more applications
  • 23 Entity Profiling and Entity Search
  • 24 Content and structure search
  • 25 Lesson learned requirement on structure
  • 26 Application of structure Interruption graph (Attackogram)
  • 27 Exploring and exploiting official documents
  • 28 Scale diachronically
  • 29 Scale in breadth eg parlproceedings of all European countries
  • 30 Scale in breadth link to related datasets
  • 31 Conclusions
  • 32 Back to our research question
  • 33 How we can help (ourselves)
  • 34 Last Question
Page 33: Keynote Exploring and Exploiting Official Publications

PoliticalMashup 33

How we can help (ourselves)

Help improve input data at the source

bull Push at the source (in UK open government data in Holland all

parliamentary data is now in XML )

bull Help reduce dumb cut-and-paste annotation work so annotators

can concentrate on tasks which are hard for machines (eg

text-classification)

bull Emphasize importance of using shared standards

Future researchers will love you

PoliticalMashup 34

Last Question

Official Publications are they

or

  • 1 Open Official Documents Requirements and Opportunities
  • 2 Content
  • 3 Our Leading Research Question
  • 4 W3C recommendations on Open Government Data
  • 5 Value of a large data corpus
  • 6 Documents related by publication date
  • 7 Properties of our Parliamentary Proceedings Dataset
  • 8 Longitudinal data
  • 9 Data about human behaviour
  • 10 Often rather boring
  • 11 But sometimes full of drama and excitement
  • 12 Loads of measurement points
  • 13 Digitally available
  • 14 About this collection
  • 15 Very rich metadata for each word
  • 16 How to exploit the extra metadata and structure
  • 17 Political n-gram viewer
  • 18 Political n-gram viewer requirements
  • 19 Is Linked (Open) Data the solution
  • 20 DBpedia not yet reliable
  • 21 Lesson learned requirement on metadata and relations
  • 22 A few more applications
  • 23 Entity Profiling and Entity Search
  • 24 Content and structure search
  • 25 Lesson learned requirement on structure
  • 26 Application of structure Interruption graph (Attackogram)
  • 27 Exploring and exploiting official documents
  • 28 Scale diachronically
  • 29 Scale in breadth eg parlproceedings of all European countries
  • 30 Scale in breadth link to related datasets
  • 31 Conclusions
  • 32 Back to our research question
  • 33 How we can help (ourselves)
  • 34 Last Question
Page 34: Keynote Exploring and Exploiting Official Publications

PoliticalMashup 34

Last Question

Official Publications are they

or

  • 1 Open Official Documents Requirements and Opportunities
  • 2 Content
  • 3 Our Leading Research Question
  • 4 W3C recommendations on Open Government Data
  • 5 Value of a large data corpus
  • 6 Documents related by publication date
  • 7 Properties of our Parliamentary Proceedings Dataset
  • 8 Longitudinal data
  • 9 Data about human behaviour
  • 10 Often rather boring
  • 11 But sometimes full of drama and excitement
  • 12 Loads of measurement points
  • 13 Digitally available
  • 14 About this collection
  • 15 Very rich metadata for each word
  • 16 How to exploit the extra metadata and structure
  • 17 Political n-gram viewer
  • 18 Political n-gram viewer requirements
  • 19 Is Linked (Open) Data the solution
  • 20 DBpedia not yet reliable
  • 21 Lesson learned requirement on metadata and relations
  • 22 A few more applications
  • 23 Entity Profiling and Entity Search
  • 24 Content and structure search
  • 25 Lesson learned requirement on structure
  • 26 Application of structure Interruption graph (Attackogram)
  • 27 Exploring and exploiting official documents
  • 28 Scale diachronically
  • 29 Scale in breadth eg parlproceedings of all European countries
  • 30 Scale in breadth link to related datasets
  • 31 Conclusions
  • 32 Back to our research question
  • 33 How we can help (ourselves)
  • 34 Last Question