keynote exploring and exploiting official publications
DESCRIPTION
Maarten MarxTRANSCRIPT
PoliticalMashup 1
PoliticalMashupOpen Official Documents Requirements and
Opportunities
Maarten Marx
Universiteit van Amsterdam
Istanbul EEOP (LREC) 2012-05-27
PoliticalMashup 2
Content
bull Official Documents Zoom in on a specific official publications
dataset
bull Opportunities What makes official publications data valuable
bull Requirements What is needed to make official publications data
reusable and interoperable
PoliticalMashup 3
Our Leading Research Question
What is the best data format for publishing both legacy and current
parliamentary proceedings in a digital sustainable manner [Marx et
al 2010]
PoliticalMashup 4
W3C recommendations on Open Government Data
bull make data both machine and human readable
bull link data make data linkable provide permanent identifiers for
each government object and data item
bull provide metadata using common standards (eg Dublin Core)
bull make the data as easy to reuse (eg in mashups) as possible
Goal of this talk make this concrete
PoliticalMashup 5
Value of a large data corpus
bull Consider a 200 year corpus of temperature and humidity readings
in one location
bull Value is not in the individual ldquodocumentsrdquo
bull Value is not in the corpus as a whole
bull Value is in the relation between the ldquodocumentsrdquo
PoliticalMashup 6
Documents related by publication date
Google books Ngram viewer
PoliticalMashup 7
Properties of our Parliamentary ProceedingsDataset
PoliticalMashup 8
Longitudinal data
bull weakly measurement for over 150 years
bull very stable measurement procedure and data model
PoliticalMashup 9
Data about human behaviour
PoliticalMashup 10
Often rather boring
PoliticalMashup 11
But sometimes full of drama and excitement
PoliticalMashup 12
Loads of measurement points
24000 days 450000 topics 75 miljoen speeches
PoliticalMashup 13
Digitally available
PoliticalMashup 14
About this collection
bull very sparse available metadata
bull very rich ldquometadatardquo sits hidden inside the raw data
bull Rich data model
bull Meeting (1 Day)
bull Topic
bull Stage direction
bull Scene
bull Stage direction
bull Speech
bull Paragraph
PoliticalMashup 15
Very rich metadata for each word
For every word spoken in parliament the following facts are known
at the time of the speech act and can often be extracted from the
written proceedings
1) when it was said
2) who said it
3) in what function
4) speaking on behalf of which party
5) in which context and
6) who was actively present during the speech act
PoliticalMashup 16
How to exploit the extra metadata and structure
bull Letrsquos consider a simple killer app
PoliticalMashup 17
Political n-gram viewer
bull From every word we know both the date and the speaker
bull Every speaker belongs to a political party
bull 3D n-gram viewer political spectrum vs time vs word-count
bull Use topic ownership agenda setting framing
PoliticalMashup 18
Political n-gram viewer requirements
documents
1 metadata date of the meeting
2 document structure for every spoken word who said it
Linked Data Speakers names are disambiguated normalized and
mapped to a database with temporal party information
Completeness and correctness Few missing or wrong data also for
long time ago
PoliticalMashup 19
Is Linked (Open) Data the solution
bull Link speakers name to WikipediaDBpedia page (named entity
disambiguation and resolution) See also Google Knowledge
Graph and [Spitkovsky Chang LREC 2012]
bull DBpedia extracts link between person and party affiliation from
Wikipedia infobox
bull Timestamped triple
Geert Wilders is partymember of VVD
from 1998-08-25 until 2004-09-02
PoliticalMashup 20
DBpedia not yet reliable
bull Data extraction is difficult even from the infobox even from
complete data
Wikipedia page of Geert Wilders
DBpedia information about Geert Wilders
Notice the values of the party and the office attributes
Timestamped facts are difficult to extract and difficult to
represent in RDF triples
PoliticalMashup 21
Lesson learned requirement on metadata andrelations
bull One cannot rely on Linked Open Data for good quality metadata
bull Official documents should be self-describing also for facts which
are obvious at publication time
bull Compare speakerrsquos data in original (OCRed) data and XMLified
and enriched version
bull Original
bull Part of it in XML
bull And now for human consumption
PoliticalMashup 22
A few more applications
PoliticalMashup 23
Entity Profiling and Entity Search
bull Users search for entities not for documents [TREC Entity Track]
[Balog et al 2009]
bull Main research questions
How to collect information on entities
how to model an entity
how to rank entities
bull (Parsimonious) language models work well as models [Balog et
al 2009][Hiemstra et al 2004]
bull Entity profiling httpwwwpolitiekinzichtcom
bull Entity search httpikkieswijzernl
PoliticalMashup 24
Content and structure search
bull Usual advanced search combines keyword search with metadata
search
bull Extra fields are just extra filters on the returned documents
bull With structured documents we can do search on content and
structure
bull Most useful task rank best entry points in large documents
bull Compare two search systems on the same data
on flat text
on an XML representation
PoliticalMashup 25
Lesson learned requirement on structure
bull Make semantically important structure of documents explicit in
XML markup
bull Publish for machine readability
bull Publish generic data not data prepared for one use-case
PoliticalMashup 26
Application of structure Interruption graph(Attackogram)
bull MP A interrupts B lArrrArr A speaks during the block of B
combined with entity profiling
httpdebatpolitiekinzichtcom
PoliticalMashup 27
Exploring and exploiting official documents
bull We saw what can be done with one well-curated collection
bull What are the key infrastructural and research questions
In what direction and how to scale this up
1 in time
2 in breadth
3 in links
PoliticalMashup 28
Scale diachronically
bull Stable data model and measurement procedure make this data
very valuable for diachronic comparisons
bull towards the past
bull OCR
bull consistency in structure
bull more missing data to link to
bull towards the future
bull remain up to date
bull legacy decisions
PoliticalMashup 29
Scale in breadth eg parlproceedings of allEuropean countries
bull All describe the same ldquoscriptrdquo so all fit in one schema
bull Main question how to connect the data from different countries
Common structure and annotation use the same Relax NG
schema
Common values on certain attributesbull Entities Normalize to Wikipedia concepts
bull Controlled vocabulary keywords Normalize to Eurovoc
bull Language Machine translate to English
bull Events Normalize to EMM Newsexplorer query Wikinews
query
PoliticalMashup 30
Scale in breadth link to related datasets
bull Link on time entities events topics
bull Other official publications
bull News
bull User generated content
bull (In our case) promisses of political actors election manifestos
PoliticalMashup 31
Conclusions
bull There are ample opportunities for exploiting Official Publications
bull Preprocessing and interlinking with other datasets is difficult and
does not scale well
bull High precision and recall is needed for many applications
bull Many text analysis and data-mapping tasks [MUC TAC]
bull Every format needs an own transformer
bull Linked Open Data knowledge bases are not (yet) good enough
create special purpose knowledge extractors
bull High investment but if done in a general way high return and
impact
PoliticalMashup 32
Back to our research question
What is the best data format for publishing both legacy and current
parliamentary proceedings in a digital sustainable manner
Lessons learned
bull Common open standardized self-describing machine readable
bull not tied to a single application
bull linked linked linked
bull Not only shared attributes
bull but more importantly shared data values
bull also store utterly obvious facts (10 years later they arenrsquot)
PoliticalMashup 33
How we can help (ourselves)
Help improve input data at the source
bull Push at the source (in UK open government data in Holland all
parliamentary data is now in XML )
bull Help reduce dumb cut-and-paste annotation work so annotators
can concentrate on tasks which are hard for machines (eg
text-classification)
bull Emphasize importance of using shared standards
Future researchers will love you
PoliticalMashup 34
Last Question
Official Publications are they
or
- 1 Open Official Documents Requirements and Opportunities
- 2 Content
- 3 Our Leading Research Question
- 4 W3C recommendations on Open Government Data
- 5 Value of a large data corpus
- 6 Documents related by publication date
- 7 Properties of our Parliamentary Proceedings Dataset
- 8 Longitudinal data
- 9 Data about human behaviour
- 10 Often rather boring
- 11 But sometimes full of drama and excitement
- 12 Loads of measurement points
- 13 Digitally available
- 14 About this collection
- 15 Very rich metadata for each word
- 16 How to exploit the extra metadata and structure
- 17 Political n-gram viewer
- 18 Political n-gram viewer requirements
- 19 Is Linked (Open) Data the solution
- 20 DBpedia not yet reliable
- 21 Lesson learned requirement on metadata and relations
- 22 A few more applications
- 23 Entity Profiling and Entity Search
- 24 Content and structure search
- 25 Lesson learned requirement on structure
- 26 Application of structure Interruption graph (Attackogram)
- 27 Exploring and exploiting official documents
- 28 Scale diachronically
- 29 Scale in breadth eg parlproceedings of all European countries
- 30 Scale in breadth link to related datasets
- 31 Conclusions
- 32 Back to our research question
- 33 How we can help (ourselves)
- 34 Last Question
-
PoliticalMashup 2
Content
bull Official Documents Zoom in on a specific official publications
dataset
bull Opportunities What makes official publications data valuable
bull Requirements What is needed to make official publications data
reusable and interoperable
PoliticalMashup 3
Our Leading Research Question
What is the best data format for publishing both legacy and current
parliamentary proceedings in a digital sustainable manner [Marx et
al 2010]
PoliticalMashup 4
W3C recommendations on Open Government Data
bull make data both machine and human readable
bull link data make data linkable provide permanent identifiers for
each government object and data item
bull provide metadata using common standards (eg Dublin Core)
bull make the data as easy to reuse (eg in mashups) as possible
Goal of this talk make this concrete
PoliticalMashup 5
Value of a large data corpus
bull Consider a 200 year corpus of temperature and humidity readings
in one location
bull Value is not in the individual ldquodocumentsrdquo
bull Value is not in the corpus as a whole
bull Value is in the relation between the ldquodocumentsrdquo
PoliticalMashup 6
Documents related by publication date
Google books Ngram viewer
PoliticalMashup 7
Properties of our Parliamentary ProceedingsDataset
PoliticalMashup 8
Longitudinal data
bull weakly measurement for over 150 years
bull very stable measurement procedure and data model
PoliticalMashup 9
Data about human behaviour
PoliticalMashup 10
Often rather boring
PoliticalMashup 11
But sometimes full of drama and excitement
PoliticalMashup 12
Loads of measurement points
24000 days 450000 topics 75 miljoen speeches
PoliticalMashup 13
Digitally available
PoliticalMashup 14
About this collection
bull very sparse available metadata
bull very rich ldquometadatardquo sits hidden inside the raw data
bull Rich data model
bull Meeting (1 Day)
bull Topic
bull Stage direction
bull Scene
bull Stage direction
bull Speech
bull Paragraph
PoliticalMashup 15
Very rich metadata for each word
For every word spoken in parliament the following facts are known
at the time of the speech act and can often be extracted from the
written proceedings
1) when it was said
2) who said it
3) in what function
4) speaking on behalf of which party
5) in which context and
6) who was actively present during the speech act
PoliticalMashup 16
How to exploit the extra metadata and structure
bull Letrsquos consider a simple killer app
PoliticalMashup 17
Political n-gram viewer
bull From every word we know both the date and the speaker
bull Every speaker belongs to a political party
bull 3D n-gram viewer political spectrum vs time vs word-count
bull Use topic ownership agenda setting framing
PoliticalMashup 18
Political n-gram viewer requirements
documents
1 metadata date of the meeting
2 document structure for every spoken word who said it
Linked Data Speakers names are disambiguated normalized and
mapped to a database with temporal party information
Completeness and correctness Few missing or wrong data also for
long time ago
PoliticalMashup 19
Is Linked (Open) Data the solution
bull Link speakers name to WikipediaDBpedia page (named entity
disambiguation and resolution) See also Google Knowledge
Graph and [Spitkovsky Chang LREC 2012]
bull DBpedia extracts link between person and party affiliation from
Wikipedia infobox
bull Timestamped triple
Geert Wilders is partymember of VVD
from 1998-08-25 until 2004-09-02
PoliticalMashup 20
DBpedia not yet reliable
bull Data extraction is difficult even from the infobox even from
complete data
Wikipedia page of Geert Wilders
DBpedia information about Geert Wilders
Notice the values of the party and the office attributes
Timestamped facts are difficult to extract and difficult to
represent in RDF triples
PoliticalMashup 21
Lesson learned requirement on metadata andrelations
bull One cannot rely on Linked Open Data for good quality metadata
bull Official documents should be self-describing also for facts which
are obvious at publication time
bull Compare speakerrsquos data in original (OCRed) data and XMLified
and enriched version
bull Original
bull Part of it in XML
bull And now for human consumption
PoliticalMashup 22
A few more applications
PoliticalMashup 23
Entity Profiling and Entity Search
bull Users search for entities not for documents [TREC Entity Track]
[Balog et al 2009]
bull Main research questions
How to collect information on entities
how to model an entity
how to rank entities
bull (Parsimonious) language models work well as models [Balog et
al 2009][Hiemstra et al 2004]
bull Entity profiling httpwwwpolitiekinzichtcom
bull Entity search httpikkieswijzernl
PoliticalMashup 24
Content and structure search
bull Usual advanced search combines keyword search with metadata
search
bull Extra fields are just extra filters on the returned documents
bull With structured documents we can do search on content and
structure
bull Most useful task rank best entry points in large documents
bull Compare two search systems on the same data
on flat text
on an XML representation
PoliticalMashup 25
Lesson learned requirement on structure
bull Make semantically important structure of documents explicit in
XML markup
bull Publish for machine readability
bull Publish generic data not data prepared for one use-case
PoliticalMashup 26
Application of structure Interruption graph(Attackogram)
bull MP A interrupts B lArrrArr A speaks during the block of B
combined with entity profiling
httpdebatpolitiekinzichtcom
PoliticalMashup 27
Exploring and exploiting official documents
bull We saw what can be done with one well-curated collection
bull What are the key infrastructural and research questions
In what direction and how to scale this up
1 in time
2 in breadth
3 in links
PoliticalMashup 28
Scale diachronically
bull Stable data model and measurement procedure make this data
very valuable for diachronic comparisons
bull towards the past
bull OCR
bull consistency in structure
bull more missing data to link to
bull towards the future
bull remain up to date
bull legacy decisions
PoliticalMashup 29
Scale in breadth eg parlproceedings of allEuropean countries
bull All describe the same ldquoscriptrdquo so all fit in one schema
bull Main question how to connect the data from different countries
Common structure and annotation use the same Relax NG
schema
Common values on certain attributesbull Entities Normalize to Wikipedia concepts
bull Controlled vocabulary keywords Normalize to Eurovoc
bull Language Machine translate to English
bull Events Normalize to EMM Newsexplorer query Wikinews
query
PoliticalMashup 30
Scale in breadth link to related datasets
bull Link on time entities events topics
bull Other official publications
bull News
bull User generated content
bull (In our case) promisses of political actors election manifestos
PoliticalMashup 31
Conclusions
bull There are ample opportunities for exploiting Official Publications
bull Preprocessing and interlinking with other datasets is difficult and
does not scale well
bull High precision and recall is needed for many applications
bull Many text analysis and data-mapping tasks [MUC TAC]
bull Every format needs an own transformer
bull Linked Open Data knowledge bases are not (yet) good enough
create special purpose knowledge extractors
bull High investment but if done in a general way high return and
impact
PoliticalMashup 32
Back to our research question
What is the best data format for publishing both legacy and current
parliamentary proceedings in a digital sustainable manner
Lessons learned
bull Common open standardized self-describing machine readable
bull not tied to a single application
bull linked linked linked
bull Not only shared attributes
bull but more importantly shared data values
bull also store utterly obvious facts (10 years later they arenrsquot)
PoliticalMashup 33
How we can help (ourselves)
Help improve input data at the source
bull Push at the source (in UK open government data in Holland all
parliamentary data is now in XML )
bull Help reduce dumb cut-and-paste annotation work so annotators
can concentrate on tasks which are hard for machines (eg
text-classification)
bull Emphasize importance of using shared standards
Future researchers will love you
PoliticalMashup 34
Last Question
Official Publications are they
or
- 1 Open Official Documents Requirements and Opportunities
- 2 Content
- 3 Our Leading Research Question
- 4 W3C recommendations on Open Government Data
- 5 Value of a large data corpus
- 6 Documents related by publication date
- 7 Properties of our Parliamentary Proceedings Dataset
- 8 Longitudinal data
- 9 Data about human behaviour
- 10 Often rather boring
- 11 But sometimes full of drama and excitement
- 12 Loads of measurement points
- 13 Digitally available
- 14 About this collection
- 15 Very rich metadata for each word
- 16 How to exploit the extra metadata and structure
- 17 Political n-gram viewer
- 18 Political n-gram viewer requirements
- 19 Is Linked (Open) Data the solution
- 20 DBpedia not yet reliable
- 21 Lesson learned requirement on metadata and relations
- 22 A few more applications
- 23 Entity Profiling and Entity Search
- 24 Content and structure search
- 25 Lesson learned requirement on structure
- 26 Application of structure Interruption graph (Attackogram)
- 27 Exploring and exploiting official documents
- 28 Scale diachronically
- 29 Scale in breadth eg parlproceedings of all European countries
- 30 Scale in breadth link to related datasets
- 31 Conclusions
- 32 Back to our research question
- 33 How we can help (ourselves)
- 34 Last Question
-
PoliticalMashup 3
Our Leading Research Question
What is the best data format for publishing both legacy and current
parliamentary proceedings in a digital sustainable manner [Marx et
al 2010]
PoliticalMashup 4
W3C recommendations on Open Government Data
bull make data both machine and human readable
bull link data make data linkable provide permanent identifiers for
each government object and data item
bull provide metadata using common standards (eg Dublin Core)
bull make the data as easy to reuse (eg in mashups) as possible
Goal of this talk make this concrete
PoliticalMashup 5
Value of a large data corpus
bull Consider a 200 year corpus of temperature and humidity readings
in one location
bull Value is not in the individual ldquodocumentsrdquo
bull Value is not in the corpus as a whole
bull Value is in the relation between the ldquodocumentsrdquo
PoliticalMashup 6
Documents related by publication date
Google books Ngram viewer
PoliticalMashup 7
Properties of our Parliamentary ProceedingsDataset
PoliticalMashup 8
Longitudinal data
bull weakly measurement for over 150 years
bull very stable measurement procedure and data model
PoliticalMashup 9
Data about human behaviour
PoliticalMashup 10
Often rather boring
PoliticalMashup 11
But sometimes full of drama and excitement
PoliticalMashup 12
Loads of measurement points
24000 days 450000 topics 75 miljoen speeches
PoliticalMashup 13
Digitally available
PoliticalMashup 14
About this collection
bull very sparse available metadata
bull very rich ldquometadatardquo sits hidden inside the raw data
bull Rich data model
bull Meeting (1 Day)
bull Topic
bull Stage direction
bull Scene
bull Stage direction
bull Speech
bull Paragraph
PoliticalMashup 15
Very rich metadata for each word
For every word spoken in parliament the following facts are known
at the time of the speech act and can often be extracted from the
written proceedings
1) when it was said
2) who said it
3) in what function
4) speaking on behalf of which party
5) in which context and
6) who was actively present during the speech act
PoliticalMashup 16
How to exploit the extra metadata and structure
bull Letrsquos consider a simple killer app
PoliticalMashup 17
Political n-gram viewer
bull From every word we know both the date and the speaker
bull Every speaker belongs to a political party
bull 3D n-gram viewer political spectrum vs time vs word-count
bull Use topic ownership agenda setting framing
PoliticalMashup 18
Political n-gram viewer requirements
documents
1 metadata date of the meeting
2 document structure for every spoken word who said it
Linked Data Speakers names are disambiguated normalized and
mapped to a database with temporal party information
Completeness and correctness Few missing or wrong data also for
long time ago
PoliticalMashup 19
Is Linked (Open) Data the solution
bull Link speakers name to WikipediaDBpedia page (named entity
disambiguation and resolution) See also Google Knowledge
Graph and [Spitkovsky Chang LREC 2012]
bull DBpedia extracts link between person and party affiliation from
Wikipedia infobox
bull Timestamped triple
Geert Wilders is partymember of VVD
from 1998-08-25 until 2004-09-02
PoliticalMashup 20
DBpedia not yet reliable
bull Data extraction is difficult even from the infobox even from
complete data
Wikipedia page of Geert Wilders
DBpedia information about Geert Wilders
Notice the values of the party and the office attributes
Timestamped facts are difficult to extract and difficult to
represent in RDF triples
PoliticalMashup 21
Lesson learned requirement on metadata andrelations
bull One cannot rely on Linked Open Data for good quality metadata
bull Official documents should be self-describing also for facts which
are obvious at publication time
bull Compare speakerrsquos data in original (OCRed) data and XMLified
and enriched version
bull Original
bull Part of it in XML
bull And now for human consumption
PoliticalMashup 22
A few more applications
PoliticalMashup 23
Entity Profiling and Entity Search
bull Users search for entities not for documents [TREC Entity Track]
[Balog et al 2009]
bull Main research questions
How to collect information on entities
how to model an entity
how to rank entities
bull (Parsimonious) language models work well as models [Balog et
al 2009][Hiemstra et al 2004]
bull Entity profiling httpwwwpolitiekinzichtcom
bull Entity search httpikkieswijzernl
PoliticalMashup 24
Content and structure search
bull Usual advanced search combines keyword search with metadata
search
bull Extra fields are just extra filters on the returned documents
bull With structured documents we can do search on content and
structure
bull Most useful task rank best entry points in large documents
bull Compare two search systems on the same data
on flat text
on an XML representation
PoliticalMashup 25
Lesson learned requirement on structure
bull Make semantically important structure of documents explicit in
XML markup
bull Publish for machine readability
bull Publish generic data not data prepared for one use-case
PoliticalMashup 26
Application of structure Interruption graph(Attackogram)
bull MP A interrupts B lArrrArr A speaks during the block of B
combined with entity profiling
httpdebatpolitiekinzichtcom
PoliticalMashup 27
Exploring and exploiting official documents
bull We saw what can be done with one well-curated collection
bull What are the key infrastructural and research questions
In what direction and how to scale this up
1 in time
2 in breadth
3 in links
PoliticalMashup 28
Scale diachronically
bull Stable data model and measurement procedure make this data
very valuable for diachronic comparisons
bull towards the past
bull OCR
bull consistency in structure
bull more missing data to link to
bull towards the future
bull remain up to date
bull legacy decisions
PoliticalMashup 29
Scale in breadth eg parlproceedings of allEuropean countries
bull All describe the same ldquoscriptrdquo so all fit in one schema
bull Main question how to connect the data from different countries
Common structure and annotation use the same Relax NG
schema
Common values on certain attributesbull Entities Normalize to Wikipedia concepts
bull Controlled vocabulary keywords Normalize to Eurovoc
bull Language Machine translate to English
bull Events Normalize to EMM Newsexplorer query Wikinews
query
PoliticalMashup 30
Scale in breadth link to related datasets
bull Link on time entities events topics
bull Other official publications
bull News
bull User generated content
bull (In our case) promisses of political actors election manifestos
PoliticalMashup 31
Conclusions
bull There are ample opportunities for exploiting Official Publications
bull Preprocessing and interlinking with other datasets is difficult and
does not scale well
bull High precision and recall is needed for many applications
bull Many text analysis and data-mapping tasks [MUC TAC]
bull Every format needs an own transformer
bull Linked Open Data knowledge bases are not (yet) good enough
create special purpose knowledge extractors
bull High investment but if done in a general way high return and
impact
PoliticalMashup 32
Back to our research question
What is the best data format for publishing both legacy and current
parliamentary proceedings in a digital sustainable manner
Lessons learned
bull Common open standardized self-describing machine readable
bull not tied to a single application
bull linked linked linked
bull Not only shared attributes
bull but more importantly shared data values
bull also store utterly obvious facts (10 years later they arenrsquot)
PoliticalMashup 33
How we can help (ourselves)
Help improve input data at the source
bull Push at the source (in UK open government data in Holland all
parliamentary data is now in XML )
bull Help reduce dumb cut-and-paste annotation work so annotators
can concentrate on tasks which are hard for machines (eg
text-classification)
bull Emphasize importance of using shared standards
Future researchers will love you
PoliticalMashup 34
Last Question
Official Publications are they
or
- 1 Open Official Documents Requirements and Opportunities
- 2 Content
- 3 Our Leading Research Question
- 4 W3C recommendations on Open Government Data
- 5 Value of a large data corpus
- 6 Documents related by publication date
- 7 Properties of our Parliamentary Proceedings Dataset
- 8 Longitudinal data
- 9 Data about human behaviour
- 10 Often rather boring
- 11 But sometimes full of drama and excitement
- 12 Loads of measurement points
- 13 Digitally available
- 14 About this collection
- 15 Very rich metadata for each word
- 16 How to exploit the extra metadata and structure
- 17 Political n-gram viewer
- 18 Political n-gram viewer requirements
- 19 Is Linked (Open) Data the solution
- 20 DBpedia not yet reliable
- 21 Lesson learned requirement on metadata and relations
- 22 A few more applications
- 23 Entity Profiling and Entity Search
- 24 Content and structure search
- 25 Lesson learned requirement on structure
- 26 Application of structure Interruption graph (Attackogram)
- 27 Exploring and exploiting official documents
- 28 Scale diachronically
- 29 Scale in breadth eg parlproceedings of all European countries
- 30 Scale in breadth link to related datasets
- 31 Conclusions
- 32 Back to our research question
- 33 How we can help (ourselves)
- 34 Last Question
-
PoliticalMashup 4
W3C recommendations on Open Government Data
bull make data both machine and human readable
bull link data make data linkable provide permanent identifiers for
each government object and data item
bull provide metadata using common standards (eg Dublin Core)
bull make the data as easy to reuse (eg in mashups) as possible
Goal of this talk make this concrete
PoliticalMashup 5
Value of a large data corpus
bull Consider a 200 year corpus of temperature and humidity readings
in one location
bull Value is not in the individual ldquodocumentsrdquo
bull Value is not in the corpus as a whole
bull Value is in the relation between the ldquodocumentsrdquo
PoliticalMashup 6
Documents related by publication date
Google books Ngram viewer
PoliticalMashup 7
Properties of our Parliamentary ProceedingsDataset
PoliticalMashup 8
Longitudinal data
bull weakly measurement for over 150 years
bull very stable measurement procedure and data model
PoliticalMashup 9
Data about human behaviour
PoliticalMashup 10
Often rather boring
PoliticalMashup 11
But sometimes full of drama and excitement
PoliticalMashup 12
Loads of measurement points
24000 days 450000 topics 75 miljoen speeches
PoliticalMashup 13
Digitally available
PoliticalMashup 14
About this collection
bull very sparse available metadata
bull very rich ldquometadatardquo sits hidden inside the raw data
bull Rich data model
bull Meeting (1 Day)
bull Topic
bull Stage direction
bull Scene
bull Stage direction
bull Speech
bull Paragraph
PoliticalMashup 15
Very rich metadata for each word
For every word spoken in parliament the following facts are known
at the time of the speech act and can often be extracted from the
written proceedings
1) when it was said
2) who said it
3) in what function
4) speaking on behalf of which party
5) in which context and
6) who was actively present during the speech act
PoliticalMashup 16
How to exploit the extra metadata and structure
bull Letrsquos consider a simple killer app
PoliticalMashup 17
Political n-gram viewer
bull From every word we know both the date and the speaker
bull Every speaker belongs to a political party
bull 3D n-gram viewer political spectrum vs time vs word-count
bull Use topic ownership agenda setting framing
PoliticalMashup 18
Political n-gram viewer requirements
documents
1 metadata date of the meeting
2 document structure for every spoken word who said it
Linked Data Speakers names are disambiguated normalized and
mapped to a database with temporal party information
Completeness and correctness Few missing or wrong data also for
long time ago
PoliticalMashup 19
Is Linked (Open) Data the solution
bull Link speakers name to WikipediaDBpedia page (named entity
disambiguation and resolution) See also Google Knowledge
Graph and [Spitkovsky Chang LREC 2012]
bull DBpedia extracts link between person and party affiliation from
Wikipedia infobox
bull Timestamped triple
Geert Wilders is partymember of VVD
from 1998-08-25 until 2004-09-02
PoliticalMashup 20
DBpedia not yet reliable
bull Data extraction is difficult even from the infobox even from
complete data
Wikipedia page of Geert Wilders
DBpedia information about Geert Wilders
Notice the values of the party and the office attributes
Timestamped facts are difficult to extract and difficult to
represent in RDF triples
PoliticalMashup 21
Lesson learned requirement on metadata andrelations
bull One cannot rely on Linked Open Data for good quality metadata
bull Official documents should be self-describing also for facts which
are obvious at publication time
bull Compare speakerrsquos data in original (OCRed) data and XMLified
and enriched version
bull Original
bull Part of it in XML
bull And now for human consumption
PoliticalMashup 22
A few more applications
PoliticalMashup 23
Entity Profiling and Entity Search
bull Users search for entities not for documents [TREC Entity Track]
[Balog et al 2009]
bull Main research questions
How to collect information on entities
how to model an entity
how to rank entities
bull (Parsimonious) language models work well as models [Balog et
al 2009][Hiemstra et al 2004]
bull Entity profiling httpwwwpolitiekinzichtcom
bull Entity search httpikkieswijzernl
PoliticalMashup 24
Content and structure search
bull Usual advanced search combines keyword search with metadata
search
bull Extra fields are just extra filters on the returned documents
bull With structured documents we can do search on content and
structure
bull Most useful task rank best entry points in large documents
bull Compare two search systems on the same data
on flat text
on an XML representation
PoliticalMashup 25
Lesson learned requirement on structure
bull Make semantically important structure of documents explicit in
XML markup
bull Publish for machine readability
bull Publish generic data not data prepared for one use-case
PoliticalMashup 26
Application of structure Interruption graph(Attackogram)
bull MP A interrupts B lArrrArr A speaks during the block of B
combined with entity profiling
httpdebatpolitiekinzichtcom
PoliticalMashup 27
Exploring and exploiting official documents
bull We saw what can be done with one well-curated collection
bull What are the key infrastructural and research questions
In what direction and how to scale this up
1 in time
2 in breadth
3 in links
PoliticalMashup 28
Scale diachronically
bull Stable data model and measurement procedure make this data
very valuable for diachronic comparisons
bull towards the past
bull OCR
bull consistency in structure
bull more missing data to link to
bull towards the future
bull remain up to date
bull legacy decisions
PoliticalMashup 29
Scale in breadth eg parlproceedings of allEuropean countries
bull All describe the same ldquoscriptrdquo so all fit in one schema
bull Main question how to connect the data from different countries
Common structure and annotation use the same Relax NG
schema
Common values on certain attributesbull Entities Normalize to Wikipedia concepts
bull Controlled vocabulary keywords Normalize to Eurovoc
bull Language Machine translate to English
bull Events Normalize to EMM Newsexplorer query Wikinews
query
PoliticalMashup 30
Scale in breadth link to related datasets
bull Link on time entities events topics
bull Other official publications
bull News
bull User generated content
bull (In our case) promisses of political actors election manifestos
PoliticalMashup 31
Conclusions
bull There are ample opportunities for exploiting Official Publications
bull Preprocessing and interlinking with other datasets is difficult and
does not scale well
bull High precision and recall is needed for many applications
bull Many text analysis and data-mapping tasks [MUC TAC]
bull Every format needs an own transformer
bull Linked Open Data knowledge bases are not (yet) good enough
create special purpose knowledge extractors
bull High investment but if done in a general way high return and
impact
PoliticalMashup 32
Back to our research question
What is the best data format for publishing both legacy and current
parliamentary proceedings in a digital sustainable manner
Lessons learned
bull Common open standardized self-describing machine readable
bull not tied to a single application
bull linked linked linked
bull Not only shared attributes
bull but more importantly shared data values
bull also store utterly obvious facts (10 years later they arenrsquot)
PoliticalMashup 33
How we can help (ourselves)
Help improve input data at the source
bull Push at the source (in UK open government data in Holland all
parliamentary data is now in XML )
bull Help reduce dumb cut-and-paste annotation work so annotators
can concentrate on tasks which are hard for machines (eg
text-classification)
bull Emphasize importance of using shared standards
Future researchers will love you
PoliticalMashup 34
Last Question
Official Publications are they
or
- 1 Open Official Documents Requirements and Opportunities
- 2 Content
- 3 Our Leading Research Question
- 4 W3C recommendations on Open Government Data
- 5 Value of a large data corpus
- 6 Documents related by publication date
- 7 Properties of our Parliamentary Proceedings Dataset
- 8 Longitudinal data
- 9 Data about human behaviour
- 10 Often rather boring
- 11 But sometimes full of drama and excitement
- 12 Loads of measurement points
- 13 Digitally available
- 14 About this collection
- 15 Very rich metadata for each word
- 16 How to exploit the extra metadata and structure
- 17 Political n-gram viewer
- 18 Political n-gram viewer requirements
- 19 Is Linked (Open) Data the solution
- 20 DBpedia not yet reliable
- 21 Lesson learned requirement on metadata and relations
- 22 A few more applications
- 23 Entity Profiling and Entity Search
- 24 Content and structure search
- 25 Lesson learned requirement on structure
- 26 Application of structure Interruption graph (Attackogram)
- 27 Exploring and exploiting official documents
- 28 Scale diachronically
- 29 Scale in breadth eg parlproceedings of all European countries
- 30 Scale in breadth link to related datasets
- 31 Conclusions
- 32 Back to our research question
- 33 How we can help (ourselves)
- 34 Last Question
-
PoliticalMashup 5
Value of a large data corpus
bull Consider a 200 year corpus of temperature and humidity readings
in one location
bull Value is not in the individual ldquodocumentsrdquo
bull Value is not in the corpus as a whole
bull Value is in the relation between the ldquodocumentsrdquo
PoliticalMashup 6
Documents related by publication date
Google books Ngram viewer
PoliticalMashup 7
Properties of our Parliamentary ProceedingsDataset
PoliticalMashup 8
Longitudinal data
bull weakly measurement for over 150 years
bull very stable measurement procedure and data model
PoliticalMashup 9
Data about human behaviour
PoliticalMashup 10
Often rather boring
PoliticalMashup 11
But sometimes full of drama and excitement
PoliticalMashup 12
Loads of measurement points
24000 days 450000 topics 75 miljoen speeches
PoliticalMashup 13
Digitally available
PoliticalMashup 14
About this collection
bull very sparse available metadata
bull very rich ldquometadatardquo sits hidden inside the raw data
bull Rich data model
bull Meeting (1 Day)
bull Topic
bull Stage direction
bull Scene
bull Stage direction
bull Speech
bull Paragraph
PoliticalMashup 15
Very rich metadata for each word
For every word spoken in parliament the following facts are known
at the time of the speech act and can often be extracted from the
written proceedings
1) when it was said
2) who said it
3) in what function
4) speaking on behalf of which party
5) in which context and
6) who was actively present during the speech act
PoliticalMashup 16
How to exploit the extra metadata and structure
bull Letrsquos consider a simple killer app
PoliticalMashup 17
Political n-gram viewer
bull From every word we know both the date and the speaker
bull Every speaker belongs to a political party
bull 3D n-gram viewer political spectrum vs time vs word-count
bull Use topic ownership agenda setting framing
PoliticalMashup 18
Political n-gram viewer requirements
documents
1 metadata date of the meeting
2 document structure for every spoken word who said it
Linked Data Speakers names are disambiguated normalized and
mapped to a database with temporal party information
Completeness and correctness Few missing or wrong data also for
long time ago
PoliticalMashup 19
Is Linked (Open) Data the solution
bull Link speakers name to WikipediaDBpedia page (named entity
disambiguation and resolution) See also Google Knowledge
Graph and [Spitkovsky Chang LREC 2012]
bull DBpedia extracts link between person and party affiliation from
Wikipedia infobox
bull Timestamped triple
Geert Wilders is partymember of VVD
from 1998-08-25 until 2004-09-02
PoliticalMashup 20
DBpedia not yet reliable
bull Data extraction is difficult even from the infobox even from
complete data
Wikipedia page of Geert Wilders
DBpedia information about Geert Wilders
Notice the values of the party and the office attributes
Timestamped facts are difficult to extract and difficult to
represent in RDF triples
PoliticalMashup 21
Lesson learned requirement on metadata andrelations
bull One cannot rely on Linked Open Data for good quality metadata
bull Official documents should be self-describing also for facts which
are obvious at publication time
bull Compare speakerrsquos data in original (OCRed) data and XMLified
and enriched version
bull Original
bull Part of it in XML
bull And now for human consumption
PoliticalMashup 22
A few more applications
PoliticalMashup 23
Entity Profiling and Entity Search
bull Users search for entities not for documents [TREC Entity Track]
[Balog et al 2009]
bull Main research questions
How to collect information on entities
how to model an entity
how to rank entities
bull (Parsimonious) language models work well as models [Balog et
al 2009][Hiemstra et al 2004]
bull Entity profiling httpwwwpolitiekinzichtcom
bull Entity search httpikkieswijzernl
PoliticalMashup 24
Content and structure search
bull Usual advanced search combines keyword search with metadata
search
bull Extra fields are just extra filters on the returned documents
bull With structured documents we can do search on content and
structure
bull Most useful task rank best entry points in large documents
bull Compare two search systems on the same data
on flat text
on an XML representation
PoliticalMashup 25
Lesson learned requirement on structure
bull Make semantically important structure of documents explicit in
XML markup
bull Publish for machine readability
bull Publish generic data not data prepared for one use-case
PoliticalMashup 26
Application of structure Interruption graph(Attackogram)
bull MP A interrupts B lArrrArr A speaks during the block of B
combined with entity profiling
httpdebatpolitiekinzichtcom
PoliticalMashup 27
Exploring and exploiting official documents
bull We saw what can be done with one well-curated collection
bull What are the key infrastructural and research questions
In what direction and how to scale this up
1 in time
2 in breadth
3 in links
PoliticalMashup 28
Scale diachronically
bull Stable data model and measurement procedure make this data
very valuable for diachronic comparisons
bull towards the past
bull OCR
bull consistency in structure
bull more missing data to link to
bull towards the future
bull remain up to date
bull legacy decisions
PoliticalMashup 29
Scale in breadth eg parlproceedings of allEuropean countries
bull All describe the same ldquoscriptrdquo so all fit in one schema
bull Main question how to connect the data from different countries
Common structure and annotation use the same Relax NG
schema
Common values on certain attributesbull Entities Normalize to Wikipedia concepts
bull Controlled vocabulary keywords Normalize to Eurovoc
bull Language Machine translate to English
bull Events Normalize to EMM Newsexplorer query Wikinews
query
PoliticalMashup 30
Scale in breadth link to related datasets
bull Link on time entities events topics
bull Other official publications
bull News
bull User generated content
bull (In our case) promisses of political actors election manifestos
PoliticalMashup 31
Conclusions
bull There are ample opportunities for exploiting Official Publications
bull Preprocessing and interlinking with other datasets is difficult and
does not scale well
bull High precision and recall is needed for many applications
bull Many text analysis and data-mapping tasks [MUC TAC]
bull Every format needs an own transformer
bull Linked Open Data knowledge bases are not (yet) good enough
create special purpose knowledge extractors
bull High investment but if done in a general way high return and
impact
PoliticalMashup 32
Back to our research question
What is the best data format for publishing both legacy and current
parliamentary proceedings in a digital sustainable manner
Lessons learned
bull Common open standardized self-describing machine readable
bull not tied to a single application
bull linked linked linked
bull Not only shared attributes
bull but more importantly shared data values
bull also store utterly obvious facts (10 years later they arenrsquot)
PoliticalMashup 33
How we can help (ourselves)
Help improve input data at the source
bull Push at the source (in UK open government data in Holland all
parliamentary data is now in XML )
bull Help reduce dumb cut-and-paste annotation work so annotators
can concentrate on tasks which are hard for machines (eg
text-classification)
bull Emphasize importance of using shared standards
Future researchers will love you
PoliticalMashup 34
Last Question
Official Publications are they
or
- 1 Open Official Documents Requirements and Opportunities
- 2 Content
- 3 Our Leading Research Question
- 4 W3C recommendations on Open Government Data
- 5 Value of a large data corpus
- 6 Documents related by publication date
- 7 Properties of our Parliamentary Proceedings Dataset
- 8 Longitudinal data
- 9 Data about human behaviour
- 10 Often rather boring
- 11 But sometimes full of drama and excitement
- 12 Loads of measurement points
- 13 Digitally available
- 14 About this collection
- 15 Very rich metadata for each word
- 16 How to exploit the extra metadata and structure
- 17 Political n-gram viewer
- 18 Political n-gram viewer requirements
- 19 Is Linked (Open) Data the solution
- 20 DBpedia not yet reliable
- 21 Lesson learned requirement on metadata and relations
- 22 A few more applications
- 23 Entity Profiling and Entity Search
- 24 Content and structure search
- 25 Lesson learned requirement on structure
- 26 Application of structure Interruption graph (Attackogram)
- 27 Exploring and exploiting official documents
- 28 Scale diachronically
- 29 Scale in breadth eg parlproceedings of all European countries
- 30 Scale in breadth link to related datasets
- 31 Conclusions
- 32 Back to our research question
- 33 How we can help (ourselves)
- 34 Last Question
-
PoliticalMashup 6
Documents related by publication date
Google books Ngram viewer
PoliticalMashup 7
Properties of our Parliamentary ProceedingsDataset
PoliticalMashup 8
Longitudinal data
bull weakly measurement for over 150 years
bull very stable measurement procedure and data model
PoliticalMashup 9
Data about human behaviour
PoliticalMashup 10
Often rather boring
PoliticalMashup 11
But sometimes full of drama and excitement
PoliticalMashup 12
Loads of measurement points
24000 days 450000 topics 75 miljoen speeches
PoliticalMashup 13
Digitally available
PoliticalMashup 14
About this collection
bull very sparse available metadata
bull very rich ldquometadatardquo sits hidden inside the raw data
bull Rich data model
bull Meeting (1 Day)
bull Topic
bull Stage direction
bull Scene
bull Stage direction
bull Speech
bull Paragraph
PoliticalMashup 15
Very rich metadata for each word
For every word spoken in parliament the following facts are known
at the time of the speech act and can often be extracted from the
written proceedings
1) when it was said
2) who said it
3) in what function
4) speaking on behalf of which party
5) in which context and
6) who was actively present during the speech act
PoliticalMashup 16
How to exploit the extra metadata and structure
bull Letrsquos consider a simple killer app
PoliticalMashup 17
Political n-gram viewer
bull From every word we know both the date and the speaker
bull Every speaker belongs to a political party
bull 3D n-gram viewer political spectrum vs time vs word-count
bull Use topic ownership agenda setting framing
PoliticalMashup 18
Political n-gram viewer requirements
documents
1 metadata date of the meeting
2 document structure for every spoken word who said it
Linked Data Speakers names are disambiguated normalized and
mapped to a database with temporal party information
Completeness and correctness Few missing or wrong data also for
long time ago
PoliticalMashup 19
Is Linked (Open) Data the solution
bull Link speakers name to WikipediaDBpedia page (named entity
disambiguation and resolution) See also Google Knowledge
Graph and [Spitkovsky Chang LREC 2012]
bull DBpedia extracts link between person and party affiliation from
Wikipedia infobox
bull Timestamped triple
Geert Wilders is partymember of VVD
from 1998-08-25 until 2004-09-02
PoliticalMashup 20
DBpedia not yet reliable
bull Data extraction is difficult even from the infobox even from
complete data
Wikipedia page of Geert Wilders
DBpedia information about Geert Wilders
Notice the values of the party and the office attributes
Timestamped facts are difficult to extract and difficult to
represent in RDF triples
PoliticalMashup 21
Lesson learned requirement on metadata andrelations
bull One cannot rely on Linked Open Data for good quality metadata
bull Official documents should be self-describing also for facts which
are obvious at publication time
bull Compare speakerrsquos data in original (OCRed) data and XMLified
and enriched version
bull Original
bull Part of it in XML
bull And now for human consumption
PoliticalMashup 22
A few more applications
PoliticalMashup 23
Entity Profiling and Entity Search
bull Users search for entities not for documents [TREC Entity Track]
[Balog et al 2009]
bull Main research questions
How to collect information on entities
how to model an entity
how to rank entities
bull (Parsimonious) language models work well as models [Balog et
al 2009][Hiemstra et al 2004]
bull Entity profiling httpwwwpolitiekinzichtcom
bull Entity search httpikkieswijzernl
PoliticalMashup 24
Content and structure search
bull Usual advanced search combines keyword search with metadata
search
bull Extra fields are just extra filters on the returned documents
bull With structured documents we can do search on content and
structure
bull Most useful task rank best entry points in large documents
bull Compare two search systems on the same data
on flat text
on an XML representation
PoliticalMashup 25
Lesson learned requirement on structure
bull Make semantically important structure of documents explicit in
XML markup
bull Publish for machine readability
bull Publish generic data not data prepared for one use-case
PoliticalMashup 26
Application of structure Interruption graph(Attackogram)
bull MP A interrupts B lArrrArr A speaks during the block of B
combined with entity profiling
httpdebatpolitiekinzichtcom
PoliticalMashup 27
Exploring and exploiting official documents
bull We saw what can be done with one well-curated collection
bull What are the key infrastructural and research questions
In what direction and how to scale this up
1 in time
2 in breadth
3 in links
PoliticalMashup 28
Scale diachronically
bull Stable data model and measurement procedure make this data
very valuable for diachronic comparisons
bull towards the past
bull OCR
bull consistency in structure
bull more missing data to link to
bull towards the future
bull remain up to date
bull legacy decisions
PoliticalMashup 29
Scale in breadth eg parlproceedings of allEuropean countries
bull All describe the same ldquoscriptrdquo so all fit in one schema
bull Main question how to connect the data from different countries
Common structure and annotation use the same Relax NG
schema
Common values on certain attributesbull Entities Normalize to Wikipedia concepts
bull Controlled vocabulary keywords Normalize to Eurovoc
bull Language Machine translate to English
bull Events Normalize to EMM Newsexplorer query Wikinews
query
PoliticalMashup 30
Scale in breadth link to related datasets
bull Link on time entities events topics
bull Other official publications
bull News
bull User generated content
bull (In our case) promisses of political actors election manifestos
PoliticalMashup 31
Conclusions
bull There are ample opportunities for exploiting Official Publications
bull Preprocessing and interlinking with other datasets is difficult and
does not scale well
bull High precision and recall is needed for many applications
bull Many text analysis and data-mapping tasks [MUC TAC]
bull Every format needs an own transformer
bull Linked Open Data knowledge bases are not (yet) good enough
create special purpose knowledge extractors
bull High investment but if done in a general way high return and
impact
PoliticalMashup 32
Back to our research question
What is the best data format for publishing both legacy and current
parliamentary proceedings in a digital sustainable manner
Lessons learned
bull Common open standardized self-describing machine readable
bull not tied to a single application
bull linked linked linked
bull Not only shared attributes
bull but more importantly shared data values
bull also store utterly obvious facts (10 years later they arenrsquot)
PoliticalMashup 33
How we can help (ourselves)
Help improve input data at the source
bull Push at the source (in UK open government data in Holland all
parliamentary data is now in XML )
bull Help reduce dumb cut-and-paste annotation work so annotators
can concentrate on tasks which are hard for machines (eg
text-classification)
bull Emphasize importance of using shared standards
Future researchers will love you
PoliticalMashup 34
Last Question
Official Publications are they
or
- 1 Open Official Documents Requirements and Opportunities
- 2 Content
- 3 Our Leading Research Question
- 4 W3C recommendations on Open Government Data
- 5 Value of a large data corpus
- 6 Documents related by publication date
- 7 Properties of our Parliamentary Proceedings Dataset
- 8 Longitudinal data
- 9 Data about human behaviour
- 10 Often rather boring
- 11 But sometimes full of drama and excitement
- 12 Loads of measurement points
- 13 Digitally available
- 14 About this collection
- 15 Very rich metadata for each word
- 16 How to exploit the extra metadata and structure
- 17 Political n-gram viewer
- 18 Political n-gram viewer requirements
- 19 Is Linked (Open) Data the solution
- 20 DBpedia not yet reliable
- 21 Lesson learned requirement on metadata and relations
- 22 A few more applications
- 23 Entity Profiling and Entity Search
- 24 Content and structure search
- 25 Lesson learned requirement on structure
- 26 Application of structure Interruption graph (Attackogram)
- 27 Exploring and exploiting official documents
- 28 Scale diachronically
- 29 Scale in breadth eg parlproceedings of all European countries
- 30 Scale in breadth link to related datasets
- 31 Conclusions
- 32 Back to our research question
- 33 How we can help (ourselves)
- 34 Last Question
-
PoliticalMashup 7
Properties of our Parliamentary ProceedingsDataset
PoliticalMashup 8
Longitudinal data
bull weakly measurement for over 150 years
bull very stable measurement procedure and data model
PoliticalMashup 9
Data about human behaviour
PoliticalMashup 10
Often rather boring
PoliticalMashup 11
But sometimes full of drama and excitement
PoliticalMashup 12
Loads of measurement points
24000 days 450000 topics 75 miljoen speeches
PoliticalMashup 13
Digitally available
PoliticalMashup 14
About this collection
bull very sparse available metadata
bull very rich ldquometadatardquo sits hidden inside the raw data
bull Rich data model
bull Meeting (1 Day)
bull Topic
bull Stage direction
bull Scene
bull Stage direction
bull Speech
bull Paragraph
PoliticalMashup 15
Very rich metadata for each word
For every word spoken in parliament the following facts are known
at the time of the speech act and can often be extracted from the
written proceedings
1) when it was said
2) who said it
3) in what function
4) speaking on behalf of which party
5) in which context and
6) who was actively present during the speech act
PoliticalMashup 16
How to exploit the extra metadata and structure
bull Letrsquos consider a simple killer app
PoliticalMashup 17
Political n-gram viewer
bull From every word we know both the date and the speaker
bull Every speaker belongs to a political party
bull 3D n-gram viewer political spectrum vs time vs word-count
bull Use topic ownership agenda setting framing
PoliticalMashup 18
Political n-gram viewer requirements
documents
1 metadata date of the meeting
2 document structure for every spoken word who said it
Linked Data Speakers names are disambiguated normalized and
mapped to a database with temporal party information
Completeness and correctness Few missing or wrong data also for
long time ago
PoliticalMashup 19
Is Linked (Open) Data the solution
bull Link speakers name to WikipediaDBpedia page (named entity
disambiguation and resolution) See also Google Knowledge
Graph and [Spitkovsky Chang LREC 2012]
bull DBpedia extracts link between person and party affiliation from
Wikipedia infobox
bull Timestamped triple
Geert Wilders is partymember of VVD
from 1998-08-25 until 2004-09-02
PoliticalMashup 20
DBpedia not yet reliable
bull Data extraction is difficult even from the infobox even from
complete data
Wikipedia page of Geert Wilders
DBpedia information about Geert Wilders
Notice the values of the party and the office attributes
Timestamped facts are difficult to extract and difficult to
represent in RDF triples
PoliticalMashup 21
Lesson learned requirement on metadata andrelations
bull One cannot rely on Linked Open Data for good quality metadata
bull Official documents should be self-describing also for facts which
are obvious at publication time
bull Compare speakerrsquos data in original (OCRed) data and XMLified
and enriched version
bull Original
bull Part of it in XML
bull And now for human consumption
PoliticalMashup 22
A few more applications
PoliticalMashup 23
Entity Profiling and Entity Search
bull Users search for entities not for documents [TREC Entity Track]
[Balog et al 2009]
bull Main research questions
How to collect information on entities
how to model an entity
how to rank entities
bull (Parsimonious) language models work well as models [Balog et
al 2009][Hiemstra et al 2004]
bull Entity profiling httpwwwpolitiekinzichtcom
bull Entity search httpikkieswijzernl
PoliticalMashup 24
Content and structure search
bull Usual advanced search combines keyword search with metadata
search
bull Extra fields are just extra filters on the returned documents
bull With structured documents we can do search on content and
structure
bull Most useful task rank best entry points in large documents
bull Compare two search systems on the same data
on flat text
on an XML representation
PoliticalMashup 25
Lesson learned requirement on structure
bull Make semantically important structure of documents explicit in
XML markup
bull Publish for machine readability
bull Publish generic data not data prepared for one use-case
PoliticalMashup 26
Application of structure Interruption graph(Attackogram)
bull MP A interrupts B lArrrArr A speaks during the block of B
combined with entity profiling
httpdebatpolitiekinzichtcom
PoliticalMashup 27
Exploring and exploiting official documents
bull We saw what can be done with one well-curated collection
bull What are the key infrastructural and research questions
In what direction and how to scale this up
1 in time
2 in breadth
3 in links
PoliticalMashup 28
Scale diachronically
bull Stable data model and measurement procedure make this data
very valuable for diachronic comparisons
bull towards the past
bull OCR
bull consistency in structure
bull more missing data to link to
bull towards the future
bull remain up to date
bull legacy decisions
PoliticalMashup 29
Scale in breadth eg parlproceedings of allEuropean countries
bull All describe the same ldquoscriptrdquo so all fit in one schema
bull Main question how to connect the data from different countries
Common structure and annotation use the same Relax NG
schema
Common values on certain attributesbull Entities Normalize to Wikipedia concepts
bull Controlled vocabulary keywords Normalize to Eurovoc
bull Language Machine translate to English
bull Events Normalize to EMM Newsexplorer query Wikinews
query
PoliticalMashup 30
Scale in breadth link to related datasets
bull Link on time entities events topics
bull Other official publications
bull News
bull User generated content
bull (In our case) promisses of political actors election manifestos
PoliticalMashup 31
Conclusions
bull There are ample opportunities for exploiting Official Publications
bull Preprocessing and interlinking with other datasets is difficult and
does not scale well
bull High precision and recall is needed for many applications
bull Many text analysis and data-mapping tasks [MUC TAC]
bull Every format needs an own transformer
bull Linked Open Data knowledge bases are not (yet) good enough
create special purpose knowledge extractors
bull High investment but if done in a general way high return and
impact
PoliticalMashup 32
Back to our research question
What is the best data format for publishing both legacy and current
parliamentary proceedings in a digital sustainable manner
Lessons learned
bull Common open standardized self-describing machine readable
bull not tied to a single application
bull linked linked linked
bull Not only shared attributes
bull but more importantly shared data values
bull also store utterly obvious facts (10 years later they arenrsquot)
PoliticalMashup 33
How we can help (ourselves)
Help improve input data at the source
bull Push at the source (in UK open government data in Holland all
parliamentary data is now in XML )
bull Help reduce dumb cut-and-paste annotation work so annotators
can concentrate on tasks which are hard for machines (eg
text-classification)
bull Emphasize importance of using shared standards
Future researchers will love you
PoliticalMashup 34
Last Question
Official Publications are they
or
- 1 Open Official Documents Requirements and Opportunities
- 2 Content
- 3 Our Leading Research Question
- 4 W3C recommendations on Open Government Data
- 5 Value of a large data corpus
- 6 Documents related by publication date
- 7 Properties of our Parliamentary Proceedings Dataset
- 8 Longitudinal data
- 9 Data about human behaviour
- 10 Often rather boring
- 11 But sometimes full of drama and excitement
- 12 Loads of measurement points
- 13 Digitally available
- 14 About this collection
- 15 Very rich metadata for each word
- 16 How to exploit the extra metadata and structure
- 17 Political n-gram viewer
- 18 Political n-gram viewer requirements
- 19 Is Linked (Open) Data the solution
- 20 DBpedia not yet reliable
- 21 Lesson learned requirement on metadata and relations
- 22 A few more applications
- 23 Entity Profiling and Entity Search
- 24 Content and structure search
- 25 Lesson learned requirement on structure
- 26 Application of structure Interruption graph (Attackogram)
- 27 Exploring and exploiting official documents
- 28 Scale diachronically
- 29 Scale in breadth eg parlproceedings of all European countries
- 30 Scale in breadth link to related datasets
- 31 Conclusions
- 32 Back to our research question
- 33 How we can help (ourselves)
- 34 Last Question
-
PoliticalMashup 8
Longitudinal data
bull weakly measurement for over 150 years
bull very stable measurement procedure and data model
PoliticalMashup 9
Data about human behaviour
PoliticalMashup 10
Often rather boring
PoliticalMashup 11
But sometimes full of drama and excitement
PoliticalMashup 12
Loads of measurement points
24000 days 450000 topics 75 miljoen speeches
PoliticalMashup 13
Digitally available
PoliticalMashup 14
About this collection
bull very sparse available metadata
bull very rich ldquometadatardquo sits hidden inside the raw data
bull Rich data model
bull Meeting (1 Day)
bull Topic
bull Stage direction
bull Scene
bull Stage direction
bull Speech
bull Paragraph
PoliticalMashup 15
Very rich metadata for each word
For every word spoken in parliament the following facts are known
at the time of the speech act and can often be extracted from the
written proceedings
1) when it was said
2) who said it
3) in what function
4) speaking on behalf of which party
5) in which context and
6) who was actively present during the speech act
PoliticalMashup 16
How to exploit the extra metadata and structure
bull Letrsquos consider a simple killer app
PoliticalMashup 17
Political n-gram viewer
bull From every word we know both the date and the speaker
bull Every speaker belongs to a political party
bull 3D n-gram viewer political spectrum vs time vs word-count
bull Use topic ownership agenda setting framing
PoliticalMashup 18
Political n-gram viewer requirements
documents
1 metadata date of the meeting
2 document structure for every spoken word who said it
Linked Data Speakers names are disambiguated normalized and
mapped to a database with temporal party information
Completeness and correctness Few missing or wrong data also for
long time ago
PoliticalMashup 19
Is Linked (Open) Data the solution
bull Link speakers name to WikipediaDBpedia page (named entity
disambiguation and resolution) See also Google Knowledge
Graph and [Spitkovsky Chang LREC 2012]
bull DBpedia extracts link between person and party affiliation from
Wikipedia infobox
bull Timestamped triple
Geert Wilders is partymember of VVD
from 1998-08-25 until 2004-09-02
PoliticalMashup 20
DBpedia not yet reliable
bull Data extraction is difficult even from the infobox even from
complete data
Wikipedia page of Geert Wilders
DBpedia information about Geert Wilders
Notice the values of the party and the office attributes
Timestamped facts are difficult to extract and difficult to
represent in RDF triples
PoliticalMashup 21
Lesson learned requirement on metadata andrelations
bull One cannot rely on Linked Open Data for good quality metadata
bull Official documents should be self-describing also for facts which
are obvious at publication time
bull Compare speakerrsquos data in original (OCRed) data and XMLified
and enriched version
bull Original
bull Part of it in XML
bull And now for human consumption
PoliticalMashup 22
A few more applications
PoliticalMashup 23
Entity Profiling and Entity Search
bull Users search for entities not for documents [TREC Entity Track]
[Balog et al 2009]
bull Main research questions
How to collect information on entities
how to model an entity
how to rank entities
bull (Parsimonious) language models work well as models [Balog et
al 2009][Hiemstra et al 2004]
bull Entity profiling httpwwwpolitiekinzichtcom
bull Entity search httpikkieswijzernl
PoliticalMashup 24
Content and structure search
bull Usual advanced search combines keyword search with metadata
search
bull Extra fields are just extra filters on the returned documents
bull With structured documents we can do search on content and
structure
bull Most useful task rank best entry points in large documents
bull Compare two search systems on the same data
on flat text
on an XML representation
PoliticalMashup 25
Lesson learned requirement on structure
bull Make semantically important structure of documents explicit in
XML markup
bull Publish for machine readability
bull Publish generic data not data prepared for one use-case
PoliticalMashup 26
Application of structure Interruption graph(Attackogram)
bull MP A interrupts B lArrrArr A speaks during the block of B
combined with entity profiling
httpdebatpolitiekinzichtcom
PoliticalMashup 27
Exploring and exploiting official documents
bull We saw what can be done with one well-curated collection
bull What are the key infrastructural and research questions
In what direction and how to scale this up
1 in time
2 in breadth
3 in links
PoliticalMashup 28
Scale diachronically
bull Stable data model and measurement procedure make this data
very valuable for diachronic comparisons
bull towards the past
bull OCR
bull consistency in structure
bull more missing data to link to
bull towards the future
bull remain up to date
bull legacy decisions
PoliticalMashup 29
Scale in breadth eg parlproceedings of allEuropean countries
bull All describe the same ldquoscriptrdquo so all fit in one schema
bull Main question how to connect the data from different countries
Common structure and annotation use the same Relax NG
schema
Common values on certain attributesbull Entities Normalize to Wikipedia concepts
bull Controlled vocabulary keywords Normalize to Eurovoc
bull Language Machine translate to English
bull Events Normalize to EMM Newsexplorer query Wikinews
query
PoliticalMashup 30
Scale in breadth link to related datasets
bull Link on time entities events topics
bull Other official publications
bull News
bull User generated content
bull (In our case) promisses of political actors election manifestos
PoliticalMashup 31
Conclusions
bull There are ample opportunities for exploiting Official Publications
bull Preprocessing and interlinking with other datasets is difficult and
does not scale well
bull High precision and recall is needed for many applications
bull Many text analysis and data-mapping tasks [MUC TAC]
bull Every format needs an own transformer
bull Linked Open Data knowledge bases are not (yet) good enough
create special purpose knowledge extractors
bull High investment but if done in a general way high return and
impact
PoliticalMashup 32
Back to our research question
What is the best data format for publishing both legacy and current
parliamentary proceedings in a digital sustainable manner
Lessons learned
bull Common open standardized self-describing machine readable
bull not tied to a single application
bull linked linked linked
bull Not only shared attributes
bull but more importantly shared data values
bull also store utterly obvious facts (10 years later they arenrsquot)
PoliticalMashup 33
How we can help (ourselves)
Help improve input data at the source
bull Push at the source (in UK open government data in Holland all
parliamentary data is now in XML )
bull Help reduce dumb cut-and-paste annotation work so annotators
can concentrate on tasks which are hard for machines (eg
text-classification)
bull Emphasize importance of using shared standards
Future researchers will love you
PoliticalMashup 34
Last Question
Official Publications are they
or
- 1 Open Official Documents Requirements and Opportunities
- 2 Content
- 3 Our Leading Research Question
- 4 W3C recommendations on Open Government Data
- 5 Value of a large data corpus
- 6 Documents related by publication date
- 7 Properties of our Parliamentary Proceedings Dataset
- 8 Longitudinal data
- 9 Data about human behaviour
- 10 Often rather boring
- 11 But sometimes full of drama and excitement
- 12 Loads of measurement points
- 13 Digitally available
- 14 About this collection
- 15 Very rich metadata for each word
- 16 How to exploit the extra metadata and structure
- 17 Political n-gram viewer
- 18 Political n-gram viewer requirements
- 19 Is Linked (Open) Data the solution
- 20 DBpedia not yet reliable
- 21 Lesson learned requirement on metadata and relations
- 22 A few more applications
- 23 Entity Profiling and Entity Search
- 24 Content and structure search
- 25 Lesson learned requirement on structure
- 26 Application of structure Interruption graph (Attackogram)
- 27 Exploring and exploiting official documents
- 28 Scale diachronically
- 29 Scale in breadth eg parlproceedings of all European countries
- 30 Scale in breadth link to related datasets
- 31 Conclusions
- 32 Back to our research question
- 33 How we can help (ourselves)
- 34 Last Question
-
PoliticalMashup 9
Data about human behaviour
PoliticalMashup 10
Often rather boring
PoliticalMashup 11
But sometimes full of drama and excitement
PoliticalMashup 12
Loads of measurement points
24000 days 450000 topics 75 miljoen speeches
PoliticalMashup 13
Digitally available
PoliticalMashup 14
About this collection
bull very sparse available metadata
bull very rich ldquometadatardquo sits hidden inside the raw data
bull Rich data model
bull Meeting (1 Day)
bull Topic
bull Stage direction
bull Scene
bull Stage direction
bull Speech
bull Paragraph
PoliticalMashup 15
Very rich metadata for each word
For every word spoken in parliament the following facts are known
at the time of the speech act and can often be extracted from the
written proceedings
1) when it was said
2) who said it
3) in what function
4) speaking on behalf of which party
5) in which context and
6) who was actively present during the speech act
PoliticalMashup 16
How to exploit the extra metadata and structure
bull Letrsquos consider a simple killer app
PoliticalMashup 17
Political n-gram viewer
bull From every word we know both the date and the speaker
bull Every speaker belongs to a political party
bull 3D n-gram viewer political spectrum vs time vs word-count
bull Use topic ownership agenda setting framing
PoliticalMashup 18
Political n-gram viewer requirements
documents
1 metadata date of the meeting
2 document structure for every spoken word who said it
Linked Data Speakers names are disambiguated normalized and
mapped to a database with temporal party information
Completeness and correctness Few missing or wrong data also for
long time ago
PoliticalMashup 19
Is Linked (Open) Data the solution
bull Link speakers name to WikipediaDBpedia page (named entity
disambiguation and resolution) See also Google Knowledge
Graph and [Spitkovsky Chang LREC 2012]
bull DBpedia extracts link between person and party affiliation from
Wikipedia infobox
bull Timestamped triple
Geert Wilders is partymember of VVD
from 1998-08-25 until 2004-09-02
PoliticalMashup 20
DBpedia not yet reliable
bull Data extraction is difficult even from the infobox even from
complete data
Wikipedia page of Geert Wilders
DBpedia information about Geert Wilders
Notice the values of the party and the office attributes
Timestamped facts are difficult to extract and difficult to
represent in RDF triples
PoliticalMashup 21
Lesson learned requirement on metadata andrelations
bull One cannot rely on Linked Open Data for good quality metadata
bull Official documents should be self-describing also for facts which
are obvious at publication time
bull Compare speakerrsquos data in original (OCRed) data and XMLified
and enriched version
bull Original
bull Part of it in XML
bull And now for human consumption
PoliticalMashup 22
A few more applications
PoliticalMashup 23
Entity Profiling and Entity Search
bull Users search for entities not for documents [TREC Entity Track]
[Balog et al 2009]
bull Main research questions
How to collect information on entities
how to model an entity
how to rank entities
bull (Parsimonious) language models work well as models [Balog et
al 2009][Hiemstra et al 2004]
bull Entity profiling httpwwwpolitiekinzichtcom
bull Entity search httpikkieswijzernl
PoliticalMashup 24
Content and structure search
bull Usual advanced search combines keyword search with metadata
search
bull Extra fields are just extra filters on the returned documents
bull With structured documents we can do search on content and
structure
bull Most useful task rank best entry points in large documents
bull Compare two search systems on the same data
on flat text
on an XML representation
PoliticalMashup 25
Lesson learned requirement on structure
bull Make semantically important structure of documents explicit in
XML markup
bull Publish for machine readability
bull Publish generic data not data prepared for one use-case
PoliticalMashup 26
Application of structure Interruption graph(Attackogram)
bull MP A interrupts B lArrrArr A speaks during the block of B
combined with entity profiling
httpdebatpolitiekinzichtcom
PoliticalMashup 27
Exploring and exploiting official documents
bull We saw what can be done with one well-curated collection
bull What are the key infrastructural and research questions
In what direction and how to scale this up
1 in time
2 in breadth
3 in links
PoliticalMashup 28
Scale diachronically
bull Stable data model and measurement procedure make this data
very valuable for diachronic comparisons
bull towards the past
bull OCR
bull consistency in structure
bull more missing data to link to
bull towards the future
bull remain up to date
bull legacy decisions
PoliticalMashup 29
Scale in breadth eg parlproceedings of allEuropean countries
bull All describe the same ldquoscriptrdquo so all fit in one schema
bull Main question how to connect the data from different countries
Common structure and annotation use the same Relax NG
schema
Common values on certain attributesbull Entities Normalize to Wikipedia concepts
bull Controlled vocabulary keywords Normalize to Eurovoc
bull Language Machine translate to English
bull Events Normalize to EMM Newsexplorer query Wikinews
query
PoliticalMashup 30
Scale in breadth link to related datasets
bull Link on time entities events topics
bull Other official publications
bull News
bull User generated content
bull (In our case) promisses of political actors election manifestos
PoliticalMashup 31
Conclusions
bull There are ample opportunities for exploiting Official Publications
bull Preprocessing and interlinking with other datasets is difficult and
does not scale well
bull High precision and recall is needed for many applications
bull Many text analysis and data-mapping tasks [MUC TAC]
bull Every format needs an own transformer
bull Linked Open Data knowledge bases are not (yet) good enough
create special purpose knowledge extractors
bull High investment but if done in a general way high return and
impact
PoliticalMashup 32
Back to our research question
What is the best data format for publishing both legacy and current
parliamentary proceedings in a digital sustainable manner
Lessons learned
bull Common open standardized self-describing machine readable
bull not tied to a single application
bull linked linked linked
bull Not only shared attributes
bull but more importantly shared data values
bull also store utterly obvious facts (10 years later they arenrsquot)
PoliticalMashup 33
How we can help (ourselves)
Help improve input data at the source
bull Push at the source (in UK open government data in Holland all
parliamentary data is now in XML )
bull Help reduce dumb cut-and-paste annotation work so annotators
can concentrate on tasks which are hard for machines (eg
text-classification)
bull Emphasize importance of using shared standards
Future researchers will love you
PoliticalMashup 34
Last Question
Official Publications are they
or
- 1 Open Official Documents Requirements and Opportunities
- 2 Content
- 3 Our Leading Research Question
- 4 W3C recommendations on Open Government Data
- 5 Value of a large data corpus
- 6 Documents related by publication date
- 7 Properties of our Parliamentary Proceedings Dataset
- 8 Longitudinal data
- 9 Data about human behaviour
- 10 Often rather boring
- 11 But sometimes full of drama and excitement
- 12 Loads of measurement points
- 13 Digitally available
- 14 About this collection
- 15 Very rich metadata for each word
- 16 How to exploit the extra metadata and structure
- 17 Political n-gram viewer
- 18 Political n-gram viewer requirements
- 19 Is Linked (Open) Data the solution
- 20 DBpedia not yet reliable
- 21 Lesson learned requirement on metadata and relations
- 22 A few more applications
- 23 Entity Profiling and Entity Search
- 24 Content and structure search
- 25 Lesson learned requirement on structure
- 26 Application of structure Interruption graph (Attackogram)
- 27 Exploring and exploiting official documents
- 28 Scale diachronically
- 29 Scale in breadth eg parlproceedings of all European countries
- 30 Scale in breadth link to related datasets
- 31 Conclusions
- 32 Back to our research question
- 33 How we can help (ourselves)
- 34 Last Question
-
PoliticalMashup 10
Often rather boring
PoliticalMashup 11
But sometimes full of drama and excitement
PoliticalMashup 12
Loads of measurement points
24000 days 450000 topics 75 miljoen speeches
PoliticalMashup 13
Digitally available
PoliticalMashup 14
About this collection
bull very sparse available metadata
bull very rich ldquometadatardquo sits hidden inside the raw data
bull Rich data model
bull Meeting (1 Day)
bull Topic
bull Stage direction
bull Scene
bull Stage direction
bull Speech
bull Paragraph
PoliticalMashup 15
Very rich metadata for each word
For every word spoken in parliament the following facts are known
at the time of the speech act and can often be extracted from the
written proceedings
1) when it was said
2) who said it
3) in what function
4) speaking on behalf of which party
5) in which context and
6) who was actively present during the speech act
PoliticalMashup 16
How to exploit the extra metadata and structure
bull Letrsquos consider a simple killer app
PoliticalMashup 17
Political n-gram viewer
bull From every word we know both the date and the speaker
bull Every speaker belongs to a political party
bull 3D n-gram viewer political spectrum vs time vs word-count
bull Use topic ownership agenda setting framing
PoliticalMashup 18
Political n-gram viewer requirements
documents
1 metadata date of the meeting
2 document structure for every spoken word who said it
Linked Data Speakers names are disambiguated normalized and
mapped to a database with temporal party information
Completeness and correctness Few missing or wrong data also for
long time ago
PoliticalMashup 19
Is Linked (Open) Data the solution
bull Link speakers name to WikipediaDBpedia page (named entity
disambiguation and resolution) See also Google Knowledge
Graph and [Spitkovsky Chang LREC 2012]
bull DBpedia extracts link between person and party affiliation from
Wikipedia infobox
bull Timestamped triple
Geert Wilders is partymember of VVD
from 1998-08-25 until 2004-09-02
PoliticalMashup 20
DBpedia not yet reliable
bull Data extraction is difficult even from the infobox even from
complete data
Wikipedia page of Geert Wilders
DBpedia information about Geert Wilders
Notice the values of the party and the office attributes
Timestamped facts are difficult to extract and difficult to
represent in RDF triples
PoliticalMashup 21
Lesson learned requirement on metadata andrelations
bull One cannot rely on Linked Open Data for good quality metadata
bull Official documents should be self-describing also for facts which
are obvious at publication time
bull Compare speakerrsquos data in original (OCRed) data and XMLified
and enriched version
bull Original
bull Part of it in XML
bull And now for human consumption
PoliticalMashup 22
A few more applications
PoliticalMashup 23
Entity Profiling and Entity Search
bull Users search for entities not for documents [TREC Entity Track]
[Balog et al 2009]
bull Main research questions
How to collect information on entities
how to model an entity
how to rank entities
bull (Parsimonious) language models work well as models [Balog et
al 2009][Hiemstra et al 2004]
bull Entity profiling httpwwwpolitiekinzichtcom
bull Entity search httpikkieswijzernl
PoliticalMashup 24
Content and structure search
bull Usual advanced search combines keyword search with metadata
search
bull Extra fields are just extra filters on the returned documents
bull With structured documents we can do search on content and
structure
bull Most useful task rank best entry points in large documents
bull Compare two search systems on the same data
on flat text
on an XML representation
PoliticalMashup 25
Lesson learned requirement on structure
bull Make semantically important structure of documents explicit in
XML markup
bull Publish for machine readability
bull Publish generic data not data prepared for one use-case
PoliticalMashup 26
Application of structure Interruption graph(Attackogram)
bull MP A interrupts B lArrrArr A speaks during the block of B
combined with entity profiling
httpdebatpolitiekinzichtcom
PoliticalMashup 27
Exploring and exploiting official documents
bull We saw what can be done with one well-curated collection
bull What are the key infrastructural and research questions
In what direction and how to scale this up
1 in time
2 in breadth
3 in links
PoliticalMashup 28
Scale diachronically
bull Stable data model and measurement procedure make this data
very valuable for diachronic comparisons
bull towards the past
bull OCR
bull consistency in structure
bull more missing data to link to
bull towards the future
bull remain up to date
bull legacy decisions
PoliticalMashup 29
Scale in breadth eg parlproceedings of allEuropean countries
bull All describe the same ldquoscriptrdquo so all fit in one schema
bull Main question how to connect the data from different countries
Common structure and annotation use the same Relax NG
schema
Common values on certain attributesbull Entities Normalize to Wikipedia concepts
bull Controlled vocabulary keywords Normalize to Eurovoc
bull Language Machine translate to English
bull Events Normalize to EMM Newsexplorer query Wikinews
query
PoliticalMashup 30
Scale in breadth link to related datasets
bull Link on time entities events topics
bull Other official publications
bull News
bull User generated content
bull (In our case) promisses of political actors election manifestos
PoliticalMashup 31
Conclusions
bull There are ample opportunities for exploiting Official Publications
bull Preprocessing and interlinking with other datasets is difficult and
does not scale well
bull High precision and recall is needed for many applications
bull Many text analysis and data-mapping tasks [MUC TAC]
bull Every format needs an own transformer
bull Linked Open Data knowledge bases are not (yet) good enough
create special purpose knowledge extractors
bull High investment but if done in a general way high return and
impact
PoliticalMashup 32
Back to our research question
What is the best data format for publishing both legacy and current
parliamentary proceedings in a digital sustainable manner
Lessons learned
bull Common open standardized self-describing machine readable
bull not tied to a single application
bull linked linked linked
bull Not only shared attributes
bull but more importantly shared data values
bull also store utterly obvious facts (10 years later they arenrsquot)
PoliticalMashup 33
How we can help (ourselves)
Help improve input data at the source
bull Push at the source (in UK open government data in Holland all
parliamentary data is now in XML )
bull Help reduce dumb cut-and-paste annotation work so annotators
can concentrate on tasks which are hard for machines (eg
text-classification)
bull Emphasize importance of using shared standards
Future researchers will love you
PoliticalMashup 34
Last Question
Official Publications are they
or
- 1 Open Official Documents Requirements and Opportunities
- 2 Content
- 3 Our Leading Research Question
- 4 W3C recommendations on Open Government Data
- 5 Value of a large data corpus
- 6 Documents related by publication date
- 7 Properties of our Parliamentary Proceedings Dataset
- 8 Longitudinal data
- 9 Data about human behaviour
- 10 Often rather boring
- 11 But sometimes full of drama and excitement
- 12 Loads of measurement points
- 13 Digitally available
- 14 About this collection
- 15 Very rich metadata for each word
- 16 How to exploit the extra metadata and structure
- 17 Political n-gram viewer
- 18 Political n-gram viewer requirements
- 19 Is Linked (Open) Data the solution
- 20 DBpedia not yet reliable
- 21 Lesson learned requirement on metadata and relations
- 22 A few more applications
- 23 Entity Profiling and Entity Search
- 24 Content and structure search
- 25 Lesson learned requirement on structure
- 26 Application of structure Interruption graph (Attackogram)
- 27 Exploring and exploiting official documents
- 28 Scale diachronically
- 29 Scale in breadth eg parlproceedings of all European countries
- 30 Scale in breadth link to related datasets
- 31 Conclusions
- 32 Back to our research question
- 33 How we can help (ourselves)
- 34 Last Question
-
PoliticalMashup 11
But sometimes full of drama and excitement
PoliticalMashup 12
Loads of measurement points
24000 days 450000 topics 75 miljoen speeches
PoliticalMashup 13
Digitally available
PoliticalMashup 14
About this collection
bull very sparse available metadata
bull very rich ldquometadatardquo sits hidden inside the raw data
bull Rich data model
bull Meeting (1 Day)
bull Topic
bull Stage direction
bull Scene
bull Stage direction
bull Speech
bull Paragraph
PoliticalMashup 15
Very rich metadata for each word
For every word spoken in parliament the following facts are known
at the time of the speech act and can often be extracted from the
written proceedings
1) when it was said
2) who said it
3) in what function
4) speaking on behalf of which party
5) in which context and
6) who was actively present during the speech act
PoliticalMashup 16
How to exploit the extra metadata and structure
bull Letrsquos consider a simple killer app
PoliticalMashup 17
Political n-gram viewer
bull From every word we know both the date and the speaker
bull Every speaker belongs to a political party
bull 3D n-gram viewer political spectrum vs time vs word-count
bull Use topic ownership agenda setting framing
PoliticalMashup 18
Political n-gram viewer requirements
documents
1 metadata date of the meeting
2 document structure for every spoken word who said it
Linked Data Speakers names are disambiguated normalized and
mapped to a database with temporal party information
Completeness and correctness Few missing or wrong data also for
long time ago
PoliticalMashup 19
Is Linked (Open) Data the solution
bull Link speakers name to WikipediaDBpedia page (named entity
disambiguation and resolution) See also Google Knowledge
Graph and [Spitkovsky Chang LREC 2012]
bull DBpedia extracts link between person and party affiliation from
Wikipedia infobox
bull Timestamped triple
Geert Wilders is partymember of VVD
from 1998-08-25 until 2004-09-02
PoliticalMashup 20
DBpedia not yet reliable
bull Data extraction is difficult even from the infobox even from
complete data
Wikipedia page of Geert Wilders
DBpedia information about Geert Wilders
Notice the values of the party and the office attributes
Timestamped facts are difficult to extract and difficult to
represent in RDF triples
PoliticalMashup 21
Lesson learned requirement on metadata andrelations
bull One cannot rely on Linked Open Data for good quality metadata
bull Official documents should be self-describing also for facts which
are obvious at publication time
bull Compare speakerrsquos data in original (OCRed) data and XMLified
and enriched version
bull Original
bull Part of it in XML
bull And now for human consumption
PoliticalMashup 22
A few more applications
PoliticalMashup 23
Entity Profiling and Entity Search
bull Users search for entities not for documents [TREC Entity Track]
[Balog et al 2009]
bull Main research questions
How to collect information on entities
how to model an entity
how to rank entities
bull (Parsimonious) language models work well as models [Balog et
al 2009][Hiemstra et al 2004]
bull Entity profiling httpwwwpolitiekinzichtcom
bull Entity search httpikkieswijzernl
PoliticalMashup 24
Content and structure search
bull Usual advanced search combines keyword search with metadata
search
bull Extra fields are just extra filters on the returned documents
bull With structured documents we can do search on content and
structure
bull Most useful task rank best entry points in large documents
bull Compare two search systems on the same data
on flat text
on an XML representation
PoliticalMashup 25
Lesson learned requirement on structure
bull Make semantically important structure of documents explicit in
XML markup
bull Publish for machine readability
bull Publish generic data not data prepared for one use-case
PoliticalMashup 26
Application of structure Interruption graph(Attackogram)
bull MP A interrupts B lArrrArr A speaks during the block of B
combined with entity profiling
httpdebatpolitiekinzichtcom
PoliticalMashup 27
Exploring and exploiting official documents
bull We saw what can be done with one well-curated collection
bull What are the key infrastructural and research questions
In what direction and how to scale this up
1 in time
2 in breadth
3 in links
PoliticalMashup 28
Scale diachronically
bull Stable data model and measurement procedure make this data
very valuable for diachronic comparisons
bull towards the past
bull OCR
bull consistency in structure
bull more missing data to link to
bull towards the future
bull remain up to date
bull legacy decisions
PoliticalMashup 29
Scale in breadth eg parlproceedings of allEuropean countries
bull All describe the same ldquoscriptrdquo so all fit in one schema
bull Main question how to connect the data from different countries
Common structure and annotation use the same Relax NG
schema
Common values on certain attributesbull Entities Normalize to Wikipedia concepts
bull Controlled vocabulary keywords Normalize to Eurovoc
bull Language Machine translate to English
bull Events Normalize to EMM Newsexplorer query Wikinews
query
PoliticalMashup 30
Scale in breadth link to related datasets
bull Link on time entities events topics
bull Other official publications
bull News
bull User generated content
bull (In our case) promisses of political actors election manifestos
PoliticalMashup 31
Conclusions
bull There are ample opportunities for exploiting Official Publications
bull Preprocessing and interlinking with other datasets is difficult and
does not scale well
bull High precision and recall is needed for many applications
bull Many text analysis and data-mapping tasks [MUC TAC]
bull Every format needs an own transformer
bull Linked Open Data knowledge bases are not (yet) good enough
create special purpose knowledge extractors
bull High investment but if done in a general way high return and
impact
PoliticalMashup 32
Back to our research question
What is the best data format for publishing both legacy and current
parliamentary proceedings in a digital sustainable manner
Lessons learned
bull Common open standardized self-describing machine readable
bull not tied to a single application
bull linked linked linked
bull Not only shared attributes
bull but more importantly shared data values
bull also store utterly obvious facts (10 years later they arenrsquot)
PoliticalMashup 33
How we can help (ourselves)
Help improve input data at the source
bull Push at the source (in UK open government data in Holland all
parliamentary data is now in XML )
bull Help reduce dumb cut-and-paste annotation work so annotators
can concentrate on tasks which are hard for machines (eg
text-classification)
bull Emphasize importance of using shared standards
Future researchers will love you
PoliticalMashup 34
Last Question
Official Publications are they
or
- 1 Open Official Documents Requirements and Opportunities
- 2 Content
- 3 Our Leading Research Question
- 4 W3C recommendations on Open Government Data
- 5 Value of a large data corpus
- 6 Documents related by publication date
- 7 Properties of our Parliamentary Proceedings Dataset
- 8 Longitudinal data
- 9 Data about human behaviour
- 10 Often rather boring
- 11 But sometimes full of drama and excitement
- 12 Loads of measurement points
- 13 Digitally available
- 14 About this collection
- 15 Very rich metadata for each word
- 16 How to exploit the extra metadata and structure
- 17 Political n-gram viewer
- 18 Political n-gram viewer requirements
- 19 Is Linked (Open) Data the solution
- 20 DBpedia not yet reliable
- 21 Lesson learned requirement on metadata and relations
- 22 A few more applications
- 23 Entity Profiling and Entity Search
- 24 Content and structure search
- 25 Lesson learned requirement on structure
- 26 Application of structure Interruption graph (Attackogram)
- 27 Exploring and exploiting official documents
- 28 Scale diachronically
- 29 Scale in breadth eg parlproceedings of all European countries
- 30 Scale in breadth link to related datasets
- 31 Conclusions
- 32 Back to our research question
- 33 How we can help (ourselves)
- 34 Last Question
-
PoliticalMashup 12
Loads of measurement points
24000 days 450000 topics 75 miljoen speeches
PoliticalMashup 13
Digitally available
PoliticalMashup 14
About this collection
bull very sparse available metadata
bull very rich ldquometadatardquo sits hidden inside the raw data
bull Rich data model
bull Meeting (1 Day)
bull Topic
bull Stage direction
bull Scene
bull Stage direction
bull Speech
bull Paragraph
PoliticalMashup 15
Very rich metadata for each word
For every word spoken in parliament the following facts are known
at the time of the speech act and can often be extracted from the
written proceedings
1) when it was said
2) who said it
3) in what function
4) speaking on behalf of which party
5) in which context and
6) who was actively present during the speech act
PoliticalMashup 16
How to exploit the extra metadata and structure
bull Letrsquos consider a simple killer app
PoliticalMashup 17
Political n-gram viewer
bull From every word we know both the date and the speaker
bull Every speaker belongs to a political party
bull 3D n-gram viewer political spectrum vs time vs word-count
bull Use topic ownership agenda setting framing
PoliticalMashup 18
Political n-gram viewer requirements
documents
1 metadata date of the meeting
2 document structure for every spoken word who said it
Linked Data Speakers names are disambiguated normalized and
mapped to a database with temporal party information
Completeness and correctness Few missing or wrong data also for
long time ago
PoliticalMashup 19
Is Linked (Open) Data the solution
bull Link speakers name to WikipediaDBpedia page (named entity
disambiguation and resolution) See also Google Knowledge
Graph and [Spitkovsky Chang LREC 2012]
bull DBpedia extracts link between person and party affiliation from
Wikipedia infobox
bull Timestamped triple
Geert Wilders is partymember of VVD
from 1998-08-25 until 2004-09-02
PoliticalMashup 20
DBpedia not yet reliable
bull Data extraction is difficult even from the infobox even from
complete data
Wikipedia page of Geert Wilders
DBpedia information about Geert Wilders
Notice the values of the party and the office attributes
Timestamped facts are difficult to extract and difficult to
represent in RDF triples
PoliticalMashup 21
Lesson learned requirement on metadata andrelations
bull One cannot rely on Linked Open Data for good quality metadata
bull Official documents should be self-describing also for facts which
are obvious at publication time
bull Compare speakerrsquos data in original (OCRed) data and XMLified
and enriched version
bull Original
bull Part of it in XML
bull And now for human consumption
PoliticalMashup 22
A few more applications
PoliticalMashup 23
Entity Profiling and Entity Search
bull Users search for entities not for documents [TREC Entity Track]
[Balog et al 2009]
bull Main research questions
How to collect information on entities
how to model an entity
how to rank entities
bull (Parsimonious) language models work well as models [Balog et
al 2009][Hiemstra et al 2004]
bull Entity profiling httpwwwpolitiekinzichtcom
bull Entity search httpikkieswijzernl
PoliticalMashup 24
Content and structure search
bull Usual advanced search combines keyword search with metadata
search
bull Extra fields are just extra filters on the returned documents
bull With structured documents we can do search on content and
structure
bull Most useful task rank best entry points in large documents
bull Compare two search systems on the same data
on flat text
on an XML representation
PoliticalMashup 25
Lesson learned requirement on structure
bull Make semantically important structure of documents explicit in
XML markup
bull Publish for machine readability
bull Publish generic data not data prepared for one use-case
PoliticalMashup 26
Application of structure Interruption graph(Attackogram)
bull MP A interrupts B lArrrArr A speaks during the block of B
combined with entity profiling
httpdebatpolitiekinzichtcom
PoliticalMashup 27
Exploring and exploiting official documents
bull We saw what can be done with one well-curated collection
bull What are the key infrastructural and research questions
In what direction and how to scale this up
1 in time
2 in breadth
3 in links
PoliticalMashup 28
Scale diachronically
bull Stable data model and measurement procedure make this data
very valuable for diachronic comparisons
bull towards the past
bull OCR
bull consistency in structure
bull more missing data to link to
bull towards the future
bull remain up to date
bull legacy decisions
PoliticalMashup 29
Scale in breadth eg parlproceedings of allEuropean countries
bull All describe the same ldquoscriptrdquo so all fit in one schema
bull Main question how to connect the data from different countries
Common structure and annotation use the same Relax NG
schema
Common values on certain attributesbull Entities Normalize to Wikipedia concepts
bull Controlled vocabulary keywords Normalize to Eurovoc
bull Language Machine translate to English
bull Events Normalize to EMM Newsexplorer query Wikinews
query
PoliticalMashup 30
Scale in breadth link to related datasets
bull Link on time entities events topics
bull Other official publications
bull News
bull User generated content
bull (In our case) promisses of political actors election manifestos
PoliticalMashup 31
Conclusions
bull There are ample opportunities for exploiting Official Publications
bull Preprocessing and interlinking with other datasets is difficult and
does not scale well
bull High precision and recall is needed for many applications
bull Many text analysis and data-mapping tasks [MUC TAC]
bull Every format needs an own transformer
bull Linked Open Data knowledge bases are not (yet) good enough
create special purpose knowledge extractors
bull High investment but if done in a general way high return and
impact
PoliticalMashup 32
Back to our research question
What is the best data format for publishing both legacy and current
parliamentary proceedings in a digital sustainable manner
Lessons learned
bull Common open standardized self-describing machine readable
bull not tied to a single application
bull linked linked linked
bull Not only shared attributes
bull but more importantly shared data values
bull also store utterly obvious facts (10 years later they arenrsquot)
PoliticalMashup 33
How we can help (ourselves)
Help improve input data at the source
bull Push at the source (in UK open government data in Holland all
parliamentary data is now in XML )
bull Help reduce dumb cut-and-paste annotation work so annotators
can concentrate on tasks which are hard for machines (eg
text-classification)
bull Emphasize importance of using shared standards
Future researchers will love you
PoliticalMashup 34
Last Question
Official Publications are they
or
- 1 Open Official Documents Requirements and Opportunities
- 2 Content
- 3 Our Leading Research Question
- 4 W3C recommendations on Open Government Data
- 5 Value of a large data corpus
- 6 Documents related by publication date
- 7 Properties of our Parliamentary Proceedings Dataset
- 8 Longitudinal data
- 9 Data about human behaviour
- 10 Often rather boring
- 11 But sometimes full of drama and excitement
- 12 Loads of measurement points
- 13 Digitally available
- 14 About this collection
- 15 Very rich metadata for each word
- 16 How to exploit the extra metadata and structure
- 17 Political n-gram viewer
- 18 Political n-gram viewer requirements
- 19 Is Linked (Open) Data the solution
- 20 DBpedia not yet reliable
- 21 Lesson learned requirement on metadata and relations
- 22 A few more applications
- 23 Entity Profiling and Entity Search
- 24 Content and structure search
- 25 Lesson learned requirement on structure
- 26 Application of structure Interruption graph (Attackogram)
- 27 Exploring and exploiting official documents
- 28 Scale diachronically
- 29 Scale in breadth eg parlproceedings of all European countries
- 30 Scale in breadth link to related datasets
- 31 Conclusions
- 32 Back to our research question
- 33 How we can help (ourselves)
- 34 Last Question
-
PoliticalMashup 13
Digitally available
PoliticalMashup 14
About this collection
bull very sparse available metadata
bull very rich ldquometadatardquo sits hidden inside the raw data
bull Rich data model
bull Meeting (1 Day)
bull Topic
bull Stage direction
bull Scene
bull Stage direction
bull Speech
bull Paragraph
PoliticalMashup 15
Very rich metadata for each word
For every word spoken in parliament the following facts are known
at the time of the speech act and can often be extracted from the
written proceedings
1) when it was said
2) who said it
3) in what function
4) speaking on behalf of which party
5) in which context and
6) who was actively present during the speech act
PoliticalMashup 16
How to exploit the extra metadata and structure
bull Letrsquos consider a simple killer app
PoliticalMashup 17
Political n-gram viewer
bull From every word we know both the date and the speaker
bull Every speaker belongs to a political party
bull 3D n-gram viewer political spectrum vs time vs word-count
bull Use topic ownership agenda setting framing
PoliticalMashup 18
Political n-gram viewer requirements
documents
1 metadata date of the meeting
2 document structure for every spoken word who said it
Linked Data Speakers names are disambiguated normalized and
mapped to a database with temporal party information
Completeness and correctness Few missing or wrong data also for
long time ago
PoliticalMashup 19
Is Linked (Open) Data the solution
bull Link speakers name to WikipediaDBpedia page (named entity
disambiguation and resolution) See also Google Knowledge
Graph and [Spitkovsky Chang LREC 2012]
bull DBpedia extracts link between person and party affiliation from
Wikipedia infobox
bull Timestamped triple
Geert Wilders is partymember of VVD
from 1998-08-25 until 2004-09-02
PoliticalMashup 20
DBpedia not yet reliable
bull Data extraction is difficult even from the infobox even from
complete data
Wikipedia page of Geert Wilders
DBpedia information about Geert Wilders
Notice the values of the party and the office attributes
Timestamped facts are difficult to extract and difficult to
represent in RDF triples
PoliticalMashup 21
Lesson learned requirement on metadata andrelations
bull One cannot rely on Linked Open Data for good quality metadata
bull Official documents should be self-describing also for facts which
are obvious at publication time
bull Compare speakerrsquos data in original (OCRed) data and XMLified
and enriched version
bull Original
bull Part of it in XML
bull And now for human consumption
PoliticalMashup 22
A few more applications
PoliticalMashup 23
Entity Profiling and Entity Search
bull Users search for entities not for documents [TREC Entity Track]
[Balog et al 2009]
bull Main research questions
How to collect information on entities
how to model an entity
how to rank entities
bull (Parsimonious) language models work well as models [Balog et
al 2009][Hiemstra et al 2004]
bull Entity profiling httpwwwpolitiekinzichtcom
bull Entity search httpikkieswijzernl
PoliticalMashup 24
Content and structure search
bull Usual advanced search combines keyword search with metadata
search
bull Extra fields are just extra filters on the returned documents
bull With structured documents we can do search on content and
structure
bull Most useful task rank best entry points in large documents
bull Compare two search systems on the same data
on flat text
on an XML representation
PoliticalMashup 25
Lesson learned requirement on structure
bull Make semantically important structure of documents explicit in
XML markup
bull Publish for machine readability
bull Publish generic data not data prepared for one use-case
PoliticalMashup 26
Application of structure Interruption graph(Attackogram)
bull MP A interrupts B lArrrArr A speaks during the block of B
combined with entity profiling
httpdebatpolitiekinzichtcom
PoliticalMashup 27
Exploring and exploiting official documents
bull We saw what can be done with one well-curated collection
bull What are the key infrastructural and research questions
In what direction and how to scale this up
1 in time
2 in breadth
3 in links
PoliticalMashup 28
Scale diachronically
bull Stable data model and measurement procedure make this data
very valuable for diachronic comparisons
bull towards the past
bull OCR
bull consistency in structure
bull more missing data to link to
bull towards the future
bull remain up to date
bull legacy decisions
PoliticalMashup 29
Scale in breadth eg parlproceedings of allEuropean countries
bull All describe the same ldquoscriptrdquo so all fit in one schema
bull Main question how to connect the data from different countries
Common structure and annotation use the same Relax NG
schema
Common values on certain attributesbull Entities Normalize to Wikipedia concepts
bull Controlled vocabulary keywords Normalize to Eurovoc
bull Language Machine translate to English
bull Events Normalize to EMM Newsexplorer query Wikinews
query
PoliticalMashup 30
Scale in breadth link to related datasets
bull Link on time entities events topics
bull Other official publications
bull News
bull User generated content
bull (In our case) promisses of political actors election manifestos
PoliticalMashup 31
Conclusions
bull There are ample opportunities for exploiting Official Publications
bull Preprocessing and interlinking with other datasets is difficult and
does not scale well
bull High precision and recall is needed for many applications
bull Many text analysis and data-mapping tasks [MUC TAC]
bull Every format needs an own transformer
bull Linked Open Data knowledge bases are not (yet) good enough
create special purpose knowledge extractors
bull High investment but if done in a general way high return and
impact
PoliticalMashup 32
Back to our research question
What is the best data format for publishing both legacy and current
parliamentary proceedings in a digital sustainable manner
Lessons learned
bull Common open standardized self-describing machine readable
bull not tied to a single application
bull linked linked linked
bull Not only shared attributes
bull but more importantly shared data values
bull also store utterly obvious facts (10 years later they arenrsquot)
PoliticalMashup 33
How we can help (ourselves)
Help improve input data at the source
bull Push at the source (in UK open government data in Holland all
parliamentary data is now in XML )
bull Help reduce dumb cut-and-paste annotation work so annotators
can concentrate on tasks which are hard for machines (eg
text-classification)
bull Emphasize importance of using shared standards
Future researchers will love you
PoliticalMashup 34
Last Question
Official Publications are they
or
- 1 Open Official Documents Requirements and Opportunities
- 2 Content
- 3 Our Leading Research Question
- 4 W3C recommendations on Open Government Data
- 5 Value of a large data corpus
- 6 Documents related by publication date
- 7 Properties of our Parliamentary Proceedings Dataset
- 8 Longitudinal data
- 9 Data about human behaviour
- 10 Often rather boring
- 11 But sometimes full of drama and excitement
- 12 Loads of measurement points
- 13 Digitally available
- 14 About this collection
- 15 Very rich metadata for each word
- 16 How to exploit the extra metadata and structure
- 17 Political n-gram viewer
- 18 Political n-gram viewer requirements
- 19 Is Linked (Open) Data the solution
- 20 DBpedia not yet reliable
- 21 Lesson learned requirement on metadata and relations
- 22 A few more applications
- 23 Entity Profiling and Entity Search
- 24 Content and structure search
- 25 Lesson learned requirement on structure
- 26 Application of structure Interruption graph (Attackogram)
- 27 Exploring and exploiting official documents
- 28 Scale diachronically
- 29 Scale in breadth eg parlproceedings of all European countries
- 30 Scale in breadth link to related datasets
- 31 Conclusions
- 32 Back to our research question
- 33 How we can help (ourselves)
- 34 Last Question
-
PoliticalMashup 14
About this collection
bull very sparse available metadata
bull very rich ldquometadatardquo sits hidden inside the raw data
bull Rich data model
bull Meeting (1 Day)
bull Topic
bull Stage direction
bull Scene
bull Stage direction
bull Speech
bull Paragraph
PoliticalMashup 15
Very rich metadata for each word
For every word spoken in parliament the following facts are known
at the time of the speech act and can often be extracted from the
written proceedings
1) when it was said
2) who said it
3) in what function
4) speaking on behalf of which party
5) in which context and
6) who was actively present during the speech act
PoliticalMashup 16
How to exploit the extra metadata and structure
bull Letrsquos consider a simple killer app
PoliticalMashup 17
Political n-gram viewer
bull From every word we know both the date and the speaker
bull Every speaker belongs to a political party
bull 3D n-gram viewer political spectrum vs time vs word-count
bull Use topic ownership agenda setting framing
PoliticalMashup 18
Political n-gram viewer requirements
documents
1 metadata date of the meeting
2 document structure for every spoken word who said it
Linked Data Speakers names are disambiguated normalized and
mapped to a database with temporal party information
Completeness and correctness Few missing or wrong data also for
long time ago
PoliticalMashup 19
Is Linked (Open) Data the solution
bull Link speakers name to WikipediaDBpedia page (named entity
disambiguation and resolution) See also Google Knowledge
Graph and [Spitkovsky Chang LREC 2012]
bull DBpedia extracts link between person and party affiliation from
Wikipedia infobox
bull Timestamped triple
Geert Wilders is partymember of VVD
from 1998-08-25 until 2004-09-02
PoliticalMashup 20
DBpedia not yet reliable
bull Data extraction is difficult even from the infobox even from
complete data
Wikipedia page of Geert Wilders
DBpedia information about Geert Wilders
Notice the values of the party and the office attributes
Timestamped facts are difficult to extract and difficult to
represent in RDF triples
PoliticalMashup 21
Lesson learned requirement on metadata andrelations
bull One cannot rely on Linked Open Data for good quality metadata
bull Official documents should be self-describing also for facts which
are obvious at publication time
bull Compare speakerrsquos data in original (OCRed) data and XMLified
and enriched version
bull Original
bull Part of it in XML
bull And now for human consumption
PoliticalMashup 22
A few more applications
PoliticalMashup 23
Entity Profiling and Entity Search
bull Users search for entities not for documents [TREC Entity Track]
[Balog et al 2009]
bull Main research questions
How to collect information on entities
how to model an entity
how to rank entities
bull (Parsimonious) language models work well as models [Balog et
al 2009][Hiemstra et al 2004]
bull Entity profiling httpwwwpolitiekinzichtcom
bull Entity search httpikkieswijzernl
PoliticalMashup 24
Content and structure search
bull Usual advanced search combines keyword search with metadata
search
bull Extra fields are just extra filters on the returned documents
bull With structured documents we can do search on content and
structure
bull Most useful task rank best entry points in large documents
bull Compare two search systems on the same data
on flat text
on an XML representation
PoliticalMashup 25
Lesson learned requirement on structure
bull Make semantically important structure of documents explicit in
XML markup
bull Publish for machine readability
bull Publish generic data not data prepared for one use-case
PoliticalMashup 26
Application of structure Interruption graph(Attackogram)
bull MP A interrupts B lArrrArr A speaks during the block of B
combined with entity profiling
httpdebatpolitiekinzichtcom
PoliticalMashup 27
Exploring and exploiting official documents
bull We saw what can be done with one well-curated collection
bull What are the key infrastructural and research questions
In what direction and how to scale this up
1 in time
2 in breadth
3 in links
PoliticalMashup 28
Scale diachronically
bull Stable data model and measurement procedure make this data
very valuable for diachronic comparisons
bull towards the past
bull OCR
bull consistency in structure
bull more missing data to link to
bull towards the future
bull remain up to date
bull legacy decisions
PoliticalMashup 29
Scale in breadth eg parlproceedings of allEuropean countries
bull All describe the same ldquoscriptrdquo so all fit in one schema
bull Main question how to connect the data from different countries
Common structure and annotation use the same Relax NG
schema
Common values on certain attributesbull Entities Normalize to Wikipedia concepts
bull Controlled vocabulary keywords Normalize to Eurovoc
bull Language Machine translate to English
bull Events Normalize to EMM Newsexplorer query Wikinews
query
PoliticalMashup 30
Scale in breadth link to related datasets
bull Link on time entities events topics
bull Other official publications
bull News
bull User generated content
bull (In our case) promisses of political actors election manifestos
PoliticalMashup 31
Conclusions
bull There are ample opportunities for exploiting Official Publications
bull Preprocessing and interlinking with other datasets is difficult and
does not scale well
bull High precision and recall is needed for many applications
bull Many text analysis and data-mapping tasks [MUC TAC]
bull Every format needs an own transformer
bull Linked Open Data knowledge bases are not (yet) good enough
create special purpose knowledge extractors
bull High investment but if done in a general way high return and
impact
PoliticalMashup 32
Back to our research question
What is the best data format for publishing both legacy and current
parliamentary proceedings in a digital sustainable manner
Lessons learned
bull Common open standardized self-describing machine readable
bull not tied to a single application
bull linked linked linked
bull Not only shared attributes
bull but more importantly shared data values
bull also store utterly obvious facts (10 years later they arenrsquot)
PoliticalMashup 33
How we can help (ourselves)
Help improve input data at the source
bull Push at the source (in UK open government data in Holland all
parliamentary data is now in XML )
bull Help reduce dumb cut-and-paste annotation work so annotators
can concentrate on tasks which are hard for machines (eg
text-classification)
bull Emphasize importance of using shared standards
Future researchers will love you
PoliticalMashup 34
Last Question
Official Publications are they
or
- 1 Open Official Documents Requirements and Opportunities
- 2 Content
- 3 Our Leading Research Question
- 4 W3C recommendations on Open Government Data
- 5 Value of a large data corpus
- 6 Documents related by publication date
- 7 Properties of our Parliamentary Proceedings Dataset
- 8 Longitudinal data
- 9 Data about human behaviour
- 10 Often rather boring
- 11 But sometimes full of drama and excitement
- 12 Loads of measurement points
- 13 Digitally available
- 14 About this collection
- 15 Very rich metadata for each word
- 16 How to exploit the extra metadata and structure
- 17 Political n-gram viewer
- 18 Political n-gram viewer requirements
- 19 Is Linked (Open) Data the solution
- 20 DBpedia not yet reliable
- 21 Lesson learned requirement on metadata and relations
- 22 A few more applications
- 23 Entity Profiling and Entity Search
- 24 Content and structure search
- 25 Lesson learned requirement on structure
- 26 Application of structure Interruption graph (Attackogram)
- 27 Exploring and exploiting official documents
- 28 Scale diachronically
- 29 Scale in breadth eg parlproceedings of all European countries
- 30 Scale in breadth link to related datasets
- 31 Conclusions
- 32 Back to our research question
- 33 How we can help (ourselves)
- 34 Last Question
-
PoliticalMashup 15
Very rich metadata for each word
For every word spoken in parliament the following facts are known
at the time of the speech act and can often be extracted from the
written proceedings
1) when it was said
2) who said it
3) in what function
4) speaking on behalf of which party
5) in which context and
6) who was actively present during the speech act
PoliticalMashup 16
How to exploit the extra metadata and structure
bull Letrsquos consider a simple killer app
PoliticalMashup 17
Political n-gram viewer
bull From every word we know both the date and the speaker
bull Every speaker belongs to a political party
bull 3D n-gram viewer political spectrum vs time vs word-count
bull Use topic ownership agenda setting framing
PoliticalMashup 18
Political n-gram viewer requirements
documents
1 metadata date of the meeting
2 document structure for every spoken word who said it
Linked Data Speakers names are disambiguated normalized and
mapped to a database with temporal party information
Completeness and correctness Few missing or wrong data also for
long time ago
PoliticalMashup 19
Is Linked (Open) Data the solution
bull Link speakers name to WikipediaDBpedia page (named entity
disambiguation and resolution) See also Google Knowledge
Graph and [Spitkovsky Chang LREC 2012]
bull DBpedia extracts link between person and party affiliation from
Wikipedia infobox
bull Timestamped triple
Geert Wilders is partymember of VVD
from 1998-08-25 until 2004-09-02
PoliticalMashup 20
DBpedia not yet reliable
bull Data extraction is difficult even from the infobox even from
complete data
Wikipedia page of Geert Wilders
DBpedia information about Geert Wilders
Notice the values of the party and the office attributes
Timestamped facts are difficult to extract and difficult to
represent in RDF triples
PoliticalMashup 21
Lesson learned requirement on metadata andrelations
bull One cannot rely on Linked Open Data for good quality metadata
bull Official documents should be self-describing also for facts which
are obvious at publication time
bull Compare speakerrsquos data in original (OCRed) data and XMLified
and enriched version
bull Original
bull Part of it in XML
bull And now for human consumption
PoliticalMashup 22
A few more applications
PoliticalMashup 23
Entity Profiling and Entity Search
bull Users search for entities not for documents [TREC Entity Track]
[Balog et al 2009]
bull Main research questions
How to collect information on entities
how to model an entity
how to rank entities
bull (Parsimonious) language models work well as models [Balog et
al 2009][Hiemstra et al 2004]
bull Entity profiling httpwwwpolitiekinzichtcom
bull Entity search httpikkieswijzernl
PoliticalMashup 24
Content and structure search
bull Usual advanced search combines keyword search with metadata
search
bull Extra fields are just extra filters on the returned documents
bull With structured documents we can do search on content and
structure
bull Most useful task rank best entry points in large documents
bull Compare two search systems on the same data
on flat text
on an XML representation
PoliticalMashup 25
Lesson learned requirement on structure
bull Make semantically important structure of documents explicit in
XML markup
bull Publish for machine readability
bull Publish generic data not data prepared for one use-case
PoliticalMashup 26
Application of structure Interruption graph(Attackogram)
bull MP A interrupts B lArrrArr A speaks during the block of B
combined with entity profiling
httpdebatpolitiekinzichtcom
PoliticalMashup 27
Exploring and exploiting official documents
bull We saw what can be done with one well-curated collection
bull What are the key infrastructural and research questions
In what direction and how to scale this up
1 in time
2 in breadth
3 in links
PoliticalMashup 28
Scale diachronically
bull Stable data model and measurement procedure make this data
very valuable for diachronic comparisons
bull towards the past
bull OCR
bull consistency in structure
bull more missing data to link to
bull towards the future
bull remain up to date
bull legacy decisions
PoliticalMashup 29
Scale in breadth eg parlproceedings of allEuropean countries
bull All describe the same ldquoscriptrdquo so all fit in one schema
bull Main question how to connect the data from different countries
Common structure and annotation use the same Relax NG
schema
Common values on certain attributesbull Entities Normalize to Wikipedia concepts
bull Controlled vocabulary keywords Normalize to Eurovoc
bull Language Machine translate to English
bull Events Normalize to EMM Newsexplorer query Wikinews
query
PoliticalMashup 30
Scale in breadth link to related datasets
bull Link on time entities events topics
bull Other official publications
bull News
bull User generated content
bull (In our case) promisses of political actors election manifestos
PoliticalMashup 31
Conclusions
bull There are ample opportunities for exploiting Official Publications
bull Preprocessing and interlinking with other datasets is difficult and
does not scale well
bull High precision and recall is needed for many applications
bull Many text analysis and data-mapping tasks [MUC TAC]
bull Every format needs an own transformer
bull Linked Open Data knowledge bases are not (yet) good enough
create special purpose knowledge extractors
bull High investment but if done in a general way high return and
impact
PoliticalMashup 32
Back to our research question
What is the best data format for publishing both legacy and current
parliamentary proceedings in a digital sustainable manner
Lessons learned
bull Common open standardized self-describing machine readable
bull not tied to a single application
bull linked linked linked
bull Not only shared attributes
bull but more importantly shared data values
bull also store utterly obvious facts (10 years later they arenrsquot)
PoliticalMashup 33
How we can help (ourselves)
Help improve input data at the source
bull Push at the source (in UK open government data in Holland all
parliamentary data is now in XML )
bull Help reduce dumb cut-and-paste annotation work so annotators
can concentrate on tasks which are hard for machines (eg
text-classification)
bull Emphasize importance of using shared standards
Future researchers will love you
PoliticalMashup 34
Last Question
Official Publications are they
or
- 1 Open Official Documents Requirements and Opportunities
- 2 Content
- 3 Our Leading Research Question
- 4 W3C recommendations on Open Government Data
- 5 Value of a large data corpus
- 6 Documents related by publication date
- 7 Properties of our Parliamentary Proceedings Dataset
- 8 Longitudinal data
- 9 Data about human behaviour
- 10 Often rather boring
- 11 But sometimes full of drama and excitement
- 12 Loads of measurement points
- 13 Digitally available
- 14 About this collection
- 15 Very rich metadata for each word
- 16 How to exploit the extra metadata and structure
- 17 Political n-gram viewer
- 18 Political n-gram viewer requirements
- 19 Is Linked (Open) Data the solution
- 20 DBpedia not yet reliable
- 21 Lesson learned requirement on metadata and relations
- 22 A few more applications
- 23 Entity Profiling and Entity Search
- 24 Content and structure search
- 25 Lesson learned requirement on structure
- 26 Application of structure Interruption graph (Attackogram)
- 27 Exploring and exploiting official documents
- 28 Scale diachronically
- 29 Scale in breadth eg parlproceedings of all European countries
- 30 Scale in breadth link to related datasets
- 31 Conclusions
- 32 Back to our research question
- 33 How we can help (ourselves)
- 34 Last Question
-
PoliticalMashup 16
How to exploit the extra metadata and structure
bull Letrsquos consider a simple killer app
PoliticalMashup 17
Political n-gram viewer
bull From every word we know both the date and the speaker
bull Every speaker belongs to a political party
bull 3D n-gram viewer political spectrum vs time vs word-count
bull Use topic ownership agenda setting framing
PoliticalMashup 18
Political n-gram viewer requirements
documents
1 metadata date of the meeting
2 document structure for every spoken word who said it
Linked Data Speakers names are disambiguated normalized and
mapped to a database with temporal party information
Completeness and correctness Few missing or wrong data also for
long time ago
PoliticalMashup 19
Is Linked (Open) Data the solution
bull Link speakers name to WikipediaDBpedia page (named entity
disambiguation and resolution) See also Google Knowledge
Graph and [Spitkovsky Chang LREC 2012]
bull DBpedia extracts link between person and party affiliation from
Wikipedia infobox
bull Timestamped triple
Geert Wilders is partymember of VVD
from 1998-08-25 until 2004-09-02
PoliticalMashup 20
DBpedia not yet reliable
bull Data extraction is difficult even from the infobox even from
complete data
Wikipedia page of Geert Wilders
DBpedia information about Geert Wilders
Notice the values of the party and the office attributes
Timestamped facts are difficult to extract and difficult to
represent in RDF triples
PoliticalMashup 21
Lesson learned requirement on metadata andrelations
bull One cannot rely on Linked Open Data for good quality metadata
bull Official documents should be self-describing also for facts which
are obvious at publication time
bull Compare speakerrsquos data in original (OCRed) data and XMLified
and enriched version
bull Original
bull Part of it in XML
bull And now for human consumption
PoliticalMashup 22
A few more applications
PoliticalMashup 23
Entity Profiling and Entity Search
bull Users search for entities not for documents [TREC Entity Track]
[Balog et al 2009]
bull Main research questions
How to collect information on entities
how to model an entity
how to rank entities
bull (Parsimonious) language models work well as models [Balog et
al 2009][Hiemstra et al 2004]
bull Entity profiling httpwwwpolitiekinzichtcom
bull Entity search httpikkieswijzernl
PoliticalMashup 24
Content and structure search
bull Usual advanced search combines keyword search with metadata
search
bull Extra fields are just extra filters on the returned documents
bull With structured documents we can do search on content and
structure
bull Most useful task rank best entry points in large documents
bull Compare two search systems on the same data
on flat text
on an XML representation
PoliticalMashup 25
Lesson learned requirement on structure
bull Make semantically important structure of documents explicit in
XML markup
bull Publish for machine readability
bull Publish generic data not data prepared for one use-case
PoliticalMashup 26
Application of structure Interruption graph(Attackogram)
bull MP A interrupts B lArrrArr A speaks during the block of B
combined with entity profiling
httpdebatpolitiekinzichtcom
PoliticalMashup 27
Exploring and exploiting official documents
bull We saw what can be done with one well-curated collection
bull What are the key infrastructural and research questions
In what direction and how to scale this up
1 in time
2 in breadth
3 in links
PoliticalMashup 28
Scale diachronically
bull Stable data model and measurement procedure make this data
very valuable for diachronic comparisons
bull towards the past
bull OCR
bull consistency in structure
bull more missing data to link to
bull towards the future
bull remain up to date
bull legacy decisions
PoliticalMashup 29
Scale in breadth eg parlproceedings of allEuropean countries
bull All describe the same ldquoscriptrdquo so all fit in one schema
bull Main question how to connect the data from different countries
Common structure and annotation use the same Relax NG
schema
Common values on certain attributesbull Entities Normalize to Wikipedia concepts
bull Controlled vocabulary keywords Normalize to Eurovoc
bull Language Machine translate to English
bull Events Normalize to EMM Newsexplorer query Wikinews
query
PoliticalMashup 30
Scale in breadth link to related datasets
bull Link on time entities events topics
bull Other official publications
bull News
bull User generated content
bull (In our case) promisses of political actors election manifestos
PoliticalMashup 31
Conclusions
bull There are ample opportunities for exploiting Official Publications
bull Preprocessing and interlinking with other datasets is difficult and
does not scale well
bull High precision and recall is needed for many applications
bull Many text analysis and data-mapping tasks [MUC TAC]
bull Every format needs an own transformer
bull Linked Open Data knowledge bases are not (yet) good enough
create special purpose knowledge extractors
bull High investment but if done in a general way high return and
impact
PoliticalMashup 32
Back to our research question
What is the best data format for publishing both legacy and current
parliamentary proceedings in a digital sustainable manner
Lessons learned
bull Common open standardized self-describing machine readable
bull not tied to a single application
bull linked linked linked
bull Not only shared attributes
bull but more importantly shared data values
bull also store utterly obvious facts (10 years later they arenrsquot)
PoliticalMashup 33
How we can help (ourselves)
Help improve input data at the source
bull Push at the source (in UK open government data in Holland all
parliamentary data is now in XML )
bull Help reduce dumb cut-and-paste annotation work so annotators
can concentrate on tasks which are hard for machines (eg
text-classification)
bull Emphasize importance of using shared standards
Future researchers will love you
PoliticalMashup 34
Last Question
Official Publications are they
or
- 1 Open Official Documents Requirements and Opportunities
- 2 Content
- 3 Our Leading Research Question
- 4 W3C recommendations on Open Government Data
- 5 Value of a large data corpus
- 6 Documents related by publication date
- 7 Properties of our Parliamentary Proceedings Dataset
- 8 Longitudinal data
- 9 Data about human behaviour
- 10 Often rather boring
- 11 But sometimes full of drama and excitement
- 12 Loads of measurement points
- 13 Digitally available
- 14 About this collection
- 15 Very rich metadata for each word
- 16 How to exploit the extra metadata and structure
- 17 Political n-gram viewer
- 18 Political n-gram viewer requirements
- 19 Is Linked (Open) Data the solution
- 20 DBpedia not yet reliable
- 21 Lesson learned requirement on metadata and relations
- 22 A few more applications
- 23 Entity Profiling and Entity Search
- 24 Content and structure search
- 25 Lesson learned requirement on structure
- 26 Application of structure Interruption graph (Attackogram)
- 27 Exploring and exploiting official documents
- 28 Scale diachronically
- 29 Scale in breadth eg parlproceedings of all European countries
- 30 Scale in breadth link to related datasets
- 31 Conclusions
- 32 Back to our research question
- 33 How we can help (ourselves)
- 34 Last Question
-
PoliticalMashup 17
Political n-gram viewer
bull From every word we know both the date and the speaker
bull Every speaker belongs to a political party
bull 3D n-gram viewer political spectrum vs time vs word-count
bull Use topic ownership agenda setting framing
PoliticalMashup 18
Political n-gram viewer requirements
documents
1 metadata date of the meeting
2 document structure for every spoken word who said it
Linked Data Speakers names are disambiguated normalized and
mapped to a database with temporal party information
Completeness and correctness Few missing or wrong data also for
long time ago
PoliticalMashup 19
Is Linked (Open) Data the solution
bull Link speakers name to WikipediaDBpedia page (named entity
disambiguation and resolution) See also Google Knowledge
Graph and [Spitkovsky Chang LREC 2012]
bull DBpedia extracts link between person and party affiliation from
Wikipedia infobox
bull Timestamped triple
Geert Wilders is partymember of VVD
from 1998-08-25 until 2004-09-02
PoliticalMashup 20
DBpedia not yet reliable
bull Data extraction is difficult even from the infobox even from
complete data
Wikipedia page of Geert Wilders
DBpedia information about Geert Wilders
Notice the values of the party and the office attributes
Timestamped facts are difficult to extract and difficult to
represent in RDF triples
PoliticalMashup 21
Lesson learned requirement on metadata andrelations
bull One cannot rely on Linked Open Data for good quality metadata
bull Official documents should be self-describing also for facts which
are obvious at publication time
bull Compare speakerrsquos data in original (OCRed) data and XMLified
and enriched version
bull Original
bull Part of it in XML
bull And now for human consumption
PoliticalMashup 22
A few more applications
PoliticalMashup 23
Entity Profiling and Entity Search
bull Users search for entities not for documents [TREC Entity Track]
[Balog et al 2009]
bull Main research questions
How to collect information on entities
how to model an entity
how to rank entities
bull (Parsimonious) language models work well as models [Balog et
al 2009][Hiemstra et al 2004]
bull Entity profiling httpwwwpolitiekinzichtcom
bull Entity search httpikkieswijzernl
PoliticalMashup 24
Content and structure search
bull Usual advanced search combines keyword search with metadata
search
bull Extra fields are just extra filters on the returned documents
bull With structured documents we can do search on content and
structure
bull Most useful task rank best entry points in large documents
bull Compare two search systems on the same data
on flat text
on an XML representation
PoliticalMashup 25
Lesson learned requirement on structure
bull Make semantically important structure of documents explicit in
XML markup
bull Publish for machine readability
bull Publish generic data not data prepared for one use-case
PoliticalMashup 26
Application of structure Interruption graph(Attackogram)
bull MP A interrupts B lArrrArr A speaks during the block of B
combined with entity profiling
httpdebatpolitiekinzichtcom
PoliticalMashup 27
Exploring and exploiting official documents
bull We saw what can be done with one well-curated collection
bull What are the key infrastructural and research questions
In what direction and how to scale this up
1 in time
2 in breadth
3 in links
PoliticalMashup 28
Scale diachronically
bull Stable data model and measurement procedure make this data
very valuable for diachronic comparisons
bull towards the past
bull OCR
bull consistency in structure
bull more missing data to link to
bull towards the future
bull remain up to date
bull legacy decisions
PoliticalMashup 29
Scale in breadth eg parlproceedings of allEuropean countries
bull All describe the same ldquoscriptrdquo so all fit in one schema
bull Main question how to connect the data from different countries
Common structure and annotation use the same Relax NG
schema
Common values on certain attributesbull Entities Normalize to Wikipedia concepts
bull Controlled vocabulary keywords Normalize to Eurovoc
bull Language Machine translate to English
bull Events Normalize to EMM Newsexplorer query Wikinews
query
PoliticalMashup 30
Scale in breadth link to related datasets
bull Link on time entities events topics
bull Other official publications
bull News
bull User generated content
bull (In our case) promisses of political actors election manifestos
PoliticalMashup 31
Conclusions
bull There are ample opportunities for exploiting Official Publications
bull Preprocessing and interlinking with other datasets is difficult and
does not scale well
bull High precision and recall is needed for many applications
bull Many text analysis and data-mapping tasks [MUC TAC]
bull Every format needs an own transformer
bull Linked Open Data knowledge bases are not (yet) good enough
create special purpose knowledge extractors
bull High investment but if done in a general way high return and
impact
PoliticalMashup 32
Back to our research question
What is the best data format for publishing both legacy and current
parliamentary proceedings in a digital sustainable manner
Lessons learned
bull Common open standardized self-describing machine readable
bull not tied to a single application
bull linked linked linked
bull Not only shared attributes
bull but more importantly shared data values
bull also store utterly obvious facts (10 years later they arenrsquot)
PoliticalMashup 33
How we can help (ourselves)
Help improve input data at the source
bull Push at the source (in UK open government data in Holland all
parliamentary data is now in XML )
bull Help reduce dumb cut-and-paste annotation work so annotators
can concentrate on tasks which are hard for machines (eg
text-classification)
bull Emphasize importance of using shared standards
Future researchers will love you
PoliticalMashup 34
Last Question
Official Publications are they
or
- 1 Open Official Documents Requirements and Opportunities
- 2 Content
- 3 Our Leading Research Question
- 4 W3C recommendations on Open Government Data
- 5 Value of a large data corpus
- 6 Documents related by publication date
- 7 Properties of our Parliamentary Proceedings Dataset
- 8 Longitudinal data
- 9 Data about human behaviour
- 10 Often rather boring
- 11 But sometimes full of drama and excitement
- 12 Loads of measurement points
- 13 Digitally available
- 14 About this collection
- 15 Very rich metadata for each word
- 16 How to exploit the extra metadata and structure
- 17 Political n-gram viewer
- 18 Political n-gram viewer requirements
- 19 Is Linked (Open) Data the solution
- 20 DBpedia not yet reliable
- 21 Lesson learned requirement on metadata and relations
- 22 A few more applications
- 23 Entity Profiling and Entity Search
- 24 Content and structure search
- 25 Lesson learned requirement on structure
- 26 Application of structure Interruption graph (Attackogram)
- 27 Exploring and exploiting official documents
- 28 Scale diachronically
- 29 Scale in breadth eg parlproceedings of all European countries
- 30 Scale in breadth link to related datasets
- 31 Conclusions
- 32 Back to our research question
- 33 How we can help (ourselves)
- 34 Last Question
-
PoliticalMashup 18
Political n-gram viewer requirements
documents
1 metadata date of the meeting
2 document structure for every spoken word who said it
Linked Data Speakers names are disambiguated normalized and
mapped to a database with temporal party information
Completeness and correctness Few missing or wrong data also for
long time ago
PoliticalMashup 19
Is Linked (Open) Data the solution
bull Link speakers name to WikipediaDBpedia page (named entity
disambiguation and resolution) See also Google Knowledge
Graph and [Spitkovsky Chang LREC 2012]
bull DBpedia extracts link between person and party affiliation from
Wikipedia infobox
bull Timestamped triple
Geert Wilders is partymember of VVD
from 1998-08-25 until 2004-09-02
PoliticalMashup 20
DBpedia not yet reliable
bull Data extraction is difficult even from the infobox even from
complete data
Wikipedia page of Geert Wilders
DBpedia information about Geert Wilders
Notice the values of the party and the office attributes
Timestamped facts are difficult to extract and difficult to
represent in RDF triples
PoliticalMashup 21
Lesson learned requirement on metadata andrelations
bull One cannot rely on Linked Open Data for good quality metadata
bull Official documents should be self-describing also for facts which
are obvious at publication time
bull Compare speakerrsquos data in original (OCRed) data and XMLified
and enriched version
bull Original
bull Part of it in XML
bull And now for human consumption
PoliticalMashup 22
A few more applications
PoliticalMashup 23
Entity Profiling and Entity Search
bull Users search for entities not for documents [TREC Entity Track]
[Balog et al 2009]
bull Main research questions
How to collect information on entities
how to model an entity
how to rank entities
bull (Parsimonious) language models work well as models [Balog et
al 2009][Hiemstra et al 2004]
bull Entity profiling httpwwwpolitiekinzichtcom
bull Entity search httpikkieswijzernl
PoliticalMashup 24
Content and structure search
bull Usual advanced search combines keyword search with metadata
search
bull Extra fields are just extra filters on the returned documents
bull With structured documents we can do search on content and
structure
bull Most useful task rank best entry points in large documents
bull Compare two search systems on the same data
on flat text
on an XML representation
PoliticalMashup 25
Lesson learned requirement on structure
bull Make semantically important structure of documents explicit in
XML markup
bull Publish for machine readability
bull Publish generic data not data prepared for one use-case
PoliticalMashup 26
Application of structure Interruption graph(Attackogram)
bull MP A interrupts B lArrrArr A speaks during the block of B
combined with entity profiling
httpdebatpolitiekinzichtcom
PoliticalMashup 27
Exploring and exploiting official documents
bull We saw what can be done with one well-curated collection
bull What are the key infrastructural and research questions
In what direction and how to scale this up
1 in time
2 in breadth
3 in links
PoliticalMashup 28
Scale diachronically
bull Stable data model and measurement procedure make this data
very valuable for diachronic comparisons
bull towards the past
bull OCR
bull consistency in structure
bull more missing data to link to
bull towards the future
bull remain up to date
bull legacy decisions
PoliticalMashup 29
Scale in breadth eg parlproceedings of allEuropean countries
bull All describe the same ldquoscriptrdquo so all fit in one schema
bull Main question how to connect the data from different countries
Common structure and annotation use the same Relax NG
schema
Common values on certain attributesbull Entities Normalize to Wikipedia concepts
bull Controlled vocabulary keywords Normalize to Eurovoc
bull Language Machine translate to English
bull Events Normalize to EMM Newsexplorer query Wikinews
query
PoliticalMashup 30
Scale in breadth link to related datasets
bull Link on time entities events topics
bull Other official publications
bull News
bull User generated content
bull (In our case) promisses of political actors election manifestos
PoliticalMashup 31
Conclusions
bull There are ample opportunities for exploiting Official Publications
bull Preprocessing and interlinking with other datasets is difficult and
does not scale well
bull High precision and recall is needed for many applications
bull Many text analysis and data-mapping tasks [MUC TAC]
bull Every format needs an own transformer
bull Linked Open Data knowledge bases are not (yet) good enough
create special purpose knowledge extractors
bull High investment but if done in a general way high return and
impact
PoliticalMashup 32
Back to our research question
What is the best data format for publishing both legacy and current
parliamentary proceedings in a digital sustainable manner
Lessons learned
bull Common open standardized self-describing machine readable
bull not tied to a single application
bull linked linked linked
bull Not only shared attributes
bull but more importantly shared data values
bull also store utterly obvious facts (10 years later they arenrsquot)
PoliticalMashup 33
How we can help (ourselves)
Help improve input data at the source
bull Push at the source (in UK open government data in Holland all
parliamentary data is now in XML )
bull Help reduce dumb cut-and-paste annotation work so annotators
can concentrate on tasks which are hard for machines (eg
text-classification)
bull Emphasize importance of using shared standards
Future researchers will love you
PoliticalMashup 34
Last Question
Official Publications are they
or
- 1 Open Official Documents Requirements and Opportunities
- 2 Content
- 3 Our Leading Research Question
- 4 W3C recommendations on Open Government Data
- 5 Value of a large data corpus
- 6 Documents related by publication date
- 7 Properties of our Parliamentary Proceedings Dataset
- 8 Longitudinal data
- 9 Data about human behaviour
- 10 Often rather boring
- 11 But sometimes full of drama and excitement
- 12 Loads of measurement points
- 13 Digitally available
- 14 About this collection
- 15 Very rich metadata for each word
- 16 How to exploit the extra metadata and structure
- 17 Political n-gram viewer
- 18 Political n-gram viewer requirements
- 19 Is Linked (Open) Data the solution
- 20 DBpedia not yet reliable
- 21 Lesson learned requirement on metadata and relations
- 22 A few more applications
- 23 Entity Profiling and Entity Search
- 24 Content and structure search
- 25 Lesson learned requirement on structure
- 26 Application of structure Interruption graph (Attackogram)
- 27 Exploring and exploiting official documents
- 28 Scale diachronically
- 29 Scale in breadth eg parlproceedings of all European countries
- 30 Scale in breadth link to related datasets
- 31 Conclusions
- 32 Back to our research question
- 33 How we can help (ourselves)
- 34 Last Question
-
PoliticalMashup 19
Is Linked (Open) Data the solution
bull Link speakers name to WikipediaDBpedia page (named entity
disambiguation and resolution) See also Google Knowledge
Graph and [Spitkovsky Chang LREC 2012]
bull DBpedia extracts link between person and party affiliation from
Wikipedia infobox
bull Timestamped triple
Geert Wilders is partymember of VVD
from 1998-08-25 until 2004-09-02
PoliticalMashup 20
DBpedia not yet reliable
bull Data extraction is difficult even from the infobox even from
complete data
Wikipedia page of Geert Wilders
DBpedia information about Geert Wilders
Notice the values of the party and the office attributes
Timestamped facts are difficult to extract and difficult to
represent in RDF triples
PoliticalMashup 21
Lesson learned requirement on metadata andrelations
bull One cannot rely on Linked Open Data for good quality metadata
bull Official documents should be self-describing also for facts which
are obvious at publication time
bull Compare speakerrsquos data in original (OCRed) data and XMLified
and enriched version
bull Original
bull Part of it in XML
bull And now for human consumption
PoliticalMashup 22
A few more applications
PoliticalMashup 23
Entity Profiling and Entity Search
bull Users search for entities not for documents [TREC Entity Track]
[Balog et al 2009]
bull Main research questions
How to collect information on entities
how to model an entity
how to rank entities
bull (Parsimonious) language models work well as models [Balog et
al 2009][Hiemstra et al 2004]
bull Entity profiling httpwwwpolitiekinzichtcom
bull Entity search httpikkieswijzernl
PoliticalMashup 24
Content and structure search
bull Usual advanced search combines keyword search with metadata
search
bull Extra fields are just extra filters on the returned documents
bull With structured documents we can do search on content and
structure
bull Most useful task rank best entry points in large documents
bull Compare two search systems on the same data
on flat text
on an XML representation
PoliticalMashup 25
Lesson learned requirement on structure
bull Make semantically important structure of documents explicit in
XML markup
bull Publish for machine readability
bull Publish generic data not data prepared for one use-case
PoliticalMashup 26
Application of structure Interruption graph(Attackogram)
bull MP A interrupts B lArrrArr A speaks during the block of B
combined with entity profiling
httpdebatpolitiekinzichtcom
PoliticalMashup 27
Exploring and exploiting official documents
bull We saw what can be done with one well-curated collection
bull What are the key infrastructural and research questions
In what direction and how to scale this up
1 in time
2 in breadth
3 in links
PoliticalMashup 28
Scale diachronically
bull Stable data model and measurement procedure make this data
very valuable for diachronic comparisons
bull towards the past
bull OCR
bull consistency in structure
bull more missing data to link to
bull towards the future
bull remain up to date
bull legacy decisions
PoliticalMashup 29
Scale in breadth eg parlproceedings of allEuropean countries
bull All describe the same ldquoscriptrdquo so all fit in one schema
bull Main question how to connect the data from different countries
Common structure and annotation use the same Relax NG
schema
Common values on certain attributesbull Entities Normalize to Wikipedia concepts
bull Controlled vocabulary keywords Normalize to Eurovoc
bull Language Machine translate to English
bull Events Normalize to EMM Newsexplorer query Wikinews
query
PoliticalMashup 30
Scale in breadth link to related datasets
bull Link on time entities events topics
bull Other official publications
bull News
bull User generated content
bull (In our case) promisses of political actors election manifestos
PoliticalMashup 31
Conclusions
bull There are ample opportunities for exploiting Official Publications
bull Preprocessing and interlinking with other datasets is difficult and
does not scale well
bull High precision and recall is needed for many applications
bull Many text analysis and data-mapping tasks [MUC TAC]
bull Every format needs an own transformer
bull Linked Open Data knowledge bases are not (yet) good enough
create special purpose knowledge extractors
bull High investment but if done in a general way high return and
impact
PoliticalMashup 32
Back to our research question
What is the best data format for publishing both legacy and current
parliamentary proceedings in a digital sustainable manner
Lessons learned
bull Common open standardized self-describing machine readable
bull not tied to a single application
bull linked linked linked
bull Not only shared attributes
bull but more importantly shared data values
bull also store utterly obvious facts (10 years later they arenrsquot)
PoliticalMashup 33
How we can help (ourselves)
Help improve input data at the source
bull Push at the source (in UK open government data in Holland all
parliamentary data is now in XML )
bull Help reduce dumb cut-and-paste annotation work so annotators
can concentrate on tasks which are hard for machines (eg
text-classification)
bull Emphasize importance of using shared standards
Future researchers will love you
PoliticalMashup 34
Last Question
Official Publications are they
or
- 1 Open Official Documents Requirements and Opportunities
- 2 Content
- 3 Our Leading Research Question
- 4 W3C recommendations on Open Government Data
- 5 Value of a large data corpus
- 6 Documents related by publication date
- 7 Properties of our Parliamentary Proceedings Dataset
- 8 Longitudinal data
- 9 Data about human behaviour
- 10 Often rather boring
- 11 But sometimes full of drama and excitement
- 12 Loads of measurement points
- 13 Digitally available
- 14 About this collection
- 15 Very rich metadata for each word
- 16 How to exploit the extra metadata and structure
- 17 Political n-gram viewer
- 18 Political n-gram viewer requirements
- 19 Is Linked (Open) Data the solution
- 20 DBpedia not yet reliable
- 21 Lesson learned requirement on metadata and relations
- 22 A few more applications
- 23 Entity Profiling and Entity Search
- 24 Content and structure search
- 25 Lesson learned requirement on structure
- 26 Application of structure Interruption graph (Attackogram)
- 27 Exploring and exploiting official documents
- 28 Scale diachronically
- 29 Scale in breadth eg parlproceedings of all European countries
- 30 Scale in breadth link to related datasets
- 31 Conclusions
- 32 Back to our research question
- 33 How we can help (ourselves)
- 34 Last Question
-
PoliticalMashup 20
DBpedia not yet reliable
bull Data extraction is difficult even from the infobox even from
complete data
Wikipedia page of Geert Wilders
DBpedia information about Geert Wilders
Notice the values of the party and the office attributes
Timestamped facts are difficult to extract and difficult to
represent in RDF triples
PoliticalMashup 21
Lesson learned requirement on metadata andrelations
bull One cannot rely on Linked Open Data for good quality metadata
bull Official documents should be self-describing also for facts which
are obvious at publication time
bull Compare speakerrsquos data in original (OCRed) data and XMLified
and enriched version
bull Original
bull Part of it in XML
bull And now for human consumption
PoliticalMashup 22
A few more applications
PoliticalMashup 23
Entity Profiling and Entity Search
bull Users search for entities not for documents [TREC Entity Track]
[Balog et al 2009]
bull Main research questions
How to collect information on entities
how to model an entity
how to rank entities
bull (Parsimonious) language models work well as models [Balog et
al 2009][Hiemstra et al 2004]
bull Entity profiling httpwwwpolitiekinzichtcom
bull Entity search httpikkieswijzernl
PoliticalMashup 24
Content and structure search
bull Usual advanced search combines keyword search with metadata
search
bull Extra fields are just extra filters on the returned documents
bull With structured documents we can do search on content and
structure
bull Most useful task rank best entry points in large documents
bull Compare two search systems on the same data
on flat text
on an XML representation
PoliticalMashup 25
Lesson learned requirement on structure
bull Make semantically important structure of documents explicit in
XML markup
bull Publish for machine readability
bull Publish generic data not data prepared for one use-case
PoliticalMashup 26
Application of structure Interruption graph(Attackogram)
bull MP A interrupts B lArrrArr A speaks during the block of B
combined with entity profiling
httpdebatpolitiekinzichtcom
PoliticalMashup 27
Exploring and exploiting official documents
bull We saw what can be done with one well-curated collection
bull What are the key infrastructural and research questions
In what direction and how to scale this up
1 in time
2 in breadth
3 in links
PoliticalMashup 28
Scale diachronically
bull Stable data model and measurement procedure make this data
very valuable for diachronic comparisons
bull towards the past
bull OCR
bull consistency in structure
bull more missing data to link to
bull towards the future
bull remain up to date
bull legacy decisions
PoliticalMashup 29
Scale in breadth eg parlproceedings of allEuropean countries
bull All describe the same ldquoscriptrdquo so all fit in one schema
bull Main question how to connect the data from different countries
Common structure and annotation use the same Relax NG
schema
Common values on certain attributesbull Entities Normalize to Wikipedia concepts
bull Controlled vocabulary keywords Normalize to Eurovoc
bull Language Machine translate to English
bull Events Normalize to EMM Newsexplorer query Wikinews
query
PoliticalMashup 30
Scale in breadth link to related datasets
bull Link on time entities events topics
bull Other official publications
bull News
bull User generated content
bull (In our case) promisses of political actors election manifestos
PoliticalMashup 31
Conclusions
bull There are ample opportunities for exploiting Official Publications
bull Preprocessing and interlinking with other datasets is difficult and
does not scale well
bull High precision and recall is needed for many applications
bull Many text analysis and data-mapping tasks [MUC TAC]
bull Every format needs an own transformer
bull Linked Open Data knowledge bases are not (yet) good enough
create special purpose knowledge extractors
bull High investment but if done in a general way high return and
impact
PoliticalMashup 32
Back to our research question
What is the best data format for publishing both legacy and current
parliamentary proceedings in a digital sustainable manner
Lessons learned
bull Common open standardized self-describing machine readable
bull not tied to a single application
bull linked linked linked
bull Not only shared attributes
bull but more importantly shared data values
bull also store utterly obvious facts (10 years later they arenrsquot)
PoliticalMashup 33
How we can help (ourselves)
Help improve input data at the source
bull Push at the source (in UK open government data in Holland all
parliamentary data is now in XML )
bull Help reduce dumb cut-and-paste annotation work so annotators
can concentrate on tasks which are hard for machines (eg
text-classification)
bull Emphasize importance of using shared standards
Future researchers will love you
PoliticalMashup 34
Last Question
Official Publications are they
or
- 1 Open Official Documents Requirements and Opportunities
- 2 Content
- 3 Our Leading Research Question
- 4 W3C recommendations on Open Government Data
- 5 Value of a large data corpus
- 6 Documents related by publication date
- 7 Properties of our Parliamentary Proceedings Dataset
- 8 Longitudinal data
- 9 Data about human behaviour
- 10 Often rather boring
- 11 But sometimes full of drama and excitement
- 12 Loads of measurement points
- 13 Digitally available
- 14 About this collection
- 15 Very rich metadata for each word
- 16 How to exploit the extra metadata and structure
- 17 Political n-gram viewer
- 18 Political n-gram viewer requirements
- 19 Is Linked (Open) Data the solution
- 20 DBpedia not yet reliable
- 21 Lesson learned requirement on metadata and relations
- 22 A few more applications
- 23 Entity Profiling and Entity Search
- 24 Content and structure search
- 25 Lesson learned requirement on structure
- 26 Application of structure Interruption graph (Attackogram)
- 27 Exploring and exploiting official documents
- 28 Scale diachronically
- 29 Scale in breadth eg parlproceedings of all European countries
- 30 Scale in breadth link to related datasets
- 31 Conclusions
- 32 Back to our research question
- 33 How we can help (ourselves)
- 34 Last Question
-
PoliticalMashup 21
Lesson learned requirement on metadata andrelations
bull One cannot rely on Linked Open Data for good quality metadata
bull Official documents should be self-describing also for facts which
are obvious at publication time
bull Compare speakerrsquos data in original (OCRed) data and XMLified
and enriched version
bull Original
bull Part of it in XML
bull And now for human consumption
PoliticalMashup 22
A few more applications
PoliticalMashup 23
Entity Profiling and Entity Search
bull Users search for entities not for documents [TREC Entity Track]
[Balog et al 2009]
bull Main research questions
How to collect information on entities
how to model an entity
how to rank entities
bull (Parsimonious) language models work well as models [Balog et
al 2009][Hiemstra et al 2004]
bull Entity profiling httpwwwpolitiekinzichtcom
bull Entity search httpikkieswijzernl
PoliticalMashup 24
Content and structure search
bull Usual advanced search combines keyword search with metadata
search
bull Extra fields are just extra filters on the returned documents
bull With structured documents we can do search on content and
structure
bull Most useful task rank best entry points in large documents
bull Compare two search systems on the same data
on flat text
on an XML representation
PoliticalMashup 25
Lesson learned requirement on structure
bull Make semantically important structure of documents explicit in
XML markup
bull Publish for machine readability
bull Publish generic data not data prepared for one use-case
PoliticalMashup 26
Application of structure Interruption graph(Attackogram)
bull MP A interrupts B lArrrArr A speaks during the block of B
combined with entity profiling
httpdebatpolitiekinzichtcom
PoliticalMashup 27
Exploring and exploiting official documents
bull We saw what can be done with one well-curated collection
bull What are the key infrastructural and research questions
In what direction and how to scale this up
1 in time
2 in breadth
3 in links
PoliticalMashup 28
Scale diachronically
bull Stable data model and measurement procedure make this data
very valuable for diachronic comparisons
bull towards the past
bull OCR
bull consistency in structure
bull more missing data to link to
bull towards the future
bull remain up to date
bull legacy decisions
PoliticalMashup 29
Scale in breadth eg parlproceedings of allEuropean countries
bull All describe the same ldquoscriptrdquo so all fit in one schema
bull Main question how to connect the data from different countries
Common structure and annotation use the same Relax NG
schema
Common values on certain attributesbull Entities Normalize to Wikipedia concepts
bull Controlled vocabulary keywords Normalize to Eurovoc
bull Language Machine translate to English
bull Events Normalize to EMM Newsexplorer query Wikinews
query
PoliticalMashup 30
Scale in breadth link to related datasets
bull Link on time entities events topics
bull Other official publications
bull News
bull User generated content
bull (In our case) promisses of political actors election manifestos
PoliticalMashup 31
Conclusions
bull There are ample opportunities for exploiting Official Publications
bull Preprocessing and interlinking with other datasets is difficult and
does not scale well
bull High precision and recall is needed for many applications
bull Many text analysis and data-mapping tasks [MUC TAC]
bull Every format needs an own transformer
bull Linked Open Data knowledge bases are not (yet) good enough
create special purpose knowledge extractors
bull High investment but if done in a general way high return and
impact
PoliticalMashup 32
Back to our research question
What is the best data format for publishing both legacy and current
parliamentary proceedings in a digital sustainable manner
Lessons learned
bull Common open standardized self-describing machine readable
bull not tied to a single application
bull linked linked linked
bull Not only shared attributes
bull but more importantly shared data values
bull also store utterly obvious facts (10 years later they arenrsquot)
PoliticalMashup 33
How we can help (ourselves)
Help improve input data at the source
bull Push at the source (in UK open government data in Holland all
parliamentary data is now in XML )
bull Help reduce dumb cut-and-paste annotation work so annotators
can concentrate on tasks which are hard for machines (eg
text-classification)
bull Emphasize importance of using shared standards
Future researchers will love you
PoliticalMashup 34
Last Question
Official Publications are they
or
- 1 Open Official Documents Requirements and Opportunities
- 2 Content
- 3 Our Leading Research Question
- 4 W3C recommendations on Open Government Data
- 5 Value of a large data corpus
- 6 Documents related by publication date
- 7 Properties of our Parliamentary Proceedings Dataset
- 8 Longitudinal data
- 9 Data about human behaviour
- 10 Often rather boring
- 11 But sometimes full of drama and excitement
- 12 Loads of measurement points
- 13 Digitally available
- 14 About this collection
- 15 Very rich metadata for each word
- 16 How to exploit the extra metadata and structure
- 17 Political n-gram viewer
- 18 Political n-gram viewer requirements
- 19 Is Linked (Open) Data the solution
- 20 DBpedia not yet reliable
- 21 Lesson learned requirement on metadata and relations
- 22 A few more applications
- 23 Entity Profiling and Entity Search
- 24 Content and structure search
- 25 Lesson learned requirement on structure
- 26 Application of structure Interruption graph (Attackogram)
- 27 Exploring and exploiting official documents
- 28 Scale diachronically
- 29 Scale in breadth eg parlproceedings of all European countries
- 30 Scale in breadth link to related datasets
- 31 Conclusions
- 32 Back to our research question
- 33 How we can help (ourselves)
- 34 Last Question
-
PoliticalMashup 22
A few more applications
PoliticalMashup 23
Entity Profiling and Entity Search
bull Users search for entities not for documents [TREC Entity Track]
[Balog et al 2009]
bull Main research questions
How to collect information on entities
how to model an entity
how to rank entities
bull (Parsimonious) language models work well as models [Balog et
al 2009][Hiemstra et al 2004]
bull Entity profiling httpwwwpolitiekinzichtcom
bull Entity search httpikkieswijzernl
PoliticalMashup 24
Content and structure search
bull Usual advanced search combines keyword search with metadata
search
bull Extra fields are just extra filters on the returned documents
bull With structured documents we can do search on content and
structure
bull Most useful task rank best entry points in large documents
bull Compare two search systems on the same data
on flat text
on an XML representation
PoliticalMashup 25
Lesson learned requirement on structure
bull Make semantically important structure of documents explicit in
XML markup
bull Publish for machine readability
bull Publish generic data not data prepared for one use-case
PoliticalMashup 26
Application of structure Interruption graph(Attackogram)
bull MP A interrupts B lArrrArr A speaks during the block of B
combined with entity profiling
httpdebatpolitiekinzichtcom
PoliticalMashup 27
Exploring and exploiting official documents
bull We saw what can be done with one well-curated collection
bull What are the key infrastructural and research questions
In what direction and how to scale this up
1 in time
2 in breadth
3 in links
PoliticalMashup 28
Scale diachronically
bull Stable data model and measurement procedure make this data
very valuable for diachronic comparisons
bull towards the past
bull OCR
bull consistency in structure
bull more missing data to link to
bull towards the future
bull remain up to date
bull legacy decisions
PoliticalMashup 29
Scale in breadth eg parlproceedings of allEuropean countries
bull All describe the same ldquoscriptrdquo so all fit in one schema
bull Main question how to connect the data from different countries
Common structure and annotation use the same Relax NG
schema
Common values on certain attributesbull Entities Normalize to Wikipedia concepts
bull Controlled vocabulary keywords Normalize to Eurovoc
bull Language Machine translate to English
bull Events Normalize to EMM Newsexplorer query Wikinews
query
PoliticalMashup 30
Scale in breadth link to related datasets
bull Link on time entities events topics
bull Other official publications
bull News
bull User generated content
bull (In our case) promisses of political actors election manifestos
PoliticalMashup 31
Conclusions
bull There are ample opportunities for exploiting Official Publications
bull Preprocessing and interlinking with other datasets is difficult and
does not scale well
bull High precision and recall is needed for many applications
bull Many text analysis and data-mapping tasks [MUC TAC]
bull Every format needs an own transformer
bull Linked Open Data knowledge bases are not (yet) good enough
create special purpose knowledge extractors
bull High investment but if done in a general way high return and
impact
PoliticalMashup 32
Back to our research question
What is the best data format for publishing both legacy and current
parliamentary proceedings in a digital sustainable manner
Lessons learned
bull Common open standardized self-describing machine readable
bull not tied to a single application
bull linked linked linked
bull Not only shared attributes
bull but more importantly shared data values
bull also store utterly obvious facts (10 years later they arenrsquot)
PoliticalMashup 33
How we can help (ourselves)
Help improve input data at the source
bull Push at the source (in UK open government data in Holland all
parliamentary data is now in XML )
bull Help reduce dumb cut-and-paste annotation work so annotators
can concentrate on tasks which are hard for machines (eg
text-classification)
bull Emphasize importance of using shared standards
Future researchers will love you
PoliticalMashup 34
Last Question
Official Publications are they
or
- 1 Open Official Documents Requirements and Opportunities
- 2 Content
- 3 Our Leading Research Question
- 4 W3C recommendations on Open Government Data
- 5 Value of a large data corpus
- 6 Documents related by publication date
- 7 Properties of our Parliamentary Proceedings Dataset
- 8 Longitudinal data
- 9 Data about human behaviour
- 10 Often rather boring
- 11 But sometimes full of drama and excitement
- 12 Loads of measurement points
- 13 Digitally available
- 14 About this collection
- 15 Very rich metadata for each word
- 16 How to exploit the extra metadata and structure
- 17 Political n-gram viewer
- 18 Political n-gram viewer requirements
- 19 Is Linked (Open) Data the solution
- 20 DBpedia not yet reliable
- 21 Lesson learned requirement on metadata and relations
- 22 A few more applications
- 23 Entity Profiling and Entity Search
- 24 Content and structure search
- 25 Lesson learned requirement on structure
- 26 Application of structure Interruption graph (Attackogram)
- 27 Exploring and exploiting official documents
- 28 Scale diachronically
- 29 Scale in breadth eg parlproceedings of all European countries
- 30 Scale in breadth link to related datasets
- 31 Conclusions
- 32 Back to our research question
- 33 How we can help (ourselves)
- 34 Last Question
-
PoliticalMashup 23
Entity Profiling and Entity Search
bull Users search for entities not for documents [TREC Entity Track]
[Balog et al 2009]
bull Main research questions
How to collect information on entities
how to model an entity
how to rank entities
bull (Parsimonious) language models work well as models [Balog et
al 2009][Hiemstra et al 2004]
bull Entity profiling httpwwwpolitiekinzichtcom
bull Entity search httpikkieswijzernl
PoliticalMashup 24
Content and structure search
bull Usual advanced search combines keyword search with metadata
search
bull Extra fields are just extra filters on the returned documents
bull With structured documents we can do search on content and
structure
bull Most useful task rank best entry points in large documents
bull Compare two search systems on the same data
on flat text
on an XML representation
PoliticalMashup 25
Lesson learned requirement on structure
bull Make semantically important structure of documents explicit in
XML markup
bull Publish for machine readability
bull Publish generic data not data prepared for one use-case
PoliticalMashup 26
Application of structure Interruption graph(Attackogram)
bull MP A interrupts B lArrrArr A speaks during the block of B
combined with entity profiling
httpdebatpolitiekinzichtcom
PoliticalMashup 27
Exploring and exploiting official documents
bull We saw what can be done with one well-curated collection
bull What are the key infrastructural and research questions
In what direction and how to scale this up
1 in time
2 in breadth
3 in links
PoliticalMashup 28
Scale diachronically
bull Stable data model and measurement procedure make this data
very valuable for diachronic comparisons
bull towards the past
bull OCR
bull consistency in structure
bull more missing data to link to
bull towards the future
bull remain up to date
bull legacy decisions
PoliticalMashup 29
Scale in breadth eg parlproceedings of allEuropean countries
bull All describe the same ldquoscriptrdquo so all fit in one schema
bull Main question how to connect the data from different countries
Common structure and annotation use the same Relax NG
schema
Common values on certain attributesbull Entities Normalize to Wikipedia concepts
bull Controlled vocabulary keywords Normalize to Eurovoc
bull Language Machine translate to English
bull Events Normalize to EMM Newsexplorer query Wikinews
query
PoliticalMashup 30
Scale in breadth link to related datasets
bull Link on time entities events topics
bull Other official publications
bull News
bull User generated content
bull (In our case) promisses of political actors election manifestos
PoliticalMashup 31
Conclusions
bull There are ample opportunities for exploiting Official Publications
bull Preprocessing and interlinking with other datasets is difficult and
does not scale well
bull High precision and recall is needed for many applications
bull Many text analysis and data-mapping tasks [MUC TAC]
bull Every format needs an own transformer
bull Linked Open Data knowledge bases are not (yet) good enough
create special purpose knowledge extractors
bull High investment but if done in a general way high return and
impact
PoliticalMashup 32
Back to our research question
What is the best data format for publishing both legacy and current
parliamentary proceedings in a digital sustainable manner
Lessons learned
bull Common open standardized self-describing machine readable
bull not tied to a single application
bull linked linked linked
bull Not only shared attributes
bull but more importantly shared data values
bull also store utterly obvious facts (10 years later they arenrsquot)
PoliticalMashup 33
How we can help (ourselves)
Help improve input data at the source
bull Push at the source (in UK open government data in Holland all
parliamentary data is now in XML )
bull Help reduce dumb cut-and-paste annotation work so annotators
can concentrate on tasks which are hard for machines (eg
text-classification)
bull Emphasize importance of using shared standards
Future researchers will love you
PoliticalMashup 34
Last Question
Official Publications are they
or
- 1 Open Official Documents Requirements and Opportunities
- 2 Content
- 3 Our Leading Research Question
- 4 W3C recommendations on Open Government Data
- 5 Value of a large data corpus
- 6 Documents related by publication date
- 7 Properties of our Parliamentary Proceedings Dataset
- 8 Longitudinal data
- 9 Data about human behaviour
- 10 Often rather boring
- 11 But sometimes full of drama and excitement
- 12 Loads of measurement points
- 13 Digitally available
- 14 About this collection
- 15 Very rich metadata for each word
- 16 How to exploit the extra metadata and structure
- 17 Political n-gram viewer
- 18 Political n-gram viewer requirements
- 19 Is Linked (Open) Data the solution
- 20 DBpedia not yet reliable
- 21 Lesson learned requirement on metadata and relations
- 22 A few more applications
- 23 Entity Profiling and Entity Search
- 24 Content and structure search
- 25 Lesson learned requirement on structure
- 26 Application of structure Interruption graph (Attackogram)
- 27 Exploring and exploiting official documents
- 28 Scale diachronically
- 29 Scale in breadth eg parlproceedings of all European countries
- 30 Scale in breadth link to related datasets
- 31 Conclusions
- 32 Back to our research question
- 33 How we can help (ourselves)
- 34 Last Question
-
PoliticalMashup 24
Content and structure search
bull Usual advanced search combines keyword search with metadata
search
bull Extra fields are just extra filters on the returned documents
bull With structured documents we can do search on content and
structure
bull Most useful task rank best entry points in large documents
bull Compare two search systems on the same data
on flat text
on an XML representation
PoliticalMashup 25
Lesson learned requirement on structure
bull Make semantically important structure of documents explicit in
XML markup
bull Publish for machine readability
bull Publish generic data not data prepared for one use-case
PoliticalMashup 26
Application of structure Interruption graph(Attackogram)
bull MP A interrupts B lArrrArr A speaks during the block of B
combined with entity profiling
httpdebatpolitiekinzichtcom
PoliticalMashup 27
Exploring and exploiting official documents
bull We saw what can be done with one well-curated collection
bull What are the key infrastructural and research questions
In what direction and how to scale this up
1 in time
2 in breadth
3 in links
PoliticalMashup 28
Scale diachronically
bull Stable data model and measurement procedure make this data
very valuable for diachronic comparisons
bull towards the past
bull OCR
bull consistency in structure
bull more missing data to link to
bull towards the future
bull remain up to date
bull legacy decisions
PoliticalMashup 29
Scale in breadth eg parlproceedings of allEuropean countries
bull All describe the same ldquoscriptrdquo so all fit in one schema
bull Main question how to connect the data from different countries
Common structure and annotation use the same Relax NG
schema
Common values on certain attributesbull Entities Normalize to Wikipedia concepts
bull Controlled vocabulary keywords Normalize to Eurovoc
bull Language Machine translate to English
bull Events Normalize to EMM Newsexplorer query Wikinews
query
PoliticalMashup 30
Scale in breadth link to related datasets
bull Link on time entities events topics
bull Other official publications
bull News
bull User generated content
bull (In our case) promisses of political actors election manifestos
PoliticalMashup 31
Conclusions
bull There are ample opportunities for exploiting Official Publications
bull Preprocessing and interlinking with other datasets is difficult and
does not scale well
bull High precision and recall is needed for many applications
bull Many text analysis and data-mapping tasks [MUC TAC]
bull Every format needs an own transformer
bull Linked Open Data knowledge bases are not (yet) good enough
create special purpose knowledge extractors
bull High investment but if done in a general way high return and
impact
PoliticalMashup 32
Back to our research question
What is the best data format for publishing both legacy and current
parliamentary proceedings in a digital sustainable manner
Lessons learned
bull Common open standardized self-describing machine readable
bull not tied to a single application
bull linked linked linked
bull Not only shared attributes
bull but more importantly shared data values
bull also store utterly obvious facts (10 years later they arenrsquot)
PoliticalMashup 33
How we can help (ourselves)
Help improve input data at the source
bull Push at the source (in UK open government data in Holland all
parliamentary data is now in XML )
bull Help reduce dumb cut-and-paste annotation work so annotators
can concentrate on tasks which are hard for machines (eg
text-classification)
bull Emphasize importance of using shared standards
Future researchers will love you
PoliticalMashup 34
Last Question
Official Publications are they
or
- 1 Open Official Documents Requirements and Opportunities
- 2 Content
- 3 Our Leading Research Question
- 4 W3C recommendations on Open Government Data
- 5 Value of a large data corpus
- 6 Documents related by publication date
- 7 Properties of our Parliamentary Proceedings Dataset
- 8 Longitudinal data
- 9 Data about human behaviour
- 10 Often rather boring
- 11 But sometimes full of drama and excitement
- 12 Loads of measurement points
- 13 Digitally available
- 14 About this collection
- 15 Very rich metadata for each word
- 16 How to exploit the extra metadata and structure
- 17 Political n-gram viewer
- 18 Political n-gram viewer requirements
- 19 Is Linked (Open) Data the solution
- 20 DBpedia not yet reliable
- 21 Lesson learned requirement on metadata and relations
- 22 A few more applications
- 23 Entity Profiling and Entity Search
- 24 Content and structure search
- 25 Lesson learned requirement on structure
- 26 Application of structure Interruption graph (Attackogram)
- 27 Exploring and exploiting official documents
- 28 Scale diachronically
- 29 Scale in breadth eg parlproceedings of all European countries
- 30 Scale in breadth link to related datasets
- 31 Conclusions
- 32 Back to our research question
- 33 How we can help (ourselves)
- 34 Last Question
-
PoliticalMashup 25
Lesson learned requirement on structure
bull Make semantically important structure of documents explicit in
XML markup
bull Publish for machine readability
bull Publish generic data not data prepared for one use-case
PoliticalMashup 26
Application of structure Interruption graph(Attackogram)
bull MP A interrupts B lArrrArr A speaks during the block of B
combined with entity profiling
httpdebatpolitiekinzichtcom
PoliticalMashup 27
Exploring and exploiting official documents
bull We saw what can be done with one well-curated collection
bull What are the key infrastructural and research questions
In what direction and how to scale this up
1 in time
2 in breadth
3 in links
PoliticalMashup 28
Scale diachronically
bull Stable data model and measurement procedure make this data
very valuable for diachronic comparisons
bull towards the past
bull OCR
bull consistency in structure
bull more missing data to link to
bull towards the future
bull remain up to date
bull legacy decisions
PoliticalMashup 29
Scale in breadth eg parlproceedings of allEuropean countries
bull All describe the same ldquoscriptrdquo so all fit in one schema
bull Main question how to connect the data from different countries
Common structure and annotation use the same Relax NG
schema
Common values on certain attributesbull Entities Normalize to Wikipedia concepts
bull Controlled vocabulary keywords Normalize to Eurovoc
bull Language Machine translate to English
bull Events Normalize to EMM Newsexplorer query Wikinews
query
PoliticalMashup 30
Scale in breadth link to related datasets
bull Link on time entities events topics
bull Other official publications
bull News
bull User generated content
bull (In our case) promisses of political actors election manifestos
PoliticalMashup 31
Conclusions
bull There are ample opportunities for exploiting Official Publications
bull Preprocessing and interlinking with other datasets is difficult and
does not scale well
bull High precision and recall is needed for many applications
bull Many text analysis and data-mapping tasks [MUC TAC]
bull Every format needs an own transformer
bull Linked Open Data knowledge bases are not (yet) good enough
create special purpose knowledge extractors
bull High investment but if done in a general way high return and
impact
PoliticalMashup 32
Back to our research question
What is the best data format for publishing both legacy and current
parliamentary proceedings in a digital sustainable manner
Lessons learned
bull Common open standardized self-describing machine readable
bull not tied to a single application
bull linked linked linked
bull Not only shared attributes
bull but more importantly shared data values
bull also store utterly obvious facts (10 years later they arenrsquot)
PoliticalMashup 33
How we can help (ourselves)
Help improve input data at the source
bull Push at the source (in UK open government data in Holland all
parliamentary data is now in XML )
bull Help reduce dumb cut-and-paste annotation work so annotators
can concentrate on tasks which are hard for machines (eg
text-classification)
bull Emphasize importance of using shared standards
Future researchers will love you
PoliticalMashup 34
Last Question
Official Publications are they
or
- 1 Open Official Documents Requirements and Opportunities
- 2 Content
- 3 Our Leading Research Question
- 4 W3C recommendations on Open Government Data
- 5 Value of a large data corpus
- 6 Documents related by publication date
- 7 Properties of our Parliamentary Proceedings Dataset
- 8 Longitudinal data
- 9 Data about human behaviour
- 10 Often rather boring
- 11 But sometimes full of drama and excitement
- 12 Loads of measurement points
- 13 Digitally available
- 14 About this collection
- 15 Very rich metadata for each word
- 16 How to exploit the extra metadata and structure
- 17 Political n-gram viewer
- 18 Political n-gram viewer requirements
- 19 Is Linked (Open) Data the solution
- 20 DBpedia not yet reliable
- 21 Lesson learned requirement on metadata and relations
- 22 A few more applications
- 23 Entity Profiling and Entity Search
- 24 Content and structure search
- 25 Lesson learned requirement on structure
- 26 Application of structure Interruption graph (Attackogram)
- 27 Exploring and exploiting official documents
- 28 Scale diachronically
- 29 Scale in breadth eg parlproceedings of all European countries
- 30 Scale in breadth link to related datasets
- 31 Conclusions
- 32 Back to our research question
- 33 How we can help (ourselves)
- 34 Last Question
-
PoliticalMashup 26
Application of structure Interruption graph(Attackogram)
bull MP A interrupts B lArrrArr A speaks during the block of B
combined with entity profiling
httpdebatpolitiekinzichtcom
PoliticalMashup 27
Exploring and exploiting official documents
bull We saw what can be done with one well-curated collection
bull What are the key infrastructural and research questions
In what direction and how to scale this up
1 in time
2 in breadth
3 in links
PoliticalMashup 28
Scale diachronically
bull Stable data model and measurement procedure make this data
very valuable for diachronic comparisons
bull towards the past
bull OCR
bull consistency in structure
bull more missing data to link to
bull towards the future
bull remain up to date
bull legacy decisions
PoliticalMashup 29
Scale in breadth eg parlproceedings of allEuropean countries
bull All describe the same ldquoscriptrdquo so all fit in one schema
bull Main question how to connect the data from different countries
Common structure and annotation use the same Relax NG
schema
Common values on certain attributesbull Entities Normalize to Wikipedia concepts
bull Controlled vocabulary keywords Normalize to Eurovoc
bull Language Machine translate to English
bull Events Normalize to EMM Newsexplorer query Wikinews
query
PoliticalMashup 30
Scale in breadth link to related datasets
bull Link on time entities events topics
bull Other official publications
bull News
bull User generated content
bull (In our case) promisses of political actors election manifestos
PoliticalMashup 31
Conclusions
bull There are ample opportunities for exploiting Official Publications
bull Preprocessing and interlinking with other datasets is difficult and
does not scale well
bull High precision and recall is needed for many applications
bull Many text analysis and data-mapping tasks [MUC TAC]
bull Every format needs an own transformer
bull Linked Open Data knowledge bases are not (yet) good enough
create special purpose knowledge extractors
bull High investment but if done in a general way high return and
impact
PoliticalMashup 32
Back to our research question
What is the best data format for publishing both legacy and current
parliamentary proceedings in a digital sustainable manner
Lessons learned
bull Common open standardized self-describing machine readable
bull not tied to a single application
bull linked linked linked
bull Not only shared attributes
bull but more importantly shared data values
bull also store utterly obvious facts (10 years later they arenrsquot)
PoliticalMashup 33
How we can help (ourselves)
Help improve input data at the source
bull Push at the source (in UK open government data in Holland all
parliamentary data is now in XML )
bull Help reduce dumb cut-and-paste annotation work so annotators
can concentrate on tasks which are hard for machines (eg
text-classification)
bull Emphasize importance of using shared standards
Future researchers will love you
PoliticalMashup 34
Last Question
Official Publications are they
or
- 1 Open Official Documents Requirements and Opportunities
- 2 Content
- 3 Our Leading Research Question
- 4 W3C recommendations on Open Government Data
- 5 Value of a large data corpus
- 6 Documents related by publication date
- 7 Properties of our Parliamentary Proceedings Dataset
- 8 Longitudinal data
- 9 Data about human behaviour
- 10 Often rather boring
- 11 But sometimes full of drama and excitement
- 12 Loads of measurement points
- 13 Digitally available
- 14 About this collection
- 15 Very rich metadata for each word
- 16 How to exploit the extra metadata and structure
- 17 Political n-gram viewer
- 18 Political n-gram viewer requirements
- 19 Is Linked (Open) Data the solution
- 20 DBpedia not yet reliable
- 21 Lesson learned requirement on metadata and relations
- 22 A few more applications
- 23 Entity Profiling and Entity Search
- 24 Content and structure search
- 25 Lesson learned requirement on structure
- 26 Application of structure Interruption graph (Attackogram)
- 27 Exploring and exploiting official documents
- 28 Scale diachronically
- 29 Scale in breadth eg parlproceedings of all European countries
- 30 Scale in breadth link to related datasets
- 31 Conclusions
- 32 Back to our research question
- 33 How we can help (ourselves)
- 34 Last Question
-
PoliticalMashup 27
Exploring and exploiting official documents
bull We saw what can be done with one well-curated collection
bull What are the key infrastructural and research questions
In what direction and how to scale this up
1 in time
2 in breadth
3 in links
PoliticalMashup 28
Scale diachronically
bull Stable data model and measurement procedure make this data
very valuable for diachronic comparisons
bull towards the past
bull OCR
bull consistency in structure
bull more missing data to link to
bull towards the future
bull remain up to date
bull legacy decisions
PoliticalMashup 29
Scale in breadth eg parlproceedings of allEuropean countries
bull All describe the same ldquoscriptrdquo so all fit in one schema
bull Main question how to connect the data from different countries
Common structure and annotation use the same Relax NG
schema
Common values on certain attributesbull Entities Normalize to Wikipedia concepts
bull Controlled vocabulary keywords Normalize to Eurovoc
bull Language Machine translate to English
bull Events Normalize to EMM Newsexplorer query Wikinews
query
PoliticalMashup 30
Scale in breadth link to related datasets
bull Link on time entities events topics
bull Other official publications
bull News
bull User generated content
bull (In our case) promisses of political actors election manifestos
PoliticalMashup 31
Conclusions
bull There are ample opportunities for exploiting Official Publications
bull Preprocessing and interlinking with other datasets is difficult and
does not scale well
bull High precision and recall is needed for many applications
bull Many text analysis and data-mapping tasks [MUC TAC]
bull Every format needs an own transformer
bull Linked Open Data knowledge bases are not (yet) good enough
create special purpose knowledge extractors
bull High investment but if done in a general way high return and
impact
PoliticalMashup 32
Back to our research question
What is the best data format for publishing both legacy and current
parliamentary proceedings in a digital sustainable manner
Lessons learned
bull Common open standardized self-describing machine readable
bull not tied to a single application
bull linked linked linked
bull Not only shared attributes
bull but more importantly shared data values
bull also store utterly obvious facts (10 years later they arenrsquot)
PoliticalMashup 33
How we can help (ourselves)
Help improve input data at the source
bull Push at the source (in UK open government data in Holland all
parliamentary data is now in XML )
bull Help reduce dumb cut-and-paste annotation work so annotators
can concentrate on tasks which are hard for machines (eg
text-classification)
bull Emphasize importance of using shared standards
Future researchers will love you
PoliticalMashup 34
Last Question
Official Publications are they
or
- 1 Open Official Documents Requirements and Opportunities
- 2 Content
- 3 Our Leading Research Question
- 4 W3C recommendations on Open Government Data
- 5 Value of a large data corpus
- 6 Documents related by publication date
- 7 Properties of our Parliamentary Proceedings Dataset
- 8 Longitudinal data
- 9 Data about human behaviour
- 10 Often rather boring
- 11 But sometimes full of drama and excitement
- 12 Loads of measurement points
- 13 Digitally available
- 14 About this collection
- 15 Very rich metadata for each word
- 16 How to exploit the extra metadata and structure
- 17 Political n-gram viewer
- 18 Political n-gram viewer requirements
- 19 Is Linked (Open) Data the solution
- 20 DBpedia not yet reliable
- 21 Lesson learned requirement on metadata and relations
- 22 A few more applications
- 23 Entity Profiling and Entity Search
- 24 Content and structure search
- 25 Lesson learned requirement on structure
- 26 Application of structure Interruption graph (Attackogram)
- 27 Exploring and exploiting official documents
- 28 Scale diachronically
- 29 Scale in breadth eg parlproceedings of all European countries
- 30 Scale in breadth link to related datasets
- 31 Conclusions
- 32 Back to our research question
- 33 How we can help (ourselves)
- 34 Last Question
-
PoliticalMashup 28
Scale diachronically
bull Stable data model and measurement procedure make this data
very valuable for diachronic comparisons
bull towards the past
bull OCR
bull consistency in structure
bull more missing data to link to
bull towards the future
bull remain up to date
bull legacy decisions
PoliticalMashup 29
Scale in breadth eg parlproceedings of allEuropean countries
bull All describe the same ldquoscriptrdquo so all fit in one schema
bull Main question how to connect the data from different countries
Common structure and annotation use the same Relax NG
schema
Common values on certain attributesbull Entities Normalize to Wikipedia concepts
bull Controlled vocabulary keywords Normalize to Eurovoc
bull Language Machine translate to English
bull Events Normalize to EMM Newsexplorer query Wikinews
query
PoliticalMashup 30
Scale in breadth link to related datasets
bull Link on time entities events topics
bull Other official publications
bull News
bull User generated content
bull (In our case) promisses of political actors election manifestos
PoliticalMashup 31
Conclusions
bull There are ample opportunities for exploiting Official Publications
bull Preprocessing and interlinking with other datasets is difficult and
does not scale well
bull High precision and recall is needed for many applications
bull Many text analysis and data-mapping tasks [MUC TAC]
bull Every format needs an own transformer
bull Linked Open Data knowledge bases are not (yet) good enough
create special purpose knowledge extractors
bull High investment but if done in a general way high return and
impact
PoliticalMashup 32
Back to our research question
What is the best data format for publishing both legacy and current
parliamentary proceedings in a digital sustainable manner
Lessons learned
bull Common open standardized self-describing machine readable
bull not tied to a single application
bull linked linked linked
bull Not only shared attributes
bull but more importantly shared data values
bull also store utterly obvious facts (10 years later they arenrsquot)
PoliticalMashup 33
How we can help (ourselves)
Help improve input data at the source
bull Push at the source (in UK open government data in Holland all
parliamentary data is now in XML )
bull Help reduce dumb cut-and-paste annotation work so annotators
can concentrate on tasks which are hard for machines (eg
text-classification)
bull Emphasize importance of using shared standards
Future researchers will love you
PoliticalMashup 34
Last Question
Official Publications are they
or
- 1 Open Official Documents Requirements and Opportunities
- 2 Content
- 3 Our Leading Research Question
- 4 W3C recommendations on Open Government Data
- 5 Value of a large data corpus
- 6 Documents related by publication date
- 7 Properties of our Parliamentary Proceedings Dataset
- 8 Longitudinal data
- 9 Data about human behaviour
- 10 Often rather boring
- 11 But sometimes full of drama and excitement
- 12 Loads of measurement points
- 13 Digitally available
- 14 About this collection
- 15 Very rich metadata for each word
- 16 How to exploit the extra metadata and structure
- 17 Political n-gram viewer
- 18 Political n-gram viewer requirements
- 19 Is Linked (Open) Data the solution
- 20 DBpedia not yet reliable
- 21 Lesson learned requirement on metadata and relations
- 22 A few more applications
- 23 Entity Profiling and Entity Search
- 24 Content and structure search
- 25 Lesson learned requirement on structure
- 26 Application of structure Interruption graph (Attackogram)
- 27 Exploring and exploiting official documents
- 28 Scale diachronically
- 29 Scale in breadth eg parlproceedings of all European countries
- 30 Scale in breadth link to related datasets
- 31 Conclusions
- 32 Back to our research question
- 33 How we can help (ourselves)
- 34 Last Question
-
PoliticalMashup 29
Scale in breadth eg parlproceedings of allEuropean countries
bull All describe the same ldquoscriptrdquo so all fit in one schema
bull Main question how to connect the data from different countries
Common structure and annotation use the same Relax NG
schema
Common values on certain attributesbull Entities Normalize to Wikipedia concepts
bull Controlled vocabulary keywords Normalize to Eurovoc
bull Language Machine translate to English
bull Events Normalize to EMM Newsexplorer query Wikinews
query
PoliticalMashup 30
Scale in breadth link to related datasets
bull Link on time entities events topics
bull Other official publications
bull News
bull User generated content
bull (In our case) promisses of political actors election manifestos
PoliticalMashup 31
Conclusions
bull There are ample opportunities for exploiting Official Publications
bull Preprocessing and interlinking with other datasets is difficult and
does not scale well
bull High precision and recall is needed for many applications
bull Many text analysis and data-mapping tasks [MUC TAC]
bull Every format needs an own transformer
bull Linked Open Data knowledge bases are not (yet) good enough
create special purpose knowledge extractors
bull High investment but if done in a general way high return and
impact
PoliticalMashup 32
Back to our research question
What is the best data format for publishing both legacy and current
parliamentary proceedings in a digital sustainable manner
Lessons learned
bull Common open standardized self-describing machine readable
bull not tied to a single application
bull linked linked linked
bull Not only shared attributes
bull but more importantly shared data values
bull also store utterly obvious facts (10 years later they arenrsquot)
PoliticalMashup 33
How we can help (ourselves)
Help improve input data at the source
bull Push at the source (in UK open government data in Holland all
parliamentary data is now in XML )
bull Help reduce dumb cut-and-paste annotation work so annotators
can concentrate on tasks which are hard for machines (eg
text-classification)
bull Emphasize importance of using shared standards
Future researchers will love you
PoliticalMashup 34
Last Question
Official Publications are they
or
- 1 Open Official Documents Requirements and Opportunities
- 2 Content
- 3 Our Leading Research Question
- 4 W3C recommendations on Open Government Data
- 5 Value of a large data corpus
- 6 Documents related by publication date
- 7 Properties of our Parliamentary Proceedings Dataset
- 8 Longitudinal data
- 9 Data about human behaviour
- 10 Often rather boring
- 11 But sometimes full of drama and excitement
- 12 Loads of measurement points
- 13 Digitally available
- 14 About this collection
- 15 Very rich metadata for each word
- 16 How to exploit the extra metadata and structure
- 17 Political n-gram viewer
- 18 Political n-gram viewer requirements
- 19 Is Linked (Open) Data the solution
- 20 DBpedia not yet reliable
- 21 Lesson learned requirement on metadata and relations
- 22 A few more applications
- 23 Entity Profiling and Entity Search
- 24 Content and structure search
- 25 Lesson learned requirement on structure
- 26 Application of structure Interruption graph (Attackogram)
- 27 Exploring and exploiting official documents
- 28 Scale diachronically
- 29 Scale in breadth eg parlproceedings of all European countries
- 30 Scale in breadth link to related datasets
- 31 Conclusions
- 32 Back to our research question
- 33 How we can help (ourselves)
- 34 Last Question
-
PoliticalMashup 30
Scale in breadth link to related datasets
bull Link on time entities events topics
bull Other official publications
bull News
bull User generated content
bull (In our case) promisses of political actors election manifestos
PoliticalMashup 31
Conclusions
bull There are ample opportunities for exploiting Official Publications
bull Preprocessing and interlinking with other datasets is difficult and
does not scale well
bull High precision and recall is needed for many applications
bull Many text analysis and data-mapping tasks [MUC TAC]
bull Every format needs an own transformer
bull Linked Open Data knowledge bases are not (yet) good enough
create special purpose knowledge extractors
bull High investment but if done in a general way high return and
impact
PoliticalMashup 32
Back to our research question
What is the best data format for publishing both legacy and current
parliamentary proceedings in a digital sustainable manner
Lessons learned
bull Common open standardized self-describing machine readable
bull not tied to a single application
bull linked linked linked
bull Not only shared attributes
bull but more importantly shared data values
bull also store utterly obvious facts (10 years later they arenrsquot)
PoliticalMashup 33
How we can help (ourselves)
Help improve input data at the source
bull Push at the source (in UK open government data in Holland all
parliamentary data is now in XML )
bull Help reduce dumb cut-and-paste annotation work so annotators
can concentrate on tasks which are hard for machines (eg
text-classification)
bull Emphasize importance of using shared standards
Future researchers will love you
PoliticalMashup 34
Last Question
Official Publications are they
or
- 1 Open Official Documents Requirements and Opportunities
- 2 Content
- 3 Our Leading Research Question
- 4 W3C recommendations on Open Government Data
- 5 Value of a large data corpus
- 6 Documents related by publication date
- 7 Properties of our Parliamentary Proceedings Dataset
- 8 Longitudinal data
- 9 Data about human behaviour
- 10 Often rather boring
- 11 But sometimes full of drama and excitement
- 12 Loads of measurement points
- 13 Digitally available
- 14 About this collection
- 15 Very rich metadata for each word
- 16 How to exploit the extra metadata and structure
- 17 Political n-gram viewer
- 18 Political n-gram viewer requirements
- 19 Is Linked (Open) Data the solution
- 20 DBpedia not yet reliable
- 21 Lesson learned requirement on metadata and relations
- 22 A few more applications
- 23 Entity Profiling and Entity Search
- 24 Content and structure search
- 25 Lesson learned requirement on structure
- 26 Application of structure Interruption graph (Attackogram)
- 27 Exploring and exploiting official documents
- 28 Scale diachronically
- 29 Scale in breadth eg parlproceedings of all European countries
- 30 Scale in breadth link to related datasets
- 31 Conclusions
- 32 Back to our research question
- 33 How we can help (ourselves)
- 34 Last Question
-
PoliticalMashup 31
Conclusions
bull There are ample opportunities for exploiting Official Publications
bull Preprocessing and interlinking with other datasets is difficult and
does not scale well
bull High precision and recall is needed for many applications
bull Many text analysis and data-mapping tasks [MUC TAC]
bull Every format needs an own transformer
bull Linked Open Data knowledge bases are not (yet) good enough
create special purpose knowledge extractors
bull High investment but if done in a general way high return and
impact
PoliticalMashup 32
Back to our research question
What is the best data format for publishing both legacy and current
parliamentary proceedings in a digital sustainable manner
Lessons learned
bull Common open standardized self-describing machine readable
bull not tied to a single application
bull linked linked linked
bull Not only shared attributes
bull but more importantly shared data values
bull also store utterly obvious facts (10 years later they arenrsquot)
PoliticalMashup 33
How we can help (ourselves)
Help improve input data at the source
bull Push at the source (in UK open government data in Holland all
parliamentary data is now in XML )
bull Help reduce dumb cut-and-paste annotation work so annotators
can concentrate on tasks which are hard for machines (eg
text-classification)
bull Emphasize importance of using shared standards
Future researchers will love you
PoliticalMashup 34
Last Question
Official Publications are they
or
- 1 Open Official Documents Requirements and Opportunities
- 2 Content
- 3 Our Leading Research Question
- 4 W3C recommendations on Open Government Data
- 5 Value of a large data corpus
- 6 Documents related by publication date
- 7 Properties of our Parliamentary Proceedings Dataset
- 8 Longitudinal data
- 9 Data about human behaviour
- 10 Often rather boring
- 11 But sometimes full of drama and excitement
- 12 Loads of measurement points
- 13 Digitally available
- 14 About this collection
- 15 Very rich metadata for each word
- 16 How to exploit the extra metadata and structure
- 17 Political n-gram viewer
- 18 Political n-gram viewer requirements
- 19 Is Linked (Open) Data the solution
- 20 DBpedia not yet reliable
- 21 Lesson learned requirement on metadata and relations
- 22 A few more applications
- 23 Entity Profiling and Entity Search
- 24 Content and structure search
- 25 Lesson learned requirement on structure
- 26 Application of structure Interruption graph (Attackogram)
- 27 Exploring and exploiting official documents
- 28 Scale diachronically
- 29 Scale in breadth eg parlproceedings of all European countries
- 30 Scale in breadth link to related datasets
- 31 Conclusions
- 32 Back to our research question
- 33 How we can help (ourselves)
- 34 Last Question
-
PoliticalMashup 32
Back to our research question
What is the best data format for publishing both legacy and current
parliamentary proceedings in a digital sustainable manner
Lessons learned
bull Common open standardized self-describing machine readable
bull not tied to a single application
bull linked linked linked
bull Not only shared attributes
bull but more importantly shared data values
bull also store utterly obvious facts (10 years later they arenrsquot)
PoliticalMashup 33
How we can help (ourselves)
Help improve input data at the source
bull Push at the source (in UK open government data in Holland all
parliamentary data is now in XML )
bull Help reduce dumb cut-and-paste annotation work so annotators
can concentrate on tasks which are hard for machines (eg
text-classification)
bull Emphasize importance of using shared standards
Future researchers will love you
PoliticalMashup 34
Last Question
Official Publications are they
or
- 1 Open Official Documents Requirements and Opportunities
- 2 Content
- 3 Our Leading Research Question
- 4 W3C recommendations on Open Government Data
- 5 Value of a large data corpus
- 6 Documents related by publication date
- 7 Properties of our Parliamentary Proceedings Dataset
- 8 Longitudinal data
- 9 Data about human behaviour
- 10 Often rather boring
- 11 But sometimes full of drama and excitement
- 12 Loads of measurement points
- 13 Digitally available
- 14 About this collection
- 15 Very rich metadata for each word
- 16 How to exploit the extra metadata and structure
- 17 Political n-gram viewer
- 18 Political n-gram viewer requirements
- 19 Is Linked (Open) Data the solution
- 20 DBpedia not yet reliable
- 21 Lesson learned requirement on metadata and relations
- 22 A few more applications
- 23 Entity Profiling and Entity Search
- 24 Content and structure search
- 25 Lesson learned requirement on structure
- 26 Application of structure Interruption graph (Attackogram)
- 27 Exploring and exploiting official documents
- 28 Scale diachronically
- 29 Scale in breadth eg parlproceedings of all European countries
- 30 Scale in breadth link to related datasets
- 31 Conclusions
- 32 Back to our research question
- 33 How we can help (ourselves)
- 34 Last Question
-
PoliticalMashup 33
How we can help (ourselves)
Help improve input data at the source
bull Push at the source (in UK open government data in Holland all
parliamentary data is now in XML )
bull Help reduce dumb cut-and-paste annotation work so annotators
can concentrate on tasks which are hard for machines (eg
text-classification)
bull Emphasize importance of using shared standards
Future researchers will love you
PoliticalMashup 34
Last Question
Official Publications are they
or
- 1 Open Official Documents Requirements and Opportunities
- 2 Content
- 3 Our Leading Research Question
- 4 W3C recommendations on Open Government Data
- 5 Value of a large data corpus
- 6 Documents related by publication date
- 7 Properties of our Parliamentary Proceedings Dataset
- 8 Longitudinal data
- 9 Data about human behaviour
- 10 Often rather boring
- 11 But sometimes full of drama and excitement
- 12 Loads of measurement points
- 13 Digitally available
- 14 About this collection
- 15 Very rich metadata for each word
- 16 How to exploit the extra metadata and structure
- 17 Political n-gram viewer
- 18 Political n-gram viewer requirements
- 19 Is Linked (Open) Data the solution
- 20 DBpedia not yet reliable
- 21 Lesson learned requirement on metadata and relations
- 22 A few more applications
- 23 Entity Profiling and Entity Search
- 24 Content and structure search
- 25 Lesson learned requirement on structure
- 26 Application of structure Interruption graph (Attackogram)
- 27 Exploring and exploiting official documents
- 28 Scale diachronically
- 29 Scale in breadth eg parlproceedings of all European countries
- 30 Scale in breadth link to related datasets
- 31 Conclusions
- 32 Back to our research question
- 33 How we can help (ourselves)
- 34 Last Question
-
PoliticalMashup 34
Last Question
Official Publications are they
or
- 1 Open Official Documents Requirements and Opportunities
- 2 Content
- 3 Our Leading Research Question
- 4 W3C recommendations on Open Government Data
- 5 Value of a large data corpus
- 6 Documents related by publication date
- 7 Properties of our Parliamentary Proceedings Dataset
- 8 Longitudinal data
- 9 Data about human behaviour
- 10 Often rather boring
- 11 But sometimes full of drama and excitement
- 12 Loads of measurement points
- 13 Digitally available
- 14 About this collection
- 15 Very rich metadata for each word
- 16 How to exploit the extra metadata and structure
- 17 Political n-gram viewer
- 18 Political n-gram viewer requirements
- 19 Is Linked (Open) Data the solution
- 20 DBpedia not yet reliable
- 21 Lesson learned requirement on metadata and relations
- 22 A few more applications
- 23 Entity Profiling and Entity Search
- 24 Content and structure search
- 25 Lesson learned requirement on structure
- 26 Application of structure Interruption graph (Attackogram)
- 27 Exploring and exploiting official documents
- 28 Scale diachronically
- 29 Scale in breadth eg parlproceedings of all European countries
- 30 Scale in breadth link to related datasets
- 31 Conclusions
- 32 Back to our research question
- 33 How we can help (ourselves)
- 34 Last Question
-