using entity extraction extension with openrefine and datatxt apis

24
Using entity extraction extension with OpenRefine and dataTXT APIs food for thoughts

Upload: spaziodati

Post on 08-Sep-2014

1.423 views

Category:

Technology


5 download

DESCRIPTION

Food for thoughts to understand why you need entity extraction capabilities inside OpenRefine. Some examples and scenarios.

TRANSCRIPT

Page 1: Using entity extraction extension with OpenRefine and dataTXT APIs

Using entity extraction extension with OpenRefine and dataTXT APIs

!

food for thoughts

Page 2: Using entity extraction extension with OpenRefine and dataTXT APIs

What we are talking about

OpenRefine www.openrefine.org

NER extension integrated with dataTXT-NEX API

http://freeyourmetadata.org/named-entity-extraction/

(dandelion.eu)

Page 3: Using entity extraction extension with OpenRefine and dataTXT APIs

What industries are using OpenRefine?

https://groups.google.com/d/msg/openrefine/vA75Ac_XODo/AfG8IRlEfSAJ

Page 4: Using entity extraction extension with OpenRefine and dataTXT APIs

data journalists

metadata curators

museumslibrariesresearch labs

SEO folks

data scientistsenterprisesuniversities

patent attorneys

Open Data hackers

Social Media specialists

civil servants

Page 5: Using entity extraction extension with OpenRefine and dataTXT APIs

What does OpenRefine offer that other data-parsing tools don't?

http://opendata.stackexchange.com/questions/515/what-does-openrefine-offer-that-other-data-parsing-tools-dont

Page 6: Using entity extraction extension with OpenRefine and dataTXT APIs

reconciliation of text data against reference data services containing strong identifiers (Freebase, OpenCorporates, any SPARQL or RDF, etc) !

simple linking of reconciled entities to other info sources like Wikipedia, MusicBrainz, IMDB, etc

[…]

[…]

Page 7: Using entity extraction extension with OpenRefine and dataTXT APIs

How we are using it, at SpazioDati?

Page 8: Using entity extraction extension with OpenRefine and dataTXT APIs

OpenRefine is inside our data curation controller

dandelion.eu

Page 9: Using entity extraction extension with OpenRefine and dataTXT APIs

normalize, clean and extract data from different sources reconcile against internal reconciliation services ( administrative regions, names and telephone numbers… )apply rules and transformations to data, aligned it with our internal ontologies

Page 10: Using entity extraction extension with OpenRefine and dataTXT APIs

A look at OpenRefine & reconciliation

Page 11: Using entity extraction extension with OpenRefine and dataTXT APIs

Why it’s useful reconciliation?

Instruments

bla bla bla

bla bla bla bla

what kind of instruments?

Page 12: Using entity extraction extension with OpenRefine and dataTXT APIs

reconciliation identifies keywords in flowing text and gives them a URL

from strings to things

Page 13: Using entity extraction extension with OpenRefine and dataTXT APIs

instruments data column

musical instruments

measuring instruments

aeronautical instruments

URL

URL

URL

Instruments

bla bla bla

bla bla bla bla

Page 14: Using entity extraction extension with OpenRefine and dataTXT APIs

reconciliation works great for those fields in your dataset that contain single terms

names of people countries, works of art […]

Page 15: Using entity extraction extension with OpenRefine and dataTXT APIs

and what if we have a column with unstructured texts, like this one?

Page 16: Using entity extraction extension with OpenRefine and dataTXT APIs

we need a new step in the data curation workflow…

a new column data, labelled “dataTXT”

extract named entities using

NER extension + dataTXT API

data column with some texts

Page 17: Using entity extraction extension with OpenRefine and dataTXT APIs

in this column, there are named concepts, linked to Wikipedia

label + URI“Collective action” + http://en.wikipedia.org/wiki/Collective_action

Page 18: Using entity extraction extension with OpenRefine and dataTXT APIs

make a text filter

looking for a concept

classify and categorize the content …

things, not strings

Page 19: Using entity extraction extension with OpenRefine and dataTXT APIs

some scenarios

Page 20: Using entity extraction extension with OpenRefine and dataTXT APIs

Open Data community real issues

Using OpenRefine + NER extension with dataTXT-NEX

extract meaninful informations from some CVs, like names, organizations, skills, …

http://opendata.stackexchange.com/search?page=3&tab=relevance&q=extraction

normalize organizations names cited in some texts

Page 21: Using entity extraction extension with OpenRefine and dataTXT APIs

Data journalists

Using OpenRefine + NER extension with dataTXT-NEX

extract relevant news to a precise topic ( a person, a brand or a company )

write a summary from a politician speech, starting from the main concepts extracted from the text

mine specific informations in judicial decisions (judge's name, court, area of law and neutral citation number

Page 22: Using entity extraction extension with OpenRefine and dataTXT APIs

Using OpenRefine + NER extension with dataTXT-NEX

Text mining on tweets: extract brands, places and concepts easily from a twitter flow related to an event

Text mining on website content: extract concepts and places easily from a webpage, to improve website SEO ranking

Social media specialists

Page 23: Using entity extraction extension with OpenRefine and dataTXT APIs

Using OpenRefine + NER extension with dataTXT-NEX

Understand your own bank account statements: extract useful informations, like brands and places, to categorize and classify your own expenses

“Quantify self” movement

Analytics on Personal Data

Page 24: Using entity extraction extension with OpenRefine and dataTXT APIs

#dataTXT #refine #ner

you know other use cases? tell us on Twitter!

@spaziodatidandelion.eu