linked open data and rdf stores meet search based applications · 2012-02-20 · classification...

17
Linked Open Data and RDF Stores meet Search Based Applications PlanetData Call 2 Strategic Workshop, 08-Feb-12, LUX Dr. Christoph Goller, Director Research, IntraFind AG

Upload: others

Post on 14-Jul-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Linked Open Data and RDF Stores meet Search Based Applications · 2012-02-20 · Classification Phase New Document Example Documents Text Classification ... Offloading RDF Content

Linked Open Data and RDF Stores meet

Search Based Applications

PlanetData Call 2 Strategic Workshop, 08-Feb-12, LUX

Dr. Christoph Goller, Director Research, IntraFind AG

Page 2: Linked Open Data and RDF Stores meet Search Based Applications · 2012-02-20 · Classification Phase New Document Example Documents Text Classification ... Offloading RDF Content

Outline

IntraFind

Search Based Applications (SBA) / Enterprise Search

Natural Language Processing (NLP): extract semantics from text

RDF Stores and Linked Open Data (LOD)

SBAs, RDF Stores and LOD: bringing it all together

PlanetData Call 2 Strategic Workshop, 08-Feb-12, LUX 2

Page 3: Linked Open Data and RDF Stores meet Search Based Applications · 2012-02-20 · Classification Phase New Document Example Documents Text Classification ... Offloading RDF Content

IntraFind Software AG

PlanetData Call 2 Strategic Workshop, 08-Feb-12, LUX 3

Page 4: Linked Open Data and RDF Stores meet Search Based Applications · 2012-02-20 · Classification Phase New Document Example Documents Text Classification ... Offloading RDF Content

IntraFind Business Model

Founding of the company: October 2000

More than 700 customers mainly in Germany, Austria, and Switzerland

Partner Network (> 30 VAR & embedding partners)

Employees: 25

Lucene Committers: B. Messer, C. Goller

Our Open Source Search Business:

Product Company: iFinder, Topic Finder, Knowledge Map, Tagging Service, …

Products are a combination of Open Source Components and in-house Development

Support (up to 7x24), Services, Training, Stable API

Automatic Generation of Semantics

Linguistic Analyzers for most European Languages

Named Entity Recognition

Text Classification

Semantic Search

Clustering

PlanetData Call 2 Strategic Workshop, 08-Feb-12, LUX 4

Page 5: Linked Open Data and RDF Stores meet Search Based Applications · 2012-02-20 · Classification Phase New Document Example Documents Text Classification ... Offloading RDF Content

Search Based Applications (SBA)

“The elements that make search powerful are not necessarily the search box, but the ability to

bring together multiple types of information quickly and understandably, in real time, and at

massive scale. Databases have been the underpinning for most of the current generation of

enterprise applications; search technologies may well be the software backbone of the future”

---Susan Feldman, IDC LINK, June 9, 2010

bring together data from multiple sources (structured and unstructured)

Massive Scalability of Web Search (fulltext indexes are NoSQL DBs)

Agile: days vs. months

Semantic NLP Technologies to analyze, enrich, and aggregate source data

Usability of Web Search

Search Box (natural language)

Facets / Dynamic Clustering (refining navigation / web shops)

ad hoc reporting and analysis (dashboard tools, information visualization)

build BI / decision tools on structured and unstructured information

E.g. Social Media Opinion Mining

PlanetData Call 2 Strategic Workshop, 08-Feb-12, LUX 5

Page 6: Linked Open Data and RDF Stores meet Search Based Applications · 2012-02-20 · Classification Phase New Document Example Documents Text Classification ... Offloading RDF Content

iFinder

PlanetData Call 2 Strategic Workshop, 08-Feb-12, LUX 6

Unique

Search Box

Result from

Structured Data

Result from

Unstructured Data

Facets

(correlated terms)

Facets

(named entities)

Facets

(other meta-data)

Page 7: Linked Open Data and RDF Stores meet Search Based Applications · 2012-02-20 · Classification Phase New Document Example Documents Text Classification ... Offloading RDF Content

iFinder Knowledge Map

Tab browsing

Full-text search combined with Knowledge Maps

Content sources provide

filtering capabilities within

the Knowledge Map

Search-while-you-type

based on filters (in this

example: filter „author“)

PlanetData Call 2 Strategic Workshop, 08-Feb-12, LUX 7

Page 8: Linked Open Data and RDF Stores meet Search Based Applications · 2012-02-20 · Classification Phase New Document Example Documents Text Classification ... Offloading RDF Content

Morphological Analyzer vs. Stemming

Morphological Analyzer:

Lemmatizer: maps words to their base forms

Decomposer: decomposes words into their compounds

Kinderbuch (children‘s book) -> Kind (Noun) | Buch (Noun)

Versicherungsvertrag (contract of insurance) -> Versicherung (Noun) | Vertrag (Noun)

Holztisch (wooden table), Glastisch (table made of glass)

Stemmer: usually simple algorithm

going -> go king -> k ???????????

Messer -> mess ??????

PlanetData Call 2 Strategic Workshop, 08-Feb-12, LUX 8

English German

going -> go (Verb) lief -> laufen (Verb)

bought -> buy (Verb) rannte -> rennen (Verb)

bags -> bag (Noun) Bücher -> Buch (Noun)

bacteria -> bacterium (Noun) Taschen -> Tasche (Noun)

Page 9: Linked Open Data and RDF Stores meet Search Based Applications · 2012-02-20 · Classification Phase New Document Example Documents Text Classification ... Offloading RDF Content

Advantages of Morphological Analysis

Combines high Recall with high Precision for Search Applications

Improves subsequent statistical methods

Better suited as descriptions for semantic search / faceting / clustering than

artificial stems, categories used for phrase detection

Reliable Lookup in Lexicon Resources

Thesaurus / Ontologies

Cross-lingual search

Available Languages:

German, English, Spanish, French, Italian, Dutch, Russian, Polish,

Serbo-Croatian, Greek, Chinese, Arabian, Pasthu

PlanetData Call 2 Strategic Workshop, 08-Feb-12, LUX 9

Page 10: Linked Open Data and RDF Stores meet Search Based Applications · 2012-02-20 · Classification Phase New Document Example Documents Text Classification ... Offloading RDF Content

PlanetData Call 2 Strategic Workshop, 08-Feb-12, LUX 10

Named Entity Recognition (NER)

Automated extraction of information from unstructured data

People names

Company names

Brands from product lists

Technical key figures from technical data

(raw materials, product types, order IDs,

process numbers, eClass categories)

Names of streets and locations

Currency and accounting values

Dates

Phone numbers, email addresses, hyperlinks

Search query: Which people / companies

strongly correlate with each other?

10

Page 11: Linked Open Data and RDF Stores meet Search Based Applications · 2012-02-20 · Classification Phase New Document Example Documents Text Classification ... Offloading RDF Content

Named Entities: Applications in Search

Applications:

Facets

Search for „Experts“

Additional Query Types

Index Structure: Additional Tokens on the same position:

N_PersonName

N_Peter Müller

N_Peter

N_Müller

Search for a person named „Brown“

Search for a company near „founded“ and „Bill Gates“

Question Answering / Natural Language Queries (Demo)

PlanetData Call 2 Strategic Workshop, 08-Feb-12, LUX 11

Page 12: Linked Open Data and RDF Stores meet Search Based Applications · 2012-02-20 · Classification Phase New Document Example Documents Text Classification ... Offloading RDF Content

Topic Learner Classifier

Rules (Parameters)

Definition of

Topics 1……N

Topic Classifier

Topic Association

Learning Phase

Classification Phase

New Document

Example Documents

Text Classification

Automatic Assignment of Documents to Topics based on their content

PlanetData Call 2 Strategic Workshop, 08-Feb-12, LUX 12

Page 13: Linked Open Data and RDF Stores meet Search Based Applications · 2012-02-20 · Classification Phase New Document Example Documents Text Classification ... Offloading RDF Content

Applications of Text Classification

News: Newsletter-Management System

Spam-Filtering

Mail / Email Classification

Product Classification (Online Shops), ECLASS /UNSPSC

Subject Area Assignment Libraries & Publishing Companies

Opinion Mining / Sentiment Detection

Part of our Tagging Services

PlanetData Call 2 Strategic Workshop, 08-Feb-12, LUX 13

Page 14: Linked Open Data and RDF Stores meet Search Based Applications · 2012-02-20 · Classification Phase New Document Example Documents Text Classification ... Offloading RDF Content

RDF Stores & Linked Open Data

RDF Stores:

Originally designed as Meta-Data model: machine-readable information

Triples: subject-predicate-object

General Data Model for Knowledge Representation

Query and Inference Languages: SPARQL

Linked Open Data:

method of publishing structured data so that it can be interlinked

Uses RDF

PlanetData Call 2 Strategic Workshop, 08-Feb-12, LUX 14

Page 15: Linked Open Data and RDF Stores meet Search Based Applications · 2012-02-20 · Classification Phase New Document Example Documents Text Classification ... Offloading RDF Content

Linked Open Data

PlanetData Call 2 Strategic Workshop, 08-Feb-12, LUX 15

Page 16: Linked Open Data and RDF Stores meet Search Based Applications · 2012-02-20 · Classification Phase New Document Example Documents Text Classification ... Offloading RDF Content

SBAs, RDF, and LOD: bringing it all together

Use RDF Stores to store structured data within SBAs

LOD (Ontologies / Thesauri) can be used for tagging, topic identification, disambiguation, Auto-Complete, Did-You-Mean, Semantic Search, Navigation / Facetts, Cross-Lingual

Use LOD/knowledge to better link structured and unstructured data

Use LOD/knwoledge (not only statistics) to better determine query-document matching

Use LOD-Resources for improving / developing Information Extraction NLP components

RDF Stores may benefit from SBAs in terms of scalability

(e.g. Project Sindice: Offloading RDF Content into a Lucene Store)

Other NoSQL DBs

RDF stores and LOD are designed for machine-usability but human usability is needed

Vision:

combine Fulltext Search an RDF Stores into a single (Enterprise) Content Repository

provide a simple unified intuitive user interface for accessing all (Enterprise)

knowledge, structured and unstructured

Current and Future Needs for SBAs: more Semantics, more NLP

PlanetData Call 2 Strategic Workshop, 08-Feb-12, LUX 16

Page 17: Linked Open Data and RDF Stores meet Search Based Applications · 2012-02-20 · Classification Phase New Document Example Documents Text Classification ... Offloading RDF Content

Questions?

Dr. Christoph Goller

Director Research

Phone: +49 89 3090446-0

Fax: +49 89 3090446-29

Email: [email protected]

Web: www.intrafind.com

IntraFind Software AG

Fraunhofer Strasse 15

82152 Planegg

Germany

PlanetData Call 2 Strategic Workshop, 08-Feb-12, LUX 17