topic/s – a topic and trend recognition approach in news-media, i-semantics13
DESCRIPTION
information extraction, modelling and storage of semantic data to recognize trending topics for journalism and newspaper officesTRANSCRIPT
![Page 1: Topic/S – A Topic and Trend Recognition Approach in News-Media, I-Semantics13](https://reader033.vdocuments.net/reader033/viewer/2022060110/5560b804d8b42a033c8b4c1a/html5/thumbnails/1.jpg)
Sächsische AufbauBank Forschung und Entwicklung - Projektförderung Projektnummer - 99457/2677
Michael Aleythe, Martin Voigt, Peter Wehner
![Page 2: Topic/S – A Topic and Trend Recognition Approach in News-Media, I-Semantics13](https://reader033.vdocuments.net/reader033/viewer/2022060110/5560b804d8b42a033c8b4c1a/html5/thumbnails/2.jpg)
Structure
Motivation, Problems, and Goals
Topic/S Workflow
Demo
Conclusion
Friday, 06.09.2013 Topic/S Slide 1
![Page 3: Topic/S – A Topic and Trend Recognition Approach in News-Media, I-Semantics13](https://reader033.vdocuments.net/reader033/viewer/2022060110/5560b804d8b42a033c8b4c1a/html5/thumbnails/3.jpg)
Motivation
Newsroom
Friday, 06.09.2013 Topic/S Slide 2
Quelle: ringier.com
![Page 4: Topic/S – A Topic and Trend Recognition Approach in News-Media, I-Semantics13](https://reader033.vdocuments.net/reader033/viewer/2022060110/5560b804d8b42a033c8b4c1a/html5/thumbnails/4.jpg)
Problem
Overwhelming amount of data
e.g., WAZ 5000 articles/day from agencies and in-house production
Friday, 06.09.2013 Topic/S
DPA
Reuters
KNA
Blogs
…
News agencies Web, social media
…
In-house production
Archive
Online
Slide 3
![Page 5: Topic/S – A Topic and Trend Recognition Approach in News-Media, I-Semantics13](https://reader033.vdocuments.net/reader033/viewer/2022060110/5560b804d8b42a033c8b4c1a/html5/thumbnails/5.jpg)
Vision
Automatic topic discovery using Named Entities and other keywords (Semantic Items, SemItem)
Investigation of trending topics
Push them to the editor
Friday, 06.09.2013 Topic/S
MA1
E1
E2
E4
E3
E7
E6
E5MA2
Media Assets
Named Entities
Pre-Processing
MA1
E1
T1E2
E4
E3
E7
E6
T2
T3
E5MA2
Media Assets
Named Entities
Topics
Pre-Processing Post-Processing
Slide 4
![Page 6: Topic/S – A Topic and Trend Recognition Approach in News-Media, I-Semantics13](https://reader033.vdocuments.net/reader033/viewer/2022060110/5560b804d8b42a033c8b4c1a/html5/thumbnails/6.jpg)
Structure
Motivation, Problems, and Goals
Topic/S Workflow
– Overview
– Information Extraction
– Storage
– Topic Detection
Demo
Conclusion
Friday, 06.09.2013 Topic/S Slide 5
![Page 7: Topic/S – A Topic and Trend Recognition Approach in News-Media, I-Semantics13](https://reader033.vdocuments.net/reader033/viewer/2022060110/5560b804d8b42a033c8b4c1a/html5/thumbnails/7.jpg)
Workflow
Friday, 06.09.2013 Topic/S Slide 6
![Page 8: Topic/S – A Topic and Trend Recognition Approach in News-Media, I-Semantics13](https://reader033.vdocuments.net/reader033/viewer/2022060110/5560b804d8b42a033c8b4c1a/html5/thumbnails/8.jpg)
Workflow: Preprocessor
Friday, 06.09.2013 Topic/S
Language Recognition (Ger/Eng)
Rule based
Named Entity Extraction
word list + statistics
Keyword Extraction
Lemmatization, word list
Categorisation
Source based
Slide 7
Source: onelanguageoneposter.com
![Page 9: Topic/S – A Topic and Trend Recognition Approach in News-Media, I-Semantics13](https://reader033.vdocuments.net/reader033/viewer/2022060110/5560b804d8b42a033c8b4c1a/html5/thumbnails/9.jpg)
Structure
Motivation, Problems, and Goals
Topic/S Workflow
– Overview
– Information Extraction
– Storage
– Topic Detection
Demo
Conclusion
Friday, 06.09.2013 Topic/S Slide 8
![Page 10: Topic/S – A Topic and Trend Recognition Approach in News-Media, I-Semantics13](https://reader033.vdocuments.net/reader033/viewer/2022060110/5560b804d8b42a033c8b4c1a/html5/thumbnails/10.jpg)
Semantic Model
Friday, 06.09.2013 Topic/S Slide 9
![Page 11: Topic/S – A Topic and Trend Recognition Approach in News-Media, I-Semantics13](https://reader033.vdocuments.net/reader033/viewer/2022060110/5560b804d8b42a033c8b4c1a/html5/thumbnails/11.jpg)
Semantic Facts
Named Entities required but no lists available
Stored preferred and alternative names
ID: http://www.topic-s.de/topics-facts/id/person/Rene_Muller
Names: Rene Muller, Rene Müller, René Muller, René Müller
Triples without SemItems: 27,6 Mio.
Friday, 06.09.2013 Topic/S Slide 10
SemItem Number (with alt. names)
Person 1.504.341 (2.499.962)
Organization 63.332 (98.127)
Place 89.702 (95.178)
Keyword 1351
![Page 12: Topic/S – A Topic and Trend Recognition Approach in News-Media, I-Semantics13](https://reader033.vdocuments.net/reader033/viewer/2022060110/5560b804d8b42a033c8b4c1a/html5/thumbnails/12.jpg)
Storage of Semantic Data
Using Oracle 11gR2 Pros
Already available, existing knowledge
Integrated querying of relational and semantic data
Cons
Inference
Incomplete SPARQL 1.1 support
Limited custom rule support
Benchmark of triple stores [Voigt2012]
Friday, 06.09.2013 Topic/S Slide 11
![Page 13: Topic/S – A Topic and Trend Recognition Approach in News-Media, I-Semantics13](https://reader033.vdocuments.net/reader033/viewer/2022060110/5560b804d8b42a033c8b4c1a/html5/thumbnails/13.jpg)
Structure
Motivation, Problems, and Goals
Topic/S Workflow
– Overview
– Information Extraction
– Storage
– Topic Detection
Demo
Conclusion
Friday, 06.09.2013 Topic/S Slide 12
![Page 14: Topic/S – A Topic and Trend Recognition Approach in News-Media, I-Semantics13](https://reader033.vdocuments.net/reader033/viewer/2022060110/5560b804d8b42a033c8b4c1a/html5/thumbnails/14.jpg)
Workflow: Topic Detection
Friday, 06.09.2013 Topic/S
Clustering
Slide 13
![Page 15: Topic/S – A Topic and Trend Recognition Approach in News-Media, I-Semantics13](https://reader033.vdocuments.net/reader033/viewer/2022060110/5560b804d8b42a033c8b4c1a/html5/thumbnails/15.jpg)
Workflow: Topic Detection
Friday, 06.09.2013 Topic/S
Clustering
Slide 14
![Page 16: Topic/S – A Topic and Trend Recognition Approach in News-Media, I-Semantics13](https://reader033.vdocuments.net/reader033/viewer/2022060110/5560b804d8b42a033c8b4c1a/html5/thumbnails/16.jpg)
Workflow: Topic Detection
Friday, 06.09.2013 Topic/S
Clustering
Merkel
Politics
Highway
Traffic
Audi
Obama
Slide 15
![Page 17: Topic/S – A Topic and Trend Recognition Approach in News-Media, I-Semantics13](https://reader033.vdocuments.net/reader033/viewer/2022060110/5560b804d8b42a033c8b4c1a/html5/thumbnails/17.jpg)
Workflow: Topic Detection
Friday, 06.09.2013 Topic/S
Clustering (Top Cluster 25.08.2013)
Article Name HotTopic
43 Bundesliga, Fußball, Spieltag , 1. FC Union Berlin, SC Paderborn 07 eV, FC Augsburg, FSV Frankfurt
Yes
25 Euro, SPD, Berlin, Griechenland, FDP, CDU, Deutschland
Yes
19 Bericht, Diplomat, Google Inc , Anbieter, Berlin, Deutschland, Auto
Yes
18 Veranstaltung, Bernd Lucke, Angreifer, Berlin, Polizei, Angriff, Deutschland
Yes
15 Gericht, Prozess, Bo Xilai, Christian Wulff, Anklage, Verfahren, Mord
Yes
Slide 16
![Page 18: Topic/S – A Topic and Trend Recognition Approach in News-Media, I-Semantics13](https://reader033.vdocuments.net/reader033/viewer/2022060110/5560b804d8b42a033c8b4c1a/html5/thumbnails/18.jpg)
Structure
Motivation, Problems, and Goals
Topic/S Workflow
Demo
Conclusion
Friday, 06.09.2013 Topic/S Slide 17
![Page 19: Topic/S – A Topic and Trend Recognition Approach in News-Media, I-Semantics13](https://reader033.vdocuments.net/reader033/viewer/2022060110/5560b804d8b42a033c8b4c1a/html5/thumbnails/19.jpg)
Live Demo
Friday, 06.09.2013 Topic/S Slide 18
![Page 20: Topic/S – A Topic and Trend Recognition Approach in News-Media, I-Semantics13](https://reader033.vdocuments.net/reader033/viewer/2022060110/5560b804d8b42a033c8b4c1a/html5/thumbnails/20.jpg)
Structure
Motivation, Problems, and Goals
Topic/S Workflow
Demo
Conclusion
Friday, 06.09.2013 Topic/S Slide 19
![Page 21: Topic/S – A Topic and Trend Recognition Approach in News-Media, I-Semantics13](https://reader033.vdocuments.net/reader033/viewer/2022060110/5560b804d8b42a033c8b4c1a/html5/thumbnails/21.jpg)
Sum it up!
Result
Identifying topics and pushing them to the editor
Lessons learned
NER: bad for non-English, combination required
model needs to be optimized for queries
dedicated user interface required
Outlook
prediction of topics with causal/temporal relations
Friday, 06.09.2013 Topic/S Slide 20
Quelle: ooltapulta.com
Quelle: business-strategy-innovation.com
![Page 22: Topic/S – A Topic and Trend Recognition Approach in News-Media, I-Semantics13](https://reader033.vdocuments.net/reader033/viewer/2022060110/5560b804d8b42a033c8b4c1a/html5/thumbnails/22.jpg)
Sächsische AufbauBank Forschung und Entwicklung - Projektförderung Projektnummer - 99457/2677
Thanks! Questions?
![Page 23: Topic/S – A Topic and Trend Recognition Approach in News-Media, I-Semantics13](https://reader033.vdocuments.net/reader033/viewer/2022060110/5560b804d8b42a033c8b4c1a/html5/thumbnails/23.jpg)
Workflow: Preprocessor
Friday, 06.09.2013 Topic/S
Named Entity Recognition
word list
Tool: LingPipe + Extension
Sources: LOD (DBPedia, Geonames, YAGO2, GND)
Advantages: controlled vocabulary, guarantied recognition of entities
statistics
Tool: Stanford NLP
Source: pre-trained model
Advantage: Recognition of unknown entities
Slide 22
Quelle: churchthought.com
![Page 24: Topic/S – A Topic and Trend Recognition Approach in News-Media, I-Semantics13](https://reader033.vdocuments.net/reader033/viewer/2022060110/5560b804d8b42a033c8b4c1a/html5/thumbnails/24.jpg)
Workflow: Preprocessor
Friday, 06.09.2013 Topic/S
Categorization
Politics
Article DPA IPTC Media Topic
Categoriser OTS
Categoriser DPA
Categoriser Reuters
Slide 23
![Page 25: Topic/S – A Topic and Trend Recognition Approach in News-Media, I-Semantics13](https://reader033.vdocuments.net/reader033/viewer/2022060110/5560b804d8b42a033c8b4c1a/html5/thumbnails/25.jpg)
Workflow: Preprocessor
Friday, 06.09.2013 Topic/S
Categorization - Quality
News-Agency accuracy
KNA 80,3 %
DPA 94,4 %
EPD 80,3 %
Reuters 90,8 %
OTS 93,5 %
AFP 86 %
Method accuracy
One cat. for all agencies 85 %
One cat. per agency 87,5 %
Slide 24
![Page 26: Topic/S – A Topic and Trend Recognition Approach in News-Media, I-Semantics13](https://reader033.vdocuments.net/reader033/viewer/2022060110/5560b804d8b42a033c8b4c1a/html5/thumbnails/26.jpg)
Workflow: Preprocessor
Friday, 06.09.2013 Topic/S
Keywords
Lemmatization
Developing a word list
Extraction using the word list
Bonus: frequent terms of an article
Slide 25
Quelle: hugdaily.org
![Page 27: Topic/S – A Topic and Trend Recognition Approach in News-Media, I-Semantics13](https://reader033.vdocuments.net/reader033/viewer/2022060110/5560b804d8b42a033c8b4c1a/html5/thumbnails/27.jpg)
Disambiguation
Friday, 06.09.2013 Topic/S Slide 26
Quelle: fansshare.com Quelle: lounge.espdisk.com
Quelle: de.wikipedia.org
![Page 28: Topic/S – A Topic and Trend Recognition Approach in News-Media, I-Semantics13](https://reader033.vdocuments.net/reader033/viewer/2022060110/5560b804d8b42a033c8b4c1a/html5/thumbnails/28.jpg)
Disambiguation
Problem: not all SemItems available in the LOD
Friday, 06.09.2013 Topic/S
Michael Jackson
Beer
Michael Jackson
Beer
Whiskey
Michael Jackson
Music
King of Pop
Internal Facts
External Facts (DBpedia, etc.)
Identification of Entity Cluster
Slide 27