from keyword searching to discourse mining

12
From keyword searching to discourse mining Pim Huijnen, Juliette Lonij DH2016, Kraków 15 July 2016

Upload: pim-huijnen

Post on 12-Apr-2017

244 views

Category:

Data & Analytics


2 download

TRANSCRIPT

Page 1: From keyword searching to discourse mining

From

keyword searching to

discourse mining

Pim Huijnen, Juliette Lonij

DH2016, Kraków 15 July 2016

Page 2: From keyword searching to discourse mining

From: The oasis, 13 April 1912, p.9. Chronicling America: Historic American Newspapers. Lib. of Congress.

Page 3: From keyword searching to discourse mining

From: The oasis, 13 April 1912, p.9. Chronicling America: Historic American Newspapers. Lib. of Congress.

Page 4: From keyword searching to discourse mining

Tangherlini, T. R. and Leonard, P. (2013). Trawling in the Sea of the Great Unread: Sub-corpus

topic modeling and Humanities research, Poetics, 41: 725-749.

Van den Hoven, M., Van den Bosch, A. and Zervanou, K. (2010). Beyond Reported History:

Strikes That Never Happened. Proceedings of the First International AMICUS Workshop on

Automated Motif Discovery in Cultural Heritage and Scientific Communication Texts,

Vienna: 20-28.

Wiedemann, G. and Niekler, A. (2014). Document Retrieval for Large Scale Content Analysis

using Contextualized Dictionaries. Terminology and Knowledge Engineering, Berlin, June

2014: https://hal.archives-ouvertes.fr/hal-01005879.

Page 5: From keyword searching to discourse mining

Using extensive and context-specific word lists (‘dictionaries’) to replace the contingency of single keywords

Developing a script to extract dictionaries from literature based on topic modeling

Experimenting with tools to visualise results of dictionary searching in kranten.delpher.nl

Goals researcher-in-residence project

Page 6: From keyword searching to discourse mining

Flexibility (evaluation based on human expertise)

Transparency (avoiding black-boxing)

Practicality (available for the wider public)

KB researcher-in-residence project

Page 7: From keyword searching to discourse mining

Script to extract dictionaries

B

Topic modeling

TF-IDF

A

Page 8: From keyword searching to discourse mining

BC

Script to extract dictionaries

Page 9: From keyword searching to discourse mining

Visualising results of dictionary searches in Delpher

Use OR-query to search KB’s newspaper corpus Visualise results on the basis of Solr’s relevancy-score (min. no. of words)

(arbeid* OR bedrij* OR beheer OR controle* OR factor* OR functie* OR kost* OR leiding* OR loon* OR maatregel* OR management OR methode* OR model* OR norm* OR organisatie* OR plannen OR prijs OR productie OR rationeel OR rendement OR reorganisatie OR statistiek OR taylor OR tijd OR werkbesparing OR werkverdeeling)

Page 10: From keyword searching to discourse mining

kbresearch.nl/dictionary

Page 11: From keyword searching to discourse mining

Challenges

Running an OR-query of 25+ (or, preferably, more) words on a 90.000.000+ document dataset

Accounting for particularities of the corpus: * number of newspaper titles per year * changes in newspaper titles over the years * changes in article length over the years

Getting an idea of the exact combination of words in the visualised results

Page 12: From keyword searching to discourse mining

Thank you!

https://github.com/jlonij/keyword_generator

http://blog.kbresearch.nl/

http://www.pimhuijnen.com