karnov super power your search with text analytics - findability day 2014

Power your Search with Text Analytics – a pragmatic approach

Rasmus Schepelern Hentze Chief Innovation Officer, Karnov Group

A note on Karnov Group

•  Karnov Group is the leading provider of legal and tax & accounting information to businesses and professionals in Denmark and Sweden.

•  We provide legislation, case law, regulatory information, news and continued legal education.

Our Users and What they Need

•  Lawyers really don’t like missing out on important information.

•  recall, recall, recall.

•  Lawyers are really busy and not very patient with irrelevant information.

•  precision, precision, precision.

•  The Challenge: •  The users need metadata

to efficiently interact with the information growth.

•  Manual extraction/creation of the metadata needed is not feasible.

Stemming & Decompounding – an Example

Stemming & Decompounding – The Challenge

•  The user is both in a hurry and paranoid about missing out on relevant results.

•  Optimize recall without bothering the user? •  Stemming – match all forms of the query terms •  Decompounding – match parts of compound words

•  Argh, my new search engine doesn’t come with the right language components…

Stemming & Decompounding – The Solution

•  Implemented using OSS components, search logs, and a dictionary.

•  Modified versions of the standard Porter Stemmer built into Solr/Lucene.

•  Decompounder based on simple TeX hyphenation patterns that plug directly into the standard Solr/Lucene interface.

•  Trained on the top 1000 queries from each language. •  Total development effort: 4 FTE weeks

Stemming & Decompounding – The Result

•  Yes, it overgenerates. •  No, it’s not a grammatically correct analysis. •  No, the users don’t care about the underlying analysis. •  Yes, the users are happy about the search quality. •  No, it doesn’t require additional licensing J •  Yes, it’s 100% integrated into the standard Solr/Lucene

architecture.

Subject Classification – The Case

•  We receive 350.000 new Swedish court cases per year •  In flat pdf files L

•  Without subject classification and basic metadata the content is of very limited value to the user:

•  Facets for navigation •  News alerts & monitoring

•  Manual classification of 1000+ documents a day is not feasible.

Subject Classification – a case for Machine Learning

•  Let’s train a classifier and automate the process… •  What software do we use? •  How do we compile training data? •  How do we measure the quality? •  When is the quality good enough?

•  Oh my, before we start we also need •  a Part-of-Speech tagger for Swedish •  a lemmatizer optimized for Swedish legal texts

•  Is it really that complex?

Subject classification – Machine Learning Results

•  Software: Python scikit-learn •  Production grade OSS •  Well-documented •  Integrated well with our existing content pipeline

•  No NLP pre-processors needed •  A simple “bag of words” sufficed

•  Legacy data served as training data as is •  Quality (F1-measure): ca. .85+ for most of the topics

•  Not too bad, but.. •  What about the subjects that didn’t work?

Subject Classification – coupling Machine Learning and naïve heuristics

•  Hey, if it’s from the labour court it’s probably labour law! •  Use all available context

•  If it goes on about EU directives it’s probably EU law •  /direktiv \d{4}\/\d+\/EG/

•  200+ simple expressions improved classification to production quality.

Subject Classification – combined results

•  A developer-only team would still be training that classifier…

•  Early involvement of local domain expertise was crucial. •  Combining tools & tasks pairwise into smaller separate

pipeline components provided technical flexibility and helped counter over-engineering.

Key Learnings

•  Rich metadata are key for a quality user experience.

•  Manual metadata processing is only feasible for small data sets.

•  Text analytics are a cost effective means to more metadata.

•  There is so much good analytics software out there.

•  Be wary of over-engineering.

•  Keep focus on the user experience not the technology.

•  Get started! •  Know your data!

Thank you!

[email protected]

www.karnovgroup.com

karnov super power your search with text analytics - findability day 2014

Data & Analytics

manual classification

search quality

quality userexperience

training data

quality good

karnov group karnov

case law

quality f1measure