karnov super power your search with text analytics - findability day 2014
DESCRIPTION
Karnov's Presentation on Text Analytics on Findability Day 2014TRANSCRIPT
Power your Search with Text Analytics – a pragmatic approach
Rasmus Schepelern Hentze Chief Innovation Officer, Karnov Group
A note on Karnov Group
• Karnov Group is the leading provider of legal and tax & accounting information to businesses and professionals in Denmark and Sweden.
• We provide legislation, case law, regulatory information, news and continued legal education.
Our Users and What they Need
• Lawyers really don’t like missing out on important information.
• recall, recall, recall.
• Lawyers are really busy and not very patient with irrelevant information.
• precision, precision, precision.
• The Challenge: • The users need metadata
to efficiently interact with the information growth.
• Manual extraction/creation of the metadata needed is not feasible.
Stemming & Decompounding – an Example
Stemming & Decompounding – The Challenge
• The user is both in a hurry and paranoid about missing out on relevant results.
• Optimize recall without bothering the user? • Stemming – match all forms of the query terms • Decompounding – match parts of compound words
• Argh, my new search engine doesn’t come with the right language components…
Stemming & Decompounding – The Solution
• Implemented using OSS components, search logs, and a dictionary.
• Modified versions of the standard Porter Stemmer built into Solr/Lucene.
• Decompounder based on simple TeX hyphenation patterns that plug directly into the standard Solr/Lucene interface.
• Trained on the top 1000 queries from each language. • Total development effort: 4 FTE weeks
Stemming & Decompounding – The Result
• Yes, it overgenerates. • No, it’s not a grammatically correct analysis. • No, the users don’t care about the underlying analysis. • Yes, the users are happy about the search quality. • No, it doesn’t require additional licensing J • Yes, it’s 100% integrated into the standard Solr/Lucene
architecture.
Subject Classification – The Case
• We receive 350.000 new Swedish court cases per year • In flat pdf files L
• Without subject classification and basic metadata the content is of very limited value to the user:
• Facets for navigation • News alerts & monitoring
• Manual classification of 1000+ documents a day is not feasible.
Subject Classification – a case for Machine Learning
• Let’s train a classifier and automate the process… • What software do we use? • How do we compile training data? • How do we measure the quality? • When is the quality good enough?
• Oh my, before we start we also need • a Part-of-Speech tagger for Swedish • a lemmatizer optimized for Swedish legal texts
• Is it really that complex?
Subject classification – Machine Learning Results
• Software: Python scikit-learn • Production grade OSS • Well-documented • Integrated well with our existing content pipeline
• No NLP pre-processors needed • A simple “bag of words” sufficed
• Legacy data served as training data as is • Quality (F1-measure): ca. .85+ for most of the topics
• Not too bad, but.. • What about the subjects that didn’t work?
Subject Classification – coupling Machine Learning and naïve heuristics
• Hey, if it’s from the labour court it’s probably labour law! • Use all available context
• If it goes on about EU directives it’s probably EU law • /direktiv \d{4}\/\d+\/EG/
• 200+ simple expressions improved classification to production quality.
Subject Classification – combined results
• A developer-only team would still be training that classifier…
• Early involvement of local domain expertise was crucial. • Combining tools & tasks pairwise into smaller separate
pipeline components provided technical flexibility and helped counter over-engineering.
Key Learnings
• Rich metadata are key for a quality user experience.
• Manual metadata processing is only feasible for small data sets.
• Text analytics are a cost effective means to more metadata.
• There is so much good analytics software out there.
• Be wary of over-engineering.
• Keep focus on the user experience not the technology.
• Get started! • Know your data!