David Baehrens
Large-Scale Patent Classification
at the European Patent Office
ABOUT AVERBIS
Founded: 2007
Location: Freiburg im Breisgau
Team: Domain & IT-Experts
Focus: Leverage structured & unstructured information
Current Sectors: Pharma, Health, Automotive, Publishers & Libraries
PORTFOLIO
Solutions
Libraries Pharma Patents Healthcare Social Media
Terminology Management Text Mining
Search & Analytics NoSQL
Categorization & Clustering
Automotive
TERMINOLOGY MANAGEMENT
Terminology management
software
Provision of terminologies
Mappings between
terminologies
Building terminology-based
applications
Synonyms: dimethyl sulfoxide, dimethylsulfoxide, Domoso, Infiltrina
Hierarchies: cancer, carcinoma, melanoma, lymphoma, glioblastoma…
Patterns: dates, citations, mail addresses…
Rule-based extraction of all different kinds of complex information
Persons, Locations, Genes, ….
Coocurrences, Typed Relations, e.g. Genes / Diseases / Modification Type
TEXT MINING
Term Detection
Regular
Expressions
Rule Engine
Named Entities
Relations
Sentences, Tokens, POS-Tags, Chunks, Paragraphs, Sections, Stemming, Decompounding… Syntax Detection
RULE ENGINE
1. NAME OF THE MEDICINAL PRODUCT
Desloratadine ratiopharm 5 mg film-coated tablets
Primary Field Name Secondary Field Name Field Value
MedicalProductName coveredText Desloratadine ratiopharm 5 mg film-coated tablets
inventedPartName DESLORATADINE
strengthPart 5 mg
pharmaceuticalDoseFormPart FILM-COATED TABLET
Te
xt
Reg
el
Erg
eb
nis
SEARCH & NOSQL
Free text + concept based
search
Text mining integration
Guided navigation / facets
NoSQL functionalities
Multi- & cross lingual search
Related documents
Based on Apache Solr
• Extended Query Syntax
• JSON-API
• Scalability
…
DOCUMENT CLASSIFICATION
Hotel Reviews
Patents
SEARCH & NOSQL
INFORMATION DISCOVERY
Terminology Management Text Mining
Search & Analytics NoSQL
Categorization & Clustering
Delivery / Deployment / Runtime Environment
Integration Tests / Continuous Integration
Extensive Documentation
Common Architecture / Application Design
User & Role Management, Security
Communication Bus
Project Management
PATENT CLASSIFICATION AT EPO
Tender No. 1585
1) Pre-Classification of
unpublished patents into departments
2) Re-Classification on
published patents, if category system changes
ABOUT EPO
• The European Patent Office (EPO)
grants European patents for the
Contracting States to the European
Patent Convention
• Second largest intergovernmental
institution in Europe
• Not an EU institution
• Self-financing, i.e. revenue
from fees covers operating
and capital expenditure
NUMBER OF STAFF
Status: December 2008
PATENT APPLICATIONS
http://www.epo.org/about-us/annual-reports-statistics/annual-report/2014.html
COOPERATIVE PATENT CLASSIFICATION
• Patent Classification System based on ECLA / IPC
• jointly developed by the European Patent Office (EPO)
and the United States Patent and Trademark Office
(USPTO)
• used by both the EPO and USPTO since 1 January 2013
• currently contains about 250.000 classes
EXAMPLE CPC CLASS
GRANTED PATENT
EARLY PATENT
EARLY PATENT
EARLY PATENT
PATENT CLASSIFICATION AT EPO
Tender No. 1585
1) Pre-Classification of
unpublished patents into departments
Our Motivation:
• Great Classification Use-Case
– Big Data (80 Mio. patents available)
– Large Scale Category System >250.000 CPC codes
– Tough classification quality and response time
constraints
• Text Mining Success Story
OLD CLASSIFICATION PROCESS
PATENTS CLA SSIFICATION DEPARTMENTS
CLASSIFICATION COMPLEXITY
~250.000
CPC Codes
~1.500
Ranges
250
Departments
CLASSIFICATION PROCESS
PATENTS CLA SSIFICATION DEPARTMENTS
NEW CLASSIFICATION PROCESS
PATENTS CLA SSIFICATION DEPARTMENTS
SOME FACTS
• about 650k training documents from 2005-2013
• supervised learning: light-weight and fast linear support
vector machine
• Training time (16 Cores, 128 GB RAM)
– Feature Extraction: ~1 hour
– Training of Classifiers: ~1 hour
– 90/10 tests with a look-a-head of 3 levels
and reporting 3 best candidates: ~1 hour
• Prediction: 5 docs in 5 sec
HIERARCHICAL CLASSIFICATION
STATUS & OUTLOOK
Range-specific quality
evaluation
Going live with best
ranges
• Continuous optimization
PATENT CLASSIFICATION AT EPO
Tender No. 1585
1) Re-Classification on
published patents, if category system changes
Challenges and Facts:
– 250.000 CPC codes, regular changes/refinements
– Several re-classification projects at any one time, great
variation in size, a class is split into 5-20(?) subclasses
– No training material available
NEW RE-CLASSIFICATION PROCESS
Training Data
• Human Annotator starts labeling about 20% of
the documents with new subclasses
Statistical Models
• are generated on-the-fly, and
• Cross-validation test are carried out
Threshold
• If cross-validation achieves certain threshold
(e.g. 90%), the remaining documents are
classified fully automatically without further
review
• Otherwise, more training data is being generated
STATUS & OUTLOOK
Currently in evaluation
phase
• Going live in the next
weeks
…NOT ONLY PATENTS
Solutions
Libraries Pharma Patents Healthcare Social Media
Terminology Management Text Mining
Search & Analytics NoSQL
Categorization & Clustering
Automotive