dev ops-presentation
TRANSCRIPT
Text miningFuzzy document classification
Using ElasticsearchLev Ozeryansky
Identity Card
• Merging of Ankor and We!
• Owned by Hilan (publicly traded in Tel Aviv Stock exchange)
• Fast growing IT integration company
• Over 2000 systems installed and maintained• Over 1000 leading customers - Hi-tech, Industry, Academy, Banks,
Insurance,
• Strong technological team – over 45 engineers, professional services and project managers
• Over 120 employees
• Four main divisions – Infrastructure, Big Data, Cloud, Cyber
Technology Edge
What is classification
• Document classification as document categorization.
• Using classification.
• Our classification data source.
• What we do with?• Java programmer.
• .NET programmer.
Data source
The mathematics
• Let be class set
• Let be documents set
• Classification function
Classification method
• Cosine similarity
• Function
Build document class vector
• Java programmer• Java
• 5
• Hibernate
• .NET programmer• C#
• 5
• Nhibernate
Let index classificators
• Add weight manually.
• For Java programmer:• Java = 0.7
• 5 = 0.5
• Hibernate = 0.3
• For .NET programmer• C# = 0.7
• 5 = 0.5
• Nhibernate = 0.3
DEMO
w-shingling
• In natural language processing a w-shingling is a set of unique "shingles"—contiguous subsequences of tokens in a document. (Wikipedia)
• Tokenization
• Elasticsearch analyze mechanism
DEMO
Classification process
• Tokens array.
• Classification query.• Use terms query when terms array == tokens array
• Two vectors• Vector of filtered tokens
• Classification vector
DEMO
Classification process
• SciPy to calculate distance.
Q&A