dev ops-presentation

16
Text mining Fuzzy document classification Using Elasticsearch Lev Ozeryansky

Upload: lev-ozeryansky

Post on 14-Jul-2015

335 views

Category:

Software


0 download

TRANSCRIPT

Page 1: Dev ops-presentation

Text miningFuzzy document classification

Using ElasticsearchLev Ozeryansky

Page 2: Dev ops-presentation

Identity Card

• Merging of Ankor and We!

• Owned by Hilan (publicly traded in Tel Aviv Stock exchange)

• Fast growing IT integration company

• Over 2000 systems installed and maintained• Over 1000 leading customers - Hi-tech, Industry, Academy, Banks,

Insurance,

• Strong technological team – over 45 engineers, professional services and project managers

• Over 120 employees

• Four main divisions – Infrastructure, Big Data, Cloud, Cyber

Page 3: Dev ops-presentation

Technology Edge

Page 4: Dev ops-presentation

What is classification

• Document classification as document categorization.

• Using classification.

• Our classification data source.

• What we do with?• Java programmer.

• .NET programmer.

Page 5: Dev ops-presentation

Data source

Page 6: Dev ops-presentation

The mathematics

• Let be class set

• Let be documents set

• Classification function

Page 7: Dev ops-presentation

Classification method

• Cosine similarity

• Function

Page 8: Dev ops-presentation

Build document class vector

• Java programmer• Java

• 5

• Hibernate

• .NET programmer• C#

• 5

• Nhibernate

Page 9: Dev ops-presentation

Let index classificators

• Add weight manually.

• For Java programmer:• Java = 0.7

• 5 = 0.5

• Hibernate = 0.3

• For .NET programmer• C# = 0.7

• 5 = 0.5

• Nhibernate = 0.3

Page 10: Dev ops-presentation

DEMO

Page 11: Dev ops-presentation

w-shingling

• In natural language processing a w-shingling is a set of unique "shingles"—contiguous subsequences of tokens in a document. (Wikipedia)

• Tokenization

• Elasticsearch analyze mechanism

Page 12: Dev ops-presentation

DEMO

Page 13: Dev ops-presentation

Classification process

• Tokens array.

• Classification query.• Use terms query when terms array == tokens array

• Two vectors• Vector of filtered tokens

• Classification vector

Page 14: Dev ops-presentation

DEMO

Page 15: Dev ops-presentation

Classification process

• SciPy to calculate distance.

Page 16: Dev ops-presentation

Q&A