smart crawler: using committee machines for web pages continuous classification

19
Smart Crawler: Using Committee Machines for Web Pages Continuous Classification Luiz Henrique Zambom Santana, Prof. Dr. Ronaldo dos Santos Mello e Prof. Dr. Mauro Roisenberg Federal University of Santa Catarina – Florianópolis/SC WebMedia – Manaus, 2015

Upload: luiz-henrique-zambom-santana

Post on 13-Feb-2017

381 views

Category:

Internet


0 download

TRANSCRIPT

Page 1: Smart Crawler: Using Committee Machines for Web Pages Continuous Classification

Smart Crawler: Using Committee Machines for Web

Pages Continuous ClassificationLuiz Henrique Zambom Santana,

Prof. Dr. Ronaldo dos Santos Mello e Prof. Dr. Mauro Roisenberg

Federal University of Santa Catarina – Florianópolis/SC

WebMedia – Manaus, 2015

Page 2: Smart Crawler: Using Committee Machines for Web Pages Continuous Classification

Agenda• Goals• Motivation• Model• Architecture• Implementation• Experiments• Conclusions

Page 3: Smart Crawler: Using Committee Machines for Web Pages Continuous Classification

Goals• Idea:

• If:• www.infomoney.com.br = Finance• www.lance.com.br = Futbol• www.4rodas.com.br = Cars

• So:• www.valor.com.br = Finance• placar.abril.com.br = Futbol• revistaautoesporte.globo.com = Cars

Page 4: Smart Crawler: Using Committee Machines for Web Pages Continuous Classification

Motivation• If we know the category of a page, then

• We can better parse• We can provide better search results• We can customize the user experience

• Classify web page contents, for generating dataset

Page 5: Smart Crawler: Using Committee Machines for Web Pages Continuous Classification

Motivation• Using ML techniques seemed a good idea, but:

• We need to scale, so Matlab was not an option• We need to collect and classify pages continuously, so we need to index the

pages

• After find the right tools, we had the following question:• What is the best ML technique to use? We tried:

• Naive Bayes, but the degree of class overlapping is not small in our case• SVM, but it can only classify between two extremes

• We decided to create a committee machine of SVM models• Better generalization• Could be very slow

Page 6: Smart Crawler: Using Committee Machines for Web Pages Continuous Classification

Model

Page 7: Smart Crawler: Using Committee Machines for Web Pages Continuous Classification

Implementation• Cloud-ready technologies

• Apache Spark• Elasticsearch

• Java frameworks:• Crawler4J• Apache Lucene• Jsoup: parsing

Page 8: Smart Crawler: Using Committee Machines for Web Pages Continuous Classification

Support vector machine (SVM)• Non-probabilistic binary linear classifier• Can parametrize the number of iteractions • Slow!• “One Vs. All” approach with committee [1 e 2]•The model that had more votes is the winner

[1] e Silva, Sergio Roberto de Lima, and Mauro Roisenberg. "Continuous authentication by keystroke dynamics using committee machines." Intelligence and Security Informatics. Springer Berlin Heidelberg, 2006. 686-687.[2] Sun, Bing-Yu, et al. "Support vector machine committee for classification."Advances in Neural Networks–ISNN 2004. Springer Berlin Heidelberg, 2004. 648-653.

Finance Vs. Sport Finance Vs. Movies Finance Vs. Cars

Sport Vs. Movies Sport Vs. Cars

Movies Vs. Cars

Page 9: Smart Crawler: Using Committee Machines for Web Pages Continuous Classification

Achitecture

Page 10: Smart Crawler: Using Committee Machines for Web Pages Continuous Classification

Implementation details - Training1. Set of pages is used as input to the models

String [] pagesCars={"http://g1.globo.com/carros/index.html","http://quatrorodas.abril.com.br/"};String [] pagesFinance={"http://www.valor.com.br/","http://www.infomoney.com.br/", "http://exame.abril.com.br/"};String [] pagesSport={"http://globoesporte.globo.com/","http://oledobrasil.com.br/","http://espn.uol.com.br"};String [] pagesMovies={"http://www.imdb.com/list/ls002231878/","http://www.adorocinema.com/","http://www.filmeb.com.br/", "http://www.revistabula.com/3165-lista-dos-100-melhores-filmes-de-todos-os-tempos-segundo-hollywood/"};

2. Set of pages is used as input to the models

Page 11: Smart Crawler: Using Committee Machines for Web Pages Continuous Classification

Implementation details - Training3. Clean the page and calculate Feature Vector using HashingTF• Get only the page text (ie., exclude HTML tags)• Use Lucene to remove stopwords, simbols, numbers and other

meaning less parts• Calc the term frequence and create a feature vector

Page 12: Smart Crawler: Using Committee Machines for Web Pages Continuous Classification

Implementation details - Training

16:33:42,784 INFO CallTrainerSVMPageTest:78 - For page http://www.filmeb.com.br/ the model predicts movies16:33:42,784 INFO CallTrainerSVMPageTest:78 - For page http://www.valor.com.br/ the model predicts finance16:33:42,784 INFO CallTrainerSVMPageTest:78 - For page http://www.infomoney.com.br/ the model predicts finance16:33:42,784 INFO CallTrainerSVMPageTest:78 - For page http://exame.abril.com.br/ the model predicts finance16:33:42,784 INFO CallTrainerSVMPageTest:78 - For page http://oledobrasil.com.br/ the model predicts sport16:33:42,784 INFO CallTrainerSVMPageTest:78 - For page http://espn.uol.com.br the model predicts sport16:33:42,784 INFO CallTrainerSVMPageTest:78 - For page http://sportv.globo.com/site/ the model predicts movies

4. Test the data against the models

Page 13: Smart Crawler: Using Committee Machines for Web Pages Continuous Classification

Experiments• First dataset

• Classes: Finance (Infomoney), Sports (Lance), Movies (IMDB), and Cars (4 Rodas)

• Second dataset• Classes: Life Style, Soup opera, Technology

• Most of the documents are correctly classified, but there was also lot of ambiguity:

Page 14: Smart Crawler: Using Committee Machines for Web Pages Continuous Classification

Other problems• Templates in portals (headers and footer)• Documents with few information (e.g., assine já)• Documents with too much information (e.g., the main page)

Page 15: Smart Crawler: Using Committee Machines for Web Pages Continuous Classification

Focused cralwer• 100 labeled pages of each kind, runned

the focused crawler with Carreira, Mercados, Onde Investir and Negócios

• The page structure is easier to test and provides much better results:

• The errors are due texts in more than one category, for instance:

Page 16: Smart Crawler: Using Committee Machines for Web Pages Continuous Classification

Performance experiments• Three experiments:

• 1: 200000 classifications in 30 minutes, and 4 classes

• 2: 180000 classifications and 8 classes• 3: Focused crawler and 3 classes

Page 17: Smart Crawler: Using Committee Machines for Web Pages Continuous Classification

Current version• eCrawler• Disambiguation• Pipeline of methods:

Page 18: Smart Crawler: Using Committee Machines for Web Pages Continuous Classification

Conclusions• Cloud ready technologies, such as Apache Spark and Elasticsearch,

enables the Smart Crawler for expanding accordingly to the application necessities;

• The use of SVM, a traditional machine Learning method, implemented using a Machine Committee can improve the generalization power of the classification components;

• The architecture is created to be general-propose, so it can be used to crawl different domains and make this content available to transformations, search, and retrieval operations.

• The source code is available in:• https://github.com/lhzsantana/smart-crawler

Page 19: Smart Crawler: Using Committee Machines for Web Pages Continuous Classification

Smart Crawler: Using Committee Machines for Web

Pages Continuous ClassificationLuiz Henrique Zambom Santana,

Prof. Dr. Ronaldo dos Santos Mello e Prof. Dr. Mauro Roisenberg

Obrigado!

Federal University of Santa Catarina

WebMedia - 2015