introduction to information retrievalcir.dcs.uni-pannon.hu/cikkek/ir_seminar1.pdf · introduction...

21
Introduction to Information Retrieval 1. seminar IR architecture, document processing, indexing, weighting University of Pannonia Tamás Kiezer, Miklós Erdélyi

Upload: others

Post on 17-Jan-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introduction to Information Retrievalcir.dcs.uni-pannon.hu/cikkek/IR_seminar1.pdf · Introduction to Information Retrieval 1. seminar IR architecture, document processing, indexing,

Introduction to Information

Retrieval1. seminar

IR architecture, documentprocessing, indexing, weighting

University of Pannonia

Tamás Kiezer, Miklós Erdélyi

Page 2: Introduction to Information Retrievalcir.dcs.uni-pannon.hu/cikkek/IR_seminar1.pdf · Introduction to Information Retrieval 1. seminar IR architecture, document processing, indexing,

Review (1)

• IR architecture overview

Page 3: Introduction to Information Retrievalcir.dcs.uni-pannon.hu/cikkek/IR_seminar1.pdf · Introduction to Information Retrieval 1. seminar IR architecture, document processing, indexing,

Review (2)

• Document processing workflow

– Parsing

– Tokenization

– Stopword removal

– Stemming

– Inverted file building (indexing)

Page 4: Introduction to Information Retrievalcir.dcs.uni-pannon.hu/cikkek/IR_seminar1.pdf · Introduction to Information Retrieval 1. seminar IR architecture, document processing, indexing,

Parsing

• Stored information available in diverseformats (HTML, PDF, DOC, etc.)

• Must convert them to a „canonical” format(ie. plain text)

• Many open source tools are available to do parsing in practice– NekoHTML, pdftohtml, PDFBox, wvWare, etc.

• Metadata (DCMI)

• Examples

Page 5: Introduction to Information Retrievalcir.dcs.uni-pannon.hu/cikkek/IR_seminar1.pdf · Introduction to Information Retrieval 1. seminar IR architecture, document processing, indexing,

Tokenization (segmentation)

• Chopping the document unit up into pieces called tokens

• Language-specific (needs languageidentification)

• How do we recognize word boundaries?– -, /, ., ?, !, …

– eg. by non-alphanumeric characters

• How do we handle numbers? (index size!)

• Non-trivial for eastern languages like Japanese, Chinese, etc.

• Examples

Page 6: Introduction to Information Retrievalcir.dcs.uni-pannon.hu/cikkek/IR_seminar1.pdf · Introduction to Information Retrieval 1. seminar IR architecture, document processing, indexing,

Stoplisting (1)

• Idea: too frequent or too rare words do not convey useful information

– Throw away these words during

preprocessing using a stoplist

• Example English stoplist:a ab about above ac according across ads ae af after afterwards

against albeit all almost alone along already also although always

among amongst an and another any anybody anyhow anyone

with within without worse worst would wow www x y ye year yet

yippee you your yours yourself yourselves

Page 7: Introduction to Information Retrievalcir.dcs.uni-pannon.hu/cikkek/IR_seminar1.pdf · Introduction to Information Retrieval 1. seminar IR architecture, document processing, indexing,

Stoplisting (2)

• Automatized generation of a stoplist: from the word frequency distribution

Page 8: Introduction to Information Retrievalcir.dcs.uni-pannon.hu/cikkek/IR_seminar1.pdf · Introduction to Information Retrieval 1. seminar IR architecture, document processing, indexing,

Stemming

• Idea: reduce lexicon size, improve retrieval efficiency

• Language-specific methods– Properly handling agglutinative languages such as

Hungarian is difficult

• Stemming methods– Brute force, lemmatization, suffix stripping, affix

stripping

• Over-stemming, under-stemming

• Normalization (equivalence classing of terms)

Page 9: Introduction to Information Retrievalcir.dcs.uni-pannon.hu/cikkek/IR_seminar1.pdf · Introduction to Information Retrieval 1. seminar IR architecture, document processing, indexing,

Stemming – Porter’s method

• Suffix stripping method

• Well-tried for stemming English texts

• 4-step algorithm– Step 1 deals with plurals and past participles.

– Step 2-3 removes adjective/noun formative syllables.

– Step 4 removes noun formative syllables.

– Step 5 tidies up.

• Example

Page 10: Introduction to Information Retrievalcir.dcs.uni-pannon.hu/cikkek/IR_seminar1.pdf · Introduction to Information Retrieval 1. seminar IR architecture, document processing, indexing,

Example: Porter’s stemming rules

(excerpt)

Page 11: Introduction to Information Retrievalcir.dcs.uni-pannon.hu/cikkek/IR_seminar1.pdf · Introduction to Information Retrieval 1. seminar IR architecture, document processing, indexing,

Example: Hunspell for stemming

Hungarian text (too)

• Hunspell: general library for morphological analysis and stemming

• Affix stripper (does prefix and suffix stripping) with a dictionary of base words

• Example rules:

Page 12: Introduction to Information Retrievalcir.dcs.uni-pannon.hu/cikkek/IR_seminar1.pdf · Introduction to Information Retrieval 1. seminar IR architecture, document processing, indexing,

Inverted file structure – review

• Stores the postings list for each term

• Eases answering queries - how?

Page 13: Introduction to Information Retrievalcir.dcs.uni-pannon.hu/cikkek/IR_seminar1.pdf · Introduction to Information Retrieval 1. seminar IR architecture, document processing, indexing,

Inverted index construction

• Example:

Page 14: Introduction to Information Retrievalcir.dcs.uni-pannon.hu/cikkek/IR_seminar1.pdf · Introduction to Information Retrieval 1. seminar IR architecture, document processing, indexing,

Weighting methods – review

• Binary weighting:

• Frequency weighting:

• Max-normalized (max-tf):

• Length-normalized (norm-tf):

• Term frequency inverse document frequency

• Length normalized term frequency inverse document frequency

(norm-tf-idf):

Page 15: Introduction to Information Retrievalcir.dcs.uni-pannon.hu/cikkek/IR_seminar1.pdf · Introduction to Information Retrieval 1. seminar IR architecture, document processing, indexing,

Exercise: building a TD matrix

• Let us consider the following simple document collection:

• Build a frequency weighted TD matrix

• Build a norm-tf weighted TD matrix

• Build a norm-tf-idf weighted TD matrix

Doc 1 breakthrough drug for schizophrenia

Doc 2 new schizophrenia drug

Doc 3 new approach for treatment of schizophrenia

Doc 4 new hopes for schizophrenia patients

Page 16: Introduction to Information Retrievalcir.dcs.uni-pannon.hu/cikkek/IR_seminar1.pdf · Introduction to Information Retrieval 1. seminar IR architecture, document processing, indexing,

Solution: tf weighted TD matrix

0100treatment

1111schizophrenia

1000patient

1110new

1000hope

0011drug

0001breakthrough

0100approach

Doc4Doc3Doc2Doc1Terms/Documents

Page 17: Introduction to Information Retrievalcir.dcs.uni-pannon.hu/cikkek/IR_seminar1.pdf · Introduction to Information Retrieval 1. seminar IR architecture, document processing, indexing,

Solution: norm-tf weighted TD

matrix

0000treatment

0,500,577350,57735schizophrenia

0,5000patient

0,500,577350new

0,5000hope

000,577350,57735drug

0000,57735breakthrough

00,500approach

Doc4Doc3Doc2Doc1Terms/Documents

Page 18: Introduction to Information Retrievalcir.dcs.uni-pannon.hu/cikkek/IR_seminar1.pdf · Introduction to Information Retrieval 1. seminar IR architecture, document processing, indexing,

Example: Terrier IR Platform

Page 19: Introduction to Information Retrievalcir.dcs.uni-pannon.hu/cikkek/IR_seminar1.pdf · Introduction to Information Retrieval 1. seminar IR architecture, document processing, indexing,

Terrier: Indexing

Page 20: Introduction to Information Retrievalcir.dcs.uni-pannon.hu/cikkek/IR_seminar1.pdf · Introduction to Information Retrieval 1. seminar IR architecture, document processing, indexing,

Terrier: Search results

Page 21: Introduction to Information Retrievalcir.dcs.uni-pannon.hu/cikkek/IR_seminar1.pdf · Introduction to Information Retrieval 1. seminar IR architecture, document processing, indexing,

Questions?