ttp i o i text processing overview - computer action...

1

T t P i O i

Lecture 2

Text Processing Overview

Lecture 2CS 510

Information Retrieval on the Internet

Text Processing PipelineDocuments “<p>Friends, Romans &amp

countrymen</p>”

Document Type ID& Text Extraction

Document Type ID& Text Extraction

Text String

Token Stream

“Friends, Romans & countrymen”

“Friends” “Romans” “countrymen”

TokenizationTokenization

Linguistic ModulesLinguistic Modules

2IR 2010

Terms

Postings

“friend” “roman” “countryman”

(term6, doc3); (term23, doc3); (term18, doc3)

Linguistic ModulesLinguistic Modules

Dictionary LookupDictionary Lookup

2

Text Processing Goals

• Enable high quality retrieval results• Represent document content and

characteristics efficiently– Storage space– Efficient matching of queries to documents

3IR 2010

Terminology

Token: Occurrence of a string of charactersToken Type: All tokens with the same charsTerm: Equivalence class of tokens

GoetzGötz gotzG tGotz

Recall vs. PrecisionIR 2010 4

3

Accessing Document Content

• Identify document types• Handle each appropriately

5IR 2010

Challenge Possible approachScanned documents OCR

Accessing Document Content

Compressed, encrypted, zipped files

Uncompress, decrypt, unzip

Word processed and other application-specific documents

Application-specific handler

Complex HTML pages: Parse HTML to extract

6

Complex HTML pages:•Frames•Dynamic content•Scripts embedded in HTML

Parse HTML to extract contentSite-specific wrapper?

IR 2010

4

7IR 2010

8IR 2010

5

(Part-of-Speech $POS$ Tagging)endobj89 0 obj<< /S /GoTo /D (Outline7.8.56) >>endobj92 0 obj(Chunking and Parsing)endobj93 0 obj

Excerpt from a PDF file93 0 obj<< /S /GoTo /D (Outline7.9.64) >>endobj96 0 obj(Semantics)endobj97 0 obj<< /S /GoTo /D (Outline7.10.71) >>endobj100 0 obj

9

j(Pragmatics: Co-reference resolution)endobj101 0 obj<< /S /GoTo /D (Outline8) >>endobj104 0 obj(Performance Evaluation)endobj IR 2010

�ÐÏ à¡±� �á > þÿ � � � � 5 7 þÿÿÿ 4 ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿì¥Á q` � � � � � � ø ¿ ß bjbjqPqP �� 8 : : ß � � � � � �ÿÿ ÿÿ ÿÿ ¤ Ü Ü Ü Ü Ü Ü Ü ð ” ” ” ” ð ä 2 À Ö

� � � � � � � � �� Ö Ö Ö Ö Ö Ö c e e e e e e $ h ~ j ‰ � � � � Ü Ö Ö Ü Ö Ü Ö c Ö ´

� �H}A¦Æ ” ¬ " ß } ¦�c �́ � � � 0 ä ë x è Î

� � � � � �. è c è Ü c Ö Ö �

� �Ö Ö Ö Ö Ö ‰ ‰ ü� � �Ö Ö Ö ä Ö Ö Ö Ö ð ð ð ¤ ” ð

�ð ð ” ð ð ð Ü Ü Ü Ü Ü Ü ÿÿÿÿ � � HYPERLINK "http://www.ultraseek.com/support/docs/UltraseekAdmin/wwhelp/wwhimpl/js/html/starthelp.htm" � htt // lt k / t/d /Ult kAd i / h l / hi l/j /ht l/ t th l ht

Excerpt from a Word file

10

�� http://www.ultraseek.com/support/docs/UltraseekAdmin/wwhelp/wwhimpl/js/html/starthelp.ht�m

Importing from Ultraseek

Ultraseek collection configurations are text files named configuration that are located in the directory you specified during installation to hold search data. This section explains how to import these files so that Ultraseek can re-create the described collection.IR 2010

6

Identifying Document Structure

• To help extract textT l it t t f hi• To exploit structure for searching

• Examples– Complex web pages– Title, metadata, headings– Chapter, section, paragraph, sentence

11

• What is a sentence? How do you recognize one?– Email: header, body, attachment– Tables

IR 2010

C f S S

How to automatically recognize a sentence?

Washington D.C. is the capital of the U.S. and Salem O.R. is a state capital. The D.C. metro is a convenient subway. At clerk.house.gov you can find a listing of members of the U.S. congress including Peter A. DeFazio, John Conyers Jr. and Jim McDermott, who is also sometimes known as Dr. Jim McDermott because he has an M D degree

12

McDermott because he has an M.D. degree.

IR 2010

7

source recipient(s)date

subject

Document structure: email

Header

13

BodyIR 2010

Deciding What and How to Index

• HTML tags?URL? P titl ? J i t?• URL? Page title? Javascript?

• Ads?• Navigation bars?• Text in tables?• Text associated with images?

14

g• Maintain information about document structure?

– Index document parts as separate fields?

IR 2010

8

What should you index on this page?

How would you extract it from all

15

the other stuff?

IR 2010

Lexical analysis

• Parse the text into words– What is a word?

• Challenges:– Punctuation– Special characters

Hyphens

16

– Hyphens– Case– Digits

IR 2010

9

Lexical analysis: Punctuation• Most can just be eliminated• Periods

E d f t– End of sentence– Abbreviations and acronyms– “Dots” in URLs?

• Apostrophes– How much do they change the meaning?

• Grants versus Grant’sE t t h l ?

17

– Exact match only?– Eliminate all in both query and index?– Eliminate some and not others?– Retain and use some algorithm for matching?

• What do current search engines do?IR 2010

18IR 2010

10

Lexical analysis: Special Chars

• What to do with &, $, !?– Are they different if they occur alone vs. part

of a word?• Characters used in other languages

– Exact match only?– Automatically match common translations?

19

Automatically match common translations?• el nino and el niño?

• What do current web search engines do?

IR 2010

20IR 2010

11

Lexical analysis: Hyphens

• Usage often variesM t h i• May or may not change meaning– world wide, world-wide, worldwide– client server, client-server, client/server,

ClientServer– A-Boy, a boy

G ll fl diff f

21

• Generally want to conflate different forms that refer to the same concept

IR 2010

Hyphens in Index vs. Query

Handle differently in indexing and query• Indexing

world wide worldwide

world-wide world-wideworldwide worldwide

• Querying(world AND wide) OR

world-wide (world-wide) OR(worldwide)

IR 2010 22

12

Lexical analysis: Case• Upper case, lower case, mixed

– Case can be useful for distinguishing proper nouns and Case ca be use u o d st gu s g p ope ou s a dacronyms

– Most search engines appear to convert all text to lower or upper case

• Grant == grant on Google, Yahoo, MSN• Bush == bush on Google, Yahoo, MSN• Or do they?

– look further down in the Google results; order differs slightlyUl k ( i l f f b i /i

23

– Ultraseek (commercial software for website/intranet use)

• if query is lower case, matching is case-insensitive• If query has any upper case characters, matching is exact only

IR 2010

Lexical analysis: Digits

Some early sources suggest that digits ll t f lgenerally not very useful

What about addresses, phone numbers, numbers with significant meaning? (e.g. 911)

Major search engines do index digits

24IR 2010

13

Stemming• Purpose is to conflate inflexional variations on a

word that refer to the same conceptword that refer to the same concept– singular and plural forms of nouns– different tenses of the same verb– query rain cat dog should match “raining cats and

dogs”• Okay if stemmed form is not meaningful

U d ’t th t ti i d

25

– User doesn’t see or care that computing in query and computed in text were both stemmed to comput

IR 2010

Stemming strategies (1)• Algorithmic

– Affix removale o a• Usually only remove suffixes since prefixes often change

meaning of the word (e.g. do, undo)– Successor variety

• Finds boundaries between morphemes (word segments)• Morphemes are the smallest linguistic units with semantic

meaning (e.g. unreadable un-read-able)• Successor variety of a string is the number of different

characters that follow it in words in some body of text

26

characters that follow it in words in some body of text.– Text: read, readable, reading, reads, red, rope, ripe– R 3, RE 2, REA 1, READ 3, READA 1, READAB 1,

READABL 1, READABLE 1• Look for peak and plateau to break words

IR 2010

14

Stemming strategies (2)

• Algorithmic, cont.N grams– N-grams

• An N-gram is a sub-sequence of length n from a sequence • Uses character sequences within words• Clusters words based on number of shared N-grams• Language-independent

• Dictionary

27

– Table lookup• Mixed

– Use algorithm with table lookup for exceptions

IR 2010

Stemming

Porter Stemmer Outputa aabase abasabate abatabated abatabatement abatabbess abbess

28

abbey abbeiabide abidabides abidabjectly abjectli

IR 2010

15

Stemming

• Effect on retrieval is controversialS l i it i ll b t• Some claim it improves recall but may degrade precision

• Probably depends on many factors:– Language– Stemming method

29

– Document collection– Queries– Nature of user task

IR 2010

Stemming summaryAdvantagesSearcher doesn’t have to

ti i t th ’ t

DisadvantagesMay decrease precision

anticipate author’s exact usagetsunami can match:•What happens to a tsunami as it approaches land?•Tsunamis have been historically referred to as tidal

•Might want to match specific word•Might be okay for query tired to match other forms of to tire, but not automobile tireStemming can change meaning•bushing – an insulating liner in an

30

historically referred to as tidal waves because ...

May increase recallResults in more compact indexing vocabulary

g gopening through which conductors pass

•bush - A low shrub with many branches

Stemming algorithms imperfect –occasionally lead to odd resultsIR 2010

16

Lemmatization

Lemma = linguistic “base word”made makesaw see

Stemming is a syntactic processLemmatization is a semantic process

Req ires a look p table in generalRequires a look-up table in general

IR 2010 31

Token Normalization

Equivalence class of tokens term{Co-worker, co-worker, Coworker, coworker} coworker

How to implement?Implicit: Use rules, such as hyphen removal

– apply for indexing and querying– apply for indexing and querying– only index target term

IR 2010 32

17

Token Normalization 2

Explicit: index tokens separatelycoworker, co-worker each has a list

• Apply during indexingcoworker is indexed under coworker and co-worker

• Apply during queryApply during querycoworker (coworker OR co-worker)allows suppression: “coworker”

IR 2010 33

Stopwords

• Very common words are often thought not to be very good discriminators between documentsvery good discriminators between documents– Common words often convey little of what a

document is about– Articles, conjunctions, prepositions– Does the or of tell you what a document is about?

• Eliminating common words (stopwords) reduces

34

Eliminating common words (stopwords) reduces the size of the index

IR 2010

18

Stopwords

• Typically implemented as a set of words f filt i b th d t t t d ifor filtering both document text and queries

• Size of list may vary, e.g.– MEDLARS stoplist only 7 words

• and, an, by, from, of, the, with– Other published lists have 250, 471 words

35

Other published lists have 250, 471 words• May want a collection-specific stopword

list

IR 2010

Example stopword list applied to a documenta, about, an, and, are, as, by, for, from, had, have, he, his, him, in, into, of, on, or, that, the, this, to, was, with, were

Here were the servants of your adversary,And yours, close fighting ere I did approach:I drew to part them: in the instant cameThe fiery Tybalt, with his sword prepared,Which, as he breathed defiance to my ears,He swung about his head and cut the winds

36

He swung about his head and cut the winds,Who nothing hurt withal hiss'd him in scorn:While we were interchanging thrusts and blows,Came more and more and fought on part and part,Till the prince came, who parted either part.

IR 2010

19

Stopwords

• Common words may be important– Phrases

• Query for “to be or not to be”– Special meaning in context

• Vitamin A– Abbreviations and acronyms

37

• OR as abbreviation (ORegon) or acronym (Operating Room)

IR 2010

Phrases containing common words

• Index all words• Ignore stopwords in indexing but account for• Ignore stopwords in indexing, but account for

them in calculating postion– use position information for “significant” words “the cat in the hat”: cat at p, hat at p+3

– Accept risk of false matches, or– Filter matching docs by examining full text

38IR 2010

20

39IR 2010

to and be notto and be noteliminated in ad search

to and be doappear to be eliminated in

b h

40

web search

IR 2010

21

41IR 2010

Index term selection

A difficult balance– Conflating tokens risks bringing in irrelevant

results: co-op coop

– If you don’t, could miss relevant documents: co-worker vs. coworker

Might be possible to adjust the balance with

42

g p jweightsco-op to co-op is a “stronger” match than co-op to coop

IR 2010