special topics in computer science the art of information retrieval chapter 6: text and multimedia...

22
Special Topics in Computer Science Special Topics in Computer Science The Art of Information The Art of Information Retrieval Retrieval Chapter 6: Text and Chapter 6: Text and Multimedia Languages and Multimedia Languages and Properties Properties Alexander Gelbukh www.Gelbukh.com

Upload: kaitlyn-powell

Post on 27-Mar-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Special Topics in Computer Science The Art of Information Retrieval Chapter 6: Text and Multimedia Languages and Properties Alexander Gelbukh

Special Topics in Computer ScienceSpecial Topics in Computer Science

The Art of Information RetrievalThe Art of Information Retrieval

Chapter 6: Text and Multimedia Chapter 6: Text and Multimedia Languages and Properties Languages and Properties

Alexander Gelbukh

www.Gelbukh.com

Page 2: Special Topics in Computer Science The Art of Information Retrieval Chapter 6: Text and Multimedia Languages and Properties Alexander Gelbukh

2

Previous chapter: ConclusionsPrevious chapter: Conclusions

Query operations:

Relevance feedbacko Simple, understandable

o Needs user attention

o Term re-weighting

Local analysis for query expansiono Co-occurrences in the retrieved docs

o Usually gives better results than global analysis

o Computationally expensive

Global analysiso Worse results. What is good for collection is not for a query

o Linguistic methods, dictionaries, ontologies, stemming, ...

Page 3: Special Topics in Computer Science The Art of Information Retrieval Chapter 6: Text and Multimedia Languages and Properties Alexander Gelbukh

3

Previous chapter: Previous chapter: Trends and research topicsTrends and research topics

Interactive interfaceso Graphical, 2D or 3D

Refining global analysis techniques Application of linguistics methods. Stemming. Ontol

ogies Local analysis for the Web (now too expensive) Combine the tree techniques (feedback, local, global)

Page 4: Special Topics in Computer Science The Art of Information Retrieval Chapter 6: Text and Multimedia Languages and Properties Alexander Gelbukh

4

Anatomy of a document...Anatomy of a document...

We search for documents What is a document?

Page 5: Special Topics in Computer Science The Art of Information Retrieval Chapter 6: Text and Multimedia Languages and Properties Alexander Gelbukh

5

Characteristics of a documentCharacteristics of a document

Syntax is a device that “plays” the document producing semantics (kind of: presentation)

Like CD drive plays CD to produce music Knowing Korean + paper w/glyphs meaning

Page 6: Special Topics in Computer Science The Art of Information Retrieval Chapter 6: Text and Multimedia Languages and Properties Alexander Gelbukh

6

...Anatomy of a document...Anatomy of a document

Queries are conditions on semantics/presentation, not on (binary?) data of the document

Thus need to know syntaxo Example: search in PS or PDF

How to describe formally?

Page 7: Special Topics in Computer Science The Art of Information Retrieval Chapter 6: Text and Multimedia Languages and Properties Alexander Gelbukh

7

MetadataMetadata

Info about the organization of datao Data about the data

Descriptive vs. Semantic metadatao Descriptive: about creation: author, date, ...o Semantic: about meaning: keywords, subject codes, ...

Ontologies

o Others: who and how to use. E.g.: adult, confident, signature Standards (many)

o Dublin Core Metadata Element Set: 15 fields. Descriptive.o Machine Readable Catalog Record (MARC): bibliographic

WEB – very importanto Many projects on Web ontologies. Semantic Web.

Page 8: Special Topics in Computer Science The Art of Information Retrieval Chapter 6: Text and Multimedia Languages and Properties Alexander Gelbukh

8

TextText

Encoding. ASCII-7, 8. UNICODE: oriental Format. Binary vs. ASCII (better). DOC, RTF, PDF, PS Compression. ZIP, ARJ Binary in ASCII: uuencode

To predict behavior of tools and systems, need to model text Entropy: the limit of compression, degree of chaos Statistics of the letters and words

o Very skewed

Page 9: Special Topics in Computer Science The Art of Information Retrieval Chapter 6: Text and Multimedia Languages and Properties Alexander Gelbukh

9

Zipf lawZipf law

Page 10: Special Topics in Computer Science The Art of Information Retrieval Chapter 6: Text and Multimedia Languages and Properties Alexander Gelbukh

10

Zipf law, etc.Zipf law, etc.

i-th most frequent word appears k/i times, = 1.5 – 2 Mandelbrot form: k/(c+i)

50% of text are few hundred words Most of them are stopwords: the, of, and, a, to, in... Not indexed smaller indices

Distribution of words by docs o p, k depend on collection and word

Page 11: Special Topics in Computer Science The Art of Information Retrieval Chapter 6: Text and Multimedia Languages and Properties Alexander Gelbukh

11

Heaps’ lawHeaps’ law

Page 12: Special Topics in Computer Science The Art of Information Retrieval Chapter 6: Text and Multimedia Languages and Properties Alexander Gelbukh

12

Heaps’ law, etc.Heaps’ law, etc.

# of distinct words (size of vocabulary) = Kn, = 0.4 ... 0.6 square root

Applies to collections To WWW

Average length of word. English:o 4.8 ... 5.3 letters per word, average by text

o 6 ... 7 without stopwords

o 8 ... 9 average by vocabulary

Page 13: Special Topics in Computer Science The Art of Information Retrieval Chapter 6: Text and Multimedia Languages and Properties Alexander Gelbukh

13

Similarity between stringsSimilarity between strings

symmetric; triangle: dist (a,c) dist (a,b) + dist (b,c) Hamming: # of different positions. Also for sets. Soundex: phonetic similarity Levenshtein: min # insertions, deletions, substitutions

o dist (survey, surgery) = 2

o A very good measure

Longest common subsequence: survey, surgery surey

Various metrics to compare whole docso E.g., consider strings as symbols, or similarity of strings, etc.

Page 14: Special Topics in Computer Science The Art of Information Retrieval Chapter 6: Text and Multimedia Languages and Properties Alexander Gelbukh

14

Markup languagesMarkup languages

“Our documents do not belong to us but to Bill Gates!”

Extra textual syntax to describe formatting, structure, ... Marks are called tags. Initial and ending tags surround the marked text. Standard metalanguage: SGML (Standard Generalized

Markup Language)o XML (eXtensible), its subset: new metalanguage for Web

o HTML is an instance of SGML

Page 15: Special Topics in Computer Science The Art of Information Retrieval Chapter 6: Text and Multimedia Languages and Properties Alexander Gelbukh

15

SGMLSGML

Provides rules for defining tags A document consists of:

o Definitions of tags Document Type Declaration, DTD Informal comments or an additional description

o Text with tags

Tags: <tag>text</tag> Mostly defines semantics, not printing format

o Defined in other languages

Page 16: Special Topics in Computer Science The Art of Information Retrieval Chapter 6: Text and Multimedia Languages and Properties Alexander Gelbukh

16

HTMLHTML

1992; 4.0: 1997 Instance of SGML

o Exists DTD, usually not used Also does not define (much of) formatting. Thus: Cascade Style Sheets (CSS)

o define aspects of formattingo can be combined (cascaded)o not well supported by browsers

Does NOT (unlike generic SGML ( too expensive))o allow to specify new tagso support nesting structureso support validity checks

Page 17: Special Topics in Computer Science The Art of Information Retrieval Chapter 6: Text and Multimedia Languages and Properties Alexander Gelbukh

17

XML (eXtensible ...)XML (eXtensible ...)

More flexible than HTML, simpler than SGML Simplified subset of SGML

o Much simpler in implementation

Allows for human- and machine-readable markupo Good for development of Web docs

o Allow to do things that now are done with Java scripts

Using DTD is optional, parser can discover tags Extensible Style sheet Language (like CSS in HTML)

o Like macros in a word processor

Extensible Linking Language

Page 18: Special Topics in Computer Science The Art of Information Retrieval Chapter 6: Text and Multimedia Languages and Properties Alexander Gelbukh

18

Uses of XMLUses of XML

MathML: Mathematical Markup Languageo Not only presentation but also meaning of expressions!

SMIL: Synchronized Multimedia Integration Languageo Declarative language to specify positions and timing

Resource Description Formato Metadata for XML

Trend: HTML evolutions to model and describe the structure of data, not presentation details

Page 19: Special Topics in Computer Science The Art of Information Retrieval Chapter 6: Text and Multimedia Languages and Properties Alexander Gelbukh

19

MultimediaMultimedia

Text, sound, images, video Image formats. BMP. Compression:

o GIF. Good for few colors

o JPG. Lossy compression. Parametric: can be controlled

o TIFF is used for exchange; can contain metadata

Moving images:o MPEG: Moving Pictures Expert Group. Encodes changes

Textual images. Compression. Retrieval:o Metadata, keywords

o OCR. Many typos; keyword search should be approximate

o Treat as a sequence of images, convert query similarly

Page 20: Special Topics in Computer Science The Art of Information Retrieval Chapter 6: Text and Multimedia Languages and Properties Alexander Gelbukh

20

Taxonomy of Web languagesTaxonomy of Web languages

Page 21: Special Topics in Computer Science The Art of Information Retrieval Chapter 6: Text and Multimedia Languages and Properties Alexander Gelbukh

21

ConclusionsConclusions

Modeling of text helps predict behavior of systemso Zipf law, Heaps’ law

Describing formally the structure of documents allows to treat a part of their meaning automatically, e.g., search

Languages to describe document syntaxo SGML, too expensive

o HTML, too simple

o XML, good combination

Page 22: Special Topics in Computer Science The Art of Information Retrieval Chapter 6: Text and Multimedia Languages and Properties Alexander Gelbukh

22

Thank you!Till November 6

The class of Oct The class of Oct 30 is cancelled30 is cancelled