special topics in computer science the art of information retrieval chapter 6: text and multimedia...
TRANSCRIPT
Special Topics in Computer ScienceSpecial Topics in Computer Science
The Art of Information RetrievalThe Art of Information Retrieval
Chapter 6: Text and Multimedia Chapter 6: Text and Multimedia Languages and Properties Languages and Properties
Alexander Gelbukh
www.Gelbukh.com
2
Previous chapter: ConclusionsPrevious chapter: Conclusions
Query operations:
Relevance feedbacko Simple, understandable
o Needs user attention
o Term re-weighting
Local analysis for query expansiono Co-occurrences in the retrieved docs
o Usually gives better results than global analysis
o Computationally expensive
Global analysiso Worse results. What is good for collection is not for a query
o Linguistic methods, dictionaries, ontologies, stemming, ...
3
Previous chapter: Previous chapter: Trends and research topicsTrends and research topics
Interactive interfaceso Graphical, 2D or 3D
Refining global analysis techniques Application of linguistics methods. Stemming. Ontol
ogies Local analysis for the Web (now too expensive) Combine the tree techniques (feedback, local, global)
4
Anatomy of a document...Anatomy of a document...
We search for documents What is a document?
5
Characteristics of a documentCharacteristics of a document
Syntax is a device that “plays” the document producing semantics (kind of: presentation)
Like CD drive plays CD to produce music Knowing Korean + paper w/glyphs meaning
6
...Anatomy of a document...Anatomy of a document
Queries are conditions on semantics/presentation, not on (binary?) data of the document
Thus need to know syntaxo Example: search in PS or PDF
How to describe formally?
7
MetadataMetadata
Info about the organization of datao Data about the data
Descriptive vs. Semantic metadatao Descriptive: about creation: author, date, ...o Semantic: about meaning: keywords, subject codes, ...
Ontologies
o Others: who and how to use. E.g.: adult, confident, signature Standards (many)
o Dublin Core Metadata Element Set: 15 fields. Descriptive.o Machine Readable Catalog Record (MARC): bibliographic
WEB – very importanto Many projects on Web ontologies. Semantic Web.
8
TextText
Encoding. ASCII-7, 8. UNICODE: oriental Format. Binary vs. ASCII (better). DOC, RTF, PDF, PS Compression. ZIP, ARJ Binary in ASCII: uuencode
To predict behavior of tools and systems, need to model text Entropy: the limit of compression, degree of chaos Statistics of the letters and words
o Very skewed
9
Zipf lawZipf law
10
Zipf law, etc.Zipf law, etc.
i-th most frequent word appears k/i times, = 1.5 – 2 Mandelbrot form: k/(c+i)
50% of text are few hundred words Most of them are stopwords: the, of, and, a, to, in... Not indexed smaller indices
Distribution of words by docs o p, k depend on collection and word
11
Heaps’ lawHeaps’ law
12
Heaps’ law, etc.Heaps’ law, etc.
# of distinct words (size of vocabulary) = Kn, = 0.4 ... 0.6 square root
Applies to collections To WWW
Average length of word. English:o 4.8 ... 5.3 letters per word, average by text
o 6 ... 7 without stopwords
o 8 ... 9 average by vocabulary
13
Similarity between stringsSimilarity between strings
symmetric; triangle: dist (a,c) dist (a,b) + dist (b,c) Hamming: # of different positions. Also for sets. Soundex: phonetic similarity Levenshtein: min # insertions, deletions, substitutions
o dist (survey, surgery) = 2
o A very good measure
Longest common subsequence: survey, surgery surey
Various metrics to compare whole docso E.g., consider strings as symbols, or similarity of strings, etc.
14
Markup languagesMarkup languages
“Our documents do not belong to us but to Bill Gates!”
Extra textual syntax to describe formatting, structure, ... Marks are called tags. Initial and ending tags surround the marked text. Standard metalanguage: SGML (Standard Generalized
Markup Language)o XML (eXtensible), its subset: new metalanguage for Web
o HTML is an instance of SGML
15
SGMLSGML
Provides rules for defining tags A document consists of:
o Definitions of tags Document Type Declaration, DTD Informal comments or an additional description
o Text with tags
Tags: <tag>text</tag> Mostly defines semantics, not printing format
o Defined in other languages
16
HTMLHTML
1992; 4.0: 1997 Instance of SGML
o Exists DTD, usually not used Also does not define (much of) formatting. Thus: Cascade Style Sheets (CSS)
o define aspects of formattingo can be combined (cascaded)o not well supported by browsers
Does NOT (unlike generic SGML ( too expensive))o allow to specify new tagso support nesting structureso support validity checks
17
XML (eXtensible ...)XML (eXtensible ...)
More flexible than HTML, simpler than SGML Simplified subset of SGML
o Much simpler in implementation
Allows for human- and machine-readable markupo Good for development of Web docs
o Allow to do things that now are done with Java scripts
Using DTD is optional, parser can discover tags Extensible Style sheet Language (like CSS in HTML)
o Like macros in a word processor
Extensible Linking Language
18
Uses of XMLUses of XML
MathML: Mathematical Markup Languageo Not only presentation but also meaning of expressions!
SMIL: Synchronized Multimedia Integration Languageo Declarative language to specify positions and timing
Resource Description Formato Metadata for XML
Trend: HTML evolutions to model and describe the structure of data, not presentation details
19
MultimediaMultimedia
Text, sound, images, video Image formats. BMP. Compression:
o GIF. Good for few colors
o JPG. Lossy compression. Parametric: can be controlled
o TIFF is used for exchange; can contain metadata
Moving images:o MPEG: Moving Pictures Expert Group. Encodes changes
Textual images. Compression. Retrieval:o Metadata, keywords
o OCR. Many typos; keyword search should be approximate
o Treat as a sequence of images, convert query similarly
20
Taxonomy of Web languagesTaxonomy of Web languages
21
ConclusionsConclusions
Modeling of text helps predict behavior of systemso Zipf law, Heaps’ law
Describing formally the structure of documents allows to treat a part of their meaning automatically, e.g., search
Languages to describe document syntaxo SGML, too expensive
o HTML, too simple
o XML, good combination
22
Thank you!Till November 6
The class of Oct The class of Oct 30 is cancelled30 is cancelled