modern information retreival

45
Modern Information Modern Information Retreival Retreival Chap. 06: Text and Multimedia Chap. 06: Text and Multimedia Languages and Properties Languages and Properties (Introduction, Metadata and (Introduction, Metadata and Text) Text) 6.1, 6.2, 6.3 6.1, 6.2, 6.3

Upload: ull

Post on 08-Feb-2016

37 views

Category:

Documents


0 download

DESCRIPTION

Modern Information Retreival. Chap. 06: Text and Multimedia Languages and Properties (Introduction, Metadata and Text) 6.1, 6.2, 6.3. Introduction. Text main form of communicating knowledge. Document loosely defined, denote a single unit of information. can be any physical unit - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Modern Information Retreival

Modern Information Modern Information RetreivalRetreival

Chap. 06: Text and Multimedia Chap. 06: Text and Multimedia Languages and Properties Languages and Properties

(Introduction, Metadata and Text) (Introduction, Metadata and Text) 6.1, 6.2, 6.36.1, 6.2, 6.3

Page 2: Modern Information Retreival

IntroductionIntroduction• Text Text

– main form of communicating knowledge.main form of communicating knowledge.• DocumentDocument

– loosely defined, denote a single unit of loosely defined, denote a single unit of information.information.

– can be any physical unitcan be any physical unit•a filea file•an emailan email•a Web Pagea Web Page

Page 3: Modern Information Retreival

IntroductionIntroduction• DocumentDocument

– Syntax and structureSyntax and structure– SemanticsSemantics– Information about itselfInformation about itself

Page 4: Modern Information Retreival

IntroductionIntroduction• Document SyntaxDocument Syntax

– Implicit, or expressed in a language (e.g, TeX)Implicit, or expressed in a language (e.g, TeX)– Powerful languages: easier to parse, difficult to Powerful languages: easier to parse, difficult to

convert to other formats.convert to other formats.– Open languages are better (interchange)Open languages are better (interchange)– Semantics of texts in natural language are not easy Semantics of texts in natural language are not easy

for a computer to understandfor a computer to understand– Trend: languages which provides information on Trend: languages which provides information on

structure, format and semantics being readable by structure, format and semantics being readable by human and computershuman and computers

Page 5: Modern Information Retreival

IntroductionIntroduction• New applications are pushing for New applications are pushing for

format such that information can be format such that information can be represented independetly of style.represented independetly of style.

• Style: defined by the author, but the Style: defined by the author, but the reader may decide part of itreader may decide part of it

• Style can include treatment of other Style can include treatment of other mediamedia

Page 6: Modern Information Retreival

MetadataMetadata• ““Data about the data”Data about the data”

– e.g: in a DBMS, schema specifies name of the e.g: in a DBMS, schema specifies name of the relations, attributes, domains, etc.relations, attributes, domains, etc.

• Descriptive MetadataDescriptive Metadata– Author, source, lengthAuthor, source, length– Dublin Core Metadata Element SetDublin Core Metadata Element Set

• Semantic MetadataSemantic Metadata– Characterizes the subject matter within the document Characterizes the subject matter within the document

contentscontents– MEDLINEMEDLINE

Page 7: Modern Information Retreival

MetadataMetadata• Metadata information on Web documentsMetadata information on Web documents

– cataloging, content rating, property rights, digital cataloging, content rating, property rights, digital signaturessignatures

• New standard: Resource Description FrameworkNew standard: Resource Description Framework– description of Web resources to facilitate automated description of Web resources to facilitate automated

processing of informationprocessing of information– nodes and attched atribute/values pairsnodes and attched atribute/values pairs

• Metadescription of non-textual objectsMetadescription of non-textual objects– keyword can be used to search the objectskeyword can be used to search the objects

Page 8: Modern Information Retreival

RDF ModelRDF Model• A model is a collection of statementsA model is a collection of statements• Statement := (predicate,subject,object)Statement := (predicate,subject,object)• Predicate is a resourcePredicate is a resource• Subject is a resourceSubject is a resource• Object is either a resource or a literalObject is either a resource or a literal

Subject Object

Predicate

Statement

Page 9: Modern Information Retreival

Example shown in triples Example shown in triples viewview

Page 10: Modern Information Retreival

RDF model and natural RDF model and natural languagelanguage

• Subject. Subject. In grammar, this is the noun or noun In grammar, this is the noun or noun phrase that is the doer of the action. In the sentence phrase that is the doer of the action. In the sentence “The company sells batteries,” the subject is “the “The company sells batteries,” the subject is “the company.”company.”

• Predicate. Predicate. In grammar, this is the part of a In grammar, this is the part of a sentence that modifies the subject and includes the sentence that modifies the subject and includes the verb phrase. In our sentence, the predicate is the verb phrase. In our sentence, the predicate is the phrase “sells”phrase “sells”

• Object. Object. In grammar this is a noun that is acted In grammar this is a noun that is acted upon by the verb. In our sentence, the object is the upon by the verb. In our sentence, the object is the noun “batteries.”noun “batteries.”

Page 11: Modern Information Retreival

XML vs. RDFXML vs. RDF• RDF is not just an XML dialect.RDF is not just an XML dialect.

– XML:XML:•Has a Has a treetree structure data model. structure data model.•Only nodes are labeled.Only nodes are labeled.

– RDF:RDF:•Has a Has a graphgraph structure data model. structure data model.•Both edges (properties) and nodes Both edges (properties) and nodes

(subjects/objects) are labeled.(subjects/objects) are labeled.

Page 12: Modern Information Retreival

Linking StatementsLinking Statements•The subject of one statement can The subject of one statement can

be the object of anotherbe the object of another•Such collections of statements Such collections of statements

form a directed, labeled graphform a directed, labeled graphGanji CE

studentOF

Sharif http://ce.sharif.edu

departmentOF hasHomePage

Page 13: Modern Information Retreival

RDF Graph: ‘anonymous’ RDF Graph: ‘anonymous’ nodesnodes

Person12345

Jonathan

Borden

person.name

first

last

value

value

PersonName LiteralPerson

Page 14: Modern Information Retreival

How can RDF be implementedHow can RDF be implemented•Usually RDF/XML syntaxUsually RDF/XML syntax•However other notations are possibleHowever other notations are possible

– e.g. Notation3:e.g. Notation3:•Buddy Belden owns a business.Buddy Belden owns a business.•The business has a Web site accessible at The business has a Web site accessible at

http://www.c2i2.com/~budstv.http://www.c2i2.com/~budstv.•Buddy is the father of Lynne.Buddy is the father of Lynne.

•<#Buddy> <#owns> <#business>.<#Buddy> <#owns> <#business>.•<#business> <#has-website> <#business> <#has-website>

<http://www.c2i2.com/~budstv>.<http://www.c2i2.com/~budstv>.•<#Buddy> <#father-of> <#Lynne>.<#Buddy> <#father-of> <#Lynne>.

Page 15: Modern Information Retreival

Converting N3 to RDFConverting N3 to RDF• Jena toolkit can do such conversionJena toolkit can do such conversion

Page 16: Modern Information Retreival

XML Syntax for RDFXML Syntax for RDF• RDF has an XML syntax that has a specific meaning:RDF has an XML syntax that has a specific meaning:• Every Every DescriptionDescription element describes a resource element describes a resource• Every attribute or nested element inside a Every attribute or nested element inside a DescriptionDescription is a is a propertyproperty of that Resourceof that Resource

• We can refer to resources by using URIsWe can refer to resources by using URIs

<rdf:Description <rdf:Description aboutabout="some.uri/person/ganji">="some.uri/person/ganji"> <studentOf <studentOf resourceresource="some.uri/Sharif/CE"/>="some.uri/Sharif/CE"/><</Description/Description>><Description <Description aboutabout="some.uri/Sharif/CE">="some.uri/Sharif/CE"> <hasHomePage<hasHomePage>http://ce.sharif.edu<>http://ce.sharif.edu</hasHomePage/hasHomePage>> <departmentOf <departmentOf resourceresource="some.uri/~Sharif"/>="some.uri/~Sharif"/><</rdf:Description>/rdf:Description>

Page 17: Modern Information Retreival

RDF typeRDF type• RDF predifined propertyRDF predifined property• Its value – a resource that represent a category or Its value – a resource that represent a category or

classclass• Its subject – Instance of that category or classIts subject – Instance of that category or class

prefix prefix ex: URI: http://www.example.org/termsex: URI: http://www.example.org/terms

Page 18: Modern Information Retreival

ContainersContainers• Containers are collectionsContainers are collections

– they allow grouping of resources (or literal they allow grouping of resources (or literal values)values)

• It is possible to make statements about It is possible to make statements about the container (as a whole) or about its the container (as a whole) or about its members individuallymembers individually

• It is also possible to create collections It is also possible to create collections based on URI patternsbased on URI patterns– for example, all files in a particular web sitefor example, all files in a particular web site

Page 19: Modern Information Retreival

RDF containersRDF containers• BagBag: (A resource having type rdf:Bag): (A resource having type rdf:Bag)

– Represents an unordered list of resources or Represents an unordered list of resources or literalsliterals

– Duplicated values are prermittedDuplicated values are prermitted• SequenceSequence: (A resource having type rdf:Seq): (A resource having type rdf:Seq)

– Represents ordered list of resources or Represents ordered list of resources or literalliteral

– Duplicated values are permittedDuplicated values are permitted• AlternativesAlternatives: (A resource having type rdf:Alt): (A resource having type rdf:Alt)

– Represents group of resources or literals Represents group of resources or literals that are alternativesthat are alternatives

Page 20: Modern Information Retreival

Sequence exampleSequence example

http://www.w3.org/TR/REC-rdf-syntax

“Ora Lassila”

rdf:_1

rdf:Seq

dc:Creator

rdf:Type

“Ralph Swick”

rdf:_2

Page 21: Modern Information Retreival

Bag exampleBag example

Page 22: Modern Information Retreival

RDF Schema (RDFS)RDF Schema (RDFS)• RDF gives a formalism for meta data RDF gives a formalism for meta data annotation, and a way to write it down in annotation, and a way to write it down in XML, but it does not give any special XML, but it does not give any special meaning to vocabulary such as meaning to vocabulary such as subClassOfsubClassOf or or typetype

• RDF Schema allows you to define RDF Schema allows you to define vocabulary terms and the relations vocabulary terms and the relations between those termsbetween those terms– it gives “extra meaning” to particular RDF it gives “extra meaning” to particular RDF

predicates and resourcespredicates and resources– this “extra meaning”, or semantics, specifies this “extra meaning”, or semantics, specifies

how a term should be interpretedhow a term should be interpreted

Page 23: Modern Information Retreival

Core Classes & PropertiesCore Classes & PropertiesCore Classes

Core Properties

rdfs:Resource

rdfs:Literal

rdfs:XMLLiteral

rdfs:Class

rdfs:Property

rdfs:Type

rdfs:SubClassOf

rdfs:SubPropertyOf

rdfs:Domain

rdfs:Range

rdfs:Label

rdfs:Comment

Page 24: Modern Information Retreival

RDFS ExamplesRDFS Examples

<Person,<Person,typetype,,ClassClass>><hasColleague,<hasColleague,typetype,,PropertyProperty>><Professor,<Professor,subClassOfsubClassOf,Person>,Person><Carole,<Carole,typetype,Professor>,Professor><hasColleague,<hasColleague,rangerange,Person>,Person><hasColleague,<hasColleague,domaindomain,Person>,Person>

Page 25: Modern Information Retreival

RDF/RDFS “Liberality”RDF/RDFS “Liberality”• No distinction between classes and instances No distinction between classes and instances

(individuals)(individuals)<Species,<Species,typetype,,ClassClass>><Lion,<Lion,typetype,Species>,Species><Leo,<Leo,typetype,Lion>,Lion>

• Properties can themselves have propertiesProperties can themselves have properties<hasDaughter,<hasDaughter,subPropertyOfsubPropertyOf,hasChild>,hasChild><hasDaughter,<hasDaughter,typetype,familyProperty>,familyProperty>

• No distinction between language constructors No distinction between language constructors and ontology vocabulary, so constructors can and ontology vocabulary, so constructors can be applied to themselves/each otherbe applied to themselves/each other<<typetype,,rangerange,,ClassClass>><<PropertyProperty,,typetype,,ClassClass>><<typetype,,subPropertyOfsubPropertyOf,,subClassOfsubClassOf>>

Page 26: Modern Information Retreival

Problems with RDFSProblems with RDFS• RDFS RDFS too weaktoo weak to describe resources in sufficient to describe resources in sufficient

detaildetail– No No localised range and domainlocalised range and domain constraints constraints

• Can’t say that the range of hasChild is person when applied Can’t say that the range of hasChild is person when applied to persons and elephant when applied to elephantsto persons and elephant when applied to elephants

– No No existence/cardinalityexistence/cardinality constraints constraints• Can’t say that all Can’t say that all instancesinstances of person have a mother that is of person have a mother that is

also a person, or that persons have exactly 2 parentsalso a person, or that persons have exactly 2 parents– No No transitive, inverse or symmetricaltransitive, inverse or symmetrical properties properties

• Can’t say that isPartOf is a transitive property, that hasPart Can’t say that isPartOf is a transitive property, that hasPart is the inverse of isPartOf or that touches is symmetricalis the inverse of isPartOf or that touches is symmetrical

– ……• Difficult to provide Difficult to provide reasoning supportreasoning support

– No “native” reasoners for non-standard semanticsNo “native” reasoners for non-standard semantics– May be possible to reason via FO axiomatisationMay be possible to reason via FO axiomatisation

Page 27: Modern Information Retreival

RDF(S) toolsRDF(S) tools• Read RDF data Read RDF data

– Parsers: Jena, Redland, SWI-PrologParsers: Jena, Redland, SWI-Prolog– Validators: W3C RDF validation serviceValidators: W3C RDF validation service– Editors: IsaViz, RDF Author, RDFEd, InferEdEditors: IsaViz, RDF Author, RDFEd, InferEd

• Store RDF data (XML format, tripples or Store RDF data (XML format, tripples or relational/oo DB)relational/oo DB)– Sesame, RSSDB, RDFLibSesame, RSSDB, RDFLib

• Use RDF data (applications, RSS news, etc.)Use RDF data (applications, RSS news, etc.)• Manipulate RDF data (inference, query, etc.)Manipulate RDF data (inference, query, etc.)

– Jena RDQL, etc.Jena RDQL, etc.– Example:Example:

SELECT ?person, ?knowsSELECT ?person, ?knowsWHERE (?x <WHERE (?x <http://xmlns.com/foap/knowshttp://xmlns.com/foap/knows> ?z),> ?z),(?x <(?x <http://xmlns.com/foap/namehttp://xmlns.com/foap/name> ?person),> ?person), (?z <(?z <http://xmlns.com/foap/namehttp://xmlns.com/foap/name> ?knows)> ?knows)

Page 28: Modern Information Retreival

RDF ValidatorsRDF Validators• RDF Validation ServiceRDF Validation Service

– http://www.w3.org/RDF/Validator/http://www.w3.org/RDF/Validator/• In general all the RDF parsers do In general all the RDF parsers do

some kind of validationsome kind of validation

Page 29: Modern Information Retreival

ReferencesReferences•RDF Resource Guide:RDF Resource Guide:

– http://http://www.ilrt.bris.ac.uk/discovery/rdf/resourwww.ilrt.bris.ac.uk/discovery/rdf/resourcesces//

• http://www.w3.org/RDFhttp://www.w3.org/RDF•http://www.w3.org/RDF/Validator/http://www.w3.org/RDF/Validator/

Page 30: Modern Information Retreival

TextText• Text coding in bitsText coding in bits

– EBCDIC, ASCIIEBCDIC, ASCII• Initially, 7 bits. Later, 8 bitsInitially, 7 bits. Later, 8 bits

– UnicodeUnicode•16 bits, to accommodate oriental languages16 bits, to accommodate oriental languages

Page 31: Modern Information Retreival

TextText• FormatsFormats

– No single format existsNo single format exists– IR system should retrieve information IR system should retrieve information

from different formatsfrom different formats– Past: IR systems convert the documentsPast: IR systems convert the documents– Today: IR systems use filtersToday: IR systems use filters

Page 32: Modern Information Retreival

TextText• FormatsFormats

– Formats for document interchange (RTF)Formats for document interchange (RTF)– Formats for displaying (PDF, PostScript)Formats for displaying (PDF, PostScript)– Formats for encode email (MIME)Formats for encode email (MIME)– Compressed filesCompressed files

•uuencode/uudecode, binhexuuencode/uudecode, binhex

Page 33: Modern Information Retreival

TextText• Information TheoryInformation Theory

– Amount of information is related to the Amount of information is related to the distribution of symbols in the document.distribution of symbols in the document.

– Entropy:Entropy:

– Definition of entropy depends on the probabilities Definition of entropy depends on the probabilities of each symbol.of each symbol.

– Text models are used to obtain those probabilitesText models are used to obtain those probabilites

ii

i ppE 21

log

Page 34: Modern Information Retreival

TextText• Example - EntropyExample - Entropy

– 001001011011001001011011

121log

21

21log

21

22

E

Page 35: Modern Information Retreival

TextText• Example - EntropyExample - Entropy

– 111111111111111111111111 01log10log0 22 E

Page 36: Modern Information Retreival

TextText• Modeling Natural LanguageModeling Natural Language

– Symbols: separate words or belong to Symbols: separate words or belong to wordswords

– Symbols are not uniformly distributedSymbols are not uniformly distributed•binomial modelbinomial model

– Dependency of previous symbolsDependency of previous symbols•kk-order markovian model -order markovian model

– We can take words as symbolsWe can take words as symbols

Page 37: Modern Information Retreival

TextText• Modeling Natural LanguageModeling Natural Language

– Words distribution inside documentsWords distribution inside documents– Zipf´s Law: Zipf´s Law: ii-th most frequent word appears 1/-th most frequent word appears 1/ii

times of the most frequent word, hence i-th frequent times of the most frequent word, hence i-th frequent word appears:word appears:

– Real data fits better with Real data fits better with between 1.5 and 2.0 between 1.5 and 2.0

V

jV

V

jH

Hin

1

1)(

))(/(

Page 38: Modern Information Retreival

TextText• Modeling Natural LanguageModeling Natural Language

– Example - word distibution (Zipf’s Law)Example - word distibution (Zipf’s Law)•V=1000, V=1000, = 2 = 2•most frequent word: n=300 most frequent word: n=300 •2nd most frequent: n=762nd most frequent: n=76•3rd most frequent: n=333rd most frequent: n=33•4th most frequent: n=194th most frequent: n=19

Page 39: Modern Information Retreival

TextText• Modeling Natural LanguageModeling Natural Language

– Number of distinct wordsNumber of distinct words– Heaps’ Law:Heaps’ Law:– Set of different words is fixed by a Set of different words is fixed by a

constant, but the limit is too highconstant, but the limit is too high

KnV

Page 40: Modern Information Retreival

TextText• Modeling Natural LanguageModeling Natural Language

– Heaps’ Law exampleHeaps’ Law example•kk between 10 and 100, between 10 and 100, is less than 1 is less than 1•example: n=400000, example: n=400000, = 0.5 = 0.5

– K=25, V=15811K=25, V=15811– K=35, V=22135K=35, V=22135

Page 41: Modern Information Retreival

TextText• Modeling Natural LanguageModeling Natural Language

– Length of the wordsLength of the words•defines total space needed for vocabularydefines total space needed for vocabulary

– Heaps’ Law: length increases logarithmically Heaps’ Law: length increases logarithmically with text size.with text size.

– In practice, a finit-state model is usedIn practice, a finit-state model is used•space has p=0.2space has p=0.2•space cannot apear twice subsequentlyspace cannot apear twice subsequently•there are 26 lettersthere are 26 letters

Page 42: Modern Information Retreival

TextText• Similarity ModelsSimilarity Models

– Distance FunctionDistance Function•Should be symmetric and satisfy triangle Should be symmetric and satisfy triangle

inequalityinequality– Hamming DistanceHamming Distance

•number of positions that have different charactersnumber of positions that have different characters reversereverse rerecceeivivee

Page 43: Modern Information Retreival

TextText• Similarity ModelsSimilarity Models

– Edit (Levenshtein) DistanceEdit (Levenshtein) Distance•minimum number of operations needed to make minimum number of operations needed to make

strings equalstrings equal

surveysurvey sursurggeerryy

•superior for modeling syntatic errorssuperior for modeling syntatic errors•extensions: weights, transpositions, etcextensions: weights, transpositions, etc

Page 44: Modern Information Retreival

TextText• Similarity ModelsSimilarity Models

– Longest Common Subsequence (LCS)Longest Common Subsequence (LCS) survey - surgerysurvey - surgery LCS: sureyLCS: surey

– Documents: lines as symbols (diff in Unix)Documents: lines as symbols (diff in Unix)•time consumingtime consuming

Page 45: Modern Information Retreival

ConclusionsConclusions• Text is the main form of communicating Text is the main form of communicating

knowledge.knowledge.• Documents have syntax, structure and semanticsDocuments have syntax, structure and semantics• Metadata: information about dataMetadata: information about data• Formats of textFormats of text• Modeling Natural LanguageModeling Natural Language

– EntropyEntropy– Distribution of symbolsDistribution of symbols

• SimilaritySimilarity