progress report (concept extraction) presented by: mohsen kamyar

Progress Report (Concept Extraction)

Presented by: Mohsen Kamyar

Outline

Motivations of “Semantic Web” “Semantic Web” structure Motivations of “Automatic Ontology

Extraction” “Concept Extraction” Problems Main ideas

Motivations of “Semantic Web” Computation History:

W3C definition: “Semantic Web provides a common framework that allows data to be shared and reused across application, enterprise, and community boundaries”.

Huge Computations

Gathering Human Data

Sharing Data

Processing Huge Data

First Computers First ApplicationsInventing Web

Semantic Web

Motivations of “Semantic Web” Some Efforts:

http://dbpedia.org/ (RDF enabled Wikipedia Data) http://www.foaf-project.org/ (Friend of a Friend) http://sioc-project.org/ (Semantically-Interlinked

Online Communities) http://openguid.net/ http://simile.mit.edu/ (Semantic Interoperability of

Metadata and Information in unLike Environments)

Motivations of “Semantic Web”

Different Ontologies and Metadata and Their Overlaps

Motivations of “Semantic Web”

Different Projects and Their Overlaps

Outline



“Semantic Web” structure

W3C Suggestion for Semantic Web Stack

“Semantic Web” structure

Road from “zero” to Semantic Web

Having an Ontology

Annotating Documents

Submit Query on Semantic Enabled Data

Having an Ontology

Generating Data According to Ontology

Submit Query on Semantic Enabled Data

Existing Data

Existing Data

Outline



Motivations of “Automatic Ontology Extraction” Ontology is domain specific In each domain we need a team of experts to

create an ontology Existing ontologies cover small ranges of

knowledge. WordNet Data about members, projects and … in

Universities DBLP, …

Motivations of “Automatic Ontology Extraction”

DailyMed for drugs FOAF MySpace BBC Widely used one is AudioSrobbler Based on event

ontology with 600 million entry but it is a simple ontology.

Problem: Representing Knowledge in a Machine Readable Format.

Outline



“Concept Extraction”

An ontology is a set of “Concepts”, properties of concepts (and constraints on them) and “Rules” between “Concepts”.

If we can’t generate a general ontology in a domain, “Concepts” can be used in determining the relevance of documents and their ranks.

Outline



Main Components in existing methods are as below: Similarity Measures

Euclidean Measures: Cosine Distance Problem: If we have a document “B” that is the subset of

a document “A” then the cosine distance will be large

Problems

Problems

Importance Measures Frequency: Has big false positive error (on technical

words) and false negative error (on common words) TFiDF (Term Frequency . Inverse Document

Frequency): Has more accuracy But if we have two semantic correlated words this measure

will make mistake on main purpose of the document. Also for long documents it will fail Information about order of words will be lost If we consider stems and synonyms or other related set of

words it can work fine, otherwise many problems.

Problems

LSI (Latent Semantic Indexing/Analysis) This method uses SVD (singular value decomposition) After applying SVD we have term×concept (U matrix),

eigenvalues, concept×document (V matrix). Matrices U and V are eigenvectors of terms correlation

matrix and document correlation matrix. Eigenvectors have property of clustering the nodes of a

graph (for example in PageRank algorithm we use same property)

But there are problems.

Problems

This method assumes some probabilistic properties for term-document matrix that may be not hold, for example Gaussian Model properties.

This method is not customized for sparse matrices This method is offline This method has big computation complexity

Some efforts have been made such as SDD This method again is based on cosine distance and similar

measures

Outline



Main ideas

Main ideas on this process are as below Using more general spaces instead of Euclidian

space, such as “Banach Spaces” and defining a new distance measure that can serve desirable properties of term-document space. Banach spaces have interesting properties for high

dimensional problems because of their generlity. Using Linguistics theorems in determining the

importance of a term in a document instead of TFiDF

Main ideas

Customizing LSI for sparse matrices and creating a framework for any method for matrix decomposing.

Giving an online version of LSI LSI (SVD) basically is an approximation for term-

document matrix but also have big computation complexity and can be approximated to.

We use clustering properties of eigenvectors in LSI, so we can substitute it with any clustering that is not based on Euclidian as mentioned.

progress report (concept extraction) presented by: mohsen kamyar

Documents

web semantic web slide

semantic web stack slide

semantic web structure

event ontology

general ontology

simple ontology

ontology existing ontologies

wikipedia data http