information filtering, by hadi mohammadzadeh
Post on 31-May-2015
409 Views
Preview:
DESCRIPTION
TRANSCRIPT
1
.
Hadi Mohammadzadeh Information Filtering (IF) & Text Classification (TC)
By : Hadi MohammadzadehInstitute of Applied Information ProcessingUniversity of Ulm
Seminar on
Information Filtering (IF) & Text Classification (TC)
2
.
Hadi Mohammadzadeh Information Filtering (IF) & Text Classification (TC)
Content :Content :
• Information Filtering– Definition and Terminology– General Framework– Performance Evaluation– Background and State of the Art– Comparison of Related Tasks– Summary
3
.
Hadi Mohammadzadeh Information Filtering (IF) & Text Classification (TC)
Definition and Terminology (IF)Definition and Terminology (IF)
•What is the objective of IF? To reduce the user’s information load with Respect to their areas of Interest
•Is there any general Framework for an IF system? Yes, we will describe and also review some measures for performance evaluation.
4
.
Hadi Mohammadzadeh Information Filtering (IF) & Text Classification (TC)
•Definition IF Assume a space of documents . with respect to a specific long-term user interest , we define IF as a mapping ,
Definition and Terminology (IF) continueDefinition and Terminology (IF) continue
•Another Definition for IF we define IF as a mapping :
This notation requires threshold of the relevance score.
Rejecting a Doc.
Accepting a Doc.
]1,0[: Df
D
}1,0{: Df
5
.
Hadi Mohammadzadeh Information Filtering (IF) & Text Classification (TC)
Definition and Terminology (IF)continueDefinition and Terminology (IF)continue
•IF is a Process : actively conducted by human, with or without the assistance of a machine, in order to cope with information overload.
NowThe goal of IF System is to Automate this process.
•Definition IFS An IFS automates the process of IF with the goal to reduce information overload.
1- Collection documents2- Detection relevant documents3- Presentation the result to the user
Process :
6
.
Hadi Mohammadzadeh Information Filtering (IF) & Text Classification (TC)
Definition and Terminology (IF)continueDefinition and Terminology (IF)continue
•In any specific IFS four basic points are that need to analyze: Input
A stream of Textual Doc.EmailNewsgroupsImage
OutputOnlineBatch Processing
Profile ConstructionHumanMachine Design
The concept of RelevanceContent based FilteringCollaborate FilteringEconomic Filtering
7
.
Hadi Mohammadzadeh Information Filtering (IF) & Text Classification (TC)
General Framework (IF)General Framework (IF)
DocumentSpace
ΝInformation Need Document
User Interest Space
D
Human Judgment
h
l]1,0[
considered as :
A comparison function and shows Human Judgment of the relationship between the user’s interest and a document.
lh DN ]1,0[:
By considering that the user interest may be have many different parts and thus be based ondifferent aspects, so there are different aspects that can be measured on numeric scales.l l
8
.
Hadi Mohammadzadeh Information Filtering (IF) & Text Classification (TC)
General Framework (IF)ContinueGeneral Framework (IF)Continue
Based on the result of comparison function , then user can decide whetherTo reject orTo accept A document.For doing this we define following function:
}1,0{]1,0[: lh
Final FunctionFor a given information need and any document Information Filtering defined as:
N Dd
),)(()( ddf hh
9
.
Hadi Mohammadzadeh Information Filtering (IF) & Text Classification (TC)
General Framework (IF)ContinueGeneral Framework (IF)Continue
DocumentSpace
ΝInformation Need DocumentUser Interest
Space D
Human Judgmenth
l]1,0[
Human Side
System Side
Representation
Document Representation
Function
RD:
Document Representation
Space
RProfile
ProfileAcquisitionFunction
User InterestRepresentation
Space
PN :
P
sComparison Function
l]1,0[
10
.
Hadi Mohammadzadeh Information Filtering (IF) & Text Classification (TC)
General Framework (IF)ContinueGeneral Framework (IF)Continue
So to automate the core of an IFS, every approach must have four basic Components
1- A technique for representing document, denoted as the Document Representation Function
2- A technique for representing the user’s information need, denoted as the
Profile Acquisition Function
3- A technique for matching the representation of the user’s information need against the document representation, denoted as the
Comparison Function
4- A technique for using the results of this comparison denoted as the
System Decision Function
PN :
ls RP ]1,0[:
}1,0{]1,0[: ls
RDp :
11
.
Hadi Mohammadzadeh Information Filtering (IF) & Text Classification (TC)
General Framework (IF)ContinueGeneral Framework (IF)Continue
Conclusion:
An obvious objective for a FS is that
))(),()((),)(( dd sshh DdN ,
12
.
Hadi Mohammadzadeh Information Filtering (IF) & Text Classification (TC)
Performance Evaluation (IF)Performance Evaluation (IF)
PE
Effectiveness
Efficiency
Accept Relevant Doc.
Reject Non-relevant Doc.
: Talk about resources that are consumed to produce the filtering output such as
Computation TimeLabeled Training Data
13
.
Hadi Mohammadzadeh Information Filtering (IF) & Text Classification (TC)
Content :Content :
Text Classification– Definition and Terminology– Text Representation
• Text Normalization• Term Extraction• Dimensionality Reduction• Vector Generation
– Text Learning • Learning Algorithms• Ensembles of Classifiers
14
.
Hadi Mohammadzadeh Information Filtering (IF) & Text Classification (TC)
Definition and Terminology (TC)Definition and Terminology (TC)
• In this section we will do:1. Define the task of TC.2. Showing relationship between TC and the task of IF.
• Our Goal:To use machine learning techniques for Automatic TC.
• Also we examine:1. Representing Textual Document (Text Representation) in a way
that is appropriate as input for Machine Learning Algorithm.2. Using these Machine Learning Algorithms proper.
15
.
Hadi Mohammadzadeh Information Filtering (IF) & Text Classification (TC)
Definition and Terminology (TC)ContinueDefinition and Terminology (TC)Continue
• TC is a Mapping
• So the task of TC is: To Classify Documents into a fix number of predefined classes.
• And Information Filtering:Means the Classification of Documents as either Relevant or Non-Relevant. Each Document is assigned to Exactly one Class.
Space of Textual
Documents
Fixed Set of
Classes
h
},...,{ 1 kccC
D
16
.
Hadi Mohammadzadeh Information Filtering (IF) & Text Classification (TC)
Definition and Terminology (TC)ContinueDefinition and Terminology (TC)Continue
• What is Supervised Learning (SL) and Its Tasko SL is a ML Techniques for Learning a Function from Training Data.o The Task of SL is to predict the value of the Function for any valid input
object after having seen a number of training examples.
• So Assume: A set of labeled training documents
A fixes set of classes
A target function that assigns each training document to its true class label
Now the Objective of the TL task is to induce a Classifier
DddD n },...,{ 1
},...,{ 1 kccC
CDT :
CDh :
n
k
17
.
Hadi Mohammadzadeh Information Filtering (IF) & Text Classification (TC)
Definition and Terminology (TC)ContinueDefinition and Terminology (TC)Continue
• The Problem of Automatically Classifying Documents falls Into 2 Phases:
1. Apply SLA to construct a Classifier in the Learning Phase.I. Preprocessing text to prepare it as input for MLAII. Using TLA
2. Using this Classifier to predict the Class Label of New Documents in Classification Phase.
Summary of TC Automatically so far
18
.
Hadi Mohammadzadeh Information Filtering (IF) & Text Classification (TC)
Text Representation (TC)Text Representation (TC)
Dd
iw
• Aim TR: To transform a textual document into a format that is suitable as Input for MLA.
Definition VSM (Vector Space Model)
Let be a textual document. The representation of is the document vector d = where each dimension corresponds to a distinct term in the document collection and denotes the weight of the -th term. The set of these index terms, ,is referred to as the vocabulary
dmT
m Rwwd ),...,()( 1Dd
mi
},...,{ 1 mttv
19
.
Hadi Mohammadzadeh Information Filtering (IF) & Text Classification (TC)
Text Representation (TR)ContinueText Representation (TR)Continue
Dd
mTmi Rwwwd ),...,,...,()( 1Dd
Document Space D
DocumentRepresentation
Function
Document Vector or Feature Vector
Specific DocumentCollection
1d
md
id
1Term
Index
2ITiIT
mITEach Dimension Correspond to a Distinct Term
AndAt this Point We Suppose
Index Terms are Plain Text
20
.
Hadi Mohammadzadeh Information Filtering (IF) & Text Classification (TC)
Text Representation (TR)ContinueText Representation (TR)Continue
1. Text Normalization (TN) transforms any type of Document into a sequence of Word Token.
2. Text Extraction (TE) All distinct index terms of the training document are merged to generate a set of candidate index terms which may potentially be used as Vocabulary
3. Dimensionality Reduction (DR) Since the resulting set of index terms tends to be very large, we aim at Reducing the Size of the Vocabulary in the dimensionality reduction step.
4. Vector Generation (VG) Evaluate Weights for all index terms of any given document.
Steps Involved to Transform Training Documents and New Documents into Feature Vectors
21
.
Hadi Mohammadzadeh Information Filtering (IF) & Text Classification (TC)
Text Representation (TC)ContinueText Representation (TC)Continue
22
.
Hadi Mohammadzadeh Information Filtering (IF) & Text Classification (TC)
Text Representation (TC)ContinueText Representation (TC)Continue
TN a Sequence of Normalized Tokens
There are two steps in the Text Normalization Process.
1. Parsing textual components to produce a Sequence of Tokens.2. To Normalize tokens – Depending on the Application.
Term Normalization
Output
Based onDomainSpecific
knowledge
• All letter converted to lower-case
• Punctuation marks at the end of tokens removed
• Tokens that contain any non-alphanumeric char may delete
• Even token containing numeric characters are often omitted
23
.
Hadi Mohammadzadeh Information Filtering (IF) & Text Classification (TC)
Text Representation (TC)ContinueText Representation (TC)Continue
TE a Sequence of Index Term (IT) Based on Tokens
At this time we face with two situations:
1. No Vocabulary: Output of TE are merged to generate a set of distinct Index Term.
2. Vocabulary Exist: If IT is not in the Vocabulary , then IT omitted. Otherwise , Constitution of Term Frequency Vector (TFV) d for
denotes number of times term appears in document
Term Extraction
Output
}
^
1
^^^,...,{ mttv
Tmd ttfttf ))(),...,(( 1 jd
tf
)(ttfd vt d
24
.
Hadi Mohammadzadeh Information Filtering (IF) & Text Classification (TC)
Text Representation (TC)ContinueText Representation (TC)Continue
• Curse of Dimensionality: Problem of having too many features.
• When DR step is Applied: only in case the Vocabulary needs to be constructed, i.e. when the training documents are being processed.
• Objective: To reduce the number of features that are finally used to represent document.
Dimensionality Reduction
TrainingData
Term Extraction
Step
}
^
1
^^^,...,{ mttv
It is Very Large
},...,{ 1 mttv
After DR
where^
mm
top related