text based information retrieval - text mining pkb - antonie
Post on 22-Dec-2015
230 views
TRANSCRIPT
Background
• Human dificults to process huge information• Computer can do better with matemathics
– why don’t also use computer to process huge information?
• A Large text to find:– Terrorist attack on 1995?– Terrorist movement and bomb relation?
• Relates to Information Retreival, Data Mining and Text Mining
Terminology• Data Mining
A step in the knowledge discovery process consisting of particular algorithms (methods), produces a particular enumeration of patterns (models) over the data.
• Data Mining is a process of discovering advantageous patterns in data.
• Knowledge Discovery ProcessThe process of using data mining methods (algorithms) to extract (identify) what is knowledge according to the specifications of measures and thresholds, using a database along with any necessary preprocessing or transformations.
What kind of data in Data Mining?
• Relational Databases
• Data Warehouses
• Transactional Databases
• Advanced Database Systems– Object-Relational– Multimedia– Text– Heterogeneous and
Distributed– WWW
Data Mining Application:• Market analysis• Risk analysis and management• Fraud detection and detection of unusual patterns (outliers)• Text mining (news group, email, documents) and Web mining• Stream data mining
What Is Text Mining?“The objective of Text Mining is to exploit information contained in textual
documents in various ways, including …discovery of patterns and trends in data, associations among entities, predictive rules, etc.” (Grobelnik et al., 2001)
“Another way to view text data mining is as a process of exploratory data analysis that leads to heretofore unknown information, or to answers for questions for which the answer is not currently known.” (Hearst, 1999)
“The non trivial extraction of implicit, previously unknown, and potentially useful information from (large amount of) textual data”.
An exploration and analysis of textual (natural-language) datatextual (natural-language) data by
automatic and semi automatic means to discover new knowledge.
Text Mining (2)
• What is ““previously unknown”previously unknown” information ?
– Strict definition• Information that not even the writer knows.
– Lenient (lunak) definition• Rediscover the information that the author
encoded in the text• e.g., Automatically extracting a product’s name
from a web-page.
• Information Retrieval– Indexing and retrieval of textual documents
• Information Extraction– Extraction of partial knowledgepartial knowledge in the text
• Web Mining– Indexing and retrieval of textual documents and
extraction of partial knowledge using the web
• Clustering– Generating collections of similar text documents
Text Mining Methods
Text Mining Application
• Email: Spam filtering• News Feeds: Discover what is interesting• Medical: Identify relationships and link
information from different medical fields• Marketing: Discover distinct groups of potential
buyers and make suggestions for other products• Industry: Identifying groups of competitors web
pages• Job Seeking: Identify parameters in searching
for jobs
Information Retrieval (1)
• Given:– A source of textual documents– A well defined limited query (text based)
• Find:– Sentences with relevantrelevant information– Extract the relevant information and ignore non-relevant information (important!) – Link related information and output in a predetermined format
• Example: news stories, e-mails, web pages, photograph, music, statistical data, biomedical data, etc.
• Information items can be in the form of text, image, video, audio, numbers, etc.
Information Retrieval (2)
• 2 basic information retrieval (IR) process:– Browsing or navigation system
• User skims document collection by jumping from one document to the other via hypertext or hypermedia links until relevant document found
– Classical IR system: question answering system
• Query: question in natural language• Answer: directly extracted from text of document
collection• Text Based Information Retrieval:
– Information item (document) :• Text format (written/spoken) or has textual description
– Information need (query):• Usually in text format
Intelligent Information Retrieval
• meaning of words – Synonyms “buy” / “purchase”– Ambiguity “bat” (baseball vs. mammal)
• order of words in the query– hot dog stand in the amusement park – hot amusement stand in the dog park
Why Mine the Web?• Enormous wealth of textual information on the Web.
– Book/CD/Video stores (e.g., Amazon)– Restaurant information (e.g., Zagats)– Car prices (e.g., Carpoint)
• Lots of data on user access patterns – Web logs contain sequence of URLs accessed by users
• Possible to retrieve “previously unknown” information– People who ski also frequently break their leg.– Restaurants that serve sea food in California are likely to be
outside San-Francisco
Mining the Web
IR / IESystem
Query
Documentssource
RankedDocuments
1. Doc12. Doc2
3. Doc3 . .
Web Spider
What is Web Clustering ?
• Given:– A source of textual
documents– Similarity measure
• e.g., how many words are common in these documents
ClusteringSystem
Similarity measure
Documentssource
DocDo
cDoc
Doc
Doc
DocDoc
Doc
DocDoc
• Find:• Several clusters of
documents that are relevant to each other
Text characteristics
• Large textual data base– Efficiency consideration
• over 2,000,000,000 web pages• almost all publications are also in electronic form
• High dimensionality (Sparse input)– Consider each word/phrase as a dimension
• Dependency– relevant information is a complex conjunction of
words/phrases• e.g., Document categorization.Pronoun disambiguation
Text characteristics
• Ambiguity– Word ambiguity
• Pronouns (he, she …)• “buy”, “purchase”
– Semantic ambiguity• The king saw the rabbit with his glasses. (? meanings)
• Noisy data• Example: Spelling mistakes
• Not well structured text– Chat rooms
• “r u available ?”• “Hey whazzzzzz up”
– Speech
Text mining process• Text preprocessing
– Syntactic/Semantic text analysis
• Features Generation – Bag of words
• Features Selection– Simple counting– Statistics
• Text/Data Mining– Classification-
Supervised learning
– Clustering- Unsupervised learning
• Analyzing results
• Part Of Speech (pos) tagging• Find the corresponding pos for each word
e.g., John (noun) gave (verb) the (det) ball (noun)
• Word sense disambiguation• Context basedContext based or proximity basedproximity based• Very accurate
• Parsing• Generates a parse treeparse tree (graph) for each sentence• Each sentence is a stand alone graph
Syntactic / Semantic text analysis
Feature Generation: Bag of words• Text document is represented by the words it contains
(and their occurrences)– e.g., “Lord of the rings” {“the”, “Lord”, “rings”, “of”}– Highly efficient– Makes learning far simpler and easier– Order of words is not that important for certain applications
• Stemming: identifies a word by its root– Reduce dimensionality– e.g., flying, flew fly– Use Porter Algorithm
• Stop words: The most common words are unlikely to help text mining– e.g., “the”, “a”, “an”, “you” …
Feature selection
• Reduce dimensionality– Learners have difficulty addressing tasks with
high dimensionality
• Irrelevant features– Not all features help!
• e.g., the existence of a noun in a news article is unlikely to help classify it as “politics” or “sport”
• Use Weightening
• Given: a collection of labeled records (training settraining set)– Each record contains a set of features (attributesattributes), and
the true class (labellabel)• Find: a modelmodel for the class as a function of the
values of the features• Goal: previously unseen records should be
assigned a class as accurately as possible– A test settest set is used to determine the accuracy of the
model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it
Text Mining: Classification definition
Similarity Measures:• Euclidean DistanceEuclidean Distance if attributes are continuous• Other Problem-specific Measures
• e.g., how many words are common in these documents
• Given: a set of documents and a similarity similarity measuremeasure among documents
• Find: clusters such that:– Documents in one cluster are more similar to one
another– Documents in separate clusters are less similar to one
another• Goal:
– Finding a correctcorrect set of documents
Text Mining: Clustering definition
• Supervised learning (classification)– Supervision: The training data (observations, measurements,
etc.) are accompanied by labelslabels indicating the class of the
observations
– New data is classified based on the training set
• Unsupervised learning (clustering)– The class labels of training data is unknown
– Given a set of measurements, observations, etc. with the aim of
establishing the existence of classes or clusters in the data
Supervised vs. Unsupervised Learning
• Correct classification: The known label of test sample is identical with the class class resultresult from the classification model
• Accuracy ratio: the percentage of test set samples that are correctly classified by the model
• A distance measuredistance measure between classes can be used– e.g., classifying “football” document as a
“basketball” document is not as bad as classifying it as “crime”.
Evaluation:What Is Good Classification?
• Good clustering method: produce high quality clusters with . . .– high intra-classintra-class similarity– low inter-classinter-class similarity
• The qualityquality of a clustering method is also measured by its ability to discover some or all of the hiddenhidden patterns
Evaluation: What Is Good Clustering?
Text Classification: An Example
Ex# Hooligan
1 An English football fan …
Yes
2 During a game in Italy …
Yes
3 England has been beating France …
Yes
4 Italian football fans were cheering …
No
5 An average USA salesman earns 75K
No
6 The game in London was horrific
Yes
7 Manchester city is likely to win the championship
Yes
8 Rome is taking the lead in the football league
Yes 10
class
Training Set
ModelLearn
Classifier
text
TestSet
Hooligan
A Danish football fan ?
Turkey is playing vs. France. The Turkish fans …
? 10
Ex# Hooligan
1 An English football fan …
Yes
2 During a game in Italy …
Yes
3 England has been beating France …
Yes
4 Italian football fans were cheering …
No
5 An average USA salesman earns 75K
No
6 The game in London was horrific
Yes
7 Manchester city is likely to win the championship
Yes
8 Rome is taking the lead in the football league
Yes 10
classte
xt
Decision Tree: A Text Example
YesEnglish
Yes
No
MarSt
NO
Married Single, Divorced
Splitting Attributes
Income
YESNO
> 80K < 80K
The splitting attribute at a node is
determined based on a specific
Attribute selection algorithm
• Decision tree – A flow-chart-like tree structure– Internal node denotes a test on an attribute– Branch represents an outcome of the test– Leaf nodes represent class labels or class distribution
• Decision tree generation consists of two phases:– Tree construction– Tree pruning
• Identify and remove branches that reflect noisenoise or outliersoutliers
• Use of decision tree: Classifying an unknown sample– Test the attribute of the sample against the decision
tree
Classification by DT Induction
• Text is tricky to process, but “ok” results are easily
achieved
• There exist several text mining systemstext mining systems
– e.g., D2K - Data to Knowledge
– http://www.ncsa.uiuc.edu/Divisions/DMV/ALG/
• Additional IntelligenceIntelligence can be integrated with text
mining
– One may play with any phase of the text mining process
Summary
Summary
• There are many other scientific and statistical text mining scientific and statistical text mining
methodsmethods developed but not covered in this talk.
– http://www.cs.utexas.edu/users/pebronia/text-mining/
– http://filebox.vt.edu/users/wfan/text_mining.html
• Also, it is important to study theoretical foundationstheoretical foundations of
data mining.
– Data Mining Concepts and Techniques / J.Han & M.Kamber
– Machine Learning, / T.Mitchell