fypca4

Mining User’s

Opinions in

Hotel

National University Of Singapore

TEY JUN HONG

U095074X

1. Background

2. Formulating the problem

3. Data Mining Process

4. Techniques

5. Analysis

Content

01

• Extraction of meaningful / useful / Interesting

patterns from a large volume of data sources

• In this project, the source will be large

volume of WEB HOTEL REVIEWS data

• Data mining is one of the top ten emerging

technology

What is Data

Mining?

MIT’s TECHNOLOGY REVIEW 2004

• Process of exploration and analysis

• By automatic / semi automatic means

• With little or no human interactions

• To discover meaningful patterns and rules

What is Data

Mining?

MASTERING DATA MINING BY BERRY AND LINOFF, 2000

• Increase in social media and web

user

• Increase in valuable opinion

oriented data in Hotel due to web

expansion

• Identify potential hotel to stay by

looking at the aspects

• Overall Sentiments on hotel are

greatly sought on the web for

Sentiment Analysis

User’s Opinions in Hotel

• Identify best prospects

(ASPECTS), and retain customers

• Predict what ASPECTS

customers like and promote

accordingly

• Learn parameters influencing

trends in sales and margins

• Identification of opinions for

customers

Sentiment Analysis !!!

What can Data Mining do?

• Exponential growth of user’s

opinions

• Limitations of human analysis

• Accuracy of human analysis

Machines can be trained to take

over human analysis with advanced

computer technology and it is done

with LOW COST

What are the problems?

• Unable to read like a human

• No emotions

• Cannot detect sarcasm

• Expression of sentiments in

different topic and domain

• Polarity analysis

• Facts Vs Opinion

Some Limitations of machines

• “The service is as good as none”. Negation not obvious to machine

• “Swimming pool is big enough to swim with comfort” , “There is a big crowd at the counter complaining”. Polarity might change with context.

• “The room is warmer than the lobby”. Comparisons are hard to classify

Some machine limitation

examples

• Machine learning

• Pattern recognition

• Statistics

• Databases

Sentiment

Analysis

• A tool for data mining and intelligent decision

support

• Application of computer algorithms that

improve automatically through experience

Machine Learning

MASTERING DATA MINING BY BERRY AND LINOFF, 2000

• Supervised Learning

• A training set is provided (data

with correct answers) which is

used to mine for known pattern

• Unsupervised Learning

• Data are provided with no prior

knowledge of the hidden

patterns that they contain.

• Semi Supervised Learning

Types of Machine learning

• Rule Mining and Rule learning

• Bayesian Networks

• Support Vector Machine

Supervised Learning techniques

• Prediction of sentence polarity

• Classification of polarity for sentiment

lexicon

• Detection of relations

Project Objective

• Large data set

• Relevant Prior Knowledge to

domain, in our case the hotel

domain

• Eg. Rating

• Sentiment lexicon for sentiment

analysis

• Data selection for reliability and

standards

Pre-requisite

Data Mining Process

• Frequent problem : Data inconsistencies

• Duplicate data

• Spelling Errors != Trim from data

• Foreign accent and characters

• Singular / Plural conversion

• Punctuations removal / replacement

• Noise and incomplete data

• Naming convention misused, same name but

different meaning

Cleaning the “Dirty” Data (60% of

effort)

• Part of Speech Tagging (POS) using Brill

Tagger

• Polarity tagging using sentiment lexicon

Data Preprocessing (Laundering)

• Part of Speech Tagging (POS) using Brill

Tagger - NO PROBLEM

-95% accuracy POS tagging words after data

cleaning

Findings

• Polarity tagging using sentiment lexicon –

BIG PROBLEM

-40% sentiment words not found in sentiment

lexicon

-10% sentiment words with a positive or

negative polarity found are in the neutral section

of sentiment lexicon

Findings

• Sentiment lexicon not comprehensive to fulfill

machine learning technique adopted

• Polarity of sentiment words who are domain

dependent are founded in neutral section of

sentiment lexicon

• Polarity of sentiment words can also change

within the domain even though they are

domain dependent

EXPANSION OF LEXICON !!!

Problems

• Classify the polarity of unlabeled sentiment

word using rule based mining

• Classify domain dependent sentiment words

• Establish word relations between labeled and

unlabeled sentiment words

Solution

• Rule based mining using conjunction and

punctuation

Data Processing

Polarity Assignment Rules

Same Adj – AND/OR - Adj

Opposite Neg - Adj – AND/OR - Adj / Adj – AND/OR - Neg- Adj

Same Neg - Adj – AND/OR - Neg- Adj

Opposite Adj – BUT/NOR – Adj

Same Neg - Adj – BUT/NOR - Adj / Adj – BUT/NOR - Neg- Adj

Opposite Neg - Adj – BUT/NOR - Neg- Adj

Same Adj , Adj

• Relation Network – Aspect – Sentiment word

pair

Data Processing

• Using the expanded sentiment lexicon, we

analyze the polarity sentiment by doing a

sentiment lookup using Bayesian Network

Analysis

• To determine polarity of sentiments

P(X | Y) = P(X) P(Y | X) / P(Y)

• Probability that a sentiments is positive or

negative, given it's contents

• Assumptions: There is no link between words

• P(sentiment | sentence) =

P(sentiment)P(sentence | sentiment) /

P(sentence)

Bayesian

• Precision = N (agree & found) / N (found)

• High precision means most of the correct

sentiment words are found by the system

• Recall = N (agree & found) / N (agree)

• High recall means most of found sentiment

words are correctly labeled by the system

Validation

• It is found that out of the 350 aspect-

unlabelled sentiment word pairs,

• Only 194 are founded by the methods.

Thus, the precision is about 57%.

• The recall is also not very high; only 126

words are corrected labelled by the

system, which is about 63%.

Validation Results

• The results will improve if more rules are

applied such the inclusion of more adverbs

such as “excessively” as negation words.

• There might not be enough dataset for the

system to work on. There are only 350 aspect-

unlabelled sentiment word pairs for the

application to work with.

• This, however requires more human judges to

validate the data

Discussion

• Comprehensive Sentiment Lexicon is a

simple yet effective solution to sentiment

analysis as it does not requires prior training

• Current sentiment lexicon does not capture

such domain and context sensitivities of

sentiment expressions

Conclusion

• This leads to poor coverage

• Thus, expanding general sentiment lexicon to

capture domain and context sensitivities of

sentiment expressions are advocated

Conclusion

01DEMO

Questions?

fypca4

Technology

fields of data mining

data mining process4

growing data volume

large volume of data

valuable opinion oriented

limitation of humans

hotel domain

customers sentiment