fypca4

33
Mining User’s Opinions in Hotel National University Of Singapore TEY JUN HONG U095074X

Upload: haha-teh

Post on 13-Jul-2015

64 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Fypca4

Mining User’s

Opinions in

Hotel

National University Of Singapore

TEY JUN HONG

U095074X

Page 2: Fypca4

1. Background

2. Formulating the problem

3. Data Mining Process

4. Techniques

5. Analysis

Content

01

Page 3: Fypca4

• Extraction of meaningful / useful / Interesting

patterns from a large volume of data sources

• In this project, the source will be large

volume of WEB HOTEL REVIEWS data

• Data mining is one of the top ten emerging

technology

What is Data

Mining?

MIT’s TECHNOLOGY REVIEW 2004

Page 4: Fypca4

• Process of exploration and analysis

• By automatic / semi automatic means

• With little or no human interactions

• To discover meaningful patterns and rules

What is Data

Mining?

MASTERING DATA MINING BY BERRY AND LINOFF, 2000

Page 5: Fypca4

• Increase in social media and web

user

• Increase in valuable opinion

oriented data in Hotel due to web

expansion

• Identify potential hotel to stay by

looking at the aspects

• Overall Sentiments on hotel are

greatly sought on the web for

Sentiment Analysis

User’s Opinions in Hotel

Page 6: Fypca4

• Identify best prospects

(ASPECTS), and retain customers

• Predict what ASPECTS

customers like and promote

accordingly

• Learn parameters influencing

trends in sales and margins

• Identification of opinions for

customers

Sentiment Analysis !!!

What can Data Mining do?

Page 7: Fypca4

• Exponential growth of user’s

opinions

• Limitations of human analysis

• Accuracy of human analysis

Machines can be trained to take

over human analysis with advanced

computer technology and it is done

with LOW COST

What are the problems?

Page 8: Fypca4

• Unable to read like a human

• No emotions

• Cannot detect sarcasm

• Expression of sentiments in

different topic and domain

• Polarity analysis

• Facts Vs Opinion

Some Limitations of machines

Page 9: Fypca4

• “The service is as good as none”. Negation not obvious to machine

• “Swimming pool is big enough to swim with comfort” , “There is a big crowd at the counter complaining”. Polarity might change with context.

• “The room is warmer than the lobby”. Comparisons are hard to classify

Some machine limitation

examples

Page 10: Fypca4

• Machine learning

• Pattern recognition

• Statistics

• Databases

Sentiment

Analysis

Page 11: Fypca4

• A tool for data mining and intelligent decision

support

• Application of computer algorithms that

improve automatically through experience

Machine Learning

MASTERING DATA MINING BY BERRY AND LINOFF, 2000

Page 12: Fypca4

• Supervised Learning

• A training set is provided (data

with correct answers) which is

used to mine for known pattern

• Unsupervised Learning

• Data are provided with no prior

knowledge of the hidden

patterns that they contain.

• Semi Supervised Learning

Types of Machine learning

Page 13: Fypca4

• Rule Mining and Rule learning

• Bayesian Networks

• Support Vector Machine

Supervised Learning techniques

Page 14: Fypca4

• Prediction of sentence polarity

• Classification of polarity for sentiment

lexicon

• Detection of relations

Project Objective

Page 15: Fypca4

• Large data set

• Relevant Prior Knowledge to

domain, in our case the hotel

domain

• Eg. Rating

• Sentiment lexicon for sentiment

analysis

• Data selection for reliability and

standards

Pre-requisite

Page 16: Fypca4

Data Mining Process

Page 17: Fypca4

• Frequent problem : Data inconsistencies

• Duplicate data

• Spelling Errors != Trim from data

• Foreign accent and characters

• Singular / Plural conversion

• Punctuations removal / replacement

• Noise and incomplete data

• Naming convention misused, same name but

different meaning

Cleaning the “Dirty” Data (60% of

effort)

Page 18: Fypca4

• Part of Speech Tagging (POS) using Brill

Tagger

• Polarity tagging using sentiment lexicon

Data Preprocessing (Laundering)

Page 19: Fypca4

• Part of Speech Tagging (POS) using Brill

Tagger - NO PROBLEM

-95% accuracy POS tagging words after data

cleaning

Findings

Page 20: Fypca4

• Polarity tagging using sentiment lexicon –

BIG PROBLEM

-40% sentiment words not found in sentiment

lexicon

-10% sentiment words with a positive or

negative polarity found are in the neutral section

of sentiment lexicon

Findings

Page 21: Fypca4

• Sentiment lexicon not comprehensive to fulfill

machine learning technique adopted

• Polarity of sentiment words who are domain

dependent are founded in neutral section of

sentiment lexicon

• Polarity of sentiment words can also change

within the domain even though they are

domain dependent

EXPANSION OF LEXICON !!!

Problems

Page 22: Fypca4

• Classify the polarity of unlabeled sentiment

word using rule based mining

• Classify domain dependent sentiment words

• Establish word relations between labeled and

unlabeled sentiment words

Solution

Page 23: Fypca4

• Rule based mining using conjunction and

punctuation

Data Processing

Polarity Assignment Rules

Same Adj – AND/OR - Adj

Opposite Neg - Adj – AND/OR - Adj / Adj – AND/OR - Neg- Adj

Same Neg - Adj – AND/OR - Neg- Adj

Opposite Adj – BUT/NOR – Adj

Same Neg - Adj – BUT/NOR - Adj / Adj – BUT/NOR - Neg- Adj

Opposite Neg - Adj – BUT/NOR - Neg- Adj

Same Adj , Adj

Page 24: Fypca4

• Relation Network – Aspect – Sentiment word

pair

Data Processing

Page 25: Fypca4

• Relation Network – Aspect – Sentiment word

pair

Data Processing

Page 26: Fypca4

• Using the expanded sentiment lexicon, we

analyze the polarity sentiment by doing a

sentiment lookup using Bayesian Network

Analysis

Page 27: Fypca4

• To determine polarity of sentiments

P(X | Y) = P(X) P(Y | X) / P(Y)

• Probability that a sentiments is positive or

negative, given it's contents

• Assumptions: There is no link between words

• P(sentiment | sentence) =

P(sentiment)P(sentence | sentiment) /

P(sentence)

Bayesian

Page 28: Fypca4

• Precision = N (agree & found) / N (found)

• High precision means most of the correct

sentiment words are found by the system

• Recall = N (agree & found) / N (agree)

• High recall means most of found sentiment

words are correctly labeled by the system

Validation

Page 29: Fypca4

• It is found that out of the 350 aspect-

unlabelled sentiment word pairs,

• Only 194 are founded by the methods.

Thus, the precision is about 57%.

• The recall is also not very high; only 126

words are corrected labelled by the

system, which is about 63%.

Validation Results

Page 30: Fypca4

• The results will improve if more rules are

applied such the inclusion of more adverbs

such as “excessively” as negation words.

• There might not be enough dataset for the

system to work on. There are only 350 aspect-

unlabelled sentiment word pairs for the

application to work with.

• This, however requires more human judges to

validate the data

Discussion

Page 31: Fypca4

• Comprehensive Sentiment Lexicon is a

simple yet effective solution to sentiment

analysis as it does not requires prior training

• Current sentiment lexicon does not capture

such domain and context sensitivities of

sentiment expressions

Conclusion

Page 32: Fypca4

• This leads to poor coverage

• Thus, expanding general sentiment lexicon to

capture domain and context sensitivities of

sentiment expressions are advocated

Conclusion

Page 33: Fypca4

01DEMO

Questions?