text mining in business intelligence โดย รศ.ดร.โอม ศรนิล
TRANSCRIPT
The First NIDA Business Analytics and Data Sciences Contest/Conferenceวันที่ 1-2 กันยายน 2559 ณ อาคารนวมินทราธิราช สถาบันบัณฑิตพัฒนบริหารศาสตร์
https://businessanalyticsnida.wordpress.comhttps://www.facebook.com/BusinessAnalyticsNIDA/
โดย รศ. ดร. โอม ศรนิล สาขาวิชาวิทยาการข้อมูลคณะสถิติประยุกต์ สถาบันบัณฑติพฒันบรหิารศาสตร์
Text Mining in Business Intelligence
การท าเหมืองข้อความท าไดอ้ย่างไร มีหลักการอย่างไรท าเหมืองข้อความภาษาไทยได้หรอืไม่
เราจะประยุกต์ใช้การท าเหมืองข้อความกับธุรกิจได้อย่างไรต้องเขียนโปรแกรมเป็นหรือไม่หากจะท าเหมืองข้อความ
ท าเหมืองข้อความแล้วจะได้ความรู้อะไรบา้ง
นวมินทราธิราช 3003 วันที่ 1 กันยายน 2559 9.30-10.00 น.
TEXT MINING IN BUSINESS INTELLIGENCE
OHM SORNIL, Ph.D. Department of Computer Science, NIDA
BUSINESS INTELLIGENCE
“the ability to apprehend the interrelationships of presented facts in such a way as to guide action towards a desired goal.”
(H. P. Luhn, 1958)
“a set of techniques and tools for the acquisition and transformation of raw data into meaningful and useful information for business analysis purposes.”
(D. M. Turner, 2016)
UNSTRUCTURED DATA
◉ Unstructured data is like Text, video, a voice recording of a customer service transaction
◉ Generally accepted maxim is that structured data represents only 20%. The rest is unstructured.
◉ If it can be counted, it can be analyzed.
◉ If it can be analyzed, it can be interpreted.
Source: http://www.csc.com/insights/flxwd/78931-big_data_universe_beginning_to_explode
JUST MARKETING TERMS
◉ Text mining = Text analytics = Natural language processing (NLP)
◉ A move from university research to real-world business problems
Internal◉ Company documents◉ Emails◉ Reports◉ Media releases◉ Customer records and communication
SOURCES OF TEXTUAL DATA
External◉ News◉ Websites◉ Blogs◉ Social media posts
CHALLENGES
◉ Text is generally unstructured◉ Large quantities and increasing rapidly◉ Noisy (e.g., typoerrors, slangs, informal words, etc.) ◉ Synonymy and polysemy
TEXT MINING
◉ Process of extracting interesting information or patterns from unstructured text
◉ An interdisciplinary field: computational linguistics, statistics, and machine learning
◉ Can lead to the development of new opportunities in business
Business Applications
CUSTOMER RELATIONSHIP MANAGEMENT (CRM)
Input◉ Text documents produced from
a variety of sources in contact centers
Output◉ Contents of client’s messages ◉ Routing specific requests to the
appropriate service◉ Supplying immediate answers to
the most frequently asked questions
OPINION ANALYSIS
Output◉ Frequency of words mentioned is an indicator for concept salience, e.g., “unbreakable”, “fragile”
◉ Frequency of co-occurrence represents the strength of connection in the customer‘s mind, e.g., <“Samsung”, “camera”>, <“iPhone”, “expensive”>
Input◉ Customers’ messages in websites, blogs, Tweeter,
Facebook, etc.
MEDICAL RECORD ANALYSIS
Input◉ Doctors’ comments
Output◉ An early warning regarding
specific diseases
If frequency of “lungs” or “breathing” appears more than 45 appearances in the last 30 days for a given ZIP code or region, it can be a clue to excessive environmental conditions which are resulting in respiratory problems. A proactive intervention can be activated to remedy the situation.
SENTIMENT ANALYSIS
Input◉ Customers’ messages in
websites, blogs, Tweeter, Facebook, etc.
Output◉ Positive, negative or neutral
opinions/feelings (polarity) expressed by a writer in a document collection
SENTIMENT ANALYSIS (FEATURE-BASED)
EMOTIONAL STATE CLASSIFICATION
SOURCE: http://emotion-research.net/toolbox/toolboxlabellingtool.2006-09-26.9095478150
https://annaszymanska1324161.wordpress.com/2014/04/28/very-emotional-research/
HUMAN RESOURCE MANAGEMENT
Input◉ Staff’s opinions◉ CVs from applicants
Output◉ Level of employee satisfaction◉ Selection of new personnel
INSURANCE CLAIM DIAGNOSIS
Input◉ Note of all the details related to
the claim/health issues in the form of a brief description
Output◉ Identified a common group of
problems
CORPORATE FINANCE
Input◉ Publicly available descriptions of any startups' business
- products/services, investors and social links between individuals in 2 firms
Output◉ Targets for mergers and acquisitions
Source: http://phys.org/news/2016-07-text-mining-intelligence-startups.html#jCp
INVESTMENT
Input◉ Security related newsfeed
Output◉ A model to predict movements of markets for everything
from government bonds to commodities.
MEANINGThe key is to capture the meaning of text.
TEXT MINING PROCESS
Text Sources Preprocessing
Presentation(Visualization/
Browsing)Modeling
COMMON PREPROCESSING
◉ Extracting text◉ Tokenization◉ Stopword elimination: is, am, are, the, of, for, … (http://www.ranks.nl/stopwords/thai-stopwords)
◉ Stemming: run, runs, ran, running run
TEXT REPRESENTATION FOR MINING
INVERSE DOCUMENT FREQUENCY
SOURCE: http://nlp.stanford.edu/IR-book/pdf/06vect.pdf
TF-IDF TERM WEIGHTING
REAL-VALUED VECTOR
COSINE SIMILARITY BETWEEN 2 VECTORS
WORD CO-OCCURRENCE STRENGTH
◉ Mutual Information (MI) between words x and y
ADD-ON COMPONENTS
◉ WordNet◉ Feature selection/reduction
WordNet
◉ WordNet is essentially Dictionary + Thesaurus Relations: hyponymy, meronymy, antonymy
TASK SPECIFIC COMPONENTS
◉ Part-of-Speech (POS) tagging
◉ SentiWordNet- Results of automatic annotation of all synsets of WordNet
according to the notions of “positivity”, “negativity” and “neutral”
◉ Emoticons
MINING ALGORITHMS
◉ General machine learning algorithms are applicable
Classification
Naïve Bayes
Support Vector Machine
Bayesian Network
Neural Network
Logistic Regression
etc.
Clustering
K-means
Fuzzy C-means
Hierarhical Clustering
Self-Organizing Map
etc.
Association Analysis
and Sequence Analysis
Apriori
Generalized Rule Induction
Influential Apriori
FP-Growth
etc.
Analysis Tasks
GENERAL DATA MINING TASKS
◉ Classification◉ Clustering◉ Association Analysis◉ Prediction◉ Sequence Analysis
INFORMATION EXTRACTION
Analytics Tools with Text Mining Capabilities
OPEN SOURCED SOFTWARE
SOURCE: http://www.predictiveanalyticstoday.com/top-free-software-for-text-analysis-text-mining-text-analytics/
R package TM
COMMERCIAL SOFTWARE
SOURCE: http://www.predictiveanalyticstoday.com/top-free-software-for-text-analysis-text-mining-text-analytics/