large scale topic modeling

July 7th, 2013

Large Scale Topic Modeling

By - Sameer Wadkar

© 2013 Axiomine LLC

What is Topic Modeling

• Technique is called Latent Dirichlet Allocation (LDA)• An excellent explanation is available in the following

blog article by Edwin Chen from Google (http://blog.echen.me/2011/06/27/topic-modeling-the-sarah-palin-emails/)

• This presentation borrows heavily from the blog article to explain the basics of Topic Modeling

July 7th, 2013 © 2013 Axiomine LLC

http://blog.echen.me/2011/06/27/topic-modeling-the-sarah-palin-emails/




Brief Overview of LDA

• What can LDA do?• LDA extracts key topics and themes from a large corpus of

text • Each topic is a ordered list of representative words (Order is

based on importance of word to a Topic)• LDA describes each document in the corpus based on

allocation to the extracted topics.• It is an Unsupervised Learning Technique

• No extensive preparation needed to create a training dataset• Easy to apply for exploratory analysis


LDA – A Quick Example

“I listened to Justin Bieber and Lady Gaga on the radio while driving around in my car”, an LDA model might represent this sentence as 75% about music (a topic which, contains words Bieber, Gaga , radio ) and 25% about cars (a topic which contains words driving and cars ).


Sarah Palin Email Corpus

• Sarah Palin Email Corpus• In June 2011 several thousand emails from Sarah Palin’s

time as governor of Alaska were released (

http://sunlightfoundation.com/blog/2011/06/15/sarahs-inbox/)

• Emails were not organized in any form• The Edwin Chen blog article discusses how LDA was used

to organize these emails in categories discovered from the Email Corpus using LDA.


http://sunlightfoundation.com/blog/2011/06/15/sarahs-inbox/

LDA Analysis Results

Wildlife/ BP Corrosion

•game•fish•moose•wildlife, hunting•bears•polar•bear•subsistence•management•area•board•hunt•wolves•control•department•year•use•wolf•habitat•hunters•caribou• program•Fishing…..

Energy/ Fuel/ Oil Mining

•energy•fuel•costs•oil•alaskans•prices•cost•nome•Now•high•being•home•public•power•mine•crisis•price•resource•need•community•fairbanks•rebate•use•mining•Villages …

Trig/ Family/ Inspiration

•family•web•mail•god•son•from•congratulations• children•life•child•down•trig•baby•birth•love•You•syndrome•very•special•bless•old•husband•years•thank•best …

Gas

•gas•oil•pipeline•agia•project•natural•north•producers•companies•tax•company•energy•development•slope•production•resources•line•gasline•transcanada•said•billion•plan•administration•million•industry, …

Education/ Waste

•school•waste•education•students•schools•million•read•email•market•policy•student•year•high•news•states•program•first•report•business•management•bulletin•information•reports•2008•quarter …

Presidential Campaign/ Elections

•mail•web•from•thank, you•box•mccain•sarah•very•good•great•john•hope•president•sincerely•wasilla•work•keep•make•add•family•republican•support•doing•p.o, …

• LDA Analysis of Sarah Palin’s emails discovered the following topics (notice the ordered list of words)


Temporal Extraction MethodologyLDA Sample from Wildlife topic


Temporal Extraction MethodologyLDA Sample from multiple topics

LDA classification of above emailTopic Allocation Percentage

Presidential Campaign/ Elections 10%

Wildlife 90%


Types of Analysis LDA can perform

• Similarity Analysis• Which topics are similar?

• Which documents are similar based on Topic Allocations?• LDA can distinguish between business articles related to “Mergers”

from those related to “Quarterly Earnings” which leads to more potent Similarity Analysis

• LDA determines Topic Allocation based on collocation of word groups. Hence “IBM” and “Microsoft” documents can be discovered to be similar if they talk about similar computing topics

• Similarity Analysis based on LDA is very accurate since• LDA converts the high-dimensional and noisy space of

Word/Document allocations into a low dimensional Topic/Document

allocations.



• Topic Co-occurance• Do certain topics occur together in documents?

• Analysis of software resumes will reveal that “Object Oriented Language” skills typically co-occur with “SQL and RDBMS skills”

• Does Topic Co-occurance change with time?• Resume corpus would reveal that “Java” skills was highly correlated

with “Flash Development” skills in 2007. In 2013 the correlation has shifted to “Java” and “HTML5” but not as much as in 2007 indicating that HTML5 is a more specialized skill



• Time based Analysis• For a corpus which covers documents over time, do certain topics

appear over time• How does appearance of new topics affect the distribution of other topics• Analysis of science articles from the Journal of Science (1880-2002)

reveals this process • http://topics.cs.princeton.edu/Science/• The Browser is at http://topics.cs.princeton.edu/Science/browser/• 75 topic model • Demonstrates how Topics gain/lose prominence over time• Demonstrate how a Topic composition changes over time• Demonstrates how new Topics appear

• Ex. Laser made an appearance in its topic only in 1980


http://topics.cs.princeton.edu/Science/

http://topics.cs.princeton.edu/Science/

http://topics.cs.princeton.edu/Science/browser/

http://topics.cs.princeton.edu/Science/browser/

Example based on Sarah Palin’s email corpus

• Analyze emails which below to Trig/Family/Inspiration topics

• Spike in April 2008 – Remarkably (for Topic Modeling) and Unsurprisingly (for common sense), this was exactly the month Trig was born.

• Topic Modeling can discover such patterns from a large Text Corpus without requiring a human to read the entire corpus.


Topic Modeling Toolkits• Several Open Source Options exist

Library Name URLMallet MALLET is a Java-based package for statistical natural language

processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.

R Based Library R based library to perform Topic Modeling

Apache Mahout Big Data solution of Topic Modeling. Why is it needed?

• Topic Modeling is computationally expensive• Requires large amounts of memory• Requires considerable computational power• Memory is bigger constraint• Most implementations run out of memory when applied on even a

modest number of documents (50,000 to 100,000 documents)• If they do not run out of memory they slow down to a crawl due to

frequent Garbage Collection (in Java based environment)• A Big Data based approach is needed!


Mahout for Big LDA

• Apache Mahout • Hadoop MapReduce based suite of Machine Learning procedures• Implements several Machine Learning routines which are based on

Bayesian techniques (Ex. Generative Algorithms)• Generative Algorithms are iterative and iterations converge to a solution

• Each iteration needs the results produced by the previous iteration. Hence Iterations cannot be executed in parallel

• Several iterations (a few thousand) are needed to converge to a solution

• Mahout uses Map-Reduce to parallelize a single iteration• Each iteration is a separate Map-Reduce job• Inter-Iteration communication using HDFS. Leads to high I/O• High I/O compounded by multi-iteration nature

• Mahout based LDA• Each iteration is slower to accommodate large memory requirements• Typically 1000 iterations needed. Takes too long to run. Unsuitable

for exploratory analysis• Lesser iterations lead to sub-optimal solution


Parallel LDA based on Mallet

• A Parallel LDA in Mallet is based on• Newman, Asuncion, Smyth and Welling, Distributed

Algorithms for Topic Models JMLR (2009), with SparseLDA sampling scheme and data structure from Yao, Mimno and McCallum, Efficient Methods for Topic Model Inference on Streaming Document Collections, KDD (2009)

• Still memory intensive• Large corpus leads to frequent Garbage Collection• Executing Mallet ParallelTopicModel on 8 GB, Intel I-7 Quad

Core processor on 500,000 US Patent abstracts 400 minutes of processing for 1000 iterations.

• The application makes no progress for 1 Million Patents and eventually runs out of memory or stalls due to frequent Garbage Collection


Axiomine Solution – Big LDA without Hadoop

• Map-Reduce is unsuitable for LDA type Algorithms• Hadoop is complex and unsuited for ad-hoc analysis• Large number of sequential iterations only allows Map-Reduce to be

used at Iteration level. Leads to too many short Map-Reduce jobs• Large scale LDA without Big Data

• LDA is a memory intensive process• Off-Heap memory based on Java NIO allows processes to use

memory without incurring GC penalty. • Trade-off is slightly lower performance• Exploit the OS page-caching to use off-heap memory

• LDA operates on Text data. But soring text is orders of magnitude more expensive as compared to storing numbers

• Massive off-heap memory based indexes which map words to numbers allow significant lowering of memory usage

• Reorganizing the Mallet implementation steps achieved significant performance gains and memory savings


Axiomine Solution – Performance NumbersMachine Type Corpus PerformanceSingle 8 GB, Intel I-7 Quad-core machine

500000 US Patent Abstracts, 600

1000 Iterations completed in 2 hours

Amazon AWS hs1.8xlarge machine (http://aws.amazon.com/ec2/instance-types/)

2.1 Million US Patent Abstracts, 600 topics using 5 CPU threads

1000 Iterations completed in approximately 5 hours.

• High Points• Scaling is practically linear unlike other implementations

• Each iteration takes between 7-15 seconds• We contemplated Apache HAMA to achieve parallelism without

incurring the disk I/O cost of Hadoop Map-Reduce• But Network I/O will ensure worse intra-iteration performance than

we could achieve on a single machine!• Big Topic Modeling without Big Data!!

• At Axiomine we intend to port more such popular Algorithms based on lessons learned while porting LDA

• We want to bring Large Scale Exploratory Analysis at low complexity


http://aws.amazon.com/ec2/instance-types/

http://aws.amazon.com/ec2/instance-types/

Conclusion – Large Scale Analysis without Big Data

• The Axiomine LDA implementation has the following benefits

• Scaling is practically linear unlike other implementations• Each iteration takes between 7-15 seconds• We contemplated Apache HAMA to achieve parallelism without

incurring the disk I/O cost of Hadoop Map-Reduce• But Network I/O will ensure worse intra-iteration performance than

we could achieve on a single machine!• Big Topic Modeling without Big Data!!

• At Axiomine we intend to port more such popular Algorithms based on lessons learned while porting LDA

• We want to bring Large Scale Exploratory Analysis at low complexity• Open Source

• https://github.com/sameerwadkar/largelda


https://github.com/sameerwadkar/largelda

large scale topic modeling

Documents

axiomine llc1what

axiomine llc3lda

topic allocations

topic modelingtechnique

wildlife topic

lda model

basics of topic modeling

following topics