daniel krasner - high performance text processing with rosetta

(Easy), High Performance Text Processing withPython’s Rosetta

Daniel Krasner

KFit Solutions, Columbia University

Nov 22, 2014

(IDSE) 1 / 52

Outline

1 Introduction

2 Dealing with Limited Memory

3 Making Things Faster!

4 Stepping Outside of Python (time permitting)

(IDSE) 2 / 52

Outline

1 Introduction




(IDSE) 3 / 52

Motivation: Current Text Processing Projects

The Declassification EngineI Full stack Digital Archive.

F Collections structuring/parsing.F Backend, API, UI.F Statistical analysis.

I OrganizersF David Madigan - Statistics Department chair (CU)F Matthew Connelly - Professor of international and global history (CU)

I For more info see http://www.declassification-engine.org/.

(IDSE) 4 / 52

http://www.declassification-engine.org/

Motivation: Current Projects Continued

eDiscoveryI The legal world is overwhelmed with documents, both in pre and post

production review.I Most technologies heavily rely on keyword search which is not efficient.I “Predictive coding” solutions are generally archaic and inaccurate.

OtherI Human - text/document interaction.I Semantic filtering solutions.

(IDSE) 5 / 52

Motivation: Data Structuring/Information ExtractionText can come in many formats, encodings, and degree of structure.

Figure: raw xml

(IDSE) 6 / 52

Motivation: Data Structuring/Information ExtractionMany tasks involve initial structuring of the data.

Figure: structured api response

(IDSE) 7 / 52

Motivation: Network AnalysisKissinger telcons can be analyzed for frequency. Even a simple version ofthis type of analysis requires the data to be structured, or an informationextraction process in place.

Figure:

(IDSE) 8 / 52

Motivation: Semantic Modeling

State department cables from embassies can be analyzed for topics.

Moscow is predominantly topic 12

soviet 0.133910moscow 0.128717october 0.090400joint 0.052875ussr 0.044190soviets 0.042493ur 0.027686mfa 0.025786refs 0.023871prague 0.021268

London is predominantly topic 13

london 0.114568bonn 0.083748rome 0.074385uk 0.051367frg 0.050235berlin 0.035972usmission 0.031757british 0.029836european 0.027203brussels 0.025023

(IDSE) 9 / 52

Motivation: ClassificationDetermine which documents are relevant to a legal case.

Figure: eDiscovery processing pipeline

(IDSE) 10 / 52

Feature Extraction

Figure: Metadata + body text ⇒ features

The metadata features can be used in any classifier

(IDSE) 11 / 52

Tasks

What are some typical text processing goals?Structuring/Information Extraction

I Metadata extractionF Geo-taggingF Name-entity identification/disambiguation

I Text cleaning/text body extractionMachine Learning

I ClassificationF logistic regression, random forests, etc

I Sentiment analysisI Recommendation systemsI Anomaly detection

F Understanding underlying semantic structute (ex LDA/LSI modeling)F Communication dynamics

Note: this is far from a complete list!

(IDSE) 12 / 52

General Flow

Figure: General Flow

(IDSE) 13 / 52

What’s so hard about text processing?

Must use sparse data structuresData usually doesn’t fit in memoryIt’s language . . .Text structure can change from collection to collection(parsing/feature extraction can be tricky)Can be difficult to convert to nice machine-readable formatsHUGE data drives much of software development. Leads to productstoo complicated for most applicationsNo simple solution (e.g. “sklearn/statsmodels/pandas/numpy” stack)

(IDSE) 14 / 52

What’s so fun about text processing?

You get to care about memory and processing speedIt’s language

I spelling, stemming, etcI parsingI tokenizationI domain knowledge

Unix plays nicely with textPython plays nicely with textYou get to step outside the Python ecosystem

(IDSE) 15 / 52

Outline

1 Introduction




(IDSE) 16 / 52

Cluster? No.

One powerful machine and memory/CPU conscious code can handle manytasks with 1TB of text.

System76 Laptop, pre-loaded with Ubuntu. 16GB Memory, 4cores,1T SSD. $1823

I Can also get a Macbook Pro for about $3600You can spend 6k and get 20 cores, 64GB Memory (upgradeable to256GB), 1-2T RAID SSD for $7,000. Stick it in a closet with lots offans or in a server center for $250/month.Single machine on AWS ($1-2k per year) or on Digital Ocean (theyhave nice ssds).

(IDSE) 17 / 52

Back-of-the-envelope memory calculations

Text:I Same as on disk if the file is large enough (e.g. 10M)I Can be much more if you load many small files and append to a list.

Numbers:I 1double = 8byte. ⇒ 1,000,000 doubles ≈ 8MB.I You can save space by using type “int8,” “float16,” “float32,”

etc. . . see the dtype docs

(IDSE) 18 / 52

http://docs.scipy.org/doc/numpy/user/basics.types.html

Monitoring Memory with HTOP

4 cores, 4 virtual cores, all in use2821/15946 MB memory in useprocesses listed below

For macs, htop doesn’t necessarily work. If it doesn’t, try iStat.(IDSE) 19 / 52

Don’t blow up memory, stream!Process files (text) line-by-line:with open(infile, ‘‘r’’) as f, open(outfile, ‘‘w’’) as g:

for line in input:line_output = process_line(line)# Write output NOWg.write(line_output)

Process directories one file at a time:from rosetta.text.filefilter import get_paths

my_paths_iter = get_paths(MYDIR, get_iter=True)

for path in my_paths_iter:output = process_file(path)# Now write the output to file

(IDSE) 20 / 52

Rosetta Text File Streamer

Set up a text file streamer class for processing.# stream from a file sys directorystream = TextFileStreamer(text_base_path=MYDIR,

tokenizer=MyTokenizer)

# call info stream which will return a dict with# doc_id, text, tokens, etcfor item in stream.info_stream():

# print the document textprint item[‘‘text’’]‘‘This is my text.’’# print the tokensprint item[‘‘tokens’’][‘‘this’’, ‘‘is’’, ‘‘my’’, ‘‘text’’]

(IDSE) 21 / 52

Rosetta MySQL Streamer

Set up a databse streamer class for processing.# stream from a DBstream = MySQLStreamer(db_setup=DBCONFIG,

tokenizer=MyTokenizer)

# Convert to scipysparse matrix# and cache some data along the waysparse_mat = stream.to_scipysparse(

cache_list[‘‘doc_id’’, ‘‘date’’])

# grab the cached doc_id and datesdoc_ids = stream.__dict__[‘‘doc_id_cache’’]dates = stream.__dict__[‘‘date_cache’’]

(IDSE) 22 / 52

(Online) Stochastic Gradient DescentCombine the above with an online learning algorithm.To minimize empirical loss

n∑i=1|yi − w · xi |2

Update the coefficient w one training example at a time

w (t+1) : = w t − ηt∇w |yt − w t · xt |2.

Note:The learning rate ηt decays to 0We can cycle through the training examples, updating the weightsmore than n timesWe only need to load one single example into memory at a timeConverges faster for cases of many data points and coefficients

See Bottou scikit-learn sgd and vowpal wabbit.(IDSE) 23 / 52

http://media.nips.cc/Conferences/2007/Tutorials/Slides/bottou-NIPS-07-tutorial.pdf

http://scikit-learn.org/stable/modules/sgd.html

https://github.com/JohnLangford/vowpal_wabbit/wiki

Dealing with limited memory: Summary

Monitor memory usageStream process

I and cache what you need along the wayDeal with huge feature counts by. . .

I Use a sparse data structure, stochastic gradient descent. Or. . .I Reduce the number of features to something that fits into a dense

matrix

(IDSE) 24 / 52

Outline

1 Introduction




(IDSE) 25 / 52

Goal: Tokenization

# Steps: split on whitespace, set to lowercase# remove non-letters, punctuation, and stopwordsstopwords_list = ‘‘lets,do,a,the,and’’.split(‘‘,’’) # And morepunct = [’|’, ’‘’, ’[’, ’]’, ’{’, ’}’, ’(’, ’)’]

# Here’s the function call/result we wanttext = "Let’s do a deal: Trade 55 Euros for 75 euros"tokens = tokenize(text, punct, stopwords_list)print tokens[‘‘deal’’, ‘‘trade’’, ‘‘euros’’, ‘‘euros’’]

(IDSE) 26 / 52

First hack: For loop

def tokenize_for(text, punct, stopwords):tokens = []# Split the text on whitespace.for token in text.lower().split():

# Remove punctuationclean_token = ’’for char in clean_token:

if char not in punct:clean_token += char

# Remove stopwords.if clean_token.isalpha() and len(token) > 1\

and (token not in stopwords):tokens.append(token)

return tokens

(IDSE) 27 / 52

Profile: Time your coderegex test.pydef main():

stopwords_list = ‘‘lets,do,a,the,and’’.split(‘‘,’’) # And moretext = # Something pretty typical for your applicationfor i in range(10000):

tokens = tokenize_for(text, stopwords_list)

if __name__ == ‘‘__main__’’:main()

time python regex_test.py

real 0m0.128suser 0m0.120ssys 0m0.008s

(IDSE) 28 / 52

Profile: Switch to a regexbad_char_pattern = r"\||’|\[|\]|\{|\}|$|$"

def tokenize_regex_1(text, bad_char_pattern, stopwords):# Substitute empty string for the bad characterstext = re.sub(bad_char_pattern, ’’, text).lower()

# Split on whitespace, keeping strings length > 1split_text = re.findall(r"[a-z]+", text)tokens = []for word in split_text:

if word not in stopwords:tokens.append(word)

return tokens

time python regex_test.py

real 0m0.091suser 0m0.083ssys 0m0.008s

(IDSE) 29 / 52

Profile: Line-by-line readoutAdd an @profile decorator to your tokenize regex function, andpip install line_profilerkernprof.py -l regex_test.pypython -m line_profiler regex_test.py.lprof |less

Figure: line profiler output shows the for loop and if statement are slow.

(IDSE) 30 / 52

Profile: Use a setregex test.py

def main():stopwords_list = ’lets,do,a,the,and’.split(’,’) # And morestopwords_set = set(stopwords_list)for i in range(1000):

text = # Something pretty typical for your applicationtokens = tokenize_regex_1(text, stopwords_set)

if __name__ == ’__main__’:main()

Reduces time from 0.091 to 0.043Testing item in my set requires hash function computationTesting item in my list requires looking at every item inmy list.

(IDSE) 31 / 52

Profile: IPython timeit

Figure: Set lookup is O(1). So don’t ever test item in my list for long lists.

(IDSE) 32 / 52

Profile: Line-by-line profile again

Switching to a set speed up the if statement.The for loop can still be faster

(IDSE) 33 / 52

Profile: Switch to a list comprehension

Reduced time from 0.033 to 0.025sA list comprehension is essentially a for loop with a fast appendLooks nicer in this caseBe sure to use time python regex test.py for the total time!

(IDSE) 34 / 52

Data structures

Think about the data structures (and associated methods) you are usingfor the task at hand!

Many data analysis friendly languages (ex. Python, R) have veryconvenient built in data structures.These can come with significant lookup and operation overhead.

I example: set lookup vs list lookup as aboveI example: python does not allocate a contiguous memory block for

dictionaries, making them slower than a data structure which tells theinterpreter how much space will be needed

You can (easily) create your own data structure for the task at hand.See Saulius Lukuaskas nice post Why Python Runs Slow. Part 1:Data structures.

(IDSE) 35 / 52

http://lukauskas.co.uk/articles/2014/02/13/why-your-python-runs-slow-part-1-data-structures/

http://lukauskas.co.uk/articles/2014/02/13/why-your-python-runs-slow-part-1-data-structures/

Parallelization

Much of text processing is embarrassingly parallel.

Figure: Word counts for individual documents can be computed independently.

(IDSE) 36 / 52

Parallelization: Basic mapping

Serial mapping>>> def func(x):... return 2 * x>>> iterable = range(3)>>> map(func, iterable)[0, 2, 4]

Parallel mapping>>> from multiprocessing import Pool>>> def func(x):... return 2 * x>>> my_pool = Pool(processes=4)>>> iterable = range(3)>>> my_pool.map(func, iterable)[0, 2, 4]

1 Spawns 4 subprocesses2 Pickles func, iterable and pipes them to the subprocess3 The subprocesses compute their results4 Subprocesses pickle/pipe back the results to the mother process

(IDSE) 37 / 52

Parallelization: Basic mapping issues

Serial mapping>>> def func(x):... return 2 * x>>> map(func, range(3))[0, 2, 4]

Parallel mapping>>> from multiprocessing import Pool>>> def func(x):... return 2 * x>>> my_pool = Pool(processes=4)>>> iterable = range(3)>>> my_pool.map(func, iterable)[0, 2, 4]

Issues:What about functions of more than one variable?Pickling not possible for every functionCan’t step a debugger into pool callsTraceback is uninterpretableCan’t exit with Ctrl-CEntire result is computed at once ⇒ memory blow-up!

(IDSE) 38 / 52

Parallelization: Mapping functions of more than one var

>>> from multiprocessing import Pool>>> from functools import partial

>>> def func(a, x):... return 2 * a * x

>>> a = 3>>> func_a = partial(func, a)# func_a(x) = func(a, x)

>>> Pool.map(func_a, range(3))[0, 6, 12]

(IDSE) 39 / 52

Parallelization: Dealing with map issues

def map_easy(func, iterable, n_jobs):

if n_jobs == 1:return map(func, iterable)

else:_trypickle(func)pool = Pool(n_jobs)timeout = 1000000return pool.map_async(func, iterable).get(timeout)

trypickle(func) tries to pickle the func before mappingn jobs = 1 ⇒ serial (debuggable/traceable) executionpool.map async(func, iterable).get(timeout)allows exit with Ctrl-C

(IDSE) 40 / 52

Parallelization: Limiting memory usageSend out/return jobs in chunksdef imap_easy(func, iterable, n_jobs, chunksize,

ordered=True)if n_jobs == 1:

results_iter = itertools.imap(func, iterable)else:

_trypickle(func)pool = Pool(n_jobs)if ordered:

results_iter = pool.imap(func, iterable,chunksize=chunksize)

else:results_iter = pool.imap_unordered(

func, iterable, chunksize=chunksize)

return results_iterNote: Exit with Ctrl-C is more difficult. See rosetta.parallel

(IDSE) 41 / 52

https://github.com/columbia-applied-data-science/rosetta/tree/master/rosetta/parallel

Making things faster: Summary

Use regular expressionsUse the right data structure

I Numpy/Pandas for numbers (use built in functions/numba/cython,NOT for loops)

I sets if you will test some item in my set

Profile your codeI time python myscript.pyI timeit in IPythonI line profiler (using kernprof.py)

Use multiprocessing.PoolA number of the Rosetta streamer methods have multiprocessingbuilt in (see rosetta.text.streamers)NOTE: the above example are in python for convenience but arerelevant in many (most) other scenarios

(IDSE) 42 / 52

https://github.com/columbia-applied-data-science/rosetta/tree/master/rosetta/text/streamers.py

Outline

1 Introduction




(IDSE) 43 / 52

LDA (in a slide)

Latent Dirichlet Allocation, by Blei, Ng and Jordan, is a hierarchicalBayesian model which describes the underlying semantic structure of adocument corpus via a set of latent distributions of the vocabulary.

The latent semantic distributions are referred to as “topics.”Each document is assumed modeled as a mixture of these topics.

I the number of topics is chosen a priori.Words in a document are draw by

I choosing a topic, given document mixture weights,I sampling from that topic.

Hyperparameters:I lda alpha: prior which controls the topic probabilities/weights.

F lda alpha 0.1: θd Dirichlet(α)I lda rho: prior which controls the word probabilities.

F lda alpha 0.1: βk Dirichlet(ρ)

(IDSE) 44 / 52

http://machinelearning.wustl.edu/mlpapers/paper_files/BleiNJ03.pdf

Vowpal Wabbit: What/Why

Can you build a topic model with 1,000,000 documents using gensim?Sure. . . if you have 10 hours or so to kill

Better solution: Vowpal WabbitOnline stochastic gradient descent ⇒ memory independent, optimalfor huge data setsHighly optimized C++ ⇒ fast

However. . .Interface is CLI and the input/output files are not very usable

(IDSE) 45 / 52

https://github.com/JohnLangford/vowpal_wabbit/wiki

Vowpal Wabbit: Python to the rescuePrinciples:

Make getting data into/out of VW easyDon’t wrap the VW CLI (or if you do use the subprocess module tomake calls, not os.system)

# Convert text files in a directory structure to vw formatstream = TextFileStreamer(

text_base_path=’bodyfiles’, tokenizer=my_tokenizer)stream.to_vw(’myfiles.vw’, n_jobs=-1)

# Explore token counts and filter tokens in a DataFramesff = SFileFilter(VWFormatter()).load_sfile(’myfiles.vw’)

# Create a data-frame representativedf = sff.to_frame()

(IDSE) 46 / 52

Vowpal Wabbit: Python to the rescue

# Create a filtered version of your sparse filesff.filter_extremes(doc_freq_min=5, doc_fraction_max=0.8)sff.compactify().filter_sfile(’myfiles.vw’, ’myfiles_filtered.vw’)

# Back to bash to run VWvw --lda 5 --cache_file ddrs.cache --passes 10 \

-p prediction.dat --readable_model topics.dat \--bit_precision 16 myfiles_filtered.vw

# Look at the results in DataFrameslda = LDAResults(

’topics.dat’, ’prediction.dat’, num_topics, sff)lda.print_topics()

See rosetta/examples/vw helpers.md

(IDSE) 47 / 52

https://github.com/columbia-applied-data-science/rosetta/blob/master/examples/vw_helpers.md

Steps with VW

Step 1: Convert files to VW input format1 0000BC34| saying:1 antunes:4 goncalves:3 scientist:1 ...1 0000C1AE| shot:1 help:1 september:2 luxembourg:1...1 0000BBA7| raised:1 chinese:1 winston:1 authority:1...

step 2: View the tokens in a DataFramedoc_freq

tokenswar 58china 77...

Step 3: Filter tokens and hash them1 0000BC34| 3423211:1 111:4 43454:3 989794:1 ...1 0000C1AE| 338:1 3123:1 19393:2 3232321:1...1 0000BBA7| 1191:1 69830:1 398:1 974949:1...

(IDSE) 48 / 52

Steps with VW

Step 4: Run VW vw --lda 5 --cache file ddrs.cache --passes10...

Step 5: View the results

topic_0 topic_1tokenswar 0.2 0.8china 0.4 0.6

See rosetta/examples/vw helpers.md

(IDSE) 49 / 52

https://github.com/columbia-applied-data-science/rosetta/blob/master/examples/vw_helpers.md

Summary

Pay attention to memoryPay attention to data structuresProfile for performanceParallelization is easy for many text-processing tasksUse Python to make stepping outside the python world easierAlso, don’t forget CLI and UNIX

(IDSE) 50 / 52

Bibliography

M. Connelly et al. . .Declassification engine. Ongoing project at Columbia Universityhttp://www.declassification-engine.org/https://github.com/declassengine/declass

D. Krasner and I. LangmoreApplied data science, lecture notes http://columbia-applied-data-science.github.io/appdatasci.pdf

The Rosetta teamTools for data science with a focus on text processing.https://github.com/columbia-applied-data-science/rosetta

Clone, submit issues on github, fork, contribute!

(IDSE) 51 / 52

http://www.declassification-engine.org/

https://github.com/declassengine/declass

http://columbia-applied-data-science.github.io/appdatasci.pdf

http://columbia-applied-data-science.github.io/appdatasci.pdf

https://github.com/columbia-applied-data-science/rosetta

THANK YOU!

contact: [email protected]

Rosettahttps://github.com/

columbia-applied-data-science/rosettaOpen Source Python Text Processing Library

Feel free to use, fork, submit issues and contribute!

(IDSE) 52 / 52



daniel krasner - high performance text processing with rosetta

Data & Analytics

text structure

typical text processing

sparse data structuresdata

semantic filtering solutions

f backend

f statistical analysis

limited memory3

outline1 introduction2