daniel krasner - high performance text processing with rosetta

52
(Easy), High Performance Text Processing with Python’s Rosetta Daniel Krasner KFit Solutions, Columbia University Nov 22, 2014 (IDSE) 1 / 52

Upload: pydata

Post on 08-Jul-2015

574 views

Category:

Data & Analytics


0 download

DESCRIPTION

This talk covers rapid prototyping of a high performance scalable text processing pipeline development in Python. We demonstrate how Python modules, in particular from the Rosetta library, can be used to analyze, clean, extract features, and finally perform machine learning tasks such as classification or topic modeling on millions of documents. Our style is to build small and simple modules (each with command line interfaces) that use very little memory and are parallelized with the multiprocessing library.

TRANSCRIPT

Page 1: Daniel Krasner - High Performance Text Processing with Rosetta

(Easy), High Performance Text Processing withPython’s Rosetta

Daniel Krasner

KFit Solutions, Columbia University

Nov 22, 2014

(IDSE) 1 / 52

Page 2: Daniel Krasner - High Performance Text Processing with Rosetta

Outline

1 Introduction

2 Dealing with Limited Memory

3 Making Things Faster!

4 Stepping Outside of Python (time permitting)

(IDSE) 2 / 52

Page 3: Daniel Krasner - High Performance Text Processing with Rosetta

Outline

1 Introduction

2 Dealing with Limited Memory

3 Making Things Faster!

4 Stepping Outside of Python (time permitting)

(IDSE) 3 / 52

Page 4: Daniel Krasner - High Performance Text Processing with Rosetta

Motivation: Current Text Processing Projects

The Declassification EngineI Full stack Digital Archive.

F Collections structuring/parsing.F Backend, API, UI.F Statistical analysis.

I OrganizersF David Madigan - Statistics Department chair (CU)F Matthew Connelly - Professor of international and global history (CU)

I For more info see http://www.declassification-engine.org/.

(IDSE) 4 / 52

Page 5: Daniel Krasner - High Performance Text Processing with Rosetta

Motivation: Current Projects Continued

eDiscoveryI The legal world is overwhelmed with documents, both in pre and post

production review.I Most technologies heavily rely on keyword search which is not efficient.I “Predictive coding” solutions are generally archaic and inaccurate.

OtherI Human - text/document interaction.I Semantic filtering solutions.

(IDSE) 5 / 52

Page 6: Daniel Krasner - High Performance Text Processing with Rosetta

Motivation: Data Structuring/Information ExtractionText can come in many formats, encodings, and degree of structure.

Figure: raw xml

(IDSE) 6 / 52

Page 7: Daniel Krasner - High Performance Text Processing with Rosetta

Motivation: Data Structuring/Information ExtractionMany tasks involve initial structuring of the data.

Figure: structured api response

(IDSE) 7 / 52

Page 8: Daniel Krasner - High Performance Text Processing with Rosetta

Motivation: Network AnalysisKissinger telcons can be analyzed for frequency. Even a simple version ofthis type of analysis requires the data to be structured, or an informationextraction process in place.

Figure:

(IDSE) 8 / 52

Page 9: Daniel Krasner - High Performance Text Processing with Rosetta

Motivation: Semantic Modeling

State department cables from embassies can be analyzed for topics.

Moscow is predominantly topic 12

soviet 0.133910moscow 0.128717october 0.090400joint 0.052875ussr 0.044190soviets 0.042493ur 0.027686mfa 0.025786refs 0.023871prague 0.021268

London is predominantly topic 13

london 0.114568bonn 0.083748rome 0.074385uk 0.051367frg 0.050235berlin 0.035972usmission 0.031757british 0.029836european 0.027203brussels 0.025023

(IDSE) 9 / 52

Page 10: Daniel Krasner - High Performance Text Processing with Rosetta

Motivation: ClassificationDetermine which documents are relevant to a legal case.

Figure: eDiscovery processing pipeline

(IDSE) 10 / 52

Page 11: Daniel Krasner - High Performance Text Processing with Rosetta

Feature Extraction

Figure: Metadata + body text ⇒ features

The metadata features can be used in any classifier

(IDSE) 11 / 52

Page 12: Daniel Krasner - High Performance Text Processing with Rosetta

Tasks

What are some typical text processing goals?Structuring/Information Extraction

I Metadata extractionF Geo-taggingF Name-entity identification/disambiguation

I Text cleaning/text body extractionMachine Learning

I ClassificationF logistic regression, random forests, etc

I Sentiment analysisI Recommendation systemsI Anomaly detection

F Understanding underlying semantic structute (ex LDA/LSI modeling)F Communication dynamics

Note: this is far from a complete list!

(IDSE) 12 / 52

Page 13: Daniel Krasner - High Performance Text Processing with Rosetta

General Flow

Figure: General Flow

(IDSE) 13 / 52

Page 14: Daniel Krasner - High Performance Text Processing with Rosetta

What’s so hard about text processing?

Must use sparse data structuresData usually doesn’t fit in memoryIt’s language . . .Text structure can change from collection to collection(parsing/feature extraction can be tricky)Can be difficult to convert to nice machine-readable formatsHUGE data drives much of software development. Leads to productstoo complicated for most applicationsNo simple solution (e.g. “sklearn/statsmodels/pandas/numpy” stack)

(IDSE) 14 / 52

Page 15: Daniel Krasner - High Performance Text Processing with Rosetta

What’s so fun about text processing?

You get to care about memory and processing speedIt’s language

I spelling, stemming, etcI parsingI tokenizationI domain knowledge

Unix plays nicely with textPython plays nicely with textYou get to step outside the Python ecosystem

(IDSE) 15 / 52

Page 16: Daniel Krasner - High Performance Text Processing with Rosetta

Outline

1 Introduction

2 Dealing with Limited Memory

3 Making Things Faster!

4 Stepping Outside of Python (time permitting)

(IDSE) 16 / 52

Page 17: Daniel Krasner - High Performance Text Processing with Rosetta

Cluster? No.

One powerful machine and memory/CPU conscious code can handle manytasks with 1TB of text.

System76 Laptop, pre-loaded with Ubuntu. 16GB Memory, 4cores,1T SSD. $1823

I Can also get a Macbook Pro for about $3600You can spend 6k and get 20 cores, 64GB Memory (upgradeable to256GB), 1-2T RAID SSD for $7,000. Stick it in a closet with lots offans or in a server center for $250/month.Single machine on AWS ($1-2k per year) or on Digital Ocean (theyhave nice ssds).

(IDSE) 17 / 52

Page 18: Daniel Krasner - High Performance Text Processing with Rosetta

Back-of-the-envelope memory calculations

Text:I Same as on disk if the file is large enough (e.g. 10M)I Can be much more if you load many small files and append to a list.

Numbers:I 1double = 8byte. ⇒ 1,000,000 doubles ≈ 8MB.I You can save space by using type “int8,” “float16,” “float32,”

etc. . . see the dtype docs

(IDSE) 18 / 52

Page 19: Daniel Krasner - High Performance Text Processing with Rosetta

Monitoring Memory with HTOP

4 cores, 4 virtual cores, all in use2821/15946 MB memory in useprocesses listed below

For macs, htop doesn’t necessarily work. If it doesn’t, try iStat.(IDSE) 19 / 52

Page 20: Daniel Krasner - High Performance Text Processing with Rosetta

Don’t blow up memory, stream!Process files (text) line-by-line:with open(infile, ‘‘r’’) as f, open(outfile, ‘‘w’’) as g:

for line in input:line_output = process_line(line)# Write output NOWg.write(line_output)

Process directories one file at a time:from rosetta.text.filefilter import get_paths

my_paths_iter = get_paths(MYDIR, get_iter=True)

for path in my_paths_iter:output = process_file(path)# Now write the output to file

(IDSE) 20 / 52

Page 21: Daniel Krasner - High Performance Text Processing with Rosetta

Rosetta Text File Streamer

Set up a text file streamer class for processing.# stream from a file sys directorystream = TextFileStreamer(text_base_path=MYDIR,

tokenizer=MyTokenizer)

# call info stream which will return a dict with# doc_id, text, tokens, etcfor item in stream.info_stream():

# print the document textprint item[‘‘text’’]‘‘This is my text.’’# print the tokensprint item[‘‘tokens’’][‘‘this’’, ‘‘is’’, ‘‘my’’, ‘‘text’’]

(IDSE) 21 / 52

Page 22: Daniel Krasner - High Performance Text Processing with Rosetta

Rosetta MySQL Streamer

Set up a databse streamer class for processing.# stream from a DBstream = MySQLStreamer(db_setup=DBCONFIG,

tokenizer=MyTokenizer)

# Convert to scipysparse matrix# and cache some data along the waysparse_mat = stream.to_scipysparse(

cache_list[‘‘doc_id’’, ‘‘date’’])

# grab the cached doc_id and datesdoc_ids = stream.__dict__[‘‘doc_id_cache’’]dates = stream.__dict__[‘‘date_cache’’]

(IDSE) 22 / 52

Page 23: Daniel Krasner - High Performance Text Processing with Rosetta

(Online) Stochastic Gradient DescentCombine the above with an online learning algorithm.To minimize empirical loss

n∑i=1|yi − w · xi |2

Update the coefficient w one training example at a time

w (t+1) : = w t − ηt∇w |yt − w t · xt |2.

Note:The learning rate ηt decays to 0We can cycle through the training examples, updating the weightsmore than n timesWe only need to load one single example into memory at a timeConverges faster for cases of many data points and coefficients

See Bottou scikit-learn sgd and vowpal wabbit.(IDSE) 23 / 52

Page 24: Daniel Krasner - High Performance Text Processing with Rosetta

Dealing with limited memory: Summary

Monitor memory usageStream process

I and cache what you need along the wayDeal with huge feature counts by. . .

I Use a sparse data structure, stochastic gradient descent. Or. . .I Reduce the number of features to something that fits into a dense

matrix

(IDSE) 24 / 52

Page 25: Daniel Krasner - High Performance Text Processing with Rosetta

Outline

1 Introduction

2 Dealing with Limited Memory

3 Making Things Faster!

4 Stepping Outside of Python (time permitting)

(IDSE) 25 / 52

Page 26: Daniel Krasner - High Performance Text Processing with Rosetta

Goal: Tokenization

# Steps: split on whitespace, set to lowercase# remove non-letters, punctuation, and stopwordsstopwords_list = ‘‘lets,do,a,the,and’’.split(‘‘,’’) # And morepunct = [’|’, ’‘’, ’[’, ’]’, ’{’, ’}’, ’(’, ’)’]

# Here’s the function call/result we wanttext = "Let’s do a deal: Trade 55 Euros for 75 euros"tokens = tokenize(text, punct, stopwords_list)print tokens[‘‘deal’’, ‘‘trade’’, ‘‘euros’’, ‘‘euros’’]

(IDSE) 26 / 52

Page 27: Daniel Krasner - High Performance Text Processing with Rosetta

First hack: For loop

def tokenize_for(text, punct, stopwords):tokens = []# Split the text on whitespace.for token in text.lower().split():

# Remove punctuationclean_token = ’’for char in clean_token:

if char not in punct:clean_token += char

# Remove stopwords.if clean_token.isalpha() and len(token) > 1\

and (token not in stopwords):tokens.append(token)

return tokens

(IDSE) 27 / 52

Page 28: Daniel Krasner - High Performance Text Processing with Rosetta

Profile: Time your coderegex test.pydef main():

stopwords_list = ‘‘lets,do,a,the,and’’.split(‘‘,’’) # And moretext = # Something pretty typical for your applicationfor i in range(10000):

tokens = tokenize_for(text, stopwords_list)

if __name__ == ‘‘__main__’’:main()

time python regex_test.py

real 0m0.128suser 0m0.120ssys 0m0.008s

(IDSE) 28 / 52

Page 29: Daniel Krasner - High Performance Text Processing with Rosetta

Profile: Switch to a regexbad_char_pattern = r"\||’|\[|\]|\{|\}|\(|\)"

def tokenize_regex_1(text, bad_char_pattern, stopwords):# Substitute empty string for the bad characterstext = re.sub(bad_char_pattern, ’’, text).lower()

# Split on whitespace, keeping strings length > 1split_text = re.findall(r"[a-z]+", text)tokens = []for word in split_text:

if word not in stopwords:tokens.append(word)

return tokens

time python regex_test.py

real 0m0.091suser 0m0.083ssys 0m0.008s

(IDSE) 29 / 52

Page 30: Daniel Krasner - High Performance Text Processing with Rosetta

Profile: Line-by-line readoutAdd an @profile decorator to your tokenize regex function, andpip install line_profilerkernprof.py -l regex_test.pypython -m line_profiler regex_test.py.lprof |less

Figure: line profiler output shows the for loop and if statement are slow.

(IDSE) 30 / 52

Page 31: Daniel Krasner - High Performance Text Processing with Rosetta

Profile: Use a setregex test.py

def main():stopwords_list = ’lets,do,a,the,and’.split(’,’) # And morestopwords_set = set(stopwords_list)for i in range(1000):

text = # Something pretty typical for your applicationtokens = tokenize_regex_1(text, stopwords_set)

if __name__ == ’__main__’:main()

Reduces time from 0.091 to 0.043Testing item in my set requires hash function computationTesting item in my list requires looking at every item inmy list.

(IDSE) 31 / 52

Page 32: Daniel Krasner - High Performance Text Processing with Rosetta

Profile: IPython timeit

Figure: Set lookup is O(1). So don’t ever test item in my list for long lists.

(IDSE) 32 / 52

Page 33: Daniel Krasner - High Performance Text Processing with Rosetta

Profile: Line-by-line profile again

Switching to a set speed up the if statement.The for loop can still be faster

(IDSE) 33 / 52

Page 34: Daniel Krasner - High Performance Text Processing with Rosetta

Profile: Switch to a list comprehension

Reduced time from 0.033 to 0.025sA list comprehension is essentially a for loop with a fast appendLooks nicer in this caseBe sure to use time python regex test.py for the total time!

(IDSE) 34 / 52

Page 35: Daniel Krasner - High Performance Text Processing with Rosetta

Data structures

Think about the data structures (and associated methods) you are usingfor the task at hand!

Many data analysis friendly languages (ex. Python, R) have veryconvenient built in data structures.These can come with significant lookup and operation overhead.

I example: set lookup vs list lookup as aboveI example: python does not allocate a contiguous memory block for

dictionaries, making them slower than a data structure which tells theinterpreter how much space will be needed

You can (easily) create your own data structure for the task at hand.See Saulius Lukuaskas nice post Why Python Runs Slow. Part 1:Data structures.

(IDSE) 35 / 52

Page 36: Daniel Krasner - High Performance Text Processing with Rosetta

Parallelization

Much of text processing is embarrassingly parallel.

Figure: Word counts for individual documents can be computed independently.

(IDSE) 36 / 52

Page 37: Daniel Krasner - High Performance Text Processing with Rosetta

Parallelization: Basic mapping

Serial mapping>>> def func(x):... return 2 * x>>> iterable = range(3)>>> map(func, iterable)[0, 2, 4]

Parallel mapping>>> from multiprocessing import Pool>>> def func(x):... return 2 * x>>> my_pool = Pool(processes=4)>>> iterable = range(3)>>> my_pool.map(func, iterable)[0, 2, 4]

1 Spawns 4 subprocesses2 Pickles func, iterable and pipes them to the subprocess3 The subprocesses compute their results4 Subprocesses pickle/pipe back the results to the mother process

(IDSE) 37 / 52

Page 38: Daniel Krasner - High Performance Text Processing with Rosetta

Parallelization: Basic mapping issues

Serial mapping>>> def func(x):... return 2 * x>>> map(func, range(3))[0, 2, 4]

Parallel mapping>>> from multiprocessing import Pool>>> def func(x):... return 2 * x>>> my_pool = Pool(processes=4)>>> iterable = range(3)>>> my_pool.map(func, iterable)[0, 2, 4]

Issues:What about functions of more than one variable?Pickling not possible for every functionCan’t step a debugger into pool callsTraceback is uninterpretableCan’t exit with Ctrl-CEntire result is computed at once ⇒ memory blow-up!

(IDSE) 38 / 52

Page 39: Daniel Krasner - High Performance Text Processing with Rosetta

Parallelization: Mapping functions of more than one var

>>> from multiprocessing import Pool>>> from functools import partial

>>> def func(a, x):... return 2 * a * x

>>> a = 3>>> func_a = partial(func, a)# func_a(x) = func(a, x)

>>> Pool.map(func_a, range(3))[0, 6, 12]

(IDSE) 39 / 52

Page 40: Daniel Krasner - High Performance Text Processing with Rosetta

Parallelization: Dealing with map issues

def map_easy(func, iterable, n_jobs):

if n_jobs == 1:return map(func, iterable)

else:_trypickle(func)pool = Pool(n_jobs)timeout = 1000000return pool.map_async(func, iterable).get(timeout)

trypickle(func) tries to pickle the func before mappingn jobs = 1 ⇒ serial (debuggable/traceable) executionpool.map async(func, iterable).get(timeout)allows exit with Ctrl-C

(IDSE) 40 / 52

Page 41: Daniel Krasner - High Performance Text Processing with Rosetta

Parallelization: Limiting memory usageSend out/return jobs in chunksdef imap_easy(func, iterable, n_jobs, chunksize,

ordered=True)if n_jobs == 1:

results_iter = itertools.imap(func, iterable)else:

_trypickle(func)pool = Pool(n_jobs)if ordered:

results_iter = pool.imap(func, iterable,chunksize=chunksize)

else:results_iter = pool.imap_unordered(

func, iterable, chunksize=chunksize)

return results_iterNote: Exit with Ctrl-C is more difficult. See rosetta.parallel

(IDSE) 41 / 52

Page 42: Daniel Krasner - High Performance Text Processing with Rosetta

Making things faster: Summary

Use regular expressionsUse the right data structure

I Numpy/Pandas for numbers (use built in functions/numba/cython,NOT for loops)

I sets if you will test some item in my set

Profile your codeI time python myscript.pyI timeit in IPythonI line profiler (using kernprof.py)

Use multiprocessing.PoolA number of the Rosetta streamer methods have multiprocessingbuilt in (see rosetta.text.streamers)NOTE: the above example are in python for convenience but arerelevant in many (most) other scenarios

(IDSE) 42 / 52

Page 43: Daniel Krasner - High Performance Text Processing with Rosetta

Outline

1 Introduction

2 Dealing with Limited Memory

3 Making Things Faster!

4 Stepping Outside of Python (time permitting)

(IDSE) 43 / 52

Page 44: Daniel Krasner - High Performance Text Processing with Rosetta

LDA (in a slide)

Latent Dirichlet Allocation, by Blei, Ng and Jordan, is a hierarchicalBayesian model which describes the underlying semantic structure of adocument corpus via a set of latent distributions of the vocabulary.

The latent semantic distributions are referred to as “topics.”Each document is assumed modeled as a mixture of these topics.

I the number of topics is chosen a priori.Words in a document are draw by

I choosing a topic, given document mixture weights,I sampling from that topic.

Hyperparameters:I lda alpha: prior which controls the topic probabilities/weights.

F lda alpha 0.1: θd Dirichlet(α)I lda rho: prior which controls the word probabilities.

F lda alpha 0.1: βk Dirichlet(ρ)

(IDSE) 44 / 52

Page 45: Daniel Krasner - High Performance Text Processing with Rosetta

Vowpal Wabbit: What/Why

Can you build a topic model with 1,000,000 documents using gensim?Sure. . . if you have 10 hours or so to kill

Better solution: Vowpal WabbitOnline stochastic gradient descent ⇒ memory independent, optimalfor huge data setsHighly optimized C++ ⇒ fast

However. . .Interface is CLI and the input/output files are not very usable

(IDSE) 45 / 52

Page 46: Daniel Krasner - High Performance Text Processing with Rosetta

Vowpal Wabbit: Python to the rescuePrinciples:

Make getting data into/out of VW easyDon’t wrap the VW CLI (or if you do use the subprocess module tomake calls, not os.system)

# Convert text files in a directory structure to vw formatstream = TextFileStreamer(

text_base_path=’bodyfiles’, tokenizer=my_tokenizer)stream.to_vw(’myfiles.vw’, n_jobs=-1)

# Explore token counts and filter tokens in a DataFramesff = SFileFilter(VWFormatter()).load_sfile(’myfiles.vw’)

# Create a data-frame representativedf = sff.to_frame()

(IDSE) 46 / 52

Page 47: Daniel Krasner - High Performance Text Processing with Rosetta

Vowpal Wabbit: Python to the rescue

# Create a filtered version of your sparse filesff.filter_extremes(doc_freq_min=5, doc_fraction_max=0.8)sff.compactify().filter_sfile(’myfiles.vw’, ’myfiles_filtered.vw’)

# Back to bash to run VWvw --lda 5 --cache_file ddrs.cache --passes 10 \

-p prediction.dat --readable_model topics.dat \--bit_precision 16 myfiles_filtered.vw

# Look at the results in DataFrameslda = LDAResults(

’topics.dat’, ’prediction.dat’, num_topics, sff)lda.print_topics()

See rosetta/examples/vw helpers.md

(IDSE) 47 / 52

Page 48: Daniel Krasner - High Performance Text Processing with Rosetta

Steps with VW

Step 1: Convert files to VW input format1 0000BC34| saying:1 antunes:4 goncalves:3 scientist:1 ...1 0000C1AE| shot:1 help:1 september:2 luxembourg:1...1 0000BBA7| raised:1 chinese:1 winston:1 authority:1...

step 2: View the tokens in a DataFramedoc_freq

tokenswar 58china 77...

Step 3: Filter tokens and hash them1 0000BC34| 3423211:1 111:4 43454:3 989794:1 ...1 0000C1AE| 338:1 3123:1 19393:2 3232321:1...1 0000BBA7| 1191:1 69830:1 398:1 974949:1...

(IDSE) 48 / 52

Page 49: Daniel Krasner - High Performance Text Processing with Rosetta

Steps with VW

Step 4: Run VW vw --lda 5 --cache file ddrs.cache --passes10...

Step 5: View the results

topic_0 topic_1tokenswar 0.2 0.8china 0.4 0.6

See rosetta/examples/vw helpers.md

(IDSE) 49 / 52

Page 50: Daniel Krasner - High Performance Text Processing with Rosetta

Summary

Pay attention to memoryPay attention to data structuresProfile for performanceParallelization is easy for many text-processing tasksUse Python to make stepping outside the python world easierAlso, don’t forget CLI and UNIX

(IDSE) 50 / 52

Page 51: Daniel Krasner - High Performance Text Processing with Rosetta

Bibliography

M. Connelly et al. . .Declassification engine. Ongoing project at Columbia Universityhttp://www.declassification-engine.org/https://github.com/declassengine/declass

D. Krasner and I. LangmoreApplied data science, lecture notes http://columbia-applied-data-science.github.io/appdatasci.pdf

The Rosetta teamTools for data science with a focus on text processing.https://github.com/columbia-applied-data-science/rosetta

Clone, submit issues on github, fork, contribute!

(IDSE) 51 / 52

Page 52: Daniel Krasner - High Performance Text Processing with Rosetta

THANK YOU!

contact: [email protected]

Rosettahttps://github.com/

columbia-applied-data-science/rosettaOpen Source Python Text Processing Library

Feel free to use, fork, submit issues and contribute!

(IDSE) 52 / 52