using the python_data_toolkit_timbers_slides

27
USING THE PANDAS PYTHON DATA TOOLKIT Tiffany A. Timbers, Ph.D. Banting Postdoctoral Fellow Simon Fraser University Burnaby, BC

Upload: tiffany-timbers

Post on 17-Jan-2017

492 views

Category:

Technology


0 download

TRANSCRIPT

USING THE PANDAS PYTHON DATA TOOLKITTiffany A. Timbers, Ph.D.

Banting Postdoctoral FellowSimon Fraser University

Burnaby, BC

OUTLINE

• A bit about my science

• My journey to using Python’s Pandas

• Highlights of Pandas that makes this library super intuitive and welcoming for scientists

A BIT ABOUT MY SCIENCE

Ph.D. in Neuroscience from 2005 - 2012

1 mm

Thesis: Genetics of learning & memory in a microscopic

nematode, Caenorhabditis elegans

1990 - 2010: DATA COLLECTION BOTTLENECK

• Recorded a single worm at a time (5 - 30 min each)

• Re-watched videos to hand score

• Scanned into computer and measured with imageJ

• Used box-stats program to analyze and visualize data

Tap%Habituation%in%C.#elegans##

7"

1990 - 2010: DATA COLLECTION BOTTLENECK

• Recorded a single worm at a time (5 - 30 min each)

• Re-watched videos to hand score

• Scanned into computer and measured with imageJ

• Used box-stats program to analyze and visualize data

Tap%Habituation%in%C.#elegans##

7"

Very small datasets and time consuming!

stimulus delivery

Swierczek & Giles et al. Nature Methods 2011

image extraction

post-experimentanalysis

THE MULTI-WORM TRACKER

The .dat files from the tracker give you a row for each frame captured by the camera.

25 frames/sec x 300 sec x 100 worms = 750,000 rows and up to ~ 30 columns for each experiment!

2010: DATA ANALYSIS BOTTLENECK

SOLUTION = PROGRAMMING LANGUAGES

• Initially started with Matlab

• Moved to R for access to data frames and advanced statistic capabilities

• Still work in R a lot, but moving into Python because of culture and ease to learn and readability of code

PANDAS: A BRIEF HISTORY

• First released to open source by Wes McKinney (AQR) in 2009

• Wanted R statistical capabilities and better data-munging abilities in a more readable language

• Inspired by Jonathon Taylor’s (Stanford) port of R’s MASS package to Python

PANDAS TODAY:

• Data frame data structures: database-like tables with rows and columns

• Easy time series manipulations

• Quick visualizations based on Matplotlib

• Very flexible import and export of data

• Data munging: remove duplicates, manage missing values, automatically join tables by index

• SQL-like operations: join, aggregate (group by)

Pandas is a library that makes analysis of complex tabular data easy!

Using the Pandas Python Data Toolkit

Today we will highlight some very useful and cool features of the Pandas library in Python while playing with some nematode worm

behaviour data collected from the multi-worm-tracker (Swierczek et al., 2011).

Specifically, we will explore:

1. Loading data

2. Dataframe data structures

3. Element-wise mathematics

4. Working with time series data

5. Quick and easy visualization

Some initial setup

In [1]: ## load libraries%matplotlib inlineimport pandas as pdimport numpy as np

from pandas import set_optionset_option("display.max_rows", 4)

## magic to time cells in ipython notebook%install_ext https://raw.github.com/cpcloud/ipython-autotime/master/autotime.py%load_ext autotime

1. Loading data from a local text file

More details, see http://pandas.pydata.org/pandas-docs/stable/io.html (http://pandas.pydata.org/pandas-docs/stable/io.html)

Let's first load some behaviour data from a collection of wild-type worms.

Installed autotime.py. To use it, type:

%load_ext autotime

In [2]: filename = 'data/behav.dat'

behav = pd.read_table(filename, sep = '\s+')

behav

2. Dataframe data structures

For more details, see http://pandas.pydata.org/pandas-docs/stable/dsintro.html (http://pandas.pydata.org/pandas-docs/stable/dsintro.html)

Pandas provides access to data frame data structures. These tabular data objects allow you to mix and match arrays of different data types

in one "table".

Out[2]:

time: 642 ms

plate time strain frame area speed angular_speed aspect midline morphwidth kink

0 20141118_131037 5.065 N2 126 0.094770 0.3600 0.8706 0.0822 12.1000 1.0000 0.000

1 20141118_131037 5.109 N2 127 0.094770 0.3600 0.8630 0.0819 5.9000 1.0000 0.007

... ... ... ... ... ... ... ... ... ... ... ...

249997 20141118_132717 249.048 N2 6158 0.108621 0.0792 0.5000 0.1470 0.9943 0.0906 41.200

249998 20141118_132717 249.093 N2 6159 0.107892 0.0693 0.6000 0.1520 1.0019 0.0903 42.900

249999 rows × 13 columns

In [3]: print behav.dtypes

3. Element-wise mathematics

Suppose we want to add a new column that is a combination of two columns in our dataset. Similar to numpy, Pandas lets us do this easily

and deals with doing math between columns on an element by element basis. For example, We are interested in the ratio of the midline

length divided by the morphwidth to look at whether worms are crawling in a straight line or curling back on themselves (e.g., during a turn).

In [4]: ## vectorization takes 49.3 msbehav['mid_width_ratio'] = behav['morphwidth']/behav['midline']

behav[['morphwidth', 'midline', 'mid_width_ratio']].head()

plate object

time float64

...

bias float64

pathlength float64

dtype: object

time: 4.85 ms

Out[4]: morphwidth midline mid_width_ratio

0 1 12.1 0.082645

1 1 5.9 0.169492

... ... ... ...

3 1 14.9 0.067114

4 1 6.3 0.158730

5 rows × 3 columns

time: 57.6 ms

In [ ]: ## looping takes 1 min 44smid_width_ratio = np.empty(len(behav['morphwidth']), dtype='float64')

for i in range(1,len(behav['morphwidth'])): mid_width_ratio[i] =+ behav.loc[i,'morphwidth']/behav.loc[i,'midline']

behav['mid_width_ratio'] = mid_width_ratio

behav[['morphwidth', 'midline', 'mid_width_ratio']].head()

apply()

For more details, see: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply.html

(http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply.html)

Another bonus about using Pandas is the apply function - this allows you to apply any function to a select column(s) or row(s) of a

dataframe, or accross the entire dataframe.

In [5]: ## custom function to center datadef center(data): return data - data.mean()

time: 1.49 ms

In [6]: ## center all data on a column basisbehav.iloc[:,4:].apply(center).head()

4. Working with time series data

IndicesFor more details, see http://pandas.pydata.org/pandas-docs/stable/indexing.html (http://pandas.pydata.org/pandas-

docs/stable/indexing.html)

Given that this is time series data we will want to set the index to time, we can do this while we read in the data.

Out[6]:

time: 55.3 ms

area speed angular_speed aspect midline morphwidth kink bias pathlength mid_width_ratio

0 -0.002280 0.249039 -6.313001 -0.219804 11.004384 0.904059 -43.962917 NaN NaN -0.029877

1 -0.002280 0.249039 -6.320601 -0.220104 4.804384 0.904059 -43.955917 NaN NaN 0.056970

... ... ... ... ... ... ... ... ... ... ...

3 -0.000093 0.229039 -6.279701 -0.220304 13.804384 0.904059 -43.942917 NaN NaN -0.045408

4 0.000636 0.221039 -6.257501 -0.217504 5.204384 0.904059 -43.935917 NaN NaN 0.046208

5 rows × 10 columns

In [7]: behav = pd.read_table(filename, sep = '\s+', index_col='time')

behav

To utilize functions built into Pandas to deal with time series data, let's convert our time to a date time object using the to_datetime()

function.

In [8]: behav.index.dtype

Out[7]:

time: 609 ms

plate strain frame area speed angular_speed aspect midline morphwidth kink bias

time

5.065 20141118_131037 N2 126 0.094770 0.3600 0.8706 0.0822 12.1000 1.0000 0.000 NaN

5.109 20141118_131037 N2 127 0.094770 0.3600 0.8630 0.0819 5.9000 1.0000 0.007 NaN

... ... ... ... ... ... ... ... ... ... ... ...

249.048 20141118_132717 N2 6158 0.108621 0.0792 0.5000 0.1470 0.9943 0.0906 41.200 1

249.093 20141118_132717 N2 6159 0.107892 0.0693 0.6000 0.1520 1.0019 0.0903 42.900 1

249999 rows × 12 columns

Out[8]: dtype('float64')

time: 2.53 ms

In [9]: behav.index = pd.to_datetime(behav.index, unit='s')

print behav.index.dtypebehav

Now that our index is of datetime object, we can use the resample function to get time intervals. With this function you can choose the time

interval as well as how to downsample (mean, sum, etc.)

datetime64[ns]

Out[9]:

time: 394 ms

plate strain frame area speed angular_speed aspect midline morphwidth kink

1970-01-0100:00:05.065 20141118_131037 N2 126 0.094770 0.3600 0.8706 0.0822 12.1000 1.0000 0.000

1970-01-0100:00:05.109 20141118_131037 N2 127 0.094770 0.3600 0.8630 0.0819 5.9000 1.0000 0.007

... ... ... ... ... ... ... ... ... ... ...

1970-01-0100:04:09.048 20141118_132717 N2 6158 0.108621 0.0792 0.5000 0.1470 0.9943 0.0906 41.200

1970-01-0100:04:09.093 20141118_132717 N2 6159 0.107892 0.0693 0.6000 0.1520 1.0019 0.0903 42.900

249999 rows × 12 columns

In [10]: behav_resampled = behav.resample('10s', how=('mean'))

behav_resampled

5. Quick and easy visualizationFor more details, see: http://pandas.pydata.org/pandas-docs/version/0.15.0/visualization.html (http://pandas.pydata.org/pandas-

docs/version/0.15.0/visualization.html)

Out[10]:

time: 103 ms

frame area speed angular_speed aspect midline morphwidth kink bias pathlength

1970-01-0100:00:00

158.970096 0.099870 0.172162 9.021929 0.271491 3.385725 0.156643 37.793984 0.987238 0.379338

1970-01-0100:00:10

362.347271 0.098067 0.166863 11.942732 0.319444 1.880583 0.107296 44.520299 0.969474 1.080874

... ... ... ... ... ... ... ... ... ... ...

1970-01-0100:04:00

5924.536608 0.097678 0.127150 5.646088 0.242850 1.785435 0.098452 35.647127 0.889103 9.590879

1970-01-0100:04:10

6041.902439 0.098643 0.255963 0.910815 0.088282 34.607500 0.025641 10.396449 NaN NaN

26 rows × 10 columns

In [11]: behav_resampled['angular_speed'].plot()

Out[11]: <matplotlib.axes._subplots.AxesSubplot at 0x10779b650>

time: 183 ms

In [12]: behav_resampled.plot(subplots=True, figsize = (10, 12))

Out[12]: array([<matplotlib.axes._subplots.AxesSubplot object at 0x112675890>,

<matplotlib.axes._subplots.AxesSubplot object at 0x10768a650>,

<matplotlib.axes._subplots.AxesSubplot object at 0x10770f150>,

<matplotlib.axes._subplots.AxesSubplot object at 0x108002610>,

<matplotlib.axes._subplots.AxesSubplot object at 0x109859790>,

<matplotlib.axes._subplots.AxesSubplot object at 0x108039610>,

<matplotlib.axes._subplots.AxesSubplot object at 0x10a16af10>,

<matplotlib.axes._subplots.AxesSubplot object at 0x10a1fc0d0>,

<matplotlib.axes._subplots.AxesSubplot object at 0x10a347e90>,

<matplotlib.axes._subplots.AxesSubplot object at 0x10ab0ce50>], dtype=object)

In [13]: behav_resampled[['speed', 'angular_speed', 'bias']].plot(subplots = True, figsize = (10,8))

time: 1.69 s

Out[13]: array([<matplotlib.axes._subplots.AxesSubplot object at 0x10bc4e250>,

<matplotlib.axes._subplots.AxesSubplot object at 0x10c981c50>,

<matplotlib.axes._subplots.AxesSubplot object at 0x10e7a1110>], dtype=object)

time: 541 ms

Summary

Pandas is a extremely useful and efficient tool for scientists, or anyone who needs to wrangle, analyze and visualize data!

Pandas is particularly attractive to scientists with minimal programming experience because:

Strong, welcoming and growing community

It is readable

Idiom matches intuition

To learn more about Pandas see:

Pandas Documentation (http://pandas.pydata.org/)

ipython notebook tutorial (http://nsoontie.github.io/2015-03-05-ubc/novice/python/Pandas-Lesson.html) by Nancy Soontiens

(Software Carpentry)

Video tutorial (https://www.youtube.com/watch?v=0CFFTJUZ2dc&list=PLYx7XA2nY5Gcpabmu61kKcToLz0FapmHu&index=12)

from SciPy 2015 by Jonathan Rocher

History of Pandas (https://www.youtube.com/watch?v=kHdkFyGCxiY) by Wes McKinney