open–source python tools for environmental data processing

44
Open–Source Python Tools for Environmental Data Processing, Analysis, and Visualization Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization 9/17/2019 Michael J. Murphy Environmental Data Analyst/ Data Scientist and Senior Staff Geologist Terraphase Engineering Inc.

Upload: others

Post on 20-May-2022

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Open–Source Python Tools for Environmental Data Processing

Open–Source Python Tools for Environmental Data Processing, Analysis, and Visualization

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/2019

Michael J. MurphyEnvironmental Data Analyst/ Data Scientist and Senior Staff Geologist

Terraphase Engineering Inc.

Page 2: Open–Source Python Tools for Environmental Data Processing

OVERVIEW

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/20192

• The Python programming language

• Why Python?

• Python data analysis basics

• Case studies

• Review

Page 3: Open–Source Python Tools for Environmental Data Processing

PYTHON* IS A MODERN PROGRAMMING LANGUAGE THAT IS ESPECIALLY USEFUL FOR QUANTITATIVEDATA ANALYSIS AND SCIENTIFIC PROGRAMMING

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/20193

*Python is named after Monty Python, not Pythonidae.

!=+

Page 4: Open–Source Python Tools for Environmental Data Processing

WHY PYTHON INSTEAD OF EXCEL?

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/20194

vs.

Page 5: Open–Source Python Tools for Environmental Data Processing

BASIC PYTHON DATA TOOLS

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/20195

• DataFrame table structures • Access, process, analyze, export and visualize data

• Vast collection of functions for array operations• Quantitative analysis

• High-level plotting functions• Can produce publication-quality data visualizations• Works seamlessly with Python data tools such as NumPy, pandas

Page 6: Open–Source Python Tools for Environmental Data Processing

EXAMPLE: PANDAS DATAFRAMES

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/20196

Key Values

Index

Page 7: Open–Source Python Tools for Environmental Data Processing

EXAMPLE: PANDAS DATAFRAMES

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/20197

Original wide-format table:

Melted long-format table:

Page 8: Open–Source Python Tools for Environmental Data Processing

EXAMPLE: NUMPY OPERATIONS

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/20198

Vectorize function:

Draw random values:

Define function to calculate ion balance:

Page 9: Open–Source Python Tools for Environmental Data Processing

EXAMPLE: PLOTTING WITH MATPLOTLIB

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/20199

A very basic example of a Matplotlib plot:

Page 10: Open–Source Python Tools for Environmental Data Processing

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/201910

CASE STUDIES

Page 11: Open–Source Python Tools for Environmental Data Processing

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/201911

CASE STUDY: MACHINE LEARNING ANALYSIS OF VOC DISTRIBUTION INFRACTURED AQUIFER

+

Page 12: Open–Source Python Tools for Environmental Data Processing

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/201912

ML STUDY PT. 1 – PRINCIPLE COMPONENT ANALYSIS

Page 13: Open–Source Python Tools for Environmental Data Processing

DATA CONSISTS OF DEPTH, DIP AND STRIKE, VOCS, AND DESCRIPTION

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/201913

Page 14: Open–Source Python Tools for Environmental Data Processing

DATA IS CONVERTED TO A NUMERICAL ARRAY OR MATRIX

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/201914

Original data table:

Numerical data in array:

Categorical data as target values:

Page 15: Open–Source Python Tools for Environmental Data Processing

DATA IS SCALED, THEN DECOMPOSED INTO TWO PRINCIPLE COMPONENTS(EIGENVECTORS)

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/201915

Scale data:

Transform data with PCA:

Page 16: Open–Source Python Tools for Environmental Data Processing

COMPONENT 2 IS PLOTTED AGAINST COMPONENT 1, AND TARGET NAMES ARE ASSIGNED

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/201916

Page 17: Open–Source Python Tools for Environmental Data Processing

COMPONENT 2 IS PLOTTED AGAINST COMPONENT 1, AND TARGET NAMES ARE ASSIGNED

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/201917

Target categories show clustering:

Page 18: Open–Source Python Tools for Environmental Data Processing

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/201918

ML STUDY PT. 2 – K-MEANS CLUSTERING

Page 19: Open–Source Python Tools for Environmental Data Processing

DIP AND VOCS ARE SELECTED FROM SAME DATA AS AN ARRAY

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/201919

Same dataset as PCA example:

Select dip and VOC columns as array:

Page 20: Open–Source Python Tools for Environmental Data Processing

DATA IS SCALED AND A NEW VECTOR CREATED WITH FOUR K-MEANS CLUSTERS

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/201920

Create K-means pipeline:

Fit and predict four clusters:

Page 21: Open–Source Python Tools for Environmental Data Processing

LABELS ARE ADDED, AND EACH CLUSTER LABEL COUNTED

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/201921

Original data:

Two new columns with cluster labels:

Counts of cluster labels:

Page 22: Open–Source Python Tools for Environmental Data Processing

CLUSTERS ARE PLOTTED WITH DEPTH

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/201922

Clusters plotted with depth:

Page 23: Open–Source Python Tools for Environmental Data Processing

THE CLUSTER COUNTS INDICATE THAT A HIGHER PERCENTAGE OF LOW-ANGLE FEATURES HAVE HIGHVOCS

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/201923

Count = 71

Count = 30Count = 29

Count = 48

Page 24: Open–Source Python Tools for Environmental Data Processing

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/201924

BOTH UNSUPERVISED METHODS INDICATE CLUSTERING BETWEEN VARIABLES

Count = 71

Count = 30

Count = 29

Count = 48

Page 25: Open–Source Python Tools for Environmental Data Processing

CASE STUDY: WELL TRANSDUCER DATA PROCESSING AND VISUALIZATION

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/201925

Page 26: Open–Source Python Tools for Environmental Data Processing

DATA FROM A SINGLE TRANSDUCER CONTAINS NEARLY 70,000 DATA POINTS

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/201926

Data collected every 1m; ~70,000 records:

Page 27: Open–Source Python Tools for Environmental Data Processing

SEVERAL LARGE OUTLIERS ARE PRESENT WHERE TRANSDUCER WAS MOVED, ETC.

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/201927

Large artificial outliers:

Page 28: Open–Source Python Tools for Environmental Data Processing

OUTLIERS ARE ITERATIVELY REPLACED WITH INTERPOLATED VALUES

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/201928

Interpolate and impute outliers:

Page 29: Open–Source Python Tools for Environmental Data Processing

THE RESULTING HYDROGRAPH IS MUCH MORE READABLE

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/201929

Hydrograph with outliers replaced:

Page 30: Open–Source Python Tools for Environmental Data Processing

THE PROCESSED DATASET IS THEN RESAMPLED TO THE MEAN HOURLY VALUE

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/201930

Resample dataset to the hourly mean:

Data is reduced by ~98% to ~1,200 records:

Page 31: Open–Source Python Tools for Environmental Data Processing

THE PROCESSED AND REDUCED DATASET CONTAINS THE ESSENTIAL INFORMATION

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/201931

Hydrograph of reduced dataset:

Page 32: Open–Source Python Tools for Environmental Data Processing

BY ITERATIVELY REPLACING ARTIFICIAL OUTLIERS AND REDUCING THE SIZE OF THE DATASET, WHATWOULD TAKE SEVERAL HOURS IN EXCEL IS ACCOMPLISHED IN LESS THAN 1 MINUTE

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/201932

Page 33: Open–Source Python Tools for Environmental Data Processing

REVIEW AND FURTHER TOPICS

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/201933

• Python presents a comprehensive and free open-source toolkit for data processing, analysis and visualization– Python libraries, such as NumPy, pandas, Matplotlib, Seaborn, and Jupyter Notebook

– Developed by universities, scientists, independent developers

– Libraries can be used together to process, analyze, and visualize groundwater and environmental data

– More accurate, powerful, and repeatable than Excel, etc.

– Range of applications from simple EDA to complex ML studies

• Python can also be used to interact with other software and codes– FloPy- MODFLOW library

– PHREEQPY- PHREEQC library

– ArcPy, Python API- Interact and script functions within ESRI’s ArcGIS suite

Page 34: Open–Source Python Tools for Environmental Data Processing

THANK YOU! QUESTIONS?

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/201934

Please feel free to contact me at [email protected] if you have

any questions we cannot get to, or find me around the conference.

Page 35: Open–Source Python Tools for Environmental Data Processing

CASE STUDY: EDA TO VISUALIZE METALS DISTRIBUTIONS WITH DEPTH

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/201936

• Metals data from soil samples– Potentially contaminated site

– Arsenic is primary COC

– Examine distribution of As with depth to determine possible outliers

• EDA with ‘Seaborn’ Python library, Jupyter Notebooks

– Seaborn: High-level statistical visualization tools built on Matplotlib

– Works seamlessly with pandas DataFrames

– Allows rapid EDA

– Produce publication-quality visualizations

– Jupyter Notebooks: Edit and compile code, view inline plots

Page 36: Open–Source Python Tools for Environmental Data Processing

CASE STUDY: EDA TO VISUALIZE METALS DISTRIBUTIONS WITH DEPTH

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/201937

Wide-format tableof metals data:

Page 37: Open–Source Python Tools for Environmental Data Processing

CASE STUDY: EDA TO VISUALIZE METALS DISTRIBUTIONS WITH DEPTH

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/201938

Plot all metals against depth:

Page 38: Open–Source Python Tools for Environmental Data Processing

CASE STUDY: EDA TO VISUALIZE METALS DISTRIBUTIONS WITH DEPTH

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/201939

Page 39: Open–Source Python Tools for Environmental Data Processing

CASE STUDY: EDA TO VISUALIZE METALS DISTRIBUTIONS WITH DEPTH

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/201940

Highlight arsenic results:

Page 40: Open–Source Python Tools for Environmental Data Processing

CASE STUDY: EDA TO VISUALIZE METALS DISTRIBUTIONS WITH DEPTH

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/201941

Page 41: Open–Source Python Tools for Environmental Data Processing

CASE STUDY: EDA TO VISUALIZE METALS DISTRIBUTIONS WITH DEPTH

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/201942

Box and scatter plot of arsenic vs. depth:

Page 42: Open–Source Python Tools for Environmental Data Processing

CASE STUDY: EDA TO VISUALIZE METALS DISTRIBUTIONS WITH DEPTH

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/201943

Page 43: Open–Source Python Tools for Environmental Data Processing

CASE STUDY: EDA TO VISUALIZE METALS DISTRIBUTIONS WITH DEPTH

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/201944

Violin and scatter plot of arsenic vs. depth:

Page 44: Open–Source Python Tools for Environmental Data Processing

CASE STUDY: EDA TO VISUALIZE METALS DISTRIBUTIONS WITH DEPTH

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/201945