open–source python tools for environmental data processing

Post on 20-May-2022

2 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Open–Source Python Tools for Environmental Data Processing, Analysis, and Visualization

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/2019

Michael J. MurphyEnvironmental Data Analyst/ Data Scientist and Senior Staff Geologist

Terraphase Engineering Inc.

OVERVIEW

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/20192

• The Python programming language

• Why Python?

• Python data analysis basics

• Case studies

• Review

PYTHON* IS A MODERN PROGRAMMING LANGUAGE THAT IS ESPECIALLY USEFUL FOR QUANTITATIVEDATA ANALYSIS AND SCIENTIFIC PROGRAMMING

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/20193

*Python is named after Monty Python, not Pythonidae.

!=+

WHY PYTHON INSTEAD OF EXCEL?

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/20194

vs.

BASIC PYTHON DATA TOOLS

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/20195

• DataFrame table structures • Access, process, analyze, export and visualize data

• Vast collection of functions for array operations• Quantitative analysis

• High-level plotting functions• Can produce publication-quality data visualizations• Works seamlessly with Python data tools such as NumPy, pandas

EXAMPLE: PANDAS DATAFRAMES

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/20196

Key Values

Index

EXAMPLE: PANDAS DATAFRAMES

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/20197

Original wide-format table:

Melted long-format table:

EXAMPLE: NUMPY OPERATIONS

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/20198

Vectorize function:

Draw random values:

Define function to calculate ion balance:

EXAMPLE: PLOTTING WITH MATPLOTLIB

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/20199

A very basic example of a Matplotlib plot:

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/201910

CASE STUDIES

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/201911

CASE STUDY: MACHINE LEARNING ANALYSIS OF VOC DISTRIBUTION INFRACTURED AQUIFER

+

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/201912

ML STUDY PT. 1 – PRINCIPLE COMPONENT ANALYSIS

DATA CONSISTS OF DEPTH, DIP AND STRIKE, VOCS, AND DESCRIPTION

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/201913

DATA IS CONVERTED TO A NUMERICAL ARRAY OR MATRIX

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/201914

Original data table:

Numerical data in array:

Categorical data as target values:

DATA IS SCALED, THEN DECOMPOSED INTO TWO PRINCIPLE COMPONENTS(EIGENVECTORS)

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/201915

Scale data:

Transform data with PCA:

COMPONENT 2 IS PLOTTED AGAINST COMPONENT 1, AND TARGET NAMES ARE ASSIGNED

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/201916

COMPONENT 2 IS PLOTTED AGAINST COMPONENT 1, AND TARGET NAMES ARE ASSIGNED

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/201917

Target categories show clustering:

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/201918

ML STUDY PT. 2 – K-MEANS CLUSTERING

DIP AND VOCS ARE SELECTED FROM SAME DATA AS AN ARRAY

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/201919

Same dataset as PCA example:

Select dip and VOC columns as array:

DATA IS SCALED AND A NEW VECTOR CREATED WITH FOUR K-MEANS CLUSTERS

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/201920

Create K-means pipeline:

Fit and predict four clusters:

LABELS ARE ADDED, AND EACH CLUSTER LABEL COUNTED

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/201921

Original data:

Two new columns with cluster labels:

Counts of cluster labels:

CLUSTERS ARE PLOTTED WITH DEPTH

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/201922

Clusters plotted with depth:

THE CLUSTER COUNTS INDICATE THAT A HIGHER PERCENTAGE OF LOW-ANGLE FEATURES HAVE HIGHVOCS

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/201923

Count = 71

Count = 30Count = 29

Count = 48

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/201924

BOTH UNSUPERVISED METHODS INDICATE CLUSTERING BETWEEN VARIABLES

Count = 71

Count = 30

Count = 29

Count = 48

CASE STUDY: WELL TRANSDUCER DATA PROCESSING AND VISUALIZATION

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/201925

DATA FROM A SINGLE TRANSDUCER CONTAINS NEARLY 70,000 DATA POINTS

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/201926

Data collected every 1m; ~70,000 records:

SEVERAL LARGE OUTLIERS ARE PRESENT WHERE TRANSDUCER WAS MOVED, ETC.

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/201927

Large artificial outliers:

OUTLIERS ARE ITERATIVELY REPLACED WITH INTERPOLATED VALUES

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/201928

Interpolate and impute outliers:

THE RESULTING HYDROGRAPH IS MUCH MORE READABLE

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/201929

Hydrograph with outliers replaced:

THE PROCESSED DATASET IS THEN RESAMPLED TO THE MEAN HOURLY VALUE

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/201930

Resample dataset to the hourly mean:

Data is reduced by ~98% to ~1,200 records:

THE PROCESSED AND REDUCED DATASET CONTAINS THE ESSENTIAL INFORMATION

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/201931

Hydrograph of reduced dataset:

BY ITERATIVELY REPLACING ARTIFICIAL OUTLIERS AND REDUCING THE SIZE OF THE DATASET, WHATWOULD TAKE SEVERAL HOURS IN EXCEL IS ACCOMPLISHED IN LESS THAN 1 MINUTE

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/201932

REVIEW AND FURTHER TOPICS

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/201933

• Python presents a comprehensive and free open-source toolkit for data processing, analysis and visualization– Python libraries, such as NumPy, pandas, Matplotlib, Seaborn, and Jupyter Notebook

– Developed by universities, scientists, independent developers

– Libraries can be used together to process, analyze, and visualize groundwater and environmental data

– More accurate, powerful, and repeatable than Excel, etc.

– Range of applications from simple EDA to complex ML studies

• Python can also be used to interact with other software and codes– FloPy- MODFLOW library

– PHREEQPY- PHREEQC library

– ArcPy, Python API- Interact and script functions within ESRI’s ArcGIS suite

THANK YOU! QUESTIONS?

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/201934

Please feel free to contact me at michael.murphy@terraphase.com if you have

any questions we cannot get to, or find me around the conference.

CASE STUDY: EDA TO VISUALIZE METALS DISTRIBUTIONS WITH DEPTH

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/201936

• Metals data from soil samples– Potentially contaminated site

– Arsenic is primary COC

– Examine distribution of As with depth to determine possible outliers

• EDA with ‘Seaborn’ Python library, Jupyter Notebooks

– Seaborn: High-level statistical visualization tools built on Matplotlib

– Works seamlessly with pandas DataFrames

– Allows rapid EDA

– Produce publication-quality visualizations

– Jupyter Notebooks: Edit and compile code, view inline plots

CASE STUDY: EDA TO VISUALIZE METALS DISTRIBUTIONS WITH DEPTH

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/201937

Wide-format tableof metals data:

CASE STUDY: EDA TO VISUALIZE METALS DISTRIBUTIONS WITH DEPTH

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/201938

Plot all metals against depth:

CASE STUDY: EDA TO VISUALIZE METALS DISTRIBUTIONS WITH DEPTH

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/201939

CASE STUDY: EDA TO VISUALIZE METALS DISTRIBUTIONS WITH DEPTH

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/201940

Highlight arsenic results:

CASE STUDY: EDA TO VISUALIZE METALS DISTRIBUTIONS WITH DEPTH

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/201941

CASE STUDY: EDA TO VISUALIZE METALS DISTRIBUTIONS WITH DEPTH

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/201942

Box and scatter plot of arsenic vs. depth:

CASE STUDY: EDA TO VISUALIZE METALS DISTRIBUTIONS WITH DEPTH

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/201943

CASE STUDY: EDA TO VISUALIZE METALS DISTRIBUTIONS WITH DEPTH

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/201944

Violin and scatter plot of arsenic vs. depth:

CASE STUDY: EDA TO VISUALIZE METALS DISTRIBUTIONS WITH DEPTH

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/201945

top related