Open–Source Python Tools for Environmental Data Processing ...

44
Open–Source Python Tools for Environmental Data Processing, Analysis, and Visualization Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization 9/17/2019 Michael J. Murphy Environmental Data Analyst/ Data Scientist and Senior Staff Geologist Terraphase Engineering Inc.

Transcript of Open–Source Python Tools for Environmental Data Processing ...

Page 1: Open–Source Python Tools for Environmental Data Processing ...

Open–Source Python Tools for Environmental Data Processing, Analysis, and Visualization

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/2019

Michael J. MurphyEnvironmental Data Analyst/ Data Scientist and Senior Staff Geologist

Terraphase Engineering Inc.

Page 2: Open–Source Python Tools for Environmental Data Processing ...

OVERVIEW

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/20192

• The Python programming language

• Why Python?

• Python data analysis basics

• Case studies

• Review

Page 3: Open–Source Python Tools for Environmental Data Processing ...

PYTHON* IS A MODERN PROGRAMMING LANGUAGE THAT IS ESPECIALLY USEFUL FOR QUANTITATIVEDATA ANALYSIS AND SCIENTIFIC PROGRAMMING

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/20193

*Python is named after Monty Python, not Pythonidae.

!=+

Page 4: Open–Source Python Tools for Environmental Data Processing ...

WHY PYTHON INSTEAD OF EXCEL?

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/20194

vs.

Page 5: Open–Source Python Tools for Environmental Data Processing ...

BASIC PYTHON DATA TOOLS

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/20195

• DataFrame table structures • Access, process, analyze, export and visualize data

• Vast collection of functions for array operations• Quantitative analysis

• High-level plotting functions• Can produce publication-quality data visualizations• Works seamlessly with Python data tools such as NumPy, pandas

Page 6: Open–Source Python Tools for Environmental Data Processing ...

EXAMPLE: PANDAS DATAFRAMES

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/20196

Key Values

Index

Page 7: Open–Source Python Tools for Environmental Data Processing ...

EXAMPLE: PANDAS DATAFRAMES

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/20197

Original wide-format table:

Melted long-format table:

Page 8: Open–Source Python Tools for Environmental Data Processing ...

EXAMPLE: NUMPY OPERATIONS

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/20198

Vectorize function:

Draw random values:

Define function to calculate ion balance:

Page 9: Open–Source Python Tools for Environmental Data Processing ...

EXAMPLE: PLOTTING WITH MATPLOTLIB

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/20199

A very basic example of a Matplotlib plot:

Page 10: Open–Source Python Tools for Environmental Data Processing ...

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/201910

CASE STUDIES

Page 11: Open–Source Python Tools for Environmental Data Processing ...

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/201911

CASE STUDY: MACHINE LEARNING ANALYSIS OF VOC DISTRIBUTION INFRACTURED AQUIFER

+

Page 12: Open–Source Python Tools for Environmental Data Processing ...

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/201912

ML STUDY PT. 1 – PRINCIPLE COMPONENT ANALYSIS

Page 13: Open–Source Python Tools for Environmental Data Processing ...

DATA CONSISTS OF DEPTH, DIP AND STRIKE, VOCS, AND DESCRIPTION

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/201913

Page 14: Open–Source Python Tools for Environmental Data Processing ...

DATA IS CONVERTED TO A NUMERICAL ARRAY OR MATRIX

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/201914

Original data table:

Numerical data in array:

Categorical data as target values:

Page 15: Open–Source Python Tools for Environmental Data Processing ...

DATA IS SCALED, THEN DECOMPOSED INTO TWO PRINCIPLE COMPONENTS(EIGENVECTORS)

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/201915

Scale data:

Transform data with PCA:

Page 16: Open–Source Python Tools for Environmental Data Processing ...

COMPONENT 2 IS PLOTTED AGAINST COMPONENT 1, AND TARGET NAMES ARE ASSIGNED

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/201916

Page 17: Open–Source Python Tools for Environmental Data Processing ...

COMPONENT 2 IS PLOTTED AGAINST COMPONENT 1, AND TARGET NAMES ARE ASSIGNED

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/201917

Target categories show clustering:

Page 18: Open–Source Python Tools for Environmental Data Processing ...

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/201918

ML STUDY PT. 2 – K-MEANS CLUSTERING

Page 19: Open–Source Python Tools for Environmental Data Processing ...

DIP AND VOCS ARE SELECTED FROM SAME DATA AS AN ARRAY

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/201919

Same dataset as PCA example:

Select dip and VOC columns as array:

Page 20: Open–Source Python Tools for Environmental Data Processing ...

DATA IS SCALED AND A NEW VECTOR CREATED WITH FOUR K-MEANS CLUSTERS

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/201920

Create K-means pipeline:

Fit and predict four clusters:

Page 21: Open–Source Python Tools for Environmental Data Processing ...

LABELS ARE ADDED, AND EACH CLUSTER LABEL COUNTED

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/201921

Original data:

Two new columns with cluster labels:

Counts of cluster labels:

Page 22: Open–Source Python Tools for Environmental Data Processing ...

CLUSTERS ARE PLOTTED WITH DEPTH

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/201922

Clusters plotted with depth:

Page 23: Open–Source Python Tools for Environmental Data Processing ...

THE CLUSTER COUNTS INDICATE THAT A HIGHER PERCENTAGE OF LOW-ANGLE FEATURES HAVE HIGHVOCS

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/201923

Count = 71

Count = 30Count = 29

Count = 48

Page 24: Open–Source Python Tools for Environmental Data Processing ...

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/201924

BOTH UNSUPERVISED METHODS INDICATE CLUSTERING BETWEEN VARIABLES

Count = 71

Count = 30

Count = 29

Count = 48

Page 25: Open–Source Python Tools for Environmental Data Processing ...

CASE STUDY: WELL TRANSDUCER DATA PROCESSING AND VISUALIZATION

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/201925

Page 26: Open–Source Python Tools for Environmental Data Processing ...

DATA FROM A SINGLE TRANSDUCER CONTAINS NEARLY 70,000 DATA POINTS

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/201926

Data collected every 1m; ~70,000 records:

Page 27: Open–Source Python Tools for Environmental Data Processing ...

SEVERAL LARGE OUTLIERS ARE PRESENT WHERE TRANSDUCER WAS MOVED, ETC.

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/201927

Large artificial outliers:

Page 28: Open–Source Python Tools for Environmental Data Processing ...

OUTLIERS ARE ITERATIVELY REPLACED WITH INTERPOLATED VALUES

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/201928

Interpolate and impute outliers:

Page 29: Open–Source Python Tools for Environmental Data Processing ...

THE RESULTING HYDROGRAPH IS MUCH MORE READABLE

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/201929

Hydrograph with outliers replaced:

Page 30: Open–Source Python Tools for Environmental Data Processing ...

THE PROCESSED DATASET IS THEN RESAMPLED TO THE MEAN HOURLY VALUE

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/201930

Resample dataset to the hourly mean:

Data is reduced by ~98% to ~1,200 records:

Page 31: Open–Source Python Tools for Environmental Data Processing ...

THE PROCESSED AND REDUCED DATASET CONTAINS THE ESSENTIAL INFORMATION

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/201931

Hydrograph of reduced dataset:

Page 32: Open–Source Python Tools for Environmental Data Processing ...

BY ITERATIVELY REPLACING ARTIFICIAL OUTLIERS AND REDUCING THE SIZE OF THE DATASET, WHATWOULD TAKE SEVERAL HOURS IN EXCEL IS ACCOMPLISHED IN LESS THAN 1 MINUTE

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/201932

Page 33: Open–Source Python Tools for Environmental Data Processing ...

REVIEW AND FURTHER TOPICS

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/201933

• Python presents a comprehensive and free open-source toolkit for data processing, analysis and visualization– Python libraries, such as NumPy, pandas, Matplotlib, Seaborn, and Jupyter Notebook

– Developed by universities, scientists, independent developers

– Libraries can be used together to process, analyze, and visualize groundwater and environmental data

– More accurate, powerful, and repeatable than Excel, etc.

– Range of applications from simple EDA to complex ML studies

• Python can also be used to interact with other software and codes– FloPy- MODFLOW library

– PHREEQPY- PHREEQC library

– ArcPy, Python API- Interact and script functions within ESRI’s ArcGIS suite

Page 34: Open–Source Python Tools for Environmental Data Processing ...

THANK YOU! QUESTIONS?

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/201934

Please feel free to contact me at [email protected] if you have

any questions we cannot get to, or find me around the conference.

Page 35: Open–Source Python Tools for Environmental Data Processing ...

CASE STUDY: EDA TO VISUALIZE METALS DISTRIBUTIONS WITH DEPTH

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/201936

• Metals data from soil samples– Potentially contaminated site

– Arsenic is primary COC

– Examine distribution of As with depth to determine possible outliers

• EDA with ‘Seaborn’ Python library, Jupyter Notebooks

– Seaborn: High-level statistical visualization tools built on Matplotlib

– Works seamlessly with pandas DataFrames

– Allows rapid EDA

– Produce publication-quality visualizations

– Jupyter Notebooks: Edit and compile code, view inline plots

Page 36: Open–Source Python Tools for Environmental Data Processing ...

CASE STUDY: EDA TO VISUALIZE METALS DISTRIBUTIONS WITH DEPTH

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/201937

Wide-format tableof metals data:

Page 37: Open–Source Python Tools for Environmental Data Processing ...

CASE STUDY: EDA TO VISUALIZE METALS DISTRIBUTIONS WITH DEPTH

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/201938

Plot all metals against depth:

Page 38: Open–Source Python Tools for Environmental Data Processing ...

CASE STUDY: EDA TO VISUALIZE METALS DISTRIBUTIONS WITH DEPTH

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/201939

Page 39: Open–Source Python Tools for Environmental Data Processing ...

CASE STUDY: EDA TO VISUALIZE METALS DISTRIBUTIONS WITH DEPTH

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/201940

Highlight arsenic results:

Page 40: Open–Source Python Tools for Environmental Data Processing ...

CASE STUDY: EDA TO VISUALIZE METALS DISTRIBUTIONS WITH DEPTH

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/201941

Page 41: Open–Source Python Tools for Environmental Data Processing ...

CASE STUDY: EDA TO VISUALIZE METALS DISTRIBUTIONS WITH DEPTH

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/201942

Box and scatter plot of arsenic vs. depth:

Page 42: Open–Source Python Tools for Environmental Data Processing ...

CASE STUDY: EDA TO VISUALIZE METALS DISTRIBUTIONS WITH DEPTH

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/201943

Page 43: Open–Source Python Tools for Environmental Data Processing ...

CASE STUDY: EDA TO VISUALIZE METALS DISTRIBUTIONS WITH DEPTH

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/201944

Violin and scatter plot of arsenic vs. depth:

Page 44: Open–Source Python Tools for Environmental Data Processing ...

CASE STUDY: EDA TO VISUALIZE METALS DISTRIBUTIONS WITH DEPTH

Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization

9/17/201945