Open–Source Python Tools for Environmental Data Processing ...
Transcript of Open–Source Python Tools for Environmental Data Processing ...
Open–Source Python Tools for Environmental Data Processing, Analysis, and Visualization
Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization
9/17/2019
Michael J. MurphyEnvironmental Data Analyst/ Data Scientist and Senior Staff Geologist
Terraphase Engineering Inc.
OVERVIEW
Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization
9/17/20192
• The Python programming language
• Why Python?
• Python data analysis basics
• Case studies
• Review
PYTHON* IS A MODERN PROGRAMMING LANGUAGE THAT IS ESPECIALLY USEFUL FOR QUANTITATIVEDATA ANALYSIS AND SCIENTIFIC PROGRAMMING
Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization
9/17/20193
*Python is named after Monty Python, not Pythonidae.
!=+
WHY PYTHON INSTEAD OF EXCEL?
Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization
9/17/20194
vs.
BASIC PYTHON DATA TOOLS
Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization
9/17/20195
• DataFrame table structures • Access, process, analyze, export and visualize data
• Vast collection of functions for array operations• Quantitative analysis
• High-level plotting functions• Can produce publication-quality data visualizations• Works seamlessly with Python data tools such as NumPy, pandas
EXAMPLE: PANDAS DATAFRAMES
Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization
9/17/20196
Key Values
Index
EXAMPLE: PANDAS DATAFRAMES
Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization
9/17/20197
Original wide-format table:
Melted long-format table:
EXAMPLE: NUMPY OPERATIONS
Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization
9/17/20198
Vectorize function:
Draw random values:
Define function to calculate ion balance:
EXAMPLE: PLOTTING WITH MATPLOTLIB
Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization
9/17/20199
A very basic example of a Matplotlib plot:
Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization
9/17/201910
CASE STUDIES
Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization
9/17/201911
CASE STUDY: MACHINE LEARNING ANALYSIS OF VOC DISTRIBUTION INFRACTURED AQUIFER
+
Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization
9/17/201912
ML STUDY PT. 1 – PRINCIPLE COMPONENT ANALYSIS
DATA CONSISTS OF DEPTH, DIP AND STRIKE, VOCS, AND DESCRIPTION
Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization
9/17/201913
DATA IS CONVERTED TO A NUMERICAL ARRAY OR MATRIX
Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization
9/17/201914
Original data table:
Numerical data in array:
Categorical data as target values:
DATA IS SCALED, THEN DECOMPOSED INTO TWO PRINCIPLE COMPONENTS(EIGENVECTORS)
Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization
9/17/201915
Scale data:
Transform data with PCA:
COMPONENT 2 IS PLOTTED AGAINST COMPONENT 1, AND TARGET NAMES ARE ASSIGNED
Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization
9/17/201916
COMPONENT 2 IS PLOTTED AGAINST COMPONENT 1, AND TARGET NAMES ARE ASSIGNED
Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization
9/17/201917
Target categories show clustering:
Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization
9/17/201918
ML STUDY PT. 2 – K-MEANS CLUSTERING
DIP AND VOCS ARE SELECTED FROM SAME DATA AS AN ARRAY
Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization
9/17/201919
Same dataset as PCA example:
Select dip and VOC columns as array:
DATA IS SCALED AND A NEW VECTOR CREATED WITH FOUR K-MEANS CLUSTERS
Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization
9/17/201920
Create K-means pipeline:
Fit and predict four clusters:
LABELS ARE ADDED, AND EACH CLUSTER LABEL COUNTED
Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization
9/17/201921
Original data:
Two new columns with cluster labels:
Counts of cluster labels:
CLUSTERS ARE PLOTTED WITH DEPTH
Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization
9/17/201922
Clusters plotted with depth:
THE CLUSTER COUNTS INDICATE THAT A HIGHER PERCENTAGE OF LOW-ANGLE FEATURES HAVE HIGHVOCS
Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization
9/17/201923
Count = 71
Count = 30Count = 29
Count = 48
Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization
9/17/201924
BOTH UNSUPERVISED METHODS INDICATE CLUSTERING BETWEEN VARIABLES
Count = 71
Count = 30
Count = 29
Count = 48
CASE STUDY: WELL TRANSDUCER DATA PROCESSING AND VISUALIZATION
Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization
9/17/201925
DATA FROM A SINGLE TRANSDUCER CONTAINS NEARLY 70,000 DATA POINTS
Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization
9/17/201926
Data collected every 1m; ~70,000 records:
SEVERAL LARGE OUTLIERS ARE PRESENT WHERE TRANSDUCER WAS MOVED, ETC.
Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization
9/17/201927
Large artificial outliers:
OUTLIERS ARE ITERATIVELY REPLACED WITH INTERPOLATED VALUES
Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization
9/17/201928
Interpolate and impute outliers:
THE RESULTING HYDROGRAPH IS MUCH MORE READABLE
Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization
9/17/201929
Hydrograph with outliers replaced:
THE PROCESSED DATASET IS THEN RESAMPLED TO THE MEAN HOURLY VALUE
Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization
9/17/201930
Resample dataset to the hourly mean:
Data is reduced by ~98% to ~1,200 records:
THE PROCESSED AND REDUCED DATASET CONTAINS THE ESSENTIAL INFORMATION
Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization
9/17/201931
Hydrograph of reduced dataset:
BY ITERATIVELY REPLACING ARTIFICIAL OUTLIERS AND REDUCING THE SIZE OF THE DATASET, WHATWOULD TAKE SEVERAL HOURS IN EXCEL IS ACCOMPLISHED IN LESS THAN 1 MINUTE
Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization
9/17/201932
REVIEW AND FURTHER TOPICS
Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization
9/17/201933
• Python presents a comprehensive and free open-source toolkit for data processing, analysis and visualization– Python libraries, such as NumPy, pandas, Matplotlib, Seaborn, and Jupyter Notebook
– Developed by universities, scientists, independent developers
– Libraries can be used together to process, analyze, and visualize groundwater and environmental data
– More accurate, powerful, and repeatable than Excel, etc.
– Range of applications from simple EDA to complex ML studies
• Python can also be used to interact with other software and codes– FloPy- MODFLOW library
– PHREEQPY- PHREEQC library
– ArcPy, Python API- Interact and script functions within ESRI’s ArcGIS suite
THANK YOU! QUESTIONS?
Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization
9/17/201934
Please feel free to contact me at [email protected] if you have
any questions we cannot get to, or find me around the conference.
CASE STUDY: EDA TO VISUALIZE METALS DISTRIBUTIONS WITH DEPTH
Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization
9/17/201936
• Metals data from soil samples– Potentially contaminated site
– Arsenic is primary COC
– Examine distribution of As with depth to determine possible outliers
• EDA with ‘Seaborn’ Python library, Jupyter Notebooks
– Seaborn: High-level statistical visualization tools built on Matplotlib
– Works seamlessly with pandas DataFrames
– Allows rapid EDA
– Produce publication-quality visualizations
– Jupyter Notebooks: Edit and compile code, view inline plots
CASE STUDY: EDA TO VISUALIZE METALS DISTRIBUTIONS WITH DEPTH
Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization
9/17/201937
Wide-format tableof metals data:
CASE STUDY: EDA TO VISUALIZE METALS DISTRIBUTIONS WITH DEPTH
Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization
9/17/201938
Plot all metals against depth:
CASE STUDY: EDA TO VISUALIZE METALS DISTRIBUTIONS WITH DEPTH
Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization
9/17/201939
CASE STUDY: EDA TO VISUALIZE METALS DISTRIBUTIONS WITH DEPTH
Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization
9/17/201940
Highlight arsenic results:
CASE STUDY: EDA TO VISUALIZE METALS DISTRIBUTIONS WITH DEPTH
Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization
9/17/201941
CASE STUDY: EDA TO VISUALIZE METALS DISTRIBUTIONS WITH DEPTH
Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization
9/17/201942
Box and scatter plot of arsenic vs. depth:
CASE STUDY: EDA TO VISUALIZE METALS DISTRIBUTIONS WITH DEPTH
Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization
9/17/201943
CASE STUDY: EDA TO VISUALIZE METALS DISTRIBUTIONS WITH DEPTH
Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization
9/17/201944
Violin and scatter plot of arsenic vs. depth:
CASE STUDY: EDA TO VISUALIZE METALS DISTRIBUTIONS WITH DEPTH
Open –Source Python Tools for Environmental Data Processing, Analysis, and Visualization
9/17/201945