John Porter Sheng Shan Lu M. Gastil Gastil-Buhl With special thanks to Chau-Chin Lin and Chi-Wen...
Transcript of John Porter Sheng Shan Lu M. Gastil Gastil-Buhl With special thanks to Chau-Chin Lin and Chi-Wen...
John PorterSheng Shan Lu
M. Gastil Gastil-BuhlWith special thanks to Chau-Chin Lin and Chi-Wen
Hsaio
Maximizing the potential of LTER data to be used to make new ecological discoveries Moving from the era of single datasets to
large scale data integration Tens to hundreds of datasets
A first step to achieving this goal is to automate the mechanical processes associated with data ingestion into analytical software
We want to: Identify a dataset in the LTER Network
Information System Download it Write a R statistical program to read the data Produce basic statistical summaries of the
ingested data How long should that process take?
With our tools we can do that in less than 1 minute!
Tool Description Works with Metacat
Works with PASTA
TFRI – R module Web-form-based system takes you through a multistep process to ingest data, do a basic quality assurance analysis and simple analyses
Manual data download
Manual data download
StatProg Web-form-based system that generates R, SAS, SPSS or Matlab programs that can be edited to process data
Manual data download
Manual data download
PASTAprog Web service – returns ready-to-use R, SAS or Matlab program. Can be run directly from inside R for 1-minute analyses!
Variable – some automated, some manual
Fully-automated download
Note: You do NOT need to have R installed on your PC to use this. It is entirely web-based.
Don’t be worried by the buttons! A fully English version is available at the URL above
Metadata
Display
Statistical
Functions
Raw Data
Upload
Select number type of the fieldIncude the field in R code ( select at least one )
ˇ
EML metadata transform into HTML by XSL Stylesheet
No field header
Upload
Only for numerical attributes!Data Check Functions
Correct domain (real, integer)Range Checks
Action Options: Edit records with bad values
Set all the bad values to
missing ( NA )
Eliminate all the records with
bad
values
Ignore all the range check
problems (Just for value range
error)
Data Type Error:Value Range Error : Select 'Set all the
bad values to missing ( NA )' option
3
Update
3
54
54
The message for No data error
This line can not be modified
Rest of the R program CAN be modified to reflect your analyses
Select program
type
Specify Metadata Document to Use
You can get the Package ID from the LTER Metadata catalog. Download a copy of the data, while you are there!
Or, you can specify a metadata document on a site server by giving the full URL
Importantly, you need to edit the program to point to where the data is stored on
YOUR computer, so the program can find it!
The previous form-based programs have been available for several years Their performance has improved as Metadata
has gotten better But they still can be slower to use than we
would like, requiring manual editing and steps The advent of the LTER PASTA system
makes possible truly automated ingestion and analysis using a web service
R “source” function specifying the web service URL and that we want to “echo” our commands to
the screen
Package ID from the PASTA Data Portal
DONE! Our analysis has been run, and basic statistical summaries have been created for each of the attributes.
You can now add additional commands to generate graphics etc. or merge to other datasets
Base URL: http://www.vcrlter.virginia.edu/webservice/
PASTAprog/ Plus – a Package ID (available on the PASTA portal)
E.g., knb-lter-vcr.26.14 Scope: knb-lter-vcr ID: 26 Revision: 14
Plus – A suffix indicating the type of program you want (e.g., .r, .sas, .spss, .m) for R, SAS, SPSS or Matlab
http://www.vcrlter.virginia.edu/webservice/PASTAprog/knb-lter-vcr.26.14.r
You can also use the web service URL in a web browser to get a text copy of your
program
Note: There are other options that will let you use the web service for data OUTSIDE PASTA by specifying the URL of the EML metadata separately
Problems with Metadata Lead to lack of congruency between the
description of the data and the data itself* Bad practices in metadata - e.g., using special
characters, spaces or mathematical operations as part of the attribute names
Links to data in the metadata may not properly lead directly to data *
Problems with Data Inconsistent coding (character data where
numbers are expected) – causes conversion of numerical data into R “factors”
Dates – often are handled in different ways ????? – these systems need additional
testing on a wide array of data – and you can help!
* Much improved by PASTA system over earlier Metacat