Field Based Data Validation: a very real experience in wrangling data, taxonomic names, and photos...
-
Upload
cory-jackson -
Category
Documents
-
view
218 -
download
1
Transcript of Field Based Data Validation: a very real experience in wrangling data, taxonomic names, and photos...
![Page 1: Field Based Data Validation: a very real experience in wrangling data, taxonomic names, and photos Moorea Biocode Project, supported by the Gordon and.](https://reader036.fdocuments.us/reader036/viewer/2022062517/56649f1e5503460f94c35ded/html5/thumbnails/1.jpg)
Field Based Data Validation: a very real experience in wrangling data, taxonomic
names, and photos
Moorea Biocode Project, supported by the Gordon and Betty Moore Foundation
Presentation by John Deck, University of California at Berkeley
![Page 2: Field Based Data Validation: a very real experience in wrangling data, taxonomic names, and photos Moorea Biocode Project, supported by the Gordon and.](https://reader036.fdocuments.us/reader036/viewer/2022062517/56649f1e5503460f94c35ded/html5/thumbnails/2.jpg)
Outline
• Part 1: Background on Moorea Biocode Project
• Part 2: bioValidator: field based data validation
• Part 3: A case study in handling taxonomic names in a field based client application
![Page 3: Field Based Data Validation: a very real experience in wrangling data, taxonomic names, and photos Moorea Biocode Project, supported by the Gordon and.](https://reader036.fdocuments.us/reader036/viewer/2022062517/56649f1e5503460f94c35ded/html5/thumbnails/3.jpg)
Part 1: Background on Moorea Biocode Project
![Page 4: Field Based Data Validation: a very real experience in wrangling data, taxonomic names, and photos Moorea Biocode Project, supported by the Gordon and.](https://reader036.fdocuments.us/reader036/viewer/2022062517/56649f1e5503460f94c35ded/html5/thumbnails/4.jpg)
Moorea Biocode: The Collecting
![Page 5: Field Based Data Validation: a very real experience in wrangling data, taxonomic names, and photos Moorea Biocode Project, supported by the Gordon and.](https://reader036.fdocuments.us/reader036/viewer/2022062517/56649f1e5503460f94c35ded/html5/thumbnails/5.jpg)
The Sorting
Moorea Biocode: Sorting Specimens
![Page 6: Field Based Data Validation: a very real experience in wrangling data, taxonomic names, and photos Moorea Biocode Project, supported by the Gordon and.](https://reader036.fdocuments.us/reader036/viewer/2022062517/56649f1e5503460f94c35ded/html5/thumbnails/6.jpg)
Moorea Biocode: Tissue Sampling
![Page 7: Field Based Data Validation: a very real experience in wrangling data, taxonomic names, and photos Moorea Biocode Project, supported by the Gordon and.](https://reader036.fdocuments.us/reader036/viewer/2022062517/56649f1e5503460f94c35ded/html5/thumbnails/7.jpg)
Moorea Biocode LIMS: Binning, Trimming, & Assembly of Sequence Data
![Page 8: Field Based Data Validation: a very real experience in wrangling data, taxonomic names, and photos Moorea Biocode Project, supported by the Gordon and.](https://reader036.fdocuments.us/reader036/viewer/2022062517/56649f1e5503460f94c35ded/html5/thumbnails/8.jpg)
Challenges Facing the Moorea Biocode Project IT Team
• Multiple taxa & a different team for each group. Various cultures and workflow for each team.
• Everyone in a hurry, non-technical biologists entering data• Specimens (& metadata) ultimately owned by multiple host
institutions.• Multiple labs processing genetic data (w/ different equipment,
processes, and workflows).• Final taxonomic determination made using Lab and/or Host
Institution (Often much later than collecting event)• No internet or bad internet in field.• *Need to associate photos/standardized higher taxonomy in the
field (before accession into any db)
![Page 9: Field Based Data Validation: a very real experience in wrangling data, taxonomic names, and photos Moorea Biocode Project, supported by the Gordon and.](https://reader036.fdocuments.us/reader036/viewer/2022062517/56649f1e5503460f94c35ded/html5/thumbnails/9.jpg)
Field Based System Requirements
• Spreadsheets for data entry• Extensible validation rules (each project or sub-
project has its own requirements)• Match specimen data to Photos• Tag photos and load to external system (e.g.
Flickr)• Query multiple taxonomic authorities (each
TaxonTeam selects its own authority)• Updates online database periodically.
![Page 10: Field Based Data Validation: a very real experience in wrangling data, taxonomic names, and photos Moorea Biocode Project, supported by the Gordon and.](https://reader036.fdocuments.us/reader036/viewer/2022062517/56649f1e5503460f94c35ded/html5/thumbnails/10.jpg)
Part 2: bioValidator: a Field based Data Validation Tool
• Validate data using extensible validation rules• Search multiple taxonomies built in Lucene• Specimen to photo matching• Upload to Flickr using machine tags• No internet required• Java based
![Page 11: Field Based Data Validation: a very real experience in wrangling data, taxonomic names, and photos Moorea Biocode Project, supported by the Gordon and.](https://reader036.fdocuments.us/reader036/viewer/2022062517/56649f1e5503460f94c35ded/html5/thumbnails/11.jpg)
![Page 12: Field Based Data Validation: a very real experience in wrangling data, taxonomic names, and photos Moorea Biocode Project, supported by the Gordon and.](https://reader036.fdocuments.us/reader036/viewer/2022062517/56649f1e5503460f94c35ded/html5/thumbnails/12.jpg)
![Page 13: Field Based Data Validation: a very real experience in wrangling data, taxonomic names, and photos Moorea Biocode Project, supported by the Gordon and.](https://reader036.fdocuments.us/reader036/viewer/2022062517/56649f1e5503460f94c35ded/html5/thumbnails/13.jpg)
![Page 14: Field Based Data Validation: a very real experience in wrangling data, taxonomic names, and photos Moorea Biocode Project, supported by the Gordon and.](https://reader036.fdocuments.us/reader036/viewer/2022062517/56649f1e5503460f94c35ded/html5/thumbnails/14.jpg)
![Page 15: Field Based Data Validation: a very real experience in wrangling data, taxonomic names, and photos Moorea Biocode Project, supported by the Gordon and.](https://reader036.fdocuments.us/reader036/viewer/2022062517/56649f1e5503460f94c35ded/html5/thumbnails/15.jpg)
Part 3: A case study in handling taxonomic names in a field based client application
![Page 16: Field Based Data Validation: a very real experience in wrangling data, taxonomic names, and photos Moorea Biocode Project, supported by the Gordon and.](https://reader036.fdocuments.us/reader036/viewer/2022062517/56649f1e5503460f94c35ded/html5/thumbnails/16.jpg)
Why Lucene?
• Java-based, cross platform• Indexes can be delivered to client apps (can
run offline)• Ability to build a standardized interface to
multiple taxonomies.
![Page 17: Field Based Data Validation: a very real experience in wrangling data, taxonomic names, and photos Moorea Biocode Project, supported by the Gordon and.](https://reader036.fdocuments.us/reader036/viewer/2022062517/56649f1e5503460f94c35ded/html5/thumbnails/17.jpg)
Higher Taxonomic Name Handling in the Field
• Initial Spreadsheet: Just assign the lowest taxon name and lowest taxon level.
• bioValidator: Suggest a higher taxonomy based off name and level provided.
• Revised Spreadsheet: update with suggested higher taxonomic hierarchy.
![Page 18: Field Based Data Validation: a very real experience in wrangling data, taxonomic names, and photos Moorea Biocode Project, supported by the Gordon and.](https://reader036.fdocuments.us/reader036/viewer/2022062517/56649f1e5503460f94c35ded/html5/thumbnails/18.jpg)
Lucene Indexer Implementation for Taxonomy
Taxonomic Concept Lucene Class
Unit Document
Rank Field
Taxonomic Database IndexWriter
String sql = "SELECT tsn from taxonomic_units”;… obtain resultset …while (resultset.next()) {
Document doc = new Document();// itisUnit is class that abstracts ITIS Schema
itisRanks ir = new itisRanks(resultset.getString("tsn”)); while (ir.next()) {
doc.add(new Field(ir.rank, ir.name)); }
}IndexWriter.addDocument(doc);
Example of Lucene Index built on ITIS
![Page 19: Field Based Data Validation: a very real experience in wrangling data, taxonomic names, and photos Moorea Biocode Project, supported by the Gordon and.](https://reader036.fdocuments.us/reader036/viewer/2022062517/56649f1e5503460f94c35ded/html5/thumbnails/19.jpg)
Lucene Search Implementation for Taxonomy
public static Hashtable searchIndex(String taxonLevel, String taxonName) {// Construct queryQuery query = new QueryParser(taxonLevel, taxonName);
// Possible multiple matches TopFieldDocs hits = new IndexSearcher().search(query);// Loop through each taxonomic Unit
for (int taxonUnit = 0; i < hits.totalHits; taxonUnit++) { Document doc = searcher.doc(hits.scoreDocs[taxonUnit].doc); // Loop each rank to assign to map
for (int rank = 0; rank < taxonLevels.getNumLevels(); rank++) { Object value = doc.get(taxonLevels.getLevel(rank));
// Populate a simple table with taxon ranks & values map.put(level, value); } } return map; }
Example of (a simplified!) Lucene Search:
![Page 20: Field Based Data Validation: a very real experience in wrangling data, taxonomic names, and photos Moorea Biocode Project, supported by the Gordon and.](https://reader036.fdocuments.us/reader036/viewer/2022062517/56649f1e5503460f94c35ded/html5/thumbnails/20.jpg)
![Page 21: Field Based Data Validation: a very real experience in wrangling data, taxonomic names, and photos Moorea Biocode Project, supported by the Gordon and.](https://reader036.fdocuments.us/reader036/viewer/2022062517/56649f1e5503460f94c35ded/html5/thumbnails/21.jpg)
Further Work
• Standardization in validation protocols (expand on CRIA work). As we push the envelop in field-based data collection this will become more of an issue.
• Network of Lucene indexes for taxonomies?• GUID implementation in spreadsheets?• How to track and update data as it changes in
dependent systems (LIMS Systems, Genbank, BOLD, CalPhotos). See BiSciCol Grant (NSF)
![Page 22: Field Based Data Validation: a very real experience in wrangling data, taxonomic names, and photos Moorea Biocode Project, supported by the Gordon and.](https://reader036.fdocuments.us/reader036/viewer/2022062517/56649f1e5503460f94c35ded/html5/thumbnails/22.jpg)
More Information
• John Deck ([email protected])• Moorea Biocode Project– http://mooreabiocode.org/
• bioValidator – http://biovalidator.sourceforge.net/
• bioTaxonomy (Lucene index/search)– http://biotaxonomy.sourceforge.net/