INCOFISH WP3 - Campinas, April 2006WEB Tools and Data Cleaning
Alexandre [email protected]
Centro de Referência em Informação Ambiental, CrIA
WEB Tools and Data Cleaning
These tools were developed within the scope of thespeciesLink project, so, in some cases, there is a
complete dependency on the architecture, the localdatabase, and the libraries that were developed by CRIA.
Data Cleaning started as an idea that had not a very clear direction, it became a very particular system.
The speciesLink
project is being
funded by
FAPESP (São
Paulo state
agency) from
October, 2001 to
October, 2005.
Col 1
Col 2
Col 3
Col 4
Col 5
program
search
interface
Win2000Brahms
LinuxMySQL
Win98Access
Win98biota FreeeBSD
PostgreSQL
??
??
?
Different data sources software and systemsDifferent data sources software and systems
Protocol and Content SchemaProtocol and Content Schema
• DiGIR protocol (Distributed Generic Information Retrieval)
Potential to be globally accepted
• DiGIR software (Java Portal & PHP Provider)
Collaborative development
• DarwinCore v.2
Covers the basic content elements (taxonomic
identification, location and date of collecting event)
speciesLink site
Presentation Layer
speciesLink site
Presentation Layer
DiGIRPortal(Java)
DiGIRPortal(Java)
PerlPerl
Slow or unstable connectivity
Fast and stable connectivity
DataSOAP client
CollectionManagement
System
SQL
Collection C
DataRepository
DataSOAP client
CollectionManagement
System
SQL
Collection B
DataRepository
PostgresPHP
Provider
SOAP Server
SQL
Mirror Server
DataPHP
Provider
Collection Management
System
SQL
Collection A
System’s System’s ArchitectureArchitecture
~40 connected collections~40 connected collections
~940.000 on-line records~940.000 on-line records
March/2006March/2006
JBRJ
speciesLink network
WEB ToolsWEB Tools
• geoLoc
• spOutlier
• infoXY
• conversor
• speciesMapper
• data cleaning
About geoLoc
to assist biological collections in geo-referencing their data
the database includes approximately 110 thousand names of Brazilian localities, obtained from:
Brazilian Institute of National Statistics and Geography (IBGE) GEOnet Names Server (GNS) speciesLink/Fapesp
algorithm based on concepts in the Egaz program (Shattuck 1997) capable of calculating a coordinate for a distance and direction
ToolsTools
26 Noroeste-NW
Campinas São Paulo
ToolsTools
About spOutlier
to assist biological collections in identifying possible suspect points in existing records
uses techniques modified from Chapman 1999 to detect outliers in latitude, longitude and altitude
allows users to indicate their data set as either terrestrial or marine
useful to biologists around the world who wish to identify possible errors in their data
1, -63.25, -4.916666667, 7952, -67.05, -10.96666667, 8053, -68.0125, -12.66666667, 8094, -68.75, -13.60111111, 8155, -68.9102, -13.83333, 8106, -72.3666, -14.36611111, 7907, -78.3166, -14.38916667, 8018, -72.137, -11.8647, 700
marine
1, -63.25, -4.916672, 34.3239,67.9836aus, 150.0417,-34.90813, -68.0125, -12.66674, -22.0400, 63.9514id_teste, -45, -226, -75.3667, -14.36617, 71.37, -19.37eua, -80.8011,26.05069,-120.7642,58.721710,26.0089,-29.519711,-95.3781,16.7639
Input/Output:-degrees, min, sec-decimal degrees-UTM
DATUM:-WGS84 (World)-SAD69 (Brazil)-Córrego Alegre (SP)
-3.5800 , 52.063334.3239 , 67.9836-45 , -22
03d34'47"W , 52d3'47"N34d19'23"E , 67d59'0"N44d59'58"W , 21d59'58"S
degrees, min, s
Plot georeferenced points on a map.
Available layers:
-World-South and Central America-Brazil-São Paulo State
-95.6 -39.5166-70.2833 -4.2 -70.033333 -4.35 -69.914889 0.274694 -69.7333 -4.2333 -69.6661 -3.908333 ...
Trachurus trachurus
Pteroscion pele
Gaidropsarus biscayensis
Using
DataPostgreSQL
DataPostgreSQL
spOutliergeoLoc
SOAP
Web service
job1 job2
MapsPostGIS
MapsPostGIS
ToolsTools
About Data Cleaning
Aim at helping curators in identifying possible errors and to standardize data
Records are not modified
The system just presents "suspect" records
Col 1 Col 2 Col 3 Col n Col n
National collections
Col 1 Col 2
Internacional collections
... ...
Tables of Suspect RecordsTables of Suspect Records
chart.pm (Perl)
Local DatabaseLocal Databasedc_tax
dc_geo
PostgreSQL
PostgreSQL
Detect Suspect Records
Perl
Web
speciesLink PortalspeciesLink PortalJava
How
Data
Cle
anin
g W
ork
sH
ow
Data
Cle
anin
g W
ork
s
Demonstration on-line
Top Related