TDWG- Lisbon Oct 2003 Data Cleaning Tools and Methodologies Arthur D. Chapman Australia / Brazil...
-
Upload
zaria-vickrey -
Category
Documents
-
view
215 -
download
1
Transcript of TDWG- Lisbon Oct 2003 Data Cleaning Tools and Methodologies Arthur D. Chapman Australia / Brazil...
![Page 1: TDWG- Lisbon Oct 2003 Data Cleaning Tools and Methodologies Arthur D. Chapman Australia / Brazil Centro de Referência em Informação Ambiental.](https://reader036.fdocuments.us/reader036/viewer/2022070308/551bbcdc550346af588b47e4/html5/thumbnails/1.jpg)
TDWG- Lisbon Oct 2003
Data Cleaning Tools and Methodologies
Arthur D. Chapman
Australia / Brazil
Centro de Referência em Informação Ambiental
![Page 2: TDWG- Lisbon Oct 2003 Data Cleaning Tools and Methodologies Arthur D. Chapman Australia / Brazil Centro de Referência em Informação Ambiental.](https://reader036.fdocuments.us/reader036/viewer/2022070308/551bbcdc550346af588b47e4/html5/thumbnails/2.jpg)
TDWG- Lisbon Oct 2003
Background
• ERIN/CRIA
• speciesLink
• FAPESP/Biota
![Page 3: TDWG- Lisbon Oct 2003 Data Cleaning Tools and Methodologies Arthur D. Chapman Australia / Brazil Centro de Referência em Informação Ambiental.](https://reader036.fdocuments.us/reader036/viewer/2022070308/551bbcdc550346af588b47e4/html5/thumbnails/3.jpg)
TDWG- Lisbon Oct 2003
Species Data
• Museum/Herbarium
• Observation
• Survey
![Page 4: TDWG- Lisbon Oct 2003 Data Cleaning Tools and Methodologies Arthur D. Chapman Australia / Brazil Centro de Referência em Informação Ambiental.](https://reader036.fdocuments.us/reader036/viewer/2022070308/551bbcdc550346af588b47e4/html5/thumbnails/4.jpg)
TDWG- Lisbon Oct 2003
Data Error
• Names
• Geocode
• Altitude
• Collectors
• Dates
![Page 5: TDWG- Lisbon Oct 2003 Data Cleaning Tools and Methodologies Arthur D. Chapman Australia / Brazil Centro de Referência em Informação Ambiental.](https://reader036.fdocuments.us/reader036/viewer/2022070308/551bbcdc550346af588b47e4/html5/thumbnails/5.jpg)
TDWG- Lisbon Oct 2003
Data quality - fitness for use
![Page 6: TDWG- Lisbon Oct 2003 Data Cleaning Tools and Methodologies Arthur D. Chapman Australia / Brazil Centro de Referência em Informação Ambiental.](https://reader036.fdocuments.us/reader036/viewer/2022070308/551bbcdc550346af588b47e4/html5/thumbnails/6.jpg)
TDWG- Lisbon Oct 2003
Methods for geocode validation
• Internal Database Checks
• Outliers in Geographic Space - GIS
• Outliers in Environmental Space - Models
• Statistical outliers
![Page 7: TDWG- Lisbon Oct 2003 Data Cleaning Tools and Methodologies Arthur D. Chapman Australia / Brazil Centro de Referência em Informação Ambiental.](https://reader036.fdocuments.us/reader036/viewer/2022070308/551bbcdc550346af588b47e4/html5/thumbnails/7.jpg)
TDWG- Lisbon Oct 2003
Internal Database Checks
• Internal inconsistencies
• Checking one field against another– Text location vs geocode
• Checking one database against another– Gazetteers– DEM– Collectors
![Page 8: TDWG- Lisbon Oct 2003 Data Cleaning Tools and Methodologies Arthur D. Chapman Australia / Brazil Centro de Referência em Informação Ambiental.](https://reader036.fdocuments.us/reader036/viewer/2022070308/551bbcdc550346af588b47e4/html5/thumbnails/8.jpg)
TDWG- Lisbon Oct 2003
Geographic outliers - GIS
• Country, State, named district, etc.
![Page 9: TDWG- Lisbon Oct 2003 Data Cleaning Tools and Methodologies Arthur D. Chapman Australia / Brazil Centro de Referência em Informação Ambiental.](https://reader036.fdocuments.us/reader036/viewer/2022070308/551bbcdc550346af588b47e4/html5/thumbnails/9.jpg)
TDWG- Lisbon Oct 2003
Geographic outliers - GIS
![Page 10: TDWG- Lisbon Oct 2003 Data Cleaning Tools and Methodologies Arthur D. Chapman Australia / Brazil Centro de Referência em Informação Ambiental.](https://reader036.fdocuments.us/reader036/viewer/2022070308/551bbcdc550346af588b47e4/html5/thumbnails/10.jpg)
TDWG- Lisbon Oct 2003
Geographic Outliers - GIS
• Collectors – location vs date
![Page 11: TDWG- Lisbon Oct 2003 Data Cleaning Tools and Methodologies Arthur D. Chapman Australia / Brazil Centro de Referência em Informação Ambiental.](https://reader036.fdocuments.us/reader036/viewer/2022070308/551bbcdc550346af588b47e4/html5/thumbnails/11.jpg)
TDWG- Lisbon Oct 2003
Environmental Outliers
• Cumulative Frequency Curves
![Page 12: TDWG- Lisbon Oct 2003 Data Cleaning Tools and Methodologies Arthur D. Chapman Australia / Brazil Centro de Referência em Informação Ambiental.](https://reader036.fdocuments.us/reader036/viewer/2022070308/551bbcdc550346af588b47e4/html5/thumbnails/12.jpg)
TDWG- Lisbon Oct 2003
Acacia orites - 19 records - 9 Temperature parameters
0
5
10
15
20
25
30
35
tann
tmncm
tmxwm
tspan
tclq
twmq
twetq
tdryq
Tem
pera
ture
(C)
Reverse Jack-knife
![Page 13: TDWG- Lisbon Oct 2003 Data Cleaning Tools and Methodologies Arthur D. Chapman Australia / Brazil Centro de Referência em Informação Ambiental.](https://reader036.fdocuments.us/reader036/viewer/2022070308/551bbcdc550346af588b47e4/html5/thumbnails/13.jpg)
TDWG- Lisbon Oct 2003
Outliers in climate space
(T=0.95(√n)+0.2)
where ‘n’ is the number of records
![Page 14: TDWG- Lisbon Oct 2003 Data Cleaning Tools and Methodologies Arthur D. Chapman Australia / Brazil Centro de Referência em Informação Ambiental.](https://reader036.fdocuments.us/reader036/viewer/2022070308/551bbcdc550346af588b47e4/html5/thumbnails/14.jpg)
TDWG- Lisbon Oct 2003
FloraMap
• CIAT (Columbia)
• PCA
• Cluster Analysis
• $US100
• Modelling
• 10-minute grids
![Page 15: TDWG- Lisbon Oct 2003 Data Cleaning Tools and Methodologies Arthur D. Chapman Australia / Brazil Centro de Referência em Informação Ambiental.](https://reader036.fdocuments.us/reader036/viewer/2022070308/551bbcdc550346af588b47e4/html5/thumbnails/15.jpg)
TDWG- Lisbon Oct 2003
Principal Components Analysis - FloraMap
Image from FloraMap (Jones and Gladkov 2001) showing use of Principal Components Analysis to identify an outlier in Rauvolfia littoralis specimen data.
A. Principal Components Analysis B. Specimen record. C. Mapped specimen. D. Climate profile
![Page 16: TDWG- Lisbon Oct 2003 Data Cleaning Tools and Methodologies Arthur D. Chapman Australia / Brazil Centro de Referência em Informação Ambiental.](https://reader036.fdocuments.us/reader036/viewer/2022070308/551bbcdc550346af588b47e4/html5/thumbnails/16.jpg)
TDWG- Lisbon Oct 2003
Cluster Analysis - FloraMap
Image from FloraMap (Jones and Gladkov 2001) showing use of Cluster Analysis to identify an outlier in Rauvolfia littoralis specimen data.
A.Cluster Analysis B. Principal Components Analysis. C. Mapped specimen. D. Climate profile. E. Specimen record
![Page 17: TDWG- Lisbon Oct 2003 Data Cleaning Tools and Methodologies Arthur D. Chapman Australia / Brazil Centro de Referência em Informação Ambiental.](https://reader036.fdocuments.us/reader036/viewer/2022070308/551bbcdc550346af588b47e4/html5/thumbnails/17.jpg)
TDWG- Lisbon Oct 2003
Diva-GIS
• Free
• Simple GIS
• Modelling (BIOCLIM/Domain)
• Data Cleaning Tools
![Page 18: TDWG- Lisbon Oct 2003 Data Cleaning Tools and Methodologies Arthur D. Chapman Australia / Brazil Centro de Referência em Informação Ambiental.](https://reader036.fdocuments.us/reader036/viewer/2022070308/551bbcdc550346af588b47e4/html5/thumbnails/18.jpg)
TDWG- Lisbon Oct 2003
Diva-GIS – Coordinate Check
Using Diva-GIS to check coordinates by comparing a file of point specimen records (red) against a polygon of Bolivian provinces. Input dialogue box is shown at A, where it can be seen that “STATE” in the point file has been set to the equivalent “DEPARTMENT” in the polygon file (Hijmans et al. 2003).
![Page 19: TDWG- Lisbon Oct 2003 Data Cleaning Tools and Methodologies Arthur D. Chapman Australia / Brazil Centro de Referência em Informação Ambiental.](https://reader036.fdocuments.us/reader036/viewer/2022070308/551bbcdc550346af588b47e4/html5/thumbnails/19.jpg)
TDWG- Lisbon Oct 2003
Points outside Polygon – Diva GIS
Results from Diva-GIS (Hijmans et al. 2003) showing point records that fall outside all polygons in the Bolivian provinces polygon file. The highlighted record shows the linking between the results dialogue box and the mapped record
![Page 20: TDWG- Lisbon Oct 2003 Data Cleaning Tools and Methodologies Arthur D. Chapman Australia / Brazil Centro de Referência em Informação Ambiental.](https://reader036.fdocuments.us/reader036/viewer/2022070308/551bbcdc550346af588b47e4/html5/thumbnails/20.jpg)
TDWG- Lisbon Oct 2003
Mismatched Provinces – Diva GIS
Results from Diva-GIS (Hijmans et al. 2003) showing point records that do not match set relationships between the specimen point file and the polygon of Bolivian provinces. The highlighted record where the geocoding on the specimen record causes it to fall in the wrong province
![Page 21: TDWG- Lisbon Oct 2003 Data Cleaning Tools and Methodologies Arthur D. Chapman Australia / Brazil Centro de Referência em Informação Ambiental.](https://reader036.fdocuments.us/reader036/viewer/2022070308/551bbcdc550346af588b47e4/html5/thumbnails/21.jpg)
TDWG- Lisbon Oct 2003
Assign Coordinates – Diva GIS
Results from Diva-GIS (Hijmans et al. 2003) showing point records with geocodes automatically assigned. A. Unambiguous geocodes found by the program and assigned. B. Ambiguous geocodes identified. C. Appropriate geocodes not found.
![Page 22: TDWG- Lisbon Oct 2003 Data Cleaning Tools and Methodologies Arthur D. Chapman Australia / Brazil Centro de Referência em Informação Ambiental.](https://reader036.fdocuments.us/reader036/viewer/2022070308/551bbcdc550346af588b47e4/html5/thumbnails/22.jpg)
TDWG- Lisbon Oct 2003
Multiple possibilities – Diva GIS
Results from Diva-GIS (Hijmans et al. 2003) showing alternate geocodes for a record where use of the Gazetteer has produced a number of credible alternatives.
![Page 23: TDWG- Lisbon Oct 2003 Data Cleaning Tools and Methodologies Arthur D. Chapman Australia / Brazil Centro de Referência em Informação Ambiental.](https://reader036.fdocuments.us/reader036/viewer/2022070308/551bbcdc550346af588b47e4/html5/thumbnails/23.jpg)
TDWG- Lisbon Oct 2003
Cumulative Frequency Curves - DivaGiS
Results from Diva-GIS (Hijmans et al. 2003) showing the use of the Cumulative Frequency curve from BIOCLIM to identify possible geocoding errors in Rauvolfia littoralis. A1 and A2 show possible outliers in climate space, B1 and B2 the corresponding mapped records. The Blue lines represent the 97.5 percentile
![Page 24: TDWG- Lisbon Oct 2003 Data Cleaning Tools and Methodologies Arthur D. Chapman Australia / Brazil Centro de Referência em Informação Ambiental.](https://reader036.fdocuments.us/reader036/viewer/2022070308/551bbcdc550346af588b47e4/html5/thumbnails/24.jpg)
TDWG- Lisbon Oct 2003
Bioclimatic Envelop – Diva GIS
Results from Diva-GIS (Hijmans et al. 2003) showing the use of the Bioclimatic Envelope from BIOCLIM to identify outliers in climate space. In this case the percentile cut off is set at 95. Red points on the envelope correspond with red points on the map, green points in the envelope correspond with yellow points on the map
![Page 25: TDWG- Lisbon Oct 2003 Data Cleaning Tools and Methodologies Arthur D. Chapman Australia / Brazil Centro de Referência em Informação Ambiental.](https://reader036.fdocuments.us/reader036/viewer/2022070308/551bbcdc550346af588b47e4/html5/thumbnails/25.jpg)
TDWG- Lisbon Oct 2003
ANUCLIM• $AUD1000 (with data files)
• Modelling (BIOCLIM / ESOCLIM)
• Cumulative Frequency Curves
• Parameter Extremes
![Page 26: TDWG- Lisbon Oct 2003 Data Cleaning Tools and Methodologies Arthur D. Chapman Australia / Brazil Centro de Referência em Informação Ambiental.](https://reader036.fdocuments.us/reader036/viewer/2022070308/551bbcdc550346af588b47e4/html5/thumbnails/26.jpg)
TDWG- Lisbon Oct 2003
Cumulative Frequency - ANUCLIM
Log file of Eucalyptus fastigata from ANUCLIM Version 5.1 (Houlder et al. 2002) showing the species accumulation curve with an identified outlier (labelled “bad”). Information from the “bad” record is displayed at the top of the log file (from Houlder et al. 2000).
![Page 27: TDWG- Lisbon Oct 2003 Data Cleaning Tools and Methodologies Arthur D. Chapman Australia / Brazil Centro de Referência em Informação Ambiental.](https://reader036.fdocuments.us/reader036/viewer/2022070308/551bbcdc550346af588b47e4/html5/thumbnails/27.jpg)
TDWG- Lisbon Oct 2003
Parameter extremes - ANUCLIM
Log file of Eucalyptus fastigata from ANUCLIM Version 5.1 (Houlder et al. 2002) showing the parameter extremes (top) and associated species accumulation curve (bottom) (from Houlder et al. 2000
![Page 28: TDWG- Lisbon Oct 2003 Data Cleaning Tools and Methodologies Arthur D. Chapman Australia / Brazil Centro de Referência em Informação Ambiental.](https://reader036.fdocuments.us/reader036/viewer/2022070308/551bbcdc550346af588b47e4/html5/thumbnails/28.jpg)
TDWG- Lisbon Oct 2003
Statistical Tests
• Outliers in Latitude
• Outliers in Altitude
• Outliers in collectors range/day or week– Especially 17th, 18th and 19th Century
collections
![Page 29: TDWG- Lisbon Oct 2003 Data Cleaning Tools and Methodologies Arthur D. Chapman Australia / Brazil Centro de Referência em Informação Ambiental.](https://reader036.fdocuments.us/reader036/viewer/2022070308/551bbcdc550346af588b47e4/html5/thumbnails/29.jpg)
TDWG- Lisbon Oct 2003
Thank You…
Questions?