Tony Rees: TAXAMATCH poster May 2009

Post on 04-Jul-2015

183 views 1 download

description

TAXAMATCH (fuzzy matching for scientifc names of organisms) poster presented at e-Biosphere conference, London, May 2009

Transcript of Tony Rees: TAXAMATCH poster May 2009

Tony Rees, CSIRO Marine and Atmospheric Research, Australia

contact: Tony Reesphone: +61 3 6232 5318email: tony.rees@csiro.au web www.cmar.csiro.au/datacentre/

Taxon scientific names are key identifiers in the world of biodiversity, yet for informatics applications they often fail to provide the required cross linkages on account of minor (or not so minor) differences in spelling arising from keying

or phonetic errors, OCR (optical character recognition) and transcription errors, emendations, gender endings of species epithets, differences in diacritical marks, and more.

For example, data on the fish genus Coelorinchus (present “correct” spelling) might be stored under variant spellings Caelorinchus (previously considered correct), Coelorhinchus, Coelorhynchus, Caelorhynchus, and so on, while the potential for random or semi-random keystroke, OCR or transcription errors is almost limitless. If such potential variant spellings cannot be reconciled, some or even all of the desired data may not be retrieved.

This poster introduces TAXAMATCH, a “fuzzy” or near match algorithm developed at CSIRO Marine and Atmospheric Research (Australia), with the specific purpose of providing optimal fuzzy matching for genus and species scientific names in real world situations, and capable of deployment over a remote reference database of spellings deemed correct, or incorporation into any local system to suit a user’s particular needs.

TAXAMATCH comprises a suite of custom filters and tests used in succession on genus, species epithet, plus authority where supplied, to return candidate near or “fuzzy” matches in a reference set of taxon names to any supplied input name. The actual tests employed include the following:

• Anexactmatchtest,bothbeforeand after minor normalisation

• Aphoneticmatchtest,usingacustomalgorithm “tuned” to the characteristics of taxon scientific names

• Acustom“ModifiedDamerau-LevenshteinDistance”(MDLD)algorithmwhichlooksforpossibleomitted,inserted,substitutedandtransposedcharactersandcharacterblocks

• Amodifiedn-gramcomparisonofauthornames and dates where supplied, including expansionofselectedknownabbreviationsof author names as appropriate.

TAXAMATCH operating principlesThecustomfilteringthathasbeendevelopedforTAXAMATCHatbothgenusandspeciesepithetlevelscomprises:

• Genusandspeciespre-filters, which servetospeedupthealgorithmexecutionbyexcludingnamesdeemedtobealmostcertainnottomatchfrombeingtested

• Genusandspeciespost-filters, which apply a set of rules to assist in the discrimination oflikely“true”from“false”nearmatches

• Agenuscosmetic filter, which presents onlyasubsetof“genusnearmatch”searchresultstothehumanwebinterface,whilepassing a wide range of genera through to the species stage for further testing

• Afinalresult shapingstage(whichcanbeswitchedoutifdesired),whichmasksmore distant near matches in the presence ofcloserones,butopensautomaticallytoshowthemwhenthelatterareabsent.

AschematicofoverallTAXAMATCHoperationisshowninFig.1,below.

TAXAMATCH reference implementationThe reference installation of TAXAMATCH iscurrentlyinstalledovertheIRMNG(Interim Register of Marine and Nonmarine Genera)databasehostedatCSIROMarineandAtmosphericResearch,availableviatheaccesspoint www.cmar.csiro.au/datacentre/irmng/, which(atmid2009)containsover1.4millionspeciesnamesfromtheCatalogueofLifeandothersources,togetherwithover400,000genusnames.TAXAMATCHisautomaticallyinvokedwhen single genus + species, or genus queries aremadesoastodisplaynotonlyexact,butalsoanynearmatchesintheIRMNGdatabase,toanyuser-suppliedinputname.Figs.2and3illustrate how TAXAMATCH will return a match of the correct spelled name “Homo sapiens” in response to an incorrectly spelled input name “Hombosapient”.Notethatinthisinstance,operationofthegenusandspeciespre-filtersmeansthatonly325ofthe445,004genera,and31ofthe1,459,171speciespresentlyinthereferencedatabaseareactuallyrequiredtobetested,whichcontributessignificantlytotherelativelyshortexecutiontimeforthequery(around1toafewsecondsperinputname,orlesswhenconductedwithoutthewebinterfaceandancillaryinformationpresented).

Figure 2: Web accessible IRMNG / TAXAMATCH search entry point www.cmar.csiro.au/datacentre/irmng/

Figure 4: Sample IRMNG search result for a batch of multiple species names to be checked, showing option presented for “fuzzy search” on names that do not have an exact match to any current target name in the IRMNG database at this time.

Figure 3: Result of above search for the entered term “Hombo sapient” against the IRMNG database

TAXAMATCH use casesArangeofusecasescanbeenvisagedforTAXAMATCH, including the following:

• Matchinga(weborother)user’senteredtextagainststoredbiodiversityinformation,where either the input or stored name maybemisspelledoravariantspelling

• Checkingofnamesona“ListA”thatdonotmatchentriesonanequivalent“ListB”(butmaypotentiallyincludethesameentitiesundervariantspellings)

• Queryexpansion–fordistributeddatasearches(whereallnamevariantscanbeindexedinadvance),aswouldbeapplicableto(e.g.)OBIS,GBIF,etc.

• Deduplicationofstoredlists–especiallythoseconstructedbyaggregationof names from multiple sources

• “Asyoutype”spellcorrection

• Applicationintaxonomicnamerecognitionsoftware,e.g.viaOCRofscannedspecimenlabels,ordetectionof taxonomic names in mixed text streams(biologicalpublications,etc.)

ThewebaccessibleIRMNG/TAXAMATCHsearch entry point also currently supports theinputofbatchesofuptoapproximately2,500genusnamesor1,200genus+speciesnamesforautomatedchecking,asshowninFig.4,andmechanismsforcheckinglargerbatchesofnamescanbeimplementedviaalternativemechanismsasdesired.

ConclusionTAXAMATCHappearstoofferagoodsolutiontotheproblemsofnearmatchinggenusand/orspeciesscientificnames,whetherformatchingusers’misspelledquerytermstocorrectlystoredtargetdata(orviceversa),listcross-matchingorinternaldeduplication,orasaprototypewebaccessibletaxonomicspellcheckingservice.SeveraldevelopmentareasforTAXAMATCHarecurrentlyunderactiveconsideration,andinterestedpotentialusersordevelopersare encouragedtocontacttheauthorattheaddressshownbelowortovisitthe TAXAMATCHwebpagewww.cmar.csiro.au/datacentre/taxamatch.htm.

References

Rees,T.(2008).TAXAMATCH,a“fuzzy”matchingalgorithmfortaxonnames,andpotentialapplicationsintaxonomicdatabases.TDWG 2008 Annual Conference, Perth, Australia, abstractandpresentationavailableviawww.tdwg.org/conference2008/program/.

Rees,T.(2009inpress).TAXAMATCH,analgorithmfornear(‘fuzzy’)matchingofspeciesscientificnamesintaxonomicdatabases.Biodiversity Informatics(submitted).

Acknowledgements

IthankMiroslawRyba,CSIROMarineandAtmosphericResearch,forprogramminganddatabaseassistance,andBarbaraBoehmer,USAforassistancewithmodifyingheroriginalOracle®LevenshteinDistanceimplementationforTAXAMATCHuse.

PhotographscourtesyofKarenGowlett-Holmes.

Fuzzy matching of taxon names for biodiversityinformaticsapplications

Acropaginula <> ArcopaginulaMeosarmatium <> Neosarmatium

Peneus <> Penaeusfaveolata <> flaveolata

capricornicus <> capricornensisabrohlensis <> abrolhensis

input genus + species (+ auth.)

available genus names

available species

genus names tested

species tested

genus near matches

species near matches

species authorities

auth. comparator

genus cosmetic filter

normalised input genus

genus pre-filter

species pre-filter

genus post-filter

species post-filter

ranking + result shaping

genus test

species testnormalised input species

normalised input authority

genus near matches displayed

species near matches displayed

parsing and normalisation

Figure 1: Schematic of TAXAMATCH operation

available genus+ species names

(+ auth’s)

PosterdesignbyLeaCrosswell–CommunicationGroup,CSIROMarineandAtmosphericResearch–May2009