Extraction and Visualization of Geographical Names in Text
ZHANG [email protected]
Key Laboratory of Virtual Geographical Environment, Ministry of Education Nanjing Normal University
Nov. 18, 2009
Content
Background1
Extraction of geographical names2
Applications3
Resolution of Geographical names
Generation of geographical names
GIS
Geography
spatial model of the earth
Information and Library Sciences
Computer Science
Natural Language Processing
Computational linguistics
Human Computer Interaction
Cognitive Psychology
Medicine
Political and social sciences
Geophysics
Biology(botany/zoology/ecology)
Archeology
……
1.1 Disciplines concerned with geographic space
Location designator
1.2 What is a geographical names?
Geographical named entity: named entities with nouns or location expressions
Place name: the name by which a geographical place is known.
Location
Toponym: a named point of reference in both the physical and cultural landscape on the Earth's surface.
Geographical name: essentially labels which distinguish one part of the earth’s surface from another.
Recognition: identify geospatial names from a text span and then classifies them to predefined geographical feature categories.
1.3 Main tasks
Resolution: look up candidate referents and uses algorithms to pick the correct referents assigned to the recognized geographical names.
1.4 Basic processing architecture
Applications
Representation
Extraction
Formalization
Dataset
Natural language processing and Machine learning
Geo
spatial In
form
ation
Geographical Information System
Natural language text
1.5 Statistical models-ME
Maximum Entropy 1996 Natural language processing
√ no assumption of a normal distribution
√ no limits of context characteristics
√ learning cost of its parameters
√Considering single situations
1.5 Statistical Models-HMM
Hidden Markov Model
Markov property
Markov chain model: For observable state sequences (state is known from data).
Hidden Markov Model: For non-observable states
Speech recognition
Speech recognition
Part-of-speech tagging
Part-of-speech tagging
HandwritingrecognitionHandwritingrecognition
Machine translation
HMM in Computational Linguistics
1.5 Statistical Models-HMM
Conditional Random Field
1.6 Statistical Models-CRF
Much like a Markov random field
An HMM –a CRF with very specific feature functions
A CRF --generalization of an HMM
Content
Background1
Extraction of geographical names2
Applications3
2.1 Diagram of CRF based recognition
label granularity
Feature template
CRF training
CRF test
CCRF test
Dataset
CCRF training
Simple geographical names
linguistic characteristics
Combined geographical names
2.2 Linguistic characteristics
language, history and culturespecial charactersCombined named unitsspatial relations
2.3 Label granularity
Granularity:1-gram, 2-gram, …., word, phrase, sentence, paragraph, discourse
1-gram: sparse data
Word segmentation
2.4 CCRF( cascaded CRF)
The upper recognition model
…… ……2CT iCT nCT1CT
The lower recognition model
…… ……
…… ……1W 2W iW nW
2ST iST nST1ST
2.5 Feature template
Context: observable windows
( 1) 0 1( , ,..., ,..., , )n n n nw w w w w
n: training time and test performance
Feature type Relative position
Front neighbor feature W-n….. W-(n-1)
Back neighbor feature W1….. Wn
Current feature W0
Front combined feature W-1 W0
Back combined feature W0 W1
Transition state Label of the first front neighbor feature
2.5 Feature template
2.6 A example
位于黑龙江省哈尔滨市的哈尔滨市儿童公园为孩子们准备了特殊的贺岁礼物。Harbin Children Park in the Harbin city of Heilongjiang Province
prepared special new year gifts for children.
位于黑龙江省哈尔滨市的哈尔滨市儿童公园为孩子们准备了特殊的贺岁礼物。Harbin Children Park/SGN in the Harbin city/SGN of Heilongjiang
Province/SGN prepared special new year gifts for children.
位于黑龙江省哈尔滨市的哈尔滨市儿童公园为孩子们准备了特殊的贺岁礼物。Harbin Children Park/SGN in the Harbin city of Heilongjiang Province/CGN prepared special new year gifts for children.
2.7 Experimental performance
Dataset
Precision Recall F-1
Number of recognized
geographical names
Train Test
PER ( 1-5)
PER( 1) 94.01 94.91 94.46 26185
PER ( 1-5)
PER( 6) 94.30 94.35 94.33 30126
PER ( 1-5) MSRA 73.40 73.10 73.25 2674
MSRA MSRA 93.23 87.78 90.43 3211
MSRAPER
( 1) 73.61 67.84 70.61 18718
MSRAPER
( 6) 71.90 69.68 70.77 22249
2.8 Resolution approach
Matching
Gazetteer
Reference disambiguation
Candidate referents
Cognitive salience model
intended referents
2.9 Cognitive salience model
High degree of spatial correlation in geographic references that are in textual proximity.
2.10 Problems
Ancient geographical names
Spatio-temple Changs
Limits of statistical models
Limits of gazetteers
……
Content
Background1
Extraction of geographical names2
Applications3
GeoChunk: an annotation system
TextMAP: a integrated system for text and map
CGeoCoder: a address geocoding systems
SRAnnotation
Top Related