2010.04.05 - SLIDE 1IS 240 – Spring 2010 Prof. Ray Larson University of California, Berkeley...
-
date post
22-Dec-2015 -
Category
Documents
-
view
215 -
download
1
Transcript of 2010.04.05 - SLIDE 1IS 240 – Spring 2010 Prof. Ray Larson University of California, Berkeley...
2010.04.05 - SLIDE 1IS 240 – Spring 2010
Prof. Ray Larson University of California, Berkeley
School of Information
Principles of Information Retrieval
Lecture 19: DLs and GIR
2010.04.05 - SLIDE 2IS 240 – Spring 2010
Today
• Digital Libraries and IR
• Image Retrieval in DL• From paper presented at the 1999 ASIS Annual
Meeting
• More on Geographic Information Retrieval
2010.04.05 - SLIDE 3IS 240 – Spring 2010
UCB Digital Library Project: Research Agenda
• Funded by NSF/NASA/DARPA Digital Library Initiative (Phases I and II) ~1993-2004
• Research agenda– Understand user needs.– Extend functionality of documents.
• “Enliven” legacy documents.
– Improve access to information.– Scale to large systems.– Re-Invent Scholarly Information Access and
Use
2010.04.05 - SLIDE 4IS 240 – Spring 2010
Testbed: An Environmental Digital Library
• Collection: Diverse material relevant to California’s key habitats.
• Users: A consortium of state agencies, development corporations, private corporations, regional government alliances, educational institutions, and libraries.
• Potential: Impact on state-wide environmental system (CERES )
2010.04.05 - SLIDE 5IS 240 – Spring 2010
The Environmental Library -Users/Contributors
• California Resources Agency, California Environment Resources Evaluation System (CERES)
• California Department of Water Resources
• The California Department of Fish & Game
• SANDAG
• UC Water Resources Center Archives
• New Partners: CDL and SDSC
2010.04.05 - SLIDE 6IS 240 – Spring 2010
The Environmental Library - Contents
• Environmental technical reports, bulletins, etc.• County general plans• Aerial and ground photography• USGS topographic maps• Land use and other special purpose maps• Sensor data• “Derived” information• Collection data bases for the classification and
distribution of the California biota (e.g., SMASCH)• Supporting 3-D, economic, traffic, etc. models• Videos collected by the California Resources Agency
2010.04.05 - SLIDE 7IS 240 – Spring 2010
The Environmental Library - Contents
• As of mid 1999, the collection represents about three quarters of a terabyte of data, including over 70,000 digital images, over 300,000 pages of environmental documents, and over a million records in geographical and botanical databases.
2010.04.05 - SLIDE 8IS 240 – Spring 2010
Botanical Data:
• The CalFlora Database contains taxonomical and distribution information for more than 8000 native California plants. The Occurrence Database includes over 300,000 records of California plant sightings from many federal, state, and private sources. The botanical databases are linked to our CalPhotos collection of Calfornia plants, and are also linked to external collections of data, maps, and photos.
2010.04.05 - SLIDE 9IS 240 – Spring 2010
Geographical Data:
• Much of the geographical data in our collection is being used to develop our web-based GIS Viewer. The Street Finder uses 500,000 Tiger records of S.F. Bay Area streets along with the 70,000-records from the USGS GNIS database. California Dams is a database of information about the 1395 dams under state jurisdiction. An additional 11 GB of geographical data represents maps and imagery that have been processed for inclusion as layers in our GIS Viewer. This includes Digital Ortho Quads and DRG maps for the S.F. Bay Area.
2010.04.05 - SLIDE 10IS 240 – Spring 2010
Documents:
• Most of the 300,000 pages of digital documents are environmental reports and plans that were provided by California state agencies. This collection includes documents, maps, articles, and reports on the California environment including Environmental Impact Reports (EIRs), educational pamphlets, water usage bulletins, and county plans. Documents in this collection come from the California Department of Water Resources (DWR), California Department of Fish and Game (DFG), San Diego Association of Governments (SANDAG), and many other agencies. Among the most frequently accessed documents are County General Plans for every California county and a survey of 125 Sacramento Delta fish species.
2010.04.05 - SLIDE 11IS 240 – Spring 2010
Documents - cont.
• The collection also includes about 20Mb of full-text (HTML) documents from the World Conservation Digital Library. In addition to providing online access to important environmental documents, the document collection is the testbed for our Multivalent Document research.
2010.04.05 - SLIDE 12IS 240 – Spring 2010
Photographs:
• The photo collection includes 17,000 images of California natural resources from the state Department of Water Resources, several hundred aerial photos, 17,000 photos of California native plants from St. Mary's College, the California Academy of Science, and others, a small collection of California animals, and 40,000 Corel stock photos.
2010.04.05 - SLIDE 13IS 240 – Spring 2010
Testbed Success Stories
• LUPIN: CERES’ Land Use Planning Information Network– California Country General Plans and other
environmental documents.– Enter at Resources Agency Server, documents stored
at and retrieved from UCB DLIB server.
• California flood relief efforts– High demand for some data sets only available on our
server (created by document recognition).
• CalFlora: Creation and interoperation of repositories pertaining to plant biology.
• Cloning of services at Cal State Library, FBI
2010.04.05 - SLIDE 14IS 240 – Spring 2010
Research Highlights
• Documents– Multivalent Document prototype
• Page images, structured documents, GIS data, photographs
• Intelligent Access to Content– Document recognition – Vision-based Image Retrieval: stuff, thing,
scene retrieval– Natural Language Processing: categorizing
the web, Cheshire II, TileBar Interfaces
2010.04.05 - SLIDE 15IS 240 – Spring 2010
User Interface Paradigms: Multivalent Documents • An approach to new document types and
their authoring.
• Supports active, distributed, composable transformations of multimedia documents.
• Enables sophisticated annotations, intelligent result handling, user-modifiable interface, composite documents.
2010.04.05 - SLIDE 16IS 240 – Spring 2010
Multivalent Documents
Cheshire LayerCheshire Layer
OCR LayerOCR Mapping LayerHistory of The Classical World
The jsfj sjjhfjs jsjjjsjhfsjf sjhfjksh sshfjsfksfjk sjs jsjfs kjsjfkjsfhskjf sjfhjkshskjfhkjshfjkshjsfhkjshfjkskjfhsfhskjfksjflksjflksjflksfsjfksjfkjskfjskfjklsslkslfjlskfjklsfklkkkdsjksfksjfkskflk sjfjksfkjsfkjsfkjshf sjfsjfjksksfjksfjksjfkthsjir\\ksksfjksjfkksjkls’ksklsjfkskfksjjjhsjhuusfsjfkjs
Modernjsfj sjjhfjs jsjjjsjhfsjf sslfjksh sshfjsfksfjk sjs jsjfs kjsjfkjsfhskjf sjfhjkshskjfhkjshfjkshjsfhkjshfjkskjfhsfhskjfksjflksjflksjflksfsjfksjfkjskfjskfjklsslkslfjlskfjklsfklkkkdsj
GIS Layer
taksksh kdjjdkd kdjkdjkd kjsksksk kdkdk kdkd dkkskksksk jdjjdj clclc ldldl
taksksh kdjjdkd kdjkdjkd kjsksksk kdkdk kdkd dkkskksksk jdjjdj clclc ldldl
Table 1.
Table Layer
kdkdkdkdk Scanned
PageImage
Valence:2: The relativecapacity to unite,react, or interact(as with antigensor a biologicalsubstrate).
Webster’s 7th CollegiateDictionary
Network Protocols &Resources
2010.04.05 - SLIDE 18IS 240 – Spring 2010
GIS in the MVD Framework
• Layers are georeferenced data sets.• Behaviors are
– display semi-transparently– pan– zoom– issue query– display context– “spatial hyperlinks”– annotations
• Written in Java (to be merged with MVD-1 code line?)
2010.04.05 - SLIDE 19IS 240 – Spring 2010
GIS Viewer Example http://elib.cs.berkeley.edu/annotations/gis/buildings.html
2010.04.05 - SLIDE 20IS 240 – Spring 2010
Overview of Cheshire II
• The Cheshire II system is intended to provide an easy-to-use, standards-compliant system capable of retrieving any type of information in a wide variety of settings.
2010.04.05 - SLIDE 21IS 240 – Spring 2010
Overview of Cheshire II
• It supports SGML and XML.• It is a client/server application.• Uses the Z39.50 Information Retrieval Protocol.• Server supports a Relational Database Gateway.• Supports Boolean searching of all servers.• Supports probabilistic ranked retrieval in the Cheshire search
engine.• Search engine supports ``nearest neighbor'' searches and
relevance feedback.• GUI interface on X window displays.• WWW/CGI forms interface for DL, using combined client/server CGI
scripting via WebCheshire.• Image Content retrieval using BlobWorld• Support for the SDLIP (Simple Digital Library Interoperability
Protocol) for search and as Z39.50 Gateway
2010.04.05 - SLIDE 22IS 240 – Spring 2010
Cheshire II Searching
Z39.50 Internet
ImagesScannedText
Local Remote
Z39.50
Z39.50
Z39.50
2010.04.05 - SLIDE 23IS 240 – Spring 2010
Current Usage of Cheshire II
• Web clients for:– NSF/NASA/ARPA Digital Library
• Includes support for full-text and page-level search.
• Experimental Blob-World image search
– SunSite
– University of Liverpool.
– University of Essex, HDS (part of AHDS)
– California Sheet Music Project
– Cha-Cha (Berkeley Intranet Search Engine)
– Univ. of Virginia
• Cheshire ranking algorithm is basis for Inktomi (i.e., Yahoo, Hotbot, MSN? and others)
2010.04.05 - SLIDE 24IS 240 – Spring 2010
Image Retrieval Research
• Finding “Stuff” vs “Things”
• BlobWorld
• Other Vision Research
2010.04.05 - SLIDE 25IS 240 – Spring 2010
Blobworld: use regions for retrieval
• We want to find general objects Represent images based on coherent regions
2010.04.05 - SLIDE 26IS 240 – Spring 2010
Outline
• Why regions?
• Creating Blobworld: segmentation and description
• Using Blobworld: query experiments
• Indexing blobs for faster querying
• Conclusions
2010.04.05 - SLIDE 27IS 240 – Spring 2010
Creating and using Blobworld
extract features segment image describe regions query
Create Use
2010.04.05 - SLIDE 28IS 240 – Spring 2010
Extract features for each pixel• Color
– Take average color (L*a*b*) at the selected scale ignore local color variations due to texture
– “zebra = gray horse + stripes”
• Texture– Find contrast, anisotropy, polarity at the selected
scale
• Position
2010.04.05 - SLIDE 29IS 240 – Spring 2010
Find groups in feature space
• Model feature distribution as a mixture of Gaussians using Expectation-Maximization (EM)
2010.04.05 - SLIDE 30IS 240 – Spring 2010
Find regions in the image• Label each pixel based on its Gaussian
cluster
• Find connected components regions
1
334
2 11
3 4
2
2010.04.05 - SLIDE 31IS 240 – Spring 2010
Describe regions by color, texture, shape
• Color– Color histogram within region– Quadratic distance: encode similarity between
color binsd2
hist(x, y) = (x - y)' A (x - y)
• Texture– Mean contrast and anisotropy
stripes vs. spots vs. smooth
• (Basic) Shape– Fourier descriptors of contour
2010.04.05 - SLIDE 32IS 240 – Spring 2010
Select appropriate scale for processing
• Polarity: do all the gradient vectors point in the same direction?
• Choose scale where polarity stabilizes include one approximate period
2010.04.05 - SLIDE 33IS 240 – Spring 2010
Initialize means using image data
• Before, we picked random initialization• Now, choose initial means based on
image tiles
• Add noise to means and restart EM (4 runs per K)
K = 2 K = 5K = 4K = 3
2010.04.05 - SLIDE 34IS 240 – Spring 2010
update ,
update labels update ,
Grouping: Expectation-Maximization• Given class characteristics (,), find class
membership• Given class membership, find class
characteristics (,)• Iterate
update labels
2010.04.05 - SLIDE 35IS 240 – Spring 2010
How many Gaussians?
• Model selection: Minimum Description Length– Prefer fewer Gaussians if performance is
comparable
vs.vs.
2010.04.05 - SLIDE 36IS 240 – Spring 2010
Find groups in feature space
• Model feature distribution as a mixture of Gaussians using Expectation-Maximization (EM)
2010.04.05 - SLIDE 37IS 240 – Spring 2010
EM mathProbability density:
Update equations:
where
( )
( )
( )
( )( )( )
( )∑
∑
∑
∑
∑
=
=
=
=
=
Θ
−−Θ=
Θ
Θ=
Θ=
N
jj
N
jijijj
i
N
jj
N
jjj
i
N
jji
xip
xxxip
xip
xipx
xipN
1
old
1
Tnewnewold
new
1
old
1
old
new
1
oldnew
,
,
,
,
,1
μμ
μ
α
( ) ( )( )∑
=
=Θ K
kkkk
iiij
xf
xfxip
1
old,θα
θα
( ) ( )
( ) )()(
1
1T21
21
2 det)2(
1iii
d
xx
i
ii
K
iiii
exf
xfxf
μμ
πθ
θα
−Σ−−
=
−
Σ=
=Θ ∑
2010.04.05 - SLIDE 38IS 240 – Spring 2010
Encode similarity between color bins
• Quadratic distance
• Distance between histograms x and y:
d2hist(x, y) = (x - y)' A (x - y)
• Aij is based on the similarity between bins i and j– Neighboring bins have Aij = 0.5
2010.04.05 - SLIDE 39IS 240 – Spring 2010
Fourier descriptors for shape
• [Zahn & Roskies ’72, Kuhl & Giardina ’82]
• Find (x,y) representation of outer contour
• Find Fourier series of (x,y)– Coefficients specify an ellipse (4 parameters):– major axis, minor axis, orientation, starting
point
• Remove starting point ambiguity
• Store first ten Fourier coefficients
2010.04.05 - SLIDE 40IS 240 – Spring 2010
Creating and using Blobworld
extract features segment image describe regions query
Create Use
2010.04.05 - SLIDE 41IS 240 – Spring 2010
Querying: let user see the representation
• Current systems are unsatisfying– User can’t see what the computer sees– Unclear how parameters relate to the image
• User should interact with the representation– Helps in query formulation– Makes results understandable– Minimizes disappointment
• http://elib.cs.berkeley.edu/photos/blobworld
2010.04.05 - SLIDE 48IS 240 – Spring 2010
Query experiments
• Collection of 10,000 Corel stock photos
• Five query images in each of ten categories(e.g., cheetahs, polar bears, airplanes)
• Compare Blobworld to global histogram queries
• Precision (% of retrieved images that are correct) vs. Recall (% of correct images that are retrieved)
2010.04.05 - SLIDE 49IS 240 – Spring 2010
Distinctive objects
• Tigers, cheetahs, and zebras:– Blobworld does better than global histograms
cheetahs zebras
2010.04.05 - SLIDE 50IS 240 – Spring 2010
black bears
Distinctive objects and backgrounds
• Eagles and black bears:– Blobworld does better than global histograms
2010.04.05 - SLIDE 51IS 240 – Spring 2010
Distinctive scenes
• Airplanes and brown bears:– Global histograms do better than Blobworld– But Blobworld has room to grow (shape, etc.)
airplanes
2010.04.05 - SLIDE 52IS 240 – Spring 2010
Index to search huge collections• Indexing is trickier than for traditional data
• We can afford some mistakes: even with full search, we’ll miss some tigers and include some pumpkins
• Two approaches we have tried:– Store terms and treat image as a document– Store features and index using a tree
• Final (“correct”) ranking of images from index
2010.04.05 - SLIDE 53IS 240 – Spring 2010
Index using conventional IR methods
• Treat each database blob as a document– Store “terms” (bins) for color, texture, location,
and shape– Repeat color terms based on histogram
weights
• Index using Cheshire II
• Treat each query blob as a document– Repeat “terms” according to query weights
2010.04.05 - SLIDE 54IS 240 – Spring 2010
Indexing and Retrieval with Cheshire II
• Originally used the same probabilistic algorithm used for text– Blobs are not distributed like text words or
stems
• Now using a weighting based on coordination level match with a minimum threshold (must have at least half of the characteristics of the query cluster.
• Still eyeballing data, but seems much better for many types of queries
2010.04.05 - SLIDE 59IS 240 – Spring 2010
Conclusions
• Image retrieval in general collections requires region segmentation and description
• Blobworld yields high precision in queries for distinctive objects
• Blobworld can be indexed to allow fast querying
2010.04.05 - SLIDE 60IS 240 – Spring 2010
User Interface Paradigms: Multivalent Documents
• An approach to new document types and their authoring.
• Supports active, distributed, composable transformations of multimedia documents.
• Enables sophisticated annotations, intelligent result handling, user-modifiable interface, composite documents.
2010.04.05 - SLIDE 61IS 240 – Spring 2010
Multivalent Documents
Cheshire LayerCheshire Layer
OCR LayerOCR Mapping LayerHistory of The Classical World
The jsfj sjjhfjs jsjjjsjhfsjf sjhfjksh sshfjsfksfjk sjs jsjfs kjsjfkjsfhskjf sjfhjkshskjfhkjshfjkshjsfhkjshfjkskjfhsfhskjfksjflksjflksjflksfsjfksjfkjskfjskfjklsslkslfjlskfjklsfklkkkdsjksfksjfkskflk sjfjksfkjsfkjsfkjshf sjfsjfjksksfjksfjksjfkthsjir\\ksksfjksjfkksjkls’ksklsjfkskfksjjjhsjhuusfsjfkjs
Modernjsfj sjjhfjs jsjjjsjhfsjf sslfjksh sshfjsfksfjk sjs jsjfs kjsjfkjsfhskjf sjfhjkshskjfhkjshfjkshjsfhkjshfjkskjfhsfhskjfksjflksjflksjflksfsjfksjfkjskfjskfjklsslkslfjlskfjklsfklkkkdsj
GIS Layer
taksksh kdjjdkd kdjkdjkd kjsksksk kdkdk kdkd dkkskksksk jdjjdj clclc ldldl
taksksh kdjjdkd kdjkdjkd kjsksksk kdkdk kdkd dkkskksksk jdjjdj clclc ldldl
Table 1.
Table Layer
kdkdkdkdk Scanned
PageImage
Valence:2: The relativecapacity to unite,react, or interact(as with antigensor a biologicalsubstrate).
Webster’s 7th CollegiateDictionary
Network Protocols &Resources
2010.04.05 - SLIDE 62IS 240 – Spring 2010
Image Retrieval Research
• Finding “Stuff” vs “Things”
• BlobWorld
2010.04.05 - SLIDE 64IS 240 – Spring 2010
Cheshire II Searching
Z39.50 Internet
ImagesScannedText
Local Remote
Z39.50
Z39.50
Z39.50
2010.04.05 - SLIDE 65IS 240 – Spring 2010
GIS in the MVD Framework
• Layers are georeferenced data sets.• Behaviors are
– display semi-transparently– pan– zoom– issue query– display context– “spatial hyperlinks”– annotations
• Written in Java
2010.04.05 - SLIDE 66IS 240 – Spring 2010
GIS Viewer Example http://elib.cs.berkeley.edu/annotations/gis/buildings.html
2010.04.05 - SLIDE 67IS 240 – Spring 2010
Geographic Information Retrieval and Spatial
Browsing
Ray R. Larson
School of Library and Information StudiesSchool of Library and Information StudiesUniversity of California, BerkeleyUniversity of California, Berkeley
2010.04.05 - SLIDE 68IS 240 – Spring 2010
Concerns for Digital Libraries
• Excellent summary in Distributed Geolibraries from NRC.– Distributed resources– Distributed users– Distributed services
• Access for a broad population is critical for many Digital Libraries
2010.04.05 - SLIDE 69IS 240 – Spring 2010
Concerns for Digital Libraries
• Georeferenced Information (geoinformation) provides one organizational perspective
• Other common perspectives include Topical Classification schemes, Temporal/Historical organization (ECAI)
• DL’s can provide multiple views of the same information
2010.04.05 - SLIDE 70IS 240 – Spring 2010
Concerns for Digital Libraries
• Most DLs are intended for a broad user base:– varying levels of expertise in the contents– varying requirements for access methods– simple expressions of interest in natural
language should be supported– Mapping NL to controlled vocabularies
(including Digital Gazetteers)
2010.04.05 - SLIDE 71IS 240 – Spring 2010
Digital Library Needs
• Geographic and Spatial Querying
• Spatial Browsing
• Geographic and Spatial Indexing
• (Berkeley DL contents and examples)
2010.04.05 - SLIDE 72IS 240 – Spring 2010
Overview
• What is Geographic Information Retrieval?
• Geographic and Spatial Querying and Browsing.
• Geographic and Spatial Indexing.
• Examples of GIR Systems and Geographically Indexed Information.
2010.04.05 - SLIDE 73IS 240 – Spring 2010
Introduction
• What is Geographic Information Retrieval?– GIR is concerned with providing access to
georeferenced information sources. It includes all of the areas of traditional IR research with the addition of spatially and geographically oriented indexing and retrieval.
– It combines aspects of DBMS research, User Interface Research, GIS research, and Information Retrieval research.
2010.04.05 - SLIDE 74IS 240 – Spring 2010
Introduction
• The need for Geographic and Spatial Information Retrieval.– Digital Libraries
• Sequoia 2000• UC Berkeley NSF/NASA/ARPA Digital Library
Project• UC Santa Barbara Alexandria Project• NSDI - National Spatial Data Infrastructure
– Next-Generation Online Catalogs• Cheshire II
2010.04.05 - SLIDE 75IS 240 – Spring 2010
Geographic and Spatial Querying
• Both imply querying on relationships within a particular coordinate system
• Spatial querying is the more general term
• Can be defined as queries about the spatial relationships (intersection, containment, boundary, adjacency, proximity) of entities geometrically defined and located in space
2010.04.05 - SLIDE 76IS 240 – Spring 2010
Geographic and Spatial Querying
• Geographical coordinates are geometric relationships (distance and direction can be measured on a continuous scale)– E.g. “5.21 miles north
of Champaign”
• Spatial relations may be both geometric and topological (spatially related but without measureable distance or absolute direction)– E.g.: “inside the city
limits”– “left side of Beckman
Institute”
2010.04.05 - SLIDE 77IS 240 – Spring 2010
Geographic and Spatial Querying
• Types of spatial queries– Point-in-polygon : “What do we
have at this X,Y point?”– Region Queries : “What do we
have in this region?”• Which point encoded items lie
within the region• What lines (borders, etc.) lie within
or the cross the region• What areas overlap the region area
YY
XX
2010.04.05 - SLIDE 78IS 240 – Spring 2010
Geographic and Spatial Querying
• Types of spatial queries, cont.– Distance and Buffer Zone Queries
• What cities lie within 40 miles of the border of Northern and Southern Ireland?
• What wetlands lie within 50 miles of London?
– Path Queries• What is the shortest route from San
Francisco to Los Angeles?
2010.04.05 - SLIDE 79IS 240 – Spring 2010
Geographic and Spatial Querying
• Types of spatial queries, cont.– Multimedia Queries : Use non-
map georeferenced information.
• What are the names of farmers affected by flooding in Monterey and Santa Cruz Counties?
p123p123p127p127
2010.04.05 - SLIDE 80IS 240 – Spring 2010
Spatial Browsing
• Combines ad hoc spatial querying with interactive displays
• HyperMap concept
• Pseudo-HyperMaps
2010.04.05 - SLIDE 81IS 240 – Spring 2010
Spatial Browsing
• Advantages:– May not need the accuracy of a full GIS– Comprehensible searching metaphor for
many materials
• Problems:– Clutter and differing scales.– Requires good (and preferably accurate)
geographical indexing– Assumes that the user knows some
geography
2010.04.05 - SLIDE 82IS 240 – Spring 2010
Geographic and Spatial Indexing
• Traditional geographic indexing involves using place names from LCSH and name authorities. These have some problems:– Names are not unique– The places referred to change size, shape
and names over time– Spelling variations– Some places are temporary conventions
(study areas, etc.)
2010.04.05 - SLIDE 83IS 240 – Spring 2010
Digital Gazetteers
• Geographic names are and will remain the primary Entry Vocabulary for DL spatial queries – The gazetteer must support as many variant
forms of the name as possible• Including temporal ranges for particular names
– querying must support spatial reasoning based on gazetteer and other geographic and temporal information in the system or accessible by network access
2010.04.05 - SLIDE 85IS 240 – Spring 2010
Geographic and Spatial Indexing
• Geographic coordinates have some advantages over names:– They are persistent regardless of name, political
boundary or other changes– The can be simply connected to spatial browsing
interfaces and GIS data.– They provide a consistent framework for GIR
applications and spatial queries.
• However, the geographic extents and boundaries of entities also change over time– This may be the primary interest of historical
scholarship
2010.04.05 - SLIDE 86IS 240 – Spring 2010
Geographic and Spatial Indexing
• GIPSY: Automatic georeferencing of texts (Geographic Info Processing System)– The work of Allison Woodruff and Christian Plaunt -
Later DBMS-based version by Jolly Chen -- New version planned
– Designed to operate on the full text of documents– Extracts geographic terms and attempts to identify the
coordinates of the places discussed in the text using a combination of evidence
2010.04.05 - SLIDE 87IS 240 – Spring 2010
Geographic and Spatial Indexing
• GIPSY cont.– Used the USGS Geographic Names
Information System (GNIS) and Geographic Information Retrieval and Analysis System (GIRAS) to associate names with coordinates of named places, geographic features and land use characteristics.
2010.04.05 - SLIDE 88IS 240 – Spring 2010
Geographic and Spatial Indexing
• GIPSY cont.– Identified places are added as “elevations”
with each place adding a weight based on its frequency in the text and database characteristics
– The resulting map is analysed to identify the most likely locations, and coordinates for those locations are extracted
2010.04.05 - SLIDE 89IS 240 – Spring 2010
Geographic and Spatial Indexing
• GIPSY Map Overlay
““The proposed project isThe proposed project is the construction of a new State the construction of a new State Water Project facility, the Water Project facility, the coastal branch... by water coastal branch... by water purveyors of northern Santa purveyors of northern Santa Barbara County... delivering Barbara County... delivering water to San Luis Obispo ... “water to San Luis Obispo ... “
““The proposed project isThe proposed project is the construction of a new State the construction of a new State Water Project facility, the Water Project facility, the coastal branch... by water coastal branch... by water purveyors of northern Santa purveyors of northern Santa Barbara County... delivering Barbara County... delivering water to San Luis Obispo ... “water to San Luis Obispo ... “
2010.04.05 - SLIDE 90IS 240 – Spring 2010
Geographic and Spatial Indexing
• To be useful for the range of cultural and humanities materials being collected in digital libraries, the GIPSY gazetteer must– Support many different time ranges, location
and boundary changes– Support synonymous and variant names with
differing locations for the same entity– Support names in multiple languages, scripts
and usages
2010.04.05 - SLIDE 91IS 240 – Spring 2010
ECAI
• The Electronic Cultural Atlas Initiative is a collaboration between IT professionals and humanities scholars
• ECAI is developing a globally distributed spatio-temporal library of cultural and historical resources with a centralized metadata catalogue and a GIS viewer
• Currently the ECAI consortium includes over 250 projects
2010.04.05 - SLIDE 92IS 240 – Spring 2010
ECAI
• Projects range from small works by individual scholars to large nationally and internationally funded efforts. E.g.:– geography of Greco-Roman culture (Perseus project)– toponym locations for over 300,000 images of
Buddhist art and architecture– Seals of the Sassanian Empire– historical trade routes of Eurasia– the map of Hideyoshi’s invasion of Korea– historical GIS projects for China, Great Britain, the
United States, the Black Sea and Tibet
2010.04.05 - SLIDE 95IS 240 – Spring 2010
Opening shot of the Sasanian Empire ECAI project, showing a map with diverse resources, a timeline, and a menu of available map layers.
2010.04.05 - SLIDE 96IS 240 – Spring 2010
Users may zoom in to see resources that are only visible at a higher level of detail.
2010.04.05 - SLIDE 97IS 240 – Spring 2010
Spatial objects on the map are linked to a table of attributes, which may include any information about the objects. Note that this is a scholarly tool. By creating a “name quality” field, the author has noted that there is disagreement about the locations and names of places in the Sasanian Empire.
2010.04.05 - SLIDE 98IS 240 – Spring 2010
Sites on the map may be linked to resources elsewhere on the internet. In this case, important archaeological sites on the map are linked to web-based tours.
2010.04.05 - SLIDE 99IS 240 – Spring 2010
The map interface may be used to show change over time. The “Sasanian Empire ca. 270s” resource is highlighted, and the “Sasanian Empire ca. 570s” is greyed out. If a user slides the timeline bar, the new boundary of the empire will appear.
2010.04.05 - SLIDE 100IS 240 – Spring 2010
In a different time range, not only do the boundaries of the empire appear different, but the sites that were active during the earlier era (the red dots) have moved as well.
2010.04.05 - SLIDE 101IS 240 – Spring 2010
TimeMap is a user authoring tool, not merely a viewer. Users can control the look of the icons, the map layers that comprise a project, and, as shown here, the map scale at which different layers will become visible.
2010.04.05 - SLIDE 102IS 240 – Spring 2010
This screen displays the metadata for the a part of the Sasanian Empire project. The metadata includes functional (tm.) metadata to enable connection to the map interface in addition to cataloguing (dc. and ecai.) metadata. Using the menu on the left, users may choose to map individual map layers or packaged projects.
2010.04.05 - SLIDE 106IS 240 – Spring 2010
Prof. Ray Larson University of California, Berkeley
School of InformationTuesday and Thursday 10:30 am - 12:00 pm
Spring 2007http://courses.ischool.berkeley.edu/i240/s07
Principles of Information Retrieval
Lecture 23: GIR Continued
2010.04.05 - SLIDE 107IS 240 – Spring 2010
Today
• Review– Geographic Information Retrieval
• Parts of this this lecture were presented at the invitational conference “The ‘I’ in Geographic Information Science”, Manchester, U.K., July 2001
• GIR Algorithms and evaluation based on a presentation to the 2004 European Conference on Digital Libraries, held in Bath, U.K.
2010.04.05 - SLIDE 108IS 240 – Spring 2010
Introduction
• What is Geographic Information Retrieval?– GIR is concerned with providing access to
georeferenced information sources. It includes all of the areas of traditional IR research with the addition of spatially and geographically oriented indexing and retrieval.
– It combines aspects of DBMS research, User Interface Research, GIS research, and Information Retrieval research.
2010.04.05 - SLIDE 109IS 240 – Spring 2010
Introduction
• The need for Geographic and Spatial Information Retrieval.– Digital Libraries
• Sequoia 2000• UC Berkeley NSF/NASA/ARPA Digital Library
Project• UC Santa Barbara Alexandria Project• NSDI - National Spatial Data Infrastructure
– Next-Generation Online Catalogs• Cheshire II
2010.04.05 - SLIDE 110IS 240 – Spring 2010
Geographic and Spatial Querying
• Both imply querying on relationships within a particular coordinate system
• Spatial querying is the more general term
• Can be defined as queries about the spatial relationships (intersection, containment, boundary, adjacency, proximity) of entities geometrically defined and located in space
2010.04.05 - SLIDE 111IS 240 – Spring 2010
Geographic and Spatial Querying
• Geographical coordinates are geometric relationships (distance and direction can be measured on a continuous scale)– E.g. “5.21 miles north
of Champaign”
• Spatial relations may be both geometric and topological (spatially related but without measureable distance or absolute direction)– E.g.: “inside the city
limits”– “left side of Beckman
Institute”
2010.04.05 - SLIDE 112IS 240 – Spring 2010
Geographic and Spatial Querying
• Types of spatial queries– Point-in-polygon : “What do we
have at this X,Y point?”– Region Queries : “What do we
have in this region?”• Which point encoded items lie
within the region• What lines (borders, etc.) lie within
or the cross the region• What areas overlap the region area
YY
XX
2010.04.05 - SLIDE 113IS 240 – Spring 2010
Geographic and Spatial Querying
• Types of spatial queries, cont.– Distance and Buffer Zone Queries
• What cities lie within 40 miles of the border of Northern and Southern Ireland?
• What wetlands lie within 50 miles of London?
– Path Queries• What is the shortest route from San
Francisco to Los Angeles?
2010.04.05 - SLIDE 114IS 240 – Spring 2010
Geographic and Spatial Querying
• Types of spatial queries, cont.– Multimedia Queries : Use non-
map georeferenced information.
• What are the names of farmers affected by flooding in Monterey and Santa Cruz Counties?
p123p123p127p127
2010.04.05 - SLIDE 115IS 240 – Spring 2010
Spatial Browsing
• Combines ad hoc spatial querying with interactive displays
• HyperMap concept
• Pseudo-HyperMaps
2010.04.05 - SLIDE 116IS 240 – Spring 2010
Geographic and Spatial Indexing
• GIPSY Map Overlay
““The proposed project isThe proposed project is the construction of a new State the construction of a new State Water Project facility, the Water Project facility, the coastal branch... by water coastal branch... by water purveyors of northern Santa purveyors of northern Santa Barbara County... delivering Barbara County... delivering water to San Luis Obispo ... “water to San Luis Obispo ... “
““The proposed project isThe proposed project is the construction of a new State the construction of a new State Water Project facility, the Water Project facility, the coastal branch... by water coastal branch... by water purveyors of northern Santa purveyors of northern Santa Barbara County... delivering Barbara County... delivering water to San Luis Obispo ... “water to San Luis Obispo ... “
2010.04.05 - SLIDE 117IS 240 – Spring 2010
Geographic and Spatial Indexing
• To be useful for the range of cultural and humanities materials being collected in digital libraries, the GIPSY gazetteer must– Support many different time ranges, location
and boundary changes– Support synonymous and variant names with
differing locations for the same entity– Support names in multiple languages, scripts
and usages
2010.04.05 - SLIDE 118IS 240 – Spring 2010
The map interface may be used to show change over time. The “Sasanian Empire ca. 270s” resource is highlighted, and the “Sasanian Empire ca. 570s” is greyed out. If a user slides the timeline bar, the new boundary of the empire will appear.