…and postgis & full text search & fuzzy comparisons.
-
Upload
martina-bridges -
Category
Documents
-
view
214 -
download
1
Transcript of …and postgis & full text search & fuzzy comparisons.
POSTGRESQL…and postgis &
full text search &fuzzy comparisons
POSTGIS For manipulating 2D/3D spatial data
Points, lines, and polygons formed from points and lines
Can perform union, intersection, operations Can project shapes into 2D areas Has a 3D geometry type (relatively new)
Can calculate accurate distances in meters Works with an open source server that allows folks
to share geospatial data Command line interface Also supports some forms of raster data Provides spatial indices Has a notion of a geometric column
QUERIESSELECT superhero.name FROM city, superhero WHERE ST_Contains(city.geom, superhero.geom) and city.name = 'Gotham';
SELECT AsBinary(the_geom) as wkb_geometry FROM river AS r, state AS s WHERE intersects(r.the_geom, s.the_geom)
MAPNIK Used for OSM (open street map) data
and uses postgisMapnik is an open source system for
rendering mapsUsed to design mapsWritten in C++ It renders maps from postgis databases
NEXT: FULL TEXT AND APPROXIMATE TEXT SEARCH But first, not to be confused with the
Like operatorUsed % as the wild card
Or with regular expressions for character string comparison
FULL TEXT SEARCH First, you index the words in a document
and create an array of lexemes Second, specify a boolean phrase using
and, or, not, and parens We typically don’t index “stop” words
like and, or, the, etc. Dictionaries are used to find roots of
related words, like dead and dying Thesauruses dictionaries are used to for
recognition of domain-specific and similar words
DOCUMENTS A document is a text attribute in a row
of a table Often we use part of a document or
concatenate various parts of documents
DETAILS: DICTIONARIES Define stop words that should not
be indexed Map synonyms to a single word. Map phrases to a single word using
a thesaurus. Map different variations of a word
to a canonical form
SEARCHING Uses a match operator - @@ Basic search consists of asking about
the relationship to a vector of words to a given document, which is also a vectorThe vector can have and, or, etc. in it tsvector – document – normalized lexemes tsquery – query
EXAMPLESSELECT title FROM pgweb WHERE to_tsvector(title || ' ' || body) @@ to_tsquery('create & table') ORDER BY last_mod_date DESC LIMIT 10;
select the ten most recent documents that contain create and table in the title or body
Results can be ranked
RECENT ADDITION: FUZZINESS soundex(text) returns text
Converts a string to its Soundex code Based on pronunciation
difference(text, text) returns int converts two strings to their Soundex codes
and then reports the number of matching code positions 0 is a no match 4 is a full match
Def: A phonetic coding system intended to suppress spelling variation and determining the relationship between two (similar) words
LEVENSHTEIN Levenshtein distance is a metric for
evaluating the difference between two sequences, in particular, words
E.g.: test=# SELECT levenshtein('GUMBO', 'GAMBOL');
E.g.: SELECT * FROM some_table WHERE levenshtein(code, 'AB123-lHdfj') <= 3 ORDER BY levenshtein(code, 'AB123-lHdfj') LIMIT 10
Used in particular, to detect nicknames
METAPHONE E.g., metaphone(text source, int
max_output_length) returns text Similar to soundex Used to classify words according to their
english pronunciation Apparently better for non-english
languages, compared to soundex E.g.: SELECT * FROM users WHERE
METAPHONE(users.first_name, 2) = METAPHONE('Willem', 2) should detect similarity to word William