…and postgis & full text search & fuzzy comparisons.

POSTGRESQL…and postgis &

full text search &fuzzy comparisons

POSTGIS For manipulating 2D/3D spatial data

Points, lines, and polygons formed from points and lines

Can perform union, intersection, operations Can project shapes into 2D areas Has a 3D geometry type (relatively new)

Can calculate accurate distances in meters Works with an open source server that allows folks

to share geospatial data Command line interface Also supports some forms of raster data Provides spatial indices Has a notion of a geometric column

QUERIESSELECT superhero.name FROM city, superhero WHERE ST_Contains(city.geom, superhero.geom) and city.name = 'Gotham';

SELECT AsBinary(the_geom) as wkb_geometry FROM river AS r, state AS s WHERE intersects(r.the_geom, s.the_geom)

MAPNIK Used for OSM (open street map) data

and uses postgisMapnik is an open source system for

rendering mapsUsed to design mapsWritten in C++ It renders maps from postgis databases

NEXT: FULL TEXT AND APPROXIMATE TEXT SEARCH But first, not to be confused with the

Like operatorUsed % as the wild card

Or with regular expressions for character string comparison

FULL TEXT SEARCH First, you index the words in a document

and create an array of lexemes Second, specify a boolean phrase using

and, or, not, and parens We typically don’t index “stop” words

like and, or, the, etc. Dictionaries are used to find roots of

related words, like dead and dying Thesauruses dictionaries are used to for

recognition of domain-specific and similar words

DOCUMENTS A document is a text attribute in a row

of a table Often we use part of a document or

concatenate various parts of documents

DETAILS: DICTIONARIES Define stop words that should not

be indexed Map synonyms to a single word. Map phrases to a single word using

a thesaurus. Map different variations of a word

to a canonical form

SEARCHING Uses a match operator - @@ Basic search consists of asking about

the relationship to a vector of words to a given document, which is also a vectorThe vector can have and, or, etc. in it tsvector – document – normalized lexemes tsquery – query

EXAMPLESSELECT title FROM pgweb WHERE to_tsvector(title || ' ' || body) @@ to_tsquery('create & table') ORDER BY last_mod_date DESC LIMIT 10;

select the ten most recent documents that contain create and table in the title or body

Results can be ranked

RECENT ADDITION: FUZZINESS soundex(text) returns text

Converts a string to its Soundex code Based on pronunciation

difference(text, text) returns int converts two strings to their Soundex codes

and then reports the number of matching code positions 0 is a no match 4 is a full match

Def: A phonetic coding system intended to suppress spelling variation and determining the relationship between two (similar) words

LEVENSHTEIN Levenshtein distance is a metric for

evaluating the difference between two sequences, in particular, words

E.g.: test=# SELECT levenshtein('GUMBO', 'GAMBOL');

E.g.: SELECT * FROM some_table WHERE levenshtein(code, 'AB123-lHdfj') <= 3 ORDER BY levenshtein(code, 'AB123-lHdfj') LIMIT 10

Used in particular, to detect nicknames

METAPHONE E.g., metaphone(text source, int

max_output_length) returns text Similar to soundex Used to classify words according to their

english pronunciation Apparently better for non-english

languages, compared to soundex E.g.: SELECT * FROM users WHERE

METAPHONE(users.first_name, 2) = METAPHONE('Willem', 2) should detect similarity to word William

…and postgis & full text search & fuzzy comparisons.

Documents

Transcript of …and postgis & full text search & fuzzy comparisons.