…and postgis & full text search & fuzzy comparisons.

13
POSTGRESQL …and postgis & full text search & fuzzy comparisons

Transcript of …and postgis & full text search & fuzzy comparisons.

Page 1: …and postgis & full text search & fuzzy comparisons.

POSTGRESQL…and postgis &

full text search &fuzzy comparisons

Page 2: …and postgis & full text search & fuzzy comparisons.

POSTGIS For manipulating 2D/3D spatial data

Points, lines, and polygons formed from points and lines

Can perform union, intersection, operations Can project shapes into 2D areas Has a 3D geometry type (relatively new)

Can calculate accurate distances in meters Works with an open source server that allows folks

to share geospatial data Command line interface Also supports some forms of raster data Provides spatial indices Has a notion of a geometric column

Page 3: …and postgis & full text search & fuzzy comparisons.

QUERIESSELECT superhero.name FROM city, superhero WHERE ST_Contains(city.geom, superhero.geom) and city.name = 'Gotham';

SELECT AsBinary(the_geom) as wkb_geometry FROM river AS r, state AS s WHERE intersects(r.the_geom, s.the_geom)

Page 4: …and postgis & full text search & fuzzy comparisons.

MAPNIK Used for OSM (open street map) data

and uses postgisMapnik is an open source system for

rendering mapsUsed to design mapsWritten in C++ It renders maps from postgis databases

Page 5: …and postgis & full text search & fuzzy comparisons.

NEXT: FULL TEXT AND APPROXIMATE TEXT SEARCH But first, not to be confused with the

Like operatorUsed % as the wild card

Or with regular expressions for character string comparison

Page 6: …and postgis & full text search & fuzzy comparisons.

FULL TEXT SEARCH First, you index the words in a document

and create an array of lexemes Second, specify a boolean phrase using

and, or, not, and parens We typically don’t index “stop” words

like and, or, the, etc. Dictionaries are used to find roots of

related words, like dead and dying Thesauruses dictionaries are used to for

recognition of domain-specific and similar words

Page 7: …and postgis & full text search & fuzzy comparisons.

DOCUMENTS A document is a text attribute in a row

of a table Often we use part of a document or

concatenate various parts of documents

Page 8: …and postgis & full text search & fuzzy comparisons.

DETAILS: DICTIONARIES Define stop words that should not

be indexed Map synonyms to a single word. Map phrases to a single word using

a thesaurus. Map different variations of a word

to a canonical form

Page 9: …and postgis & full text search & fuzzy comparisons.

SEARCHING Uses a match operator - @@ Basic search consists of asking about

the relationship to a vector of words to a given document, which is also a vectorThe vector can have and, or, etc. in it tsvector – document – normalized lexemes tsquery – query

Page 10: …and postgis & full text search & fuzzy comparisons.

EXAMPLESSELECT title FROM pgweb WHERE to_tsvector(title || ' ' || body) @@ to_tsquery('create & table') ORDER BY last_mod_date DESC LIMIT 10;

select the ten most recent documents that contain create and table in the title or body

Results can be ranked

Page 11: …and postgis & full text search & fuzzy comparisons.

RECENT ADDITION: FUZZINESS soundex(text) returns text

Converts a string to its Soundex code Based on pronunciation

difference(text, text) returns int converts two strings to their Soundex codes

and then reports the number of matching code positions 0 is a no match 4 is a full match

Def: A phonetic coding system intended to suppress spelling variation and determining the relationship between two (similar) words

Page 12: …and postgis & full text search & fuzzy comparisons.

LEVENSHTEIN Levenshtein distance is a metric for

evaluating the difference between two sequences, in particular, words

E.g.: test=# SELECT levenshtein('GUMBO', 'GAMBOL');

E.g.: SELECT * FROM some_table WHERE levenshtein(code, 'AB123-lHdfj') <= 3 ORDER BY levenshtein(code, 'AB123-lHdfj') LIMIT 10

Used in particular, to detect nicknames

Page 13: …and postgis & full text search & fuzzy comparisons.

METAPHONE E.g., metaphone(text source, int

max_output_length) returns text Similar to soundex Used to classify words according to their

english pronunciation Apparently better for non-english

languages, compared to soundex E.g.: SELECT * FROM users WHERE

METAPHONE(users.first_name, 2) = METAPHONE('Willem', 2) should detect similarity to word William