Databases for text storage · 1 Introduction 2 PostgresSQL 3 MongoDB Jonathan Ronen (NYU) databases...

24
Databases for text storage Jonathan Ronen New York University [email protected] December 1, 2014 Jonathan Ronen (NYU) databases December 1, 2014 1 / 24

Transcript of Databases for text storage · 1 Introduction 2 PostgresSQL 3 MongoDB Jonathan Ronen (NYU) databases...

Page 1: Databases for text storage · 1 Introduction 2 PostgresSQL 3 MongoDB Jonathan Ronen (NYU) databases December 1, 2014 2 / 24. ... Basics of SQL SELECT statement ... MongoDB Document

Databases for text storage

Jonathan Ronen

New York University

[email protected]

December 1, 2014

Jonathan Ronen (NYU) databases December 1, 2014 1 / 24

Page 2: Databases for text storage · 1 Introduction 2 PostgresSQL 3 MongoDB Jonathan Ronen (NYU) databases December 1, 2014 2 / 24. ... Basics of SQL SELECT statement ... MongoDB Document

Overview

1 Introduction

2 PostgresSQL

3 MongoDB

Jonathan Ronen (NYU) databases December 1, 2014 2 / 24

Page 3: Databases for text storage · 1 Introduction 2 PostgresSQL 3 MongoDB Jonathan Ronen (NYU) databases December 1, 2014 2 / 24. ... Basics of SQL SELECT statement ... MongoDB Document

Why Databases?

Structured way to store your data

Accessible, shareable

Manage growing volumes of data

You cannot keep all of your data in working memory...

indexing

Jonathan Ronen (NYU) databases December 1, 2014 3 / 24

Page 4: Databases for text storage · 1 Introduction 2 PostgresSQL 3 MongoDB Jonathan Ronen (NYU) databases December 1, 2014 2 / 24. ... Basics of SQL SELECT statement ... MongoDB Document

Basic issues with databases

Inserting data

Schema

Querying

Indexing

Jonathan Ronen (NYU) databases December 1, 2014 4 / 24

Page 5: Databases for text storage · 1 Introduction 2 PostgresSQL 3 MongoDB Jonathan Ronen (NYU) databases December 1, 2014 2 / 24. ... Basics of SQL SELECT statement ... MongoDB Document

I’ll show you how to do this in

PostgresSQL

MongoDB

Jonathan Ronen (NYU) databases December 1, 2014 5 / 24

Page 6: Databases for text storage · 1 Introduction 2 PostgresSQL 3 MongoDB Jonathan Ronen (NYU) databases December 1, 2014 2 / 24. ... Basics of SQL SELECT statement ... MongoDB Document

PostgreSQL

Relational DB

Which means we define tables with columns and relations

Queried using Structured Query Language

ES-QUE-ELL, or SEQUEL, but not SQUEAL

opensource, free, very fast, advanced text search capabilities

Friendly elephant logo

Jonathan Ronen (NYU) databases December 1, 2014 6 / 24

Page 7: Databases for text storage · 1 Introduction 2 PostgresSQL 3 MongoDB Jonathan Ronen (NYU) databases December 1, 2014 2 / 24. ... Basics of SQL SELECT statement ... MongoDB Document

Basics of SQL

Jonathan Ronen (NYU) databases December 1, 2014 7 / 24

Page 8: Databases for text storage · 1 Introduction 2 PostgresSQL 3 MongoDB Jonathan Ronen (NYU) databases December 1, 2014 2 / 24. ... Basics of SQL SELECT statement ... MongoDB Document

Basics of SQL

Jonathan Ronen (NYU) databases December 1, 2014 8 / 24

Page 9: Databases for text storage · 1 Introduction 2 PostgresSQL 3 MongoDB Jonathan Ronen (NYU) databases December 1, 2014 2 / 24. ... Basics of SQL SELECT statement ... MongoDB Document

Basics of SQL

Jonathan Ronen (NYU) databases December 1, 2014 9 / 24

Page 10: Databases for text storage · 1 Introduction 2 PostgresSQL 3 MongoDB Jonathan Ronen (NYU) databases December 1, 2014 2 / 24. ... Basics of SQL SELECT statement ... MongoDB Document

Basics of SQL

SELECT statement

SELECT * FROM tweets WHERE user id=2170941466;

SELECT statement with time range

SELECT * FROM tweets WHERE timestamp >’2014-12-2’;

SELECT statement with LIKE

SELECT * FROM tweets WHERE lower(text) LIKE ’%obama%’;

Jonathan Ronen (NYU) databases December 1, 2014 10 / 24

Page 11: Databases for text storage · 1 Introduction 2 PostgresSQL 3 MongoDB Jonathan Ronen (NYU) databases December 1, 2014 2 / 24. ... Basics of SQL SELECT statement ... MongoDB Document

Indexing

Imagine searching through a table:

id user id timestamp text

1 1 2014-11-30 10:23:40 I love the biebsssss!

2 2 2014-11-30 11:33:44 Bieberboy make me a baby!

3 1 2014-11-30 10:23:23 God if biebs dont come i shoot myself!!!

4 3 2014-11-30 9:12:11 I love bieber so much i have bieber sandwiches

5 2 2014-11-30 12:33:10 RT if you love biebsbs as much ias me!! or you die!!!!

Find me all tweets since noon.

Jonathan Ronen (NYU) databases December 1, 2014 11 / 24

Page 12: Databases for text storage · 1 Introduction 2 PostgresSQL 3 MongoDB Jonathan Ronen (NYU) databases December 1, 2014 2 / 24. ... Basics of SQL SELECT statement ... MongoDB Document

Indexing

Imagine searching through a table:

id user id timestamp text

4 3 2014-11-30 9:12:11 I love bieber so much i have bieber sandwiches

3 1 2014-11-30 10:23:23 God if biebs dont come i shoot myself!!!

1 1 2014-11-30 10:23:40 I love the biebsssss!

2 2 2014-11-30 11:33:44 Bieberboy make me a baby!

5 2 2014-11-30 12:33:10 RT if you love biebsbs as much ias me!! or you die!!!!

Easy! Sort by time!

Jonathan Ronen (NYU) databases December 1, 2014 12 / 24

Page 13: Databases for text storage · 1 Introduction 2 PostgresSQL 3 MongoDB Jonathan Ronen (NYU) databases December 1, 2014 2 / 24. ... Basics of SQL SELECT statement ... MongoDB Document

Indexing

An index is a sorted copy of a column.

timestamp id

2014-11-30 9:12:11 4

2014-11-30 10:23:23 3

2014-11-30 10:23:40 1

2014-11-30 11:33:44 2

2014-11-30 12:33:10 5

(Or really, it’s usually a btree...)

Jonathan Ronen (NYU) databases December 1, 2014 13 / 24

Page 14: Databases for text storage · 1 Introduction 2 PostgresSQL 3 MongoDB Jonathan Ronen (NYU) databases December 1, 2014 2 / 24. ... Basics of SQL SELECT statement ... MongoDB Document

Text search in postgres

SELECT statement using PG text search

SELECT * FROM tweets WHERE to tsvector(’english’, text) @@to tsquery(’obama’);

to tsvector

to tsquery

(show these in the terminal...)

Jonathan Ronen (NYU) databases December 1, 2014 14 / 24

Page 15: Databases for text storage · 1 Introduction 2 PostgresSQL 3 MongoDB Jonathan Ronen (NYU) databases December 1, 2014 2 / 24. ... Basics of SQL SELECT statement ... MongoDB Document

Text indexing

CREATE INDEX statement

CREATE INDEX text idx ON tweets USING gin(to tsvector(’english’,text));

SELECT statement using text index

SELECT * FROM tweets WHERE to tsvector(’english’, text) @@to tsquery(’obama’);

Jonathan Ronen (NYU) databases December 1, 2014 15 / 24

Page 16: Databases for text storage · 1 Introduction 2 PostgresSQL 3 MongoDB Jonathan Ronen (NYU) databases December 1, 2014 2 / 24. ... Basics of SQL SELECT statement ... MongoDB Document

Aggregation

GROUP BY statement

SELECT user id, count(*) FROM tweets GROUP BY user id;

Jonathan Ronen (NYU) databases December 1, 2014 16 / 24

Page 17: Databases for text storage · 1 Introduction 2 PostgresSQL 3 MongoDB Jonathan Ronen (NYU) databases December 1, 2014 2 / 24. ... Basics of SQL SELECT statement ... MongoDB Document

MongoDB

Document store

noSQL doesn’t mean query language isn’t structured (but it’sdifferent..)

opensource, free, really fast (sometimes)

Jonathan Ronen (NYU) databases December 1, 2014 17 / 24

Page 18: Databases for text storage · 1 Introduction 2 PostgresSQL 3 MongoDB Jonathan Ronen (NYU) databases December 1, 2014 2 / 24. ... Basics of SQL SELECT statement ... MongoDB Document

JSON documents

{” c r e a t e d a t ” : ”Wed Aug 13 1 5 : 2 0 : 4 6 +0000 2014” ,” l a n g ” : ” en ” ,” r e t w e e t c o u n t ” : 0 ,” t e x t ” : ” P e n n s y l v a n i a USA P h i l a d e l p h i a \u00bb MikeBrown 545 Mike Brown : St . L o u i s P o l i c e Shoot amp K i l l Unarmed 18−Year−Old −− S\u2026 h t t p : / / t . co /RgDpM8M881” ,” u s e r ” : {

”name ” : ” J e f f ” ,” sc ree n name ” : ” j e f f e r s o n d o l ” ,” s t a t u s e s c o u n t ” : 207845 ,” d e s c r i p t i o n ” : ”#a n d r o i d , #andro idgames ,# iphone , #iphonegames , #ipad , #ipadgames , #app ” ,” f o l l o w e r s c o u n t ” : 810 ,” l a n g ” : ” en ” ,” g e o e n a b l e d ” : f a l s e ,” l o c a t i o n ” : ” F l o r i d a ” ,

}}

Jonathan Ronen (NYU) databases December 1, 2014 18 / 24

Page 19: Databases for text storage · 1 Introduction 2 PostgresSQL 3 MongoDB Jonathan Ronen (NYU) databases December 1, 2014 2 / 24. ... Basics of SQL SELECT statement ... MongoDB Document

MongoDB is a document database

MongoDB lets you store these documents directly

No need to flatten to tabular form!

Comes with its own query syntax

Also uses indexing to speed queries

SQL MongoDatabase Database

Table Collection

Row Document

Index Index

Jonathan Ronen (NYU) databases December 1, 2014 19 / 24

Page 20: Databases for text storage · 1 Introduction 2 PostgresSQL 3 MongoDB Jonathan Ronen (NYU) databases December 1, 2014 2 / 24. ... Basics of SQL SELECT statement ... MongoDB Document

MongoDB Query Syntax

Regex matching

db . c o l l e c t i o n . f i n d ({ ’ t e x t ’ : /obama /})

Date range

db . c o l l e c t i o n . f i n d ({ t imestamp : {$gt : new Date ( 2 0 1 4 , 1 0 , 6 )

}})

Jonathan Ronen (NYU) databases December 1, 2014 20 / 24

Page 21: Databases for text storage · 1 Introduction 2 PostgresSQL 3 MongoDB Jonathan Ronen (NYU) databases December 1, 2014 2 / 24. ... Basics of SQL SELECT statement ... MongoDB Document

Text search in MongoDB

Creating a text index

db.tweets.ensureIndex({text: ”text” })

Using text search

db.tweets.findOne({text : {search: ”obama”}})

Jonathan Ronen (NYU) databases December 1, 2014 21 / 24

Page 22: Databases for text storage · 1 Introduction 2 PostgresSQL 3 MongoDB Jonathan Ronen (NYU) databases December 1, 2014 2 / 24. ... Basics of SQL SELECT statement ... MongoDB Document

Aggregation in MongoDB

Aggregation framework

db . t w e e t s . a g g r e g a t e ({ $group : {i d : ” $ u s e r . sc reen name ” ,

number : { $sum : 1 }}

})

Jonathan Ronen (NYU) databases December 1, 2014 22 / 24

Page 23: Databases for text storage · 1 Introduction 2 PostgresSQL 3 MongoDB Jonathan Ronen (NYU) databases December 1, 2014 2 / 24. ... Basics of SQL SELECT statement ... MongoDB Document

SMAPP

Some info on the smapp backend:

MongoDB with index on tweet id, timestamp, random number (forsampling)

No text index (yet!)

New!: multiple collection for smappler indexes (smapptoolkit)

Jonathan Ronen (NYU) databases December 1, 2014 23 / 24

Page 24: Databases for text storage · 1 Introduction 2 PostgresSQL 3 MongoDB Jonathan Ronen (NYU) databases December 1, 2014 2 / 24. ... Basics of SQL SELECT statement ... MongoDB Document

The End

Jonathan Ronen (NYU) databases December 1, 2014 24 / 24