Big Data Database

33
Big Data Database Cloud Hadoop Others 1 Credit to Ekapop

Transcript of Big Data Database

Page 1: Big Data Database

Big Data DatabaseCloud

Hadoop

Others

1Credit to Ekapop

Page 2: Big Data Database

Outlines

• Query on big database: (1) internal & (2) external tables• Pro/con of external tables• Sample of databases supporting external tables• BigQuery External Table (BigTable vs. BigQuery)• Parquet vs. ORC• Partition• Next generation of distributed database

2

Page 3: Big Data Database

Query on Big Data database

• There are two options to map the data into “table” in order to query data from Big Data database. • 1) Internal table• 2) External table

3

Page 4: Big Data Database

TABLE

1) Managed/Internal TableDrop table, also delete data

• Store data inside the database engine• Performance depends on database performance• Easy to use database utility (cache, …) to improve performance• Easy to Guarantee ACID • (Especially Consistency)

DATA

4

Page 5: Big Data Database

2) External TableDrop table, delete only table NOT data

• Store data in an external system• File System• Other databases

• Performance depends on external system (source system)• Some external systems do NOT guarantee ACID

• (Especially Consistency)• Flexible

• Don’t need to load data to table • Don’t need to create table structure (ad-hoc)

• There are two types of external tables• Permanent External Table • Ad-hoc External Table (define & delete once it is done).

DATA

TABLE

5

Page 6: Big Data Database

2) External Table (cont)

• 2.1) Permanent External Table• Create External Table• Then use

DATA

TABLE

CREATE OR REPLACE EXTERNAL TABLE mydataset.salesOPTIONS (

format = 'CSV',uris = ['gs://mybucket/sales.csv']

)

SELECT * FROM mydataset.sales

6Blue = create tables & orange = query

Page 7: Big Data Database

2) External Table (cont.)

• 2.2) Ad-hoc External Table• Don’t need to create External Table• Query on the fly• Not support for all DW• Google BigQuery can support this mode using its own

command, NOT DDL (SQL).DATA

TABLE

Azure Synapse

7

Page 8: Big Data Database

Pro of External Tables

• Dynamic schema• Flexible and can read data on the fly (doesn’t need to define table schema)

• Can read data from other databases directly, so it can capture changes without a setup of Change Data Capture (CDC)• Convenient and do not need to move data• For example

• IoT is stored in MongoDB; however, it is diffcult to join with other tables from MongoDB.

• In this case, we can create external table in Azure and can directly query data in MongoDB; thus, it can join with other tables easily.

• Note that Azure automatically duplicates data from MongoDB and creates SQL DB.

8

Page 9: Big Data Database

Con of External Tables• It does not guarantee consistency in some external data sources.• It is not suitable for large scale of data since it doesn’t utilize database features,

e.g., indexing.• Query performance of external tables may not be as fast as querying data in a

native BigQuery table.• Since the source data can be file, it must process the whole file instead of a part

of data.• It doen’t utilize the cluster performance due to the bottleneck at the data source.• In the external table, the amount of processed data can’t be determined until the

actual query is completed.• We don’t know as to wheter or not the query is working until it runs, e.g.,

#columns is incompatible.• You may have to pay in addition for querying data from external tables.• Location: When you query data in an external data source such as Cloud Storage,

the data you're querying must be in the same location as your BigQuery data. 9

Page 10: Big Data Database

There are many modern databases supporting external tables.• Hadoop Hive• IBM DB2 (new version)• Presto

• Snowflake• Google Big Query• Amazon RedShift• Microsoft Azure Synapse

11

Page 11: Big Data Database

Hadoop (Hive; DW)

12

Internal/External concepts are supported by all modern databases.

Page 12: Big Data Database

Presto (DW)

13

Internal/External concepts are supported by all modern databases.

Page 13: Big Data Database

Snowflake (DW Cloud; not open-source)

14

Internal/External concepts are supported by all modern databases.

Page 14: Big Data Database

Google BigQuery (DW)

15

Internal/External concepts are supported by all modern databases.

Page 15: Big Data Database

AWS Redshift (DW)

16

Internal/External concepts are supported by all modern databases.

Page 16: Big Data Database

Azure Synapse (DW)

17

Page 17: Big Data Database

BigQuery External TableCloud Storage• Comma-separated values (CSV)• Newline-delimited JSON• Avro files• Datastore export files• ORC files• Parquet files• Firestore export files

Google Drive• Comma-separated values (CSV)• Newline-delimited JSON• Avro files• Google Sheets

Cloud BigTable18

Page 18: Big Data Database

19

OLTP (update)- insert, delete real-time- NoSQL à Key-value database- Support SQL- Like Hbase

- Cell (key-value); sparse- Column Family

OLAP (query)- Qeury a lot of data- Immutable (slow update)- Support SQL

Page 19: Big Data Database

Row Based vs Column Based (recap)

• select distinct firstname• Row -> Read all row• Colum -> Read only firstname column

20

Page 20: Big Data Database

Parquet• Columnar storage• Embed data type• Compression• Dev by Google Dremel, Cloudera, Twitter• Widely used in Cloudera, Impala, Spark

https://databricks.com/glossary/what-is-parquet21

Compress = smaller data

Page 21: Big Data Database

ORC

• Optimized Row Columnar• Indexing• Embed data type• Compression• Column-level aggregates

• (for each chunk)• count, min, max, and sum• Can skip data (chunk)

• Dev by Facebook, Hortonwork• Widely used in Hive

• Let hive support ACID Table (while Impala can’t)

22

Page 22: Big Data Database

Parquet vs. ORC

23

Page 23: Big Data Database

Parquet vs. ORC

24

Page 24: Big Data Database

Partition• Sub data to multiple files based on the partitioning column• For example, if the iris-class is a partition column, there will be 3 partitions.

• Select data without read whole table

25

Page 25: Big Data Database

Partition (cont.)

• Iris• <iris_path>/data.csv

26

Page 26: Big Data Database

Partition (cont.)

• Iris• <iris_path>/class=Iris-setosa/data.csv

• <iris_path>/class=Iris-versicolor/data.csv

• <iris_path>/class=Iris-virginica/data.csv

27

Page 27: Big Data Database

Old Distributed Data Warehouse

• Redshift with DS2 instance type (master, slave)• Minimize data movement on network

28

Page 28: Big Data Database

Network Throughput TrendNetwork bandwidth is now scaling very fast.

29

Page 29: Big Data Database

Next Gen Distributed Data WarehouseNowadays data doesn’t store in compute nodes anymore.

• Redshift with RA3 instance type• Separate storage and compute• High performance • Easy to add more compute node• Easy to scale storage

• Data store in one-place, it is easier to optimize.

30

Page 30: Big Data Database

Next Gen Distributed Data Warehouse

• CPU Usage

31

Page 31: Big Data Database

Next Gen Distributed Data Warehouse

• Latency

32

Page 32: Big Data Database

Next Gen Distributed Data WarehouseGoogle BigQuery

https://panoply.io/data-warehouse-guide/bigquery-architecture/ 33

Page 33: Big Data Database

Next Gen Distributed Data WarehouseAzure Synapse

34