Big Data Database

Big Data DatabaseCloud

Hadoop

Others

1Credit to Ekapop

Outlines

• Query on big database: (1) internal & (2) external tables• Pro/con of external tables• Sample of databases supporting external tables• BigQuery External Table (BigTable vs. BigQuery)• Parquet vs. ORC• Partition• Next generation of distributed database

2

Query on Big Data database

• There are two options to map the data into “table” in order to query data from Big Data database. • 1) Internal table• 2) External table

3

TABLE

1) Managed/Internal TableDrop table, also delete data

• Store data inside the database engine• Performance depends on database performance• Easy to use database utility (cache, …) to improve performance• Easy to Guarantee ACID • (Especially Consistency)

DATA

4

2) External TableDrop table, delete only table NOT data

• Store data in an external system• File System• Other databases

• Performance depends on external system (source system)• Some external systems do NOT guarantee ACID

• (Especially Consistency)• Flexible

• Don’t need to load data to table • Don’t need to create table structure (ad-hoc)

• There are two types of external tables• Permanent External Table • Ad-hoc External Table (define & delete once it is done).

DATA

TABLE

5

2) External Table (cont)

• 2.1) Permanent External Table• Create External Table• Then use

DATA

TABLE

CREATE OR REPLACE EXTERNAL TABLE mydataset.salesOPTIONS (

format = 'CSV',uris = ['gs://mybucket/sales.csv']

)

SELECT * FROM mydataset.sales

6Blue = create tables & orange = query

2) External Table (cont.)

• 2.2) Ad-hoc External Table• Don’t need to create External Table• Query on the fly• Not support for all DW• Google BigQuery can support this mode using its own

command, NOT DDL (SQL).DATA

TABLE

Azure Synapse

7

Pro of External Tables

• Dynamic schema• Flexible and can read data on the fly (doesn’t need to define table schema)

• Can read data from other databases directly, so it can capture changes without a setup of Change Data Capture (CDC)• Convenient and do not need to move data• For example

• IoT is stored in MongoDB; however, it is diffcult to join with other tables from MongoDB.

• In this case, we can create external table in Azure and can directly query data in MongoDB; thus, it can join with other tables easily.

• Note that Azure automatically duplicates data from MongoDB and creates SQL DB.

8

Con of External Tables• It does not guarantee consistency in some external data sources.• It is not suitable for large scale of data since it doesn’t utilize database features,

e.g., indexing.• Query performance of external tables may not be as fast as querying data in a

native BigQuery table.• Since the source data can be file, it must process the whole file instead of a part

of data.• It doen’t utilize the cluster performance due to the bottleneck at the data source.• In the external table, the amount of processed data can’t be determined until the

actual query is completed.• We don’t know as to wheter or not the query is working until it runs, e.g.,

#columns is incompatible.• You may have to pay in addition for querying data from external tables.• Location: When you query data in an external data source such as Cloud Storage,

the data you're querying must be in the same location as your BigQuery data. 9

There are many modern databases supporting external tables.• Hadoop Hive• IBM DB2 (new version)• Presto

• Snowflake• Google Big Query• Amazon RedShift• Microsoft Azure Synapse

11

Hadoop (Hive; DW)

12

Internal/External concepts are supported by all modern databases.

Presto (DW)

13


Snowflake (DW Cloud; not open-source)

14


Google BigQuery (DW)

15


AWS Redshift (DW)

16


Azure Synapse (DW)

17

BigQuery External TableCloud Storage• Comma-separated values (CSV)• Newline-delimited JSON• Avro files• Datastore export files• ORC files• Parquet files• Firestore export files

Google Drive• Comma-separated values (CSV)• Newline-delimited JSON• Avro files• Google Sheets

Cloud BigTable18

19

OLTP (update)- insert, delete real-time- NoSQL à Key-value database- Support SQL- Like Hbase

- Cell (key-value); sparse- Column Family

OLAP (query)- Qeury a lot of data- Immutable (slow update)- Support SQL

Row Based vs Column Based (recap)

• select distinct firstname• Row -> Read all row• Colum -> Read only firstname column

20

Parquet• Columnar storage• Embed data type• Compression• Dev by Google Dremel, Cloudera, Twitter• Widely used in Cloudera, Impala, Spark

https://databricks.com/glossary/what-is-parquet21

Compress = smaller data

https://databricks.com/glossary/what-is-parquet

ORC

• Optimized Row Columnar• Indexing• Embed data type• Compression• Column-level aggregates

• (for each chunk)• count, min, max, and sum• Can skip data (chunk)

• Dev by Facebook, Hortonwork• Widely used in Hive

• Let hive support ACID Table (while Impala can’t)

22

Parquet vs. ORC

23

Parquet vs. ORC

24

Partition• Sub data to multiple files based on the partitioning column• For example, if the iris-class is a partition column, there will be 3 partitions.

• Select data without read whole table

25

Partition (cont.)

• Iris• <iris_path>/data.csv

26

Partition (cont.)

• Iris• <iris_path>/class=Iris-setosa/data.csv

• <iris_path>/class=Iris-versicolor/data.csv

• <iris_path>/class=Iris-virginica/data.csv

27

Old Distributed Data Warehouse

• Redshift with DS2 instance type (master, slave)• Minimize data movement on network

28

Network Throughput TrendNetwork bandwidth is now scaling very fast.

29

Next Gen Distributed Data WarehouseNowadays data doesn’t store in compute nodes anymore.

• Redshift with RA3 instance type• Separate storage and compute• High performance • Easy to add more compute node• Easy to scale storage

• Data store in one-place, it is easier to optimize.

30

Next Gen Distributed Data Warehouse

• CPU Usage

31

Next Gen Distributed Data Warehouse

• Latency

32

Next Gen Distributed Data WarehouseGoogle BigQuery

https://panoply.io/data-warehouse-guide/bigquery-architecture/ 33

Next Gen Distributed Data WarehouseAzure Synapse

34

Big Data Database

Documents

Transcript of Big Data Database