IN-MEMORY DATABASE SYSTEMS FOR BIG DATA MANAGEMENT.SAP HANA DATABASE.
Big Data Database
Transcript of Big Data Database
Big Data DatabaseCloud
Hadoop
Others
1Credit to Ekapop
Outlines
• Query on big database: (1) internal & (2) external tables• Pro/con of external tables• Sample of databases supporting external tables• BigQuery External Table (BigTable vs. BigQuery)• Parquet vs. ORC• Partition• Next generation of distributed database
2
Query on Big Data database
• There are two options to map the data into “table” in order to query data from Big Data database. • 1) Internal table• 2) External table
3
TABLE
1) Managed/Internal TableDrop table, also delete data
• Store data inside the database engine• Performance depends on database performance• Easy to use database utility (cache, …) to improve performance• Easy to Guarantee ACID • (Especially Consistency)
DATA
4
2) External TableDrop table, delete only table NOT data
• Store data in an external system• File System• Other databases
• Performance depends on external system (source system)• Some external systems do NOT guarantee ACID
• (Especially Consistency)• Flexible
• Don’t need to load data to table • Don’t need to create table structure (ad-hoc)
• There are two types of external tables• Permanent External Table • Ad-hoc External Table (define & delete once it is done).
DATA
TABLE
5
2) External Table (cont)
• 2.1) Permanent External Table• Create External Table• Then use
DATA
TABLE
CREATE OR REPLACE EXTERNAL TABLE mydataset.salesOPTIONS (
format = 'CSV',uris = ['gs://mybucket/sales.csv']
)
SELECT * FROM mydataset.sales
6Blue = create tables & orange = query
2) External Table (cont.)
• 2.2) Ad-hoc External Table• Don’t need to create External Table• Query on the fly• Not support for all DW• Google BigQuery can support this mode using its own
command, NOT DDL (SQL).DATA
TABLE
Azure Synapse
7
Pro of External Tables
• Dynamic schema• Flexible and can read data on the fly (doesn’t need to define table schema)
• Can read data from other databases directly, so it can capture changes without a setup of Change Data Capture (CDC)• Convenient and do not need to move data• For example
• IoT is stored in MongoDB; however, it is diffcult to join with other tables from MongoDB.
• In this case, we can create external table in Azure and can directly query data in MongoDB; thus, it can join with other tables easily.
• Note that Azure automatically duplicates data from MongoDB and creates SQL DB.
8
Con of External Tables• It does not guarantee consistency in some external data sources.• It is not suitable for large scale of data since it doesn’t utilize database features,
e.g., indexing.• Query performance of external tables may not be as fast as querying data in a
native BigQuery table.• Since the source data can be file, it must process the whole file instead of a part
of data.• It doen’t utilize the cluster performance due to the bottleneck at the data source.• In the external table, the amount of processed data can’t be determined until the
actual query is completed.• We don’t know as to wheter or not the query is working until it runs, e.g.,
#columns is incompatible.• You may have to pay in addition for querying data from external tables.• Location: When you query data in an external data source such as Cloud Storage,
the data you're querying must be in the same location as your BigQuery data. 9
There are many modern databases supporting external tables.• Hadoop Hive• IBM DB2 (new version)• Presto
• Snowflake• Google Big Query• Amazon RedShift• Microsoft Azure Synapse
11
Hadoop (Hive; DW)
12
Internal/External concepts are supported by all modern databases.
Presto (DW)
13
Internal/External concepts are supported by all modern databases.
Snowflake (DW Cloud; not open-source)
14
Internal/External concepts are supported by all modern databases.
Google BigQuery (DW)
15
Internal/External concepts are supported by all modern databases.
AWS Redshift (DW)
16
Internal/External concepts are supported by all modern databases.
Azure Synapse (DW)
17
BigQuery External TableCloud Storage• Comma-separated values (CSV)• Newline-delimited JSON• Avro files• Datastore export files• ORC files• Parquet files• Firestore export files
Google Drive• Comma-separated values (CSV)• Newline-delimited JSON• Avro files• Google Sheets
Cloud BigTable18
19
OLTP (update)- insert, delete real-time- NoSQL à Key-value database- Support SQL- Like Hbase
- Cell (key-value); sparse- Column Family
OLAP (query)- Qeury a lot of data- Immutable (slow update)- Support SQL
Row Based vs Column Based (recap)
• select distinct firstname• Row -> Read all row• Colum -> Read only firstname column
20
Parquet• Columnar storage• Embed data type• Compression• Dev by Google Dremel, Cloudera, Twitter• Widely used in Cloudera, Impala, Spark
https://databricks.com/glossary/what-is-parquet21
Compress = smaller data
ORC
• Optimized Row Columnar• Indexing• Embed data type• Compression• Column-level aggregates
• (for each chunk)• count, min, max, and sum• Can skip data (chunk)
• Dev by Facebook, Hortonwork• Widely used in Hive
• Let hive support ACID Table (while Impala can’t)
22
Parquet vs. ORC
23
Parquet vs. ORC
24
Partition• Sub data to multiple files based on the partitioning column• For example, if the iris-class is a partition column, there will be 3 partitions.
• Select data without read whole table
25
Partition (cont.)
• Iris• <iris_path>/data.csv
26
Partition (cont.)
• Iris• <iris_path>/class=Iris-setosa/data.csv
• <iris_path>/class=Iris-versicolor/data.csv
• <iris_path>/class=Iris-virginica/data.csv
27
Old Distributed Data Warehouse
• Redshift with DS2 instance type (master, slave)• Minimize data movement on network
28
Network Throughput TrendNetwork bandwidth is now scaling very fast.
29
Next Gen Distributed Data WarehouseNowadays data doesn’t store in compute nodes anymore.
• Redshift with RA3 instance type• Separate storage and compute• High performance • Easy to add more compute node• Easy to scale storage
• Data store in one-place, it is easier to optimize.
30
Next Gen Distributed Data Warehouse
• CPU Usage
31
Next Gen Distributed Data Warehouse
• Latency
32
Next Gen Distributed Data WarehouseGoogle BigQuery
https://panoply.io/data-warehouse-guide/bigquery-architecture/ 33
Next Gen Distributed Data WarehouseAzure Synapse
34