Keeping Data in Sync with Syncsort

What’s New and Performance Tips

Paige Roberts, Big Data Product Marketing Manager

Ashwin Ramachandran, Big Data Product Manager

Agenda

What’s New and Coming Soon in Big Data

• What’s New in DMX/DMX-h version 9.5

• New Product: DMX Change Data Capture – Now GA in version 9.5!

• DataFunnel GUI – Now in beta!

• Lineage

• Big Data Quality

• DMX CDC and MIMIX Share

Strategies for Change Data Capture

• Advantages and Disadvantages of Various Strategies

– Versions, Dates

– Triggers

– Snapshot

– Log

How to Do Change Data Capture with Syncsort Software

• Snapshot-Based CDC with DMX/DMX-h

• Log-Based CDC with DMX Change Data Capture

Where to Find More Info on CDC

2Syncsort Confidential and Proprietary - do not copy or distribute

WHAT’S NEW IN DMX/DMX-H


Combine batch and streaming data sources

Single Interface for Streaming & Batch

Spark 2!

Easy development in GUI No need to write Scala, C or Java code

Now supports cluster mode!

4

Syncsort Confidential and Proprietary - do not copy or distribute

Simplify Streaming Data Integration


Progress Monitoring

Track the progress of DMX/DMX-h jobs as they’re running!

Settable time intervals

See exactly how fast jobs are running

Know how much memory and CPU jobs use at any point

Know when there’s a problem, even in the middle of long-running jobs


C:\PROGRAM FILES\DMEXPRESS\PROGRAMS\dmsmonitor.exe /jobid J_readVSAM_20171006_001743_13572 /task T_readVSAM /interactive 2 /logdir .

Timestamp: 2017-10-06 00:19:09

Status: RUNNING for 00:01:28

User: aramachandran

Data directory: C:\Users\aramachandran\Documents\Projects\CompanyName\VSAM_test

Memory: 32MB

CPU: 12%

/MVS/WWCDMX/AZR.VSM (Source): 7689557 records [1689372 records/sec], 246065824 bytes [5405992 bytes/sec]

Vsam_out.dat (Target): 7685704 records [1687590 records/sec], 245942528 bytes [54002880 bytes/sec]

C:\PROGRAM FILES\DMEXPRESS\PROGRAMS\dmsmonitor.exe /jobid J_readVSAM_20171006_001743_13572 /task T_readVSAM /interactive 2 /logdir .

Timestamp: 2017-10-06 00:19:11

Status: RUNNING for 00:01:30

User: aramachandran

Data directory: C:\Users\aramachandran\Documents\Projects\CompanyName\VSAM_test

Memory: 32MB

CPU: 12%

/MVS/WWCDMX/AZR.VSM (Source): 10718776 records [1514609 records/sec], 343000832 bytes [48467504 bytes/sec]

Vsam_out.dat (Target): 10716748 records [1515522 records/sec], 342935936 bytes [48496704 bytes/sec]

Access and Integration of Mainframe Data … We’re Simply the Best


Save MIPS by processing mainframe data on Hadoop

Read and write Mainframe record formats

– Fixed record length, variable record length, & variable record length with block descriptor

– Handle complex array structures like ODO’s, even nested

– Interpret complex copybooks automatically

Write files to local or remote open systems via FTP, SFTP, Connect:Direct or HDFS

– Connect to external mainframe metadata like copybooks right on the mainframe with Connect:Direct

Store an unmodified archive copy for compliance and lineage tracking

Hive Enhancements

Improvements to Hive support

JDBC connectivity

Support for partitioned tables: ORC, Parquet, AVRO, HDFS

Support for Truncate and Insert

Automatic creation of Hive and other Hcat supported tables

Direct distributed processing of Hive

Update of Hive statistics

Use Hive tables for lookups


Keybreak Processing Made Easy


• Running Totals

• Counters

• Group Numbering

DATAFUNNEL


Get Your Database data into Hadoop, At the Press of a Button

• Funnel hundreds of tables at once into your data lake‒ Extract, map and move whole DB schemas in one invocation‒ Extract from Oracle, DB2/z, MS SQL Server, Teradata, Netezza and Redshift‒ To SQL Server, Postgres, Hive, HDFS, Redshift and Amazon S3‒ Automatically create target Hive and HCat tables

• Process multiple funnels in parallel on edge node or data nodes‒ Order data flows by dependencies

‒ Leverage DMX-h high performance data processing engine

• Extract only the data you want‒ Data type filtering‒ Table, record or column exclusion / inclusion

• In-flight transformations and cleansing

• User specified access methods: Native, ODBC or JDBC


DMX DataFunnel™

Move thousands of tables in days, not weeks!

New User Experience for DataFunnel


DMX DataFunnel™

New UI Wizard Flow Creation


DMX DataFunnel™

LINEAGE


Integration with Cloudera Navigator from Source to Cluster


BIG DATA QUALITY


Firstly, we configure DMX to access and ingest data

from a JSON source.

Secondly, DMX ingests data from a mainframe in

EBCDIC format.

Finally, DMX then ingests data from an XML source.

DMX then merges these files into

one consistent format.

At the same stage, DMX

produces two exports:

• one simple text/csv output

• a first write to a Hive

database.

DMX then

invokes

TSS to

perform

the Data

Quality

processing

.

Comments

All of these source files have different field structures too.

Trillium Quality for Big Data


Easily Create Data Quality Workflows Without MapReduce or Spark CodingIntelligent Execution enables deployment to Hadoop MapReduce and Spark

Verify and enrich global postal addresses using global postal reference sources

Enrich data from external, third-party sources to create comprehensive, unified records, enabling 360-degree views of the customer and other key business entities

Identify records that belong to the same domain (i.e., household or business)

Parse data values to their correct fields and standardize for better matching

Match like records and eliminate duplicates

DMX CHANGE DATA CAPTURE


Keep Mainframe and Hadoop Data in Sync with Hadoop in Real-Time

Keeps Hadoop data in sync with mainframe changes in real-time

• without overloading networks• without incurring a high MIPS cost • without affecting source database performance• without coding or tuning

Dependable – Reliable transfer of data even during loss of mainframe connection or Hadoop cluster failure. Continue from failure point.

Fast – Both Hive data and table statistics updated in real-time. Does fast update and insert, even on Hive tables that don’t natively support it.

Flexible – Works with all Hive tables, including those backed by text, ORC, Parquet or Avro.

DB2


DMX Change Data Capture

DB2

MIMIX Share Replicates Data in Real Time

Transforms and enhances data during replication

Minimizes bandwidth usage with LAN/WAN friendly replication

Ensures data integrity with conflict resolution and collision monitoring

Enables tracking and auditing of transactions for compliance

Real-Time

Replication

with Transformation

Change Data

Capture

(CDC)

Conflict Resolution,

Collision Monitoring,

Tracking and Auditing

Source

Database

Target

Database

20

STRATEGIES FOR CHANGE DATA CAPTURE


Why Do Change Data Capture?

Change Data Capture (CDC) is the process that ensures that changes made over time in one dataset are automatically transferred to the other dataset.

Common data management scenarios where CDC is important:

Enterprise Data Warehouse (EDW)

Business Intelligence (BI)

EDW and/or Mainframe Optimization

Master Data Management

Data Quality


Different CDC Strategies

Timestamps or Version Numbers

Table Triggers

Snapshot or Table Comparison

Log Scraping


Advantages and Disadvantages of Timestamp or Version-Based CDC

Advantages

Simple

Nearly every database can query with a where clause.


Disadvantages

Must be built into database

Bloats database size

Query requires considerable compute resources in source database

Not always reliable

Advantages and Disadvantages of Trigger-Based CDC

Advantages

Very reliable and detailed

Changes can be captured, almost as fast as they are made – real-time CDC.


Disadvantages

Significant drag on database resources, both compute and storage.

Requires that the database have the capability.

Negative impact on performance of applications that depend on the source database.

Advantages and Disadvantages of Snapshot-Based CDC

Advantages

Relatively easy to implement with good ETL software.

Requires no specialized knowledge of the source database.

Very dependable and accurate.


Disadvantages

Requires repeatedly moving all data in monitored tables. May impact target or staging system resources and network bandwidth.

Moving lots of data can be slow, may not meet SLA’s.

Joining, comparing, and finding changes may also take time. Even slower.

Not a complete record of intermediate changes between snapshot captures.

Advantages and Disadvantages of Log-Based CDC

Advantages

Very reliable and detailed.

Virtually no impact on database or application performance.

Changes captured in real-time.

No database bloat.


Disadvantages

Every RDMS has a different log format, often not documented.

Log formats often change between RDBMS versions.

Log files are frequently archived by the database. CDC software must read them before they’re archived, or be able to go read the archived logs.

Requires specialized CDC software. Cannot be easily accomplished with ETL software.

TWO WAYS SYNCSORT DOES CDC


How Change Data Capture in DMX/DMX-h Works – Snapshot-based CDC


1. Capture: DMX or DMX-h pulls all data from tables that are being monitored for change. Syncsort high performance engine joins new data with previous snapshot and finds the data changes.

3. Apply: DMX-h applies the changes to Hive tables, and updates Hive statistics to facilitate queries on the new data.

2. Process: On an edge node in DMX-h, a CDC Reader consumes a single raw data stream of the delta data, and splits it into parallel load streams for the cluster.

Edge Node or Server

Source

Database

Staged

Data

Snapshot

How DMX Change Data Capture Works – Log-based CDC


1. Capture: DMX CDC engine scrapes the DB2 logs and stores only the delta, the data that has changed, and flags it as Updated, Deleted or Inserted. Virtually no MIPS usage.

3. Apply: DMX-h applies the changes to Hive tables, and updates Hive statistics to facilitate queries on the new data.

2. On an edge node in DMX-h, a CDC Reader consumes a single raw data stream of the delta data, and splits it into parallel load streams for the cluster.

What Next?


Find out more about DMX Change Data Capture

http://www.syncsort.com/en/Products/BigData/DMX-Change-Data-Capture

Contact Syncsort sales to get the latest info: http://www.syncsort.com/en/ContactSales

http://www.syncsort.com/en/Products/BigData/DMX-Change-Data-Capture

http://www.syncsort.com/en/ContactSales

Questions


Keeping Data in Sync with Syncsort

Technology

Transcript of Keeping Data in Sync with Syncsort