SQL Server Integration Services Best Practices

SQL Server Integration Services Best PracticesThomas Kejser, Denny LeeMicrosoft Corporation

DBP408

SQL Customer Advisory TeamSQL Server Design Wins Program

SQLCAT engages with largest customer deploymentsUS: NASDAQ, USDA, Verizon, Raymond James…Europe: London Stock Exchange, Barclay’s CapitalAsia and Pacific: Korea Telecom, Western Digital, Japan Railways EastISVs: Horizontals - SAP, PeopleSoft, Siebel, JDE, Sharepoint, MBS, Verticals – Healthcare, Retail, Financials, Manufacturing

Our site: http://SQLCAT.comAll things SQLCAT can be found on this site

Top 10 Lists: Summary list of Best Practices and RecommendationsTechnical Notes: Deep level technical short papers Technical Spotlights: Technical end-to-end customer case studiesTools: Tool kits and useful scripts Presentations: Multimedia and PPT presentations Searchable and Tagger friendly; post comments and provide feedback!Search all of our SQLCAT Best Practices Whitepapers Easily find our SQLCAT Blogs and other materials

Session Objectives And Takeaways

Session Objectives: Learn what to measure before you designLearn how to performance tune individual Integration Services data flowsUnderstand how to design scalable ETL solutions

Integration Services as a high performance ETL platformUnderstand common pitfalls

AgendaMeasure twice – cut once

Server characteristicsSpeed of source

Baseline the packageMeasure speed, memory, CPU and I/O

Tuning the data flowBag of tricks

Designing for parallelism

Measure twice – Cut Once

What are we running on here?

Understand And Measure HardwareSeek to understand the limits of your

systemQuestions about the target platform for your infrastructure guys:

How many CPU cores?How much memory?How fast is the I/O subsystem

Use SQLIO to measureUnderstand how spindles map to LUNs

How fast is the network? How many NIC do you available?What is network topology?

Limits of the Source SystemStating the obvious:

All Data flows extract data from somewhere …and put it somewhere elseYou cannot transform data faster than you can read it…or write it

Before you start tuning measure the extract speed of the sourceSignificant gains can be had from simple things:

Better driversDriver configurationI/O and Network optimization

Understand limits early in design

Measure Speed per Connection

Read data from sourceDestination is Row CountRun DTEXEC and measure time taken

Use Integration Services log output for time values

Rows / sec = Row Count / TimeData Flow

Total Speed of SourceSources have a limited amount of Rows/Sec

Limit of the networkLimit of the driverLimit of the hardware at source system

You can often overcome limit of driver by staring multiple connections to the sourceIf network is bottleneck: Multiple connections allow multiple NICs to be used

Measure Total Speed of SourceGradually start up several copies of your

simple packageConsider partitioning source if locking/blocking is an issueUse perfmon to measure

CPU load Processor / Total

NIC loadNetwork / Current BandwithNetwork / Bytes Total/sec

I/O LatencyLogical Disk / avg. disk sec / transfer

If possible, measure both at source and executor

Baseline the Package

How does it run?

Set up PerfmonYou should trace you package with PerfmonGet an a baseline of resources consumptionPerfmon counters to use

Logical DiskAvg Disk Sec / TransferRead and Write Bytes / Sec

Processor / TotalProcess (measure both DTEXEC and SQL)

Private Bytes / Working SetMSSQL / Memory Manger / Total Server Memory for SQL% Processor Time

Questions You Can Now Answer

How much Memory does my package use?Plan your memory accordinglyIntegration Services assumes transformations fit in memory!

What is the I/O rate while the package runs?Plan your I/O capacity

What network throughput does the package drive?How is CPU usage distributed between SQL Server and Integration Services?How much CPU can I consume with one package?

More about this later

Tuning the Data Flow

… A bag of tricks for you to use

Optimize SQL Data SourceUse the NOLOCK hint to remove locking overhead

Improves the speed of large tables scan

SELECT only columns you need

Optimize Lookup Transformation

Change SELECT statement to only use the columns you need

Optimizes memory usageConsider adding NOLOCK

In SSIS 2008:Use Shared Lookup Cache

Network TuningChange the network packet size in the connection manager

Higher values typically yield faster throughputMax value: 32767

Experiment with Shared Memory vs. TCP/IPIf using Win 2008

Network affinityEnable Jumbo Frames on the Network

Consult your network specialists

Data TypesDon’t use INT/INT32

… when SMALLINT/INT32 will doDon’t use nchar/DT_WSTR

…when char/DT_STR will doWhen SQL Server is the destination

Money/DT_Currency instead of Decimal can yield good benefits Measure

Make data types as narrow as possible

Optimize SQL DestinationUse SQL Destination instead of OLEDB destination

But be aware of limitationsCan only run if Integration Services is on same box as SQL

Commit size 0 is fastestIf you cannot use 0, use highest possible value

Heap insert is typically faster than cluster index

Drop indexes and rebuild if changing large part of table

Use partitions and partition SWITCH

Change the DesignDon’t Sort unless absolutely necessary

Use SQL Server indexes and mark source as sorted with ORDER BY statement

Sometimes T-SQL is fasterSet based UPDATE statement instead of row by row OLEDBLarge Aggregations (GROUP BY/SUM)Right tool for the right job

Delta detectionSometimes not worth doing. Just reloadRule of Thumb:

Delta > 10% => reload!Do minimally logged operation of possible

Data flow in bulk modeTRUNCATE instead of DELETESWITCH and partitioning

Designing for Parallelism

How to really speed things up!

The Tenets of Scalable Computing

Partition the problemPreferable into equal sized pieces

Eliminate the need for common resources

Stateless design

Schedule and distribute it correctlyMake the best of the Gantt DiagramTry not to let the longest task dominate runtime

Partition the ProblemPartition the source data into smaller piles of equal size

Range partitions – e.g. daily or by geographyHash Partitions for modulo on IDENTITY (1,1)

Use partitioning on the target tableSWITCH command is your friend!

Let the package take parameters@Partition configures which partition to

Start multiple copies of the packageUse the START command

Eliminate Common Resouces

Many connections inserting to the same table will eventually cause contention

Partition the tablesOther tricks:

Build proper I/O systemIsolate readers from writers is often beneficial

Design to stay in memoryDon’t page memory, let every package have enough

Don’t land all connections on same NUMA node

http://msdn.microsoft.com/en-us/library/ms345346.aspx



Schedule it CorrectlyCreate a (priority) queue for your packages

SQL Table good for this purposePackages include a loop:

Loop takes one item from queueUntil queue empty…

P5Pn …Priority Queue

Get Task Do WorkLoop

Get Task Do WorkLoop

DTEXEC (1)

DTEXEC (2)P3P4 P1P2

Using a queue to control parallelism

Demo

Integration Services vs. SQLLab Test Setup

Transform fact data with surrogate key lookups5 dimension tables, 100K rows eachPartitioned fact table, total of 320M rows

Test speed of hash joins

Test 2: Raw Join Time/s Krows/s

SSIS 2008 144 2222

SQL MAXDOP = 0 158 2025

SQL MAXDOP = 1 x 32 162 1975

Test 3: Join and write

SQL MAXDOP = 1 x 32 246 1301

SSIS 2008 278 1151

SQL MAXDOP = 0 1927 166

Integration Services lookup join is comparable in speed with T-SQL!

ICE 4.0 – Security AnalysisSingle DB instance: 40TB (from 27TB in 3.0)

Complex transformations1.4TB/day:

700GB Firewall data 700GB Web Proxy data

HardwareIntegration Services 4GB, 4-proc (memory usage: 1GB-3.5GB)Database: 32GB, 8-proc, CX-700 SAN

Results of prototype test# of SSIS Instances

Input Log File Size (GB)

Number of Rows

Duration (minutes)

Partition Switch-In

2 23.0, 21.1 31m, 29m 25

Direct Insert 2 23.0, 20.3 31m, 28m 60

ETL World Record

ETL World Record on ES/7000-one Task Manager

Related Contenthttp://sqlcat.com website

Watch out for Top 10 Integration Services Best PracticesICE 3.0 whitepaper

http://technet.microsoft.com/en-us/library/bb961995.aspxTPC-H ETL World Record

http://blogs.msdn.com/sqlperf/archive/2008/02/27/etl-world-record.aspx

http://sqlcat.com/

http://sqlcat.com/

http://technet.microsoft.com/en-us/library/bb961995.aspx



© 2008 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.

The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after

the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

SQL Server Integration Services Best Practices

Technology

Transcript of SQL Server Integration Services Best Practices