SQL Server Integration Services Best Practices
-
Upload
denny-lee -
Category
Technology
-
view
185 -
download
4
Transcript of SQL Server Integration Services Best Practices
SQL Server Integration Services Best PracticesThomas Kejser, Denny LeeMicrosoft Corporation
DBP408
SQL Customer Advisory TeamSQL Server Design Wins Program
SQLCAT engages with largest customer deploymentsUS: NASDAQ, USDA, Verizon, Raymond James…Europe: London Stock Exchange, Barclay’s CapitalAsia and Pacific: Korea Telecom, Western Digital, Japan Railways EastISVs: Horizontals - SAP, PeopleSoft, Siebel, JDE, Sharepoint, MBS, Verticals – Healthcare, Retail, Financials, Manufacturing
Our site: http://SQLCAT.comAll things SQLCAT can be found on this site
Top 10 Lists: Summary list of Best Practices and RecommendationsTechnical Notes: Deep level technical short papers Technical Spotlights: Technical end-to-end customer case studiesTools: Tool kits and useful scripts Presentations: Multimedia and PPT presentations Searchable and Tagger friendly; post comments and provide feedback!Search all of our SQLCAT Best Practices Whitepapers Easily find our SQLCAT Blogs and other materials
Session Objectives And Takeaways
Session Objectives: Learn what to measure before you designLearn how to performance tune individual Integration Services data flowsUnderstand how to design scalable ETL solutions
Integration Services as a high performance ETL platformUnderstand common pitfalls
AgendaMeasure twice – cut once
Server characteristicsSpeed of source
Baseline the packageMeasure speed, memory, CPU and I/O
Tuning the data flowBag of tricks
Designing for parallelism
Measure twice – Cut Once
What are we running on here?
Understand And Measure HardwareSeek to understand the limits of your
systemQuestions about the target platform for your infrastructure guys:
How many CPU cores?How much memory?How fast is the I/O subsystem
Use SQLIO to measureUnderstand how spindles map to LUNs
How fast is the network? How many NIC do you available?What is network topology?
Limits of the Source SystemStating the obvious:
All Data flows extract data from somewhere …and put it somewhere elseYou cannot transform data faster than you can read it…or write it
Before you start tuning measure the extract speed of the sourceSignificant gains can be had from simple things:
Better driversDriver configurationI/O and Network optimization
Understand limits early in design
Measure Speed per Connection
Read data from sourceDestination is Row CountRun DTEXEC and measure time taken
Use Integration Services log output for time values
Rows / sec = Row Count / TimeData Flow
Total Speed of SourceSources have a limited amount of Rows/Sec
Limit of the networkLimit of the driverLimit of the hardware at source system
You can often overcome limit of driver by staring multiple connections to the sourceIf network is bottleneck: Multiple connections allow multiple NICs to be used
Measure Total Speed of SourceGradually start up several copies of your
simple packageConsider partitioning source if locking/blocking is an issueUse perfmon to measure
CPU load Processor / Total
NIC loadNetwork / Current BandwithNetwork / Bytes Total/sec
I/O LatencyLogical Disk / avg. disk sec / transfer
If possible, measure both at source and executor
Baseline the Package
How does it run?
Set up PerfmonYou should trace you package with PerfmonGet an a baseline of resources consumptionPerfmon counters to use
Logical DiskAvg Disk Sec / TransferRead and Write Bytes / Sec
Processor / TotalProcess (measure both DTEXEC and SQL)
Private Bytes / Working SetMSSQL / Memory Manger / Total Server Memory for SQL% Processor Time
Questions You Can Now Answer
How much Memory does my package use?Plan your memory accordinglyIntegration Services assumes transformations fit in memory!
What is the I/O rate while the package runs?Plan your I/O capacity
What network throughput does the package drive?How is CPU usage distributed between SQL Server and Integration Services?How much CPU can I consume with one package?
More about this later
Tuning the Data Flow
… A bag of tricks for you to use
Optimize SQL Data SourceUse the NOLOCK hint to remove locking overhead
Improves the speed of large tables scan
SELECT only columns you need
Optimize Lookup Transformation
Change SELECT statement to only use the columns you need
Optimizes memory usageConsider adding NOLOCK
In SSIS 2008:Use Shared Lookup Cache
Network TuningChange the network packet size in the connection manager
Higher values typically yield faster throughputMax value: 32767
Experiment with Shared Memory vs. TCP/IPIf using Win 2008
Network affinityEnable Jumbo Frames on the Network
Consult your network specialists
Data TypesDon’t use INT/INT32
… when SMALLINT/INT32 will doDon’t use nchar/DT_WSTR
…when char/DT_STR will doWhen SQL Server is the destination
Money/DT_Currency instead of Decimal can yield good benefits Measure
Make data types as narrow as possible
Optimize SQL DestinationUse SQL Destination instead of OLEDB destination
But be aware of limitationsCan only run if Integration Services is on same box as SQL
Commit size 0 is fastestIf you cannot use 0, use highest possible value
Heap insert is typically faster than cluster index
Drop indexes and rebuild if changing large part of table
Use partitions and partition SWITCH
Change the DesignDon’t Sort unless absolutely necessary
Use SQL Server indexes and mark source as sorted with ORDER BY statement
Sometimes T-SQL is fasterSet based UPDATE statement instead of row by row OLEDBLarge Aggregations (GROUP BY/SUM)Right tool for the right job
Delta detectionSometimes not worth doing. Just reloadRule of Thumb:
Delta > 10% => reload!Do minimally logged operation of possible
Data flow in bulk modeTRUNCATE instead of DELETESWITCH and partitioning
Designing for Parallelism
How to really speed things up!
The Tenets of Scalable Computing
Partition the problemPreferable into equal sized pieces
Eliminate the need for common resources
Stateless design
Schedule and distribute it correctlyMake the best of the Gantt DiagramTry not to let the longest task dominate runtime
Partition the ProblemPartition the source data into smaller piles of equal size
Range partitions – e.g. daily or by geographyHash Partitions for modulo on IDENTITY (1,1)
Use partitioning on the target tableSWITCH command is your friend!
Let the package take parameters@Partition configures which partition to
Start multiple copies of the packageUse the START command
Eliminate Common Resouces
Many connections inserting to the same table will eventually cause contention
Partition the tablesOther tricks:
Build proper I/O systemIsolate readers from writers is often beneficial
Design to stay in memoryDon’t page memory, let every package have enough
Don’t land all connections on same NUMA node
http://msdn.microsoft.com/en-us/library/ms345346.aspx
Schedule it CorrectlyCreate a (priority) queue for your packages
SQL Table good for this purposePackages include a loop:
Loop takes one item from queueUntil queue empty…
P5Pn …Priority Queue
Get Task Do WorkLoop
Get Task Do WorkLoop
DTEXEC (1)
DTEXEC (2)P3P4 P1P2
Using a queue to control parallelism
Demo
Integration Services vs. SQLLab Test Setup
Transform fact data with surrogate key lookups5 dimension tables, 100K rows eachPartitioned fact table, total of 320M rows
Test speed of hash joins
Test 2: Raw Join Time/s Krows/s
SSIS 2008 144 2222
SQL MAXDOP = 0 158 2025
SQL MAXDOP = 1 x 32 162 1975
Test 3: Join and write
SQL MAXDOP = 1 x 32 246 1301
SSIS 2008 278 1151
SQL MAXDOP = 0 1927 166
Integration Services lookup join is comparable in speed with T-SQL!
ICE 4.0 – Security AnalysisSingle DB instance: 40TB (from 27TB in 3.0)
Complex transformations1.4TB/day:
700GB Firewall data 700GB Web Proxy data
HardwareIntegration Services 4GB, 4-proc (memory usage: 1GB-3.5GB)Database: 32GB, 8-proc, CX-700 SAN
Results of prototype test# of SSIS Instances
Input Log File Size (GB)
Number of Rows
Duration (minutes)
Partition Switch-In
2 23.0, 21.1 31m, 29m 25
Direct Insert 2 23.0, 20.3 31m, 28m 60
ETL World Record
ETL World Record on ES/7000-one Task Manager
Related Contenthttp://sqlcat.com website
Watch out for Top 10 Integration Services Best PracticesICE 3.0 whitepaper
http://technet.microsoft.com/en-us/library/bb961995.aspxTPC-H ETL World Record
http://blogs.msdn.com/sqlperf/archive/2008/02/27/etl-world-record.aspx
© 2008 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.
The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after
the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.