Fast parallel data loading with the bulk API
-
Upload
salesforce-developers -
Category
Technology
-
view
330 -
download
8
description
Transcript of Fast parallel data loading with the bulk API
Salesforce API SeriesFast Parallel Data Loading with the Bulk APIJuly 15, 2014
#forcewebinar
Speaker
Hervé MalevillePlatform Specialist - France
#forcewebinar
Safe HarborSafe harbor statement under the Private Securities Litigation Reform Act of 1995:
This presentation may contain forward-looking statements that involve risks, uncertainties, and assumptions. If any such uncertainties materialize or if any of the assumptions proves incorrect, the results of salesforce.com, inc. could differ materially from the results expressed or implied by the forward-looking statements we make. All statements other than statements of historical fact could be deemed forward-looking, including any projections of product or service availability, subscriber growth, earnings, revenues, or other financial items and any statements regarding strategies or plans of management for future operations, statements of belief, any statements concerning new, planned, or upgraded services or technology developments and customer contracts or use of our services.
The risks and uncertainties referred to above include – but are not limited to – risks associated with developing and delivering new functionality for our service, new products and services, our new business model, our past operating losses, possible fluctuations in our operating results and rate of growth, interruptions or delays in our Web hosting, breach of our security measures, the outcome of intellectual property and other litigation, risks associated with possible mergers and acquisitions, the immature market in which we operate, our relatively limited operating history, our ability to expand, retain, and motivate our employees and manage our growth, new releases of our service and successful customer deployment, our limited history reselling non-salesforce.com products, and utilization and selling to larger enterprise customers. Further information on potential factors that could affect the financial results of salesforce.com, inc. is included in our annual report on Form 10-Q for the most recent fiscal quarter . This documents and others containing important disclosures are available on the SEC Filings section of the Investor Information section of our Web site.
Any unreleased services or features referenced in this or other presentations, press releases or public statements are not currently available and may not be delivered on time or at all. Customers who purchase our services should make the purchase decisions based upon features that are currently available. Salesforce.com, inc. assumes no obligation and does not intend to update these forward-looking statements.
#forcewebinar
Follow Developer Force for the Latest News
@forcedotcom / #forcewebinar
Developer Force – Force.com Community
+Developer Force – Force.com Community
Developer Force
Developer Force Group
How fast can you load data into Salesforce?
How many records can you load into Salesforce in 1
hour?
#forcewebinar
Data load throughput
OK Fast Faster -
5,000,000
10,000,000
15,000,000
20,000,000
25,000,000
Records/Hour
Parallel processing
#forcewebinar
A parallel processing analogy: digging a ditch
#forcewebinar
Serial processing
#forcewebinar
Parallel processing
Degree of ParallelismThe number of processes or threads associated with an
operation.
#forcewebinar
Optimal parallel processing
Serial
Parallel
20M records
5M records
5M records
5M records
5M records
Time
#forcewebinar
Sub-optimal parallel processing
Serial
Parallel
Time
5M records
5M records
5M records
5M records
20M records
#forcewebinar
Locks, exceptions, triggers, relationships, …
Serial
Parallel
Time
5M records
5M records
5M records
5M records
20M records
Throughput inhibitors
#forcewebinar
Data load case studies
Get hands on with the Salesforce Bulk API Contrast serial data loads vs. parallel data
loads Measure degrees of parallelism and
throughput Identify and avoid throughput inhibitors Achieve maximum throughput
Prep work
#forcewebinar
Salesforce Bulk API
Asynchronous data loading Optimized for large data sets REST API Powers many tools Use to build custom tools with any
programming language (Java, etc.)
#forcewebinar
Demo schema
Bulk API Loads that …RIP
Realize, Investigate, and Plan
Case Studies
Case StudySerial Data Load
#forcewebinar
ThreadThreadThreadThreadThreadThreadThreadThreadThreadThreadThreadThreadThreadThreadThreadThread
Serial load: Expected plan
Time
• One job• 100 batches• 10,000 records/batch• 1M total records
#forcewebinar
Serial load: Job configuration
#forcewebinar
Serial load: Batch creation
#forcewebinar
Serial load: Batch run
DemoSerial load
#forcewebinar
Serial load summary
Concurrency Mode SerialRecords Loaded 1 millionRecords Failed 0Run Time 77 minutesWork Completed 75 minutesThroughput 13,000 records per minuteDegree of Parallelism 0.97Key Problem Degree of parallelism explicitly limited to ~1.Solution Explore parallel load for increased throughput.
#forcewebinar
Parallelism vs. Throughput of a Single Job
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 200
50000
100000
150000
200000
250000
300000
350000
Serial
Serial Run Low degree of parallelism
Degree of Parallelism
Thro
ughp
ut R
ecor
ds/M
in
Case StudyParallel data loads
#forcewebinar
ThreadThreadThreadThreadThreadThreadThreadThreadThreadThreadThreadThreadThreadThreadThreadThread
Parallel load: Expected plan
Time
• One job• 100 batches• 10,000 records/batch• 1M total records
#forcewebinar
Parallel load: Job configuration
DemoParallel 1
#forcewebinar
Things to watch for
Locks can significantly affect parallel loads– Wasted processing capacity– Reduced throughput– Failures
Retry logic is not all its cracked up to be
#forcewebinar
Parallel load 1 summary
Concurrency Mode ParallelRecords Loaded 396,600Records Failed 603,400Run Time 17 minutesWork Completed 3 hours 15 minutesThroughput 22,000 records per minuteDegree of Parallelism 11.5
Key ProblemLock Exceptions. Server worked significantly harder but no increase in throughput.
Solution Run the load in serial mode or manage locks.
#forcewebinar
Parallelism vs. throughput of a single job
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 200
50000
100000
150000
200000
250000
300000
350000
Serial
Parallel Run 1 High degree of parallelism Low throughput due to locks
Degree of Parallelism
Thro
ughp
ut R
ecor
ds/M
in
Parallel 1
#forcewebinar
Time to optimize
Let’s make your data load RIP Realize
– Locks inhibit parallelism and throughput Investigate
– What is causing the locks P lan
– Manage the locks
DemoParallel load 2
Eliminate Locks by Modifying Schema
#forcewebinar
Parallel load: Sample results
Concurrency Mode ParallelRecords Loaded 1 millionRecords Failed 0Run Time 3 minutes and 30 secondsWork Completed 1 hourThroughput 320,000 records per minuteDegree of Parallelism 19Key Problem NoneSolution n/a
#forcewebinar
Parallelism vs. throughput of a single job
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 200
50000
100000
150000
200000
250000
300000
350000
Serial
Parallel Run 2 High degree of parallelism High throughput
Degree of Parallelism
Thro
ughp
ut R
ecor
ds/M
in Parallel 2
Parallel 1
#forcewebinar
Locks can be managed by
Elimination Ordering load file
DemoParallel load 3
Avoid Locks with Ordered Data
#forcewebinar
Managing locks … a discussion while we load Master-detail relationships Lookup relationships Roll-up summary fields Triggers Workflow rules Group membership locks*
#forcewebinar
Parallel load: Sample results
Concurrency Mode ParallelRecords Loaded 1 millionRecords Failed 0Run Time 4 minutesWork Completed 1 hourThroughput 250,000 records per minuteDegree of Parallelism 16.5Key Problem Minimal overhead due to locksSolution Remove all unnecessary locks
#forcewebinar
Parallelism vs. throughput of a single job
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 200
50000
100000
150000
200000
250000
300000
350000
Serial
Parallel Run 3 High degree of parallelism High throughput
Degree of Parallelism
Thro
ughp
ut R
ecor
ds/M
in Parallel 2
Parallel 3
Parallel 1
Case StudyControlled feed/parallel
data loads
#forcewebinar
Controlled feed load methodology
Explicit throttling on parallelism and throughput– Parallel extraction and loading– Prioritization of asynchronous processing capacity
Manage inhibitors in complex jobs– Data Skews– Multiple Locks
#forcewebinar
Parallelism vs. throughput of a single job
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 200
50000
100000
150000
200000
250000
300000
350000
Serial
Controlled Feed Run Reduced parallelism Expected throughput
Degree of Parallelism
Thro
ughp
ut R
ecor
ds/M
in Parallel 2
Parallel 3
Controlled Feed
Parallel 1
#forcewebinar
Related wiki article and Architect Core Resources
developer.salesforce.com/architect
http://bit.ly/bulkapi-repo
#forcewebinar
Recap
Make your parallel data loads RIP Realize
– Locks inhibit parallelism and throughput Investigate
– What is causing the locks P lan
– Manage the locks
Q & A
#forcewebinar
Hervé MalevillePlatform Specialist - France