Fast parallel data loading with the bulk API

51
Salesforce API Series Fast Parallel Data Loading with the Bulk API July 15, 2014

description

Pouvez-vous charger 20 millions d'enregistrements dans Salesforce en moins d'une heure? Si ce n'est pas le cas, ce webinar est fait pour vous.

Transcript of Fast parallel data loading with the bulk API

Page 1: Fast parallel data loading with the bulk API

Salesforce API SeriesFast Parallel Data Loading with the Bulk APIJuly 15, 2014

Page 2: Fast parallel data loading with the bulk API

#forcewebinar

Speaker

Hervé MalevillePlatform Specialist - France

Page 3: Fast parallel data loading with the bulk API

#forcewebinar

Safe HarborSafe harbor statement under the Private Securities Litigation Reform Act of 1995:

This presentation may contain forward-looking statements that involve risks, uncertainties, and assumptions. If any such uncertainties materialize or if any of the assumptions proves incorrect, the results of salesforce.com, inc. could differ materially from the results expressed or implied by the forward-looking statements we make. All statements other than statements of historical fact could be deemed forward-looking, including any projections of product or service availability, subscriber growth, earnings, revenues, or other financial items and any statements regarding strategies or plans of management for future operations, statements of belief, any statements concerning new, planned, or upgraded services or technology developments and customer contracts or use of our services.

The risks and uncertainties referred to above include – but are not limited to – risks associated with developing and delivering new functionality for our service, new products and services, our new business model, our past operating losses, possible fluctuations in our operating results and rate of growth, interruptions or delays in our Web hosting, breach of our security measures, the outcome of intellectual property and other litigation, risks associated with possible mergers and acquisitions, the immature market in which we operate, our relatively limited operating history, our ability to expand, retain, and motivate our employees and manage our growth, new releases of our service and successful customer deployment, our limited history reselling non-salesforce.com products, and utilization and selling to larger enterprise customers. Further information on potential factors that could affect the financial results of salesforce.com, inc. is included in our annual report on Form 10-Q for the most recent fiscal quarter . This documents and others containing important disclosures are available on the SEC Filings section of the Investor Information section of our Web site.

Any unreleased services or features referenced in this or other presentations, press releases or public statements are not currently available and may not be delivered on time or at all. Customers who purchase our services should make the purchase decisions based upon features that are currently available. Salesforce.com, inc. assumes no obligation and does not intend to update these forward-looking statements.

Page 4: Fast parallel data loading with the bulk API

#forcewebinar

Follow Developer Force for the Latest News

@forcedotcom / #forcewebinar

Developer Force – Force.com Community

+Developer Force – Force.com Community

Developer Force

Developer Force Group

Page 5: Fast parallel data loading with the bulk API

How fast can you load data into Salesforce?

Page 6: Fast parallel data loading with the bulk API

How many records can you load into Salesforce in 1

hour?

Page 7: Fast parallel data loading with the bulk API

#forcewebinar

Data load throughput

OK Fast Faster -

5,000,000

10,000,000

15,000,000

20,000,000

25,000,000

Records/Hour

Page 8: Fast parallel data loading with the bulk API

Parallel processing

Page 9: Fast parallel data loading with the bulk API

#forcewebinar

A parallel processing analogy: digging a ditch

Page 10: Fast parallel data loading with the bulk API

#forcewebinar

Serial processing

Page 11: Fast parallel data loading with the bulk API

#forcewebinar

Parallel processing

Page 12: Fast parallel data loading with the bulk API

Degree of ParallelismThe number of processes or threads associated with an

operation.

Page 13: Fast parallel data loading with the bulk API

#forcewebinar

Optimal parallel processing

Serial

Parallel

20M records

5M records

5M records

5M records

5M records

Time

Page 14: Fast parallel data loading with the bulk API

#forcewebinar

Sub-optimal parallel processing

Serial

Parallel

Time

5M records

5M records

5M records

5M records

20M records

Page 15: Fast parallel data loading with the bulk API

#forcewebinar

Locks, exceptions, triggers, relationships, …

Serial

Parallel

Time

5M records

5M records

5M records

5M records

20M records

Throughput inhibitors

Page 16: Fast parallel data loading with the bulk API

#forcewebinar

Data load case studies

Get hands on with the Salesforce Bulk API Contrast serial data loads vs. parallel data

loads Measure degrees of parallelism and

throughput Identify and avoid throughput inhibitors Achieve maximum throughput

Page 17: Fast parallel data loading with the bulk API

Prep work

Page 18: Fast parallel data loading with the bulk API

#forcewebinar

Salesforce Bulk API

Asynchronous data loading Optimized for large data sets REST API Powers many tools Use to build custom tools with any

programming language (Java, etc.)

Page 19: Fast parallel data loading with the bulk API

#forcewebinar

Demo schema

Page 20: Fast parallel data loading with the bulk API

Bulk API Loads that …RIP

Realize, Investigate, and Plan

Page 21: Fast parallel data loading with the bulk API

Case Studies

Page 22: Fast parallel data loading with the bulk API

Case StudySerial Data Load

Page 23: Fast parallel data loading with the bulk API

#forcewebinar

ThreadThreadThreadThreadThreadThreadThreadThreadThreadThreadThreadThreadThreadThreadThreadThread

Serial load: Expected plan

Time

• One job• 100 batches• 10,000 records/batch• 1M total records

Page 24: Fast parallel data loading with the bulk API

#forcewebinar

Serial load: Job configuration

Page 25: Fast parallel data loading with the bulk API

#forcewebinar

Serial load: Batch creation

Page 26: Fast parallel data loading with the bulk API

#forcewebinar

Serial load: Batch run

Page 27: Fast parallel data loading with the bulk API

DemoSerial load

Page 28: Fast parallel data loading with the bulk API

#forcewebinar

Serial load summary

Concurrency Mode SerialRecords Loaded 1 millionRecords Failed 0Run Time 77 minutesWork Completed 75 minutesThroughput 13,000 records per minuteDegree of Parallelism 0.97Key Problem Degree of parallelism explicitly limited to ~1.Solution Explore parallel load for increased throughput.

Page 29: Fast parallel data loading with the bulk API

#forcewebinar

Parallelism vs. Throughput of a Single Job

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 200

50000

100000

150000

200000

250000

300000

350000

Serial

Serial Run Low degree of parallelism

Degree of Parallelism

Thro

ughp

ut R

ecor

ds/M

in

Page 30: Fast parallel data loading with the bulk API

Case StudyParallel data loads

Page 31: Fast parallel data loading with the bulk API

#forcewebinar

ThreadThreadThreadThreadThreadThreadThreadThreadThreadThreadThreadThreadThreadThreadThreadThread

Parallel load: Expected plan

Time

• One job• 100 batches• 10,000 records/batch• 1M total records

Page 32: Fast parallel data loading with the bulk API

#forcewebinar

Parallel load: Job configuration

Page 33: Fast parallel data loading with the bulk API

DemoParallel 1

Page 34: Fast parallel data loading with the bulk API

#forcewebinar

Things to watch for

Locks can significantly affect parallel loads– Wasted processing capacity– Reduced throughput– Failures

Retry logic is not all its cracked up to be

Page 35: Fast parallel data loading with the bulk API

#forcewebinar

Parallel load 1 summary

Concurrency Mode ParallelRecords Loaded 396,600Records Failed 603,400Run Time 17 minutesWork Completed 3 hours 15 minutesThroughput 22,000 records per minuteDegree of Parallelism 11.5

Key ProblemLock Exceptions. Server worked significantly harder but no increase in throughput.

Solution Run the load in serial mode or manage locks.

Page 36: Fast parallel data loading with the bulk API

#forcewebinar

Parallelism vs. throughput of a single job

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 200

50000

100000

150000

200000

250000

300000

350000

Serial

Parallel Run 1 High degree of parallelism Low throughput due to locks

Degree of Parallelism

Thro

ughp

ut R

ecor

ds/M

in

Parallel 1

Page 37: Fast parallel data loading with the bulk API

#forcewebinar

Time to optimize

Let’s make your data load RIP Realize

– Locks inhibit parallelism and throughput Investigate

– What is causing the locks P lan

– Manage the locks

Page 38: Fast parallel data loading with the bulk API

DemoParallel load 2

Eliminate Locks by Modifying Schema

Page 39: Fast parallel data loading with the bulk API

#forcewebinar

Parallel load: Sample results

Concurrency Mode ParallelRecords Loaded 1 millionRecords Failed 0Run Time 3 minutes and 30 secondsWork Completed 1 hourThroughput 320,000 records per minuteDegree of Parallelism 19Key Problem NoneSolution n/a

Page 40: Fast parallel data loading with the bulk API

#forcewebinar

Parallelism vs. throughput of a single job

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 200

50000

100000

150000

200000

250000

300000

350000

Serial

Parallel Run 2 High degree of parallelism High throughput

Degree of Parallelism

Thro

ughp

ut R

ecor

ds/M

in Parallel 2

Parallel 1

Page 41: Fast parallel data loading with the bulk API

#forcewebinar

Locks can be managed by

Elimination Ordering load file

Page 42: Fast parallel data loading with the bulk API

DemoParallel load 3

Avoid Locks with Ordered Data

Page 43: Fast parallel data loading with the bulk API

#forcewebinar

Managing locks … a discussion while we load Master-detail relationships Lookup relationships Roll-up summary fields Triggers Workflow rules Group membership locks*

Page 44: Fast parallel data loading with the bulk API

#forcewebinar

Parallel load: Sample results

Concurrency Mode ParallelRecords Loaded 1 millionRecords Failed 0Run Time 4 minutesWork Completed 1 hourThroughput 250,000 records per minuteDegree of Parallelism 16.5Key Problem Minimal overhead due to locksSolution Remove all unnecessary locks

Page 45: Fast parallel data loading with the bulk API

#forcewebinar

Parallelism vs. throughput of a single job

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 200

50000

100000

150000

200000

250000

300000

350000

Serial

Parallel Run 3 High degree of parallelism High throughput

Degree of Parallelism

Thro

ughp

ut R

ecor

ds/M

in Parallel 2

Parallel 3

Parallel 1

Page 46: Fast parallel data loading with the bulk API

Case StudyControlled feed/parallel

data loads

Page 47: Fast parallel data loading with the bulk API

#forcewebinar

Controlled feed load methodology

Explicit throttling on parallelism and throughput– Parallel extraction and loading– Prioritization of asynchronous processing capacity

Manage inhibitors in complex jobs– Data Skews– Multiple Locks

Page 48: Fast parallel data loading with the bulk API

#forcewebinar

Parallelism vs. throughput of a single job

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 200

50000

100000

150000

200000

250000

300000

350000

Serial

Controlled Feed Run Reduced parallelism Expected throughput

Degree of Parallelism

Thro

ughp

ut R

ecor

ds/M

in Parallel 2

Parallel 3

Controlled Feed

Parallel 1

Page 49: Fast parallel data loading with the bulk API

#forcewebinar

Related wiki article and Architect Core Resources

developer.salesforce.com/architect

http://bit.ly/bulkapi-repo

Page 50: Fast parallel data loading with the bulk API

#forcewebinar

Recap

Make your parallel data loads RIP Realize

– Locks inhibit parallelism and throughput Investigate

– What is causing the locks P lan

– Manage the locks

Page 51: Fast parallel data loading with the bulk API

Q & A

#forcewebinar

Hervé MalevillePlatform Specialist - France