Batching and Java EE (jdk.io)

Post on 19-Jan-2017

221 views 7 download

Transcript of Batching and Java EE (jdk.io)

BATCHING AND JAVA EERyan Cuprak

What is Batch Processing?Batch jobs are typically:

• Bulk-oriented

• Non-interactive

• Potentially compute intensive

• May require parallel execution

• Maybe invoked, ad hoc, scheduled, on-demand etc.

Batching Examples

• Monthly reports/statements

• Daily data cleanup

• One-time data migrations

• Data synchronization

• Data analysis

• Portfolio rebalancing

Introducing Java EE Batching• Introduced in Java EE 7

• JSR 352 - https://jcp.org/en/jsr/detail?id=352 • Reference implementation:

https://github.com/WASdev/standards.jsr352.jbatch

• Batch Framework:• Batch container for execution of jobs• XML Job Specification Language• Batch annotations and interfaces• Supporting classes and interfaces for interacting with the

container• Depends on CDI

Java EE Batching Overview

JobOperator Job Step

JobRepository

ItemReader

ItemProcessor

ItemWriter

1 *

1

1

1 1

1

1

Java EE Batching Overview

JobInstance

Job

JobExecution

*

*

EndOfDayJob

EndOfDayJob for 9/1/2016

First attempt at EndOfDay job for 9/1/2016

Java EE Batching Features

• Fault-tolerant – checkpoints and job persistence

• Transactions - chunks execute within a JTA transaction

• Listeners – notification of job status/completion/etc.

• Resource management – limits concurrent jobs

• Starting/stopping/restarting – job control API

Java EE Batching Deployment

WAR EAR JAR

Deploy batch jobs in:

Manage jobs – split application into modules

Server B

app.war

End of Day Job

Cleanup Job

Server C

app2.war

Analytics Job

Server A

frontend.war

Batchlet

Exit CodesCode DescriptionSTARTING Job has been submitted to runtime.

STARTED Batch job has started executing.

STOPPING Job termination has been requested.

STOPPED Job has been stopped.

FAILED Job has thrown an error or failured triggered by <failure>

COMPLETED Job has completed normally.

ABANDONDED Job cannot be restarted

Basic Layout

CDI Configuration

Job Configuration

Batchlet

Job ConfigurationMETA-INF/batch-jobs/<job-name>.xml

Batch Runtime

Batchlet with Termination

Jobs should implement and terminate when requested!

Batching & Resources

Concurrent Resources

IDs and NamesinstanceId

• ID represents an instance of a job.• Created when JobOperator start method invoked.

executionId• ID that represents the next attempt to run a particular job instance. • Created when a job is started/restarted.• Only one executionId for a job can be started at a time

stepExecutionId• ID for an attempt to execute a particular step in a job

jobName• name of the job from XML (actually id) <job id=“”>

jobXMLName• name of the config file in META-INF/batch-jobs

JobInstance vs. JobExecution

JobInstance

JobExecution

1

*

• BatchStatus• createTime• endTime• executionID• exitStatus• jobName• jobParameters,

lastUpdateTime• startTime

• instanceId• jobName

Managing Jobs• JobOperator – interface for operating on batch jobs.

• BatchRuntime.getJobOperator()• JobOperator:

• Provides information on current and completed jobs• Used to start/stop/restart/abandon jobs• Security is implementation dependent• JobOperator interacts with JobRepository

• JobRepository• Implementation out-side scope of JSR• No API for deleting old jobs

• Reference implementation provides no API for cleanup!

JobOperator MethodsType Methodvoid Abandon(long executionId)

JobExecution getJobExecution(long executionId)

List<JobExecution> getJobExecutions(JobInstance instance)

JobInstance getJobInstance(long executionId)

int getJobInstanceCount(String jobName)

List<JobInstance> getJobInstances(String jobName,int start, in count)

Set<String> getJobNames()

Properties getParameters(long executionId)

List<Long> getRunningExecutions(String jobName)

List<StepExecution> getStepExecutions(long jobExecutionId)

long Restart(long executionId, Properties restartParams)

long start(String jobXMLName, Properties jobParams)

void Stop(long executionId)

Listing Batch Jobs

Chunking• Chunking is primary pattern for batch processing in JSR-

352.• Encapsulates the ETL pattern:

• Pieces: Reader/Processor/Writer• Reader/Processor invoked until an entire chuck of data is

processed.• Output is written atomically

• Implementation:• Interfaces: ItemReader/ItemWriter/ItemProcessor• Classes: AbstractReader/AbstractWriter/AbstractProcessor

Reader Processor Writer

Chunking

Chunk ConfigurationParameter Descriptioncheckpoint-policy Possible values: item or customitem-count Number of items to be processed per

chunk. Default is 10.time-limit Time in seconds before taking a

checkpoint. Default is 0 (means after each chunk)

skip-limit Number of exceptions a step will skip if there are configured skippable exceptions.

retry-limit Number of times a step will be retried if it has throw a skippable exception.

Skippable Exceptions

ChunkingStep ItemReader ItemProcessor ItemWriter

read()

itemprocess(item)

item

read()

itemprocess(item)

itemwrite(items)

execute()

ExitStatus

Chunking: ItemReader

Chunking: ItemProcessor

Chunking: ItemWriter

Demo

Runtime ParametersSet Property

Retrieve Property

Pre-Defined PropertiesSet Property

Property Injected

Step Exceptions• Parallel running instances (partition) complete before the

job completes.• Batch status transitions to FAILED

Job Listener Configuration

Listener Config

Job Listener Implementation

Step Listener Configuration

Listener Config

Step Listener Implementation

Partition Configuration

Partition Implementation

Decision Configuration

Decision

What next?

Decision Implementation

Dependency Injection!

SplitupdateExisting processNewStorms

Flow & Splits JCL• <flow> element is used to implement process workflows.• <split> element is used to run jobs in parallel

retrieveTracking

processDecider

stormReader

stormProcessor

stormWriter

updateExistingStorms

Flows & Splits

Checkpoint Algorithm Configuration

Checkpoint Algorithm Implementation

Hadoop Overview• Massively scalable storage and batch data processing

system• Written in Java• Huge ecosystem

• Meant for massive data processing jobs• Horizontally scalable• Uses MapReduce programming model• Handles processing of petabytes of data• Started at Yahoo! In 2005.

Hadoop

MapReduce(Distributed Computation)

HDFS(Distributed Storage)

YARN Framework

Common Utilities

HadoopTypically Hadoop is used when:

• Analysis is performed on unstructured datasets

• Data is stored across multiple servers (HDFS)

• Non-Java processes are fed data and managed

Ex. https://svi.nl/HuygensSoftware

Spring vs. Java EE Batching• Spring Batch 3.0 implements JSR-352!• Batch artifacts developed against JSR-352 won’t work

within a traditional Spring Batch Job• Same two processing models as Spring Batch:

• Item – aka chunking• Task - aka Batchlet

Terminology ComparisonJSR-352 Spring BatchJob Job

Step Step

Chunk Chunk

Item Item

ItemReader ItemReader/ItemStream

ItemProcessor ItemProcessor

ItemWriter ItemWriter/ItemStream

JobInstance JobInstance

JobExecution JobExecution

StepExecution StepExecution

JobListener JobExecutionListener

StepListener StepExecutionListener

Scaling Batch Jobs• Traditional Spring Batch Scaling:

• Split – running multiple steps in parallel• Multiple threads – executing a single step via multiple threads• Partitioning – dividing data up for parallel processing• Remote Chunking – executing the processor logic remotely

• JSR-352 Job Scaling• Split – running multiple steps in parallel• Partitioning – dividing data up – implementation slightly

different.

JSR-352/Spring/HadoopHadoop• Massively parallel / large jobs• Processing petabytes of data (BIG DATA)JSR-352/Spring• Traditional batch processing jobs• Structured data/business processesJSR-352 vs. Spring• Java EE versus Spring containers• Spring has better job scaling capabilities

JSR-352 Implementations• JBeret

• http://tinyurl.com/z4qx3wo• WebSphere/WebLogic/Payara

• jbatch (reference)• http://tinyurl.com/jk6vcb8• WildFly/JBoss

• SpringBatch• http://tinyurl.com/mt8v3k7

Best Practices• Package/deploy batch jobs separately• Implement logic to cleanup old jobs• Implement logic for auto-restart• Test restart and checkpoint logic• Configure database to store jobs • Configure thread pool for batch jobs• Only invoke batch jobs from logic that is secured (@Role

etc.)

Resources• JSR-352

https://jcp.org/en/jsr/detail?id=352 • Java EE Support

http://javaee.support/contributors/• Spring Batch

http://docs.spring.io/spring-batch/reference/html/spring-batch-intro.html

• Spring JSR-352 Supporthttp://docs.spring.io/spring-batch/reference/html/jsr-352.html

Resources• Java EE 7 Batch Processing and World of Warcraft

http://tinyurl.com/gp8yls8• Three Key Concepts for Understanding JSR-352

http://tinyurl.com/oxe2dhu• Java EE Tutorial https

://docs.oracle.com/javaee/7/tutorial/batch-processing.htm

Q&AEmail: rcuprak@gmail.comTwitter: @ctjava