Data array processing with Java language

Data array processing

Vitalii Tymchyshyn, [email protected]

http://michaelgr.com/

mailto:[email protected]

Data array processing

What will we be talking about

Real life Example

Main problems to be solved

Configurable task processing

Clustering solutions

Grid vs Client-Server

Controlling CPU usageVitalii Tymchyshyn, [email protected]


What will we be talking about

You have a large set of tasks or task stream

Each task is relatively large, it's processing is CPU intensive and require multiple algorithms to be used

The goal is to maximize overall performance, not minimize processing time of single task

Real life Example

The task is web crawling with content analysis

Complex Artificial Intelligence algorithms are used, each release changes resource consumption schema

Some algorithms take single page, others take domain as a whole

Target: 1000s of domains, 100000s of pages per day

Main problems to be solved

Make single task processing be configurable, so that algorithms are independent and easily extended/replaced

Make solution scalable & solid

Make it use the equipment fully, but without overload

1. Configurable task processing

The most popular way is IoC container, e.g. Spring

Another option is data flow – beans do not call each other layer by layer, but are called by container one by one, taking input and producing output

Let's compare the options and check if second option has any benefits with “Hello, world”

“Hello, world” with IoC

Daytime retrieverGreeting text reader

Greeting printer Answer printer

User input reader

Hello World runner

“Hello, world” graph dataflow

Read greeting textCheck day time

Print greeting

Read user input

Print answer

Graph dataflow Pros & Cons

Pros

Full decoupling

Easy parallel processing, clustering & savepoints

Automatic flow management

Single call to get data needed in many places

Data types instead of interfaces

Graph dataflow Pros & Cons

Cons

IoC is still good to use to manage common resources and complex nodes :)

Our graph vs BPM

Lighweight

Connections is data, no central storage

Targeted on small (minutes to hours) automated CPU intensive jobs, subtasks from millisecons to minutes

Highly configurable clustering

Conslusion

With graph dataflow we have algorithm parts as independent blocks

Time to use this block to fill our equipment efficiently.

2. Scalability with Clustering

Simple way is:

to have multiple Vms

each fully does it's set of tasks

each task has it's working set on it's hand

2. Scalability with Clustering

But:

Each algorithm initialization data takes memory while only one algorithm is running

Algorithm may require only small part of task data

Task processing at some point may be split and processed in parallel

Solution: Clustering

Example: web domain processing

Get domain data

Mark cities(needs world city index)

Detect addresses

Define primary address

Example: cluster setup

Primary processor, reads task &

Performs primary address detection

City mark processor,Needs memory for city index,

Works page by page,fast

Address detector,Works page by page,

Slow but you can have manyof this because of low

Memory footprint

Cluster in focus

10-20 hardware nodes

FreeBSD OS, jails in use, so no multicast

Oracle Grid Engine (formely SGE) as cluster processes controller

Complex, memory consuming tasks, with JNI (crashes, long GC)

Two faces of cluster Janus

Data cluster

Is good for task data to be stored in

Can be replaced with central data warehouse, but scalability will suffer

You would better separate it from computing VM if computing is complex

Can perform Computing cluster functions

Computing cluster

Is good for running tasks from multiple task producers

Can be grid-based or client-server

Multiple small clusters may be better then one large

Hazelast

One of few free Data Grids

Has built-in Computing Grid abstractions

Good support from developers

but

Bugs in non-usual usages

Simply did not work reliably in our environment

Grid Gain

May fit like a glove

You'd better not make mitten out of glove

Heavy annotations use – problems with runtime configuration

Weakened interfaces – here are shotgun, you have your foot

Unsafe vendor support

ZooKeeper

The thing we are using now

Low level primitives, yet there are 3rd party libraries with high level

Client-Server architecture

Clustered servers for stability and read scalability.

No write scalability

Part of Hadoop

HDFS

Has Single Point of Failure

Name node memory requirements are linear from number of files

Uses TCP (don't forget to tune OS tcp stack)

Much like unix file system

Again, part of hadoop

Two types of The Time

Wall Time = CPU Time + Wait + LatencyWall Time = CPU Time + Wait + Latency

External wait is managed with cooperative multitasking (discussed later) Latency is vital for interactive services, but has low priority for data processing


Grid Client-Server


Latency is two times less

A lot more connections

Everyone is watching

Complex cluster membership change procedure

More robust

Servers can be clustered

Central point for debugging

No “watching deads” overhead for everyone

Conclusion

Now our tasks are spread on our equipment.

Time to prevent resource overload!

3. Controlling CPU usage

Low load means processing power not used

High load means that:Parallel tasks unnecessary take memory

High System time because of context switches

CPU caches are split on different switching tasks

Lower total throughput

Parallel vs Sequential on single CPU

Task 1 Task 3Task 2 Task 4

Task 1

Task 4

Task 3

Task 2

Sequential (LA=1):

Average finish time = (1+2+3+4)/4 = 2.5

Parallel (LA=4):

Average finish time = (4+4+4+4)/4 = 4

Thread-pool:

Multiple tasks are processed at one time by different threads

There should be enough threads to use CPU while someone's blocked

There should not be too much threads for not to overload CPU

Event-based:

One task is processed at given time

There must be no blocking

Any blockable call must be replaced with callback / event

Popular options to process tasks

Thread-pool vs event-based

Thread-pool:

Pros

Simple procedural model

A lot of libraries & frameworks

Cons

Context-switch storms

Per-thread memory

Average speed

Event-based:

Pros

Optimal pool size

No waiting threads memory overhead

Cons

More complex event-based programming

Little supporting libraries & frameworks

Introducing cooperative multitask

It's much like thread-pool variant, but:

Each wait (IO) is signaled to multitask coordinator

Only one thread can be in no-wait state, another thread exiting wait will block on a mutex.

If system is overloaded (mutex is always taken), new tasks are not run.

Cooperative multitasking features

Still simple procedural modelControlled CPU usage

Low waiting thread memory usage because of no layered calls in graph dataflow

All regular frameworks & libraries are availableDynamic thread pool size

Author contacts

Vitalii Tymchyshyn [email protected]

Skyрe: vittymchyshyn


Data array processing with Java language

Technology

Transcript of Data array processing with Java language