Data array processing with Java language
-
Upload
- -
Category
Technology
-
view
1.805 -
download
1
Transcript of Data array processing with Java language
![Page 2: Data array processing with Java language](https://reader033.fdocuments.us/reader033/viewer/2022052908/5595d3021a28ab062c8b46db/html5/thumbnails/2.jpg)
Data array processing
What will we be talking about
Real life Example
Main problems to be solved
Configurable task processing
Clustering solutions
Grid vs Client-Server
Controlling CPU usageVitalii Tymchyshyn, [email protected]
![Page 3: Data array processing with Java language](https://reader033.fdocuments.us/reader033/viewer/2022052908/5595d3021a28ab062c8b46db/html5/thumbnails/3.jpg)
What will we be talking about
You have a large set of tasks or task stream
Each task is relatively large, it's processing is CPU intensive and require multiple algorithms to be used
The goal is to maximize overall performance, not minimize processing time of single task
![Page 4: Data array processing with Java language](https://reader033.fdocuments.us/reader033/viewer/2022052908/5595d3021a28ab062c8b46db/html5/thumbnails/4.jpg)
Real life Example
The task is web crawling with content analysis
Complex Artificial Intelligence algorithms are used, each release changes resource consumption schema
Some algorithms take single page, others take domain as a whole
Target: 1000s of domains, 100000s of pages per day
![Page 5: Data array processing with Java language](https://reader033.fdocuments.us/reader033/viewer/2022052908/5595d3021a28ab062c8b46db/html5/thumbnails/5.jpg)
Main problems to be solved
Make single task processing be configurable, so that algorithms are independent and easily extended/replaced
Make solution scalable & solid
Make it use the equipment fully, but without overload
![Page 6: Data array processing with Java language](https://reader033.fdocuments.us/reader033/viewer/2022052908/5595d3021a28ab062c8b46db/html5/thumbnails/6.jpg)
1. Configurable task processing
The most popular way is IoC container, e.g. Spring
Another option is data flow – beans do not call each other layer by layer, but are called by container one by one, taking input and producing output
Let's compare the options and check if second option has any benefits with “Hello, world”
![Page 7: Data array processing with Java language](https://reader033.fdocuments.us/reader033/viewer/2022052908/5595d3021a28ab062c8b46db/html5/thumbnails/7.jpg)
“Hello, world” with IoC
Daytime retrieverGreeting text reader
Greeting printer Answer printer
User input reader
Hello World runner
![Page 8: Data array processing with Java language](https://reader033.fdocuments.us/reader033/viewer/2022052908/5595d3021a28ab062c8b46db/html5/thumbnails/8.jpg)
“Hello, world” graph dataflow
Read greeting textCheck day time
Print greeting
Read user input
Print answer
![Page 9: Data array processing with Java language](https://reader033.fdocuments.us/reader033/viewer/2022052908/5595d3021a28ab062c8b46db/html5/thumbnails/9.jpg)
Graph dataflow Pros & Cons
Pros
Full decoupling
Easy parallel processing, clustering & savepoints
Automatic flow management
Single call to get data needed in many places
Data types instead of interfaces
![Page 10: Data array processing with Java language](https://reader033.fdocuments.us/reader033/viewer/2022052908/5595d3021a28ab062c8b46db/html5/thumbnails/10.jpg)
Graph dataflow Pros & Cons
Cons
IoC is still good to use to manage common resources and complex nodes :)
![Page 11: Data array processing with Java language](https://reader033.fdocuments.us/reader033/viewer/2022052908/5595d3021a28ab062c8b46db/html5/thumbnails/11.jpg)
Our graph vs BPM
Lighweight
Connections is data, no central storage
Targeted on small (minutes to hours) automated CPU intensive jobs, subtasks from millisecons to minutes
Highly configurable clustering
![Page 12: Data array processing with Java language](https://reader033.fdocuments.us/reader033/viewer/2022052908/5595d3021a28ab062c8b46db/html5/thumbnails/12.jpg)
Conslusion
With graph dataflow we have algorithm parts as independent blocks
Time to use this block to fill our equipment efficiently.
![Page 13: Data array processing with Java language](https://reader033.fdocuments.us/reader033/viewer/2022052908/5595d3021a28ab062c8b46db/html5/thumbnails/13.jpg)
2. Scalability with Clustering
Simple way is:
to have multiple Vms
each fully does it's set of tasks
each task has it's working set on it's hand
![Page 14: Data array processing with Java language](https://reader033.fdocuments.us/reader033/viewer/2022052908/5595d3021a28ab062c8b46db/html5/thumbnails/14.jpg)
2. Scalability with Clustering
But:
Each algorithm initialization data takes memory while only one algorithm is running
Algorithm may require only small part of task data
Task processing at some point may be split and processed in parallel
Solution: Clustering
![Page 15: Data array processing with Java language](https://reader033.fdocuments.us/reader033/viewer/2022052908/5595d3021a28ab062c8b46db/html5/thumbnails/15.jpg)
Example: web domain processing
Get domain data
Mark cities(needs world city index)
Detect addresses
Define primary address
![Page 16: Data array processing with Java language](https://reader033.fdocuments.us/reader033/viewer/2022052908/5595d3021a28ab062c8b46db/html5/thumbnails/16.jpg)
Example: cluster setup
Primary processor, reads task &
Performs primary address detection
City mark processor,Needs memory for city index,
Works page by page,fast
Address detector,Works page by page,
Slow but you can have manyof this because of low
Memory footprint
![Page 17: Data array processing with Java language](https://reader033.fdocuments.us/reader033/viewer/2022052908/5595d3021a28ab062c8b46db/html5/thumbnails/17.jpg)
Cluster in focus
10-20 hardware nodes
FreeBSD OS, jails in use, so no multicast
Oracle Grid Engine (formely SGE) as cluster processes controller
Complex, memory consuming tasks, with JNI (crashes, long GC)
![Page 18: Data array processing with Java language](https://reader033.fdocuments.us/reader033/viewer/2022052908/5595d3021a28ab062c8b46db/html5/thumbnails/18.jpg)
Two faces of cluster Janus
Data cluster
Is good for task data to be stored in
Can be replaced with central data warehouse, but scalability will suffer
You would better separate it from computing VM if computing is complex
Can perform Computing cluster functions
Computing cluster
Is good for running tasks from multiple task producers
Can be grid-based or client-server
Multiple small clusters may be better then one large
![Page 19: Data array processing with Java language](https://reader033.fdocuments.us/reader033/viewer/2022052908/5595d3021a28ab062c8b46db/html5/thumbnails/19.jpg)
Hazelast
One of few free Data Grids
Has built-in Computing Grid abstractions
Good support from developers
but
Bugs in non-usual usages
Simply did not work reliably in our environment
![Page 20: Data array processing with Java language](https://reader033.fdocuments.us/reader033/viewer/2022052908/5595d3021a28ab062c8b46db/html5/thumbnails/20.jpg)
Grid Gain
May fit like a glove
You'd better not make mitten out of glove
Heavy annotations use – problems with runtime configuration
Weakened interfaces – here are shotgun, you have your foot
Unsafe vendor support
![Page 21: Data array processing with Java language](https://reader033.fdocuments.us/reader033/viewer/2022052908/5595d3021a28ab062c8b46db/html5/thumbnails/21.jpg)
ZooKeeper
The thing we are using now
Low level primitives, yet there are 3rd party libraries with high level
Client-Server architecture
Clustered servers for stability and read scalability.
No write scalability
Part of Hadoop
![Page 22: Data array processing with Java language](https://reader033.fdocuments.us/reader033/viewer/2022052908/5595d3021a28ab062c8b46db/html5/thumbnails/22.jpg)
HDFS
Has Single Point of Failure
Name node memory requirements are linear from number of files
Uses TCP (don't forget to tune OS tcp stack)
Much like unix file system
Again, part of hadoop
![Page 23: Data array processing with Java language](https://reader033.fdocuments.us/reader033/viewer/2022052908/5595d3021a28ab062c8b46db/html5/thumbnails/23.jpg)
Two types of The Time
Wall Time = CPU Time + Wait + LatencyWall Time = CPU Time + Wait + Latency
External wait is managed with cooperative multitasking (discussed later) Latency is vital for interactive services, but has low priority for data processing
![Page 24: Data array processing with Java language](https://reader033.fdocuments.us/reader033/viewer/2022052908/5595d3021a28ab062c8b46db/html5/thumbnails/24.jpg)
Grid vs Client-Server
Grid Client-Server
![Page 25: Data array processing with Java language](https://reader033.fdocuments.us/reader033/viewer/2022052908/5595d3021a28ab062c8b46db/html5/thumbnails/25.jpg)
Grid vs Client-Server
Latency is two times less
A lot more connections
Everyone is watching
Complex cluster membership change procedure
More robust
Servers can be clustered
Central point for debugging
No “watching deads” overhead for everyone
![Page 26: Data array processing with Java language](https://reader033.fdocuments.us/reader033/viewer/2022052908/5595d3021a28ab062c8b46db/html5/thumbnails/26.jpg)
Conclusion
Now our tasks are spread on our equipment.
Time to prevent resource overload!
![Page 27: Data array processing with Java language](https://reader033.fdocuments.us/reader033/viewer/2022052908/5595d3021a28ab062c8b46db/html5/thumbnails/27.jpg)
3. Controlling CPU usage
Low load means processing power not used
High load means that:Parallel tasks unnecessary take memory
High System time because of context switches
CPU caches are split on different switching tasks
Lower total throughput
![Page 28: Data array processing with Java language](https://reader033.fdocuments.us/reader033/viewer/2022052908/5595d3021a28ab062c8b46db/html5/thumbnails/28.jpg)
Parallel vs Sequential on single CPU
Task 1 Task 3Task 2 Task 4
Task 1
Task 4
Task 3
Task 2
Sequential (LA=1):
Average finish time = (1+2+3+4)/4 = 2.5
Parallel (LA=4):
Average finish time = (4+4+4+4)/4 = 4
![Page 29: Data array processing with Java language](https://reader033.fdocuments.us/reader033/viewer/2022052908/5595d3021a28ab062c8b46db/html5/thumbnails/29.jpg)
Thread-pool:
Multiple tasks are processed at one time by different threads
There should be enough threads to use CPU while someone's blocked
There should not be too much threads for not to overload CPU
Event-based:
One task is processed at given time
There must be no blocking
Any blockable call must be replaced with callback / event
Popular options to process tasks
![Page 30: Data array processing with Java language](https://reader033.fdocuments.us/reader033/viewer/2022052908/5595d3021a28ab062c8b46db/html5/thumbnails/30.jpg)
Thread-pool vs event-based
Thread-pool:
Pros
Simple procedural model
A lot of libraries & frameworks
Cons
Context-switch storms
Per-thread memory
Average speed
Event-based:
Pros
Optimal pool size
No waiting threads memory overhead
Cons
More complex event-based programming
Little supporting libraries & frameworks
![Page 31: Data array processing with Java language](https://reader033.fdocuments.us/reader033/viewer/2022052908/5595d3021a28ab062c8b46db/html5/thumbnails/31.jpg)
Introducing cooperative multitask
It's much like thread-pool variant, but:
Each wait (IO) is signaled to multitask coordinator
Only one thread can be in no-wait state, another thread exiting wait will block on a mutex.
If system is overloaded (mutex is always taken), new tasks are not run.
![Page 32: Data array processing with Java language](https://reader033.fdocuments.us/reader033/viewer/2022052908/5595d3021a28ab062c8b46db/html5/thumbnails/32.jpg)
Cooperative multitasking features
Still simple procedural modelControlled CPU usage
Low waiting thread memory usage because of no layered calls in graph dataflow
All regular frameworks & libraries are availableDynamic thread pool size