Scalable Group CommunicationIn Heterogeneous ClusterFilip Hanik Apache Software FoundationJune 30th, 2006
2
Who am I• [email protected]• Tomcat Committer / ASF member• Responsible for session replication
and clustering• Been involved with ASF since 2001
3
What we will cover• Introduction to group communication• Challenges in group/cluster
communication• Today’s Solutions• Detailed Tribes overview• Tribes – design/configuration/usage• Problems and their solutions• Q & A
4
What is Group Communication
• 1-to-n communication between software/hardware nodes
• Designed to reduce packets compared to 1-to-1 (point to point) communication
• Also referred to as broadcasting and/or multicasting
• broadcast != multicast• broadcast – all nodes receive• multicast – interested (subscribed) nodes receive
• Popular academic research topic!! Lots of information available
5
Challenges in Group Communication
• Multicast is most commonly used• Group consistency and leadership• Delivery guarantee• Group delivery guarantee• Ordering and total ordering• Flow control• Multiple networks
6
Today’s Solutions• Dozens if not hundreds academic
products• Not maintained, Not supported, Proprietary
• Many open source projects• Appia, Spread, Erlang, JGroups…list goes
on• Most multicast based to solve the 1-to-
n packet reduction problem
7
What is uniform group model?
• Nodes are identical• All nodes process, send and receive
message in the same way• All nodes have the same applications • Total ordering is based on the
complete group• Note: Not the official definition for
what uniformity in a group setting is
8
When isn’t the uniformity enough?
• When processes on each node are dynamic - activate, passivate, short and long lived
• Example, Tomcat webapps• Example, heterogeneous hardware environments• Application management vs. application data
replication• Messages with different priorities
• Example, session attribute being replicated vs. a 25MB war file being transferred
• Need different guarantee levels• When most messages are 1-to-m m<n
9
Challenges in heterogeneous clusters
• Same challenges as in homogeneous environments
• Node attributes change runtime• Nodes carry different responsibilities• Total order messages that are sent
1-to-m where m < n
10
What is Tribes?• Tribes is a messaging framework with
group communication capabilities• 100% Java, Apache Licensed (2.0)• Born out of the cluster/session
replication code from Tomcat 5.0-5.5 early 2006
• Currently alpha, will become the communication framework for Tomcat’s next cluster implementation
• Ideas from 2001
11
Why Tribes?• Many frameworks are not flexible enough• Not enough features• Messages were guaranteed, without
delivery feedback• Static configurations for message delivery• Based on 1-to-m delivery, where m<n• License, license, license…
12
Why Tribes?• Research gap - platforms are
proprietary and often suggest protocols that are not standard
• Opportunities for httpd & Tomcat and other ASF software integration for more advanced and intelligent clusters
• Separation of communication layer• Did I say Apache License?
13
Why not Tribes• TCP is connection based• When you always want to send 1-to-n• Unique scenario where a highly
customized solution might be the best fit
• Its not the one fit all solution, if such exists
14
Goals• Simplify peer-to-peer and peer-to-group
communication for distributed applications• Flexible enough to support a wide range of
applications under one runtime configuration• Provide instant feedback on message
delivery• Concurrent message delivery, even between
two nodes• Parallel delivery to multiple nodes• Clean, intuitive and easy to use, even for
complex tasks• All this with low overhead
15
Feature Overview• Pluggable Modules• Guaranteed Messaging• Different Guarantee Levels• Per message delivery semantics(!)• Pluggable Interceptors (runtime)• Delivery feedback – even for async• Concurrent and parallel delivery• Fixed node hierarchy
16
Feature: Pluggable Modules• All major components can be swapped out,
simple interfaces defined• Needed when customization is required for
lower level IO operations• Example
• Multicast not available• Proprietary network protocols• SSL
• Goal: Default Implementation to be enough for 80% of applications that require messaging
17
Feature: Guaranteed Msg Delivery
• Assume 1-to-m delivery, (m < n)• Default implementation is TCP based
• java.io & java.nio• Most cases, TCP(java) will outperform UDP
when flow control and ack/nack for guaranteed delivery is implemented
• java.io support for platforms with poor NIO implementations
• java.nio preferred
18
Feature: Guarantee Levels• By default supports 3 levels• NO_ACK – message was sent
• Relies on TCP to deliver without node feedback• ACK – message was received
• Remote node replies with an ACK• SYNC_ACK – message was processed
• Remote node replies with ACK/FAIL_ACK when message has been processed
• Allows for message process feedback
19
Feature: Per message delivery
semantics• Most unique feature, what makes Tribes
really stand out• Allows for each message to be delivered
differently• Per message guarantee level• Sync vs. async• Not ordered, ordered, totally ordered
• 27 flags - 2ⁿ (n=27) combinations• Based on interceptors configured
• Each message with its own uniquedelivery guarantee
20
Feature: Pluggable Interceptors
• React on message attributes (flags)• If not modifying message bytes, can
be inserted run time• Intercept any events through defined
methods• ChannelInterceptorBase available to
minimize redundant code for non intercepted methods
21
Feature: Delivery Feedback• Tribes aims to deliver feedback for
each message and each delivery semantic
• NO_ACK, ACK, SYNC_ACK• Synchronous and asynchronous delivery
• Asynchronous gets feedback through callback
• Example, recoverable transactions can now be implemented since we always know if the remote node received the message
22
Feature: Concurrent & Parallel Delivery
• Concurrent• More than one message sent or received a
any point in time• No “message blocking” ie 10mb message
with SYNC_ACK will not stop 10kb NO_ACK• Parallel
• Able to send a message to multiple destinations in parallel using one thread (NIO)
• Prioritized• Future feature
23
Feature:Fixed Node Hierarchy
• Absolute Order Algorithm• Always be able to determine leadership
• No message exchanges (chat free)• Non coordinated
• Also provides “Coordination” algorithm• Chatty, but efficient• Auto merge groups• Enhance node discovery where multicast might glitch• Can connect different subnets when used together
with the StaticMembershipInterceptor
24
Feature:Absolute Failure Detection
• Simple interceptor TcpFailureDetector• Instant feedback on member down
• No need to wait for timeout• No risk of node pings getting stuck on a busy
network• Verifies timeouts against “false positives”• 3 levels
• Connect• Send• Read
25
Feature RPC messaging• Ability to collect responses to a
message• NO_REPLY, FIRST_REPLY,
MAJORITY_REPLY & ALL_REPLY• Absence reply(!) – rather than timeout• Callback left over delivery• Support for multiple RPC channels on
top of one Tribes channel
26
Feature – JNDI Channel• Ability to bind a channel into a JNDI tree• Share the channel between objects• Ideal for J2EE messaging• Coming soon:
• Ability to download client stub• Out of process invocation
• Not yet implemented…
27
Architecture - Overview
Channel
RpcChannel
Application Application Application Application
Tipi Tipi
Interceptor
Interceptor
Coordinator
Membership Sender Receiver
RpcChannel
RX
TX
28
Architecture - Channel• 1 instance per Tribes runtime setup• Is the first interceptor• Holds a list of one or more
ChannelListeners & MembershipListeners
• Serializes and deserializes messages• Supports ByteMessage for transfer of
pure byte[] data• RpcChannel instanceof ChannelListener
29
Architecture - Interceptors• Linked list invocation• Strongly typed – one method per event• No events need to travel through the stack
to coordinate interceptors• Examples
• Failure detection• Static membership• Total order or per member order• Throughput measurements and statistics• Leadership election• Message data encryption• Message dispatch – asynchronous messaging• All or none delivery guarantee
30
Architecture - Interceptors• Trigger on ChannelData.getOptions() • Pass through a ChannelData object• Using XByteBuffer – optimized byte[]
handling• Membership & Message interceptions• Threadless
31
Architecture - Coordinator• Last interceptor• Coordinates IO components
• Sender• Receiver• Membership
• Receiver uses thread pool• Sender piggy backs on application
thread
32
Code Structure• org.apache.catalina.tribes
• Application and Component interfaces• group – default implementation• transport – RX/TX components• membership – membership service• group.interceptors – supplied interceptors• io – protocol utilities and optimizations• tipis – utilities on top of Tribes core
33
Quick StartChannel myChannel = new GroupChannel();
ChannelListener msgListener = new MyMessageListener();MembershipListener mbrListener = new MyMemberListener();
myChannel.addMembershipListener(mbrListener);myChannel.addChannelListener(msgListener);
myChannel.start(Channel.DEFAULT); //start the channel
Serializable myMsg = new MyMessage();
Member[] group = myChannel.getMembers();
channel.send(group,myMsg,Channel.SEND_OPTIONS_DEFAULT);
34
Data Replication• ReplicatedMap – one to all replication• LazyReplicatedMap – primary/backup
replication• Cookie based replication map
• ideal for HTTP session replication• Backup location stored in cookies
• Versioned delta replication• Example: org.apache.catalina.ha
35
Tribes Demos• Demo• Code Example• Discussion around common problems
and how Tribes could solve them
36
Future Work• Security - SSL Support and node
authentication• Many processes – one channel • Language independent • WAN membership discover• TCP Based multicaster for large clusters
• 2*n packet reduction for the sender, not total• Intelligent membership broadcasting
• httpd as a load balancer
37
Q & A• [email protected]• http://people.apache.org/~fhanik/trib
es• Tomcat SVN repository• Interested to use?• Interested to help?
38
Folientitel• Font: Trebuchet MS, 32 pt
•Font: Trebuchet MS, 28 pt•Font: Trebuchet MS, 24 pt
• Font: Trebuchet MS, 20 pt• Lorem ipsum dolor sit amet, consectetur
adipscing elit, sed diam nonnumy eiusmod tempor incidunt ut labore et dolore magna aliquam erat volupat.
39
FolientitelLorem ipsum dolor sit amet, consectetur adipscing elit, sed diam nonnumy eiusmod tempor incidunt ut labore et dolore magna aliquam erat volupat. Et harumd dereud facilis est er expedit distinct. Nam liber a tempor cum soluta nobis eligend optio comque nihil quod a impedit anim id quod maxim placeat.
Lorem ipsum dolor sit amet, consectetur adipscing elit, sed diam nonnumy eiusmod tempor incidunt ut labore et dolore magna aliquam erat volupat. Et harumd dereud facilis est er expedit distinct.
Top Related