WXS Preloading Explained - Google Drive

8/13/2019 WXS Preloading Explained - Google Drive

1/8

Efficiently preload large quantities of data in to IBM WebSphere eXtreme Scale

Pre loading data from a databas e in to IBM We bSphe re eXtre me Scale 2

The first wrong way 2

The second wrong way 2

The third wrong way 2The right way 3

Duplicate reference data in EVERY partition 3

Collocating the master and child objects in a single partition 3

Multiple Maps means multiple Loaders 4

Preloading the Maps ONE by ONE is more efficient 5

Conclusion 5

Author: Billy Newport April 10, 2011 Page: 1/2


2/8


3/8


Preloading data from a database in to IBM WebSphere eXtreme

ScaleThis is one of the most common use cases. The customer wants to place an IBM WebSphere

eXtreme Scale (WXS) grid in front of a database to allow better scaling. Usually, the datainvolves several tables and the customer wants to use several WXS maps with relationships

between them.

The first wrong way

Lets say we have two tables in the database. A customer table and a customer address table. A

customer can have many addresses. We decide to use the JPA built-in Loader to create a

Customer entity and an Address entity and create a one to many relationship between them.

The customer uses a single WXS Map for the Customer entity POJO. The key is the customer

key and the value is an object graph consisting of one Customer POJO that has a collection of

Address objects.

The customer then tries to write a preloader application. This will run one of more JPA queries

to fetch the Customer/Address graphs using the JPA implementation. This usually requires the

database to execute a SELECT statement that fetches a join query where the Addresses are

grouped together for a single customer. This query is usually very expensive. The preload is

slow and the bottleneck is the database. The customer is parallel loading the Customer graphs in

to the grid using agents but the process takes a very long time because of the database.

The second wrong way

Convinced the JPA implementation is the problem. A JDBC Loader is written to do the same

thing but ultimately, the issue remains that executing the SELECT query with the group by is still

the major bottleneck. This suffers more or less from the same issue as the first wrong way.

The third wrong way

Here, the graph is split in to two maps. We have a customer map and an address map. The

customer map has the customer key and the value is the Customer POJO. The Address map

has a key that contains the customer key and the value is one address. Loaders are written using

JPA or JDBC to fetch and store data in these maps independently. Preloading the grid here can

be made efficient as a simple table scan is sufficient to fetch all rows from both tables and store

them in the grid using a bulk putAll (wxsutils). However, a major performance problem is

discovered accessing the data in the grid after preload. Some of the customers have several

addresses. The customer and address objects are stored in different partitions for the same

customer. The client application is usually executing a get for each customer and each address.

This results in a lot of gets for each logical operation. If a customer had 10 addresses then 11

get operations are required to get a single customer. The team will likely state that the grid is

slower than the database as fetching the data from the DBMS can be done with a single RPC

and its hard to do 11 RPCs faster than one.

The right way

The above approaches are flawed in two ways:



4/8


Preloading the data using SELECT statements that do joins or group or order bys will

never work. Its too slow and the DBMS will struggle to execute this SQL query

especially when there are a lot of rows involved. Ive seen use cases with 1.6 billion

rows in the child table. I dont care which database you have, sorting 1.6 billions rows

is expensive.

The master and child objects are not stored together in a single partition. This meansthat working with the customer and the associated addresses is very expensive, as many

RPCs are needed both to read the data as well as write changed to the grid.

We will now examine what needs to be done to solve these problems.

Duplicate reference data in EVERY partition

Lets suppose our Address objects had a reference to the State. The state map is typically very

small, 52 entries in the USA. We should make a Map for the state and then preload it with the

same data in EVERY partition. This can be readily done like follows. First, running a query to

fetch all the states from the database. Next, write an Agent that when executed on every

partition to insert the states in a Map on every partition. This means that later if business logic

requires data stored in the State map for a particular address or customer that its instantlyavailable in the same partition. Clients will not be access to access the state map directly

because such a map is not routable. The key to the Map does not determine which partition the

entry will be found in.

Collocating the master and child objects in a single partition

This is absolutely HUGE from a performance point of view. The best news is that its extremely

easy to do. A WXS Map has a key and a value. The customer Map has a key, lets say its a

customer id. The address Map has a key and a value also. The key is a composite POJO that

includes the customer key, the customer id.

So, the customer map looks like:

Customer: Map

Address: Map

Class AddressKey implements Serializable

{

String customerId;

int addressId;

public int hashCode() {}

public Boolean equals() {}

}

The address entries for a customer will be stored in many partitions for the same customer. We

need to modify the application so WXS will store ALL address entries in the same partition as

the one used for the master customer entry.

Class AddressKey implements Serializable, PartitionableKey



5/8


{

String customerId;

public Object getIBMPartition()

{

return customerId;

}

}

This version of the key accomplishes that with very little work. When WXS needs to calculate

the partition for a specific key, it typical just calls the hashCode method and then figures out the

partition. The initial version of AddressKey would always result in a different hashCode than the

one returned by the Customer object. The new version implements the PartitionableKey

interface. If WXS sees a key that implements this interface then it will instead use the hashCode

of the object returned from the getIBMPartition on that key. Our improved AddressKey returns

the customerId string as the result. This guarantees that WXS will place the Address entries fora specific customer in the same partition as the associated parent Customer entry because it

calculates the hashCode the same exact way. This should be done for all Maps that are one to

many children of a master map. The child keys should all employ the same technique.

This will massively improve performance as now the application can fetch a customer AND all

addresses using a single operation using a very simply Agent call. Our previous example showed

11 RPCs to accomplish this. Now, we only use one and now WXS will outperform the

database easily.

Typically agents are written for all operations for the highest speed. These agents can be thought

of as stored procedures to fetch or manipulate customer data in some particular way using a

single RPC. This also improves application performance enormously.

The Customer object will also typically have a Collection of Address keys. This collection

allows the associated address objects to be retrieved efficiently rather than using an index to do

the same thing. Both will cost memory so typically, I will use a Collection in the Customer

object but then Id have to make sure that the agents that add/remove Addresses also maintain

this list in the associated Customer object. This is extra cost but performance wise, its usually

very cheap to do. The preloading code clearly will need to make sure the Collection is set

correctly also.

Multiple Maps means multiple Loaders

Sometimes, the application will decide to keep a single Map and the value includes the

Customer and a Collection of Address objects. If the data is read only as far as WXS is

concerned then this is a workable solution. Its fast, you can fetch the customer and addresses

in a single operation without resorting to agents. But, if a Loader is used then the Loader is more

complex. It has to figure out how to apply the List of Addresses in the customer object to the

Address table in the database. Typically, this means reading all the addresses from the DBMS,

comparing them to the list in memory and then doing insert/delete and update SQL statements

to account for any differences. This is expensive. WXS is unable to track the address changes

as you are storing everything in a single Map and WXS does change detection map by map.



6/8


When a Loader is used, a better approach is to use separate maps as described above. Each

Map gets its own Loader. WXS can now track changes to the Address Map and instruct the

Loader to insert/update/delete Addresses without needing to do the query first typically. This is

simpler to code and more efficient in terms of SQL.

The single Map approach is still more expensive even if write behind is enabled. The SQL is still

complex.

Preloading the Maps ONE by ONE is more efficient

As explained earlier, trying to fetch the objects using a single query with joins and group/order

bys is almost always too expensive in practice. Splitting the object in to separate maps allows

the following approach to be used.

First, preload the customer map. Do a simple SELECT * from Customer in blocks and use

putAll to write the Customer objects to the Customer Map with empty Address collections.

This is typically very fast.

Next, preload the address map. Do a simple SELECT * from Address in blocks. There is no

guarantee that youll get the Addresses for one customer at a time but thats ok. We will write

an agent to take a block of addresses, split them in to a block for each partition and then foreach partition, the agent will merge the new addresses for a particular customer in to the existing

set by updating the Collection of address keys in the customer as well as inserting the Address

objects themselves.

This is much faster than before as the SQL query is very efficient and we still bulk add the data

into the grid. The key is using the merge Agent to add the unordered addresses to the existing

customers efficiently.

A great way to think about this is that were moving the join operations out of the

database in to the grid.

ConclusionFollowing the steps outlined in this article solves 99% of all the preloading performance issues

that Ive seen in customer situations. To summarize:

1. Use wxsutils for putAll or bulk implementations.

2. Map per table.

3. Collocate related map entries using PartitionableKey

4. Preload a table at a time.

5. Then, use a merge agent to preload child tables

6. Try preloading the master table using parallel chunks.

Multi-thread it so that you use N threads with each of the threads fetching an exclusive

range of records from the database.

7. Try preloading the child tables also using parallel chunks.8. DO NOT try to fetch the child data at the same time as the master data from the

DBMS.

9. Join data in the grid, not in the database

10. Duplicate reference data in every partition.

11. Write agents to manipulate the main data using a single RPC, both reads/puts and

deletes.



7/8


Following these techniques will ensure you preload data from a DBMS as efficiently as possible

with the least load on the DBMS.



8/8

WXS Preloading Explained - Google Drive

Documents

Transcript of WXS Preloading Explained - Google Drive