WXS Preloading Explained - Google Drive

download WXS Preloading Explained - Google Drive

of 8

Transcript of WXS Preloading Explained - Google Drive

  • 8/13/2019 WXS Preloading Explained - Google Drive

    1/8

    Efficiently preload large quantities of data in to IBM WebSphere eXtreme Scale

    Pre loading data from a databas e in to IBM We bSphe re eXtre me Scale 2

    The first wrong way 2

    The second wrong way 2

    The third wrong way 2The right way 3

    Duplicate reference data in EVERY partition 3

    Collocating the master and child objects in a single partition 3

    Multiple Maps means multiple Loaders 4

    Preloading the Maps ONE by ONE is more efficient 5

    Conclusion 5

    Author: Billy Newport April 10, 2011 Page: 1/2

  • 8/13/2019 WXS Preloading Explained - Google Drive

    2/8

  • 8/13/2019 WXS Preloading Explained - Google Drive

    3/8

    Efficiently preload large quantities of data in to IBM WebSphere eXtreme Scale

    Preloading data from a database in to IBM WebSphere eXtreme

    ScaleThis is one of the most common use cases. The customer wants to place an IBM WebSphere

    eXtreme Scale (WXS) grid in front of a database to allow better scaling. Usually, the datainvolves several tables and the customer wants to use several WXS maps with relationships

    between them.

    The first wrong way

    Lets say we have two tables in the database. A customer table and a customer address table. A

    customer can have many addresses. We decide to use the JPA built-in Loader to create a

    Customer entity and an Address entity and create a one to many relationship between them.

    The customer uses a single WXS Map for the Customer entity POJO. The key is the customer

    key and the value is an object graph consisting of one Customer POJO that has a collection of

    Address objects.

    The customer then tries to write a preloader application. This will run one of more JPA queries

    to fetch the Customer/Address graphs using the JPA implementation. This usually requires the

    database to execute a SELECT statement that fetches a join query where the Addresses are

    grouped together for a single customer. This query is usually very expensive. The preload is

    slow and the bottleneck is the database. The customer is parallel loading the Customer graphs in

    to the grid using agents but the process takes a very long time because of the database.

    The second wrong way

    Convinced the JPA implementation is the problem. A JDBC Loader is written to do the same

    thing but ultimately, the issue remains that executing the SELECT query with the group by is still

    the major bottleneck. This suffers more or less from the same issue as the first wrong way.

    The third wrong way

    Here, the graph is split in to two maps. We have a customer map and an address map. The

    customer map has the customer key and the value is the Customer POJO. The Address map

    has a key that contains the customer key and the value is one address. Loaders are written using

    JPA or JDBC to fetch and store data in these maps independently. Preloading the grid here can

    be made efficient as a simple table scan is sufficient to fetch all rows from both tables and store

    them in the grid using a bulk putAll (wxsutils). However, a major performance problem is

    discovered accessing the data in the grid after preload. Some of the customers have several

    addresses. The customer and address objects are stored in different partitions for the same

    customer. The client application is usually executing a get for each customer and each address.

    This results in a lot of gets for each logical operation. If a customer had 10 addresses then 11

    get operations are required to get a single customer. The team will likely state that the grid is

    slower than the database as fetching the data from the DBMS can be done with a single RPC

    and its hard to do 11 RPCs faster than one.

    The right way

    The above approaches are flawed in two ways:

    Author: Billy Newport April 10, 2011 Page: 1/2

  • 8/13/2019 WXS Preloading Explained - Google Drive

    4/8

    Efficiently preload large quantities of data in to IBM WebSphere eXtreme Scale

    Preloading the data using SELECT statements that do joins or group or order bys will

    never work. Its too slow and the DBMS will struggle to execute this SQL query

    especially when there are a lot of rows involved. Ive seen use cases with 1.6 billion

    rows in the child table. I dont care which database you have, sorting 1.6 billions rows

    is expensive.

    The master and child objects are not stored together in a single partition. This meansthat working with the customer and the associated addresses is very expensive, as many

    RPCs are needed both to read the data as well as write changed to the grid.

    We will now examine what needs to be done to solve these problems.

    Duplicate reference data in EVERY partition

    Lets suppose our Address objects had a reference to the State. The state map is typically very

    small, 52 entries in the USA. We should make a Map for the state and then preload it with the

    same data in EVERY partition. This can be readily done like follows. First, running a query to

    fetch all the states from the database. Next, write an Agent that when executed on every

    partition to insert the states in a Map on every partition. This means that later if business logic

    requires data stored in the State map for a particular address or customer that its instantlyavailable in the same partition. Clients will not be access to access the state map directly

    because such a map is not routable. The key to the Map does not determine which partition the

    entry will be found in.

    Collocating the master and child objects in a single partition

    This is absolutely HUGE from a performance point of view. The best news is that its extremely

    easy to do. A WXS Map has a key and a value. The customer Map has a key, lets say its a

    customer id. The address Map has a key and a value also. The key is a composite POJO that

    includes the customer key, the customer id.

    So, the customer map looks like:

    Customer: Map

    Address: Map

    Class AddressKey implements Serializable

    {

    String customerId;

    int addressId;

    public int hashCode() {}

    public Boolean equals() {}

    }

    The address entries for a customer will be stored in many partitions for the same customer. We

    need to modify the application so WXS will store ALL address entries in the same partition as

    the one used for the master customer entry.

    Class AddressKey implements Serializable, PartitionableKey

    Author: Billy Newport April 10, 2011 Page: 1/2

  • 8/13/2019 WXS Preloading Explained - Google Drive

    5/8

    Efficiently preload large quantities of data in to IBM WebSphere eXtreme Scale

    {

    String customerId;

    public Object getIBMPartition()

    {

    return customerId;

    }

    }

    This version of the key accomplishes that with very little work. When WXS needs to calculate

    the partition for a specific key, it typical just calls the hashCode method and then figures out the

    partition. The initial version of AddressKey would always result in a different hashCode than the

    one returned by the Customer object. The new version implements the PartitionableKey

    interface. If WXS sees a key that implements this interface then it will instead use the hashCode

    of the object returned from the getIBMPartition on that key. Our improved AddressKey returns

    the customerId string as the result. This guarantees that WXS will place the Address entries fora specific customer in the same partition as the associated parent Customer entry because it

    calculates the hashCode the same exact way. This should be done for all Maps that are one to

    many children of a master map. The child keys should all employ the same technique.

    This will massively improve performance as now the application can fetch a customer AND all

    addresses using a single operation using a very simply Agent call. Our previous example showed

    11 RPCs to accomplish this. Now, we only use one and now WXS will outperform the

    database easily.

    Typically agents are written for all operations for the highest speed. These agents can be thought

    of as stored procedures to fetch or manipulate customer data in some particular way using a

    single RPC. This also improves application performance enormously.

    The Customer object will also typically have a Collection of Address keys. This collection

    allows the associated address objects to be retrieved efficiently rather than using an index to do

    the same thing. Both will cost memory so typically, I will use a Collection in the Customer

    object but then Id have to make sure that the agents that add/remove Addresses also maintain

    this list in the associated Customer object. This is extra cost but performance wise, its usually

    very cheap to do. The preloading code clearly will need to make sure the Collection is set

    correctly also.

    Multiple Maps means multiple Loaders

    Sometimes, the application will decide to keep a single Map and the value includes the

    Customer and a Collection of Address objects. If the data is read only as far as WXS is

    concerned then this is a workable solution. Its fast, you can fetch the customer and addresses

    in a single operation without resorting to agents. But, if a Loader is used then the Loader is more

    complex. It has to figure out how to apply the List of Addresses in the customer object to the

    Address table in the database. Typically, this means reading all the addresses from the DBMS,

    comparing them to the list in memory and then doing insert/delete and update SQL statements

    to account for any differences. This is expensive. WXS is unable to track the address changes

    as you are storing everything in a single Map and WXS does change detection map by map.

    Author: Billy Newport April 10, 2011 Page: 1/2

  • 8/13/2019 WXS Preloading Explained - Google Drive

    6/8

    Efficiently preload large quantities of data in to IBM WebSphere eXtreme Scale

    When a Loader is used, a better approach is to use separate maps as described above. Each

    Map gets its own Loader. WXS can now track changes to the Address Map and instruct the

    Loader to insert/update/delete Addresses without needing to do the query first typically. This is

    simpler to code and more efficient in terms of SQL.

    The single Map approach is still more expensive even if write behind is enabled. The SQL is still

    complex.

    Preloading the Maps ONE by ONE is more efficient

    As explained earlier, trying to fetch the objects using a single query with joins and group/order

    bys is almost always too expensive in practice. Splitting the object in to separate maps allows

    the following approach to be used.

    First, preload the customer map. Do a simple SELECT * from Customer in blocks and use

    putAll to write the Customer objects to the Customer Map with empty Address collections.

    This is typically very fast.

    Next, preload the address map. Do a simple SELECT * from Address in blocks. There is no

    guarantee that youll get the Addresses for one customer at a time but thats ok. We will write

    an agent to take a block of addresses, split them in to a block for each partition and then foreach partition, the agent will merge the new addresses for a particular customer in to the existing

    set by updating the Collection of address keys in the customer as well as inserting the Address

    objects themselves.

    This is much faster than before as the SQL query is very efficient and we still bulk add the data

    into the grid. The key is using the merge Agent to add the unordered addresses to the existing

    customers efficiently.

    A great way to think about this is that were moving the join operations out of the

    database in to the grid.

    ConclusionFollowing the steps outlined in this article solves 99% of all the preloading performance issues

    that Ive seen in customer situations. To summarize:

    1. Use wxsutils for putAll or bulk implementations.

    2. Map per table.

    3. Collocate related map entries using PartitionableKey

    4. Preload a table at a time.

    5. Then, use a merge agent to preload child tables

    6. Try preloading the master table using parallel chunks.

    Multi-thread it so that you use N threads with each of the threads fetching an exclusive

    range of records from the database.

    7. Try preloading the child tables also using parallel chunks.8. DO NOT try to fetch the child data at the same time as the master data from the

    DBMS.

    9. Join data in the grid, not in the database

    10. Duplicate reference data in every partition.

    11. Write agents to manipulate the main data using a single RPC, both reads/puts and

    deletes.

    Author: Billy Newport April 10, 2011 Page: 1/2

  • 8/13/2019 WXS Preloading Explained - Google Drive

    7/8

    Efficiently preload large quantities of data in to IBM WebSphere eXtreme Scale

    Following these techniques will ensure you preload data from a DBMS as efficiently as possible

    with the least load on the DBMS.

    Author: Billy Newport April 10, 2011 Page: 1/2

  • 8/13/2019 WXS Preloading Explained - Google Drive

    8/8