Is-GI 2012 Indraratna B. General Report - Vertical Drains, Vacuum Consolidation and Preloading
WXS Preloading Explained - Google Drive
-
Upload
anujnet2002 -
Category
Documents
-
view
217 -
download
0
Transcript of WXS Preloading Explained - Google Drive
-
8/13/2019 WXS Preloading Explained - Google Drive
1/8
Efficiently preload large quantities of data in to IBM WebSphere eXtreme Scale
Pre loading data from a databas e in to IBM We bSphe re eXtre me Scale 2
The first wrong way 2
The second wrong way 2
The third wrong way 2The right way 3
Duplicate reference data in EVERY partition 3
Collocating the master and child objects in a single partition 3
Multiple Maps means multiple Loaders 4
Preloading the Maps ONE by ONE is more efficient 5
Conclusion 5
Author: Billy Newport April 10, 2011 Page: 1/2
-
8/13/2019 WXS Preloading Explained - Google Drive
2/8
-
8/13/2019 WXS Preloading Explained - Google Drive
3/8
Efficiently preload large quantities of data in to IBM WebSphere eXtreme Scale
Preloading data from a database in to IBM WebSphere eXtreme
ScaleThis is one of the most common use cases. The customer wants to place an IBM WebSphere
eXtreme Scale (WXS) grid in front of a database to allow better scaling. Usually, the datainvolves several tables and the customer wants to use several WXS maps with relationships
between them.
The first wrong way
Lets say we have two tables in the database. A customer table and a customer address table. A
customer can have many addresses. We decide to use the JPA built-in Loader to create a
Customer entity and an Address entity and create a one to many relationship between them.
The customer uses a single WXS Map for the Customer entity POJO. The key is the customer
key and the value is an object graph consisting of one Customer POJO that has a collection of
Address objects.
The customer then tries to write a preloader application. This will run one of more JPA queries
to fetch the Customer/Address graphs using the JPA implementation. This usually requires the
database to execute a SELECT statement that fetches a join query where the Addresses are
grouped together for a single customer. This query is usually very expensive. The preload is
slow and the bottleneck is the database. The customer is parallel loading the Customer graphs in
to the grid using agents but the process takes a very long time because of the database.
The second wrong way
Convinced the JPA implementation is the problem. A JDBC Loader is written to do the same
thing but ultimately, the issue remains that executing the SELECT query with the group by is still
the major bottleneck. This suffers more or less from the same issue as the first wrong way.
The third wrong way
Here, the graph is split in to two maps. We have a customer map and an address map. The
customer map has the customer key and the value is the Customer POJO. The Address map
has a key that contains the customer key and the value is one address. Loaders are written using
JPA or JDBC to fetch and store data in these maps independently. Preloading the grid here can
be made efficient as a simple table scan is sufficient to fetch all rows from both tables and store
them in the grid using a bulk putAll (wxsutils). However, a major performance problem is
discovered accessing the data in the grid after preload. Some of the customers have several
addresses. The customer and address objects are stored in different partitions for the same
customer. The client application is usually executing a get for each customer and each address.
This results in a lot of gets for each logical operation. If a customer had 10 addresses then 11
get operations are required to get a single customer. The team will likely state that the grid is
slower than the database as fetching the data from the DBMS can be done with a single RPC
and its hard to do 11 RPCs faster than one.
The right way
The above approaches are flawed in two ways:
Author: Billy Newport April 10, 2011 Page: 1/2
-
8/13/2019 WXS Preloading Explained - Google Drive
4/8
Efficiently preload large quantities of data in to IBM WebSphere eXtreme Scale
Preloading the data using SELECT statements that do joins or group or order bys will
never work. Its too slow and the DBMS will struggle to execute this SQL query
especially when there are a lot of rows involved. Ive seen use cases with 1.6 billion
rows in the child table. I dont care which database you have, sorting 1.6 billions rows
is expensive.
The master and child objects are not stored together in a single partition. This meansthat working with the customer and the associated addresses is very expensive, as many
RPCs are needed both to read the data as well as write changed to the grid.
We will now examine what needs to be done to solve these problems.
Duplicate reference data in EVERY partition
Lets suppose our Address objects had a reference to the State. The state map is typically very
small, 52 entries in the USA. We should make a Map for the state and then preload it with the
same data in EVERY partition. This can be readily done like follows. First, running a query to
fetch all the states from the database. Next, write an Agent that when executed on every
partition to insert the states in a Map on every partition. This means that later if business logic
requires data stored in the State map for a particular address or customer that its instantlyavailable in the same partition. Clients will not be access to access the state map directly
because such a map is not routable. The key to the Map does not determine which partition the
entry will be found in.
Collocating the master and child objects in a single partition
This is absolutely HUGE from a performance point of view. The best news is that its extremely
easy to do. A WXS Map has a key and a value. The customer Map has a key, lets say its a
customer id. The address Map has a key and a value also. The key is a composite POJO that
includes the customer key, the customer id.
So, the customer map looks like:
Customer: Map
Address: Map
Class AddressKey implements Serializable
{
String customerId;
int addressId;
public int hashCode() {}
public Boolean equals() {}
}
The address entries for a customer will be stored in many partitions for the same customer. We
need to modify the application so WXS will store ALL address entries in the same partition as
the one used for the master customer entry.
Class AddressKey implements Serializable, PartitionableKey
Author: Billy Newport April 10, 2011 Page: 1/2
-
8/13/2019 WXS Preloading Explained - Google Drive
5/8
Efficiently preload large quantities of data in to IBM WebSphere eXtreme Scale
{
String customerId;
public Object getIBMPartition()
{
return customerId;
}
}
This version of the key accomplishes that with very little work. When WXS needs to calculate
the partition for a specific key, it typical just calls the hashCode method and then figures out the
partition. The initial version of AddressKey would always result in a different hashCode than the
one returned by the Customer object. The new version implements the PartitionableKey
interface. If WXS sees a key that implements this interface then it will instead use the hashCode
of the object returned from the getIBMPartition on that key. Our improved AddressKey returns
the customerId string as the result. This guarantees that WXS will place the Address entries fora specific customer in the same partition as the associated parent Customer entry because it
calculates the hashCode the same exact way. This should be done for all Maps that are one to
many children of a master map. The child keys should all employ the same technique.
This will massively improve performance as now the application can fetch a customer AND all
addresses using a single operation using a very simply Agent call. Our previous example showed
11 RPCs to accomplish this. Now, we only use one and now WXS will outperform the
database easily.
Typically agents are written for all operations for the highest speed. These agents can be thought
of as stored procedures to fetch or manipulate customer data in some particular way using a
single RPC. This also improves application performance enormously.
The Customer object will also typically have a Collection of Address keys. This collection
allows the associated address objects to be retrieved efficiently rather than using an index to do
the same thing. Both will cost memory so typically, I will use a Collection in the Customer
object but then Id have to make sure that the agents that add/remove Addresses also maintain
this list in the associated Customer object. This is extra cost but performance wise, its usually
very cheap to do. The preloading code clearly will need to make sure the Collection is set
correctly also.
Multiple Maps means multiple Loaders
Sometimes, the application will decide to keep a single Map and the value includes the
Customer and a Collection of Address objects. If the data is read only as far as WXS is
concerned then this is a workable solution. Its fast, you can fetch the customer and addresses
in a single operation without resorting to agents. But, if a Loader is used then the Loader is more
complex. It has to figure out how to apply the List of Addresses in the customer object to the
Address table in the database. Typically, this means reading all the addresses from the DBMS,
comparing them to the list in memory and then doing insert/delete and update SQL statements
to account for any differences. This is expensive. WXS is unable to track the address changes
as you are storing everything in a single Map and WXS does change detection map by map.
Author: Billy Newport April 10, 2011 Page: 1/2
-
8/13/2019 WXS Preloading Explained - Google Drive
6/8
Efficiently preload large quantities of data in to IBM WebSphere eXtreme Scale
When a Loader is used, a better approach is to use separate maps as described above. Each
Map gets its own Loader. WXS can now track changes to the Address Map and instruct the
Loader to insert/update/delete Addresses without needing to do the query first typically. This is
simpler to code and more efficient in terms of SQL.
The single Map approach is still more expensive even if write behind is enabled. The SQL is still
complex.
Preloading the Maps ONE by ONE is more efficient
As explained earlier, trying to fetch the objects using a single query with joins and group/order
bys is almost always too expensive in practice. Splitting the object in to separate maps allows
the following approach to be used.
First, preload the customer map. Do a simple SELECT * from Customer in blocks and use
putAll to write the Customer objects to the Customer Map with empty Address collections.
This is typically very fast.
Next, preload the address map. Do a simple SELECT * from Address in blocks. There is no
guarantee that youll get the Addresses for one customer at a time but thats ok. We will write
an agent to take a block of addresses, split them in to a block for each partition and then foreach partition, the agent will merge the new addresses for a particular customer in to the existing
set by updating the Collection of address keys in the customer as well as inserting the Address
objects themselves.
This is much faster than before as the SQL query is very efficient and we still bulk add the data
into the grid. The key is using the merge Agent to add the unordered addresses to the existing
customers efficiently.
A great way to think about this is that were moving the join operations out of the
database in to the grid.
ConclusionFollowing the steps outlined in this article solves 99% of all the preloading performance issues
that Ive seen in customer situations. To summarize:
1. Use wxsutils for putAll or bulk implementations.
2. Map per table.
3. Collocate related map entries using PartitionableKey
4. Preload a table at a time.
5. Then, use a merge agent to preload child tables
6. Try preloading the master table using parallel chunks.
Multi-thread it so that you use N threads with each of the threads fetching an exclusive
range of records from the database.
7. Try preloading the child tables also using parallel chunks.8. DO NOT try to fetch the child data at the same time as the master data from the
DBMS.
9. Join data in the grid, not in the database
10. Duplicate reference data in every partition.
11. Write agents to manipulate the main data using a single RPC, both reads/puts and
deletes.
Author: Billy Newport April 10, 2011 Page: 1/2
-
8/13/2019 WXS Preloading Explained - Google Drive
7/8
Efficiently preload large quantities of data in to IBM WebSphere eXtreme Scale
Following these techniques will ensure you preload data from a DBMS as efficiently as possible
with the least load on the DBMS.
Author: Billy Newport April 10, 2011 Page: 1/2
-
8/13/2019 WXS Preloading Explained - Google Drive
8/8