Scalability Manuscript for Star98

18
Two Generations of Client / Server Performance Testing Steven J. Oubre MedicaLogic, Inc. 20500 NW Evergreen Parkway Hillsboro, OR 97124 phone: (503) 531-7000, fax: (503) 531-7134 email: [email protected]

Transcript of Scalability Manuscript for Star98

Page 1: Scalability Manuscript for Star98

Two Generations of Client / Server Performance Testing

Steven J. Oubre MedicaLogic, Inc.

20500 NW Evergreen Parkway Hillsboro, OR 97124

phone: (503) 531-7000, fax: (503) 531-7134 email: [email protected]

Page 2: Scalability Manuscript for Star98

IntroductionMedicaLogic, Inc. is a software development company that specializes in Electronic Medical Records (EMR) software products for the medical clinic environment. The Windows version of this product is called Logician.

Because Logician is designed to work with an Oracle database in a client/server configuration, a network and server are required in addition to PC workstations.

The ProblemMedicaLogic’s customers have varying numbers of users. As more users use Logician, a faster server and network of sufficient bandwidth are required.

Another complicating factor is that clinics typically have a long lead-time for purchasing hardware and software. They may need to know what to purchase six months to a year in advance. So it was important for MedicaLogic to produce configuration data prior to a given release of the product.

The SolutionIn the interest of customer satisfaction, MedicaLogic saw a need to invest time and money into determining server, workstation and network requirements for various numbers of users of its products. This investment turned into what has been called Scalability testing.

First Generation Scalability TestingThe first requirement was to put together a Scalability Lab with an appropriate amount and configuration of

hardware and software. At that time, MedicaLogic’s customers and potential customers were not very large and we thought that the ability to scale to around 100 users would be sufficient.

We didn’t want to equip our lab with 100 PC workstations. This would have been prohibitive, both in cost and space requirements. So we decided that if we could automate the execution of Logician at a faster rate than a human would normally do, we would actually be sending and receiving data across the network to the server at a level equal to some number greater then the number of computers we would have in our lab.

The LabThrough some fairly extensive research and analysis, we decided to equip our test lab with twenty-four PCs and three Intel-based servers. The current technology at the time limited us to mostly Intel 486 based systems with one 90Mhz Pentium server and five 75Mhz Pentium PCs. Lab furniture, patch panel, hubs and wiring were purchased and installed. The first version of MedicaLogic’s lab used a 10Base-T Ethernet network. Two servers ran Novell NetWare 4.1 and one server ran Microsoft NT Server 3.51. The PCs were running Microsoft Windows 3.1.

The Test EnvironmentInitial investigation of software test automation products indicated that Microsoft Test would be able to automate the execution of Logician. Not only that, but Microsoft Test allowed execution of tests on any number of computers without any additional cost. It also had the ability to control other PCs from one PC. We felt that this feature

Page 3: Scalability Manuscript for Star98

would allow us centralized control over all twenty-four PCs.

Scripts were written that would allow Logician to behave as if various clinic personnel were using it. Wait times were also incorporated to further make the behavior as similar to real clinic personnel as possible.

Measurements of server performance were done with Microsoft Performance Monitor on the NT server and Novell’s monitor program on the NetWare server. Response times were measured with another Microsoft Test script that recorded the time it took to perform various actions within Logician. Network performance was measured with Novell’s LanAlyzer.

Problems Encountered1. Automating in this fashion does not

take into account multiple workstations running Oracle. Since Oracle uses memory differently as more users log on, testing Logician at a faster rate, to simulate more users, did not accurately show processor utilization for the equivalent number of users.

2. Logician uses non-standard objects in its graphical user interface. Because of this, it was very difficult to automate with Microsoft Test. Screen coordinates had to be used instead of object recognition.

3. At the time, Logician was undergoing massive amounts of changes and this caused the Microsoft Test scripts to fail whenever a new internal release was received.

4. Because MedicaLogic did not have very many customers who had been using Logician for any length of time, it was hard to judge how much data

would be needed. As it turned out, too little data was used in the test database.

The ResultsOur scalability testing with this system only allowed simulation of around 60 users. With this high-end number and some educated guesswork, we were able to extrapolate what size server would be needed for any number of users between 1 and 100.

Second Generation Scalability TestingAs MedicaLogic became a leader in Electronic Medical Records software development, it became clear that we needed to be able to demonstrate scalability to greater than 100 users. Our goal than became to simulate up to 500 users.

However, with the automated system we had, we would have needed a minimum of 250 PCs to simulate 500 users. Again, this was not an option due to expense and space requirements.

We looked at a number of options from load testing tools to outside labs. Since we desired to be able to perform load testing on a frequent basis and gain expertise in-house, we eliminated the lab option. It would have been very expensive and time consuming over the long run.

After looking at several load-testing tools, we selected Compuware’s QALoad.

The LabOur lab was updated with extra memory and disk space for our QALoad Pentium & 486 PCs, and a 100BaseT network

Page 4: Scalability Manuscript for Star98

including a 100BaseT switch. Additionally, PC manufacturers loaned servers to us.

The Test EnvironmentCompuware’s QALoad tool consists of several parts:

1. A program to capture SQL (Structured Query Language) statements from our application to the Oracle server.

2. A program to convert the raw captured SQL to a “C” source file.

3. A “Player” program that runs the compiled “C” program as one or more simulated users.

4. A “Conductor” program that controls the operation of multiple “Player” systems.

We captured the SQL from Logician while performing the same steps that typically occur in a clinic during a patient encounter (including the roles of front desk, nurse, doctor, etc).

After converting the raw SQL to a “C” source file, we modified it to make it more general and accept data from a data pool. This way, each simulated user would have unique data to deal with thereby avoiding constraint errors from Oracle.

Intel’s LanDesk Server Manager was used to measure CPU loading on both a NetWare server and NT server.

The ResultsOur Scalability testing with this system allowed simulation of around 400 users. This was more than the amount that the servers and operating systems (OS) we tested could support with our application and Oracle.

Since each simulated user actually logged in to Oracle, they were seen as separate users on separate systems. This allowed our tests to be much closer to reality. In addition, we populated our test database with the amount and type of data that would typically be found in a clinic that is using Logician.

The chart below shows how CPU utilization varied over time with 300 Logician users on a Quad Pentium Pro NT server.

Problems Encountered1. 486 PCs weren’t powerful enough to

use and so we had to rent additional PCs for Scalability testing.

2. The “C” source code that was produced by the QALoad tool was too difficult to modify and maintain in a timely manner.

3. We use Oracle’s OCI layer for communicating with Oracle rather than ODBC. Due to this and the concurrency capability of Logician, our load testing tool, QALoad, does not exactly duplicate what our application does. We believe this problem would be encountered with any load test tool that’s currently available, so we decided to add some additional functionality to our load test tool to get around some of this deficiency.

Page 5: Scalability Manuscript for Star98

Third (Current) Generation Scalability TestingAs MedicaLogic has grown and become better known, some of it’s current and potential customers need to have support for a larger number of Logician users. A number of larger sites require support for well over 1,000 active Logician users.

The LabTo support over 1,000 users, the Scalability lab needed additional QALoad player systems that have greater processing capability. Fortunately, the current technology with Pentium Pro and Pentium II dual processor systems provided that capability. We have purchased a mix of these types of systems that are able to simulate several hundred users each.

In addition to our 100Mbit-switched network we have added additional 100Mbit full duplex hubs to allow us to create multiple subnets in order to distribute the network load.

The only remaining items required are the server systems. Current Intel-based servers cannot support this many Logician users. Therefore, we have purchased several UNIX-based server systems to provide this capability.

300 User; 4P6/200; NT 3.51; DTS; Full Data and PicturesFull Time

0

10

20

30

40

50

60

70

80

90

15:11:30

15:24:39

15:29:52

15:34:07

15:37:40

15:41:17

15:44:45

15:48:22

15:51:56

15:55:23

15:58:54

16:02:21

16:05:44

16:09:23

16:12:50

16:16:19

16:19:46

16:23:17

16:26:38

16:30:13

16:33:42

16:37:06

16:40:33

16:44:02

16:47:27

16:50:50

16:54:21

16:57:48

17:01:15

17:04:41

17:08:13

17:11:39

Time

Value

All Virtual Users Active

DTS Start - Scan InboxDTS - Begin Imports DTS Finished

Average CPU Utilization while DTS is acitve:37.41%

Maximum CPU Utilization:85%

Page 6: Scalability Manuscript for Star98

The Test EnvironmentThe test tools will remain the same. However, we are developing new tools to help in modifying and maintaining the

“C” source code. In addition, we are providing more accurate functionality to the load test environment in the form of software and the database.

Page 7: Scalability Manuscript for Star98
Page 8: Scalability Manuscript for Star98
Page 9: Scalability Manuscript for Star98

The ResultsScalability testing with this system revealed numerous impediments to our being able to achieve sustained user levels much greater than 500 simultaneous users.

Problems Encountered1. Current disk configuration became a

bottleneck to database performance in terms of data read and write efficiency.

2. Network segment traffic started to saturate above 300 simultaneous users. This required us to install addition network cards in the server in order to segment the network into multiple subnets. This allowed us to distribute the network load and minimize the network bottlenecks.

3. Both the Oracle database and Logician required better optimization to handle the increased user loads.

These issues affect not only performance and scalability testing in-house but also affect overall application performance for our customers. With this in mind we will now discuss some of these issues.

Issues Affecting Performance in larger Client / Server Environments

The ProblemThe issue we face here is one of tradeoffs: more specifically the issue of cost vs. performance. One of MedicaLogic’s goals is to provide our customers with recommendations that will give them the best performance at a

reasonable price. With customers looking to increase the number of licensed users in their organizations to 1000+ users in some cases, the identification and resolution of various performance issues that the customer may encounter becomes increasingly important.

The SolutionThere is no one solution that will resolve all the performance issues our customers will encounter. We can however make recommendations and product code changes based on performance testing done at MedicaLogic about Local and Wide-Area Network (LAN / WAN) sizing, server and workstation configurations, application tuning at both the client and server levels, etc. We will describe some of the more common (and not so common) issues we continue to encounter in performance testing at MedicaLogic.

Disk configuration for improved database performanceThis is becoming a larger issue with customers wanting to roll out Logician to greater numbers of users. With the larger user base also comes a larger patient population that translates into a larger database being accessed by more users. Improving database I/O is critical to the success of our customers and Logician.

Splitting database files across multiple drivesFor quite some time now it has been known that splitting Oracle data files over multiple disk devices can improve

Page 10: Scalability Manuscript for Star98

performance. However, it is not immediately clear how one should deploy different Oracle data files on different disk devices. While it is clear how to actually put data files on different disk devices; it is not clear what the distribution of the data files should be. The key to this issue is to minimize head contention for reads and writes with the drives.

Our Oracle database consists of the following data files with the following types of I/O characteristics.

1. System data file. This file holds information about the system catalogs and since this information is read from frequently during database operation most of this information should be cached in memory.

2. Rollback data file. This file holds the rollback information while users are in the middle of transactions.

3. Temporary data file. This file holds information about sorts when a sort will not fit into memory. Usually a report with quite a bit of data that needs to be sorted in a particular manner.

4. Logician data file. This file holds the clinical data for the production system.

5. Logician index data file. This file holds the information about the indexes on the data in the Logician data file.

6. Logician read only data file. This data file holds information that is read only. Mainly knowledge base type of information that the user does not change.

7. Logician read only index data file. This data file holds the indexes on the data in the Logician read only data file.

8. Tutorial data file. This file holds the information about the tutorial database and its indexes.

9. Maintenance data file. This data file holds the information about growth statistics on tables in the production system.

10. L3 data file. This data file holds the information that LinkLogic uses for setup and temporary storage processing for data imports.

11. Photo data file. This data file holds the patient photos.

12. Redo Log files. These files hold a before and an after image of each transaction. When one runs out of space it is copied off to an archive area.

13. Archive log files. These files are produced as a result of filling up of the redo logs.

14. Control files. These files hold some basic information about the database and are quite small.

Using our standard scaling model we looked at how much input and output (I/O) was occurring for each file. The important observations here are how much relative file I/O was occurring. This information is necessary to come up with a general model of how to split the files across multiple disk devices. A further refinement of the model would indicate in terms of providers and users how best to split files across multiple disk devices. That is a particular disk

Page 11: Scalability Manuscript for Star98

device has a limit of how much data can be read from and written to during a unit of time.

It should be noted that this model must be validated in the real world. To validate the data we would need to run a couple of simple SQL scripts (utilbstat and utilestat supplied by Oracle) at a variety of client sites. These scripts would not effect the current operation of the client’s system nor reveal any confidential data. If we obtain similar results from several client sites then we could come up with some general rules for deployment. This information would be invaluable for server planning, performance and scalability.

Model ResultsTable 1 shows results from running our model on a 300 user database with a DTS (Data Transfer Station) running. This table illustrates the number of actual I/O operations that occurred on each file. This is the number of I/O’s not the size of each I/O.

As you can see the redo logs account for a lot of the Write I/O. This is expected since only in recovery would the redo logs be read from. The Logician_data tablespace has the most write and read activity, since this is where the clinical data resides.

Table 2 shows results from the same 300 user run that the earlier table illustrates. This table illustrates the number of 8K blocks that occurred on each file.

In Table 2 you can see that the redo logs provide the majority of the amount of data written. In terms of the amount of data read the Logician_data

tablespace has the most by far. The direction of these results seems reasonable.

Table 1. Physical I/O OperationsPhysical I/O

Tablespace Reads % reads Writes % writes % total I/OSystem 541 0.25% 74 0.01% 0.08%

Rollback 12 0.01% 44,133 8.60% 6.08%

Temporary 11,386 5.36% 12,559 2.45% 3.30%

Logician_data 183,087 86.14% 243,136 47.37% 58.72%

Logician_index 13,447 6.33% 35,341 6.89% 6.72%

Logician_data_ro 3,572 1.68% 0 0.00% 0.49%

Logician_index_ro 124 0.06% 0 0.00% 0.02%

Tutorial 8 0.00% 0 0.00% 0.00%

Mania 0 0.00% 0 0.00% 0.00%

L3 364 0.17% 6,311 1.23% 0.92%

Photos 0 0.00% 0 0.00% 0.00%

Redo Logs 0 0.00% 171,749 33.46% 23.66%

Total 212,541 513,303 725,844

Table 2. Physical Block I/O Physical Block I/O

TablespaceReads % reads Writes % writes % total I/O

System 689 0.06% 74 0.01% 0.03%

Rollback 12 0.00% 44,133 3.25% 1.75%

Temporary 41,828 3.59% 42,316 3.11% 3.33%

Logician_data 1,105,400 94.84% 24,316 1.79% 44.75%

Logician_index 13,447 1.15% 35,341 2.60% 1.93%

Logician_data_ro 3,572 0.31% 0 0.00% 0.14%

Logician_index_ro 124 0.01% 0 0.00% 0.00%

Tutorial 12 0.00% 0 0.00% 0.00%

Maint 0 0.00% 0 0.00% 0.00%

L3 419 0.04% 6,311 0.46% 0.27%

Photos 0 0.00% 0 0.00% 0.00%

Redo Logs 0 0.00% 1,206,535 88.78% 47.79%

Total 1,165,503 1,359,026 2,524,529

In general, disk writes are more expensive than disk reads. So a basic strategy would get the redo logs off the same disk device as all the other files. This observation is in line with Oracle’s basic tuning recommendations. One

Page 12: Scalability Manuscript for Star98

item that is not noted is that when a redo log fills up they are copied to the archive area as an archive log. This additional disk activity is not accounted for in these figures.

The technical papers from Oracle listed in the bibliography can give more insight into this area.

Which RAID level should I use? 0, 1, 0+1, 5, noneA disk device is a logical drive or set of drives. It may be a single disk drive or it could be a mirrored set, or some sort of Redundant Array of Inexpensive Disks (RAID). RAID 0 is disk striping without parity;

very fast but lacks data protection. A single disk failure wipes out the logical device.

RAID 1 is disk mirroring. Very reliable in terms of data protection but requires 2X the number of disks to hold the same amount of data.

RAID 0+1 is a mirrored stripe. It is very fast and reliable.

RAID 5 is a disk stripe with parity. It is not quite as fast as 0+1 and reliability is limited. Since parity is striped across disks, if more than one disk fails, the entire logical device is usually lost.

In general a mirrored stripe, commonly referred to as 0+1, will give you the most performance, but at the cost of the number of disk drives needed.

Hardware or Software RAIDHardware RAID implementations are typically faster than software RAID but with a price. Hardware implementations are more expensive to implement than software since the capability to implement software RAID exists in the

various server operating systems without the added cost of a hardware RAID array controller.

Various server and workstation options available MedicaLogic continues to provide customers with updated server and workstation sizing information based upon testing done in our Scalability Lab. While we do not emphasize what server or workstation vendor our customers should purchase their hardware from, we do recommend that they plan their purchases to accommodate their projected user and database growth. Areas that we provide sizing information on include:1. Processor number & speed2. RAM3. Disk Space requirements4. Network and Workstation OS

configuration parameters

We typically recommend that our customers purchase the most powerful server (in terms of CPU speed and number of processors, RAM, disks) they can afford which meets their projected growth requirements. We do this in order to help minimize our customers’ hardware upgrade costs later.

As we attempt to load more users onto a single database server, we have essentially reached the maximum user load that currently available Intel platforms can support under NetWare and NT Server.

Some network OS configuration parameters are available that will allow more users to be loaded onto the system than what would currently be

Page 13: Scalability Manuscript for Star98

allowed. One in particular involves NT Server. Under Network Properties > Services > Server > Properties are available 4 optimization settings. If “Maximize throughput for network applications” is enabled, you can bring NT Server up to an equivalent performance level with NetWare as an application server. We are continually testing other configuration parameters as well to determine what performance gains may be achieved.

To meet the user load requirements for our larger (greater than 500 users) customer sites MedicaLogic has extensively evaluated several UNIX server platforms. From this evaluation MedicaLogic has chosen HP-UX servers as our UNIX database server platform of choice for our large customer implementations.

WAN configurations available and overcoming WAN bottlenecksMany customers have remote clinics for which they want to enable remote client access to the Logician database as well as larger customer sites with hundreds of users. To help our customers achieve this goal MedicaLogic performs network load testing as well. These tests determine the maximum user loads different LAN / WAN / remote connections can support and still maintain product usability. The connection speeds tested range from 28.8Kbps to 100Mbit/s including Frame Relay and ISDN connections. From these tests, MedicaLogic publishes a WAN Planning Guide for customer use. Due to the amount of traffic generated by Logician, the number of remote

connections (especially on slower lines) is limited.

In order to reduce these limitations we continue to investigate alternative connection solutions with an eye toward reducing our customers’ hardware implementation and upgrade costs. One particular solution we have investigated is Citrix WinFrame (a multi-user version of Windows NT). This solution offered several benefits:- centralized client administration- reduced hardware upgrade costs- reduced remote bandwidth

requirements while maintaining acceptable client performance.

This solution allows our existing customer sites with older PCs to use their existing hardware with the newer version of Logician.

As an example, the older versions of Logician required a 486 running Windows 3.11 as the base client platform. The current version requires a Pentium 100 running Windows 95 or NT.

Tuning the client / server application for improved database performanceThe purpose of tuning our client/server application is to achieve increased performance. For Oracle, this would be measured by an increase in the SQL_AREA cache hit ratio of our application such that it rises above 95%. The SQL_AREA is Oracle's area for SQL statements that the client machines send over. As an example, prior versions of Logician had a cache hit rate much lower than this (less than 38%). What this means is that over 62% of the time Oracle has to fully parse and

Page 14: Scalability Manuscript for Star98

validate a SQL statement. This is highly CPU intensive.

Several methods being employed to improve this ratio include:1. Use of Oracle OCI instead of ODBC

for database communications.2. Reduce SQL query parsing through

the implementation and increased use of indexing, cursors, and host variables.

3. Reduce and eliminate the number of redundant queries.

4. Modifications to the database schema to reduce table contentions.

Methods 2, 3, and 4 have shown to be the most effective for tuning Logician to achieve the desired performance levels in tests conducted at MedicaLogic with the current version of Logician achieving a cache hit ratio of greater than 85% on average.

The FutureWe continue to explore hardware and software issues that impact performance. As new OS and software versions as well as faster hardware releases come into view, customers tend to expect performance to improve as well.

Bibliography1. Performance Tuning Tips for Oracle7 RDBMS on Microsoft Windows NT White PaperDesktop Performance GroupOracle Corporation, February 1995

2. Configuring Oracle Server for VLDBCary V. MillsapOracle System Performance GroupOracle Corporation, March 7, 1996

3. Oracle White Paper: High Confidence Load Testing A Practical Approach to Low Risk ImplementationDoug Chandler, Darryl Presley, Robert Michael, Doug LilesOracle Large Systems SupportOracle Corporation, February 1997