Applying the Grid Computing Paradigm within a Liberal Arts

Applying the Grid Computing Paradigm within a Liberal Arts Academic Environment

Sarah Monisha Pulimood Department of Computer Science

The College of New Jersey

Ewing, NJ 08628 1-609-771-2788

[email protected]

Thomas R. Hagedorn Department of Mathematics and Statistics

The College of New Jersey

Ewing, NJ 08628 1-609-771-3053

[email protected]

ABSTRACT

Until very recently analyses and computations on a large scale were feasible only on supercomputers or clusters of high-end processors. Such computational infrastructure requires massive investments that can be unrealistic in a liberal arts college environment such as that found at The College of New Jersey (TCNJ). However, TCNJ has several state-of-the-art campus computer labs for use by students and faculty. There are periods (e.g. nights, weekends, and summer months), when a significant number of these computers are underutilized. A grid computing environment would enable a user to harness underutilized campus resources for computationally intensive applications. In this paper we discuss some of the issues faced in the liberal arts environment, our approach for resolving them, and our initial results.

Keywords

Grid computing, grid computing application, computational problems, resource management

1. INTRODUCTION The College of New Jersey (TCNJ) is a medium-sized student-focused state college with active research faculty. Some of their research requires analysis of immense volumes of data or highly complex and computation-intensive calculations. Until very recently, analyses and computations on a large scale were feasible only on supercomputers or clusters of high-end processors. Such computational infrastructure requires massive investments that often cannot be adequately justified in a liberal arts environment like TCNJ. As an example, the second author has been studying the Jacobsthal function [11], j(n), a function in number theory. Suitable knowledge about j(n) can be used to estimate bounds for the size of the gaps between consecutive prime numbers. While the study of prime numbers has been one of the

oldest areas of mathematics, it has become especially important in the last two decades as Internet and financial transactions are being encoded using algorithms based upon prime numbers. As one application, knowledge about prime numbers can be used to study which algorithms provide the best security for online commerce. Previously, j(n) has only been known for values of n less than 21 [17]. Prior to using a grid computing environment, Hagedorn had written and used a C program to extend the list of known values of j(n) to all n less than 41. Calculating j(41) took approximately three days on a single computer server. As the number of steps needed to calculate j(n) should grow exponentially with the growth of n, the calculation of j(42) could theoretically take 42 times the time needed to calculate j(41). This program could be executed in just a few hours if given access to sufficient computational resources. It thus becomes imperative to find additional computing resources to calculate higher values. By using a grid computing environment, we have been able to determine the values of j(n) for all n less than 50.

A grid computing system is a distributed collection of computers that enables Internet Programming, i.e. the sharing, selection, and aggregation of resources across a large network like the Internet. This sharing is made possible based on the resource availability, capability, performance, cost, and ability to meet quality-of-service requirements. TCNJ has several state-of-the-art computer labs for use by students and faculty. There are periods (e.g. nights, weekends, and summer months) when a significant number of these computers are underutilized. A grid computing environment would enable faculty to harness underutilized campus resources for computations by

distributing them across the idle computers, each of which can execute smaller, more manageable, subtasks. By increasing the total computing power available for a problem, a grid computing infrastructure could potentially enable new algorithms to be implemented for solving large computational problems.

Until fairly recently such sharing of resources was realistic only in a limited way, primarily due to constraints on network communication reliability and speeds, processing capabilities, and security. Advances in technology have overcome or mitigated many of these constraints, paving the way for grid computing to become a reality on a larger scale. The term ‘grid computing’ was coined to play on the analogy to the electrical power grid [9]. The intention was that virtually unlimited computing power could be available to anybody, at anytime [15]. Sun Microsystems, for example, offers such a service for a fee per hour of usage [12]. Berkeley Open Infrastructure for Network Computing (BOINC) [3] provides support for volunteer computing as well as desktop grid computing. OpenMacGrid [1] allows Mac users to donate spare cycles on their machines for the computational requirements of other researchers, but unlike the other grid computing networks, allows researchers to access this resource with their own scientific applications. The Globus Toolkit [8] enables sharing of computing power, databases, and other tools securely across corporate, institutional, and geographic boundaries.

Developing a customized grid computing environment, and integrating the available tools and services to meet the user’s specific needs still requires substantial research and knowledge [2]. There are many questions that need to be answered in order to fully enable such scenarios, and are the subject of ongoing research. Some such questions are: How to enable users to retain their ability to cooperate while not being in their home environment? What is the role of context and location in determining how cooperation can be carried out? How can resources be described semantically in a meaningful way to more efficiently exploit the limited resources by supporting better ways of providing data relevant to the user, enabling improved interoperability with the environment and with other users, and deciding when and how to process data? Software architectures for distributed collaborative communities must support the fundamental requirements for distributed

cooperation: efficient and semantically enhanced information sharing across a widely distributed environment; constant and timely update of distributed knowledge bases with many different sites acting both as potential users and potential providers of information; shared access to a services; security and trust in these environments; different access modes to the same information, and so on. A particularly interesting line of research is exploring the peer-to-peer paradigm, enriched with sharing abstractions in which each network node is both a potential user and a provider of information. We must also understand how existing computational applications can be made distributable in order to leverage the grid computing network. These crosscutting questions require an interdisciplinary view on the domain and input from a variety of fields within computer science as well with potential users from other disciplines.

Our goal has been to apply the grid computing paradigm to the small liberal arts college environment, while exploring and understanding some of the issues described above. In the first (and current) phase, the authors are collaborating to design the grid computing environment. In later phases we plan to collaborate with other researchers to understand their work and extend / adapt the system accordingly.

2. TCNJ GRID (T-GRID) At our institution we are constrained by the campus-wide information technology policies. Due to concerns about security of computations and machines, we cannot use applications such as OpenMacGrid, which does not allow us to control which computations utilize the resources we make available [1]. The Globus Toolkit [8] requires that several modules be installed on all machines on the grid. We cannot install these on lab machines. Maintaining the software would be an issue even if we were allowed to install them. The Globus Toolkit [8] was deemed too complex for our needs. Hence, we made the decision to implement a lightweight grid computing framework customized to our needs.

2.1 Resource Management One of the major challenges of the grid computing paradigm is resource management. The hierarchical resource management model [5] with active and passive components is the most commonly used

model. The passive components we consider in our framework are the resources, tasks, jobs, and schedules, while the active components are schedulers, users, monitors, and job control agents. At this time resources are limited to CPU time. Within the campus environment, bandwidth can be assumed to be consistent. We hope to consider disk usage at a later stage.

We define a job as an activity that requires resources; more specifically they are computationally intensive programs. Jobs are hierarchical (tree) and may contain sub-jobs, with the smallest unit, or leaf, being a task. The Job Control Agent guides the job through the system communicating with different components to ensure that the job is completed. The Scheduler receives user requests and creates the schedule, which is a map of tasks to resources over time. The Scheduler also compensates for resource availability errors that may occur, for example if a client machine goes down [18]. The Monitor stores current information about the grid, like the IP addresses of client machines. It oversees the grid and tracks the progress of a task. It also keeps track of all the clients available and determines when a task is sent to a client. It monitors individual clients to determine when they complete tasks and become available. Since the College owns all the computers we plan to put on the grid, we are not concerned with issues of ownership over domains that exist in larger grids.

The T-GRID is comprised of lab computers across campus, that are used by students for assignments and projects. It is important that users are not affected by tasks being processed. Consequently, a task must pause when a user starts to use the computer and must either wait for the computer to become free again, or be rescheduled to a different computer. Modern operating systems allow applications to run at different priorities, so the operating system itself could take care of multitasking. We use a combination of the time based method of waiting a set amount of time before canceling and rescheduling the task [18], and the event based method of waiting until a failure is reported or detected to reschedule the task [18]. We opted to use the Unix utility, “nice”, that ensures that the task runs at a lower process priority. By default, the priority is set at 10, but can be set as low as 19 (the minimum under Unix). For a Windows-based operating system, a similar command, start / low, can be used. This is an effective method for ensuring that

the grid computing environment does not interfere with any other applications running on the computer.

The locations of computers on the grid are kept track of, so that in the future a processing barter system can be created. This will allow the departments that offer the most amount of processing power to the grid to have access to more resources.

2.2 Security Security is another challenging aspect that must be considered in a grid computing environment. It is important that only authorized TCNJ users process data on the grid, hence we provide user authentication. It is also essential that computations be kept private and secure from malicious hijacking or corruption. Each computer has a unique identifier and password, which are authenticated.

Since potentially sensitive information may be transmitted between the client and server, Secure Socket Layer (SSL) connections are utilized for secure communication. Typically in SSL, the server holds a certificate with a corresponding private key. When a client connects, it is given a public key for that certificate. In order to ensure the identity of the server, the server’s certificate is verified for validity and trustworthiness. In commercial applications, a trusted certification authority, such as Verisign, Entrust, or a trusted certification server within the organization, signs this certificate. This is done to prevent hijacking of routes by malicious applications. For now, we assume that T-GRID does not need to communicate with any applications that are not trusted by the campus network and so it uses self-signed certificates.

A Service Level Agreement (SLA) provides an agreement and understanding between the client and the server. It builds a level of trust between the two entities so as to maintain a level of security and comfort with participating on the grid [6, 7]. Typically, three types of SLAs are used. Task or Transactional Service Level Agreements (TSLA) are agreements that are generated to set the guidelines for a certain task to be performed, in our case, by the client. This is particularly important to prevent malicious code from executing and compromising security. Resource Service Level Agreements (RSLA) specify what resources the grid application can consume. The RSLA is created when the client computer first connects to the grid and indicates its availability. Binding Service Level Agreements

(BSLA) are designed to connect TSLAs and RSLAs and help to maximize resource utilization, while preventing the server from assuming that the client has more resources available than it does.

Since the system is designed for a college campus network that is already behind a firewall, we did not need to concern ourselves with this aspect.

2.3 Communication We use sockets for communication between the clients and the server, and all communication is encrypted. Sockets do not offer the ease of transmission of data over XML, but do allow the server to maintain an accurate and up-to-date list of clients connected to it [13]. Most of the data transmission is text-based, to make it lightweight and easy to understand. To transmit the text, we used BufferedReaders and BufferedWriters (which are built into the Java API) to manage the buffers.

Availability and knowledge of availability are essential to the grid, so both the client and server need to be consistently ready to accept communications. The grid relies on having multiple clients to carry out computations, so if there are fewer clients connected, operations will take longer. To ease future maintenance and updates, we have kept our protocols simple and similar to other existing protocols, such as FTP or SMTP.

To keep reduce network traffic, and to ensure that all clients have the most recent version of the shared executable, the system employs an MD5 file hashing algorithm (identical to the one used for password hashing), to gain a short, textual representation of the executable. When the server assigns a job to a client, it asks for a hash of the executable residing on the client’s computer. If the client does have the file, it sends a sixteen character MD5 hash of the file back to the server; otherwise it sends a null value back. The server then creates a hash of the shared executable that it has, and compares it with the hash sent by the client. If the hashes match up, the client is asked to start the job with the file it has, otherwise it is sent the required file.

A level of complexity is added, however, when the shared executable has to be transferred to the client, from the server. The difficulty with transferring the file comes from executables being in binary, while all of our commands are in ASCII (8 bits for binary files versus 7 bits for text). To work around this issue, the

client changes the reader type whenever a file is sent. The client application switches from an ASCII reader to a binary reader. After the binary reader has accepted the appropriate number of bytes, the ASCII reader is enabled again, ready to accept further commands. We considered MIME encoding which is a way to convert a binary file to 7-bit ASCII text. This would have made the development process shorter, however more processor time would be required at runtime if the file being transferred is large.

3. IMPLEMENTATION We considered using web services to implement the grid computing framework. Web services are a genre of web-specific software component methodology that deal with modular, self-contained, self-describing software components whose public interfaces are described using XML. They allow applications to cross systems, programming languages, and even platform boundaries. Web services expedite, simplify, and reduce the cost of new application development by providing application developers with systematic access to standards-compliant third party software functionality that is invoked across the Web. They also allow valuable software functionality embedded within existing applications to be isolated and reused. Since they are web-centric, developers looking for sources for best-of-breed software functionality have ready access to the entire worldwide software community without the hindrance of geographic, political, or trade boundaries. Data transfers between applications and the web services they invoke are in the form of XML documents that are exchanged using a messaging scheme, the most popular one being SOAP [4, 10, 13], which stands for Simple Object Access Protocol. SOAP is an XML-based messaging scheme that is platform and programming language agnostic. It is a simple, lightweight mechanism for exchanging structured and typed information peer to peer in decentralized and distributed environments such as the Web. In the context of web services, SOAP flows across the various possible transport options such as HTTP, RPC, TCP, SMTP, message queuing, FTP, and BEEP. Thus, the data transfers that take place between applications and web services occur in the form of XML documents exchanged via SOAP that in turn rely on a lower level transport scheme such as HTTP or TCP for connectivity and networking across the Web [10, 13]. Web services use Universal Description, Discovery, and Integration (UDDI) for electronic directories that contain detailed

information about businesses, the services they provide, and the means for utilizing these services. A UDDI directory is meant to be platform independent and can be readily accessible using a web browser-based GUI or by applications via published APIs. Its goal is to ensure that enterprises and individuals can quickly, easily, and dynamically locate and make use of services, in particular web services, that are of interest.

Since web services adhere to strict HTTP protocols, two-way push-pull communications are not possible. Instead, HTTP fetch commands are used to transfer data. In this scenario, the client must initiate all communications to the server. If the server were to need to communicate with the client, it would have to wait for the client to initiate a communication with it. The server can attempt to maintain a list of available clients by keeping a log of recent connections to the server, but this method adds increased overhead and can be inaccurate. We investigated having the clients periodically make a request to the server. The server would need to maintain an accurate list of connected clients in order to ensure that jobs do not sit idle, waiting to be picked up, potentially by a client that is no longer available. Given these reasons, we determined that while web services have a number of advantages, they would be inefficient and impracticable for our system.

Ours is a liberal arts institution, so it is imperative that the grid technology be made transparent to the majority of users. We provide a graphical user interface (GUI) for users to submit jobs and retrieve the results.

We determined that for the T-Grid, a client-server system was the best option. Our campus has machines running the Windows, Unix, Mac, and Linux operating systems so we chose Java as the implementation language over C and C++ for its portability. The grid operations are relatively lightweight so the overheads of using Java are negligible. Shared executables can also be compiled to native code. Another advantage of Java is that it has extensive libraries and APIs. Many useful objects are already implemented and can be reused, thereby reducing the programming tasks.

One of our initial thoughts was to use a de-centralized system, such as Gnutella or Bit-Torrent. However, we determined that a centralized system would be much more effective. While a peer-to-peer system would

offer more flexibility, a centralized system lends itself to tighter management and ensuring that only authorized users can submit or process jobs. Considering the security risks of allowing applications to execute on campus equipment, close control and monitoring of the grid is extremely important.

As a proof of concept, we built a prototype for the grid computing framework in Python. This program was designed so that clients could connect to the server, which would then distribute tasks. It does not provide fault tolerance; if a client’s task is interrupted for any reason, the server program has no way of knowing whether the task was completed or not. It also has no mechanism for informing the server if a task was halted due to a user logging into a machine. The task could potentially be postponed indefinitely and there is no guarantee that the task will eventually resume.

The prototype has been tested with a large computational problem discussed in the next section and has yielded very promising results in spite of these shortcomings. Fault tolerance was handled by creating a text file of all completed tasks that was then searched to ensure that all necessary cases had been checked.

4. COMPUTATIONAL PROBLEM Jacobsthal [11] defined the function

!

j(n) to be the

smallest positive integer

!

m with the following property: Every sequence of

!

m consecutive integers contains a number that is not a multiple of one of the

first

!

n primes. For example, (2) 4j = as every sequence of four consecutive integers has at least one

integer that is not a multiple of the primes 2 and 3 .

The calculation of ( )j n has important connections to the size of the gaps between consecutive prime numbers. Important theoretical work by a number of

authors has established a lower bound for ( )j n . In [16], Pintz, improving on the work of [14], proved

(1) 3

2

2

log log( ) 1 (2 (1))

log

n n nj n e o

n

!" + + ,

where .57721! " is Euler’s constant,

1log logx x= , 1log log(log )n nx x

!= , and o(1)

indicates a constant that goes to 0 as n goes to infinity.

Table 1: Values of j(n) for 20<n<50

Let np denote the nth prime with

12p = and let

1( )n n

g n p p+

= ! be the function giving the gap

between two consecutive primes. The function

1

( ) ( )np xG x Max g n+ !

=

gives the largest gap between consecutive primes less than x. Using (1), one can prove the following lower bound for the gaps between consecutive prime numbers.

2 4

2

3

log log log( ) (2 (1))

log

x x xG x e o

x

!" +

The calculation of

!

j(n) is equivalent to determining the largest gap between units in

!

ZP(n )

, the ring of

integers modulo

!

P(n) . The number of elements in

!

ZP(n )

grows exponentially in

!

n . For example,

!

ZP(36)has

!

P(36) "1.985 #1059 elements, and a

straightforward brute force calculation of

!

j(36) would be unfeasible. Previously, the values of ( )j n were known [17] for 20n ! . (The cases

21 24n! ! were subsequently and independently determined by M. Alekseyev, see [17].)

Based upon an idea of J. Haugland, we can calculate

!

j(n) using many fewer computations. On a single 2.2

GhZ processor Linux server, the calculation of

!

j(36) = 450 takes approximately 6 hours. The computational cost of calculating

!

j(n) using our

algorithm roughly increases as a function of

!

n . It takes one month and two months respectively to calculate

!

j(48) and

!

j(49) on the same single-processor machine. It is difficult to precisely estimate the calculation time for

!

j(n) as a function of

!

n .

Using the T-GRID, we have been able to determine ( )j n for 50n < (see Table 1).

Our algorithm depends on certain initial parameter choices, which while not affecting the final value of

!

j(n), can affect the algorithm's running time. The

effect of these choices on the running time cannot be predicted prior to running the algorithm. For some smaller values of

!

n , such as

!

n = 46, the calculation of

!

j(n) took several months due to an unlucky choice of initial parameters.

5. ANALYSIS The use of the grid computing environment has drastically lowered the computational time needed to calculate

!

j(n). With a grid of 23 similar 2 GHz

machines, the calculation of

!

j(36) requires only 35 minutes, an approximate ten-fold increase in computation speed. Similar improvements are seen in the use of a grid computing environment to calculate

!

j(48) and

!

j(49) .

An additional benefit of using T-GRID has been to expose the computational landscape of the underlying problem. To calculate

!

j(n), the problem is broken up

into thousands of smaller parts, each of which is evaluated independently by each of the computers on the grid. The majority of these parts can be evaluated very quickly, while the remaining parts require a disproportional amount of time. Future work will study these “slow” cases with the hope of using these areas to further improve the efficiency of our algorithm.

Another area of future work will be to improve the efficiency of the grid computing environment. Using a 23-machine grid, we have observed a ten-fold increase in the computational speed. Ideally, we would like to see a 23-fold improvement in computational time (assuming the grid machines are dedicated to this one project). However, the present organization of the algorithm for calculating

!

j(n)

prevents seeing such a full improvement. The

n ( )j n n ( )j n n ( )j n

21 190 31 354 41 550

22 200 32 378 42 574

23 216 33 388 43 600

24 234 34 414 44 616

25 258 35 432 45 642

26 264 36 450 46 660

27 282 37 476 47 686

28 300 38 492 48 718

29 312 39 510 49 742

30 330 40 538

algorithm calculates

!

j(n) by checking, for a given (even)

!

m , whether

!

j(n) " m . Once this equation is

verified for

!

m , we increment

!

m by 2 and check whether the new equation is true. The algorithm ends when we find a

!

m such that

!

j(n) " m has no

solution. Then

!

j(n) = m " 2 (we note that

!

j(n) is always even).

As

!

m approaches the final value

!

j(n), there is a decrease in the number of computations needed to check the truth of the equation

!

j(n) " m . Hence, in

the grid computing implementation, initially there are many machines analyzing (for different parameters) the (time-consuming) equation

!

j(n) " m for values of

!

m not close to the (final) value of

!

j(n). By

changing the organization of our grid environment so that once one machine has verified the equation

!

j(n) " m , all machines would then automatically

proceed to checking the next equation

!

j(n) " m +1 (again for different parameters), we should be able to achieve a 23-fold improvement in computational time.

6. CONCLUSIONS Applying the grid computing paradigm, we were able to calculate

!

j(n) for values of

!

n far higher than those previously determined. In addition, using the prototype on 23 machines, we were able to achieve a ten-fold increase in computational speed. These results clearly demonstrate that applying the grid computing paradigm in a liberal arts institution is not only feasible but also beneficial to the college faculty and thus the larger research community. Once a working model has been implemented, we plan to educate other faculty on campus about the advantages of the grid environment and how to use it.

7. ACKNOWLEDGMENTS Special thanks to the undergraduate students who implemented sections of the prototype and framework as part of their class and capstone projects: Dan Tilden, Andrew Chiusano, Anthony LaTorre, Ian Scott, Gregory Adkins, Justin Freund, Derek Haas, and Scott Carpenter.

REFERENCES [1] “Announcing OpenMacGrid: Together We Are

Strong”, http://www.macresearch.org/announcing_ openmacgrid_together_we_are_strong, January 2007.

[2] Bal, H., Casanova, H., Dongarra, J., and Matsuoka, S., “The GRID2”, ch. 24, Application-Level Tools, pp. 463 – 489, Morgan Kaufmann Publishers, 2004.

[3] Berkeley Open Infrastructure for Network Computing, http://boinc.berkeley.edu/.

[4] Box, Don. A Brief History of SOAP, O'Reilly Media, Inc, New York, NY, 2001.

[5] Buyya, R., Chapin, S. J., and DiNucci, D. C. 2000. Architectural Models for Resource Management in the Grid. In Proceedings of the First IEEE/ACM international Workshop on Grid Computing R. Buyya and M. Baker, Eds. Lecture Notes In Computer Science, vol. 1971. Springer-Verlag, London, 18-35.

[6] Czajkowski, Karl. Grid Scheduling through Service-Level Agreement, 15 Dec 2006. http://www.isi.edu/ ~annc/classes/fall2003/lecture4rm.ppt

[7] de Bruijn, E.W. and Gommans, L.H.M. IRTF AAA Arch Research Group Outline of a Service Level Agreement, http://www.aaaarch.org/doc13/ ServiceLevelAgreement.htm, 2000.

[8] The Globus Alliance. http://www.globus.org/.

[9] Grid Computing, http://en.wikipedia.org/wiki/Grid_computing.

[10] Guruge, Anura. Web Services: Theory and Practice, Elsevier Inc, Oxford, 2004.

[11] Jacobsthal, E., Über Sequenzen ganzen Zahlen von dened keine zu n teilerfremd ist, I-III, Norske. Vid. Selsk. Forhdl., 33, 117-139, 1960.

[12] LaMonica, M. Sun plugs software into the grid, August 2005. CNET News.com.

[13] Liu, M. L. Distributed Computing: Principles and Applications. Pearson Education, San Francisco. 2004.

[14] Maier, H., Pomerance, C., Unusually Large Gaps Between Consecutive Primes, Trans. Amer. Math. Soc. 322 (1), 201-237, (1990).

[15] Nemeth, Z., and Sunderam, V. Characterizing grids: Attributes, definitions, and formalisms.

[16] Pintz, J. Very Large Gaps between Consecutive Primes, J. Number Theory, 63 (2), 286-301, 1997.

[17] Sloane, N.J.A, On-Line Encyclopedia of Integer Sequences, Seq. A048670, http://www.research.att.com/~njas/sequences/, 2006.

[18] Zheng, Ran, and Hai Jin. An Integrated Management and Scheduling Scheme for Computational Grid. Huazhong University of Science and Technology. http://grid.hust.edu.cn/downloads/An_Integrated_Management_and_Scheduling_Scheme.pdf.

Applying the Grid Computing Paradigm within a Liberal Arts

Documents

Transcript of Applying the Grid Computing Paradigm within a Liberal Arts