Parallel Language Processing System for High-Performance ...

39FUJITSU Sci. Tech. J.,33,1,pp.39-51(June 1997)

UDC 519.68: 681.326

1. IntroductionThe performance of single, central processing

units has reached the limit of improvement, andit is expected that the same is true for sharedmemory parallel processing computers. Therefore,distributed memory parallel computers in whichmultiple processor elements (PEs: a PE is an op-eration unit consisting of a CPU and memory) areconnected through a high-speed network are be-ing investigated.

Therefore, Fujitsu has developed the VPP dis-tributed-memory vector parallel supercomputerand the AP scalar parallel server.

To make the best use of a distributed memoryparallel computer, it is necessary to distribute theload among PEs evenly and reduce the overheadsfor communication and synchronization betweenPEs. Therefore, it is important to develop pro-gramming techniques that can be used to meetthese requirements.

The parallel language processing system sup-

ports parallel programming techniques with fea-tures that have not been implemented in conven-tional programming.1),2)

2. Purposes of the Parallel LanguageProcessing SystemThe system was developed to satisfy the fol-

lowing requirements:1) To provide a language environment that is

common to the VPP and AP systems,2) to support multiple parallel processing meth-

ods, and3) to achieve a higher processing rate.

2.1 Language environment common tothe VPP and AP systemsThe VPP system is a vector parallel system

in which vector computers are used as nodes. TheAP system is a scalar parallel system in whichscalar computers are used as nodes. The VPP andAP systems have their own optimal type of pro-

Parallel Language Processing System forHigh-Performance Computing

VEiji Yamanaka VTatsuya Shindo(Manuscript received April 22, 1997)

Fujitsu has developed a common parallel language processing system for theVPP and AP distributed memory parallel computers. The parallel language process-ing system includes a parallelizing compiler, libraries, and parallelizing supporttools.The systems were developed to satisfy the following requirements :1) To provide a language processing system that is common to the VPP and AP,2) to provide multi-paradigms for parallel programming, and3) to realize functions to achieve high performance.The following have also been developed :A Fortran parallelizing compiler that processes VPP Fortran. The compiler pro-vides parallelizing notations for manual tuning and makes programming easy.The MPI and PVM message passing libraries, which can be custom tuned to suita machine’s architecture.The SSL II library of popular numerical calculation algorithms, which areparallelized to achieve high performance.A GUI programming support tool called Workbench that provides users withseveral options for parallel programming.

40 FUJITSU Sci. Tech. J.,33,1,(June 1997)

E. Yamanaka et al.: Parallel Language Processing System for High-Performance Computing

gram, therefore, users should choose one of thesesystems, or link them, according to the programsto be executed.

The most important objective was to enableprograms to be ported between the two systemseasily.

For the parallel language processing system,the parallel programming language, messagepassing library, and numerical calculation librarywere designed to be used on both the VPP and APsystems. Also, user views in Workbench (the pro-gramming environment) are unified to satisfy theobjective.

2.2 Support of multiple parallelprocessing methodsThere are two major methods of creating a

parallel program; one method uses parallel pro-gramming languages and the other uses messagepassing.

Parallel programming languages accept con-ventional programs created by sequential process-ing languages and are easy to write; however, theyalso restrict parallel processing models due totheir syntax and they make other models hard towrite.

Message passing can describe various paral-lel processing methods because communicationbetween processors (i.e., low-level implementa-tion) is directly described by message passing.However, message passing also makes program-ming difficult.

Each method has its own advantages and dis-advantages. Therefore, the VPP and AP systemssupport both methods so that one or the other canbe applied according to the purpose.

For message passing, a number of specifica-tions have been suggested and standardized. MPIand PVM, which are in wide use, are supported inboth the VPP and AP systems as platforms. TheVPP system supports PARMACS, which is popu-lar in Europe, and the AP system supports APlib,which was developed for the AP1000. Users, there-fore, have a wide selection.

2.3 In search of higher processing ratesThe primary goal of high-performance com-

puting (HPC) is higher processing rates. The par-allel language processing system focuses on pro-viding specifications that speed up execution of aparallel processing program and developing instal-lation forms that exploit the hardware capabili-ties.

The parallel processing compiler is a speciallydeveloped parallel programming language calledVPP Fortran. VPP Fortran supports split alloca-tion of arrays to multiple processors, highly ab-stract parallel descriptions such as parallel ex-ecution of DO loops, and a notation that enablesflexible control of system operations, includingblock data transfer and synchronization.

To implement message passing, a lower layercalled MPlib was placed in the VPP and AP sys-tems and all message passing libraries are locatedon MPlib. The interface of MPlib is designed sothat overheads due to the layer are minimized.Programs inside MPlib are tuned so that MPlib isoptimized for each machine.

The components of the parallel language pro-cessing system are described in later sections.

3. Parallel Programming LanguageThe VPP and AP systems support the VPP

Fortran parallel programming language. Thissection looks at VPP Fortran.

3.1 Purposes of VPP FortranVPP Fortran was designed originally for the

VPP500 system. When VPP Fortran was designed,the situation regarding languages for distributedmemory parallel computers was as follows:- There were no practical standard parallel lan-

guages for distributed memory parallel com-puters.

- Techniques for implementing automaticparallelization in distributed memory paral-lel computers were not fully developed, andit was expected that sufficient performancecould not be achieved by automatic paral-

41FUJITSU Sci. Tech. J.,33,1,(June 1997)


lelization using only compilers.- Message passing differed greatly from mes-

sage passing in conventional sequential pro-cessing programming, thus programmers whocreated sequential processing programs couldnot easily handle message passing.Because of the above, VPP Fortran was de-

veloped so that general programmers could writeparallel processing programs for distributedmemory parallel computers.

3.1.1 Easy programmingGenerally, programs for parallel computers

with distributed memory are hard to create, andonly users very familiar with parallel processingcan use such computers. Parallel computer sys-tems for practical use must have sufficient per-formance and easy programming features for gen-eral programmers.

3.1.2 Application of existingresources

Parallel computer systems for practical usemust be able to handle existing user programs.

3.1.3 Performance on the hardwareThe performance of programs running on dis-

tributed memory parallel computers tends to bemuch lower than the peak performance offered bythe hardware. To avoid this waste, new parallelprocessing languages must fully exploit thehardware’s performance.

3.2 Overview of VPP FortranVPP Fortran is a kind of Fortran 90 with par-

allel processing functions for distributed memoryparallel computers.

3.2.1 Logical configurationFigure 1 shows the logical configuration of

VPP Fortran. A parallel computer with a layeredmemory has a global space shared by PEs and lo-cal spaces specific to each PE.

The global space is a shared virtual memoryspace consisting of the operating system, execu-tion libraries, and memory in each PE.

3.2.2 Features of VPP Fortran1) Global array

One of the problems in creating programs fordistributed memory parallel computers is how tolocate a logical unit of data stored in physicallyseparated memory areas.

To solve this problem, VPP Fortran introducesthe global array using the global space providedby the system. In practice, the global array is splitand assigned by a PE.

Use of the global array enables a unit of datato be handled logically. Therefore, there is gener-ally no need to change a conventional program-ming style. Distributed memory parallel comput-ers supporting the global array simplify portingof programming styles and existing programs ascompared with conventional distributed memoryparallel computers.2) Directive method

Parallel processing functions in VPP Fortranare supported by directives and service subrou-tines. Most parallel processing functions are de-scribed by directives only. By regarding directivesas comments, programs with parallel processingfunctions described by directives are guaranteedto run in the same manner even if they are se-quentially translated and sequentially executed.This provides compatibility with programs inother systems, and is useful for verifying parallelprocessing programs.

PE

Local Memory

PE PE

CPU CPU CPU

Virtual Global Memory

Local Memory

Local Memory

Fig. 1— Logical configuration of the system.



3) Compatibility with subprograms in standardFortran

Subprograms in standard Fortran can be ap-plied to VPP Fortran without any modification toparallel processing. VPP Fortran enables effec-tive use of existing program libraries.4) Gradual parallel processing

To obtain high performance on distributedmemory parallel computers it is always necessaryto tune programs. For VPP Fortran, parallel pro-cessing functions are equipped so that programscan be tuned gradually. The users can graduallytune programs to achieve the maximum parallelperformance. Thus, the performance that userscan obtain depends on the tuning.5) Explicitly splitting function

Data transfer between PEs generally de-grades the performance of parallel processing pro-grams that run on distributed memory parallelcomputers. If an area for a calculation assignedto a PE does not correspond to the data area onthe PE, data must be transferred from anotherPE to continue the calculation. Therefore, VPPFortran has specifications that ensure consistentdata partitioning and procedure partitioning. Thespecifications minimize the data transfer fre-quency to prevent a reduction in the performanceof parallel processing programs.6) Explicit transfer

Although VPP Fortran is designed so that thedata transfer frequency is reduced, some programshave structures that require data transfer be-tween PEs. To maintain high performance, VPPFortran has specifications that enable users tocontrol patterns and data transfer timing. Thespecifications guarantee that programs can trans-fer data efficiently.

Table 1 shows a list of VPP Fortran parallelprocessing functions.

3.2.3 Execution method of VPPFortran

A spread-barrier method is adopted for VPPFortran. The spread-barrier method basically con-sists of two cyclic operations: spread and barrier.

Spread is a split execution in each PE. Barrier isa simultaneous synchronization of all PEs.

Figure 2 shows an example of how a programis executed. The program is handled by eight PEs.parallel region indicates the beginning of the partto be parallel processed, and end parallel indicatesthe end of the part to be parallel processed. Thebeginning and end of the part to be parallel pro-cessed are indicated once in a program. spreadregion indicates that parts C and D are allocatedto different PEs and parallel processed. spreaddo indicates that iterations of the DO loop are di-vided and parallel processed. end spread indicatesthe end of the split execution. Barrier synchro-nizes spread region/end spread and spread do/endspread. Programs are executed with cycles ofspread and barrier.

Separating split execution from parallel pro-cessing allows overheads due to parallel process-ing to be reduced.

3.3 Performance and applicability ofVPP Fortran3.3.1 Performance of VPP FortranFigure 3 shows the ratio of the estimated peak

performance to the actual performance of theVPP500 hardware using the LINPACK program.The data in the figure can be found in reference 3).The LINPACK program written with VPP Fortranuses the block LU decomposition that is suitablefor parallel processing. LINPACK is within thescope of the VPP Fortran specifications; there areno specific procedures such as program codingusing an assembler. Figure 3 shows that a VPPFortran program can be very efficient.

3.3.2 Applicability of VPP FortranTable 2 lists the number of parallel directives

added when application programs were improvedfor parallel processing using VPP Fortran in thefield. VPP’s applicability can be measured by thenumber of parallel directives required to changea program for parallel processing. (Other impor-tant factors are the ease of rewriting and the abil-ity to represent the characteristics of parallel pro-



Table 1. List of parallel processing functions in VPP Fortran

Syntax/Function Description

processor, proc alias subprocessor

index partition

local

global

parallel region / end parallel

spread region / end spread

spread do / end spread

spread move / end spreadDirective

overlapfix

movewait

broadcast

unify

barrier

lockon / endlock

Declares processor group forms or aliases used in the program.

Declares the partition type or scope.

Declares the local variables.

Declares the global variables.

Specifies the scope of the program to be parallel processed.

Specifies split execution for a part of the program.

Specifies split execution for the DO loop.

Specifies data transfer between PEs.

Specifies transfer of the boundaries.

Specifies to wait for the end of asynchronous data transmission.

Specifies broadcast data transmission.

Specifies data transfer between reduplicated local arrays.

Specifies barrier synchronization.

Specifies the critical section.

Inquires about the number of PEs used for execution.

Inquires about the identification numbers of PEs used for execution.

Post / wait type synchronization.

Common connection between global variables.

Memory sharing for local variables and global variables.

Virtual / actual connection between arguments for global variables and split arrays.

Introduces parallel I / O.

novproc, norproc

idvproc, idrprocService procedure

postevt, waitevt

common

equivalence

Procedure interfacePromotion of Fortran acceptance

Input / Output

PE1

H

G

PE2 PE3 PE4 PE5 PE6 PE7 PE8

F1 F2 F3 F4 F5 F6 F7 F8

E

B

C D

A

processor P(8)

A parallel region B spread region /P(1 : 2)

C region /P(3 : 8) D end spread

G end parallel

E spread do /(P) DO I=1,N F END DO end spread

H END

Fig. 2— Spread-barrier execution.

: VPP500 : Intel Paragon XP/S MP : HITACHI SR2201 : Cray T3D : IBM SP2-T2 : Thinking Machines CM-5

40

50

60

70

80

90

100

0 50 100 150 200 250 300 350Peak performance (Gflops)

Effi

cien

cy (

%)

Fig. 3— Efficiency of LINPACK program.



cessing.)For VPP Fortran, the percentage of parallel

directives is 10% or less in most programs in eachfield. This fact indicates the wide applicability ofVPP Fortran.

4. Message PassingMessage passing techniques are based on

interprocess communication. The message pass-ing techniques were developed in the early 1980s,mainly in the USA. Currently, MPI, PVM, andPARMACS are the most popular.

4.1 Approach to message passingFujitsu started using message passing in the

HPC field in 1993. Table 3 shows the messagepassing libraries for the VPP and AP systems cur-rently provided by Fujitsu.

4.2 Message passing programmingParallel processing programs using message

passing are created by adding library calls for com-munication between PEs to sequential processingprograms. Functions to create processes, trans-fer and receive messages, and exclusively controlsynchronization are essential for message pass-ing.

Figure 4 shows a message passing programwith a simple MPI used as an example.

The features of message passing parallel pro-grams can be clearly identified by comparing mes-

sage passing parallel programs with data-paral-leled programs such as VPP Fortran programs.Table 4 gives a basic comparison of the features.

4.3 Features for installing the messagepassing functionThe VPP and AP systems do not install the

message passing function directly on the hardwareor operating system. Instead, they install themessage passing function on a core library calledMPlib, which is a common layer.

MPlib is fully capable of handling the high-speed networks in the VPP and AP systems to

Table 3. Major message passing libraries

Type Developer Features

MPI MPI Forum Supports group communication functions and communication space.

PALLAS Inc.(FRG)

Oak Ridge National Research (USA)

PVM Supports hereto-environments using daemons.

Supports process topologies.PARMACS

main (argc, argv) { MPI_Init (&argc, &argv) ; /*Initialization */ if (rank == A) { strcpy (msg, "How are you?") ; /*Posts the message "How are you?" */ MPI_Send (msg, …, B, …) ; /*from A to B. */

} eles { MPI_Recv (msg, …, A, …) ; /*B receives the message*/

} /*from A. */ MPI_Finalize( ) ; /*End */ return 0 ;

}

Fig. 4— MPI program.

Table 4. Features of message passing

Function

Data space Local space only Global variables and local space are used in parallel.

Split allocation is supported.

Frequently used data is put in stationary patterns. Possible (Transfer is automatic.)

Mainly handles DO loops.

No split allocation is supported.

Flexible

Impossible (Explicit transfer is required.)

Flexible

Data transfer

Direct quotation from another PE's data

Parallel processing method

Data allocation

Message passing Data paralleled (VPP Fortran)

Table 2. Parallelization of application program

13 765

Field

Computational hydrodynamics

Meteorology

Petroleum

Particle simulation

Quantum chromodynamics

Total number of steps Number of directives Content ratio

83 0.6%

7.5%

13.6%

9.8%

14.3%

3.2%

5.3%

2.9%

1.0%

7.8%

4 019 303

5 711 777

3 645 356

5 302 760

11 779 381

904 48

479 14

6 623 66

1 209 94



make the best use of the hardware.This mounting method enables the above

message passing function to be standardized forthe VPP and AP systems.

Also, by setting MPlib as a common layer andusing execution information obtained by MPlib,all programs can be executed using a commonoperation method regardless of the types of theabove message passing function.

4.4 Performance of the message passingfunctionThe VPP and AP systems are parallel com-

puters with distributed memory. Their perfor-mance depends on their network communicationcapabilities.

The following two values were measured asindexes of data transmission capability using aprogram (PingPong program) for transferringmessages between two processes:- Latency: Data transmission time for data

with length 0 (Time required to start datatransfer)

- Bandwidth: Data transmission rate for aspecified amount of dataTable 5 shows the performance of the major

message passing libraries, including those in othervendors’ distributed memory parallel computers.The rates in “Bandwidth” indicate that the VPPand AP systems make the best use of their high-speed networks.

Figure 5 shows the performance efficiency forapplication software using PARMACS on theVPP500.

In general, the greater the transmission over-head due to process addition, the less the efficiencyof parallel processing. However, in the VPP sys-tem, the rate at which the efficiency is decreasedis low. This effect is due to the cross-bar network,which enables fast and highly efficient operationsin the VPP system.

5. Technological Computation Librariesfor Parallel ProcessingFujitsu has developed parallel numeric com-

putation libraries SSL II/VPP and SSL II/AP forthe VPP and AP systems.

5.1 Overview of SSL II/VPPand SSL II/APSSL II/VPP and SSL II/AP are new, parallel

numeric computation libraries that are highlytuned to bring out the hardware capabilities ofthe VPP and AP distributed memory parallel com-puters. SSL II/VPP and SSL II/AP feature highcapabilities and scalability.

The parallel numeric computation algorithmsare provided in the form of easy-to-use VPP For-

Table 5. Performance of message passing

System

VPP system

Note) The values for PARMACS were measured by PALLAS Inc.(FRG).

MPI 26 429

15 323

17 470

68 60

60 25

56 23

18 25

PVM

PARMACSNote)

PARMACSNote)

PARMACSNote)

PARMACSNote)

MPIAP system

T3D (CRAY)

SP2 (IBM)

CM5E (Thinking Machines)

Type Latency (µs)

Bandwidth (megabytes/second)

0

10

20

30

40

50

60

70

80

0 10 20 30 40 50 60 70 80

Number of PEs

Per

form

ance

effi

cien

cy

Fig. 5— Performance efficiency.



tran subroutines.Data for calculations is stored in global ar-

rays that are separated and located in PEs. Tostore sparse matrixes, the Ellpack format and di-agonal format are supported.

These parallel numeric computation algorith-mic procedures were developed jointly by Fujitsuand a group studying numeric computation at theAustralian National University. Members in thegroup have been leading users of parallel comput-ers, and they are an authority on numeric compu-tation.

SSL II/VPP and SSL II/AP use the same par-allel numeric computation algorithm and inter-face to maintain compatibility between programson the VPP and AP systems.

5.2 Functions in SSL II/VPPand SSL II/APSSL II/VPP and SSL II/AP provide functions

that are used frequently or applied to sizable cal-culations. The following functions are provided:- Linear equation solver for dense matrices

(real matrix, positive definite symmetric ma-trix, complex matrix)

- Linear equation solver for banded matrices(real matrix, positive definite symmetric ma-trix)

- Linear equation solver for sparse matrices(real matrix, positive definite symmetric ma-trix)

- Matrix multiplication of real matrices, realsparse matrix vector multiplication

- Inverse of real matrices- Fourier transforms (uni-dimensional to three-

dimensional complex transforms, uni-dimen-sional to three-dimensional real transforms)

- Eigenvalue problem (real symmetrical matrix,tri-diagonal matrix, generalized eigenvalueproblem)

- Singular value decomposition- Least square solution- Uniform random numbers

5.3 Overview of the algorithmWe developed the latest numeric computation

algorithm to implement the functions listed aboveon distributed memory parallel computers. Forlarge-scale calculations, the algorithm provideshigh-speed processing almost proportional to thenumber of PEs.

The algorithm has the following features:- For the linear equation solver, a blocked di-

rect computational method is used. In thismethod, the optimal load balance is main-tained by dynamically redistributing the dataamong PEs, and data transfer and computa-tion are overlapped.

- For the sparse matrix solver, the precondi-tioned conjugate gradient method (CGmethod) and the solid, modified generalizedconjugate residuals method (MGCR method)are used.

- For Fourier transform kernels, a recursive 5-step algorithm suitable for vector computersis used.

- Double-precision uniform random numbershave a long period of 1052 or more, and an al-gorithm with good statistical characteristicsis used.

- For eigenvalue problems and singular valueproblems, the one-sided Jacobi method is useddue to its good scalability.

5.4 Performance of SSL II/VPPThis part describes some examples of the per-

formance of SSL II/VPP on the VPP700 (1 PE: 2.2Gflops).- SSL II/VPP can solve a real coefficient linear

equation with 50,000 elements in 1,746 sec-onds using 27 PEs. (A restriction on memoryprevents operation of just a single PE; how-ever, if a single PE could be used to solve theequation, it would take about 10 hours.) With27 PEs, the performance is 47.72 Gflops,which is about 80% of the peak hardware per-formance.Figure 6 shows the relationship between the



number of PEs and the performance.- SSL II/VPP can compute a matrix multipli-

cation with 10,000 elements in about 93 sec-onds using 10 PEs, yielding a performance ofabout 98% of the peak hardware performance.Figure 7 shows the relationship between the

number of PEs and the performance.- SSL II/VPP can process a uni-dimensional/

three-dimensional large-sized complex FFT ata high rate.Problems that are too large to store in a PE

can also be handled. For example, using 25 PEs,a uni-dimensional FFT with 228 elements and a

three-dimensional FFT with 1,024 × 1,024 × 1,024elements can be calculated in 1,536 seconds (24.46Gflops, 44.4% of peak hardware performance) and10.35 seconds (15.56 Gflops, 28.2% of peak hard-ware performance), respectively. These perfor-mance figures are some of the highest in the fieldof large-scale FFTs.

Figure 8 shows the relationship between thenumber of PEs and the performance.- SSL II/VPP can generate 250 M double-precision

uniform random numbers per second per PE.Figure 9 shows the relationship between the

number of PEs and the performance.

:Maximum number of storable elements :10,000 elements

0 5 10 15 20 25Number of PEs

0

5

10

15

20

25

30

35

40

45

Gfl

op

s

Fig. 6— Linear equation solver.

0

10

20

30

40

50

60

0 5 10 15 20 25

Number of PEs

Gfl

op

s

Fig. 7— Matrix multiplication(10,000 elements).

0

5

10

15

20

25

0 5 10 15 20 25

Number of PEs

Gfl

op

s: Three-dimensional(256 × 256 × 256 elements) : Three-dimensional(Maximum number of storable elements) : Uni-dimensional(16,777,216 elements) : Uni-dimensional(Maximum number of storable elements)

Fig. 8— Complex Fourier transform.

0 5 10 15 20 250

1 000

2 000

3 000

4 000

5 000

6 000

7 000

Number of PEs

M /

seco

nds

Fig. 9— Uniform random number.



6. Program Development Environment:WorkbenchWorkbench is an integrated environment for

developing programs for the VPP and AP systems.It supports a series of processes for developingprograms from the user’s point of view.

6.1 Features of WorkbenchWorkbench features the following:

1) Graphical user interface (GUI) with high op-erability using Motif Note 1)

2) Integrated environment for program develop-ment activities such as editing, translation,debugging, performance analysis, and opera-tions

3) Operated in the same way as the Fujitsu engineer-ing development environment for SPARCNote 2)

Workbench is compatible with programmingenvironments in workstations because it is oper-ated in the same way as the engineering develop-ment environment for SPARC. Therefore, flexibil-ity in development styles is allowed. For example,the fundamental parts of a program can be cre-ated on a workstation first, then the program canbe handled on the VPP or AP system for the changeto vectorization or parallel processing.

6.2 Functions of WorkbenchThe following shows the correspondence be-

tween programming steps and functions providedby Workbench:

6.2.1 EditingActivates an editor. The user can select the

editor.6.2.2 CompileActivates the compiler. The user can select

compile options using buttons.6.2.3 ExecutionExecutes the program on the VPP or AP sys-

tem. The user can select execution options using

buttons in the same way that translation optionsare selected.

6.2.4 DebuggingDisplays a source program, and enables

breakpoints to be set and variables to be displayedusing the mouse. Vector programs can be de-bugged on the VPP system.

6.2.5 Performance analysisEnables the performance analysis tools in the

VPP and AP systems to be operated from Work-bench.

6.2.6 Job managementEnables programs to be entered, monitored,

and canceled as batched jobs using a GUI. Sup-ports not only development of programs but alsosystem operations.

Figure 10 shows the widows of Workbench.

6.3 Operating environment forWorkbenchThe components of Workbench are located in

a workstation and the VPP or AP system; the com-ponents are linked with each other. Figure 11

shows the operating environment in the VPP sys-tem.

Making the GUI functions usable on the work-station provides a distributed development envi-ronment and reduces the load to the VPP and APsystem mainframes.

Note 1) Registered trademark of OSF (Open Soft-ware Foundation, Inc.)

Note 2) Registered trademark of SPARC Interna-tional, Inc. Fig. 10— Windows of Workbench.



7. System-specific functionsIn addition to a parallel processing system

that is common to the VPP and AP systems, sys-tem-specific functions are provided to exploit vari-ous features and improve performance.

7.1 Functions specific to the VPPsystem7.1.1 Automatic vectorization/LIW

optimization functionsThe VPP system has a vector parallel archi-

tecture. In this architecture, each PE correspondsto a conventional supercomputer. Thus, conven-tional programs can be used without performancedeterioration even if the programs are not modi-fied for parallel processing.

If the programs run satisfactorily on the VPPsystem, there is no need to modify the programfor parallel processing. To obtain a higher perfor-mance, the programs must be modified for paral-lel processing.

For more effective use of the PE’s vector ar-chitecture, the VPP system supports automaticvectorization and optimization functions that haveproved reliable on the VP series. The main func-tions of automatic vectorization and optimizationare as follows:- Vectorization of nested DO loops- Vectorization of DO loops containing control

statements such as IF statements- Vectorization of DO loops containing intrin-

sic functions

- Partial vectorization of DO loops- Vectorization of total sums, inner product op-

erations, maximum value/minimum valueretrievals, and collection/diffusion operations

- Vector pipeline schedule- Optimization for LIW

7.1.2 AnalyzerWhen creating new programs or porting pro-

grams developed for other systems, performancetuning is always necessary. The VPP system sup-ports an analyzer sampler and PEPA/MPA forperformance tuning.1) Sampler

The sampler is a sampling performance analy-sis tool. In sampling, an executed program ischecked at various points by periodic CPU inter-rupts to analyze the performance.

Sampling does not require retranslation ofprograms and enables performance analysis us-ing programs in the standard executable format.

In vector program analyses, the distributionof cost (percentage of CPU interrupts for each pro-cedure, loop, and array; and average physical vec-tor length) is displayed by a routine and loop.

Using the information displayed, high-costparts can be identified and programs can be ef-fectively tuned.

For parallel processing programs, the paral-lel processing, parallel processing rate, parallelprocessing acceleration rate, load balance betweenPEs waiting for synchronization, and asynchro-nous transfer wait rate are displayed for the en-tire program or a specific procedure.

By using the above information, program per-formance can be improved by increasing the par-allel processing rate and reducing the wait-timefor synchronization and asynchronous transfer.2) PEPA/MPA

PEPA/MPA checks how the hardware is usedwhile the sampler analyzes the performancechecking software’s characteristics (proceduresand loops).

The PEPA (PE performance analyzer) collectsevents related to PEs. It collects the VU busy rate

VPP systemWorkstation

VPP-WB Manager

Job Mgr Log

X-Server

Editor

VPP-WP BODY

Compiler

Fdp

etc.

Start

Operation

Fig. 11— Working environment.



(fraction of time the vector pipeline is operating),All-PE average performance (ratio of operationsusing a floating point for scalar instructions tooperations using a floating point for vector instruc-tions), number of operations (total number of op-erations using a floating point for scalar instruc-tions or vector instructions), measurement time(time needed to collect data).

MPA (mover performance analyzer) collectsevents related to the DTU (data transfer unithardware), which handles data transfers betweenPEs.

The PEPA and MPA are provided as subrou-tines. The PEPA/MPA can be used by adding callsfor these subroutines to the beginning and end ofparts to be analyzed. It is also possible to analyzethe entire program by specifying environmentvariables without modifying the source program.

7.2 Functions specific to the AP system7.2.1 APlib (AP1000 compatible

library)The AP3000 supports APlib, which was origi-

nally designed for the AP1000 communication li-brary.

APlib inherits software resources developedon the AP1000; it permits multi-thread program-ming methods to be used on each node.

Tasks created using APlib are mapped ontothreads on the AP3000. A node in the SMP con-figuration permits the tasks to be processed indifferent CPUs.

7.2.2 Parallel profilerThe AP system is provided with a parallel

profiler and performance analysis tool for analyz-ing the performance of programs created in VPPFortran.

After execution of a parallel processing pro-gram, the parallel profiler indicates the total timeand calculation, communication, and synchroni-zation times for each subroutine, functional pro-cedure, or DO loop.

The performance analysis tool accumulates

information about all events. Therefore, to avoida data space shortage, the performance analysistool should be used only for a specific portion of aparallel processing program.

When tuning a parallel processing program,it is recommend to first apply the parallel profilerto the entire program to locate the portions thattake much time to process. Then, the performanceanalysis tool can be used for individual sectionsto check the details.

8. ConclusionThis paper outlined the parallel programming

language, message passing libraries, numeric com-putation libraries, and programming tools of theparallel language processing system developed forthe VPP and AP systems. These parallel languageprocessing system tools enable general users touse distributed memory parallel computers forpractical purposes.

Current indications are that there is a limitto the performance of shared memory parallelcomputers. Therefore, distributed memory paral-lel computers will become more important in HPC.

In response, we intend to enhance the func-tions, implement an automatic parallelizationfunction for distributed memory parallel comput-ers (which still remains to be achieved), and pro-mote standard languages for parallel computers.

References1) M. Nakanishi, H. Ina, and K. Miura: A High

Performance Linear Equation Solver on theVPP500 Parallel Supercomputer. Proceedingsof Supercomputing’94, pp.803-810 (1994).

2) S. Kamiya, P. Lagier, W. Krotz-Vogel, and N,Asai: Two Programming Paradigms on theVPP System. Proceedings of the InternationalSymposium on PDSC’95, pp.35-44 (1995).

3) J.J.Dongarra: Performance of Various Com-puters Using Standard Linear EquationsSoftware. August 12, 1996.



Tatsuya Shindo received the B.S. de-gree in Electrical Engineering fromWaseda University, Japan in 1983.He joined Fujitsu Laboratories Ltd. in1983, where he was engaged in re-search of parallel processing. From1990 until 1992 he was a visiting re-searcher at Stanford University on sab-batical leave. He is currently a man-ager of parallel software developmentat Fujitsu Ltd.

Eiji Yamanaka received the Masterdegree in Control Engineering from theTokyo Institute of Technology, Tokyo,Japan in 1988.He joined Fujitsu Ltd., Numazu, Japanin 1988, where he is currently engagedin research and development of paral-lel compilers for Fujitsu’s VP and VPPcomputer series.

Parallel Language Processing System for High-Performance ...

Documents

Transcript of Parallel Language Processing System for High-Performance ...