Introduction to Parallel Processing - Home - Springer978-0-306-46964-0/1.pdf · INTRODUCTION TO...

Introduction toParallel ProcessingAlgorithms and Architectures

PLENUM SERIES IN COMPUTER SCIENCE

Series Editor: Rami G. MelhemUniversity of PittsburghPittsburgh, Pennsylvania

FUNDAMENTALS OF X PROGRAMMINGGraphical User Interfaces and BeyondTheo Pavlidis

INTRODUCTION TO PARALLEL PROCESSINGAlgorithms and ArchitecturesBehrooz Parhami


Behrooz ParhamiUniversity of California at Santa BarbaraSanta Barbara, California

NEW YORK, BOSTON , DORDRECHT, LONDON , MOSCOW

KLUWER ACADEMIC PUBLISHERS

©2002 Kluwer Academic PublishersNew York, Boston, Dordrecht, London, Moscow

All rights reserved

No part of this eBook may be reproduced or transmitted in any form or by any means, electronic,mechanical, recording, or otherwise, without written consent from the Publisher

Created in the United States of America

Visit Kluwer Online at: http://www.kluweronline.comand Kluwer's eBookstore at: http://www.ebooks.kluweronline.com

Print ISBN 0-306-45970-1

eBook ISBN 0-306-46964-2

To the four parallel joys in my life,

for their love and support.

Preface

THE CONTEXT OF PARALLEL PROCESSING

The field of digital computer architecture has grown explosively in the past two decades.Through a steady stream of experimental research, tool-building efforts, and theoreticalstudies, the design of an instruction-set architecture, once considered an art, has beentransformed into one of the most quantitative branches of computer technology. At the sametime, better understanding of various forms of concurrency, from standard pipelining tomassive parallelism, and invention of architectural structures to support a reasonably efficientand user-friendly programming model for such systems, has allowed hardware performanceto continue its exponential growth. This trend is expected to continue in the near future.

This explosive growth, linked with the expectation that performance will continue itsexponential rise with each new generation of hardware and that (in stark contrast to software)computer hardware will function correctly as soon as it comes off the assembly line, has itsdown side. It has led to unprecedented hardware complexity and almost intolerable devel-opment costs. The challenge facing current and future computer designers is to institutesimplicity where we now have complexity; to use fundamental theories being developed inthis area to gain performance and ease-of-use benefits from simpler circuits; to understandthe interplay between technological capabilities and limitations, on the one hand, and designdecisions based on user and application requirements on the other.

In computer designers’ quest for user-friendliness, compactness, simplicity, high per-formance, low cost, and low power, parallel processing plays a key role. High-performanceuniprocessors are becoming increasingly complex, expensive, and power-hungry. A basictrade-off thus exists between the use of one or a small number of such complex processors,at one extreme, and a moderate to very large number of simpler processors, at the other.When combined with a high-bandwidth, but logically simple, interprocessor communicationfacility, the latter approach leads to significant simplification of the design process. However,two major roadblocks have thus far prevented the widespread adoption of such moderatelyto massively parallel architectures: the interprocessor communication bottleneck and thedifficulty, and thus high cost, of algorithm/software development.

vii

viii INTRODUCTION TO PARALLEL PROCESSING

The above context is changing because of several factors. First, at very high clock rates,the link between the processor and memory becomes very critical. CPUs can no longer bedesigned and verified in isolation. Rather, an integrated processor/memory design optimiza-tion is required, which makes the development even more complex and costly. VLSItechnology now allows us to put more transistors on a chip than required by even the mostadvanced superscalar processor. The bulk of these transistors are now being used to provideadditional on-chip memory. However, they can just as easily be used to build multipleprocessors on a single chip. Emergence of multiple-processor microchips, along withcurrently available methods for glueless combination of several chips into a larger systemand maturing standards for parallel machine models, holds the promise for making parallelprocessing more practical.

This is the reason parallel processing occupies such a prominent place in computerarchitecture education and research. New parallel architectures appear with amazing regu-larity in technical publications, while older architectures are studied and analyzed in noveland insightful ways. The wealth of published theoretical and practical results on parallelarchitectures and algorithms is truly awe-inspiring. The emergence of standard programmingand communication models has removed some of the concerns with compatibility andsoftware design issues in parallel processing, thus resulting in new designs and products withmass-market appeal. Given the computation-intensive nature of many application areas (suchas encryption, physical modeling, and multimedia), parallel processing will continue tothrive for years to come.

Perhaps, as parallel processing matures further, it will start to become invisible. Packingmany processors in a computer might constitute as much a part of a future computerarchitect’s toolbox as pipelining, cache memories, and multiple instruction issue do today.In this scenario, even though the multiplicity of processors will not affect the end user oreven the professional programmer (other than of course boosting the system performance),the number might be mentioned in sales literature to lure customers in the same way thatclock frequency and cache size are now used. The challenge will then shift from makingparallel processing work to incorporating a larger number of processors, more economicallyand in a truly seamless fashion.

THE GOALS AND STRUCTURE OF THIS BOOK

The field of parallel processing has matured to the point that scores of texts and referencebooks have been published. Some of these books that cover parallel processing in general(as opposed to some special aspects of the field or advanced/unconventional parallel systems)are listed at the end of this preface. Each of these books has its unique strengths and hascontributed to the formation and fruition of the field. The current text, Introduction to ParallelProcessing: Algorithms and Architectures, is an outgrowth of lecture notes that the authorhas developed and refined over many years, beginning in the mid-1980s. Here are the mostimportant features of this text in comparison to the listed books:

1. Division of material into lecture-size chapters. In my approach to teaching, a lectureis a more or less self-contained module with links to past lectures and pointers towhat will transpire in the future. Each lecture must have a theme or title and must

PREFACE ix

proceed from motivation, to details, to conclusion. There must be smooth transitionsbetween lectures and a clear enunciation of how each lecture fits into the overallplan. In designing the text, I have strived to divide the material into chapters, eachof which is suitable for one lecture (l–2 hours). A short lecture can cover the firstfew subsections, while a longer lecture might deal with more advanced materialnear the end. To make the structure hierarchical, as opposed to flat or linear, chaptershave been grouped into six parts, each composed of four closely related chapters(see diagram on page xi).

2. A large number of meaningful problems. At least 13 problems have been providedat the end of each of the 24 chapters. These are well-thought-out problems, manyof them class-tested, that complement the material in the chapter, introduce newviewing angles, and link the chapter material to topics in other chapters.

3 . Emphasis on both the underlying theory and practical designs. The ability to copewith complexity requires both a deep knowledge of the theoretical underpinningsof parallel processing and examples of designs that help us understand the theory.Such designs also provide hints/ideas for synthesis as well as reference points forcost–performance comparisons. This viewpoint is reflected, e.g., in the coverage ofproblem-driven parallel machine designs (Chapter 8) that point to the origins of thebutterfly and binary-tree architectures. Other examples are found in Chapter 16where a variety of composite and hierarchical architectures are discussed and somefundamental cost–performance trade-offs in network design are exposed. Fifteencarefully chosen case studies in Chapters 21–23 provide additional insight andmotivation for the theories discussed.

4 . Linking parallel computing to other subfields of computer design. Parallel comput-ing is nourished by, and in turn feeds, other subfields of computer architecture andtechnology. Examples of such links abound. In computer arithmetic, the design ofhigh-speed adders and multipliers contributes to, and borrows many methods from,parallel processing. Some of the earliest parallel systems were designed by re-searchers in the field of fault-tolerant computing in order to allow independentmultichannel computations and/or dynamic replacement of failed subsystems.These links are pointed out throughout the book.

5 . Wide coverage of important topics. The current text covers virtually all importantarchitectural and algorithmic topics in parallel processing, thus offering a balancedand complete view of the field. Coverage of the circuit model and problem-drivenparallel machines (Chapters 7 and 8), some variants of mesh architectures (Chapter12), composite and hierarchical systems (Chapter 16), which are becoming increas-ingly important for overcoming VLSI layout and packaging constraints, and thetopics in Part V (Chapters 17–20) do not all appear in other textbooks. Similarly,other books that cover the foundations of parallel processing do not containdiscussions on practical implementation issues and case studies of the type foundin Part VI.

6. Unified and consistent notation/terminology throughout the text. I have tried veryhard to use consistent notation/terminology throughout the text. For example, nalways stands for the number of data elements (problem size) and p for the numberof processors. While other authors have done this in the basic parts of their texts,there is a tendency to cover more advanced research topics by simply borrowing

x INTRODUCTION TO PARALLEL PROCESSING

the notation and terminology from the reference source. Such an approach has theadvantage of making the transition between reading the text and the originalreference source easier, but it is utterly confusing to the majority of the studentswho rely on the text and do not consult the original references except, perhaps, towrite a research paper.

SUMMARY OF TOPICS

The six parts of this book, each composed of four chapters, have been written with thefollowing goals:

� Part I sets the stage, gives a taste of what is to come, and provides the neededperspective, taxonomy, and analysis tools for the rest of the book.

� Part II delimits the models of parallel processing from above (the abstract PRAMmodel) and from below (the concrete circuit model), preparing the reader for everythingelse that falls in the middle.

� Part III presents the scalable, and conceptually simple, mesh model of parallel process-ing, which has become quite important in recent years, and also covers some of itsderivatives.

� Part IV covers low-diameter parallel architectures and their algorithms, including thehypercube, hypercube derivatives, and a host of other interesting interconnectiontopologies.

� Part V includes broad (architecture-independent) topics that are relevant to a wide rangeof systems and form the stepping stones to effective and reliable parallel processing.

� Part VI deals with implementation aspects and properties of various classes of parallelprocessors, presenting many case studies and projecting a view of the past and futureof the field.

POINTERS ON HOW TO USE THE BOOK

For classroom use, the topics in each chapter of this text can be covered in a lecturespanning 1–2 hours. In my own teaching, I have used the chapters primarily for 1-1/2-hourlectures, twice a week, in a 10-week quarter, omitting or combining some chapters to fit thematerial into 18–20 lectures. But the modular structure of the text lends itself to other lectureformats, self-study, or review of the field by practitioners. In the latter two cases, the readerscan view each chapter as a study unit (for 1 week, say) rather than as a lecture. Ideally, alltopics in each chapter should be covered before moving to the next chapter. However, if fewerlecture hours are available, then some of the subsections located at the end of chapters canbe omitted or introduced only in terms of motivations and key results.

Problems of varying complexities, from straightforward numerical examples or exercisesto more demanding studies or miniprojects, have been supplied for each chapter. These problemsform an integral part of the book and have not been added as afterthoughts to make the bookmore attractive for use as a text. A total of 358 problems are included (13–16 per chapter).Assuming that two lectures are given per week, either weekly or biweekly homework canbe assigned, with each assignment having the specific coverage of the respective half-part

PREFACE x i

The structure of this book in parts, half-parts, and chapters.

(two chapters) or full part (four chapters) as its “title.” In this format, the half-parts, shownabove, provide a focus for the weekly lecture and/or homework schedule.

An instructor’s manual, with problem solutions and enlarged versions of the diagramsand tables, suitable for reproduction as transparencies, is planned. The author’s detailedsyllabus for the course ECE 254B at UCSB is available at http://www.ece.ucsb.edu/courses/syllabi/ece254b.html.

References to important or state-of-the-art research contributions and designs areprovided at the end of each chapter. These references provide good starting points for doingin-depth studies or for preparing term papers/projects.

x i i INTRODUCTION TO PARALLEL PROCESSING

New ideas in the field of parallel processing appear in papers presented at several annualconferences, known as FMPC, ICPP, IPPS, SPAA, SPDP (now merged with IPPS), and inarchival journals such as IEEE Transactions on Computers [TCom], IEEE Transactions onParallel and Distributed Systems [TPDS], Journal of Parallel and Distributed Computing[JPDC], Parallel Computing [ParC], and Parallel Processing Letters [PPL]. Tutorial andsurvey papers of wide scope appear in IEEE Concurrency [Conc] and, occasionally, in IEEEComputer [Comp]. The articles in IEEE Computer provide excellent starting points forresearch projects and term papers.

ACKNOWLEDGMENTS

The current text, Introduction to Parallel Processing: Algorithms and Architectures, isan outgrowth of lecture notes that the author has used for the graduate course “ECE 254B:Advanced Computer Architecture: Parallel Processing” at the University of California, SantaBarbara, and, in rudimentary forms, at several other institutions prior to 1988. The text hasbenefited greatly from keen observations, curiosity, and encouragement of my many studentsin these courses. A sincere thanks to all of them! Particular thanks go to Dr. Ding-Ming Kwaiwho read an early version of the manuscript carefully and suggested numerous correctionsand improvements.

GENERAL REFERENCES

[Akl89][Akl97][Alma94][Bert89]

[Code93][Comp]

[Conc]

[Cric88][DeCe89][Desr87][Duat97]

[Flyn95]

[FMPC]

[Foun94][Hock81][Hord90][Hord93][Hwan84][Hwan93]

Akl, S. G., The Design and Analysis of Parallel Algorithms, Prentice–Hall, 1989.Akl, S. G., Parallel Computation: Models and Methods, Prentice–Hall, 1997.Almasi, G. S., and A. Gottlieb, Highly Parallel Computing, Benjamin/Cummings, 2nd ed., 1994.Bertsekas, D. P., and J. N. Tsitsiklis, Parallel and Distributed Computation: Numerical Methods,Prentice–Hall, 1989.Codenotti, B., and M. Leoncini, Introduction to Parallel Processing, Addison–Wesley, 1993.IEEE Computer, journal published by IEEE Computer Society: has occasional special issues onparallel/distributed processing (February 1982, June 1985, August 1986, June 1987, March 1988,August 1991, February 1992, November 1994, November 1995, December 1996).IEEE Concurrency, formerly IEEE Parallel and Distributed Technology, magazine published byIEEE Computer Society.Crichlow, J. M., Introduction to Distributed and Parallel Computing, Prentice–Hall, 1988.DeCegama, A. L., Parallel Processing Architectures and VLSI Hardware, Prentice–Hall, 1989.Desrochers, G. R., Principles of Parallel and Multiprocessing, McGraw-Hill, 1987.Duato, J., S. Yalamanchili, and L. Ni, Interconnection Networks: An Engineering Approach, IEEEComputer Society Press, 1997.Flynn, M. J., Computer Architecture: Pipelined and Parallel Processor Design, Jones and Bartlett,1995.Proc. Symp. Frontiers of Massively Parallel Computation, sponsored by IEEE Computer Society andNASA. Held every 1 1/2–2 years since 1986. The 6th FMPC was held in Annapolis, MD, October27–31, 1996, and the 7th is planned for February 20–25, 1999.Fountain, T. J., Parallel Computing: Principles and Practice, Cambridge University Press, 1994.Hockney, R. W., and C. R. Jesshope, Parallel Computers, Adam Hilger, 1981.Hord, R. M., Parallel Supercomputing in SIMD Architectures, CRC Press, 1990.Hord, R. M., Parallel Supercomputing in MIMD Architectures, CRC Press, 1993.Hwang, K., and F. A. Briggs, Computer Architecture and Parallel Processing, McGraw-Hill, 1984.Hwang, K., Advanced Computer Architecture: Parallelism, Scalability, Programmability, McGraw-Hill, 1993.

PREFACE x i i i

[Hwan98]

[ICPP]

[IPPS]

[JaJa92][JPDC][Kris89][Kuma94]

[Laks90]

[Leig92]

[Lerm94][Lipo87][Mold93][ParC][PPL][Quin87][Quin94][Reif93][Sanz89]

[Shar87]

[Sieg85][SPAA]

[SPDP]

[Ston93][TCom]

[TPDS][Varm94]

[Zoma96]

Hwang, K., and Z. Xu, Scalable Parallel Computing: Technology, Architecture, Programming,McGraw-Hill, 1998.Proc. Int. Conference Parallel Processing, sponsored by The Ohio State University (and in recentyears, also by the International Association for Computers and Communications). Held annually since1972.Proc. Int. Parallel Processing Symp., sponsored by IEEE Computer Society. Held annually since1987. The 11th IPPS was held in Geneva, Switzerland, April 1–5, 1997. Beginning with the 1998symposium in Orlando, FL, March 30–April 3, IPPS was merged with SPDP. **JaJa, J., An Introduction to Parallel Algorithms, Addison-Wesley, 1992.Journal of Parallel and Distributed Computing, Published by Academic Press.Krishnamurthy, E. V., Parallel Processing: Principles and Practice, Addison–Wesley, 1989.Kumar, V., A. Grama, A. Gupta, and G. Karypis, Introduction to Parallel Computing: Design andAnalysis of Algorithms, Benjamin/Cummings, 1994.Lakshmivarahan, S., and S. K. Dhall, Analysis and Design of Parallel Algorithms: Arithmetic andMatrix Problems, McGraw-Hill, 1990.Leighton, F. T., Introduction to Parallel Algorithms and Architectures: Arrays, Trees, Hypercubes,Morgan Kaufmann, 1992.Lerman, G., and L. Rudolph, Parallel Evolution of Parallel Processors, Plenum, 1994.Lipovski, G. J., and M. Malek, Parallel Computing: Theory and Comparisons, Wiley, 1987.Moldovan, D. I., Parallel Processing: From Applications to Systems, Morgan Kaufmann, 1993.Parallel Computing, journal published by North-Holland.Parallel Processing Letters, journal published by World Scientific.Quinn, M. J., Designing Efficient Algorithms for Parallel Computers, McGraw-Hill, 1987.Quinn, M. J., Parallel Computing: Theory and Practice, McGraw-Hill, 1994.Reif, J. H. (ed.), Synthesis of Parallel Algorithms, Morgan Kaufmann, 1993.Sanz, J. L. C. (ed.), Opportunities and Constraints of Parallel Computing (IBM/NSF Workshop, SanJose, CA, December 1988), Springer-Verlag, 1989.Sharp, J. A., An Introduction to Distributed and Parallel Processing, Blackwell Scientific Publica-tions, 1987.Siegel, H. J., Interconnection Networks for Large-Scale Parallel Processing, Lexington Books, 1985.Proc. Symp. Parallel Algorithms and Architectures, sponsored by the Association for ComputingMachinery (ACM). Held annually since 1989. The 10th SPAA was held in Puerto Vallarta, Mexico,June 28–July 2, 1998.Proc. Int. Symp. Parallel and Distributed Systems, sponsored by IEEE Computer Society. Heldannually since 1989, except for 1997. The 8th SPDP was held in New Orleans, LA, October 23–26,1996. Beginning with the 1998 symposium in Orlando, FL, March 30–April 3, SPDP was mergedwith IPPS.Stone, H. S., High-Performance Computer Architecture, Addison–Wesley, 1993.IEEE Trans. Computers, journal published by IEEE Computer Society; has occasional special issueson parallel and distributed processing (April 1987, December 1988, August 1989, December 1991,April 1997, April 1998).IEEE Trans. Parallel and Distributed Systems, journal published by IEEE Computer Society.Varma, A., and C. S. Raghavendra, Interconnection Networks for Multiprocessors and Multicomput-ers: Theory and Practice, IEEE Computer Society Press, 1994.Zomaya, A. Y. (ed.), Parallel and Distributed Computing Handbook, McGraw-Hill, 1996.

*The 27th ICPP was held in Minneapolis, MN, August 10–15, 1998, and the 28th is scheduled for September21–24, 1999, in Aizu, Japan.

**The next joint IPPS/SPDP is sceduled for April 12–16, 1999, in San Juan, Puerto Rico.

Contents

Part I. Fundamental Concepts . . . . . . . . . . . . . . . . . . . . . . . 1

1.1. Why Parallel Processing? . . . . . . . . . . . . . . . . . . . . . .1.2. A Motivating Example . . . . . . . . . . . . . . . . . . . . . . .1.3. Parallel Processing Ups and Downs . . . . . . . . . . . . . . . .1.4. Types of Parallelism: A Taxonomy . . . . . . . . . . . . . . . . .1.5. Roadblocks to Parallel Processing . . . . . . . . . . . . . . . . .1.6. Effectiveness of Parallel Processing . . . . . . . . . . . . . . . .Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .References and Suggested Reading . . . . . . . . . . . . . . . . . . . .

2. A Taste of Parallel Algorithms . . . . . . . . . . . . . . . . . . .

2.1. Some Simple Computations . . . . . . . . . . . . . . . . . . . .2.2. Some Simple Architectures . . . . . . . . . . . . . . . . . . . . .2.3. Algorithms for a Linear Array . . . . . . . . . . . . . . . . . . .2.4. Algorithms for a Binary Tree . . . . . . . . . . . . . . . . . . . .2.5. Algorithms for a 2D Mesh . . . . . . . . . . . . . . . . . . . . .2.6. Algorithms with Shared Variables . . . . . . . . . . . . . . . . .Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .References and Suggested Reading . . . . . . . . . . . . . . . . . . . .

3. Parallel Algorithm Complexity . . . . . . . . . . . . . . . . . . .

3.1. Asymptotic Complexity . . . . . . . . . . . . . . . . . . . . . . . 473.2. Algorithm Optimality and Efficiency . . . . . . . . . . . . . . . . 503.3. Complexity Classes . . . . . . . . . . . . . . . . . . . . . . . . . 533.4. Parallelizable Tasks and the NC Class . . . . . . . . . . . . . . . 553.5. Parallel Programming Paradigms . . . . . . . . . . . . . . . . . . 563.6. Solving Recurrences . . . . . . . . . . . . . . . . . . . . . . . . 58

31. Introduction to Parallelism . . . . . . . . . . . . . . . . . . . . .

58

131516192123

25

2728303439404143

45

xv

xvi INTRODUCTION TO PARALLEL PROCESSING

Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61References and Suggested Reading . . . . . . . . . . . . . . . . . . . . 63

4. Models of Parallel Processing . . . . . . . . . . . . . . . . . . . 65

4.1. Development of Early Models . . . . . . . . . . . . . . . . . . . 674.2. SIMD versus MIMD Architectures . . . . . . . . . . . . . . . . 694.3. Global versus Distributed Memory . . . . . . . . . . . . . . . . . 714.4. The PRAM Shared-Memory Model . . . . . . . . . . . . . . . . 744.5. Distributed-Memory or Graph Models . . . . . . . . . . . . . . . 774.6. Circuit Model and Physical Realizations . . . . . . . . . . . . . . 80Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82References and Suggested Reading . . . . . . . . . . . . . . . . . . . . 85

Part II. Extreme Models . . . . . . . . . . . . . . . . . . . . . . . . . . 87

5. PRAM and Basic Algorithms . . . . . . . . . . . . . . . . . . . . 89

5.1. PRAM Submodels and Assumptions . . . . . . . . . . . . . . . 915.2. Data Broadcasting . . . . . . . . . . . . . . . . . . . . . . . . . 935.3. Semigroup or Fan-In Computation . . . . . . . . . . . . . . . . . 965.4. Parallel Prefix Computation . . . . . . . . . . . . . . . . . . . 985.5. Ranking the Elements of a Linked List . . . . . . . . . . . . . . 995.6. Matrix Multiplication . . . . . . . . . . . . . . . . . . . . . . . . 102Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105References and Suggested Reading . . . . . . . . . . . . . . . . . . . . 108

6. More Shared-Memory Algorithms . . . . . . . . . . . . . . . . . 109

6.1. Sequential Rank-Based Selection . . . . . . . . . . . . . . . . . 1116.2. A Parallel Selection Algorithm . . . . . . . . . . . . . . . . . . . 1136.3. A Selection-Based Sorting Algorithm . . . . . . . . . . . . . . . 1146.4. Alternative Sorting Algorithms . . . . . . . . . . . . . . . . . . . 1176.5. Convex Hull of a 2D Point Set . . . . . . . . . . . . . . . . . . . 1186.6. Some Implementation Aspects . . . . . . . . . . . . . . . . . . . 121Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125References and Suggested Reading . . . . . . . . . . . . . . . . . . . . 127

7. Sorting and Selection Networks . . . . . . . . . . . . . . . . . . 129

7.1. What Is a Sorting Network . . . . . . . . . . . . . . . . . . . . . 1317.2. Figures of Merit for Sorting Networks . . . . . . . . . . . . . . . 1337.3. Design of Sorting Networks . . . . . . . . . . . . . . . . . . . . 1357.4. Batcher Sorting Networks . . . . . . . . . . . . . . . . . . . . . 1367.5. Other Classes of Sorting Networks . . . . . . . . . . . . . . . . . 1417.6. Selection Networks . . . . . . . . . . . . . . . . . . . . . . . . . 142Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144References and Suggested Reading . . . . . . . . . . . . . . . . . . . 147

CONTENTS xvii

8. Other Circuit-Level Examples . . . . . . . . . . . . . . . . . . . 149

8.1. Searching and Dictionary Operations . . . . . . . . . . . . . . . . 1518.2. A Tree-Structured Dictionary Machine . . . . . . . . . . . . . . . 1528.3. Parallel Prefix Computation . . . . . . . . . . . . . . . . . . . . 1568.4. Parallel Prefix Networks . . . . . . . . . . . . . . . . . . . . . . 1578.5. The Discrete Fourier Transform . . . . . . . . . . . . . . . . . . 1618.6. Parallel Architectures for FFT . . . . . . . . . . . . . . . . . . . 163Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165References and Suggested Reading . . . . . . . . . . . . . . . . . . . . 168

Part III. Mesh-Based Architectures 169

9. Sorting on a 2D Mesh or Torus 171

9.1. Mesh-Connected Computers . . . . . . . . . . . . . . . . . . . . 1739.2. The Shearsort Algorithm . . . . . . . . . . . . . . . . . . . . . . 1769.3. Variants of Simple Shearsort . . . . . . . . . . . . . . . . . . . . 1799.4. Recursive Sorting Algorithms . . . . . . . . . . . . . . . . . . . 1809.5. A Nontrivial Lower Bound . . . . . . . . . . . . . . . . . . . . . 1839.6. Achieving the Lower Bound . . . . . . . . . . . . . . . . . . . . 186Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187References and Suggested Reading . . . . . . . . . . . . . . . . . . . . 190

10. Routing on a 2D Mesh or Torus 191

10.1. Types of Data Routing Operations . . . . . . . . . . . . . . . . 19310.2. Useful Elementary Operations . . . . . . . . . . . . . . . . . . 19510.3. Data Routing on a 2D Array . . . . . . . . . . . . . . . . . . . 19710.4. Greedy Routing Algorithms . . . . . . . . . . . . . . . . . . . . 19910.5. Other Classes of Routing Algorithms . . . . . . . . . . . . . . . 20210.6. Wormhole Routing . . . . . . . . . . . . . . . . . . . . . . . . 204Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208References and Suggested Reading . . . . . . . . . . . . . . . . . . . . 210

11. Numerical 2D Mesh Algorithms 211

11.1. Matrix Multiplication . . . . . . . . . . . . . . . . . . . . . . . 213

11.2. Triangular System of Equations . . . . . . . . . . . . . . . . . . 21511.3. Tridiagonal System of Linear Equations . . . . . . . . . . . . . 21811.4. Arbitrary System of Linear Equations . . . . . . . . . . . . . . . 22111.5. Graph Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . 22511.6. Image-Processing Algorithms . . . . . . . . . . . . . . . . . . . 228Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231References and Suggested Reading . . . . . . . . . . . . . . . . . . . . 233

12. Other Mesh-Related Architectures . . . . . . . . . . . . . . . . . 235

12.1. Three or More Dimensions . . . . . . . . . . . . . . . . . . . . 237

. . . . . . . . . . . . . . . . . . . . .. . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

xviii INTRODUCTION TO PARALLEL PROCESSING

12.2. Stronger and Weaker Connectivities . . . . . . . . . . . . . . . 24012.3. Meshes Augmented with Nonlocal Links . . . . . . . . . . . . . 24212.4. Meshes with Dynamic Links . . . . . . . . . . . . . . . . . . . . . . . . . 24512.5. Pyramid and Multigrid Systems . . . . . . . . . . . . . . . . . . . . . . . . 24612.6. Meshes of Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248P r o b l e m s. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253References and Suggested Reading . . . . . . . . . . . . . . . . . . . . 256

Part IV. Low-Diameter Architectures . . . . . . . . . . . . . . . . . . . . 257

13. Hypercubes and Their Algorithms . . . . . . . . . . . . . . . . .

13.1. Definition and Main Properties . . . . . . . . . . . . . . . . . .13.2. Embeddings and Their Usefulness . . . . . . . . . . . . . . . .13.3. Embedding of Arrays and Trees . . . . . . . . . . . . . . . . . .13.4. A Few Simple Algorithms . . . . . . . . . . . . . . . . . . . . .13 .5 . Ma t r ix Mul t ip l i ca t ion . . . . . . . . . . . . . . . . . . . . . . .13.6. Inverting a Lower Triangular Matrix . . . . . . . . . . . . . . .P r o b l e m s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References and Suggested Reading . . . . . . . . . . . . . . . . . . . .

14. Sorting and Routing on Hypercubes . . . . . . . . . . . . . . . .

14.1. Defining the Sorting Problem . . . . . . . . . . . . . . . . . . .14.2. Bitonic Sorting on a Hypercube . . . . . . . . . . . . . . . . . .14.3. Routing Problems on a Hypercube . . . . . . . . . . . . . . . .14.4. Dimension-Order Routing . . . . . . . . . . . . . . . . . . . . .14.5. Broadcasting on a Hypercube . . . . . . . . . . . . . . . . . . .14.6. Adaptive and Fault-Tolerant Routing . . . . . . . . . . . . . . .P r o b l e m s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .References and Suggested Reading . . . . . . . . . . . . . . . . . . . .

15. Other Hypercubic Architectures . . . . . . . . . . . . . . . . . .

15.1. Modified and Generalized Hypercubes . . . . . . . . . . . . . .15.2. Butterfly and Permutation Networks . . . . . . . . . . . . . . .15.3. Plus-or-Minus-2'Network . . . . . . . . . . . . . . . . . . . . .15.4. The Cube-Connected Cycles Network . . . . . . . . . . . . . .15.5. Shuffle and Shuffle–Exchange Networks . . . . . . . . . . . . .15.6. That’s Not All , Folks! P r o b l e m sReferences and Suggested Reading

16. A Sampler of Other Networks

16.1. Performance Parameters for Networks 32316.2. Star and Pancake Networks 3261 6 . 3 . R i n g - B a s e d N e t w o r k s 329

321

303305309310313316317320

281284285288292294295298

301

279

261263264269272274275278

259

. . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

CONTENTS

16.4. Composite or Hybrid Networks . . . . . . . . . . . . . . . . . . 33516.5. Hierarchical (Multilevel) Networks . . . . . . . . . . . . . . . . 33716.6. Multistage Interconnection Networks . . . . . . . . . . . . . . . 338Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 340References and Suggested Reading . . . . . . . . . . . . . . . . . . . . 343

Part V. Some Broad Topics . . . . . . . . . . . . . . . . . . . . . . . . . 345

17. Emulation and Scheduling . . . . . . . . . . . . . . . . . . . . . 347

17.1. Emulations among Architectures . . . . . . . . . . . . . . . . . 34917.2. Distributed Shared Memory . . . . . . . . . . . . . . . . . . . . 35117.3. The Task Scheduling Problem . . . . . . . . . . . . . . . . . . . 35517.4. A Class of Scheduling Algorithms . . . . . . . . . . . . . . . . 35717.5. Some Useful Bounds for Scheduling . . . . . . . . . . . . . . . 36017.6. Load Balancing and Dataflow Systems . . . . . . . . . . . . . . 362Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364References and Suggested Reading . . . . . . . . . . . . . . . . . . . . 367

18. Data Storage, Input, and Output . . . . . . . . . . . . . . . . . . 369

18.1. Data Access Problems and Caching . . . . . . . . . . . . . . . . 37118.2. Cache Coherence Protocols . . . . . . . . . . . . . . . . . . . . 37418.3. Multithreading and Latency Hiding . . . . . . . . . . . . . . . . 37718.4. Parallel I/O Technology . . . . . . . . . . . . . . . . . . . . . . 37918.5. Redundant Disk Arrays . . . . . . . . . . . . . . . . . . . . . . 38218.6. Interfaces and Standards . . . . . . . . . . . . . . . . . . . . . . 384P r o b l e m s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 386References and Suggested Reading . . . . . . . . . . . . . . . . . . . . 388

19. Reliable Parallel Processing . . . . . . . . . . . . . . . . . . . . 391

19.1. Defects, Faults, . . . , Failures . . . . . . . . . . . . . . . . . . . 39319.2. Defect-Level Methods . . . . . . . . . . . . . . . . . . . . . . . 39619.3. Fault-Level Methods . . . . . . . . . . . . . . . . . . . . . . . . 39919.4. Error-Level Methods . . . . . . . . . . . . . . . . . . . . . . . 40219.5. Malfunction-Level Methods . . . . . . . . . . . . . . . . . . . . 40419.6. Degradation-Level Methods . . . . . . . . . . . . . . . . . . . . . . . 407Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 410References and Suggested Reading . . . . . . . . . . . . . . . . . . . . 413

20. System and Software Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415

20.1. Coordination and Synchronization . . . . . . . . . . . . . . . . 41720.2. Parallel Programming . . . . . . . . . . . . . . . . . . . . . . . . . 42120.3. Software Portability and Standards . . . . . . . . . . . . . . . . . . . . 42520.4. Parallel Operating Systems . . . . . . . . . . . . . . . . . . . . 42720.5. Parallel File Systems . . . . . . . . . . . . . . . . . . . . . . . 430

xix

xx INTRODUCTION TO PARALLEL PROCESSING

20.6. Hardware/Software Interaction . . . . . . . . . . . . . . . . . 431Problems . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 433References and Suggested Reading . . . . . . . . . . . . . . . . . 435

Part VI. Implementation Aspects . . . . . . . . . . . . . . . . . . . . . 437

21. Shared-Memory MIMD Machines . . . . . . . . . . . . . . . . .. . . . 439

21.1. Variations in Shared Memory . . . . . . . . . . . . . . . . . . . 44121.2. MIN-Based BBN Butterfly . . . . . . . . . . . . . . . . . . . . 44421.3. Vector-Parallel Cray Y-MP . . . . . . . . . . . . . . . . . . . . 44521.4. Latency-Tolerant Tera MTA . . . . . . . . . . . . . . . . . . . . 44821.5. CC-NUMA Stanford DASH . . . . . . . . . . . . . . . . . . . 45021.6. SCI-Based Sequent NUMA-Q . . . . . . . . . . . . . . . . . . 452Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455References and Suggested Reading . . . . . . . . . . . . . . . . . . . . 457

22. Message-Passing MIMD Machines . . . . . . . . . . . . . . . . . . . 459

22.1. Mechanisms for Message Passing . . . . . . . . . . . . . . . . 46122.2. Reliable Bus-Based Tandem Nonstop . . . . . . . . . . . . . . 46422.3. Hypercube-Based nCUBE3 . . . . . . . . . . . . . . . . . . . . 46622.4. Fat-Tree-Based Connection Machine 5 . . . . . . . . . . . . . . 46922.5. Omega-Network-Based IBM SP2 . . . . . . . . . . . . . . . . . 47122.6. Commodity-Driven Berkeley NOW . . . . . . . . . . . . . . . . 473Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475References and Suggested Reading . . . . . . . . . . . . . . . . . . . . 477

23. Data-Parallel SIMD Machines . . . . . . . . . . . . . . . . . . . 479

23.1. Where Have All the SIMDs Gone? . . . . . . . . . . . . . . . . 48123.2. The First Supercomputer: ILLIAC IV . . . . . . . . . . . . . . . 48423.3. Massively Parallel Goodyear MPP . . . . . . . . . . . . . . . . . 48523.4. Distributed Array Processor (DAP) . . . . . . . . . . . . . . . . 48823.5. Hypercubic Connection Machine 2 . . . . . . . . . . . . . . . . 49023.6. Multiconnected MasPar MP-2 . . . . . . . . . . . . . . . . . . . 492Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495References and Suggested Reading . . . . . . . . . . . . . . . . . . . . 497

24. Past, Present, and Future . . . . . . . . . . . . . . . , . . . . . . 499

24.1. Milestones in Parallel Processing . . . . . . . . . . . . . . . . . 50124.2. Current Status, Issues, and Debates . . . . . . . . . . . . . . . . . 50324.3. TFLOPS, PFLOPS, and Beyond . . . . . . . . . . . . . . . . . 50624.4. Processor and Memory Technologies . . . . . . . . . . . . . . . 50824.5. Interconnection Technologies . . . . . . . . . . . . . . . . . . . 510

CONTENTS xxi

24.6. The Future of Parallel Processing . . . . . . . . . . . . . . . . . 513

Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 515References and Suggested Reading . . . . . . . . . . . . . . . . . . . . 517

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 519

I

FundamentalConcepts

The field of parallel processing is concerned with architectural and algorithmicmethods for enhancing the performance or other attributes (e.g., cost-effective-ness, reliability) of digital computers through various forms of concurrency. Eventhough concurrent computation has been around since the early days of digitalcomputers, only recently has it been applied in a manner, and on a scale, thatleads to better performance, or greater cost-effectiveness, compared with vectorsupercomputers. Like any other field of science/technology, the study of parallelarchitectures and algorithms requires motivation, a big picture showing therelationships between problems and the various approaches to solving them,and models for comparing, connecting, and evaluating new ideas. This part,which motivates us to study parallel processing, paints the big picture, andprovides some needed background, is composed of four chapters:

• Chapter 1: Introduction to Parallelism• Chapter 2: A Taste of Parallel Algorithms• Chapter 3: Parallel Algorithm Complexity• Chapter 4: Models of Parallel Processing

1

Introduction to Parallel Processing - Home - Springer978-0-306-46964-0/1.pdf · INTRODUCTION TO...

Documents

Transcript of Introduction to Parallel Processing - Home - Springer978-0-306-46964-0/1.pdf · INTRODUCTION TO...