Term Paper Cse 211

TERM PAPER

ON COMPUTER ORGANIZATION AND

ARCHITECTURE

TOPIC: PROCESSORS WITH PARALLEL ARCHITECTURE

SUBMITTED TO: Mr. Kiran kumar kaki

SUBMITTED BY:

Nancy Goyal

ROLL NO: B44

REG NO: 11011419

SECTION: K2003

ACKNOWLEDGEMENT

I take opportunity to express my gratitude towards my teacher and guide MR. KIRAN KUMAR KAKI who has helped me throughout this term paper. He has guided me and helped me clearing out all my problems and doubt regarding my topic.

I am indebted towards my seniors and friends who have assisted me and proved a helping hand in every aspect of this term paper. They have helped me in studying about this topic and hence preparing this term paper.

I am thankful to all of them who have helped me in this term paper.

NANCY GOYAL

CONTENTS

INTRODUCTION CLASSIFICATION OF PARALLEL

ARCHITECTURE NEED OF PARALLEL PROCESSING TYPES OF PARALLELISM

BIT-LEVEL PARALLELISM INSTRUCTION-LEVEL PARALLELISM DATA PARALLELISM TASK PARALLELISM

HARDWARE SISD SIMD MISD MIMD

SHARED MEMORY ORGANISATION MESSAGE PASSING ORGANISATION INTERCONNECTION NETWORKS

MODE OF OPERATION CONTROL STRATEGY SWITCHING TECHNIQUES TOPOLOGY

TYPES OF PARALLEL COMPUTERS MULTI-CORE COMPUTING SYMMETRIC MICROPROCESSING DISRIBUTED COMPUTING

CLUSTER COMPUTING MASSIVE PARALLEL PROCESSING GRID COMPUTING

SPECIALIZED PARALLEL COMPUTERS RECONFIGUREABLE COMPUTING WITH FIELD-

PROGRAMMABLE GATE ARRAYS GENERAL-PURPOSE COMPUTING ON GRAPHICS

PROCESSING UNITS VECTOR PROCESSORS

APPLICATIONS OF PARALLEL PROCESSING

FUTURE OF PARALLEL PROCESSING REFERENCES

ABSTRACT

This term paper covers the very basics of parallel processing. It begins with a brief overview, including concepts and terminology associated with parallel computing. Then the main topics of parallel processing are explored including need of parallel processing, types of parallelism, hardware, applications and then future scope etc. In computational field technique which is used for solving the computational tasks by using different type multiple resources simultaneously is called as parallel processing. It breaks down large problem into smaller ones, which are solved concurrently. It save time and money as Parallel usage of more resources shortens the task completion time, with potential cost saving. Parallel clusters can be constructed from cheap components. Many applications require more computing power than a traditional sequential computer can offer. Parallel processing provides a cost effective solution to this problem by increasing the number of CPUs in a computer and by adding an efficient communication between them.

INTRODUCTION

Computer architects have always strived to increase the performance of theircomputer architectures. High performance may come from fast dense circuitry,packaging technology, and parallelism. Parallel processors are computer systems consisting of multiple processing units connected via some interconnection network plus the software needed to make the processing units work together.

Processing of multiple tasks simultaneously on multiple processors is called PARALLEL PROCESSING. The parallel program consists of multiple active process simultaneously solving a given problem. A given task is divided into multiple subtasks using divide and conquer technique and each one of them are processed on different CPUs. Programming on multiprocessor system using divide and conquer technique is called Parallelism has been employed for many years, mainly in high-performance computing, but interest in it has grown lately due to the parallel programming. Parallel computers can be roughly classified according to the level at which the hardware supports parallelism, with multi-core and multi-processor computers having multiple processing elements within a single machine, while clusters, MPPs, and grids use multiple computers to work on the same task. Specialized parallel computer architectures are sometimes used alongside traditional processors, for accelerating specific tasks.

CLASSIFICATION OF PARALLEL ARCHITECTURE

Parallel architecture can be classified into two categories:

1. Data-parallel architecture2. Function-parallel architecture

Data parallel architecture is further classified into four categories:

1. Vector architecture2. Associative and neural architecture

3. SIMDS

4. Systolic architecture

Function parallel architecture is further classified into three categories:

1. Instruction level parallel architecture(ILPs)2. Thread level parallel architecture

3. Process level parallel architecture(MIMDs)

Instruction level parallel architecture is sub-divided into three categories:

http://en.wikipedia.org/wiki/Computer_cluster

Pipelined processors VLIWS

Superscalar processors

Process level parallel architecture is sub-divided into two categories:

Distributed memory MIMD Shared memory MIMD

NEED OF PARALLEL PROCESSING

The development of parallel processing is being influenced by many factors:-

Hardware improvements like pipelining, superscalar are not scaling well and require sophisticated compiler technology .Developing such compiler technology is difficult task.

Sequential architectures reaching physical limitation as they are constrained by the speed of light and hence an alternative way to get high computational speed is to connect multiple cpu’s.

The computational requirements are ever increasing, both in the area of scientific and business computing. The technical computing problems ,which require high speed computational power are related to life sciences, aerospace, geographical information systems, mechanical designs etc.

Significant development in networking technology is paving a way for network-based cost-effective parallel computing.

The parallel processing technology is now mature and is being exploited commercially.

All computers (including desktops and laptops) are now based on parallel processing (e.g., multi-core) architecture.

Parallel processing is the processing of program instructions by dividing them among multiple processors with the objective of running a program in less time. In the earliest computers, only one program ran at a time. A com computation-intensive program that took one hour to run and a tape copying program that took one hour to run would take a total of two hours to run. An early form of parallel processing allowed the interleaved execution of both programs together. The computer would start an I/O operation, and while it was waiting for the operation to complete, it would execute the processor-intensive program. The total execution time for the two jobs would be a little over one hour.

The next improvement was multiprogramming. In a multiprogramming system, multiple programs submitted by users were each allowed to use the processor for a short time. To users it appeared that all of the programs were executing at the same time. Problems of resource contention first arose in these systems. Explicit requests for resources led to the problem of the deadlock. Competitions for resources on machines with no tie-breaking instructions lead to the critical section routine.

The next step in parallel processing was the introduction of multiprocessing. In these systems, two or more processors shared the work to be done. The earliest versions had a master/slave configuration. One processor (the master) was programmed to be responsible for all of the work in the system; the other (the slave) performed only those tasks it was assigned by the master. This arrangement was necessary because it was not then understood how to program the machines so they could cooperate in managing the resources of the system.

TYPES OF PARALLELISM

Levels of parallelism decided based on the lumps of code that can be a potential candidate for parallelism.

Bit-level parallelism

From the advent of very-large-scale integration (VLSI) computer-chip fabrication technology in the 1970s until about 1986, speed-up in computer architecture was driven by doubling computer word size—the amount of information the processor can manipulate per cycle. Increasing the word size reduces the number of instructions the processor must execute to perform an operation on variables whose sizes are greater than the length of the word. For example, where an 8-bit processor must add two 16-bit integers, the processor must first add the 8 lower-order bits from each integer using the standard addition instruction, then add the 8 higher-order bits using an add-with-carry instruction and the carry bit from the lower order addition; thus, an 8-bit processor requires two instructions to complete a single operation, where a 16-bit processor would be able to complete the operation with a single instruction.

Historically, 4-bit microprocessors were replaced with 8-bit, then 16-bit, then 32-bit microprocessors. This trend generally came to an end with the introduction of 32-bit processors, which has been a standard in general-purpose computing for two decades. Not until recently with the advent of x86-64 architectures, have 64-bit processors become commonplace.

Instruction-level parallelism

A computer program is, in essence, a stream of instructions executed by a processor. These instructions can be re-ordered and combined into groups which are then executed in parallel without changing the result of the program. This is known as instruction-level parallelism.

http://en.wikipedia.org/wiki/64-bit

http://en.wikipedia.org/wiki/X86-64



http://searchcio-midmarket.techtarget.com/sDefinition/0,,sid183_gci211913,00.html

Modern processors have multi-stage instruction pipelines. Each stage in the pipeline corresponds to a different action the processor performs on that instruction in that stage; a processor with an N-stage pipeline can have up to N different instructions at different stages of completion. The canonical example of a pipelined processor is a RISC processor, with five stages: instruction fetch, decode, execute, memory access, and write back. The Pentium 4 processor had a 35-stage pipeline.

Data parallelism

Data parallelism is parallelism inherent in program loops, which focuses on distributing the data across different computing nodes to be processed in parallel. "Parallelizing loops often leads to similar (not necessarily identical) operation sequences or functions being performed on elements of a large data structure." Many scientific and engineering applications exhibit data parallelism.

A loop-carried dependency is the dependence of a loop iteration on the output of one or more previous iterations. Loop-carried dependencies prevent the parallelization of loops.

Task parallelism

Task parallelism is the characteristic of a parallel program that "entirely different calculations can be performed on either the same or different sets of data". This contrasts with data parallelism, where the same calculation is performed on the same or different sets of data. Task parallelism does not usually scale with the size of a problem.

HARDWARE

The core elements of parallel processing are CPUs. Based on a number of instruction and data streams that can be processed simultaneously , computer systems areclassified into following four categories:

1) Single Instruction Single Data (SISD)

2) Single Instruction Multiple Data (SIMD)

3) Multiple Instruction Single Data (MISD)

4) Multiple Instruction Multiple Data (MIMD)

Single Instruction Single Data (SISD)

A SISD system is a uniprocessor machine capable of executing a single instruction which operates on a single data stream. In SISD machine instructions are processes sequentially and hence computers adopting this model are popularly called sequential

http://en.wikipedia.org/wiki/Pentium_4

http://en.wikipedia.org/wiki/Reduced_Instruction_Set_Computer

computers. Most of convential computers are built using SISD model. All the instructions and data to be processed have to be stored in the primary memory.

Example: Older generation mainframes, minicomputers and workstations; most modern day PCs

Single Instruction Multiple Data (SIMD)

It is a class of parallel computers in Flynn's taxonomy. It describes computers with multiple processing elements that perform the same operation on multiple data simultaneously. Thus, such machines exploit data level parallelism

http://en.wikipedia.org/wiki/Data_parallelism

http://en.wikipedia.org/wiki/Flynn's_taxonomy

Examples: Processor Arrays: Connection Machine CM-2, MasPar MP-1 & MP-2, ILLIAC IV

Multiple Instruction Single Data (MISD)

It is a type of parallel computing architecture where many functional units perform different operations on the same data. Pipeline architectures belong to this type, though a purist might say that the data is different after processing by each stage in the pipeline.

Multiple Instruction Multiple Data (MIMD)

http://en.wikipedia.org/wiki/Pipeline_(computing)

http://en.wikipedia.org/wiki/Computer_architecture

http://en.wikipedia.org/wiki/Parallel_computing

It is a technique employed to achieve parallelism. Machines using MIMD have a number of processors that function asynchronously and independently. At any time, different processors may be executing different instructions on different pieces of data. MIMD architectures may be used in a number of application areas such as CAD/CAM ,simulation,

modeling, and as communication switches.

SHARED MEMORY ORGANISATION

A shared memory model is one in which processors communicate by reading and writing locations in a shared memory that is equally accessible by all processors. Each processor may have registers, buffers, caches and local memory banks as additional memory resources. A number of basic issues in the design of shared memory systems have to be taken into consideration. These include access control, synchronization, protection and security.

Access control determines which process accesses are possible to which resources. Access control models make the required check for every request issued by the processors to the shared memory, against the contents of the access control table.

Synchronization constraints limit the time of accesses from sharing processes to shared resources. Appropriate Synchronization ensures that the information flows properly and ensures system functionality.

http://en.wikipedia.org/w/index.php?title=Communication_switches&action=edit&redlink=1

http://en.wikipedia.org/wiki/Scientific_modelling

http://en.wikipedia.org/wiki/Computer_simulation

http://en.wikipedia.org/wiki/Asynchrony

http://en.wikipedia.org/wiki/Processors

Protection is a system feature that prevents processes from making arbitrary access belonging to other processes. Sharing and protection are incompatible: sharing allows access whereas protection restricts it.

MESSAGE PASSING ORGANISATION

Message passing systems are a class of multiprocessors in which each processor has access to its own local memory. Unlike shared memory systems, communication in message passing systems are performed via send and receive operations. A node in such a system consists of a processor and its local memory. Nodes are typically able to store messages in buffers and perform send /receive operations at the same time as processing. Simultaneously message processing and problem calculating are handled by the underlying operating system. Processors do not share a global memory and each processor has access to its own address space. The processing units of message passing system may be connected in a variety of ways ranging from architecture –specific interconnection structures to geographically dispersed networks. The message passing approach is, in principle, scalable to large portions By scalable it is meant that the number of processors can be increased without significant decrease in efficiency of operation

Message passing multiprocessors employ a variety of static networks in local communications

CLASSES OF PARALLEL COMPUTERS

Parallel computers can be roughly classified according to the level at which the hardware supports parallelism. This classification is broadly analogous to the distance between basic computing nodes. These are not mutually exclusive; for example, clusters of symmetric multiprocessors are relatively common.

MULTI-CORE COMPUTING

A multi-core processor is a processor that includes multiple execution units ("cores") on the same chip. These processors differ from superscalar processors, which can issue multiple instructions per cycle from one instruction stream (thread); in contrast, a multi-core processor can issue multiple instructions per cycle from multiple instruction streams. Each core in a multi-core processor can potentially be superscalar as well—that is, on every cycle, each core can issue multiple instructions from one instruction stream.

SYMMETRIC MULTIPROCESSORS

A symmetric multiprocessor (SMP) is a computer system with multiple identical processors that share memory and connect via a bus. Bus contention prevents bus architectures from scaling. As a result, SMPs generally do not comprise more than 32 processors. "Because of the small size of the processors and the significant reduction in the requirements for bus bandwidth achieved by large caches, such symmetric multiprocessors are extremely cost-effective, provided that a sufficient amount of memory bandwidth exists."

DISTRIBUTED COMPUTING

A distributed computer (also known as a distributed memory multiprocessor) is a distributed memory computer system in which the processing elements are connected by a network. Distributed computers are highly scalable.

CLUSTER COMPUTING

A cluster is a group of loosely coupled computers that work together closely, so that in some respects they can be regarded as a single computer. Clusters are composed of multiple standalone machines connected by a network. While machines in a cluster do not have to be symmetric, load balancing is more difficult if they are not. The most common type of cluster is the Beowulf cluster, which is a cluster implemented on multiple identical commercial off-the-shelf computers connected with a TCP/IP Ethernet local area network..

M ASSIVE PARALLEL PROCESSING

A massively parallel processor (MPP) is a single computer with many networked processors. MPPs have many of the same characteristics as clusters, but MPPs have specialized interconnect networks (whereas clusters use commodity hardware for networking). MPPs also tend to be larger than clusters, typically having "far more" than 100 processors. In an MPP, "each CPU contains its own memory and copy of the operating system and application. Each subsystem communicates with the others via a high-speed interconnect."

GRID COMPUTING

Grid computing is the most distributed form of parallel computing. It makes use of computers communicating over the Internet to work on a given problem. Because of the low bandwidth and extremely high latency available on the Internet, grid computing typically deals only with embarrassingly parallel problems. Many grid computing applications have been created, of which SETI@home and Folding@Home are the best-known examples.

Most grid computing applications use middleware, software that sits between the operating system and the application to manage network resources and standardize the software interface. The most common grid computing middleware is the Berkeley Open Infrastructure for Network Computing (BOINC). Often, grid computing software makes use of "spare cycles", performing computations at times when a computer is idling.

INTERCONNECTION NETWORKS

Multiprocessors interconnection networks (INs) can be classified based on a number of criteria. These include:

MODE OF OPERATION

According to the mode of operation, INs are classified as synchronous versus asynchronous. In synchronous mode of operation, a single global clock is used by all components in the system such that the whole system is operating in a lock–step manner. Asynchronous mode of operation, on the other hand, does not require a global clock. Handshaking signals are used instead in order to coordinate the operation of asynchronous systems. While synchronous systems tend to be slower compared to asynchronous systems, they are race and hazard-free.

CONTROL STRATEGYAccording to the control strategy, INs can be classified as centralized versus decentralized.In centralized control systems, a single central control unit is used to oversee and control the operation of the components of the system. In decentralized control, the control function is distributed among different components in the system.

SWITCHING TECHNIQUESInterconnection networks can be classified according to the switching mechanism as circuit versus packet switching networks. In the circuit switching mechanism, a complete path has to be established prior to the start of communication between a source and a destination. The established path will remain in existence during the whole communication period. In a packet switching mechanism, communication between a source and destination takes place via messages that are divided into smaller entities, called packets. On their way to the destination, packets can be sent from a node to another in a store-and-forward manner until they reach their destination.

TOPOLOGY

An interconnection network topology is a mapping function from the set of processorsand memories onto the same set of processors and memories. In other words, the topology describes how to connect processors and memories to other processors and memories. A fully connected topology, for example, is a mapping in which each processor is connected to all other processors in the computer. A ring topology is a mapping that connects processor k to its neighbors, processors (k -1) and (k + 1).

SPECIALIZED PARALLEL COMPUTERS

Within parallel computing, there are specialized parallel devices that remain niche areas of interest. While not domain-specific, they tend to be applicable to only a few classes of parallel problems.

Reconfigurable computing with field-programmable gate arrays

Reconfigurable computing is the use of a field-programmable gate array (FPGA) as a co-processor to a general-purpose computer. An FPGA is, in essence, a computer chip that can rewire itself for a given task.

FPGAs can be programmed with hardware description languages such as VHDL or Verilog. However, programming in these languages can be tedious. Several vendors have created C to

HDL languages that attempt to emulate the syntax and/or semantics of the C programming language, with which most programmers are familiar.

General-purpose computing on graphics processing units (GPGPU)

General-purpose computing on graphics processing units (GPGPU) is a fairly recent trend in computer engineering research. GPUs are co-processors that have been heavily optimized for computer graphics processing. Computer graphics processing is a field dominated by data parallel operations—particularly linear algebra matrix operations.

In the early days, GPGPU programs used the normal graphics APIs for executing programs. However, several new programming languages and platforms have been built to do general purpose computation on GPUs with both Nvidia and AMD releasing programming environments with CUDA and CTM respectively..

Vector processors

A vector processor is a CPU or computer system that can execute the same instruction on large sets of data. "Vector processors have high-level operations that work on linear arrays of numbers or vectors. An example vector operation is A = B × C, where A, B, and C are each

http://en.wikipedia.org/wiki/File:Cray_1_IMG_9126.jpg

64-element vectors of 64-bit floating-point numbers." They are closely related to Flynn's SIMD classification.

APPLICATIONS OF PARALLEL PROCESSING

This decomposing technique is used in application requiring processing of large amount of data in sophisticated ways. For example;

1. Data bases, Data mining.

2. Networked videos and Multimedia technologies.

3. Medical imaging and diagnosis.

4. Advanced graphics and virtual reality.

5. Collaborative work environments.

6. Frameworks - Dataflow frameworks provide the highest performance and simplest method for expressing record-processing applications so that they are able to achieve high scalability and total throughput.

7. RDBMS - As the most common repositories for commercial record-oriented data, RDBMS systems have evolved so that the Structured Query Language ( SQL ) that is used to access them is executed in parallel. The nature of the SQL language lends itself to faster processing using parallel techniques.

8. Parallelizing Compilers - For technical and mathematical applications dominated by matrix algebra , there are compiler s that can create parallel execution from seemingly sequential program source code. These compilers can decompose a program and insert the necessary message passing structures and other parallel constructs automatically

FUTURE OF PARALLEL PROCESSING

It is expected to lead to other major changes in the industry. Major companies like INTEL Corp and Advanced Micro Devices Inc has already integrated four processors in a single chip. Now what needed is the simultaneous translation and break through in technologies, the race for results in parallel computing is in full swing. Another great challenge is to write a software program to divide computer processors into chunks. This could only be done with the new programming language to revolutionize the every piece of software written. Parallel computing may change the way computer work in the future and how. We use them for work and play.

http://whatis.techtarget.com/definition/0,,sid9_gci211824,00.html

http://whatis.techtarget.com/definition/0,,sid9_gci212554,00.html

http://searchsqlserver.techtarget.com/definition/SQL

REFERENCES COMPUTER ARCHITECTURE

PIPELINED AND PARALLEL PROCESSOR DESIGNWRITTEN BY: MICHEAL J. FLYNNPUBLICATION: JONES AND BARTLETT PUBLISHERS INTERNATIONAL

COMPUTER ARCHITECTURE AND PARALLEL PROCESSINGWRITTEN BY: BHARAT BHUSHAN AGRAWAL AND SUMIT PRAKASH TAYALPUBLICATION: UNIVERSITY SCIENCE PRESS

ADVANCED COMPUTER ARCHITECTURE AND PARALLEL PROCESSINGBY: HESHAM EL-REWINI MOSTAFA ABD-EL-BARR

www.cse.iitd.ernet.in/.../Lect01.LecJan09_2006.Introduction.ppt

www.intel.com/ParallelProgramming en.wikipedia.org/wiki/Parallel_computing http://www.wifinotes.com/computer-networks/what-is-

parallel-computing.html

Term Paper Cse 211

Documents

Transcript of Term Paper Cse 211