comp.art.ppt.ppt

61
Computer Architecture and Organization ECEG-3143 Chapter 1 General Introduction

Transcript of comp.art.ppt.ppt

  • Computer Architecture and Organization ECEG-3143Chapter 1 General Introduction

  • Architecture & Organization 1Architecture is those attributes visible to the programmerInstruction set, number of bits used for data representation, I/O mechanisms, addressing techniques.e.g. Is there a multiply instruction?Organization is how features are implementedControl signals, interfaces, memory technology.e.g. Is there a hardware multiply unit or is it done by repeated addition?

  • Architecture & Organization 2All Intel x86 family share the same basic architectureThe IBM System/370 family share the same basic architecture

    This gives code compatibilityAt least backwardsOrganization differs between different versions

  • Structure & FunctionStructure is the way in which components relate to each otherFunction is the operation of individual components as part of the structure

  • FunctionAll computer functions are:Data processingData storageData movementControl

  • Functional View

  • Operations (a) Data movement

  • Operations (b) Storage

  • Operation (c) Processing from/to storage

  • Operation (d) Processing from storage to I/O

  • Structure - Top LevelComputerMain MemoryInputOutputSystemsInterconnectionPeripheralsCommunicationlinesCentralProcessing UnitComputer

  • Structural components of computerCentral processing unit (CPU): Controls the operation of the computer and performs its data processing functionsMain memory: Stores dataI/O: Moves data between the computer and its external environmentSystem interconnection: Some mechanism that provides for communication among CPU, main memory, and I/O

  • Structure - The CPUComputerArithmeticand Login UnitControlUnitInternal CPUInterconnectionRegistersCPUI/OMemorySystemBusCPU

  • Structure of the CPUControl unit: Controls the operation of the CPU and hence the computer Arithmetic and logic unit (ALU): Performs the computers data processing functions Registers: Provides storage internal to the CPUCPU interconnection: Some mechanism that provides for communication among the control unit, ALU, and registers

  • Structure - The Control UnitCPUControlMemoryControl Unit Registers and DecodersSequencingLogicControlUnitALURegistersInternalBusControl Unit

  • Computer Evolution and PerformanceThe evolution of computers has been characterized by: increasing processor speed decreasing component size increasing memory size, and Increasing I/O capacity and speed.

  • ENIAC - backgroundElectronic Numerical Integrator And ComputerEckert and MauchlyUniversity of PennsylvaniaTrajectory tables for weapons Started 1943Finished 1946Too late for war effortUsed until 1955

  • ENIAC - detailsDecimal (not binary)20 accumulators of 10 digitsProgrammed manually by switches18,000 vacuum tubes30 tons1500 square feet140 kW power consumption5,000 additions per second

  • von Neumann/TuringStored Program conceptMain memory storing programs and dataALU operating on binary dataControl unit interpreting instructions from memory and executingInput and output equipment operated by control unitPrinceton Institute for Advanced Studies IASCompleted 1952

  • von Neumann/TuringStored Program conceptMain memory storing programs and dataALU operating on binary dataControl unit interpreting instructions from memory and executingInput and output equipment operated by control unitPrinceton Institute for Advanced Studies IASCompleted 1952

  • Structure of von Neumann machine

  • IAS - details1000 x 40 bit wordsBinary number2 x 20 bit instructionsSet of registers (storage in CPU)Memory Buffer RegisterMemory Address RegisterInstruction RegisterInstruction Buffer RegisterProgram CounterAccumulatorMultiplier Quotient

  • Structure of IAS detail

  • Commercial Computers1947 - Eckert-Mauchly Computer CorporationUNIVAC I (Universal Automatic Computer)US Bureau of Census 1950 calculationsBecame part of Sperry-Rand CorporationLate 1950s - UNIVAC IIFasterMore memory

  • IBMPunched-card processing equipment1953 - the 701IBMs first stored program computerScientific calculations1955 - the 702Business applicationsLead to 700/7000 series

  • TransistorsReplaced vacuum tubesSmallerCheaperLess heat dissipationSolid State deviceMade from Silicon (Sand)Invented 1947 at Bell LabsWilliam Shockley et al.

  • Transistor Based ComputersSecond generation machinesNCR & RCA produced small transistor machinesIBM 7000DEC - 1957Produced PDP-1

  • MicroelectronicsLiterally - small electronicsA computer is made up of gates, memory cells and interconnectionsThese can be manufactured on a semiconductore.g. silicon wafer

  • Generations of ComputerVacuum tube - 1946-1957Transistor - 1958-1964Small scale integration - 1965 onUp to 100 devices on a chipMedium scale integration - to 1971100-3,000 devices on a chipLarge scale integration - 1971-19773,000 - 100,000 devices on a chipVery large scale integration - 1978 -1991100,000 - 100,000,000 devices on a chipUltra large scale integration 1991 -Over 100,000,000 devices on a chip

  • Moores LawIncreased density of components on chipGordon Moore co-founder of IntelNumber of transistors on a chip will double every yearSince 1970s development has slowed a littleNumber of transistors doubles every 18 monthsCost of a chip has remained almost unchangedHigher packing density means shorter electrical paths, giving higher performanceSmaller size gives increased flexibilityReduced power and cooling requirementsFewer interconnections increases reliability

  • Growth in CPU Transistor Count

  • IBM 360 series1964Replaced (& not compatible with) 7000 seriesFirst planned family of computersSimilar or identical instruction setsSimilar or identical O/SIncreasing speedIncreasing number of I/O ports (i.e. more terminals)Increased memory size Increased costMultiplexed switch structure

  • DEC PDP-81964First minicomputer (after miniskirt!)Did not need air conditioned roomSmall enough to sit on a lab bench$16,000 $100k+ for IBM 360Embedded applications & OEMBUS STRUCTURE

  • DEC - PDP-8 Bus Structure

  • Semiconductor Memory1970FairchildSize of a single corei.e. 1 bit of magnetic core storageHolds 256 bitsNon-destructive readMuch faster than coreCapacity approximately doubles each year

  • Intel1971 - 4004 First microprocessorAll CPU components on a single chip4 bitFollowed in 1972 by 80088 bitBoth designed for specific applications1974 - 8080Intels first general purpose microprocessor

  • Speeding it upPipeliningOn board cacheOn board L1 & L2 cacheBranch predictionData flow analysisSpeculative execution

  • Performance BalanceProcessor speed increasedMemory capacity increasedMemory speed lags behind processor speed

  • Logic and Memory Performance Gap

  • SolutionsIncrease number of bits retrieved at one timeMake DRAM wider rather than deeperChange DRAM interfaceCacheReduce frequency of memory accessMore complex cache and cache on chipIncrease interconnection bandwidthHigh speed busesHierarchy of buses

  • I/O DevicesPeripherals with intensive I/O demandsLarge data throughput demandsProcessors can handle this Problem moving data Solutions:CachingBufferingHigher-speed interconnection busesMore elaborate bus structuresMultiple-processor configurations

  • Typical I/O Device Data Rates

  • Key is BalanceProcessor componentsMain memoryI/O devicesInterconnection structures

  • Improvements in Chip Organization and ArchitectureIncrease hardware speed of processorFundamentally due to shrinking logic gate sizeMore gates, packed more tightly, increasing clock ratePropagation time for signals reducedIncrease size and speed of cachesDedicating part of processor chip Cache access times drop significantlyChange processor organization and architectureIncrease effective speed of executionParallelism

  • Problems with Clock Speed and Logic DensityPowerPower density increases with density of logic and clock speedDissipating heatRC delaySpeed at which electrons flow limited by resistance and capacitance of metal wires connecting themDelay increases as RC product increasesWire interconnects thinner, increasing resistanceWires closer together, increasing capacitanceMemory latencyMemory speeds lag processor speedsSolution:More emphasis on organizational and architectural approaches

  • Intel Microprocessor Performance

  • Increased Cache CapacityTypically two or three levels of cache between processor and main memoryChip density increasedMore cache memory on chipFaster cache accessPentium chip devoted about 10% of chip area to cachePentium 4 devotes about 50%

  • More Complex Execution LogicEnable parallel execution of instructionsPipeline works like assembly lineDifferent stages of execution of different instructions at same time along pipelineSuperscalar allows multiple pipelines within single processorInstructions that do not depend on one another can be executed in parallel

  • Diminishing ReturnsInternal organization of processors complexCan get a great deal of parallelismFurther significant increases likely to be relatively modestBenefits from cache are reaching limitIncreasing clock rate runs into power dissipation problem Some fundamental physical limits are being reached

  • New Approach Multiple CoresMultiple processors on single chipLarge shared cacheWithin a processor, increase in performance proportional to square root of increase in complexityIf software can use multiple processors, doubling number of processors almost doubles performanceSo, use two simpler processors on the chip rather than one more complex processorWith two processors, larger caches are justifiedPower consumption of memory logic less than processing logic

  • x86 Evolution (1)8080first general purpose microprocessor8 bit data pathUsed in first personal computer Altair8086 5MHz 29,000 transistorsmuch more powerful16 bitinstruction cache, prefetch few instructions8088 (8 bit external bus) used in first IBM PC8028616 Mbyte memory addressableup from 1Mb8038632 bitSupport for multitasking80486sophisticated powerful cache and instruction pipeliningbuilt in maths co-processor

  • x86 Evolution (2)PentiumSuperscalarMultiple instructions executed in parallelPentium ProIncreased superscalar organizationAggressive register renamingbranch predictiondata flow analysisspeculative executionPentium IIMMX technologygraphics, video & audio processingPentium IIIAdditional floating point instructions for 3D graphics

  • x86 Evolution (3)Pentium 4Note Arabic rather than Roman numeralsFurther floating point and multimedia enhancementsCoreFirst x86 with dual coreCore 264 bit architectureCore 2 Quad 3GHz 820 million transistorsFour processors on chip

    x86 architecture dominant outside embedded systemsOrganization and technology changed dramaticallyInstruction set architecture evolved with backwards compatibility~1 instruction per month added500 instructions availableSee Intel web pages for detailed information on processors

  • Performance Assessment Clock SpeedKey parametersPerformance, cost, size, security, reliability, power consumptionSystem clock speedIn Hz or multiples ofClock rate, clock cycle, clock tick, cycle timeSignals in CPU take time to settle down to 1 or 0Signals may change at different speedsOperations need to be synchronisedInstruction execution in discrete stepsFetch, decode, load and store, arithmetic or logicalUsually require multiple clock cycles per instructionPipelining gives simultaneous execution of instructionsSo, clock speed is not the whole story

  • System Clock

  • Instruction Execution RateMillions of instructions per second (MIPS)Millions of floating point instructions per second (MFLOPS)Heavily dependent on instruction set, compiler design, processor implementation, cache & memory hierarchy

  • BenchmarksPrograms designed to test performanceWritten in high level language Portable Represents style of taskSystems, numerical, commercialEasily measuredWidely distributedE.g. System Performance Evaluation Corporation (SPEC)CPU2006 for computation bound17 floating point programs in C, C++, Fortran12 integer programs in C, C++3 million lines of codeSpeed and rate metricsSingle task and throughput

  • SPEC Speed MetricSingle taskBase runtime defined for each benchmark using reference machineResults are reported as ratio of reference time to system run timeTrefi execution time for benchmark i on reference machineTsuti execution time of benchmark i on test system

    Overall performance calculated by averaging ratios for all 12 integer benchmarksUse geometric meanAppropriate for normalized numbers such as ratios

  • SPEC Rate MetricMeasures throughput or rate of a machine carrying out a number of tasksMultiple copies of benchmarks run simultaneouslyTypically, same as number of processorsRatio is calculated as follows:Trefi reference execution time for benchmark iN number of copies run simultaneouslyTsuti elapsed time from start of execution of program on all N processors until completion of all copies of programAgain, a geometric mean is calculated

  • Amdahls LawGene Amdahl [AMDA67]Potential speed up of program using multiple processorsConcluded that:Code needs to be parallelizableSpeed up is bound, giving diminishing returns for more processorsTask dependentServers gain by maintaining multiple connections on multiple processorsDatabases can be split into parallel tasks

  • Amdahls Law FormulaFor program running on single processorFraction f of code infinitely parallelizable with no scheduling overheadFraction (1-f) of code inherently serialT is total execution time for program on single processorN is number of processors that fully exploit parralle portions of codeConclusionsf small, parallel processors has little effectN ->, speedup bound by 1/(1 f)Diminishing returns for using more processors

    **************