Toward an Advanced Intelligent Memory System University of Illinois Josep Torrellas ...

34
Toward an Advanced Intelligent Memory System University of Illinois Josep Torrellas http://iacoma.cs.uiuc.edu [email protected] FlexRAM

Transcript of Toward an Advanced Intelligent Memory System University of Illinois Josep Torrellas ...

Page 1: Toward an Advanced Intelligent Memory System University of Illinois Josep Torrellas  torrellas@cs.uiuc.edu FlexRAM.

Toward an Advanced Intelligent Memory

System

University of Illinois

Josep Torrellas

http://[email protected]

FlexRAM

Page 2: Toward an Advanced Intelligent Memory System University of Illinois Josep Torrellas  torrellas@cs.uiuc.edu FlexRAM.

People Involved

Students Other faculty

Michael Huang

Seung YooH. V. JagadishJoe RenauDavid Padua

Jaejin LeeDaniel Reed

Page 3: Toward an Advanced Intelligent Memory System University of Illinois Josep Torrellas  torrellas@cs.uiuc.edu FlexRAM.

Technological Landscape

Merged Logic and DRAM (MLD):

•IBM, Mitsubishi, Samsung, Toshiba and others

•0.18 um (chips for 1 Gbit DRAM)

•Powerful: e.g. IBM SA-27E ASIC (Feb 99)

•Logic frequency: 400 MHz

•IBM PowerPC 603 proc + 16 KB I, D caches = 3%

•Further advances in the horizon

Opportunity: How to exploit MLD best?

Page 4: Toward an Advanced Intelligent Memory System University of Illinois Josep Torrellas  torrellas@cs.uiuc.edu FlexRAM.

Terminology

Processor In Memory (PIM)

Intelligent Memory orIntelligent RAM (IRAM)

=

Page 5: Toward an Advanced Intelligent Memory System University of Illinois Josep Torrellas  torrellas@cs.uiuc.edu FlexRAM.

Key Applications Benefit from HW

•Data Mining (decision trees and neural networks)

•Computational Biology (protein sequence matching)

•Molecular Dynamics (short-range forces)

•Financial Modeling (stock options, derivatives)

•Multimedia (MPEG-2)

•Decision Support Systems (TPC-D)

•Speech Recognition

All these are Data Intensive Applications

Page 6: Toward an Advanced Intelligent Memory System University of Illinois Josep Torrellas  torrellas@cs.uiuc.edu FlexRAM.

Example App: DNA Matching

Problem: Find areas of database DNA chains that match (modulo some mutations) the sample DNA chains

Page 7: Toward an Advanced Intelligent Memory System University of Illinois Josep Torrellas  torrellas@cs.uiuc.edu FlexRAM.

How the Algorithm Works

Generate 50+ most-likely mutations

Pick 4 consecutive aminoacids from sample

Page 8: Toward an Advanced Intelligent Memory System University of Illinois Josep Torrellas  torrellas@cs.uiuc.edu FlexRAM.

Example App: DNA Matching

If match is found: try to extend it

Compare them to every positions in the database DNAs

Page 9: Toward an Advanced Intelligent Memory System University of Illinois Josep Torrellas  torrellas@cs.uiuc.edu FlexRAM.

How to Use MLD

1. Main compute engine of the machine

•Add proc to DRAM chip

•Include a vector processor or multiple processors

Incremental gains

Hard to program

UC Berkeley: IRAMNotre Dame: Execube, PetaflopsMIT: RawStanford: Smart Memories

Page 10: Toward an Advanced Intelligent Memory System University of Illinois Josep Torrellas  torrellas@cs.uiuc.edu FlexRAM.

How to Use MLD (II)

2. Co-processor, special-purpose processor

•ATM switch controller

•Process data beside the disk

•Graphics accelerator

Stanford: ImagineUC Berkeley: ISTORE

Page 11: Toward an Advanced Intelligent Memory System University of Illinois Josep Torrellas  torrellas@cs.uiuc.edu FlexRAM.

How to Use MLD (III)

3. Our approach: take the place of memory chipsin a workstation or server

•PIM chip processes the memory-intensive parts of the program

Illinois: FlexRAMUC Davis: Active PagesUSC-ISI: DIVA

Page 12: Toward an Advanced Intelligent Memory System University of Illinois Josep Torrellas  torrellas@cs.uiuc.edu FlexRAM.

Our Solution: Principles

Extract high bandwidth from DRAM: Many simple processing units

Run legacy codes with high performance: Do not replace off-the-shelf uP in workstation Take place of memory chip. Same interface as DRAM Intelligent memory defaults to plain DRAM

Small increase in cost over DRAM: Simple processing units, still dense

General purpose: Do not hardwire any algorithm. No Special purpose

Page 13: Toward an Advanced Intelligent Memory System University of Illinois Josep Torrellas  torrellas@cs.uiuc.edu FlexRAM.

Architecture Proposed

Page 14: Toward an Advanced Intelligent Memory System University of Illinois Josep Torrellas  torrellas@cs.uiuc.edu FlexRAM.

The FlexRAM Memory System

Can exploit multiple levels of parallelism

For a high-end workstation:

•100’s of P.Mems in memory (e.g. IBM PowerPC 603)

•1 P.Host processor (e.g. Merced, IBM GP)

•100,000’s of very simple P.Arrays in memory

Page 15: Toward an Advanced Intelligent Memory System University of Illinois Josep Torrellas  torrellas@cs.uiuc.edu FlexRAM.

Chip Organization

Page 16: Toward an Advanced Intelligent Memory System University of Illinois Josep Torrellas  torrellas@cs.uiuc.edu FlexRAM.

Memory in one FlexRAM Chip

•64 Mbytes of DRAM organized as 16Mx32 bits

•Organized in 64 1-Mbyte banks

•1 single port

•Associated to 1 P.Array

•2 2-Kbyte row buffers (no P.Array cache)

•P.Array access to memory: 10 ns (row hit) or 20 ns (miss)

•On-chip memory bandwidth: 102 Gbytes/second

•Each bank:

Page 17: Toward an Advanced Intelligent Memory System University of Illinois Josep Torrellas  torrellas@cs.uiuc.edu FlexRAM.

Memory in one FlexRAM Chip

Group of 4 P.Arrays share one 8-Kbyte, 4-portedSRAM instruction memory

•Holds the P.Array code

•Small because short code

•Aggressive access time: 1 cycle = 2.5 ns

Page 18: Toward an Advanced Intelligent Memory System University of Illinois Josep Torrellas  torrellas@cs.uiuc.edu FlexRAM.

P.Array

•64 P.Arrays per chip. Not SIMD but SPMD

•32-bit integer arithmetic; 16 registers

•4 P.Arrays share one multiplier

•No caches, no floating point

•28 different 16-bit instructions

•Can access own 1 Mbyte of DRAM plus DRAM of left and right neighbors. Connection forms a ring•Broadcast and notify primitives: Barrier

Page 19: Toward an Advanced Intelligent Memory System University of Illinois Josep Torrellas  torrellas@cs.uiuc.edu FlexRAM.

P.Mem

•2-issue static superscalar like IBM PowerPC 603

•16-Kbyte I, D caches

•Communication with P.Arrays:

•Executes serial sections

•Broadcast/notify or plain write/read to memory

•Communication with other P.Mems:•Memory in all chips is visible•Access via the inter-chip network•Must flush caches to ensure data coherence

Page 20: Toward an Advanced Intelligent Memory System University of Illinois Josep Torrellas  torrellas@cs.uiuc.edu FlexRAM.

Issues

Communication P.Mem-P.Host:

•P.Mem cannot be the master of bus

•P.Host polls a register in Rambus interf. of master P.Mem

•P.Host starts P.Mems by writing register in Rambus interf.

•If P.Mem not finished: memory controller retries. Retries are invisible to P.Host

Virtual memory:

•They share a range of virtual addresses with P.Host

•P.Mems and P.Arrays use virtual memory

Page 21: Toward an Advanced Intelligent Memory System University of Illinois Josep Torrellas  torrellas@cs.uiuc.edu FlexRAM.

Chip Architecture

Page 22: Toward an Advanced Intelligent Memory System University of Illinois Josep Torrellas  torrellas@cs.uiuc.edu FlexRAM.

Basic Block

Page 23: Toward an Advanced Intelligent Memory System University of Illinois Josep Torrellas  torrellas@cs.uiuc.edu FlexRAM.

Area Estimation (mm )

PowerPC 603+caches: 12

SRAM instruction memory: 34

64 Mbytes of DRAM: 330

P.Arrays: 96

Pads + network interf. + refresh logic 20

Rambus interface: 3.4Multipliers: 10

Total = 505

Of which 28% logic, 65% DRAM, 7% SRAM

2

VERY CONSERVATIVE

Page 24: Toward an Advanced Intelligent Memory System University of Illinois Josep Torrellas  torrellas@cs.uiuc.edu FlexRAM.

Evaluation

P.Host P.Host L1 & L2 Bus & Memor Freq: 800 MHz L1 Size: 32 KB Bus: Split Trans Issue Width: 6 L1 RT: 2.5 ns Bus Width: 16 B Dyn Issue: Yes L1 Assoc: 2 Bus Freq: 100 MHz I-Window Size: 96 L1 Line: 64 B Mem RT: 262.5 ns Ld/St Units: 2 L2 Size: 256 KB Int Units: 6 L2 RT: 12.5 ns FP Units: 4 L2 Assoc: 4 Pending Ld/St: 8/8 L2 Line: 64 B BR Penalty: 4 cyc P.Mem P.Mem L1 P.Array Freq: 400 MHz L1 Size: 16 KB Freq: 400 MHz Issue Width: 2 L1 RT: 2.5 ns Issue Width: 1 Dyn Issue: No L1 Assoc: 2 Dyn Issue: No Ld/St Units: 2 L1 Line: 32 B Pending St: 1 Int Units: 2 L2 Cache: No Row Buffers: 3 FP Units: 2 RB Size: 2 KB Pending Ld/St: 8/8 RB Hit: 10 ns BR Penalty: 2 cyc RB Miss: 20 ns BR Penalty: 2 cyc

Page 25: Toward an Advanced Intelligent Memory System University of Illinois Josep Torrellas  torrellas@cs.uiuc.edu FlexRAM.

Utilization

High P.Array Util Low P.Mem Util

Page 26: Toward an Advanced Intelligent Memory System University of Illinois Josep Torrellas  torrellas@cs.uiuc.edu FlexRAM.

Utilization

Low P.Host Utilization

Page 27: Toward an Advanced Intelligent Memory System University of Illinois Josep Torrellas  torrellas@cs.uiuc.edu FlexRAM.

Speedups

Constant Problem Sz Scaled Problem Sz

Page 28: Toward an Advanced Intelligent Memory System University of Illinois Josep Torrellas  torrellas@cs.uiuc.edu FlexRAM.

Speedups

Varying Logic Frequency

Page 29: Toward an Advanced Intelligent Memory System University of Illinois Josep Torrellas  torrellas@cs.uiuc.edu FlexRAM.

Programming FlexRAM

•FlexRAM programmed in C + extensions: C-Flex

•Library of Intelligent Memory Operations (IMOs)

Executed by P.Arrays or P.Mem

C subroutines that can be called from main pgm

Operate on large data sets with poor locality

•Library also contains plain subroutines

•Link program with IMOs or plain subroutines

Page 30: Toward an Advanced Intelligent Memory System University of Illinois Josep Torrellas  torrellas@cs.uiuc.edu FlexRAM.

C-Flex Programming Extensions

•On processor_range: where the following code is executed

•Waitfor processor_range: processors waiting for others

•Release object

•Map object to processor_range: mapping of pages

•Flush(object), Flush&Inval(object): flush from cache

•Broadcast(address), Poll(), Receive(address), Notify()

•FlexRAM_malloc(), P_mem_malloc(), P_array_malloc()

Page 31: Toward an Advanced Intelligent Memory System University of Illinois Josep Torrellas  torrellas@cs.uiuc.edu FlexRAM.

Performance Evaluation

Hardware performance monitoring embedded in the chip

Software tools to extract and interpret performance info

Page 32: Toward an Advanced Intelligent Memory System University of Illinois Josep Torrellas  torrellas@cs.uiuc.edu FlexRAM.

Current Status

Identified and wrote all applications Designed architecture based on apps & feasible

technology Conceived ideas behind language/compiler Need to do: chip layout and fabrication

development of the compiler Funds needed for:

processor core (P.Mem) chip fabrication hardware and software engineers

Page 33: Toward an Advanced Intelligent Memory System University of Illinois Josep Torrellas  torrellas@cs.uiuc.edu FlexRAM.

Overall Goal

•Fabricate chips

•Build a workstation with an intelligent memory system

•Demonstrate significant speedups on real applications

•Build a compiler for the intelligent memory system

Page 34: Toward an Advanced Intelligent Memory System University of Illinois Josep Torrellas  torrellas@cs.uiuc.edu FlexRAM.

Conclusion

We have a handle on: A promising technology (MLD) Key applications of industrial interest

Real chance to transform the computing landscape