Post on 20-Dec-2015
A Tentative Proposal for ISTORE-2
Winfried W. Wilckewilcke@almaden.ibm.com(408) 927-2139Almaden Research Center
July 18, 2000
Richard C. Boothrcbooth@us.ibm.com(408) 927-1879Almaden Research Center
David A. Pattersonpattrsn@cs.berkeley.edu(510) 642-6587University of California, Berkeley
Underlying Beliefs...• Commodity components are quickly winning the server
wars– Gigabit Ethernet will win everything
– x86 Processors
– Linux OS will prosper
• Large servers (100-10k nodes) will be quite common - and most are storage centric
• What matters most:– Ease of management, density of nodes and seamless
geographical interconnect
Generations of IStore• IStore = IStore-1: Present UCB Project• IStore-2: Joint Research Prototype
– ~2000 nodes
– Split between UCB, IBM and others
– Hardware similar to IStore-1
– Focus on real applications and management software
– Operational YE 2001
• Follow-on Work
Talk Outline
• Project Goals
• Applications
• Research Topics
• Hardware Architecture
• Development Schedule
• Working Relationships
• Next Steps
Candidate Applications
• Research Focus– NOAA Severe Weather Warning (R. Arps, ARC)
– Fast Image Recognition (J. Malik, UCB)
• Commercial Focus– Scalable E-business server (IGS) - a must !
– Deep Searching of Entire Web; Webfountain (N. Pass)
– (tbd) Large Scale Network Attached Server (J. Palmer)
– (tbd) Speech Recognition Farms for Phone-based Special Web-services
NOAA Severe Weather.... Ron Arps
• Doppler Radar enables detection of violent tornadoes
and plane crashes due to windshear • Doubled warning time for residents in Oklahoma
during '99 class 5 outbreaks– Goal: 15 minutes avg. warning time in 2004
• Eventually 120 radar sites will be established
• Matches well with I-Store characteristics– Needs scalable local storage/processing plus seamless transfer
of data on geographical scale, manageable from one site
WebfountainNorm Pass
• Index entire Web every few weeks– Google, Northernlight index 25%
• 4 TB index => 200 TB in two years
• 'Miner' technology demonstrated– Resumes, Prices, Geospatial,...– Prototype running on a 30 node Linux farm
Software Model
• Users will see a standard Linux farm (shared nothing) programming model– No porting effort for existing Linux farm
applications (except dealing with different versions of Linux, of course)
• The system management functions are only visible to system administrators– Exception are performance monitoring functions
useful for tuning apps
Differences to a Linux Farm
• Much higher spatial density of Nodes or ‘Bricks’
• Single network protocol (Ethernet) for ALL off-node communications
• Design with geographical distribution in mind
• Diagnostic Processors
• Lego-like, standardized building blocks – Regular and relaxed homogeneous
• Monitoring Hardware
• Measuring of relevant environmental parameters
• (New) System Management Language
• AME, SON and RAIN objectives
AME, RAIN and SON
• Three areas of system research to be explored with I-Store
• These three areas are largely independent of each other
AME• Availability
– No single points of failure
– Introspection, failover and fast failure
– Fast repair by swapping identical blocks
• Maintainability– Homogenous structure
– System management language
• Extensibility/Scalability– Shared nothing architecture
RAIN• Redundant Array of Inexpensive Network
(Switches)• Issues to be explored
– Optimal topology
– Density/cost of ports, optics vs. copper
– Routing algorithms within a machine
– Need for TCP hardware acceleration
– Performance of Ethernet protocol
– Frame sizes
– Simplified switches
SON
• Storage Oriented Nodes• Basic Premise of one node=one disk=one
processor– It works in farms, but is it a good general choice?
– Is the loss of flexibility (in the ratio of disks per processor) a good tradeoff for easier management?
Additional Software Research Topics...
• Define AME, RAIN, SON benchmarks• Server Management Language• Parallel Searching of geographically
distributed database• Dynamic Resource Allocation (i.e. Firewalls)• SCSI over TCP/IP (SAN within I-Store)• Storage for mobile users (a’la Ocean Store)
System Management Language
• Define a high-level, interpretive(?) system management language– May use facilities of system OS
• Highly regular I-Store is the first target• Sample Verbs
– allocate, protect, share, map, backup, restore, copy, correlate, display, discover, ping, initialize, report, arm, define(node)....
System Management Language
• Should easily describe tasks such as:– Backup all data located in the Philippines to Colorado (a
volcano is about to blow)
– Set alarm if any disk is more than 80% full
– Define protected subregions in the system
– Display CPU utilization by time and state
– Discover present routing topology
– Show 3D correlation plot of disk vibration vs brick temperature vs. actual failure events
– .....
IStore HardwareArchitecture Goals
• Seamless Scalability– O(10,000) AME Storage Nodes
– Optimized Storage Brick for Packaging Density
• Geographically Disperse Nodes– Gb Ethernet Connections to WAN Routers
• Storage Brick – Full PME Brick: Processor, Memory, Cache
– Gb Ethernet as the Sole Interconnection Fabric
– Imbedded Disk with 10s GBytes
IStore HardwareArchitecture Goals (cont.)
• State-of-the-art Intel Processor Memory Element (PME) – 650 MHz Pentium III with 100 MHz System Bus
– 256 KB L2 cache
– O(512MB) main memory
• State-of-the-art Interconnect Fabric– 1 Gb Ethernet Runtime Network
– 10/100 Mb Ethernet Diagnostic Network
• State-of-the-art Disks– 2.5" ~32 GB drive
IStore HardwareArchitecture Goals (cont.)
• Berkeley AME Hardware Management Support– Diagnostic processor
– Environmental sensors
• TCP/IP Hardware Accelerator– Class 4: Hardware State Machine
• SCSI over TCP ("iSCSI") Support• Compatible with Standard Ethernet
Switches/Routers
IStore-1Current Berkeley Design
• 80 nodes
• AME
• 266 MHz Pentium II
• Four 100 MB Ethernet Ports/brick
• Integrated UPS
IStore-2Deltas from IStore-1
• Geographically Disperse Nodes– O(1000) nodes at Almaden
– O(1000) nodes at Berkeley
• Upgraded Storage Brick– Pentium III 650 MHz Processor
– Two Gb Ethernet Copper Ports/brick
– One 2.5" ATA disk
• User Supplied UPS Support
• Standard Ethernet Switches
Follow on Work
• Ethernet Sourced in Memory Controller (North Bridge)
• TCP/IP Hardware Accelerator– Class 4: Hardware State Machine
• SCSI over TCP Support
• Integrated UPS
Why an IStore-2 PrototypeIs Interesting
• Storage Bricks– New ratios for MIPS/bandwidth/storage
– New level of density
• AME Hardware Support– Seamless scaling
– Self maintaining nodes
• It Exists
IStore-2Core Design Team
• IBM (full time)– System Architect: Winfried Wilcke– Lead Designer: Richard Booth– 1 Experienced Hardware Designer: tbd– 3 Designers: tbd
• Berkeley– 6 Graduate Students
IStore-2Development Schedule
• Working Model– 7/00: Agreement in Principle
– 8/00: Working Team Membership
• Design– 9/00: Architecture Specification version 1.0
– 11/00: Design Workbook version 1.0
• Implementation– 2Q/01: First 3 Nodes Power-up
– 3Q/01: O(64) nodes available to users
– 4Q/01: O(2000) nodes available to users
IStore-2 Footprint(per 1000 nodes)
• 16 Storage (19") Racks – 64 Storage bricks/rack
• 8 type 1 storage bricks/drawer
• 8 storage drawers/rack
– Ethernet switches in rack
• 8 Global Ethernet Switch (19") Racks
• Requires 600 sq.. ft lab
IStore-2 PlatformRequired Resources
• Staffing– 6 ARC/SSD IBMers
– 6 UCB Graduate Students
• Lab Space– 600 sq. ft. lab at Almaden
– 600 sq. ft. lab at Berkeley
• Hardware Costs– $3M (mostly 2001 dollars)
IStore-2Working Model
• Jointly Authored Architecture Specification– 1 or 2 Almaden authors
– 1 or 2 Berkeley authors
• Design Workbook– Each Core Team Member owns a section
• Weekly Half Day Working Face-to-face Meetings– Alternate between Almaden and Berkeley
• Shared Electronic Documentation
• Machine Available -for free- to Users From Either Institution
• IP is Handled Like Previous IBM/UCB Projects ??
• Fabrication (some design ?) Vendored Out