Tier1 Status Report Andrew Sansum GRIDPP12 1 February 2004.
-
Upload
alexis-perkins -
Category
Documents
-
view
215 -
download
2
Transcript of Tier1 Status Report Andrew Sansum GRIDPP12 1 February 2004.
Tier1 Status Report
Andrew SansumGRIDPP12
1 February 2004
1 February 2004 Tier-1 Status Report
Overview
• Hardware configuration/utilisation• That old SRM story• dCache deployment• Network developments• Security stuff
1 February 2004 Tier-1 Status Report
Hardware
• CPU– 500 dual Processor Intel – PIII and Xeon servers mainly rack
mounts (About 884KSI2K) also 100 older systems.
• Disk Service – mainly “standard” configuration– Commodity disk service based on core of 57 Linux servers– External IDE&SATA/SCSI RAID arrays (Accusys and Infortrend)– ATA and SATA drives (1500 spinning drives – another 500 old
SCSI drives)– About 220TB disk (last tranch deployed in October)– Cheap and (fairly) cheerful
• Tape Service– STK Powderhorn 9310 silo with 8 9940B drives. Max capacity
1PB at present but capable of 3PB by 2007.
1 February 2004 Tier-1 Status Report
LCG in September
1 February 2004 Tier-1 Status Report
LCG Load
1 February 2004 Tier-1 Status Report
Babar Tier-A
1 February 2004 Tier-1 Status Report
Last 7 Days
ACTIVE: Babar,D0, LHCB,SNO, H1, ATLAS,ZEUS
1 February 2004 Tier-1 Status Report
dCache
• Motivation:– Needed SRM access to disk (and tape)– We have 60+ disk servers (140+ filesystems) –
needed disk pool management
• dCache was only plausible candidate
1 February 2004 Tier-1 Status Report
History (Norse saga) of dCache
at RAL• Mid 2003
– We deployed a non grid version for CMS. It was never used in production.
• End of 2003/Start of 2004– RAL offered to package a
production quality dCache.– Stalled due to bugs and holidays– went back to developers and
LCG developers.
• September 2004– Redeployed DCache into LCG
system for CMS, and DTeam VOs.
• dCache deployed within JRA1 testing infrastructure for gLite i/o daemon testing.
• Jan 2005: Still working with CMS to resolve interoperation issues, partly due to “hybrid grid/non-grid use”.
• Jan 2005: Prototype back end to tape written.
1 February 2004 Tier-1 Status Report
dCache Deployment
• dCache deployed as production service (also test instance, JRA1, developer1 and developer2?)
• Now available in production for ATLAS, CMS, LHCB and DTEAM (17TB now configured – 4TB used) Reliability good – but load is light
• Will use dCache (preferably production instance) as interface to Service Challenge 2.
• Work underway to provide Tape backend, prototype already operational. This will be production SRM to tape at least until after July service challenge
1 February 2004 Tier-1 Status Report
Current Deployment at
RAL
1 February 2004 Tier-1 Status Report
Original Plan for Service
ChallengesExperiments
SJ4SJ4 UKLIGHTUKLIGHT
Production dCachetest
dCache(?)
Service Challenges
technology
1 February 2004 Tier-1 Status Report
headgridftp
dcache
Dcache for Service
Challenges
SJ4SJ4 UKLIGHTUKLIGHT
Disk Servers
Service ChallengeExperiments
1 February 2004 Tier-1 Status Report
Network
• Recent upgrade to Tier1 Network• Begin to put in place new generation of
network infrastructure• Low cost solution based on commodity
hardware• 10 Gigabit “ready”• Able to meet needs of:
– forthcoming service challenges– Increasing production data flows
1 February 2004 Tier-1 Status Report
September
Summit 7i
1 Gbit
Site Router
DiskDisk
CPUCPU Disk CPU
Summit 7i Summit 7i
1 February 2004 Tier-1 Status Report
Now (Production)
Disk+CPU
Disk+CPU
Nortel5510 stack(80Gbit)
Summit 7i
N*1 Gbit N*1 Gbit
N*1 Gbit
1 Gbit/link
Site Router
1 February 2004 Tier-1 Status Report
Soon (Lightpath)
Disk+CPU
Disk+CPU
Nortel5510 stack(80Gbit)
Summit 7i
N*1 Gbit N*1 Gbit
1 Gbit
1 Gbit/link
Site Router
UKLIGHTDual
attach
10Gb
2*1Gbit
1 February 2004 Tier-1 Status Report
Next (Production)
Disk+CPU
Disk+CPU
Nortel5510 stack(80Gbit)
10 Gigabit Switch
N*10 Gbit N*10 Gbit
10 Gb
1 Gbit/link
New Site RouterRAL
SITE
1 February 2004 Tier-1 Status Report
Machine Room Upgrade
• Large mainframe cooling (cooking) Infrastructure: ~540KW– Substantial overhaul now completed good performance
gains, but was close to maximum by mid summer, temporary chillers over August (funded by CCLRC)
– Substantial additional capacity ramp-up planned for Tier-1 (and other E-Science) Service
• November (Annual) Air-conditioning RPM Shutdown– Major upgrade – new (independent) cooling system
(400KW+) – funded by CCLRC
• Also: profiling power distribution, new access control system.
• Worrying about power stability (brownout and blackout in last quarter.
1 February 2004 Tier-1 Status Report
MSS Stress Testing
• Preparation for SC 3 (and beyond) underway (Tim Folkes). Underway since August.
• Motivation – service load has been historically rather low. Look for “Gotchas”
• Review known limitations.• Stress test – part of the way through the process –
just a taster here– Measure performance– Fix trivial limitations– Repeat – Buy more hardware– Repeat
1 February 2004 Tier-1 Status Report
STK 9310
8 x 9940 tape drives
ADS_switch_1 ADS_Switch_2
Brocade FC switches
4 drives to each switch
ermintrudeAIX
dataserver
florenceAIX
dataserver
zebedeeAIX
dataserver
dougalAIX
dataserver
mchenry1AIXTest flfsys
basilAIXtest
dataserver
brianAIXflfsys
ADS0CNTRRedhatcounter
ADS0PT01Redhat
pathtape
ADS0SB01Redhat
SRB interface
dylanAIX
Import/exportbuxtonSunOSACSLS
User
array4 array3 array2 array1
catalogue
cache
catalogue
cache
Test system
SRB Inq; S commands; MySRB
Tape devices
ADStape
ADS sysreq
admin commandscreate query
User pathtape commands
Logging
Physical connection (FC/SCSI)
Sysreq udp command
User SRB command
VTP data transfer
SRB data transfer
STK ACSLS command
All sysreq, vtp andACSLS connections to dougal also apply tothe other dataserver machines, but are left out for clarity
Production system
SRB pathtape commands
Thursday, 04 November 2004
1 February 2004 Tier-1 Status Report
Catalogue Manipulation
0
20
40
60
80
100
120
1 2 5 10 20 50 100 150 200 250
# queries
seco
nd
s Create Av.
Query Av.
Purge Av.
1 February 2004 Tier-1 Status Report
Write Performance
0
10
20
30
40
50
60
70
80
90
100
16:0
0:41
16:2
2:40
16:4
4:41
17:0
6:41
17:2
8:41
17:4
9:42
18:1
2:42
18:3
5:42
18:5
8:42
19:2
1:41
19:4
4:41
20:0
7:42
20:3
0:41
20:5
3:41
21:1
6:41
21:3
9:41
22:0
2:41
22:2
5:41
22:4
8:41
23:1
0:41
23:3
3:41
23:5
6:41
00:1
9:42
00:4
2:42
01:0
4:41
01:2
7:40
01:4
9:41
02:1
2:41
02:3
5:41
02:5
8:41
03:2
1:41
03:4
4:40
04:0
7:40
04:2
9:41
04:5
2:41
05:1
5:41
05:3
7:41
06:0
0:41
06:2
3:41
06:4
6:40
07:0
9:40
07:3
2:40
07:5
5:42
08:1
8:40
08:4
1:41
read (KB/sec) w rite (KB/sec) Total
Single Server Test
1 February 2004 Tier-1 Status Report
Conclusions
• Have found a number of easily fixable bugs • Have found some less easily fixable
architecture issues• Have much better understanding of limitations
of architecture• Estimate suggests 60-80MB/s -> tape now.
Buy more/faster disk and try again.• Current drives good for 240MB/s peak – actual
performance likely to be limited by ratio of drive (read+write):(load+unload+seek)
1 February 2004 Tier-1 Status Report
Security Incident
• 26 August X11 scan acquires userid/pw at upstream site
• Hacker logs on to upstream site – snoops known_hosts• Ssh to Tier-1 front end host using unencrypted private
key from upstream site• Upstream site responds – but does not notify RAL• Hacker loads IRC bot on Tier1 and registers at remote
IRC server (for command/control/monitoring)• Tries to root exploit (fails), attempts login to
downstream sites (fails – we think)• 7th October – Tier-1 incident notified by IRC service
and begins response.• 2000-3000 sites involved globally
1 February 2004 Tier-1 Status Report
Objectives
• Comply with site security policy – disconnect … etc etc.– Will disconnect hosts promptly once active intrusion
detected.
• Comply with external Security Policies (eg LCG)– Notification
• Protect downstream sites, by notification, pro-active disconnection– Identify involved sites– Establish direct contacts with uptream/downstream
• Minimise service outage– Eradicate infestation
1 February 2004 Tier-1 Status Report
Roles
Incident controller mange information flood – deal with external contacts
Log trawler hunt for “contacts”
“Hunter/killer(s)” Searching for compromised hosts
Detailed forensics Understand what happened on compromised hosts
End user contacts Get in touch with users, confirm usage patterns, terminate IDs
1 February 2004 Tier-1 Status Report
Chronology
• At 09:30 on 7th October RAL network group forwarded a complaint from Undernet suggesting unauthorized connections from Tier1 hosts to Undernet.
• At 10:00 initial investigation suggests unauthorised activity on csfmove02. csfmove02 physically disconnected from network.
• By now 5 Tier1 staff + E-Science security Officer 100% engaged on incident. Additional support from CCLRC network group, and site security officer. Effort remained at this level for several days. Babar support staff at RAL also active tracking down unexplained activity.
• At 10:07 request made to site firewall admin for firewall logs for all contacts with suspected hostile IRC servers.
• At 10:37 firewall admin provides initial report, confirming unexplained current outbound activity from csfmove02, but no other nodes involved.
• At 11:29 babar report that bfactory account password was common to the following additional Ids (bbdatsrv and babartst)
• At 11:31 Steve completes rootkit check – no hosts found – although possible false positives on Redhat 7.2 which we are uneasy aboutBy 11:40 preliminary investigations at RAL had concluded that an unauthorized access had taken place onto host csfmove02 (a data mover node) which in turn was connected outbound to an IRC service. At this point we notified security mailing lists (hepix-security, lcg-security, hepsysman):
1 February 2004 Tier-1 Status Report
Security Events
0
5
10
15
20
25
30
35
2D1 4D1 2D2 4D2 D4 D6
6 Hour Intervals
Security Events
1 February 2004 Tier-1 Status Report
Security Summary
• The intrusion took 1-2 staff months of CCLRC effort to investigate (6 staff fulltime for 3 days (5 Tier-1 plus E-Science security officer), working long hours. Also:– Networking group – CCLRC site security. – Babar support– Other sites
• Prompt notification of the incident by up-stream site would have substantially reduced the size and complexity of the investigation.
• The good standard of patching on the Tier1 minimised the spread of the incident internally (but we were lucky)
• Can no longer trust who logged on uses are many userids (globally) probably compromised.
1 February 2004 Tier-1 Status Report
Conclusions
• A period of consolidation• User demand continues to fluctuate, but
increasing number of experiments able to use LCG.
• Good progress on SRM to DISK (DCACHE• Making progress with SRM to tape• Having an SRM isn’t enough – it has to meet
the needs of the experiments• Expect focus to shift (somewhat) towards
service challenge