LCG Installation – Site Validation · authentication failure ... [hassel@hik-lcg-ui hassel]$...

22
EGEE is a project funded by the European Union under contract IST-2003-508833 LCG Installation – Site Validation Peer Hasselmeyer GridKa, FZK GridKa School, 20-23 September 2004 www.eu-egee.org

Transcript of LCG Installation – Site Validation · authentication failure ... [hassel@hik-lcg-ui hassel]$...

Page 1: LCG Installation – Site Validation · authentication failure ... [hassel@hik-lcg-ui hassel]$ edg-gridftp-ls –v gsi total 185 drwxr-xr-x 2 root root 4096 Apr 29 11:16 bin drwxr-xr-x

EGEE is a project funded by the European Union under contract IST-2003-508833

LCG Installation –Site Validation

Peer HasselmeyerGridKa, FZK

GridKa School, 20-23 September 2004

www.eu-egee.org

Page 2: LCG Installation – Site Validation · authentication failure ... [hassel@hik-lcg-ui hassel]$ edg-gridftp-ls –v gsi total 185 drwxr-xr-x 2 root root 4096 Apr 29 11:16 bin drwxr-xr-x

GridKa School, 22 September 2004 - 2

Outline

• Introduction• Monitoring Tools• Information System• CE

batch systemGlobusRB

• SE

Page 3: LCG Installation – Site Validation · authentication failure ... [hassel@hik-lcg-ui hassel]$ edg-gridftp-ls –v gsi total 185 drwxr-xr-x 2 root root 4096 Apr 29 11:16 bin drwxr-xr-x

GridKa School, 22 September 2004 - 3

Introduction

• Site configuration complex• Tests important

ensure correct operation of your sitemisconfigurations might affect grid as a whole

• Some tests provided in install notes• Read rollout mailing list

installation/configuration problems are discussed there

• Understanding internals helps determining source of errors

Page 4: LCG Installation – Site Validation · authentication failure ... [hassel@hik-lcg-ui hassel]$ edg-gridftp-ls –v gsi total 185 drwxr-xr-x 2 root root 4096 Apr 29 11:16 bin drwxr-xr-x

GridKa School, 22 September 2004 - 4

Monitoring Tools

• GOC performs Grid monitoringjob submissioninformation systemstorage system / replica manager

• You need to be registered in the GOC database

•http://goc.grid-support.ac.uk/gridsite/gocmain/monitoring/

Page 5: LCG Installation – Site Validation · authentication failure ... [hassel@hik-lcg-ui hassel]$ edg-gridftp-ls –v gsi total 185 drwxr-xr-x 2 root root 4096 Apr 29 11:16 bin drwxr-xr-x

GridKa School, 22 September 2004 - 5

Status Map

Page 6: LCG Installation – Site Validation · authentication failure ... [hassel@hik-lcg-ui hassel]$ edg-gridftp-ls –v gsi total 185 drwxr-xr-x 2 root root 4096 Apr 29 11:16 bin drwxr-xr-x

GridKa School, 22 September 2004 - 6

GIIS Monitor

Page 7: LCG Installation – Site Validation · authentication failure ... [hassel@hik-lcg-ui hassel]$ edg-gridftp-ls –v gsi total 185 drwxr-xr-x 2 root root 4096 Apr 29 11:16 bin drwxr-xr-x

GridKa School, 22 September 2004 - 7

Information System

• Accessible via LDAP, e.g.ldapsearch -x -H ldap://gridkap01.fzk.de:2135 -b "mds-vo-name=local,o=grid“

• No grid proxy needed

CE SE OtherService

SiteGIIS

BDII

2135 / mds-vo-name=local,o=grid

2135 / mds-vo-name=fzklcg2,o=grid

2170 / mds-vo-name=local,o=grid

Page 8: LCG Installation – Site Validation · authentication failure ... [hassel@hik-lcg-ui hassel]$ edg-gridftp-ls –v gsi total 185 drwxr-xr-x 2 root root 4096 Apr 29 11:16 bin drwxr-xr-x

GridKa School, 22 September 2004 - 8

GIIS: Site Info# gridkap01.fzk.de/siteinfo, local, griddn: in=gridkap01.fzk.de/siteinfo,Mds-Vo-name=

local,o=gridobjectClass: SiteInfoobjectClass: DataGridTopobjectClass: DynamicObjectsiteName: FZK-LCG2sysAdminContact: [email protected]: [email protected]: [email protected]: LCG-2_2_0installationDate: 20040119103800Z

Page 9: LCG Installation – Site Validation · authentication failure ... [hassel@hik-lcg-ui hassel]$ edg-gridftp-ls –v gsi total 185 drwxr-xr-x 2 root root 4096 Apr 29 11:16 bin drwxr-xr-x

GridKa School, 22 September 2004 - 9

GIIS: Queue Info# gridkap01.fzk.de:2119/jobmanager-pbspro-long,local, grid

dn: GlueCEUniqueID=gridkap01.fzk.de:2119/jobmanager-pbspro-long, mds-vo-name=local, o=grid

<…>GlueCEInfoGatekeeperPort: 2119GlueCEInfoHostName: gridkap01.fzk.deGlueCEInfoLRMSType: pbsGlueCEInfoLRMSVersion: PBSPro_5.4.1.41640GlueCEInfoTotalCPUs: 874GlueCEStateEstimatedResponseTime: 68540GlueCEStateFreeCPUs: 17GlueCEStateRunningJobs: 86GlueCEStateStatus: Production

Page 10: LCG Installation – Site Validation · authentication failure ... [hassel@hik-lcg-ui hassel]$ edg-gridftp-ls –v gsi total 185 drwxr-xr-x 2 root root 4096 Apr 29 11:16 bin drwxr-xr-x

GridKa School, 22 September 2004 - 10

GIIS: Cluster Info# gridkap01.fzk.de, gridkap01.fzk.de, local, griddn: GlueSubClusterUniqueID=gridkap01.fzk.de, GlueClusterUniqueID=gridkap01.fzk.de,mds-vo-name=local, o=grid

<…>GlueHostApplicationSoftwareRunTimeEnvironment:VO-alice-ALICE-4.01.05

<…>GlueHostApplicationSoftwareRunTimeEnvironment:LCG-2

GlueHostApplicationSoftwareRunTimeEnvironment:LCG-2_1_0

GlueHostApplicationSoftwareRunTimeEnvironment:LCG-2_2_0

Page 11: LCG Installation – Site Validation · authentication failure ... [hassel@hik-lcg-ui hassel]$ edg-gridftp-ls –v gsi total 185 drwxr-xr-x 2 root root 4096 Apr 29 11:16 bin drwxr-xr-x

GridKa School, 22 September 2004 - 11

GIIS: Cluster InfoGlueHostArchitectureSMPSize: 2GlueHostBenchmarkSF00: 0GlueHostBenchmarkSI00: 600GlueHostMainMemoryRAMSize: 1006GlueHostMainMemoryVirtualSize: 5100GlueHostNetworkAdapterInboundIP: FALSEGlueHostNetworkAdapterOutboundIP: TRUEGlueHostOperatingSystemName: RedhatGlueHostOperatingSystemRelease:2.4.20-35_39.rh7.3.atsmp

GlueHostOperatingSystemVersion:1 SMP Thu Jun 24 15:00:33 EDT 2004

GlueHostProcessorClockSpeed: 1262GlueHostProcessorModel:Intel(R) Pentium(R) III CPU family 1266MHz

Page 12: LCG Installation – Site Validation · authentication failure ... [hassel@hik-lcg-ui hassel]$ edg-gridftp-ls –v gsi total 185 drwxr-xr-x 2 root root 4096 Apr 29 11:16 bin drwxr-xr-x

GridKa School, 22 September 2004 - 12

GIIS: CE/SE Binding# gridkap02.fzk.de, gridkap01.fzk.de:2119/jobmanager-pbspro-long, local, grid

dn: GlueCESEBindSEUniqueID=gridkap02.fzk.de,GlueCESEBindGroupCEUniqueID=gridkap01.fzk.de:2119/jobmanager-pbspro-long, mds-vo-name=local,o=grid

<…>GlueCESEBindCEAccesspoint:/grid/fzk.de/mounts/nfs/data/lcg1/SE00

GlueCESEBindCEUniqueID:gridkap01.fzk.de:2119/jobmanager-pbspro-long

GlueCESEBindSEUniqueID: gridkap02.fzk.de

Page 13: LCG Installation – Site Validation · authentication failure ... [hassel@hik-lcg-ui hassel]$ edg-gridftp-ls –v gsi total 185 drwxr-xr-x 2 root root 4096 Apr 29 11:16 bin drwxr-xr-x

GridKa School, 22 September 2004 - 13

GIIS: SE Info# dteam:dteam, gridkap02.fzk.de, local, griddn: GlueSARoot=dteam:dteam,GlueSEUniqueID=gridkap02.fzk.de,Mds-Vo-name=local,o=grid

<…>GlueSAPolicyFileLifeTime: permanentGlueSAStateAvailableSpace: 1588648960GlueSAStateUsedSpace: 3428001280GlueSAAccessControlBaseRule: dteamGlueChunkKey: GlueSEUniqueID=gridkap02.fzk.deGlueSchemaVersionMajor: 1GlueSchemaVersionMinor: 1

Page 14: LCG Installation – Site Validation · authentication failure ... [hassel@hik-lcg-ui hassel]$ edg-gridftp-ls –v gsi total 185 drwxr-xr-x 2 root root 4096 Apr 29 11:16 bin drwxr-xr-x

GridKa School, 22 September 2004 - 14

Job Processing

UI

RBCE

PBSServerWN scp

scriptscp

std output/error

globus-url-copy

input/output sandbox

transfer job script

job manager

Gatekeeper/var/log/messages/var/log/globus-gatekeeper.log$HOME/gram_job_mgr_<#>.log

/var/spool/pbs/server_logs/<date>/var/spool/pbs/server_priv/accounting/<date>

/var/spool/pbs/mom_logs/<date>

Page 15: LCG Installation – Site Validation · authentication failure ... [hassel@hik-lcg-ui hassel]$ edg-gridftp-ls –v gsi total 185 drwxr-xr-x 2 root root 4096 Apr 29 11:16 bin drwxr-xr-x

GridKa School, 22 September 2004 - 15

Test Job Processing

• Grid proxy required (grid-proxy-init)• globus-job-run gridkap01.fzk.de:2119/jobmanager-fork /bin/hostnamegridkap01.fzk.de

• Possible errorsauthentication failure

• user: no proxy?• CE: no CA certificates?

no user mapping• user: not listed in VO?• CE: VO not accepted?

UI/CE: clock not synchronized? Firewall?

Page 16: LCG Installation – Site Validation · authentication failure ... [hassel@hik-lcg-ui hassel]$ edg-gridftp-ls –v gsi total 185 drwxr-xr-x 2 root root 4096 Apr 29 11:16 bin drwxr-xr-x

GridKa School, 22 September 2004 - 16

Test Job Processing• globus-job-run gridkap01.fzk.de:2119/jobmanager-lcgpbs -q default /bin/hostnamec01-002-108

• Possible errorsno output

• copying output failed, ssh keys on CE not up to date? Test on WN:su – dteam001scp test <CE>:

batch system not working• try on CE:su – dteam001qsub test.sh<wait until job has finished>cat test.sh.o*

#!/bin/bash/bin/hostname

Page 17: LCG Installation – Site Validation · authentication failure ... [hassel@hik-lcg-ui hassel]$ edg-gridftp-ls –v gsi total 185 drwxr-xr-x 2 root root 4096 Apr 29 11:16 bin drwxr-xr-x

GridKa School, 22 September 2004 - 17

Test Job Processing• edg-job-list-match hostname.jdl

check that your site is listed here, otherwise: problem with information system

• not yet added? nothing published? VO missing?• edg-job-submit –r gridkap01.fzk.de:2119/jobmanager-pbspro-default hostname.jdl

works regardless of whether site is listed in information system

• in case of error:edg-job-get-logging-info –v 1 <jobId>

Page 18: LCG Installation – Site Validation · authentication failure ... [hassel@hik-lcg-ui hassel]$ edg-gridftp-ls –v gsi total 185 drwxr-xr-x 2 root root 4096 Apr 29 11:16 bin drwxr-xr-x

GridKa School, 22 September 2004 - 18

Job Logging InfoEvent: Match

- dest_id = gridkap01.fzk.de:2119/jobmanager-pbspro-default- host = lxn1177.cern.ch- source = WorkloadManager- src_instance = WM- timestamp = Wed Sep 1 14:23:43 2004- user = /O=GermanGrid/OU=CrossGrid/CN=Peer Hasselmeyer---Event: Running

- host = c01-007-107.gridka.de- node = c01-007-107.gridka.de- source = LRMS- timestamp = Wed Sep 1 14:25:56 2004- user = /O=GermanGrid/OU=CrossGrid/CN=Peer Hasselmeyer

---Event: Done

- exit_code = 1- host = lxn1177.cern.ch- reason = Got a job held event, reason: Globus error 155: the job manager could not stage out a file- source = LogMonitor- src_instance = unique- status_code = FAILED- timestamp = Wed Sep 1 14:30:46 2004- user = /O=GermanGrid/OU=CrossGrid/CN=Peer Hasselmeyer

Page 19: LCG Installation – Site Validation · authentication failure ... [hassel@hik-lcg-ui hassel]$ edg-gridftp-ls –v gsi total 185 drwxr-xr-x 2 root root 4096 Apr 29 11:16 bin drwxr-xr-x

GridKa School, 22 September 2004 - 19

Test SE[hassel@hik-lcg-ui hassel]$ edg-gridftp-ls –v gsiftp://gridkap02.fzk.de/total 185drwxr-xr-x 2 root root 4096 Apr 29 11:16 bindrwxr-xr-x 4 root root 1024 Apr 29 11:20 bootdrwxr-xr-x 18 root root 86016 Aug 5 19:39 devdrwxr-xr-x 47 root root 4096 Aug 31 16:43 etcdrwxr-xr-x 3 root root 4096 Apr 29 11:26 griddrwxr-xr-x 2 root root 4096 Feb 6 1996 homedrwxr-xr-x 2 root root 4096 Jun 21 2001 initrddrwxr-xr-x 6 root root 4096 Apr 29 11:15 libdrwxr-xr-x 2 root root 4096 Apr 2 2002 miscdrwxr-xr-x 5 root root 4096 Jun 24 18:37 mntdrwxr-xr-x 9 root root 4096 Apr 29 11:36 optdr-xr-xr-x 122 root root 0 Aug 5 19:38 procdrwxr-x--- 5 root root 4096 Aug 30 09:08 rootdrwxr-xr-x 2 root root 4096 Apr 29 11:17 sbindrwxr-xr-x 2 root root 0 Aug 5 19:40 swdrwxrwxrwt 3 root root 4096 Aug 31 16:42 tmpdrwxr-xr-x 19 root root 4096 Apr 29 11:36 usrdrwxr-xr-x 20 root root 4096 Apr 30 04:02 var

Page 20: LCG Installation – Site Validation · authentication failure ... [hassel@hik-lcg-ui hassel]$ edg-gridftp-ls –v gsi total 185 drwxr-xr-x 2 root root 4096 Apr 29 11:16 bin drwxr-xr-x

GridKa School, 22 September 2004 - 20

Test SE• Copy file to SE: globus-url-copyfile://`pwd`/test gsiftp://gridkap02.fzk.de/grid/fzk.de/mounts/nfs/data/lcg1/SE00/dteam/my.test

• Get file back: globus-url-copy gsiftp://gridkap02.fzk.de/grid/fzk.de/mounts/nfs/data/lcg1/SE00/dteam/my.test file://`pwd`/readout

• Compare the files: diff test readout• Delete file: edg-gridftp-rm gsiftp://gridkap02.fzk.de/grid/fzk.de/mounts/nfs/data/lcg1/SE00/dteam/my.test

Page 21: LCG Installation – Site Validation · authentication failure ... [hassel@hik-lcg-ui hassel]$ edg-gridftp-ls –v gsi total 185 drwxr-xr-x 2 root root 4096 Apr 29 11:16 bin drwxr-xr-x

GridKa School, 22 September 2004 - 21

Test Replica Management• edg-rm -v --vo dteam cr file://`pwd`/test

stores file on “close SE”equivalent: lcg-cr -v --vo dteam -d gridkap02.fzk.de file://`pwd`/test

• SE could be replaced by $VO_DTEAM_DEFAULT_SE on WN• error messages not useful at all, currently not recommended

• Possible errors:site not in information systemcopy failed (file might end up on different SE)

• Complete test scripts in installation guide, appendix F

Page 22: LCG Installation – Site Validation · authentication failure ... [hassel@hik-lcg-ui hassel]$ edg-gridftp-ls –v gsi total 185 drwxr-xr-x 2 root root 4096 Apr 29 11:16 bin drwxr-xr-x

GridKa School, 22 September 2004 - 22

To Do

• Have a look at your GIISalso at the BDII (.rainbow.grid)

• Run some jobsGlobus, fork, lcgpbsvia RB (.rainbow.grid)

• Store and retrieve some files on your SE• Replica system cannot be tested here (no

RLS)