FAX UPDATE 26 TH AUGUST 2013. Running issues FAX failover Moving to new AMQ server Informing on...
-
Upload
austin-cobb -
Category
Documents
-
view
212 -
download
0
Transcript of FAX UPDATE 26 TH AUGUST 2013. Running issues FAX failover Moving to new AMQ server Informing on...
Ilija Vukotic [email protected] 2
Running issues
FAX failover
Moving to new AMQ server
Informing on endpoint status
Monitoring developments
Monitoring validation
dCache monitor 5.0.0
Collector
Dashboard
50 shades of green
CONTENT
Ilija Vukotic [email protected] 3
RUNNING ISSUESDead endpoints:
Frascati, Manchester, LAL
cmsd services are dead at:
Taiwan-lcg2, LPSC, Protvino, SWT2_CPB
/atlas/dq2/user/gangarbt lookups
• Made half of federation endpoints not accessible from upstream redirectors. • will be more explained by Johannes.
Remaining issues with x509
• communicating our wish to get it turned on• BU, DESY-HH, DESY-ZN, FZK, LRZ-LMU, MPPMU, Freiburg, Wuppertal,
Geogrid
Ilija Vukotic [email protected] 4
RUNNINGISSUES
Rather green considering it’s August !
Quite a bit of trafficconsidering it’s August !
New functional HC tests should not contribute much AFAIK
Ilija Vukotic [email protected] 5
FAX FAILOVERFAX failover works http://pandamon.cern.ch/fax/failover.
Developments:
• Cloud is shown and corrected queue names
• Side menu
In works:
• Filtering on user
• Graphing
To ponder:
• Site admins are not aware of this possibility. How do we communicate to them that it is in their best interest to turn it on?
Ilija Vukotic [email protected] 6
FAX FAILOVER
FAX dedicated submenu Will add here panda brokered job statistics
Production jobs failing over to
FAX
Ilija Vukotic [email protected] 7
MOVING TO NEW AMQ SERVER
All FAX related info was sent to pilot.msg.cern.ch
There was no authentication
Moved to Dashboard test broker
Consumer now uses STOMP+SSL
Required change to new stomp version
This week will move to production server
Ilija Vukotic [email protected] 8
INFORMING ON ENDPOINT STATUS
Mailing from SSB works and gives results.
Do we want SAM updates too?
What would it take?
Who would do it?
Ilija Vukotic [email protected] 9
MONITORING DEVELOPMENTSThere is a need to remotely check if cmsd works.
• We had (and still have) sites showing as green for direct access and red for downstream redirection.
• Investigation shows that actually cmsd’s are dead/not responding.
• Need a way to directly probe cmsd’s
• Andy will look at the ways to do it.
To develop new columns for SSB:
• xRootD version
• Rucio support
• Monitoring status
Ilija Vukotic [email protected] 10
MONITORING VALIDATION
First step is validation that results shown by Matevz’s collector are correct.
I was sending xrootd summary messages to collector and checking what I see in plots. While messages arrive and get shown, there is something wrong in calculating/plotting summaries.
Ilija Vukotic [email protected] 11
Ilija Vukotic [email protected] 12
DCACHE MONITOR 5.0.0
dCache monitor mostly rewritten:
• dCache compatible logging
• UDP messaging from same ports
• Sends “=” stream
• Sends more data (substitutes DN \CN with username etc.)
• Made compatible with collector
Tested at MWT2. Very good results.
End of the week, RPM will be produced and placed in WLCG repository. CMS will be informed about new version.
Ilija Vukotic [email protected] 13
COLLECTOR
New version being prepared by Matevz
• New AMQ version
BIG ISSUE:
Some CMS sites are sending info to our collector. Will be raised with Brian B.
Ilija Vukotic [email protected] 14
DCACHE MONITOR 5.0.0
Now gives really important and actionable information. Just during debugging I noticed:• Files opened, read a small percentage and kept open for hours.• Same file open twice in the same session (?!)• Rather small usage of vector reads.
Ilija Vukotic [email protected] 15
IN DASHBOARD
Why difference between table and plots?What’s idea of “Site history” tab?Need to investigate why CMS sites appear here (CERN-CMSTEST)
Ilija Vukotic [email protected] 16
PANDA RE-BROKERINGDiscussed at last CERN S&C week
We agreed on providing an estimate of cost to move data in WAN to PANDA, so it could re-broker jobs from very long queues to sites with free slots that have good connection to data.
Cost matrix exist in SSB.
Code reading it from SSB doing exponential decay smoothing runs and sends info to AGIS.
Have to check scalability of AGIS bulk update.
Waiting for Artem to code moving data from AGIS to schedconfig.
Next step is Tadashi making use of that table from schedconfig and actually re-broker.
Finally we’ll have to monitor it the same way we do with Failover.No develo
pments
Ilija Vukotic [email protected] 17
50 SHADES OF GREEN
Green color in any of the FAX SSB monitor metrics is based on one and the same file.
This involves a lot of cached information.
Need to find out a percentage of successfully obtained files from much large file pool while avoiding caching effects.
Simple code developed to test all endpoints having FDR datasets. Doing _file0->ls() on each of the ~800 files. Sequential.
Currently run by hand.
You may find it in FAXtools/FAXtestsFDR of our CERN FAX git repo.