DAQ2 Shift TutorialcDAQ group1 Monitoring of the DAQ2 system.

33
DAQ2 Shift Tutorial cDAQ group 1 Monitoring of the DAQ2 system

Transcript of DAQ2 Shift TutorialcDAQ group1 Monitoring of the DAQ2 system.

Page 1: DAQ2 Shift TutorialcDAQ group1 Monitoring of the DAQ2 system.

DAQ2 Shift Tutorial cDAQ group1

Monitoring of the DAQ2 system

Page 2: DAQ2 Shift TutorialcDAQ group1 Monitoring of the DAQ2 system.

DAQ2 Shift Tutorial cDAQ group2

Monitoring tools1. RCMS/LVL0 interface

Has been covered by Hannes

2. aDAQMon Overview screen to see at a glance the CMS running configuration and rates.

3. DAQView Most comprehensive monitoring tool for issues with data flow. Here you can

monitor the data from FEDs to BUs.

4. Elastic Search / Filter Farm monitoring (File Merging) Shows the progress of file merging before being sent to T0. Important monitor of transfer

system. Also shows the state of the Filter farm.

5. CPM controller Central Partition Manager for the TCDS system. Good place to see rates, state of detector

inputs, etc.

6. HotSpot Central display for sentinel messages for errors from all processes.

Page 3: DAQ2 Shift TutorialcDAQ group1 Monitoring of the DAQ2 system.

DAQ2 Shift Tutorial cDAQ group3

aDAQmon – DAQ Summary

History of HLT activity

http://cmsonline.cern.ch/daqStatusSCX/DAQstatusGre.html

Data taking history

DAQ flow

DAQ sub-system

configuration

Status bar gives a quick overview of the DAQ

Page 4: DAQ2 Shift TutorialcDAQ group1 Monitoring of the DAQ2 system.

DAQ2 Shift Tutorial cDAQ group4

Main systems (LHC, DCS,...) status

FED-RU data stream

FED RU configuration Box color:Sub-Sys ID

RU/BU box color: CPU 0 100%

FED IN

FED OUTRU bandwidth plot

BU bandwidth plot

# Ev. in BU

BU RAM disk %

BU OUT disk %

DAQ Sub-Sys configuration

RU/BU box RED frame: flash data not updated

Event storage summary

Page 5: DAQ2 Shift TutorialcDAQ group1 Monitoring of the DAQ2 system.

DAQ2 Shift Tutorial cDAQ group5

DAQView

Page 6: DAQ2 Shift TutorialcDAQ group1 Monitoring of the DAQ2 system.

DAQ2 Shift Tutorial cDAQ group6

DAQView Status & navigation

FED BuilderFEROL/FMM

Event BuilderRU/EVM

FFF AppliancesBU & FU

Age of monitor data

Page 7: DAQ2 Shift TutorialcDAQ group1 Monitoring of the DAQ2 system.

DAQ2 Shift Tutorial cDAQ group7

DAQView - Navigation

Stop refreshing page

Switch pages betweenFEDbuilder, FFF, and all

You only need cDAQStart DAQView if it is not running

Current runDuration and start time of run(or last restart of DAQView)

Last update of page must be current!If it is stale, you need to restart DAQView

Page 8: DAQ2 Shift TutorialcDAQ group1 Monitoring of the DAQ2 system.

DAQ2 Shift Tutorial cDAQ group8

DAQView – FED builder

TTC partition name & no.

Current TTS state of partition

%warning, %busy in TTS partition

FEROL PC(link to hyperdaq page)

FED information (see next page)

min/max # fragments received by FEROL. Highlighted in yellow if different to trigger. Min is only displayed if not equal to max.

FED builder name

Confused? Try the table help button!

Page 9: DAQ2 Shift TutorialcDAQ group1 Monitoring of the DAQ2 system.

DAQ2 Shift Tutorial cDAQ group9

DAQView – FEROL and FMM Entries are of form

FRL_geoslot: FEDSourceID or FRL_geoslot: FEDSourceID1, FEDSourceID2 or FEDSourceID

For a pseudo-FED (=TTS link only, but no data is read out by DAQ)

Additional info may be displayed next to the FEDSourceID(from left to right) Percentage of time during which FED was in Warning ( ) or Busy ( ) during the

last 3 seconds (if non-zero) Current state of TTS if other than Ready FEDSourceID (expected) 601

Grey if FRL input not enabled (FMM not enabled in case of pseudo-FED) Highlighted in color of current TTS state if other than Ready

Percentage of time with DAQ backpressure during last update interval (5s) if non-zero

Warnings Received source ID different to expected FED or SLINK CRC errors Number of fragments received by FRL if no data is flowing and this FRL is lagging “behind”

Use this to judge whether a FED is creating dead-time because of a FED

problem or because of DAQ-backpressure

W:9.9% B:0.2%

W

<6.9%

#FCRC=699605

Page 10: DAQ2 Shift TutorialcDAQ group1 Monitoring of the DAQ2 system.

DAQ2 Shift Tutorial cDAQ group10

DAQView – RU/EVM Information

EVM/RU host (link to hyperdaq page)

First row is TCDS / EVM

Rate (kHz)

# fragments built by RU/EVM since start of run

# incomplete fragments>> 1 indicates a problem on the RU

Throughput (MB/s)

Super-fragment size (kB)

# events currently in RU>>1 indicates problem in IB

# events requested by BUnormal EVM >> 1 &&RUs < 100

Each row is one FEDbuilder

Shaded values mean FEDbuilder is not in readout

Page 11: DAQ2 Shift TutorialcDAQ group1 Monitoring of the DAQ2 system.

DAQ2 Shift Tutorial cDAQ group11

DAQView – FFF/BU

BU host(link to hyperdaq page)

Rate per BU (kHz)

Throughput (MB/s)

Event size (kB)

Confused? Try the table help button!

Events built since start of run

# events being built

Resource information (see next page)

# files written

# LS for which there is a file

Current LS number

Each line is one Appliance

Page 12: DAQ2 Shift TutorialcDAQ group1 Monitoring of the DAQ2 system.

DAQ2 Shift Tutorial cDAQ group12

DAQView – BU Resources BU resources are used for requesting events

Each resource corresponds to multiple events Less resources mean less event requests to EVM

Load balancing between independent appliances Backpressure mechanism if FFF/HLT cannot keep up

Each BU has a number of resources (#resources) Resources can be blocked (#blocked)

RAM disk becomes full Not enough FU CPU cores are available to process data FU processing lags behind

Resources for which no event data has been received are counted under #requests If #requests > 0, the BU is able to accept new events

Page 13: DAQ2 Shift TutorialcDAQ group1 Monitoring of the DAQ2 system.

DAQ2 Shift Tutorial cDAQ group13

DAQView – Running, or not?

LVL0:DAQ is running

No, rate is 0 kHz

None of the HF FEDs has sent any events

No fragments in RU

Many events requested

No data flow as HF has not sent any data Talk to HF expert

Page 14: DAQ2 Shift TutorialcDAQ group1 Monitoring of the DAQ2 system.

DAQ2 Shift Tutorial cDAQ group14

DAQView – Who Blocks the Run?ECAL is 100%in Warning

Rate is 0 kHzFED 602 is in warningand last event is 9605

There’s backpressurefrom DAQ

RU waits for data from FED 59FED 59 has not sent any data

FED 59 is the culprit Talk to Tracker expert

Page 15: DAQ2 Shift TutorialcDAQ group1 Monitoring of the DAQ2 system.

DAQ2 Shift Tutorial cDAQ group15

DAQView – DAQ backpressureECAL is 50%in Warning

There’s backpressurefrom DAQ

Very few events requested by BUs

All BUs are “blocked” or “throttled”

RAM disk is fullAll resources blocked

RAM disk is nearly full25/32 resources blocked

No FU cores availableAll resources blocked

Only a few FU cores available26/32 resources are blocked

FFF is blocked Try to figure out what is wrong (and call DAQ oncall)

The rate is 10 kHz

Page 16: DAQ2 Shift TutorialcDAQ group1 Monitoring of the DAQ2 system.

DAQ2 Shift Tutorial cDAQ group16

F3 Monitor

Page 17: DAQ2 Shift TutorialcDAQ group1 Monitoring of the DAQ2 system.

DAQ2 Shift Tutorial cDAQ group17

Storage & Transfer System

17

Aggregate files (event data, DQM histograms & metadata) as they appear

Micro-merger on each FU aggregates the data from all processes on the FU

Mini-merger on the BU aggregates the data from all FUs

Mega-merger(s) aggregate the data from all BUs

Data and meta-data are aggregated per luminosity sectionEach luminosity section and stream treated independently

If previous step has completed successfully, input data can be deleted

Page 18: DAQ2 Shift TutorialcDAQ group1 Monitoring of the DAQ2 system.

DAQ2 Shift Tutorial cDAQ group18

F3 Monitor http://cmsdaq0/daqfff/ecd/

Nice demo available at http://cmsdaq0/daqfff/ecd/doc/presentation/

List of recent runs

Access old runs

Active run Both boxes must be green

Time chart of HLT activity

Confused? Try the guide!

Stream rates vs LS

Stream names(click to hide them)

Completeness of dataAlert DAQ oncall when multiple boxes are not green (this situation is okay)

Page 19: DAQ2 Shift TutorialcDAQ group1 Monitoring of the DAQ2 system.

DAQ2 Shift Tutorial cDAQ group19

CentralPartitionManager

Page 20: DAQ2 Shift TutorialcDAQ group1 Monitoring of the DAQ2 system.

DAQ2 Shift Tutorial cDAQ group20

TCDS Combines the pre-LS1:

Trigger Control System (TCS)The conductor of all CMS triggering and data-taking

Trigger Timing and Control (TTC)The distributor of clock, L1As, and synchronisation signals

Trigger Throttling System (TTS)The feedback of readiness states from FEDs to TCS

Many-legged creature:

The ‘head’ is the Central Partition Manager (controlled by central DAQ)

Many different legs (i.e., partitions) across the different subsystems (controlled by the subsystems)

Page 21: DAQ2 Shift TutorialcDAQ group1 Monitoring of the DAQ2 system.

DAQ2 Shift Tutorial cDAQ group21

TCDSCentral tcds-control-central.cms:2000/urn:xdaq-application:lid=100

Page 22: DAQ2 Shift TutorialcDAQ group1 Monitoring of the DAQ2 system.

DAQ2 Shift Tutorial cDAQ group22

TCDSCentral tcds-control-central.cms:2000/urn:xdaq-application:lid=100

TTC machine interface applicationsProvide the connection between the LHC RF and timing signals and CMS.

Page 23: DAQ2 Shift TutorialcDAQ group1 Monitoring of the DAQ2 system.

DAQ2 Shift Tutorial cDAQ group23

TCDSCentral tcds-control-central.cms:2000/urn:xdaq-application:lid=100

Central Partition Manager (CPM)Drives CMS. Controls triggers, calibration sequence,

timing and synchronisation, …This application should tell you what and how many triggers are flowing,

or why not.

Page 24: DAQ2 Shift TutorialcDAQ group1 Monitoring of the DAQ2 system.

DAQ2 Shift Tutorial cDAQ group24

CPMControllertcds-control-central.cms:2050/urn:xdaq-application:lid=100

Running state shows if triggers are flowing or why not:StoppedRunning

Blocked by TTSBlocked by DAQ backpressure

etc.

Hardware status tab

Page 25: DAQ2 Shift TutorialcDAQ group1 Monitoring of the DAQ2 system.

DAQ2 Shift Tutorial cDAQ group25

CPMControllertcds-control-central.cms:2050/urn:xdaq-application:lid=100

Running state:StoppedRunning

Blocked by TTSBlocked by DAQ backpressure

etc.

shows what can/will block triggers

TTS and trigger blockers tab

Page 26: DAQ2 Shift TutorialcDAQ group1 Monitoring of the DAQ2 system.

DAQ2 Shift Tutorial cDAQ group26

CPMControllertcds-control-central.cms:2050/urn:xdaq-application:lid=100

Running state:StoppedRunning

Blocked by TTSBlocked by DAQ backpressure

etc.

This shows which partition is not TTS-READY

TTS and trigger blockers tab

Page 27: DAQ2 Shift TutorialcDAQ group1 Monitoring of the DAQ2 system.

DAQ2 Shift Tutorial cDAQ group27

CPMControllertcds-control-central.cms:2050/urn:xdaq-application:lid=100

This tab shows:- What rate of triggers are flowing, per type- What rate of triggers are being suppressed, per type- What the deadtime is, per source- How much time each partition spends in TTS not-READY

(at the bottom)

Rates and deadtimes tab

Page 28: DAQ2 Shift TutorialcDAQ group1 Monitoring of the DAQ2 system.

DAQ2 Shift Tutorial cDAQ group28

CPMControllertcds-control-central.cms:2050/urn:xdaq-application:lid=100

Add random triggers

Input sources

Page 29: DAQ2 Shift TutorialcDAQ group1 Monitoring of the DAQ2 system.

DAQ2 Shift Tutorial cDAQ group29

HotSpot

Make sure that it updates (pulsates)

Check regularly for Errors or Fatal by clicking on corresponding button

Page 30: DAQ2 Shift TutorialcDAQ group1 Monitoring of the DAQ2 system.

DAQ2 Shift Tutorial cDAQ group30

HotSpot

Click on error

Analyze the error and take appropriate action

You can use HTML to copy it into the elog

Acknowledge understood errors

Page 31: DAQ2 Shift TutorialcDAQ group1 Monitoring of the DAQ2 system.

DAQ2 Shift Tutorial cDAQ group31

Handsaw

Running in a terminal on the shifter console You need an account in the online cluster to start it

Scrolling display of error messages from DAQ All messages (and more) are in HotSpot or LVL0 Handsaw is often quicker to find the most relevant message

Page 32: DAQ2 Shift TutorialcDAQ group1 Monitoring of the DAQ2 system.

DAQ2 Shift Tutorial cDAQ group32

What to do if it does not work Don’t panic! Keep cool.

Not always easy, especially during stable beams Think before clicking! GUIs are sometimes slow in reacting. Be patient…

Look for error messages (LVL0, HotSpot, Handsaw) Look at DAQView for anything suspicious

Figure out what subsystem is causing problems Be aware that one subsystem might get backpressure from DAQ due to other issues

Talk to the shift leader and other shifters They might be aware of problems affecting DAQ E.g. if a subsystem lost power, DAQ will go into error

(you might be the first to realize it!) If you are unsure or stuck, don’t hesitate to call the DAQ oncall

anytime (76600)

Page 33: DAQ2 Shift TutorialcDAQ group1 Monitoring of the DAQ2 system.

DAQ2 Shift Tutorial cDAQ group33

Documentation and Resources DAQ2 shifters guide twiki page

https://twiki.cern.ch/twiki/bin/view/CMS/ShiftPourNuls2014 The left bar of the DAQ2 shifters guide has many valuable links:

DAQ shifter bulletin board: read before every shift. DAQ shifter hypernews: subscribe to this! All DAQ shift related announcements are sent here

DAQ ELOG: Link to DAQ area of the ELOGDAQ Shift Tutorial: link to slides from shift tutorialGlossary of DAQ Terms: definition of all the DAQ acronyms.

Expert on call: link to DAQ DOC area of shift toolExpert List: link to list of DAQ and HLT expertsDAQ shift schedule: link to DAQ shifters area of shift toolP5 shuttle: link to shuttle schedule