System performance monitoring in the ALICE Data Acquisition System with Zabbix Adriana Telesca...

37
System performance monitoring in the ALICE Data Acquisition System with Zabbix Adriana Telesca October 15 th , 2013 CHEP 2013, Amsterdam

Transcript of System performance monitoring in the ALICE Data Acquisition System with Zabbix Adriana Telesca...

Page 1: System performance monitoring in the ALICE Data Acquisition System with Zabbix Adriana Telesca October 15 th, 2013 CHEP 2013, Amsterdam.

System performance monitoring in the

ALICE Data Acquisition System

with ZabbixAdriana TelescaOctober 15th, 2013

CHEP 2013, Amsterdam

Page 2: System performance monitoring in the ALICE Data Acquisition System with Zabbix Adriana Telesca October 15 th, 2013 CHEP 2013, Amsterdam.

The ALICE Data Acquisition system

ALICE at the CERN LHC

Data Acquisition system requirements:•4 GB/s sustained recording rate•2.5 GB/s transfer to tape

15/10/2013 Adriana Telesca, CHEP 2013 2/24

Page 3: System performance monitoring in the ALICE Data Acquisition System with Zabbix Adriana Telesca October 15 th, 2013 CHEP 2013, Amsterdam.

The ALICE Data Acquisition system

For Run 2 (2015-2017): ~ 1000 nodes•Readout•Event building•Recording•Storage•Support (network, PDUs)•Operations

For Run 3 (2019-2021): ~ 2000 nodes

15/10/2013 Adriana Telesca, CHEP 2013 3/24

Page 4: System performance monitoring in the ALICE Data Acquisition System with Zabbix Adriana Telesca October 15 th, 2013 CHEP 2013, Amsterdam.

The ALICE Data Acquisition system

For Run 2 (2015-2017): ~ 1000 nodes•Readout•Event building•Recording•Storage•Support (network, PDUs)•Operations

For Run 3 (2019-2021): ~ 2000 nodes

15/10/2013 Adriana Telesca, CHEP 2013 3/24

O2: a new combined online and offline computing for ALICE after 2018

P. Vande Vyvre’s talk today at 16:45 – Data Acquisition track

Page 5: System performance monitoring in the ALICE Data Acquisition System with Zabbix Adriana Telesca October 15 th, 2013 CHEP 2013, Amsterdam.

Lemon was used to monitor the DAQ system during Run 1 (2008-2013).

Decision to replace it:•Lemon future unsure•Tools with additional/new functionalities•LHC Long Shutdown 1

Lemon

15/10/2013 Adriana Telesca, CHEP 2013 4/24

Page 6: System performance monitoring in the ALICE Data Acquisition System with Zabbix Adriana Telesca October 15 th, 2013 CHEP 2013, Amsterdam.

ALICE DAQ monitoring system needs

Low impact

Extensibility/Flexibility

Scalability

15/10/2013 Adriana Telesca, CHEP 2013 5/24

Page 7: System performance monitoring in the ALICE Data Acquisition System with Zabbix Adriana Telesca October 15 th, 2013 CHEP 2013, Amsterdam.

ALICE DAQ monitoring system needs

Full administration GUI

Easy access to data

Interface with other components

ORTHOSAlarming system

15/10/2013 Adriana Telesca, CHEP 2013 6/24

Page 8: System performance monitoring in the ALICE Data Acquisition System with Zabbix Adriana Telesca October 15 th, 2013 CHEP 2013, Amsterdam.

Parameters to monitor

CPUMemoryDisk usageNetwork InterfacesProcesses

VoltageCurrentTemperatureOutlet statusDisk status

Ethernet: CPU utilization Memory utilization Cards temperature

Fiber Channel: RX/TX ports rate

Readout links Bytes In/OutDAQ XOFF, HLT XOFFProcesses CPU and memory

15/10/2013 Adriana Telesca, CHEP 2013 7/24

Page 9: System performance monitoring in the ALICE Data Acquisition System with Zabbix Adriana Telesca October 15 th, 2013 CHEP 2013, Amsterdam.

Shortlist

Selection criteria:1.SNMP 2.Logical grouping

3. Large user community4. Distributed monitoring

Name Agent SNMPSyslog WebApp

Data Storage Method License

Cacti No Yes Yes Full

Control RRDtool, MySQL GPL

Icinga Supporte

d Via

plugin Via

plugin Full

Control MySQL, PostgreSQL, Oracle Database GPL

Zabbix Supporte

d Yes Yes Full

Control

Oracle, MySQL, PostgreSQL, IBM DB2, SQLite GPL

Zenoss No Yes Yes Full

Control ZODB, MySQL, RRDtool GPL

+ Splunk

+ MonALISA

Supported

Yes Yes Full control

Raw files Commercial

15/10/2013 Adriana Telesca, CHEP 2013 8/24

Source: http://en.wikipedia.org/wikiComparison_of_network_monitoring_systems

Page 10: System performance monitoring in the ALICE Data Acquisition System with Zabbix Adriana Telesca October 15 th, 2013 CHEP 2013, Amsterdam.

Name Data gathering

Graphing Triggering

Scalability

Data Storage

Extensibility

Icinga Agent 0 1 1 – up to 1000 hosts

DB 2

Cacti Server 2 0 1 – up to 1000 hosts

RRDtool – DB

2

Zenoss Server 1 1 2 – 1000+

RRDtool – DB

1

Zabbix Agent or Server

2 1 2 – 1000+

DB 2

Splunk Agent 2 1 2 – 1000+

Raw files

2

MonALISA Agent 2 1 2 – 1000+

DB 2

Tools comparison

0-1 Absent-Present 0-1-2 Absent - Present but not good - Good15/10/20

13 Adriana Telesca, CHEP 2013 9/24

Page 11: System performance monitoring in the ALICE Data Acquisition System with Zabbix Adriana Telesca October 15 th, 2013 CHEP 2013, Amsterdam.

Name Data gathering

Graphing Triggering

Scalability

Data Storage

Extensibility

Icinga Agent 0 1 1 – up to 1000 hosts

DB 2

Cacti Server 2 0 1 – up to 1000 hosts

RRDtool – DB

2

Zenoss Server 1 1 2 – 1000+

RRDtool – DB

1

Zabbix Agent or Server

2 1 2 – 1000+

DB 2

Splunk Agent 2 1 2 – 1000+

Raw files

2

MonALISA Agent 2 1 2 – 1000+

DB 2

Tools comparison

0-1 Absent-Present 0-1-2 Absent - Present but not good - Good15/10/20

13 Adriana Telesca, CHEP 2013 9/24

Page 12: System performance monitoring in the ALICE Data Acquisition System with Zabbix Adriana Telesca October 15 th, 2013 CHEP 2013, Amsterdam.

Name Data gathering

Graphing Triggering

Scalability

Data Storage

Extensibility

Icinga Agent 0 1 1 – up to 1000 hosts

DB 2

Cacti Server 2 0 1 – up to 1000 hosts

RRDtool – DB

2

Zenoss Server 1 1 2 – 1000+

RRDtool – DB

1

Zabbix Agent or Server

2 1 2 – 1000+

DB 2

Splunk Agent 2 1 2 – 1000+

Raw files

2

MonALISA Agent 2 1 2 – 1000+

DB 2

Tools comparison

0-1 Absent-Present 0-1-2 Absent - Present but not good - Good15/10/20

13 Adriana Telesca, CHEP 2013 9/24

Page 13: System performance monitoring in the ALICE Data Acquisition System with Zabbix Adriana Telesca October 15 th, 2013 CHEP 2013, Amsterdam.

Name Data gathering

Graphing Triggering

Scalability

Data Storage

Extensibility

Icinga Agent 0 1 1 – up to 1000 hosts

DB 2

Cacti Server 2 0 1 – up to 1000 hosts

RRDtool – DB

2

Zenoss Server 1 1 2 – 1000+

RRDtool – DB

1

Zabbix Agent or Server

2 1 2 – 1000+

DB 2

Splunk Agent 2 1 2 – 1000+

Raw files

2

MonALISA Agent 2 1 2 – 1000+

DB 2

Tools comparison

0-1 Absent-Present 0-1-2 Absent - Present but not good - Good15/10/20

13 Adriana Telesca, CHEP 2013 9/24

Page 14: System performance monitoring in the ALICE Data Acquisition System with Zabbix Adriana Telesca October 15 th, 2013 CHEP 2013, Amsterdam.

Name Data gathering

Graphing Triggering

Scalability

Data Storage

Extensibility

Icinga Agent 0 1 1 – up to 1000 hosts

DB 2

Cacti Server 2 0 1 – up to 1000 hosts

RRDtool – DB

2

Zenoss Server 1 1 2 – 1000+

RRDtool – DB

1

Zabbix Agent or Server

2 1 2 – 1000+

DB 2

Splunk Agent 2 1 2 – 1000+

Raw files

2

MonALISA Agent 2 1 2 – 1000+

DB 2

Tools comparison

0-1 Absent-Present 0-1-2 Absent - Present but not good - Good15/10/20

13 Adriana Telesca, CHEP 2013 9/24

Page 15: System performance monitoring in the ALICE Data Acquisition System with Zabbix Adriana Telesca October 15 th, 2013 CHEP 2013, Amsterdam.

Tools comparison

Name SNMP Community

Granularity Auto Discovery

Free

Icinga 2 2 1 - 1 minute /metric

2 1

Cacti 2 2 1 - 1 minute / metric

1 1

Zenoss 1 1 1- 1 minute /collector

2 1

Zabbix 2 2 2 - No limit /metric

2 1

Splunk 2 2 2 - No limit / metric

2 0

MonALISA 2 1 1 - 1 minute /metric

2 1

0-1 Absent-Present 0-1-2 Absent - Present but not good - Good15/10/20

13 Adriana Telesca, CHEP 2013 10/24

Page 16: System performance monitoring in the ALICE Data Acquisition System with Zabbix Adriana Telesca October 15 th, 2013 CHEP 2013, Amsterdam.

Tools comparison

Name SNMP Community

Granularity Auto Discovery

Free

Icinga 2 2 1 - 1 minute /metric

2 1

Cacti 2 2 1 - 1 minute / metric

1 1

Zenoss 1 1 1- 1 minute /collector

2 1

Zabbix 2 2 2 - No limit /metric

2 1

Splunk 2 2 2 - No limit / metric

2 0

MonALISA 2 1 1 - 1 minute /metric

2 1

0-1 Absent-Present 0-1-2 Absent - Present but not good - Good15/10/20

13 Adriana Telesca, CHEP 2013 10/24

Page 17: System performance monitoring in the ALICE Data Acquisition System with Zabbix Adriana Telesca October 15 th, 2013 CHEP 2013, Amsterdam.

Tools comparison

Name SNMP Community

Granularity Auto Discovery

Free

Icinga 2 2 1 - 1 minute /metric

2 1

Cacti 2 2 1 - 1 minute / metric

1 1

Zenoss 1 1 1- 1 minute /collector

2 1

Zabbix 2 2 2 - No limit /metric

2 1

Splunk 2 2 2 - No limit / metric

2 0

MonALISA 2 1 1 - 1 minute /metric

2 1

0-1 Absent-Present 0-1-2 Absent - Present but not good - Good15/10/20

13 Adriana Telesca, CHEP 2013 10/24

Page 18: System performance monitoring in the ALICE Data Acquisition System with Zabbix Adriana Telesca October 15 th, 2013 CHEP 2013, Amsterdam.

Tools comparison

Name SNMP Community

Granularity Auto Discovery

Free

Icinga 2 2 1 - 1 minute /metric

2 1

Cacti 2 2 1 - 1 minute / metric

1 1

Zenoss 1 1 1- 1 minute /collector

2 1

Zabbix 2 2 2 - No limit /metric

2 1

Splunk 2 2 2 - No limit / metric

2 0

MonALISA 2 1 1 - 1 minute /metric

2 1

0-1 Absent-Present 0-1-2 Absent - Present but not good - Good15/10/20

13 Adriana Telesca, CHEP 2013 10/24

Page 19: System performance monitoring in the ALICE Data Acquisition System with Zabbix Adriana Telesca October 15 th, 2013 CHEP 2013, Amsterdam.

Tools comparison

Name SNMP Community

Granularity Auto Discovery

Free Total

Icinga 2 2 1 - 1 minute /metric

2 1 12

Cacti 2 2 1 - 1 minute / metric

1 1 12

Zenoss 1 1 1- 1 minute /collector

2 1 11

Zabbix 2 2 2 - No limit /metric

2 1 16

Splunk 2 2 2 - No limit / metric

2 0 15

MonALISA 2 1 1 - 1 minute /metric

2 1 14 0-1 Absent-Present 0-1-2 Absent - Present but not good - Good15/10/20

13 Adriana Telesca, CHEP 2013 11/24

Page 20: System performance monitoring in the ALICE Data Acquisition System with Zabbix Adriana Telesca October 15 th, 2013 CHEP 2013, Amsterdam.

Tools comparison

Name SNMP Community

Granularity Auto Discovery

Free Total

Icinga 2 2 1 - 1 minute /metric

2 1 12

Cacti 2 2 1 - 1 minute / metric

1 1 12

Zenoss 1 1 1- 1 minute /collector

2 1 11

Zabbix 2 2 2 - No limit /metric

2 1 16

Splunk 2 2 2 - No limit / metric

2 0 15

MonALISA 2 1 1 - 1 minute /metric

2 1 14 0-1 Absent-Present 0-1-2 Absent - Present but not good - Good15/10/20

13 Adriana Telesca, CHEP 2013 11/24

Page 21: System performance monitoring in the ALICE Data Acquisition System with Zabbix Adriana Telesca October 15 th, 2013 CHEP 2013, Amsterdam.

• Graphing

• Full configuration GUI

• Many ways of data retrieval scalability

Zabbix characteristics

15/10/2013 Adriana Telesca, CHEP 2013 12/24

Page 22: System performance monitoring in the ALICE Data Acquisition System with Zabbix Adriana Telesca October 15 th, 2013 CHEP 2013, Amsterdam.

Zabbix characteristics

15/10/2013 Adriana Telesca, CHEP 2013 13/24

Page 23: System performance monitoring in the ALICE Data Acquisition System with Zabbix Adriana Telesca October 15 th, 2013 CHEP 2013, Amsterdam.

Zabbix characteristics

15/10/2013 Adriana Telesca, CHEP 2013 14/24

Page 24: System performance monitoring in the ALICE Data Acquisition System with Zabbix Adriana Telesca October 15 th, 2013 CHEP 2013, Amsterdam.

Zabbix characteristics

15/10/2013 Adriana Telesca, CHEP 2013 14/24

Page 25: System performance monitoring in the ALICE Data Acquisition System with Zabbix Adriana Telesca October 15 th, 2013 CHEP 2013, Amsterdam.

Zabbix characteristics

15/10/2013 Adriana Telesca, CHEP 2013 14/24

Page 26: System performance monitoring in the ALICE Data Acquisition System with Zabbix Adriana Telesca October 15 th, 2013 CHEP 2013, Amsterdam.

Zabbix characteristics

15/10/2013 Adriana Telesca, CHEP 2013 14/24

Page 27: System performance monitoring in the ALICE Data Acquisition System with Zabbix Adriana Telesca October 15 th, 2013 CHEP 2013, Amsterdam.

Zabbix characteristics

15/10/2013 Adriana Telesca, CHEP 2013 14/24

Page 28: System performance monitoring in the ALICE Data Acquisition System with Zabbix Adriana Telesca October 15 th, 2013 CHEP 2013, Amsterdam.

Zabbix footprint tests

15/10/2013 Adriana Telesca, CHEP 2013 15/24

Page 29: System performance monitoring in the ALICE Data Acquisition System with Zabbix Adriana Telesca October 15 th, 2013 CHEP 2013, Amsterdam.

Zabbix footprint tests

15/10/2013 Adriana Telesca, CHEP 2013 16/24

Page 30: System performance monitoring in the ALICE Data Acquisition System with Zabbix Adriana Telesca October 15 th, 2013 CHEP 2013, Amsterdam.

Zabbix footprint tests

15/10/2013 Adriana Telesca, CHEP 2013 17/24

Page 31: System performance monitoring in the ALICE Data Acquisition System with Zabbix Adriana Telesca October 15 th, 2013 CHEP 2013, Amsterdam.

Zabbix footprint tests

15/10/2013 Adriana Telesca, CHEP 2013 18/24

Page 32: System performance monitoring in the ALICE Data Acquisition System with Zabbix Adriana Telesca October 15 th, 2013 CHEP 2013, Amsterdam.

Zabbix dashboard and usage

15/10/2013 Adriana Telesca, CHEP 2013 19/24

Page 33: System performance monitoring in the ALICE Data Acquisition System with Zabbix Adriana Telesca October 15 th, 2013 CHEP 2013, Amsterdam.

Zabbix dashboard and usage

15/10/2013 Adriana Telesca, CHEP 2013 20/24

Page 34: System performance monitoring in the ALICE Data Acquisition System with Zabbix Adriana Telesca October 15 th, 2013 CHEP 2013, Amsterdam.

Zabbix dashboard and usage

15/10/2013 Adriana Telesca, CHEP 2013 21/24

Page 35: System performance monitoring in the ALICE Data Acquisition System with Zabbix Adriana Telesca October 15 th, 2013 CHEP 2013, Amsterdam.

Zabbix dashboard and usage

15/10/2013 Adriana Telesca, CHEP 2013 22/24

Page 36: System performance monitoring in the ALICE Data Acquisition System with Zabbix Adriana Telesca October 15 th, 2013 CHEP 2013, Amsterdam.

The evaluation of different monitoring tools resulted in the selection of Zabbix.

Zabbix meets the ALICE DAQ needs.

Zabbix will be in production for Run 2.

Conclusion

15/10/2013 Adriana Telesca, CHEP 2013 23/24

Page 37: System performance monitoring in the ALICE Data Acquisition System with Zabbix Adriana Telesca October 15 th, 2013 CHEP 2013, Amsterdam.

Thanks.Questions?