SplunkLive! Utrecht 2017 - ASML Customer Presentation
-
Upload
splunk -
Category
Technology
-
view
219 -
download
5
Transcript of SplunkLive! Utrecht 2017 - ASML Customer Presentation
Richard van der Ven
21-11-2017
Alert & Health MonitoringA Splunk and ITSI implementation
Public
Function Cluster Architect Litho Computing Platform
21 Nov 2017
Slide 3
Public
Who am I?
Worked at ASML for 16 years
• 13 years - IT Infrastructure
• DBA, Storage, ITIL processes
• IT Management
• 3 years - Functional Cluster Architect
• Litho Computing Platform
• Alert & Health Monitoring
Richard van der Ven
21 Nov 2017
Slide 4
Public
ASML makes the machines for making chips
• Lithography is the critical tool
for producing chips
• All of the world’s top chip
makers are our customers
• 2016 sales: €6.8 bln
• More than 17,000 employees
(FTE) worldwide
21 Nov 2017
Slide 5
Public
A global presence
3,900 employees
Source: ASML Q1 2017
Offices in over 60 cities in 16 countries worldwide
9,600 employees 3,600 employees
21 Nov 2017
Slide 6
Public
A tightly integrated set of solutions for scaling and yield
Image
Compute/SW
Measure
21 Nov 2017
Slide 7
PublicLitho Computing Platform
• A cloud infra stack, called the Litho Computing Platform, designed for
high availability and scalability
• Virtual machines are abstracted from the hardware
HW may change or break Virtual machines stay up High
Available
• It’s centralized all applications in one place
• It can serve 40 Scanners & 50 Yieldstars
• It runs in a dark site at ASML customers
An extendable HW platform that scales with application needs
21 Nov 2017
Slide 8
Public
Availability is key
Availability % Downtime per year Downtime per month* Downtime per week
90% ("one nine") 36.5 days 72 hours 16.8 hours
95% 18.25 days 36 hours 8.4 hours
97% 10.96 days 21.6 hours 5.04 hours
98% 7.30 days 14.4 hours 3.36 hours
99% ("two nines") 3.65 days 7.20 hours 1.68 hours
99.5% 1.83 days 3.60 hours 50.4 minutes
99.8% 17.52 hours 86.23 minutes 20.16 minutes
99.90% ("three nines") 8.76 hours 43.2 minutes 10.1 minutes
99.95% 4.38 hours 21.56 minutes 5.04 minutes
99.99% ("four nines") 52.56 minutes 4.32 minutes 1.01 minutes
99.999% ("five nines") 5.26 minutes 25.9 seconds 6.05 seconds
99.9999% ("six nines") 31.5 seconds 2.59 seconds 0.605 seconds
NOTE: This is availability of the functionality that we sell as perceived by the customerthus Infra + HW + Virtualization layer + Application + Connectivity
21 Nov 2017
Slide 9
Public
Some history
2011
• 1st LCP in the field
• Start off with monitoring infrastructure components with Nagios
• Supported by PHP development
• Changing knowledge experts on custom build setup
End of 2015
• The need for improved monitoring and local analysis came up aftersome situations where:
• Engineers didn’t notice application components failing
• It took a long time to get requested log files via customer approval
• It required several iterations to get the log files needed
Timeline
21 Nov 2017
Slide 10
Public
Alert & Health Monitoring
• Avoid unplanned downtime
• Reduce planned maintenance times
• A smart and robust monitoring solution platform to enable live monitoring
AHM product will enable CS engineers to
• Identify if LCP operation is at risk
• Diagnose root cause of incidents
• Pro-active maintenance
• Capacity planning
• Verify configuration state
Why Alert and Health Monitoring?
21 Nov 2017
Slide 11
Public
Alert & Health Monitoring
Alerting
• Alert when KPI over threshold
Monitoring / quick trouble shooting
• Health monitoringHW / SW/ FW/ Environment health, including network infra, databases OSes
• Configuration reportingExact HW/SW/FW config and changes including licenses and serial numbers
Analysis / debugging
• Timeline reconstructionChronological list of major events and threshold alerts
• Diagnostics deep dive
• Data downloading
Key features
21 Nov 2017
Slide 12
Public
Alert & Health Monitoring
Support flow & organization
AHM
Customer
ASML local
equipment
support
ASML GSC
equipment
support
App 1
App 2
Remote intervention
Alert
Troubleshooting
VPN
MonitoringStatus
Report
Under virtual
escort by
customer
Action Plan
21 Nov 2017
Slide 13
Public
AHM High-level Architecture
Alerting
Analysis / Debugging
Monitoring /
Quick troubleshooting
Hardware
Virtualization
Operating
Systems
Middle-
ware
Litho
apps
AHM
Data
Collection
Scripts
Search
HeadIndex
ForwardersForwarders
@
Central Instance
Alert and Health Monitoring
Data Onboarding Data Processing
Config
Manager
AHM
Configurator
Configuration
Metrics
21 Nov 2017
Slide 14
Public
Alert & Health MonitoringKeyfigures
1x
165-239 KPI’s< 5GB daily
6-10
500GB~221 sourcetypes
77 hosts
~2125 sources
> 20GB daily
>25TB> 50x
>3000 hosts
21 Nov 2017
Slide 15
Public
Alert & Health Monitoring
• Lead time
• Importance of log files for monitoring
• What determines application availability
• Changing requirements from stakeholders
• Service model
• Implementation ITSI
Challenges
21 Nov 2017
Slide 16
Public
Alert & Health Monitoring
• Service Model
• Not usable out of the box
• Generated with own tool
• UI: not usable
• ITSI Dashboard: not configurable to our needs
• Glass tables: static, where we need flexibility due to variable applications
• Event alerting
• Implementation customer specific thresholds
Challenges with Splunk core and ITSI
21 Nov 2017
Slide 17
Public
Alert & Health Monitoring
• Service Model
• Generated with own configuration tool
• ‘Manual’ regenerate at every change on applications
• Using Mind Maps for discussions
• UI
• Dashboards build with tables and hyper links
• New feature drill down promising
• Event alerting
• Aligning ITSI queries and core Splunk
• Implementation customer specific thresholds
How did we solve?
21 Nov 2017
Slide 18
Public
Implementation Splunk and ITSI
Easy and clear drill down dashboards
Users are non IT
21 Nov 2017
Slide 19
Public
Alert & Health Monitoring
• Easier access to log files, metrics and application data
• Less time spent on regular service checks
• Combine application and infra data
• Unforseen side effects of changes diagnosed in field and at internal testing
• More confidence in actual system state
• Memory leak issue spotted in field, before impact
Benefits