Putting Data to Work by Splunking All the Things at Target - Gartner AADI 2012
-
Upload
dan-cundiff -
Category
Technology
-
view
1.029 -
download
0
description
Transcript of Putting Data to Work by Splunking All the Things at Target - Gartner AADI 2012
Splunk Company OverviewCompany (NASDAQ: SPLK)
Founded 2004, first software release in 2006HQ: San Francisco / Region HQ: London, Hong KongOver 600 employees, based in 10 countriesFY 12 Revenue: $121MM; FY 13 Guidance: $183MM– Q2 FY 13 Revenue: $44.5 million
Business Model / ProductsFree download to massive scaleSoftware deployed on-premise and in the cloud; Splunk Storm delivered via a SaaS model
4,400+ CustomersCustomers in over 80 countries54 of the Fortune 100Largest license: 100 Terabytes per day
1
Copyright © 2012 Splunk, Inc.
Target Turns Machine Data into Application Intelligence
Leena Joshi, SplunkDan Cundiff, Target Corporation
Agenda
• Splunk Overview• The machine data opportunity
• Splunk At Target• Why Target chose Splunk• Results with Splunk• Best Practice Advice
3
Turn Machine Data into Application Intelligence
Spelunking:
Splunking:
to explore underground caves
to explore and visualize large amounts of machine data
Splunk
5
Make machine data accessible, usable and valuable to everyone.
Mission
6
Customer Facing Data
Outside the Datacenter
ApplicationsWeb logsLog4J, JMS, JMX.NET eventsCode and scripts
NetworkingConfigurationssyslogSNMPnetflow
DatabasesConfigurationsAudit/query logsTablesSchemas
Virtualization & Cloud
HypervisorGuest OS, AppsCloud
Linux/UnixConfigurationssyslogFile systemps, iostat, top
WindowsRegistryEvent logsFile systemsysinternals
Logfiles Configs Messages Traps Alerts
Metrics Scripts TicketsChanges
Click-stream dataShopping cart dataOnline transaction data
Manufacturing, logistics…CDRs & IPDRsPower consumptionRFID dataGPS data
Splunk Collects and Indexes Any Machine Data
7
Splunk Collects and Indexes Any Machine Data
8
Customer Facing Data
Outside the Datacenter
ApplicationsWeb logsLog4J, JMS, JMX.NET eventsCode and scripts
NetworkingConfigurationssyslogSNMPnetflow
DatabasesConfigurationsAudit/query logsTablesSchemas
Virtualization & Cloud
HypervisorGuest OS, AppsCloud
Linux/UnixConfigurationssyslogFile systemps, iostat, top
WindowsRegistryEvent logsFile systemsysinternals
Logfiles Configs Messages Traps Alerts
Metrics Scripts TicketsChanges
Click-stream dataShopping cart dataOnline transaction data
Manufacturing, logistics…CDRs & IPDRsPower consumptionRFID dataGPS data
No upfront schemaNo custom connectorsNo RDBMSNo need to filter/forward
• Any amount, any location, any source.
Turning Machine Data into Operational Intelligence
Report and analyze
Custom dashboards
Monitor and alert
Ad hoc search
Real-time
Collection and Indexing
DeveloperPlatform
9
Integrated Collection, Storage and Visualization.
Turning Machine Data into Operational Intelligence
10
Business InsightsGain real-time insight from your machine
data to make better-informed business decisions.
Operational VisibilityGain operational visibility to make
better-informed IT decisions.
Proactive MonitoringMonitor infrastructure to identify issues, problems and attacks before they impact
your customers and services.
Search and InvestigationFind and fix problems across the organization using machine data.
Machine Data Operational IntelligenceIntegrated Collection, Storage and Visualization.
Enabling Application Intelligence for Dev & Production
End user devices
Storage
Messaging
Servers
Legacy Systems
Databases
Virtualization
WebServices
App Servers
Networking/Loadbalancing
Networking/Loadbalancing
Networking/Loadbalancing
SecurityEnd user devices
End user devices
11
Talks to every technology in your stack
Correlates data across the different tiers – find causal links
Built for Big Data - Visualize, analyze, trend all your data at scale
Operational Intelligence Across Use Cases
ITOps Security ComplianceApplication
ManagementWeb
IntelligenceBusiness Analytics
12
Internet of Things
DEVELOPER FRAMEWORK
Broad Adoption Across 4,400+ CustomersOver Half the Fortune 100
Cloud and Online Services
Cloud and Online Services
Education
Cloud and Online Services
Energy and Utilities
Cloud and Online Services
Financial Services & Insurance
Cloud and Online Services
Government
Cloud and Online Services
Manufacturing
Cloud and Online Services
Media & Entertainment
Cloud and Online ServicesCloud and Online Services
Healthcare
Travel and Leisure
Cloud and Online Services
Retail
Cloud and Online Services
Telecommunications
Cloud and Online Services
Technology
Cloud and Online Services
13
Putting Data to Work by Splunking All the Things at Target Dan Cundiff, Target Corporation
Target Corporation
15
About MeTechnical Architect 7+ years development experience working across several groups: security, social media and knowledge management, and service oriented architecturesCurrently focused on API development, creating RESTful APIs that are used in and outside of the enterprise across a wide range of devices, applications, and business partnersEnjoy automating - all the things - exchanging pro tips on continuous integration and deployment
@pmotch16
Context: Enterprise Services @ TargetData and transactional APIs for all the domains in our business– Products (inventory, price, description, etc)– Locations– Coupons– etc
APIs exposed inside and outsideMostly RESTful APIs, some pub sub/messagingUsed by mobile devices, applications, partners on the outside, etc.Constantly evolving, rapidly improving, all the time
17
Part Problem. Part Opportunity.First API go-live:– Millions of log events per day (grep/cut/sed/awk not cutting it)– Logs scattered everywhere– Limited access to logs– Needed end to end visibility of web services– Needed ability to discover information in logs– Can we be pro-active? Faster reactive?
Looming horizon:– BILLIONS of log events coming– Questions changing everyday from business, support, execs, developers
18
Solution. Gave Splunk a Try.Installed Splunk on a lab serverHooked up Splunk to the logsQuickly created 15+ searches and reportsGenerated a dashboard for visibility and trendingTotal time to do all this in Splunk:
~4 hours19
Why Splunk?
20
Find What We Don’t Know
• Understand “Normal”• Actionable
events• Identify
tolerances • Find things we
didn’t know existed
Proactive
• Indicators of outliers, anomalies, percentage changes, standard deviations
Full Stack Visibility
• API gateway• Network (load
balancers, firewalls)
• Web/app• OS• Quick and flexible
dashboards• Drilldown
Community!
• Community (Splunkbase, blogs, etc)
• Google-able™ • App store!
Splunk delivers us a new type of intelligence.
21
Understanding “Normal”
22
API response time SLAs Error code by proportion
Overall volume of requests
Error code by volume
All the data in one place allows us to track multiple indicators of “Normal”
Better Understand Consumers
23
Who and how is it being used?What’s their experience?
Better Understand Consumers, Part 2
24
Load testing in production?
Understanding Our Infrastructure
Expected design vs actual implementationNot balancing workload as expected
25
Understanding Providers
How are providers responding?Is overhead added to the API response?
26
Requirements Feedback Loop
Requirement: 200 tpsActual: ~20 tps
27
Real-time Intelligence from APIs
Where are people searching?Where should we build our next store(s)?How far are people traveling?What time of day?Mobile vs website?iOS vs Android?International?
28
Metrics for APIs(source: http://blog.programmableweb.com/2012/08/02/the-api-measurement-secret-know-what-metrics-matter/)
Traffic Metrics– Total calls– Top methods– Call chains– Quota faults
Developer Metrics– Total developer count– # of active developers– Top developers– Trending apps– Retention
Service Metrics– Performance– Availability– Error rates– Code defects
Marketing Metrics– Developer registrations– Developer portal
funnel– Traffic sources– Event metrics
Support Metrics– Support tickets– Response time– Community metrics
Business Metrics– Direct revenue– Indirect revenue– Market share– Costs
29
In progress and future stuff.
30
Splunking all the Things
Consumer appsProvider systemsOS, firewalls, proxiesExternal API gateway logsAnything in between (middleware, integrations, etc)Correlate with logs from apps degrees away (e.g. .com web logs)Development (perf test results, git, Jenkins/CI, wiki, etc)
Dashboards
Global dashboard summarizing all APIsBI dashboardsExecutive dashboards
32
Custom dashboards for different roles brings right information to appropriate fingertips
Dashboards, Part 2
Environment dashboards for each API– CI– Test– Stage– Prod
33
Dashboards, Part 3
Alert trending dashboards for each API
34
Splunking Continuous Integration
Drill down into CI results linked straight from Jenkins– Filtered by date OR transaction GUID
35
Splunking Continuous Integration, Part 2
We practice code as documentationEvery commit, Jenkins runs, extracts documentation from code, puts it in the respective wiki pages (pretty cool! – automated / no humans)Splunk monitors wiki changes using the MediaWiki APIMonitor CI + human wiki changes
https://github.com/pmotch/wikislurp
36
Common Logging Service
CLS is our strategy for getting logs from all places into SplunkHow– Use UFs on end points everywhere– Else, consolidate and mount Splunk– Else, use CLS RESTful API
Enables end-to-end visibility– Insert GUIDs across all the hops in the transaction
Use out of the box log formats (e.g. Log4j)
37
Best Practice Advice
38
Lessons RTFM– Keep logs flat– Keep timestamp (ISO8601) at the beginning– k=v
Iterate quick, push to prod; minimal tweaks to SplunkFlatten out of box audit events (XML)– Toggle at runtime
Don’t re-invent the wheel, use what your system provides, Splunk can handle it!
39
Lessons, Part 2 Don’t pre-optimize up front– Governance– Standards– Alerting– Access controls
Optimize as needed
40
Lessons, Part 3Create a community
41
Lessons, Part 4Create best practices, standards, etc in a wiki
42
Challenges: Organizational“Stop. We already have tools that do this. Use those.”– tgtMAKE saves the day– tgtMAKE = R&D– R&D = $, servers, flak shelter, people network
Make it real strategy– Demo to as many key players as possible– Drum up interested– Show actual value
43
Challenges: Organizational, Part 2The data can’t be trusted?
44
Recap
Be bold. Tooling matters. Sell it.Splunk all the things!
Iterate, adapt, change quickly.
45
We’re hiring
(come talk to me)
46
Resources
Speaker emails: dan.cundiff AT target.com, ljoshi AT splunk.comSplunk download: www.splunk.com/goto/downloadSplunk Storm SaaS Service: www.splunkstorm.com/
47
Thank You