Splunk All the Things: Our First 3 Months Monitoring Web Service APIs - Splunk .conf2012

Splunk All the Things:Our First 3 Months Monitoring Web Service APIs

Dan Cundiff (@pmotch) and Eric Helgeson (@nulleric)

Target Corporation

Agenda

Context

Problem

Solution

Examples

In progress and future stuff

Lessons and challenges

Context: Enterprise Services @ TargetData and transactional APIs for all the domains in our business– Products (inventory, price, description, etc)– Locations– Coupons– etc

APIs exposed inside and outsideMostly RESTful APIs, some pub sub/messagingUsed by mobile devices, applications, partners on the outside, etc.Constantly evolving, rapidly improving, all the time

ProblemFirst API go-live:– Millions of log events per day (grep/cut/sed/awk not cutting it)– Logs scattered everywhere– Limited access to logs– Needed end to end visibility of web services– Needed ability to discover information in logs– Can we be pro-active? Faster reactive?

Looming horizon:– BILLIONS of log events coming– Questions changing everyday from business, support, execs, developers

Solution: Gave Splunk a tryInstalled Splunk on a lab serverHooked up Splunk to the logsQuickly created 15+ searches and reportsGenerated a dashboard for visibility and trendingTotal time to do all this in Splunk:

~4 hours6

Why SplunkUnderstanding what’s “normal”– Identify tolerances– Identify actionable events vs. anomalies

You don’t know what you don’t know– …but Splunk can tell you what you don’t know

Why Splunk, part 2Indicators when are things trending badly– Proactive monitoring and recovery– Standard deviations, percentage changes over time, outliers

Full stack visibility– API gateway– Network (load balancers, firewalls)– Web/app– OS

Why Splunk, part 3Quick and flexible dashboardsDrill downCommunity (Splunkbase, blogs, etc)Google-able™ App store!

Locations Service Examples

What is “normal”?Volume

What is “normal”?, part 2API response time SLAs

What is “normal”?, part 3Errors happen, but what is acceptable?

404s~1700 errors once a day every week404s for stores that don’t existBot?– Who are they?– Malicious? Competitor? Individual?– Reach out to understand why

Understanding consumers

Who and how is it being used?What’s their experience?

Understanding consumers, part 2

Load testing in production?

Understanding infrastructureExpected design vs actual implementationNot balancing workload as expected

Understanding providersHow are providers responding?Is overhead added to the API response?

Requirements feedback loopRequirement: 200 tpsActual: ~20 tps

Business intelligence from APIsWhere are people searching?Where should we build our next store?How far are people traveling?What time of day?Mobile vs website?iOS vs Android?International?

Metrics for APIs(source: http://blog.programmableweb.com/2012/08/02/the-api-measurement-secret-know-what-metrics-matter/)

Traffic Metrics– Total calls– Top methods– Call chains– Quota faults

Developer Metrics– Total developer count– Number

of active developers– Top developers– Trending apps– Retention

Service Metrics– Performance– Availability– Error rates– Code defects

Marketing Metrics– Developer registrations– Developer portal funnel– Traffic sources– Event metrics

Support Metrics– Support tickets– Response time– Community metrics

Business Metrics– Direct revenue– Indirect revenue– Market share– Costs

In progress and future stuff

Splunk all the thingsConsumer appsProvider systemsOS, firewalls, proxiesExternal API gateway logsAnything in between (middleware, integrations, etc)Correlate with logs from apps degrees away (e.g. .com web logs)

Development (perf test results, git, Jenkins/CI, wiki, etc)

DashboardsGlobal dashboard summarizing all APIsBI dashboardsExecutive dashboards

Dashboards, part 2Environment dashboards for each API– CI– Test– Stage– Prod

Dashboards, part 3Alert trending dashboards for each API

Splunking Continuous IntegrationDrill down into CI results linked straight from Jenkins– Filtered by date OR transaction GUID

Splunking Continuous Integration, part 2We practice code as documentationEvery commit, Jenkins runs, extracts documentation from code, puts it in the respective wiki pages (pretty cool! – automated / no humans)Splunk monitors wiki changes using the MediaWiki APIMonitor CI + human wiki changes

https://github.com/pmotch/wikislurp

Common Logging ServiceCLS is our strategy for getting logs from all places into SplunkHow– Use UFs on end points everywhere– Else, consolidate and mount Splunk– Else, use CLS RESTful API

Enables end-to-end visibility– Insert GUIDs across all the hops in the transaction

Use out of the box log formats (e.g. Log4j)

Lessons and challenges

Lessons RTFM– Keep logs flat– Keep timestamp (ISO8601) at the beginning– k=v

Iterate quick, push to prod; minimal tweaks to SplunkFlatten out of box audit events (XML)– Toggle at runtime

Don’t re-invent the wheel, use what your system provides, Splunk can handle it!

Lessons, part 2 Don’t pre-optimize up front– Governance– Standards– Alerting– Access controls

Optimize as needed

Lessons, part 3Create a community

Lessons, part 4Create best practices, standards, etc in a wiki

Challenges: Organizational“Stop. We already have tools that do this. Use those.”– tgtMAKE saves the day– tgtMAKE = R&D– R&D = $, servers, flak shelter, people network

Make it real strategy– Demo to as many key players as possible– Drum up interested– Show actual value

Challenges: Organizational, part 2

http://knowyourmeme.com/photos/361379-shut-up-and-take-my-money

Challenges: Organizational, part 3The data can’t be trusted?

Challenges: OSRHEL 6SELinuxIpfwInstall notes: http://nulleric.tumblr.com/post/13855621770/splunk-on-redhat-6-install-notes

Challenges: InfrastructureVM requirementAdhering to MDHA requirementsUniversal Forwarder skepticism

Challenges: Logs on the outsideUniversal Forwarders on servers that we don’t manageFirewallsMulti-layered DMZs

Challenges: Splunk…

Challenges: Splunk (err, improvements)Index improvements– Cheap servers, can fail, can expand– Replication, N=3– Replicas on N-1 subsequent nodes– Data is always available, smooth out across servers if they go down or expand– Multi-tenant– Think OpenStack Swift “Ring” concept or Cassandra– There’s that CAP Theorem thing; they say it’s a big deal.

GUI for deployment client configurations (lazy and for n00bs, we know)Ability to extend charts with other libraries (like D3 or something)

Be bold. Tooling matters. Sell it.Splunk all the things!Iterate, adapt, change quickly.

We’re hiring(come talk to us)

Questions?

Splunk All the Things: Our First 3 Months Monitoring Web Service APIs - Splunk .conf2012

Technology

Transcript of Splunk All the Things: Our First 3 Months Monitoring Web Service APIs - Splunk .conf2012

Splunk Developer & Admin Certification Training...This Splunk course also includes various topics of Splunk, such as installation and configuration, Splunk Syslog, Syslog Server, log

FireEye Splunk InterFireEye + Splunk: Intermediate Guidemediate Guide

Agency Chargeback Models to Enable Enterprise Splunk ......Workshops: Get Splunk Hands -on Experience Attend a Splunk Workshop Upcoming Schedule December 1: Introduction to Splunk

Splunk For IT Operational Intelligence · • Splunk Apps –Splunk App for VMware –Splunk Apps for Citrix & Hyper V –Splunk App for Microsoft Exchange –Splunk App for Active

Tes Conf2012 Presentation

ReversingLabs Explainable Threat Intelligence Enriches ... Data Sheets/RL-Splunk...sent to Splunk. • In Splunk, the TitaniumScale report is correlated with other available Splunk

stoQ’ing your Splunk - SANS · PDF filestoQ’ingyour Splunk Ryan Kovar, Splunk Marcus LaFerrera, PUNCH SANS DFIR 2016

Splunk Inc. Splunk 4.1.7 Security Target - Common Criteria · 1.1 ST Reference ... 7.1.3.2 FDP_ACF.1 Access control functions ... ST Title: Splunk Inc. Splunk 4.1.7 Security Target

Rivium Splunk Windows · o Splunk Enterprise Security * o uberAgent* o Splunk App for Web Analytics Common Splunk Apps & Add-ons 15 SplunkingWIndows o Splunk Add-on for Microsoft

Splunk Detect Employee Fraud using - UnderDefense · 2018-10-25 · Splunk, Splunk DB Connect, Oracle DB, Splunk CIM Key Benefits Understand employee and entity behavior—and its

Ruby conf2012

Boundary for puppet @ puppet conf2012

Fortscale Splunk Integrationinfo.fortscale.com/hubfs/UEBA Content/Fortscale Splunk Integration.pdf · Fortscale Splunk Integration ... Fortscale digests access and authentication

Splunk Overview · Internet of Things and Industrial Data. Structured RDBMS SQL Search Schema at Write Schema at Read Traditional Splunk Splunk Approach to Machine Data ... Splunk

Monitoring Docker Containers with Splunk - Splunk .conf · Splunk and Docker – At A Glance Visibility in your Container Environments Delivering Splunk as Containers Monitoring for

APIS M / APIS 15 / APIS 13 / APIS WR / APIS E - skynet.beusers.skynet.be/am259776/APIS FOLDER.pdf · APIS M / APIS 15 / APIS 13 / APIS WR / APIS E APIS, CARAT, AM- 01, DG- 303 ELAN,

Program Overview - Splunk...mission-critical services. None. Splunk Enterprise System Administration Splunk Enterprise Data Administration Splunk Cloud Administration Implementing

Sc conf2012 sponsorshipdetailsaugustnp

Time*ACer*Time*–** Comparing*Time*Ranges*in*Splunk* · AboutMe*! Splunk*Senior*Instructor*since*2009*! Frequentcontributor*to*Splunk*Answers*! Love*Splunk*search*language*puzzles*

Splunk user group - automating Splunk with Ansible

TimeACerTime*–** ComparingTimeRangesinSplunk* · AboutMe! SplunkSeniorInstructorsince2009! FrequentcontributortoSplunkAnswers! LoveSplunksearchlanguagepuzzles*