Midland regional collaborative approach: An Example of enabling regional decisions
[@IndeedEng] Logrepo: Enabling Data-Driven Decisions
-
Upload
indeedeng -
Category
Technology
-
view
2.727 -
download
5
description
Transcript of [@IndeedEng] Logrepo: Enabling Data-Driven Decisions
go.indeed.com/IndeedEngTalks
LogrepoEnabling Data-Driven Decisions
Jeff ChienSoftware EngineerIndeed Apply Team
Scale
More job searches worldwide than any other employment website.
● Over 100 million unique users ● Over 3 billion searches per month● Over 24 million jobs● Over 50 countries● Over 28 languages
I help people get jobs.
1. Search
2. View job
3. Click “Apply Now”
4. Submit application
Job seeker flow using Indeed Apply
Knowing how users interact with our system
helps us make better products
Have to upload a resume
Have Indeed
Resume
Likelihood of applying to a job
We Have Questions
● What percentage of applications use Indeed resumes?
● How many searches for “java” in “Austin”?
● How often are resumes edited?
● How long does it take to aggregate jobs?
How many applications … to jobs from CareerBuilder … by job seekers who searched for “java” in “Austin” … used an Indeed resume?
Is the percentage different on mobile compared to web?
How much has this changed in 2011 compared to 2014?
Complicated Questions
More Information
Better Decisions
More information
Need to log events
● job searches
● clicks
● applies
What to log
Client information - unique user identifier, user agent, ip address…
User behavior - clicks, alert signups…
Performance - backend request duration, memory usage...
A/B test groups - control and test groups
Better decisions
Use empirical data to make decisions
Not based on assumptions nor the highest paid person’s opinion!
Objective
Collect data on user actions and system performance from many different applications in multiple data centers
How we build systems
Simple
Fast
Resilient
Scalable
Simple
Easy interface
Reuse familiar technologies
Fast
No impact to runtime performance
Data available soon
Resilient
Does not lose data in spite of system or network failures
Can handle large quantities of data
Scalable
Requirements
Powerful enough to express diverse data
Requirements
Powerful enough to express diverse data
Store all data forever
Powerful enough to express diverse data
Store all data forever
Events stored at least once
Requirements
Requirements
Powerful enough to express diverse data
Store all data forever
Events stored at least once
Easy to add new data to logs
Requirements
Powerful enough to express diverse data
Store all data forever
Events stored at least once
Easy to add new data to logs
Easy to access logs in bulk
RequirementsPowerful enough to express diverse data
Store all data forever
Events stored at least once
Easy to add new data to logs
Easy to access logs in bulk
Time range based access
Non-Goals
Random access to individual events
Real time access to events
Complex data types
LogrepoA distributed event logging system
Est. 2006
Logrepo stores log entries
Everything is a string
Key/value pairs
URL-encoded
Organic click log entry
uid=18dtbolr20nk23qh&type=orgClk&v=0&tk=18dtbnn3p0nk20g9&jobId=500&onclick=1&avgCmpRtg=2.9&url=http%3A%2F%2Fwww.indeed.com%2Frc%2Fclk&href=http%3A%2F%2Fwww.indeed.com%2Fjobs%3Fq%3D%26l%3DNewburgh%252C%2BNY%26start%3D20&agent=Mozilla%2F5.0+%28Windows+NT+6.1%3B+WOW64%3B+rv%3A26.0%29+Gecko%2F20100101+Firefox%2F26.0&raddr=173.50.255.255&ckcnt=17&cksz=1033&ctk=18dtbc6960nk20vd&ctkRcv=1&&
URL-decoded organic click log entry
uid=18dtbolr20nk23qh&type=orgClk&v=0&tk=18dtbnn3p0nk20g9&jobId=500&onclick=1&avgCmpRtg=2.9&url=http://www.indeed.com/rc/clk&href=http://www.indeed.com/jobs?q=&l=Newburgh%2C+NYstart=20&agent=Mozilla/5.0 (Windows NT 6.1; WOW64; rv:26.0) Gecko/20100101 Firefox/26.0&...
URL-decoded organic click log entry
uid=18dtbolr20nk23qh&type=orgClk&v=0&tk=18dtbnn3p0nk20g9&jobId=500&onclick=1&avgCmpRtg=2.9&url=http://www.indeed.com/rc/clk&href=http://www.indeed.com/jobs?q=&l=Newburgh%2C+NYstart=20&agent=Mozilla/5.0 (Windows NT 6.1; WOW64; rv:26.0) Gecko/20100101 Firefox/26.0&...
Advantages
Human-readable
Advantages
Human-readable
Arbitrary keys
Advantages
Human-readable
Arbitrary keys
Low overhead to add new key/value pairs
Advantages
Human-readable
Arbitrary keys
Low overhead to add new key/value pairs
Self-describing
Advantages
Human-readable
Arbitrary keys
Low overhead to add new key/value pairs
Self-describing
Easy to parse in any language
Required log entry keys
Every log entry has uid and type
Type is an arbitrary string
uid=18dtbolr20nk23qh&type=orgClk&...
UID format
uid=18ducm8u50nk23qh&type=jobsearch&...
UID is always the first key
Unique
16 characters
Base 32 [0-9a-v]
uid=18ducm8u50nk23qh
Date = 2014-01-10 Time = 09:35:24.357
Server id = 1512App instance id = 2
UID Version = 0Random value = 3921
UID breakdown
UID generation
Unique IDs are unique
Random value avoids UID collisions
Random value is between 0 and 8191
Up to 8000 events per application instance per millisecond
UID format benefits
Contains useful metadata
Compact format reduces memory requirements
Easy to compare or sort events by time
Job seeker events
1. Search for jobs
2. Click on job
3. Apply to job
All events are part of the same flow
Parent-child relationships between events
Events can reference other events with &tk=18ducm8u50nk23qh...
Children know their parents
Parents don’t know their children
Extremely powerful model
Parent-child relationships between events
An organic click points to the search it occurred on
uid=18dtbnn3p0nk20g9&type=jobsearch&v=0&...
uid=18dtbolr20nk23qh&type=orgClk&v=0 &tk=18dtbnn3p0nk20g9&...
More jobsearch child events
Sponsored job clicks
Javascript errors
Job alert signups
And many more...
uid=18en3o3ov16r25rp&type=viewjob&...
user submission
post to employer
load IndeedApply
job view18en3o3ov16r25rp
Job seeker views a job
job view18en3o3ov16r25rp
user submission
post to employer
uid=18en3o3s216ph6d5&type=loadJs&vjtk=18en3o3ov16r25rp&...
load IndeedApply18en3o3s216ph6d5
Indeed Apply loads
uid=18en3qe0u16pi5ct&type=appSubmit&loadJsTk=18en3o3s216ph6d5&...
job view18en3o3ov16r25rp
user submission18en3qe0u16pi5ct
post to employer
load IndeedApply18en3o3s216ph6d5
Prepare job application
POST /apply HTTPS/1.1Host: employer.com
{ "applicant": {
"name": "John Doe","email": "[email protected]","phone": "555-555-5555",
}, "jobTitle": "Software Engineer" ...
uid=18en3qe2r0nji3h6&type=postApp&appSubmitTk=18en3qe0u16pi5ct&...
job view18en3o3ov16r25rp
user submission18en3qe0u16pi5ct
post to employer18en3qe2r0nji3h6
load IndeedApply18en3o3s216ph6d5
Submit job application
Javascript latency ping
At start of page load, browser executes js to ping Indeed
Server receives the ping and logs an event
Parent job search and child js latency ping
uid=18dqpc3lm16pi2an&type=jobsearch&...
uid=18dqpc3s516pi566&type=lat&tk=18dqpc3lm16pi2an
uid=18dqpc3s516pi566&type=lat&tk=18dqpc3lm16pi2an
Latency = 1389247205253 - 1389247205046= 207 ms
Approximates perceived latency to jobseeker
uid timestamp Jan 9, 2014 00:00:05.253
tk timestamp Jan 9, 2014 00:00:05.046
Subtracting UID timestamps yields duration
West coast perceived latency in California vs. Washington
Writing log entries from apps
LogEntry entry =factory.createLogEntry("search");
entry.setProperty("q", query);entry.setProperty("acctId", accountId);entry.setProperty("time", elapsedMillis);// ...
entry.commit();
Creating a log entry
LogEntry entry =factory.createLogEntry("search");
Creates a log entry with UID and type set
UID timestamp tied to createLogEntry() call
Populating a log entry
entry.setProperty("q", query);entry.setProperty("acctId", accountId);entry.setProperty("time", elapsedMillis);// ...
Lists
Separate values with commas
String groups = "foo,bar,baz";
logEntry.setProperty("grps", groups);
// uid=...&grps=foo%2Cbar%2Cbaz&...
Lists of Tuples
Encapsulate each tuple in parenthesis
Comma-separate elements within tuple
// Two jobs with (job id, score)String jobs = "(123,1.0)(400,0.8)";
logEntry.setProperty("jobs", jobs);
// uid=...&jobs=%28123%2C1.0%29%28400%2C0.8%29&...
Committing a log entry
After log entry is fully populated...
entry.commit();
Jason KoppeSystem Administrator
I engineer systemsthat help people get jobs.
Before logrepo
Before logrepo
log4j - Java logging framework
● Code - what● Configuration - define what goes to
where● Appender - where (file, smtp)
http://logging.apache.org/log4j/1.2/
Before logrepo
Reusing log4j for logrepo
Redundancy from the start
Write to local disk (FileAppender)
Write to remote server #1 (? Appender)
Write to remote server #2 (? Appender)
Writing to a remote server
syslogProtocol for transporting messages across
an IP network
Est. 1980s
http://tools.ietf.org/html/rfc5424
Using log4j with syslog
Out-of-the-box, log4j only supported UDP syslog
UDP could result in data loss
Avoiding data loss
TCP guarantees data transfer
Use TCP!
SyslogTcpAppender● created by Indeed● TCP-enabled log4j syslog Appender● buffers messages before transport
Resilient for short network and syslog server downtimes
Creating a reliable Appender
Choosing a syslog daemon
syslog-ngsyslog daemon which supports TCP
Est. 1998
http://www.balabit.com/network-security/syslog-ng
Redundancy with log4j
Write to local disk (FileAppender)
Write to remote server #1 (SyslogTcpAppender)
Write to remote server #2 (SyslogTcpAppender)
Redundancy over TCP
Each syslog-ng server
receives unsorted log entries
immediately flushes entries to files on disk called raw logs
Quick redundancy over TCP
Optimized for redundancy
raw logs are probably out-of-order
each app writes to syslog independently
Optimize for read access patterns
LogRepositoryBuilder (“Builder”)● sort● deduplicate● compress
Builder architecture
Builder architecture
Builder architecture
Builder architecture
Builder creates segment files
uid=15mt000000k1&type=orgClk&v=1&k=4... uid=15mt000010k7&type=orgClk&v=1&k=3... uid=15mt000020k8&type=orgClk&v=1&k=2... uid=15mt000030ss&type=orgClk&v=1&k=9...
Repeated strings compress well
uid=15mt000000k1&type=orgClk&v=1&k=4... uid=15mt000010k7&type=orgClk&v=1&k=3... uid=15mt000020k8&type=orgClk&v=1&k=2... uid=15mt000030ss&type=orgClk&v=1&k=9...
compresses by 85%
Archive directory structure
/orgClk/15mt/0.log4181.seg.gz
logentry type
Archive directory structure
/orgClk/15mt/0.log4181.seg.gz
4-char UID prefix, base 32
Archive directory structure
/orgClk/15mt/0.log4181.seg.gz
4-char UID prefix, base 32
~9.3 hour time period
Archive directory structure
/orgClk/15mt/0.log4181.seg.gz
5-char UID prefix, base 32
Archive directory structure
/orgClk/15mt/0.log4181.seg.gz
5-char UID prefix, base 32
~17 minute time period
Archive directory structure
/orgClk/15mt/0.log4181.seg.gz
unique number
Archive directory structure
/orgClk/15mt/0.log4181.seg.gz
unique number
Supports more than 1 segment file per type per 5-char UID prefix
Multiple segment files
Keep Builder memory usage fixed
When Builder memory fills, it flushes to disk
Each flush creates files for 5-char UID prefix
Multiple segment files
Keep Builder memory usage fixed
When Builder memory fills, it flushes to disk
Each flush creates files for 5-char UID prefix
Multiple segment files
Keep Builder memory usage fixed
When Builder memory fills, it flushes to disk
Each flush creates files for 5-char UID prefix
Builder creates the archive
Redundancy
Redundancy
Ensure archive consistency
● Delayed Builder on second server● Add new segment files for log entries
missed by first Builder● Causes multiple segment files for a 5-char
UID prefix
Providing access to logrepo
LogRepositoryReader (“Reader”)● simple request protocol● reads from (multiple) segment files● provides sorted stream of entries to TCP
client as quickly as possible
Reader request protocol
1. Start time2. End time3. Logrepo type
Reader request using netcat
$ echo 1295905740000 1295913600000 orgClk
start time (ms since 1970-01-01, the start of Unix time)
Reader request using netcat
$ echo 1295905740000 1295913600000 orgClk
end time (ms since 1970-01-01)
Reader request using netcat
$ echo 1295905740000 1295913600000 orgClk
logrepo type
Reader request using netcat
$ echo 1295905740000 1295913600000 orgClk \
| nc 192.168.0.1 9999
send echo across a TCP session
Reader request using netcat
$ echo 1295905740000 1295913600000 orgClk \
| nc 192.168.0.1 9999
uid=15mt00l710k3262q&type=orgClk&v=0&...
uid=15mt00l780k137d9&type=orgClk&v=0&...
...
uid=15mt7ggvj142h06k&type=orgClk&v=0&...
UID-sorted results
Reading entries from archive
1. Isolate to the type directory
1295905740000 1295913600000 orgClk
Reading entries from archive
2. Convert request timestamps to UID prefix
uidPrefixFromTime(1295905740000) = 15mt0
uidPrefixFromTime(1295913600000) = 15mt7
1295905740000 1295913600000 orgClk
Reading entries from archive
3. Find segments matching first UID prefix
ls orgClk/15mt/0*orgClk/15mt/0.log3094.seg.gzorgClk/15mt/0.log4181.seg.gz
1295905740000 1295913600000 orgClk
15mt0
Reading entries from archive
4. Read sorted segments simultaneously, merge into a single sorted stream
/orgClk/15mt/0.log3094.seg.gz: uid=15mt000080g1i0j5&type=orgClk&... uid=15mt00l780k137d9&type=orgClk&.../orgClk/15mt/0.log4181.seg.gz: uid=15mt00l710k3262q&type=orgClk&... uid=15mt00l790k1i2rs&type=orgClk&...
1295905740000 1295913600000 orgClk
Reading entries from archive
4. Read sorted segments simultaneously, merge into a single sorted stream
/orgClk/15mt/0.log3094.seg.gz: uid=15mt000080g1i0j5&type=orgClk&... uid=15mt00l780k137d9&type=orgClk&.../orgClk/15mt/0.log4181.seg.gz: uid=15mt00l710k3262q&type=orgClk&... uid=15mt00l790k1i2rs&type=orgClk&...
1295905740000 1295913600000 orgClk
1
42
3
Reading entries from archive
4. Read sorted segments simultaneously, merge into a single sorted stream
uid=15mt000080g1i0j5&type=orgClk&... uid=15mt00l710k3262q&type=orgClk&... uid=15mt00l780k137d9&type=orgClk&... uid=15mt00l790k1i2rs&type=orgClk&...
1295905740000 1295913600000 orgClk
1
4
23
Reading entries from archive
5. Only return log entries between timestamps
uid=15mt000080g1i0j5&type=orgClk&... uid=15mt00l710k3262q&type=orgClk&... uid=15mt00l780k137d9&type=orgClk&... uid=15mt00l790k1i2rs&type=orgClk&...
1295905740000 1295913600000 orgClk
1
4
23
Reading entries from archive
6. Read segments for each UID prefix, one prefix at a time
1295905740000 1295913600000 orgClk
15mt0 15mt7
15mt115mt215mt315mt415mt515mt6
Reading entries from archive
7. Stop reading files when entry crosses request boundary
1295905740000 1295913600000 orgClk
The first years (2007 & 2008)
● Single datacenter● App servers● 2 logrepo servers● syslog-ng● Builder● Reader
Growth
job seekers
Growth
job seekers
products
Growth
job seekers
products
datacenters
Growth
log entries
Multi-datacenter rationale
Latency
Redundancy
Multi-datacenter rationale
Job seekers
Logrepo in multiple datacenters
● Single datacenter● Consumers● Reader
● Every datacenter● Applications producing logentries● 2 syslog servers● Builders (minimize Internet traffic)
Single datacenter archival
/dc1/orgClk/15mt/0.log4181.seg.gz
event type(orgClick means organic search result click)
25-bit timestamp prefix, base 32~17-minute time period
random number
Multiple datacenter archival
/dc1/orgClk/15mt/0.log4181.seg.gz
event type(orgClick means organic search result click)
25-bit timestamp prefix, base 32~17-minute time period
random number
datacenter
Datacenter dirs avoid collisions
~$ ls */orgClk/15mt/0*
dc1/orgClk/15mt/0.log1481.seg.gz
dc3/orgClk/15mt/0.log1481.seg.gz
Different datacenters
Datacenter dirs avoid collisions
~$ ls */orgClk/15mt/0*
dc1/orgClk/15mt/0.log1481.seg.gz
dc3/orgClk/15mt/0.log1481.seg.gz
Same segment filename
Independent Builders
uid=18ducm8u50nk23qh
Date = 2014-01-10 Time = 09:35:24.357
Server id = 1512App instance id = 2
UID Version = 0Random value = 3921
UID breakdown
uid=18ducm8u50nk23qh
Date = 2014-01-10 Time = 09:35:24.357
Server id = 1512App instance id = 2
UID Version = 0Random value = 3921
UID breakdown
Using server ID for uniqueness
Each datacenter gets 256 server IDs
1. DC #1 uses 0 - 2552. DC #2 uses 256 - 5113. DC #3 uses 512 - 7674. ...
The next years (2009 - 2011)
● Multiple datacenters● 2 logrepo servers● syslog-ng● Builder
● Consumer datacenter● Reader● Consumers
More logentries
More consumers
Diverse requests
Single server disk bottleneck
Scaling logrepo reads
Bottleneck: single active Reader server
Goal: spread logrepo accesses across a cluster of servers
Read logrepo from HDFS
Hadoop Distributed File System (HDFS)
“a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster.”
http://hadoop.apache.org/docs/stable1/hdfs_design.html
Using HDFS for logrepo access
Using HDFS for logrepo access
Using HDFS for logrepo access
Resilient logrepo in HDFS
Store each logentry on 3 servers
Push to HDFS quickly
Mirror every segment file into HDFS
Push to HDFS quickly
/dc1/orgClk/15mt/0.log4181.seg.gz
5-char UID prefix, base 32~17-minute time period
500,000+ files per day
HDFS optimized for fewer files
Reduce the number of logrepo files in HDFS keeps us efficient
HDFS optimized for fewer files
Reduce the number of logrepo files in HDFS keeps us efficient
HDFSArchiver
Archive yesterday in HDFS
/dc1/orgClk/15mt/0.log4181.seg.gz
20-bit timestamp prefix~9.3 hour period
2,500 files per day
type
Scaling logrepo in HDFS
500,000+ files per day
2,500 files per day
LogrepoA distributed event logging system
Created @IndeedEng● Application
Open source● log4j
Created @IndeedEng● Application● SyslogTcpAppender
Open source● log4j
LogrepoA distributed event logging system
Created @IndeedEng● Application● SyslogTcpAppender
Open source● log4j● syslog-ng
LogrepoA distributed event logging system
Created @IndeedEng● Application● SyslogTcpAppender● Builder
Open source● log4j● syslog-ng
LogrepoA distributed event logging system
Created @IndeedEng● Application● SyslogTcpAppender● Builder
Open source● log4j● syslog-ng● gzip
LogrepoA distributed event logging system
Created @IndeedEng● Application● SyslogTcpAppender● Builder● Reader
Open source● log4j● syslog-ng● gzip
LogrepoA distributed event logging system
Created @IndeedEng● Application● SyslogTcpAppender● Builder● Reader
Open source● log4j● syslog-ng● gzip● rsync+ssh
LogrepoA distributed event logging system
Created @IndeedEng● Application● SyslogTcpAppender● Builder● Reader
Open source● log4j● syslog-ng● gzip● rsync+ssh● Hadoop
LogrepoA distributed event logging system
Created @IndeedEng● Application● SyslogTcpAppender● Builder● Reader● HDFSPusher
Open source● log4j● syslog-ng● gzip● rsync+ssh● Hadoop
LogrepoA distributed event logging system
Created @IndeedEng● Application● SyslogTcpAppender● Builder● Reader● HDFSPusher● HDFSReader
Open source● log4j● syslog-ng● gzip● rsync+ssh● Hadoop
LogrepoA distributed event logging system
Created @IndeedEng● Application● SyslogTcpAppender● Builder● Reader● HDFSPusher● HDFSReader● HDFSArchiver
Open source● log4j● syslog-ng● gzip● rsync+ssh● Hadoop
LogrepoA distributed event logging system
All time logrepo = 150 TB compressed
jobsearch event setabredistimeacmetimeaddltimeadscadsdelayadsibadscbadsiboostojcboostojibsjcbsjcwiabsjibsjindappliesbsjindappviewsbsjrevbsjwiackcntckszcountsctkagectkagedaysdayofweekdcpingtimedomTotalTimeds-mpo
dsmissdstimefeatempfjfreekwacfreekwarevfreesjcfreesjrevfrmtimegalatdelayiplatiplongjslatdelayjsvdelaykwackwacdelaykwaikwarevkwcntlacinsizelacsgsizelmstimempotimemprtimenavTotTimendxtime
ojcojclongojcshortojcwiaojiojindappliesojindappviewsojwiaoocscpageprcvdlatencyprimfollowcntprvwojiprvwojlatprvwojopentimeprvwojreqradscradsirecidlookupbudgetrectimeredirCountredirTimerelfollowcntrespTimereturnvisitrojc
rojirqcntrqlcntrqqcntrrsjcrrsjirrsjrevrsavailrsjcrsjirsusedrsviableserpsizesjcsjcdelaysjclongsjcntsjcshortsjcwiasjisjindappliessjindappviewssjrevsjwiasllatsllong
sqcsqisugtimesvjsvjnostarsvjstartadsctadsitimetimeofdaytotcnttotfollowcnttotrevtottimetsjctsjcwiatsjitsjindappliestsjindappviewstsjrevtsjwiaunqcntvpwacinsizewacsgsize
acmepageacmereviewmodacmeserviceacmesessionadclickadcrequestadcrevadschanneladsclickadsenseclickadveadvtagghttpaggjiraaggjobaggjob_waldorfaggsherlockaggsourcehealthagstimingapiapijsvapisearcharchiveindexarchiveindex_shingled_testbincarclicksclickclickanalyticscobranddctmismatchdrawdupepairsdupepairs_minidupepairs_olddupepairsalldupepairsall_miniejcheckeremilyops
feedbridgeglobalnavgooglebot_organichomepageimpressionindeedapplyjhstjobalertjobalertorganicjobalertsearchjobalertsponsoredjobexpirationjobexpiration2jobexpiration3jobprocessedjobqueueblockjobsearchjssquerykeywordAdlocsvclucyindexermainmechanicalturkmindyopsmobhomepagemobilmobilemobileorganicmobilesponsoredmobrecjobsmobsearchmobviewjobmyindeedmyindfunnelmyindpagemyindrezcreatemyindsessionoldopsesjasx
organicorgmodelorgmodelsubsetorgmodelsubset90passportaccountpassportpagepassportsigninramsaccessrecjobsrecommendserviceresumedataresumesearchrexcontactsrexfunnelreximpressionrexsearchrezSrchSearchrezalertrezalertfunnelrezfunnelrezjserrrezsrchrequestrezviewsearchablejobsseosessionsjmodelsponsoredsysadappinfosysadapptimingtestndxtestndx1testndx2tmpusrsvccacheusrsvcrequestviewjobwebusersignin
Every day at Indeed
● Create 5 billion log entries
● App spends 0.03 ms to create each log entry
● Add 500 GB to the archive
● Add 1.5 TB to HDFS
● Consumers read from HDFS at 18.5 GB/s
● 100s of consumers request 1000 different logrepo types
Four types of consumers
Ad-hoc command line
Standard Java programs
Hadoop map/reduce
Real-time monitoring
$ echo 1388556000000 1388642400000 jobsearch \| nc logrepo 9999
uid=18d6666o916r15g3&type=jobsearch&q=VP+ITuid=18d6666ob0mp27aa&type=jobsearch&q=Lab+Techuid=18d6666ob0nl15ce&type=jobsearch&q=daycareuid=18d6666og0nk24rb&type=jobsearch&q=Chef+Upscale...
Command line access
Reuses standard unix tools and patterns
$ echo 1388556000000 1388642400000 jobsearch \| nc logrepo 9999| egrep -o '&searchTime=[^&]+' \| egrep -o '[0-9]+' \| sort -r -n \| head
Slowest searches from log entries
Programmatic access is trivial
We have clients for
● java
● python
● php
● pig
A typical logrepo consumer (single machine)
Reads one primary log event type
Reads a dozen child events per primary
Total size of each event set = 10KB
A typical logrepo consumer (single machine)
Millions of events read per run
Thousands of consumers run each day
Tens of terabytes processed each day
Efficient Parsing
Important for single machine consumers
Log entry parsing too slow
Fast
Minimize memory usage
URL String Parsing(now available on github)4x faster than String.split(...), generates 50% less garbage
Parses 1 million log entries of size 0.5K each in 3 seconds
https://github.com/indeedeng
http://go.indeed.com/urlparsing
Hadoop clients
Reliable, scalable, distributed computing
Hadoop clients
Reliable, scalable, distributed computing
Most new consumers use Hadoop
Hadoop clients
Reliable, scalable, distributed computing
Most new consumers use Hadoop
Read log entries directly from HDFS
Hadoop clients
Reliable, scalable, distributed computing
Most new consumers use Hadoop
Read log entries directly from HDFS
Divide and conquer to scale
Monitoring
Want to monitor
● Business metrics
● Operational metrics
“Available soon” isn’t good enough
Datadog
Third party monitoring service
Stream metrics to Datadog HQ
Real-time dashboards
Datadog
miniEPL
'jobsearch.organic_clk': "SELECT COUNT(*), 'clicks' AS unit FROM orgClk",
'jobsearch.totTime': "SELECT int(totTime), 'ms' AS unit FROM jobsearch(totTime IS NOT NULL)",
'mobile.mobsearch.oji': "SELECT tupleCount(orgRes), 'results' AS unit FROM mobsearch",
Getting logs into Datadog
Data redundancy
Replaying events
Click charging
Replaying events
1. Job alert email sign up broke for logged in users
Replaying events
1. Job alert email sign up broke for logged in users
2. Got alert parameters + jobsearch uid from access logs
Replaying events
1. Job alert email sign up broke for logged in users
2. Got alert parameters + jobsearch uid from access logs
3. Got account id from jobsearch log entries
Replaying events
1. Job alert email sign up broke for logged in users
2. Got alert parameters + jobsearch uid from access logs
3. Got account id from jobsearch log entries
4. Recreated job alert sign ups
Click charging
1. Store sponsored click data in database
Click charging
1. Store sponsored click data in database
2. Log sponsored click data to logrepo
Click charging
1. Store sponsored click data in database
2. Log sponsored click data to logrepo
3. Verify logs match database
Click charging
1. Store sponsored click data in database
2. Log sponsored click data to logrepo
3. Verify logs match database
4. Charge for clicks
Click charging
1. Store sponsored click data in database
2. Log sponsored click data to logrepo
3. Verify logs match database
4. Charge for clicks
5. Profit!
What does logrepo enable?
Answering business and operational questions
Data-driven decisions
Average cover letter length inside US vs. outside US?
Mobile searches per hour inJP vs. UK?
Resume creation by country?
Email alert opens by email domain?
Percent of app downloads fromiOS, Android, Windows?
How quickly does a datacenter take on traffic after a failover?
Q & A
https://github.com/indeedeng
http://go.indeed.com/urlparsing
Next @IndeedEng TalkBig Value from Big Data:
Building Decision Trees at Scale
Andrew Hudson, Indeed CTOFebruary 26, 2014
http://engineering.indeed.com/talks