About VisualDNA Architecture @ Rubyslava 2014
-
Upload
michal-harish -
Category
Technology
-
view
929 -
download
0
description
Transcript of About VisualDNA Architecture @ Rubyslava 2014
@ Rubyslava 2014Michal Hariš : [email protected]
- Technical Architect, joined VisualDNA in 2012
Where were we 3 years ago● 10 people working around one mysql table holding 50M+ user profiles
Where were we 3 years ago● 10 people working around one mysql table holding 50M+ user profiles ● LAMP Architecture
SCALABILITY ISSUES
Where were we 3 years ago● 10 people working around one mysql table holding 50M+ user profiles ● LAMP Architecture
SCALABILITY ISSUES
DECISION TO GO BIG (DATA) !
Where were we 18 months ago● 30 strong team, of that a single tech team of roughly 15 people
● Basically a batch architecture● just not MySQL but CASSANDRA + HADOOP at the back● http+php trackers with piped custom log batch process ● s3 upload every 5 min● daily hdfs distcp● POC = daily hadoop inference > 6 node cassandra -> batch integrations● POC was a daily batch job which on bad days took 30 hours
● One of the first commercial Cassandra cluster in the world● very unstable
Where are we today● Stack
● Java ● Scala● Hadoop● Cassandra● Kafka● Redis● R● AngularJS for the front-end
Where are we today● Auto-scaling geo-located Tracker Clusters - well, almost auto-scaling● Robust Streaming Infrastructure - aggregation of all data streams in
central infrastructure● bringing in 8.5k events/ second at peak ● Real-time end-user products, scoring services, integrations with third
parties where possible, pre-computation infrastructure that scales more predictively
● These are primary events which get multiplied by various speed-layer● ETL Pipeline - offloading data streams and pre-computing materialised
views onto HDFS > 30TB of primary data● some data we keep only last 60 or 90 days, others we keep for ever
● Decision Analytics Pipeline (or RD Pipe) > 100TB+ of secondary data i● Using feature-extraction machine learning methods
Where are we today● Still one Cassandra ring, just bigger and more stable, 16 nodes, 250M+
active user profiles
● Lambda Architecture for real-time products like WHY Analytics● RD Pipe is the "batch" layer (daily) that generates active profiles as a
cassandra ("view layer")● Primary Events are enriched for user profiles produced daily by the
Enrichment service ("speed layer")● Combination of probabilistic counters and Redis cubes calculates the
current audience profiles for subscribed websites ("speed layer")● API on top of the Redis cubes serves the current audience profiles for the
front end suite of real-time analytics products ("serving layer")● Audience Analytics product suite is the good looking bit - http://www.
visualdna.com/why/
Where are we today● 120-strong team, of that tech is roughly 60:
● Sysadmin Team● Architecture Tech Team● Decision Analytics Tech Team● Consumer Tech Team● WHY Analytics Team
What have we learned● Architecture:
● Updating json blobs in Cassandra columns is a trap● Logging is better http://engineering.linkedin.com/distributed-systems/log-what-every-
software-engineer-should-know-about-real-time-datas-unifying
● Metrics are crucial in large distributed systems● yammer metrics + graphite + icinga works well for infrastructure ● but complex event/anomalies detection and pattern analysis gives the
edge● Real-Time processing of Data Streams is not only cool, but scales
well ... until you find a bottleneck in a single component which will limit the entire system
● Batch still matters ● but could be much faster than Hadoop which falls on too much
redundant I/O and requires a coordinated ETL pipeline
What have we learned● Engineering:
● the unix philosophy of building short, simple, clear, modular, and extendable code applies also to a design of distributed systems not just an OS
● bad tests are better than no tests but they are still bad and most tests only test positive outcome● the story of Math.abs() -> actually can return negative number ->
but none of the unit-tests anticipated this -> which is why metrics and systems with feedback control are crucial
●● Process:
● It is possible to co-operate remotely even on complex and not-well defined systems - atm some of the architecture team is working remotely on permanent basis
● QA is intrinsic to Architecture and local to products
Interesting issues we’re facing1. SLAs vs. Start-up dynamics - Separate process (and to some
degree architecture) for different levels of guarantee of service
2. Globally-distributed highly-available API for random access to our profiles - enabling decisions based on VDNA profiles on-demand
3. Our Lambda has a bottleneck at the enrichment point - although if we solve (2.) we will be half-way through
4. Complex data pooling attribution model5. Cassandra still gives us some pain - it's the drivers! - interesting
about consistency: http://aphyr.com/posts/294-call-me-maybe-cassandra/
6. Preserving start-up dynamics and culture in a company of 200+ with offices in several cities
We’re hiring for Bratislava office!
● We’re looking for engineers and analysts and more to be based in Bratislava