Tackling a 1 billion member social network

Post on 16-Apr-2017

3.051 views 0 download

Preview:

Click to see full reader

Report this document

Transcript of Tackling a 1 billion member social network

Scala SWAT

Artur Bańkowskiartur@evojam.com@abankowski

TACKLING A 1 BILLION MEMBER SOCIAL NETWORK

This is a story about our participation in one of our customer’s project

mailto:artur@evojam.com?subject=

http://blog.scalar-conf.com/post/143050511980/scalar-2016-presentations-summaries

http://www.evojam.com

Previous Experience

~10 millions records

~60 millions documents

~60 million vertices graph (non-commercial)

Datasets I have previously worked on were much smaller!

This time we were about to deal with a much bigger social network, which can easily be represented as a graph. We were going to join the team…

Graph structureUser

User

Foo Inc.User

User

worked

knows

Bar Ltd.

workedknows

2 types of vertices 2 types of relations

Companies and Users

User

Foo Inc.User

User

worked

knows

The concept: provide graph subset

Bar Ltd.

worked

Foo Inc.

User

worked

knows

Fulltext searchable, sortable, filterable result for Foo Inc.

1 billion challenge835,272,759

835,272,759 vertices

751,857,081 users

83,415,678 companies

6,956,990,209 relations

Subsets from thousands to millions users

When we finally put our hands on the data, we have realized that datasize is smaller than expected

What more have we found?

Existing app workflow

One Engineer-EveningMultiple custom scripts

JSON files

extract subseteg: 3m profiles

750m profiles

data duplication

only manually selected60m incl. duplicates

Manual process, takes few days from

user perspective

This was already used by final customers

PoC - The Goal

Handle 1 billion profiles automaticallyOur primary goal

PoC - Definition of done

• Graph traversable on demand

• First results available under 1 minute

• Entire subset ready in few minutes

REST API with search

PoC - Concept

Graph DBDocument

API

6,956,990,209 relations

751,857,081 profiles

Whole dataset stored in two

engines

All relations

All profiles

Why graph DB?

•Fast graph traversal

•Easily extendable with new relations and vertices

•Convenient algorithm description

PoC - Flow

Graph DBDocument

API

1. Generate subset2. Tag documents

3. Perform search with tag as a filter

Generate subset from graph with users and companies relations

Tag documents with unique id in database with all

profiles

Search with unique id as a primary filter

Weapon of choice

? ?Graph DBDocument

API

Scala and Play were natural choice for API app

Some research required for databases

First Steps: Extraction

Extract anonymized profiles, companies

and relations

Cleanup data, sort and generate input

files

few days to pull, streaming with Akka and Slick

To make a research we had to put our hands on the real data

First Steps: Pushing forward

Push profiles for searching purposes

Push vertices and relations for traversal

Document DB

Graph DB

two tools, highly dependent on db engines

Fulltext Searchable Document DB• Mature

• Horizontally scalable

• Fast indexing (~3k documents per second on the single node)

• Well documented

• With scala libraries:

• https://github.com/sksamuel/elastic4s

• https://github.com/evojam/play-elastic4sWe already had significant experience with scaling Elastic Search for 80 millions

Which graph DB?

Considered three engines, OrientDB and Neo4j in

community editions

Goodbye Titan

• Performed well in ~60M tests

• fast traversing • Development has been put

on hold?

• No batch insertion, slow initial load

• Stalled writes after 200 millions of vertices with relations

• Horror stories in the internet

Neo4j FTWhttps://github.com/AnormCypher/AnormCypher

• Fast • Already known • Convenient in Scala with

AnormCypher • Offline bulk loading • Result streaming

Weapon of choice

Graph DBDocument

API

Final stack :)

Final Setup on AWS

Neo4j

API

hr-1 hr-2

ES Cluster

i2.xlarge 4vCPU 30.5GB

i2.2xlarge 8vCPU 61GB

m4.large 2vCPU 8GB

• 2 nodes • 2 indexes • 10 shards each index • 0 replicas

Step #1 - Bulk loading into Neo4j

Importing the contents of these files into data/graph.db.

[…]

IMPORT DONE in 3h 24m 58s 140ms. Imported: 835273352 nodes 6956990209 relationships 0 properties

Importing the contents of these files into data/graph.db.

[…]

IMPORT DONE in 3h 24m 58s 140ms. Imported: 835273352 nodes 6956990209 relationships 0 properties

Importing the contents of these files into data/graph.db.

[…]

IMPORT DONE in 3h 24m 58s 140ms. Imported: 835273352 nodes 6956990209 relationships 0 properties

Importing the contents of these files into data/graph.db.

[…]

IMPORT DONE in 3h 24m 58s 140ms. Imported: 835273352 nodes 6956990209 relationships 0 properties

It took 12 hours on 2 times smaller Amazon instance

Step #2 - Bulk loading into ES

grouped insert

1. Create Source from CSV file, frame by \n2. Decode id from ByteString, generate user json3. Group by the bulk size (eg.: 4500)4. Throttle5. Execute bulk insert into the ElasticSearch

throttle

Akka advantage: CPU utilization,

parallel data enrichment,

human readable

Step #2 - Bulk loading into ES FileIO.fromFile(new File(sourceOfUserIds)) .via( Framing.delimiter(ByteString('\n'), maximumFrameLength = 1024, allowTruncation = false)) .mapAsyncUnordered(16)(prepareUserJson) .grouped(4500) .throttle( elements = 2, per = 2 seconds, maximumBurst = 2, mode = ThrottleMode.Shaping) .mapAsyncUnordered(2)(executeBulkInsert) .runWith(Sink.ignore)

Flow description with Akka Streams

User

Foo Inc.User

User

worked

knows

Step #3 - Tagging

Bar Ltd.

worked

Foo Inc.

User

worked

knows

MATCH (c:Company)-[:worked]-(employees:User)-[:knows]-(aquitance:User) WHERE ID(c)={foo-inc-id} AND NOT (aquitance)-[:worked]-(c) RETURN DISTINCT ID(aquitance)

Neo4j traversal query in CYPHER

Akka Streams to the rescue

val idEnum : Enumeratee[CypherRow] = _

val src = Source.fromPublisher(Streams.enumeratorToPublisher(idEnum)) .map(_.data.head.asInstanceOf[BigDecimal].toInt) .via(new TimeoutOnEmptyBuffer()) .map(UserId(_)) .mapAsyncUnordered(parallelism = 1)(id => dao.tagProfiles(id, companyId))

Readable flow Buffering to protect Neo4j when indexing is too slow Timeout due to the bug in the underlying implementation

Bottleneck

1.5 hour for 3 million subset, that’s too long!

Bulk update with AkkaStream tuning

src .grouped(20000) .throttle( elements = 2, per = 6 second, maximumBurst = 2, mode = ThrottleMode.Shaping) .mapAsyncUnordered(parallelism = 1)(ids => dao.bulkTag(ids, ...))

Bulk tagging, few lines

Tagging Foo-Company

~14 seconds until first batch is tagged

~7 minutes 11 seconds until all tagged

reference implementation - few hours

2,222,840 profiles matching criteria

Sample subset :)

Step #5 - Search

Neo4j

API

hr-1 hr-2

ES Cluster

i2.xlarge 4vCPU 30.5GBi2.2xlarge

8vCPU 61GB

m4.large 2vCPU 8GB

With data ready in ES search implementation was pretty straightforward

Step #5 - Search benchmark

Fulltext search on 2 millions subset

GET /users?company=foo-company&phrase=John

2000 phrases for the benchmarking

Response JSON with 50 profiles

Searching in 750 millions database

Phrases based on real names and surnames, used during profiles

enrichment

Random requests ordering

Step #5 - Search under siegeResponse time ~ 0.14s50 concurrent users

constant latency

constant search rate

Objective achieved

search response time from API ~0.14s

14 seconds until first batch is tagged

7 minutes until 2 millions ready

Summaryreference implementation PoC

manual automatic

few days 14 seconds

few days 7 minutes

~40 millions ~750 millions

no analytics GraphX readyTool ready for data scientists W can implement core traversal modifications almost instantly

Summary

https://snap.stanford.edu/data/

http://neo4j.com/

https://playframework.com/

http://doc.akka.io/docs/akka-stream-and-http-experimental/2.0/scala/stream-index.html

Try at home! Sample graphs ready to use

https://www.elastic.co/

Artur Bańkowskiartur@evojam.com@abankowski

http://www.evojam.com

mailto:artur@evojam.com?subject=

http://blog.scalar-conf.com/post/143050511980/scalar-2016-presentations-summaries

Tackling a 1 billion member social network

Engineering

Transcript of Tackling a 1 billion member social network

Tackling Cybersecurity With Compromise Assessments ... · Tackling Cybersecurity With Compromise Assessments Incident Response 3 ... questions beyond what ... Tackling Cybersecurity

Council Report of Cabinet Member Report for Environment ... of the... · Tackling air pollution, protecting the environment, reducing net emissions and tackling climate change is

TACKLING DRUG-RESISTANT INFECTIONS GLOBALLY drug-resistant... · TACKLING DRUG-RESISTANT INFECTIONS GLOBALLY: ... or other antimicrobials could slow down the ... billion USD over

Tackling Churn

TACKLING INACTIVITY

Diverse Slate Member Poll - Diversity Best Practices...Financial Services/Insurance ~4000 null Auto 92,000 $20 Billion Financial Services 44,000 null Insurance 33,000 $34 Billion Financial

#WED: 2015 “Seven Billion Dreams One Planet Consume with Care” Dr. Bharat Jain, Member Secretary Gujarat Cleaner Production Centre Dr. Bharat Jain, Member.

Tackling Poverty

Tackling Inequality Tackling Inequality - ilo. · PDF fileTackling Inequality Tackling Inequality ... continental European countries, ... been drifting away from the middle

Tips & Tricks for Tackling N&Ds Presented By: Seth H. Row, CMC Member Douglas E. Motzenbecker, CMC Member.

Tackling Benchmark Problems of Commonsense Reasoningict.usc.edu/pubs/Tackling Benchmark Problems of Commonsense... · Tackling Benchmark Problems of Commonsense Reasoning ... gordon@ict.usc.edu

Communities Tackling Vermont’s Energy Challenges · • Communities Tackling Vermont’s Energy Challenges • 2 Acknowledgements Communities Tackling Vermont’s Energy Challenges

1 Vigilance Awareness-for tackling corruption and deficit in governance PRESENTATION BY M.P.JUNEJA Former ADDITIONAL MEMBER RAILWAY BOARD & EX-CTE/CVC.

Notes : PREVENTING AND TACKLING STRESS AT WORK An … · 2018-10-22 · PREVENTING AND TACKLING STRESS AT WORK PREVENTING AND TACKLING STRESS AT WORK PREVENTING AND TACKLING STRESS

Tackling Technology!

Tackling a 1 billion member social network

Working Paper 158 - Tackling Graduate Unemployment … · Tackling Graduate Unemployment through Employment ... Tackling Graduate Unemployment through Employment Subsidies: An ...

Tackling Tests

Tackling debt

Tackling Employer-Supported Childcare Webinar 7 …weprinciples.org/files/attachments/Tackling... · Intro to Tackling Employer-Supported Childcare- Carmen Niethammer, ... Business