BIG data anti-patterns - Meetupfiles.meetup.com/3189882/bigdata-antipatterns.pdf · BIG data...

BIG data anti-patterns

Polyglot data integration

OLTP OLAP / EDW

HBase Cassan dra

Voldem ort

Hadoop

SecurityAnalyticsRec. Engine

Search Monitoring Social Graph

Background

• Apache project

• Originated from LinkedIn

• Open-sourced in 2011

• Written in Scala and Java

• Borrows concepts in messaging systems and logs

• Foundational data movement and integration technology

What’s the big whoop about Kafka?

Throughput

http://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines

Producer

Consumer Consumer Consumer

Kafka Server Kafka Server Kafka Server

Producer Producer

2,024,032 TPS

2,615,968 TPS

O.S. page cache is leveraged

0 1 2 3 4 5 6 7 8 9 10 11 ...

ProducerConsumer A Consumer B

OS page cache

writesreadsreads

Things to look out for

• Leverages ZooKeeper, which is tricky to configure

• Reads can become slow when the page cache is missed and disk needs to be hit

• Lack of security

Summary

• Don’t write your own data integration

• Use Kafka for light-weight, fast and scalable data integration

Full scans!

SELECT * FROM huge_table JOIN ON other_huge_table …

What’s the problem?

so part i t ion your d a t a according t o how you w i l l most commonly

access i t

hdfs:/data/tweets/date=20140929/

and then make sure t o include a f i l t e r in your queries so t h a t on ly

those partit ions are read

... WHERE DATE=20151027

include project ions t o reduce d a t a t h a t needs t o be read from disk o r

pushed over t he network

SELECT id, name FROM...

hash joins require network io which is s low

and t e l l your query engine t o use a sor t-

merge-bucket (SMB) join

— Hive properties to enable a SMB join set hive.auto.convert.sortmerge.join=true; set hive.optimize.bucketmapjoin = true; set hive.optimize.bucketmapjoin.sortedmerge = true;

and look a t using a columnar da ta

format l ike parquet

GOOGLMSFT

05-10-201405-10-2014

526.6239.54

Column Strorage

Column 1 (Symbol)

Column 2 (Date)

Column 3 (Price)

Summary

• Partition, filter and project your data (same as you used to do with relational databases)

• Look at bucketing and sorting your data to support advanced join techniques such as sort-merge- bucket

• Consider storing your data in columnar form

Tombstones

What is Cassandra?

• Low-latency distributed database

• Apache project modeled after Dynamo and BigTable

• Data replication for fault tolerance and scale

• Multi-datacenter support

East West

VVV V VV V V VVK K K K K K K K K K K K K K K K K K K K K KV V V V V V V V V V V V

tombstone markers indicate that the column has been deleted

de le te s in Cassandra a re soft; de l e t ed columns a re marked

with tombstones

these tombstoned columns slow-down reads

don ’t use Cassandra, use kafka

Counting with Java’s built-in collections

USE HYPERLOGLOG TO WORK WITH approximate DISTINCt

COUNTS @SCALE

HyperLogLog

• Cardinality estimation algorithm

• Uses (a lot) less space than sets

• Doesn’t provide exact distinct counts (being “close” is probably good enough)

• Cardinality Estimation for Big Data: http://druid.io/ blog/2012/05/04/fast-cheap-and-98-right- cardinality-estimation-for-big-data.html

1 billion distinct elements = 1.5kb memory

standard error = 2%

https://www.flickr.com/photos/redwoodphotography/4356518997

Bit pattern observations

1xxxxxxxxx..x01xxxxxxxx.

001xxxxxxx.

0001xxxxxx..x

50% of hashed values will look like:25% of hashed values will look like:

12.5% of hashed values will look like:

6.25% of hashed values will look like:

HLL Java library

• https://github.com/aggregateknowledge/java-hll

• Neat implementation - it automatically promotes internal data structure to HLL once it grows beyond a certain size

Approximate count algorithms

• HyperLogLog (distinct counts)

• CountMinSketch (frequencies of members)

• Bloom Filter (set membership)

Summary

• Data skew is a reality when working at Internet scale

• Java’s builtin collections have a large memory footprint don’t scale

• For high-cardinality data use approximate estimation algorithms

It’s open

Important questions to ask

• Is my data encrypted when it’s in motion?

• Is my data encrypted on disk?

• Are there ACL’s defining who has access to what?

• Are these checks enabled by default?

How do tools stack up?

ACL’sAt-rest

encryption

In-motion encryption

Enabled by default

Ease of use

Oracle

Hadoop

Cassandra

ooKeeper

Summary

• Enable security for your tools!

• Include security as part of evaluating a tool

• Ask vendors and project owners to step up to the plate

Thanks for your time!

Best-in-class big data tools (in my opinion)

If you want … Consider …

Low-latency lookups Cassandra, memcached

Near real-time processing Storm

Interactive analytics Vertica, Teradata

Full scans, system of record data, ETL, batch processing

HDFS, MapReduce, Hive, Pig

Data movement and integration Kafka

BIG data anti-patterns - Meetupfiles.meetup.com/3189882/bigdata-antipatterns.pdf · BIG data...

Documents

Transcript of BIG data anti-patterns - Meetupfiles.meetup.com/3189882/bigdata-antipatterns.pdf · BIG data...

Ethnicity in Children and Mixed Marriages: Theory and ...sticerd.lse.ac.uk/seminarpapers/dg251113.pdfChristianity. Cassan (2012) shows how higher-caste groups in Punjab at the turn

Anti-Patternsweb.cs.iastate.edu/.../07/01/notes/antipatterns.pdf · Why Study AntiPatterns? Provide a method of efficiently mapping a general situation to a specific class of solutions

Edinburgh Research Explorer · izing on KVsystems as the storage, such as Apache’s Cassan-dra [7], HBase [8] and Kudu [1]. Spanner [21, 13, 41] started this line of work, to support

Agile Anti-patterns - files.meetup.comfiles.meetup.com/1792120/agile-antipatterns.pdf · Agile Anti-patterns Andrew Cox ... development teams seem to get a lot of value out of, ...

Www.ifrc.org Saving lives, changing minds. First Aid IFRC Global First Aid Reference Centre Pascal Cassan (MD) Head Presentation briefing for counsellors.

Introduction to Cassandrafiles.meetup.com/14849742/London Meetup - Introduction to Cassan… · Compaction is a process that Cassandra uses to keep local data in check. a. Tables

Beyond Java 9 @ Jfokus 2015files.meetup.com/3189882/Beyond Java 9.pdf · Title: Beyond Java 9 @ Jfokus 2015 Author: Mark Reinhold Created Date: 2/27/2015 12:42:44 AM

Collection Enhancement In JDK8 JDK9files.meetup.com/3189882/Collection_Enhancement_In_JDK8_JDK9.pdf · Java 8 Collections Enhancements • Lambda/Streams • Collections are most

Introduction to Java 8 Date Time APIfiles.meetup.com/3189882/Java Date and Time API and... · How to model Date without time component? eg: 19 Dec 2015 How to model Time without a

Open Archive Toulouse Archive Ouverte ( OATAO ) · Pauline Spolaore, 1,2* Claire Joannis-Cassan,1 Elie Duran, 2 and Arsène Isambert 1 Laboratoire de Génie des Procédés et Matériaux,

Design of emergent and submerged rock-ramp fish passes · Ludovic Cassan* and Pascale Laurens Institut de Mecanique des Fluides, allee du Prof. Camille Soula, 31400 Toulouse, France

DISCUSSION PAPER SERIESgcassan/stuff/TooYoungTooDie.pdf · 2019. 11. 5. · Cassan DEVELOPMENT ECONOMICS PUBLIC ECONOMICS. ISSN 0265-8003 TOO YOUNG TO DIE . DEPRIVATION MEASURES COMBINING

Internationalized Domain Names Application (IDNA)files.meetup.com/3189882/IDN for Applications.pdfIDNA Protocol –IDNA2003 •Introduced in 2003 •Used Unicode version 3.2 •The

Anti-Patternscs356.yusun.io/slides/L22-AntiPatterns.pdf · Anti-Patterns Lessons Learned from failures and their remedies AntiPatterns: Vaccinations against Object Misuse” [Akroyd

Arnaud Cassan Optical and Infrared Wide-Field Astronomy in Antarctica ARI / ZAH Heidelberg IAP, 14 – 16 June 2006 Microlensing search for extra-solar planets.

LibreOffice Berlin 2012 Conference Presentation Templatemichael/data/2012-10-31-antipatterns.pdf · Stills from: Steven Pinker – The Stuff of Thought– Language as a window into

Concurrent Garbage Collection - Meetupfiles.meetup.com/3189882/JUG_Concurrent_GC_Jul2015.pdf · Ø Concurrent vs. Stop-the-world ... Ø Object pools, off heap memory used to get around

Anti-Patternsbillwscott.com/share/presentations/2007/rwe/Antipatterns.pdf · 2007-10-04 · anti-patterns what are anti-patterns? “Anti-patterns, also called pitfalls, are classes

BASE CONCEPTUAL DEL CARÁCTER RELATIVO DEL … · BASE CONCEPTUAL DEL CARÁCTER RELATIVO DEL MOVIMIENTO EN LA INGENIERÍA SIGLO XXI addad@fceia.unr.edu.ar,rosolio@fceia.unr.edu.ar,cassan@fceia.unr.edu.ar

CSC PSud thesis form - IEF-Cassan · PhD Thesis proposal form Discipline ... pp. 243-253, Higher Education Press and SPRINGER -VERLAG, 2011. Title: CSC_PSud_thesis_form - IEF-Cassan.jpg