Microservices @ Work - A Practice Report of Developing Microservices
Data Lake and the rise of the microservices
-
Upload
bigstep -
Category
Technology
-
view
982 -
download
0
Transcript of Data Lake and the rise of the microservices
Data Lake and the rise of the Microservices
About Me
• IT Operations Manager @BigStepInc• Tech Support 2007 — 2008• Systems Administrator 2008 — late 2014 (from Junior to Senior) • IT Operations Manager late 2014 — Present • Passionate about improvement and systems in general • Totally dislike repetitive tasks
@mboeru
@bigstepinc
About Bigstep• High performance, bare metal cloud purpose built for big data• Automatically deployed (managed and unmanaged) big data software stacks• HDFS as a Service Offering• Managed Docker platform (coming soon)• Spark clusters as a service (coming soon)• Purely on-demand: bare metal instances get deployed in 2 minutes, can be deleted anytime• Locally attached drives support• SDN controlled Layer 2 networking (40Gbps per instance, cut through)• Distributed SSD based storage fabric
Big Data technologies for mainstream and vice-versa
• Due to the cap on CPU frequency, the horizontal is the only dimension left to grow into.• Client-server architecture outdated.• All components of an application must be as independent as possible and as scalable as possible.• Big data technologies increasingly used in general purpose applications• In-memory technologies are orders of magnitude faster than the others.
• Docker promotes and simplifies large scale application management using low-overhead containers instead of VMs
• Mesos used with Docker and some additional services creates a Distributed OS
Source: Tori Randall, Ph.D. prepares a 550-year old Peruvian child mummy for a CT scan
Data as artefacts
• Just like archeological artefacts, old data can yield new insights if correlated in a novel way or analysed with a new technology.
• Throwing away data because it is of no use today might cripple the business tomorrow.
The Data Lake - A paradigm, not a technology• Store unstructured data in its original format• Store structured data along with the structure (schema) so it can be distributed onto multiple
machines• Ingest massive amounts of data - go to petabyte scale if needed• Stream in or batch import data from any source• Perform new, deeper analytics by focusing on correlations between diverse data sources:
clickstream, social media, machine data, documents, audio/video, etc. • Store anonymised data (keep IDs and not names or other personally identifiable information)• Promote data exploration
Data Services• A data service provides data to other services
Clusters Service Cluster
Timetable
Datalake
Service center load predictor
Driver's path
optimiser
Datalake Datalake
Data Services - It’s about the teams and not the technology
• Conway law: “[…] organizations which design systems ... are constrained to produce designs which are copies of the communication structures of these organizations”
Application
View
Controller
Model
in charge of
UX specialists
Backend specialists
in charge of
DB specialists
in charge of
poor
com
mun
icat
ion
Per data microservice teams
• Data teams are independent• A data service has its own
release cycle• Ultra-specialisation is
reduced• Communication among
members of the same team is better
App App
App
App App
App
App AppApp
API
API
API
better communication better communication
better communication
Monolith vs. Microservices approach
Server
App
Server
Monolithic approach
App
App
Server
App
App
Server
App
App
Server
App
App
Server
App
Server
App
Server
App
Server
App
App
Server
App
App
Microservices approach
Polyglot persistence• The data does not have to reside in the same place (e.g.: same HDFS cluster)• But it has to be always available for any team, microservice, or data application authorized to
use it
Single DB (slave)
piece 3
piece 4
piece 1
piece 2
Single DB
DB 4DB 3 DB 4DB 3
DB 1 DB 2
DB 4DB 3
DB 1DB 1 DB 2DB 2
Polyglot persistence• The data does not have to reside in the same place (e.g.: same HDFS cluster)• But it has to be always available for any team, microservice, or data application authorized to
use it
Single DB (slave)
piece 3
piece 4
piece 1
piece 2
Single DB
DB 4DB 3 DB 4DB 3
DB 1 DB 2
DB 4DB 3
DB 1DB 1 DB 2DB 2
Microservices orientated architecture• Components, not layers. • Each component can scale horizontally and is masterless• Each component can be unit tested independently• Each component can be deployed independently to production• Multiple versions of same component can coexist for a short amount of time• Using APIs to integrate components as opposed to direct method call• Use natively backward compatible API designs and implementations• Use distributed locking (e.g.: Zookeeper) instead of file based locking• Use queuing instead of blocking calls with evolving schemas (e.g.: Kafka with Avro serialiser)• Using distributed databases (e.g.: Couchbase) instead of master-slave oriented ones. Avoid
immutable schemas.
Docker• A Docker container is neither a VM nor a
VPS• Application level virtualisation• Same kernel• No performance overhead• Instant deployment• Usually a single app per container• Uses libcontainer (previously used LXC)
engine (network namespaces and cgroups)• Git-like deployment method with branches
and repositories.
Container
Kernel
Container Container Container
vNIC LAN
vNICWAN
LAN
WAN
vNIC LAN
vNICWAN
vNIC LAN
vNICWAN
vNIC LAN
vNICWAN
Docker Persistency• Docker is designed for services that do not need persistency but it does support it• By default all containers have an unique clone of the filesystem in the image• All changes to this clone are stored in unique directories per container that does not get
garbage collected• A new container has a new tree and so restarting a container without an explicit mapping
appears as having destroyed the data.• Docker achieves persistency by mapping directories from the host machine to the container.
Mesos & Marathon• Allows an app’s environment to be software
defined.• Docker (currently) knows only about 1 host• Orchestration layer for Docker containers• Out of the box load-balancing• Monitors and restarts containers if failed• API driven• Useful for creating high performance,
distributed, fault tolerant architectures.
C C C
C C C
C C C
C C C
C C C
C C C
Docker Networking in Mesos
instance-1001.bigstep.io
container
container
container
eno1
instance-1002.bigstep.io
container
container
container
eno1
instance-100n.bigstep.io
container
container
container
eno1
LANlayer 2
...
haproxy haproxy haproxy
WAN
internet
Instancearray01.bigstep.io
client
DNS loadbalancing
172.167.1.2:80
172.167.1.3:80
172.167.1.200:80
172.167.2.2:80
172.167.3.3:80
172.167.3.200:80
172.167.3.2:80
172.167.3.3:80
172.167.3.200:80
... ... ...
31.00.62.211:80 31.00.62.212:80 31.00.62.213:80
• Uses network namespaces• Needs Layer 2 or software overlay network• Each container gets a private IP• Bigstep Automatic DNS load-balancing• Automatic HAProxy load-balancing
Docker vs Native - LatencyAv
erag
e R
espo
nse
Tim
e (m
s)
- Sm
alle
r Is
Bette
r
0
6
11
17
22
INSERT AVG response time (us) SELECT AVG response time (us) UPDATE AVG response time (us)
11
1921
10
1819
1 node native 1 node native 1 Docker container
Source: Bigstep’s Cassandra Benchmark 2015
Docker vs Native - Throughput
KReq
/s -
bigg
er is
bet
ter
0
43
85
128
170
INSERT throughput (k) SELECT throughput (k) UPDATE throughput (k)
149
9282
168
9690
1 node native 1 node native 1 Docker container
Source: Bigstep’s Cassandra Benchmark 2015
Streaming versus batch• Resource usage patterns for streaming resemble those of web-centric systems, and need
consolidation for efficiency as well as high availability
time
resource usage (%)
25%
resource usage pattern of a production system time
resource usage (%)
100%
typical resource usage pattern of a big data analytics system
Spark with Mesos• Spark & Spark Streaming are great candidates for building data microservices as they are very
fast and easy to use• Spark can use Mesos as a resource manager• Spark needs YARN to access Secure HDFS YARN on Mesos: Myriad
Is it hard to build a Data Lake?• Use flexible infrastructure, workloads are very difficult to predict as data volumes and types of
analysis change all the time• Polyglot Persistency promotes the idea that data must be always available - but that it can be
stored in any technology that fits - e.g. Hadoop, NoSQL.• Polyglot Programming advocates the use of the right tool for the right job. Docker-based
deployment makes environment setup more or less irrelevant.• Mesos is more complicated to setup on-premise. Mesosphere offers a commercial product
for this. Bigstep also automates a scalable Mesos (with Docker) deployment on bare metal.• Data import services could be tricky to setup. The problem is the organisation structure and
security. Anonymisation is required. • A service discovery solution is required: Use mesos-dns
Conclusions• Data (micro-)services allows building a data ecosystem within your organisation. A team is a
provider of data to other teams.• An agile data environment enables an agile business. New tools must be inserted quickly into
the mix. (Eg: found out about Looker today, why not try it on the data). • There are methods to improve consolidation ratios with 40% while preserving performance of
data services
Data analysis Business modelling
Business understanding + =
Production Systemsmachine data
prediction model
Visualization & Reports