Data Modeling and Scale Out
Jason Stamper, 451 Research
Vladi Vexler and Paul Campaniello, ScaleBase
2
Agenda
Data Modeling and Scale Out
1. 451 Research
• Key challenges in the data landscape
• Evolution of distributed database environments
2. ScaleBase
• Pros and cons of abstracting complex databases topology
• Top strategies of distributed data modeling
• Advanced data modeling and “what-if” simulations with Analysis Genie
• Scaling real apps – From need to deployment
• Demo
3. Q & A (please type questions directly into the GoToWebinar side panel)
3
Today’s Presenters
Jason StamperAnalyst, Data Manage-
ment and Analytics
- 451 Research
• Over 20 years of
experience in IT
• Formerly Editor
of Computer Business
Review & Technology
Editor at The New
Statesman
Vladi VexlerVice President, Tech.
& Product Marketing
- ScaleBase
• Over 15 years experience
in software development
and product management
• Author of patents in field
of databases innovation,
dynamic data caching and
machine learning analytics
Paul CampanielloVice President,
Worldwide Marketing
- ScaleBase
• Over 25 years of software
marketing & sales
experience
• Held senior marketing
and sales positions at
Mendix, Lumigent, ESI,
ComBrio, Savantis and
Precise Software
4
About 451 Research
Founded in 2000
210+ employees, including over 100 analysts
1,000+ clients: Technology & Service
providers, corporate
advisory, finance, professional services, and IT
decision makers
10,000+ senior IT professionals in our research
community
Over 52 million data points each quarter
Headquartered in New York with offices in
Boston, San Francisco, Washington, London…
Research & Data
Advisory Services
Events
5
The Challenge
Businesses and their users are facing what one might call a
perfect storm – decision-makers need insight faster than ever,
and yet IT is struggling to avoid becoming a bottleneck.
6
The Facts Speak for Themselves…
Recent survey by trade magazine Computer Business
Review: 98% (of 200 UK CIOs) admit “significant gap”
between what business expects and what IT can deliver.
7
So What Does the Business Want?
Speed
Information, not data
Flexibility
Ease-of-use
Mobility
New ways of working
Self-service
Scale
Collaboration
8
What Causes IT to Become a Bottleneck?
Governance
Control
Security
Budget
Legacy
Staff
9
What Have We Learned So Far?
• So far, the emergence of so-called ‘hot’ data platform and analytics technologies have not solved the IT information bottleneck.
• Hadoop isn’t going to save the world (and neither is NoSQL).
• The ability to analyze large data sets, in real- or near real-time, is only set to grow in the era of the Internet of Things.
• IT is still critical, but it needs to enable the business to help itself. The question is how to achieve the right blend of usability, value-for-money and scalability.
10
A Word or Two on Hadoop Adoption
0 2000 4000 6000 8000
2013
2012DW and DBMS
Unstructured file
Virtualized server/OS
Backup
Archive
Other
Big data/Hadoop
Average total storage capacity (TBs), and total storage footprint
by workload illustrate the low level of adoption today
11
451 Research’s View of the ‘Total Data Approach’
12
What is Driving the Change?
Developers
Agile
REST
JSON
Schemaless
Schema-on-read
Flexible
Applications
Web
Social
Mobile
Always-on
Interactive
Local
Architecture
Cloud
Scalable
Elastic
Virtual
Distributed
Flexible
New applications require distributed architecture
Distributed architecture encourages new development approaches
New development approaches demand new architecture
Distributed architecture enables new applications
New app requirements demand new development approaches
New devapproaches enable new lightweight
apps
13
The Database Challenge
– The traditional relational database has been stretched beyond its normal capacity limits by the needs of high-volume, highly distributed or highly complex applications.
– There are workarounds – such as DIY sharding – but manual, homegrown efforts can result in database administrators being stretched beyond their available capacity in terms of managing complexity.
– Scalability
– Performance
– Relaxed consistency Increased willingness to look
– Agility for emerging alternatives
– Intricacy
– Necessity
14
Scalability, and Other Challenges
• As usage of MySQL and MariaDB has grown, so has the usage
of applications that depend on MySQL and MariaDB:
– Games; Social; Customer Facing; Web; Business apps like Ad Networks;
• This has highlighted a number of challenges
– Scalability of master-slave architecture
– Performance and predictability at scale
– Lower latency; greater throughput; richer apps
– User expectations rising
– Manageability of increasing database/app sprawl
• External factors driving greater complexity:
– Distributed computing architectures
– Proliferation of cloud and elasticity requirements
– Geo-distributed application requirements
– Viral success means growth can come very quickly
15
Conclusions
• The success of MySQL and MariaDB has led to complications in terms of scalability concerns
• Distributed computing, proliferation of cloud, and geo-distributed applications are adding to the complexity
• Manual sharding techniques transfer the strain from the database to the database administrator
• MySQL – and MySQL administrators – has/have never been under so much strain
• Database scalability software enables users to move beyond the limitations and complexity of DIY sharding; precisely how data is managed with a distributed database in the cloud or on premise is key.
Scale Out Designs
17
About ScaleBase
Distributed Database Management System
Architected for the Cloud
Simple. Reliable. Powerful.
18
Quick Scale Out
Medium scale needs
Multiple database
replicas performing load
balancing with
read/write splitting
Designs of Distributed MySQL Environments
Massive Scale Out
High scale needs
Complete distributed
database environment,
with policy-based data
sharding/distribution
19
Quick Scale-Out
Read/Write Splitting andContinuous Availability
Application
Redirection(ip/port)
MySQL Replicas
MySQL Master
R R R
R/W
20
Massive Scale-Out
0 1 2
etc.
Master
Replicas
Master
Replicas
Master
Replicas
Shards:
21
The Right Solution for You Depends on Your Goals
• Scale (mostly) reads
• Scale (mostly) writes
• Performance of reads
– Affected by joins and big tables scans of big tables
• Performance of writes
– Affected by IO r/wr, CPU and table indexes(a growing overhead)
• Locks
• CPU/IO/ RAM issues
• Load peaks
• Data growth
• Geo-distribution, special data distribution needs
Pros and Cons of
Abstracting Complex Database Topology
23
Pros of Abstracting Complex Database Topology
• Development Agility - Accelerates
your innovation speed
• Simplifies application code
• Reduces maintenance costs and
simplifies it
• Operations Efficiency – Zero
downtime for applications
• Reduces operation costs
• Better monitoring, analytics, HA,
scale, elasticity, etc.
24
Cons of Abstracting Complex Database Topology
• Additional technology component may increase complexity
• Additional layer to monitor and manage
• Additional machines to monitor and manage (possible increased opex)
• Less control on application code level (transparent)
25
Scale Out
Methodologies
Comparison
Characteristics & Modeling in a
Distributed Database System
27
Characteristics of Distributed Table Types
• MASTER – On master shard (0) onlySite settings, Admin data tables
• GLOBAL – Full copy on all shardsLookups, Frequently joined tables, Slow growing tables
• DISTRIBUTED-ROOT – Distribution based on a key column
User.Id
• DISTRIBUTED-CASCADED (child) – Based on parent rowUser_Photos, User_Photos_Likes – depend on Users
Shards: 0 1 2 3
Full table
Full table Full table Full table Full table
¼ table ¼ table ¼ table ¼ table
28
Characteristics of Distributed Queries
• ONE-DB – 1 shard, 1 node. Most optimal.1) Any call when data known to be in one shard (Distributed/Master)
2) Call to Global table (load balance)
• ALL-DB – All shards, 1 node.1) AGREGATED READs (like map-reduce)
2) DML (writes) on Global tables
3) DDL (create, drop, alter schema)
• FULL-DB – All shards, all nodes.
Session calls (USE, SET)
• CROSS-DB – #n shards, 1 node. Least optimal, but criticalCross-shard conflict resolution.
Note: Not all sharding platforms support all distributed query types.
29
Why Data Modeling is Important?
• DATA and LOAD – Efficient distribution of:
– DATA - all / main tables and data
– READS
– WRITES
• QUERIES
– Handle ALL-DB Queries (Map-reduce concept)
– Minimize (but support!) CROSS-DB Queries – higher performance and scale
• OPTIMIZE DEVELOPMENT with SQL ANALYTICS
– Insight into the real database usage
30
Data Relationships Can be Extremely Complex
Usually, scale out is applied to growing-mature apps.
How do you define an optimal data distribution policy?
Analysis Genie:
MySQL Visual Analysis &
Optimal Distribution Policy Configuration
32
ScaleBase Analysis Genie
• A tool enabling MySQL visual analysis and building an optimal data
distribution policy; Designed for DBAs, Architects & Dev. Managers
• Two step-process:
– Analysis Assistant
– An agent captures app/DB information, including SQL traffic and
database metrics
– Obfuscates, summarizes and packages the App-DB data
– Analysis Genie
– a SaaS application, receives the AA package and presents the
visual analysis and details the policy configuration
Analysis Assistant Analysis Genie
33
ScaleBase Analysis Genie
• Advanced analytics
– Schemas, data & queries
– Semantic structure analysis
– Usage, Load and Scale analytics
• Data Modeling and
Scale-out planning
– Customized for the most complex
applications
– Auto identification of optimal
data distribution policy
– Complete policy control
• Quality assurance
– Review before production
• Simulation of results
– “What-if” analysis
34
Relationship Identification
Mapping includes:
• Schemas structures
• Tables & columns names
matching
• Queries parsing and
identification of joined
tables and columns
• Statistics on every object
size and access
35
Analyzing Relationships: From Chaos to Order
Understanding
and mapping
complex
relationships
ScaleBase Genie Demo
37
MySQL Visual Analysis Demo
• Visual analysis
• Distribution policy identification and configuration
• Scale out load via data sharding (massive scale out)
ScaleBase Enterprise
Analysis
Genie
Summary
39
Reading Plus
Who:
• Online education company
Problem:
• Busy season (back-to-school) was approaching and they needed a solution
that could be quickly implemented, while guaranteeing uptime
• With increasing growth, they needed to implement a scale out solution quickly
Alternatives Considered:
• A clustering technology, which proved to be infeasible due to schema
complexity and a lengthy re-architecture requirement
Solution:
• Used visual analysis to determine best scale out plan
• ScaleBase Lite for instant scale out and continuous availability
• 35 Tomcat application servers were connected to 3 ScaleBase controllers
• ScaleBase performed automated read/write splitting and load balancing
40
Next Gen SaaS ERP Company
Who:
• Inventory management
ecommerce company
• Hosted on Rackspace
(ScaleBase Partner)
Problem:
• Largest available hardware could not support workload
Alternatives Considered:
• Initially went with a “black box” solution, encountering many issues
Solution:
• Scaled out a single MySQL instance to 8 clustered shards
• On-demand growth – current workload over 20,000 TPS
– Plan to double footprint in next quarter
– Support all production customers during Black Friday & Cyber Monday
41
Scale out to unlimited users
Continuous availability
Dynamic workload optimization
Fast and simple deployment
Easily scale out a single
MySQL instance
Optimized for the Cloud
Reduces time-to-market
No changes needed to app or database
Database usage analytics
Intelligent load balancing
Centralized data management
ScaleBase Distributed Database Management System
42
Products and Editions
Community
Limited by Deployment
Startup
Free for Qualified Candidates
Enterprise
Massive Scale Out
Also available on:
Lite
Quick Scale Out
Analysis Genie Database Performance Analytics
43
How Can I Learn More?
Use visual analysis to plan your
scale out strategy
Download the
Analysis Genie:
https://www.scalebase.com/software
Read the 451 report about
ScaleBase (& the DB market)
Download Jason’s Report
(authored last week)
https://www.scalebase.com/resources/
whitepapers
Questions?
Contact Info:Paul Campaniello
Vladi Vexler
Resources:www.scalebase.com
www.scalebase.com/resources
www.scalebase.com/blog
(617) 630.2800
Top Related