10 ways to stumble with big data
-
Upload
lars-albertsson -
Category
Data & Analytics
-
view
434 -
download
0
Transcript of 10 ways to stumble with big data
10 ways to stumble with big data
2017-09-14Lars Albertsson
www.mapflat.com
1
Who’s talking?● KTH-PDC Center for High Performance Computing (MSc thesis)● Swedish Institute of Computer Science (distributed system test+debug tools)● Sun Microsystems (building very large machines)● Google (Hangouts, productivity)● Recorded Future (natural language processing startup)● Cinnober Financial Tech. (trading systems)● Spotify (data processing & modelling)● Schibsted Media Group (data processing & modelling)● Mapflat (independent data engineering consultant)
2
Data-centric systems, 1st generation● The monolith
○ All data in one place○ Analytics + online serving from
single database
3
DB
Presentation
Logic
Storage
Data-centric systems, 2nd generation● Collect aggregated data from
multiple online systems to data warehouse
● Aggregate to OLAP cubes● Analytics focused
4
ServiceService
Service
Web application
Data warehouse
Daily aggregates
3rd generation - event oriented
5
Cluster storage
ETL
Datalake
AI feature
DatasetJobPipeline
Data-driven product development
Analytics
Why bother?
6
Development iteration speed
Data-driven development
Machine learning features
Democratised data access
1 - Spending-driven development
7
● Large spending before value delivery● Vendors want you to make this mistake
No workflow orchestration tool
Driven by infrastructure department
Project named “data lake” or “data platform” High trust
in vendor
Warning signs
2 - Premature scaling● You don’t have big data!● Max cloud instance memory: 2TB● Does your data
○ fit?○ grow faster than Moore’s law?
● Scaling out only when needed● Big data Lean data
○ Time-efficient data handling○ Democratised data○ Complex business logic○ Human fault tolerance○ Data agility
88
Funky databases
In-memory technology
Daily work requires cluster
3 - The data waterfall
9
● Handovers add latency● Low product agility
High time to delivery
Unclear use cases
Many teams from source to end No workflow
orchestration tool
Mono-functional teams
Right turn: Feature-driven teams & infrastructure● Cross-functional teams own
specific feature● Path from source data to end
user service
10
Start out with workflow orchestration
Self-service infrastructure added lazily
Postpone clusters & investments
End-to-end proof of concepts
Team that owns data exports to lake
Team needing data imports to lake
4 - Lake of trash
11111111
Excessive time spent cleaning
Data feature teams access production data
Data quality & semantics issues
5 - Random walk● Many iterative steps without a
target vision● Works fine for months.
Pain then increases gradually.● Difficult to be GDPR compliant.
1212121212
Autonomous / microservice culture
Little technology governance
No plan for schemas, deployment, privacy Wide
changes difficult
6 - Distinct crawl● Batch data pipelines are forgiving
○ Workflow orchestration tool for recovery
● Many practices are cargo rituals○ Release management○ In situ testing○ Performance testing
● Start minimal & quick○ Developer integration tests○ Continuous deployment pipeline
● Add process iff pain
131313131313
Enterprise culture
Heavypractice governance
Standard rituals applied
Late first delivery
7 - Data loss by design
14
Processing during data ingestion
Unclear source of truth
Mutablemaster data
Store every event
Immutable data
Reproducible execution
Large recovery buffers
Human error tolerance
Component error tolerance
Rapiditerationspeed
Eliminate manual
precautions
8 - AI first● You can climb, not jump● PoCs are possible
Credits: “The data science hierarchy of needs”, Monica Rogati
15
AIDeep learning
A/B testingMachine learning
AnalyticsSegments
CurationAnomaly detection
Data infrastructurePipelines
InstrumentationData collection
Value Effort
9 - Technical bankruptcy● Data pipeline == software product● Apply common best practices
○ Quality tools & processes○ Automated (integration) testing○ CI/CD○ Refactoring
● Avoid tools that steer you away○ Local execution?○ Difficult testing?○ Mocks required?
● Strong software engineersneeded
○ Rotate if necessary1616
Heterogeneous environmentWeak
release process
Few code quality tools
Excessive time on operations
1717
Data engineer
Increasing tech debt
10 - Team trinity unbalance● Team sport● Mutual respect & learning● Be driven by
○ user value
● Balance with ○ innovation○ engineering
17
Data scientist
Product ownerLittle innovation
Low business value
11 - Miss the train
18
Big data + AI is not optional
C.f. Internet, smartphones, …
Product development speed impact is significant
Data-driven evaluation
Forgiving environment - move fast without breaking things
Democratised access to data