How to provide a reliable ridesharing service - USENIX · How to provide a reliable ridesharing...
Transcript of How to provide a reliable ridesharing service - USENIX · How to provide a reliable ridesharing...
![Page 1: How to provide a reliable ridesharing service - USENIX · How to provide a reliable ridesharing service DiDiChuxingcompanyservicereliabilityassuranceteam MingHua:huaming@didichuxing.com](https://reader030.fdocuments.us/reader030/viewer/2022020215/5b61bbed7f8b9a09498cbeab/html5/thumbnails/1.jpg)
How to provide a reliable ridesharing serviceDiDi Chuxing company service reliability assurance team
Ming Hua:[email protected] Tan:[email protected]
![Page 2: How to provide a reliable ridesharing service - USENIX · How to provide a reliable ridesharing service DiDiChuxingcompanyservicereliabilityassuranceteam MingHua:huaming@didichuxing.com](https://reader030.fdocuments.us/reader030/viewer/2022020215/5b61bbed7f8b9a09498cbeab/html5/thumbnails/2.jpg)
About Speakers
• Ming Hua:• Principal Architect,SRE team,DiDi Chuxing,China• Areas of focus:• Service reliability• Operation automation
• Email:[email protected]
• Lin Tan:• Senior Engineer,SRE team,DiDi Chuxing,China• Areas of focus:• Service reliability• Container and cloud platform
• Email:[email protected]
![Page 3: How to provide a reliable ridesharing service - USENIX · How to provide a reliable ridesharing service DiDiChuxingcompanyservicereliabilityassuranceteam MingHua:huaming@didichuxing.com](https://reader030.fdocuments.us/reader030/viewer/2022020215/5b61bbed7f8b9a09498cbeab/html5/thumbnails/3.jpg)
Agenda
3
Introduction of DiDi Chuxing
The challenges of reliability construction in rapidly growing service
Our technical solutions on service reliability
Our competition mechanism for reliability work
What we learned
![Page 4: How to provide a reliable ridesharing service - USENIX · How to provide a reliable ridesharing service DiDiChuxingcompanyservicereliabilityassuranceteam MingHua:huaming@didichuxing.com](https://reader030.fdocuments.us/reader030/viewer/2022020215/5b61bbed7f8b9a09498cbeab/html5/thumbnails/4.jpg)
Introduction of DiDi Chuxing• Founded in 2012 in China
• Experienced about 5 years rapid growth• One of the most fastest growing companies in the world.
• The world’s leading mobile transportation platform• 400+ million users• 17+ million drivers• 400+ cities
• DiDi Chuxing Services• Taxi• Premier car• Express• Hitch• Chauffeur• Minibus• Bus• Test Drive• Enterprise• Car Rental 4
![Page 5: How to provide a reliable ridesharing service - USENIX · How to provide a reliable ridesharing service DiDiChuxingcompanyservicereliabilityassuranceteam MingHua:huaming@didichuxing.com](https://reader030.fdocuments.us/reader030/viewer/2022020215/5b61bbed7f8b9a09498cbeab/html5/thumbnails/5.jpg)
Challenges of service reliability in DiDi Chuxing
⚬ System overload:500% annual requests growth
⚬ Release risk:more than 400 releasesper day without standard process andenvironment
⚬Vulnerable infrastructure:Single-clusterarchitecture
⚬ Lack of stability improvement:Missinga measurable mechanism
5
Challenges of service stabilityconstruction in the early stage
![Page 6: How to provide a reliable ridesharing service - USENIX · How to provide a reliable ridesharing service DiDiChuxingcompanyservicereliabilityassuranceteam MingHua:huaming@didichuxing.com](https://reader030.fdocuments.us/reader030/viewer/2022020215/5b61bbed7f8b9a09498cbeab/html5/thumbnails/6.jpg)
Our Solutions
6
Full-link stress testA specific ridesharingservice stress test for system capacity evaluation andbottleneck location.
StandardizationThe standardization practices onconfigurationmanagement、service monitor、program delivery.
Multi-cluster serviceA high availability system which across different region zones, and a load balancing strategy bases on cities.
Competition MechanismA special mechanism for system reliability and all the works ofthe team.
![Page 7: How to provide a reliable ridesharing service - USENIX · How to provide a reliable ridesharing service DiDiChuxingcompanyservicereliabilityassuranceteam MingHua:huaming@didichuxing.com](https://reader030.fdocuments.us/reader030/viewer/2022020215/5b61bbed7f8b9a09498cbeab/html5/thumbnails/7.jpg)
7
Full-link stress test
![Page 8: How to provide a reliable ridesharing service - USENIX · How to provide a reliable ridesharing service DiDiChuxingcompanyservicereliabilityassuranceteam MingHua:huaming@didichuxing.com](https://reader030.fdocuments.us/reader030/viewer/2022020215/5b61bbed7f8b9a09498cbeab/html5/thumbnails/8.jpg)
Full-link stress test
• Purpose:⚬ Service capacity evaluation⚬ Bottleneck location
• Challenges:⚬ Data separation⚬ Data accuracy
• Strategy:⚬ Simulation ⚬ online test
8
![Page 9: How to provide a reliable ridesharing service - USENIX · How to provide a reliable ridesharing service DiDiChuxingcompanyservicereliabilityassuranceteam MingHua:huaming@didichuxing.com](https://reader030.fdocuments.us/reader030/viewer/2022020215/5b61bbed7f8b9a09498cbeab/html5/thumbnails/9.jpg)
Stress test - strategies for simulation
9
• Virtual country of China in the Pacific Ocean• Virtual passengers send virtual orders• Virtual drivers take virtual orders
Virtual country = original coordinate + offset + map informationVirtual passenger/driver = original id + offset + passengers/drivers information( some attributions were also made offset )
![Page 10: How to provide a reliable ridesharing service - USENIX · How to provide a reliable ridesharing service DiDiChuxingcompanyservicereliabilityassuranceteam MingHua:huaming@didichuxing.com](https://reader030.fdocuments.us/reader030/viewer/2022020215/5b61bbed7f8b9a09498cbeab/html5/thumbnails/10.jpg)
Stress test - virtual data logic isolation
10
Virtual data construction comment
City ID mapping ID + 10000 Not overlap with any normal city ID
Coordinate mapping Lng - 230.80078Lat - 59.63827
Locate in Pacific OceanNot overlap with any normal country
Passenger and driver ID ID + 140737488355328 |<- uid: 64 bits ->||role id: 16bits| p or d id: 48bits |
Virtual passenger phone number 11100020000-11169999999 About 68 million
Virtual driver phone number 11170000000-11199999999 About 30 million
Virtual order ID High 8 bits is non 0 Normal id rangeis enough touseformillionyears
Reasonableoffsetandrangemakesuredon’toverlapwithanyrealdata
![Page 11: How to provide a reliable ridesharing service - USENIX · How to provide a reliable ridesharing service DiDiChuxingcompanyservicereliabilityassuranceteam MingHua:huaming@didichuxing.com](https://reader030.fdocuments.us/reader030/viewer/2022020215/5b61bbed7f8b9a09498cbeab/html5/thumbnails/11.jpg)
Stress test - tag for test traffic
11
HTTPcommunication
Add hint-code=1tohttpheader
didi-header-rid:$traceid;didi-header-spanid:$spanid;didi-header-hint-code:$hintCode;didi-header-hint-content:$hintContent;
Thriftcommunication
Add hintCode tomessagestructure
struct trace{1:requiredstringlogid;2:requiredstringcaller;3:optionalstringspanid;4:optionalstringsrcMethod;5:optionali64hintCode;6:optionalstringhintContent;}
Businesstraffic
Distinguishbyspecificrange
cityid;passengerid;driverid;phone;orderid;
Databaserequest
Addhinttagbeforesql
/*{“mode”:“shadow”}*/ $sql
![Page 12: How to provide a reliable ridesharing service - USENIX · How to provide a reliable ridesharing service DiDiChuxingcompanyservicereliabilityassuranceteam MingHua:huaming@didichuxing.com](https://reader030.fdocuments.us/reader030/viewer/2022020215/5b61bbed7f8b9a09498cbeab/html5/thumbnails/12.jpg)
Different strategies fordifferent storage types
Stress test - data storage and cleaning
12
Persistentstorage
Cachestorage
Queue
Log
Mysql/rockstableAdd「hint 」before sql,dbproxymakes detection and distribution Codis
Automatically clean data byset a short value for TTL(lessthan 30 minutes)
Kafka/beanstalkdWrite to shadow tube/topicEg. : tube -> tube_shadow
Business logAdd「hint 」 tag to program logEg. : hint=1;
Business sql
dbproxy
original talbe(order)
shadow talbe(order_shadow)
Select *from order /*{“mode”:”shadow”}*/Select *from order
![Page 13: How to provide a reliable ridesharing service - USENIX · How to provide a reliable ridesharing service DiDiChuxingcompanyservicereliabilityassuranceteam MingHua:huaming@didichuxing.com](https://reader030.fdocuments.us/reader030/viewer/2022020215/5b61bbed7f8b9a09498cbeab/html5/thumbnails/13.jpg)
Stress test - monitor support
13
Log
Metrics
DB
MonitorSystem
Distinguish by tag
Read from shadow table
Program report in different tag
![Page 14: How to provide a reliable ridesharing service - USENIX · How to provide a reliable ridesharing service DiDiChuxingcompanyservicereliabilityassuranceteam MingHua:huaming@didichuxing.com](https://reader030.fdocuments.us/reader030/viewer/2022020215/5b61bbed7f8b9a09498cbeab/html5/thumbnails/14.jpg)
Stress test - traffic generation
14
virtual customer orders:tens of thousands per minute
simultaneous online drivers:several millions
Online stress test and implement during low peak period of the business
![Page 15: How to provide a reliable ridesharing service - USENIX · How to provide a reliable ridesharing service DiDiChuxingcompanyservicereliabilityassuranceteam MingHua:huaming@didichuxing.com](https://reader030.fdocuments.us/reader030/viewer/2022020215/5b61bbed7f8b9a09498cbeab/html5/thumbnails/15.jpg)
15
Standardization
![Page 16: How to provide a reliable ridesharing service - USENIX · How to provide a reliable ridesharing service DiDiChuxingcompanyservicereliabilityassuranceteam MingHua:huaming@didichuxing.com](https://reader030.fdocuments.us/reader030/viewer/2022020215/5b61bbed7f8b9a09498cbeab/html5/thumbnails/16.jpg)
Standardization
• The benefits of a standardization environment
16
standardization
stability efficiency
• The necessary of standardization in DiDi (3H)⚬ Hard to handle service relationship⚬ Hard to do trouble-shooting⚬ High risks of misoperation and delivery accident
![Page 17: How to provide a reliable ridesharing service - USENIX · How to provide a reliable ridesharing service DiDiChuxingcompanyservicereliabilityassuranceteam MingHua:huaming@didichuxing.com](https://reader030.fdocuments.us/reader030/viewer/2022020215/5b61bbed7f8b9a09498cbeab/html5/thumbnails/17.jpg)
Our Standardization works
17
configuration monitor delivery
Focus on three fields:
![Page 18: How to provide a reliable ridesharing service - USENIX · How to provide a reliable ridesharing service DiDiChuxingcompanyservicereliabilityassuranceteam MingHua:huaming@didichuxing.com](https://reader030.fdocuments.us/reader030/viewer/2022020215/5b61bbed7f8b9a09498cbeab/html5/thumbnails/18.jpg)
Standardization for Configuration
Connection relationship management standardization - disf
18
A B
A B
NS
Register/Query for B list Register
Ips/hosts names/intranet domains
• Relay on IP:hard to manage and change• Relay on NS:centralized management and easy to change
(disf)
![Page 19: How to provide a reliable ridesharing service - USENIX · How to provide a reliable ridesharing service DiDiChuxingcompanyservicereliabilityassuranceteam MingHua:huaming@didichuxing.com](https://reader030.fdocuments.us/reader030/viewer/2022020215/5b61bbed7f8b9a09498cbeab/html5/thumbnails/19.jpg)
Standardization for MonitorPrinciple:all programs involve metrics-lib and report their own status
19
A
B C
D
metrics-lib
metrics-lib metrics-lib
metrics-lib
monitordashboard
(odin/open-falcon)
Report+Collect
Report+Collect
Every business have a health dashboard
Report+Collect
![Page 20: How to provide a reliable ridesharing service - USENIX · How to provide a reliable ridesharing service DiDiChuxingcompanyservicereliabilityassuranceteam MingHua:huaming@didichuxing.com](https://reader030.fdocuments.us/reader030/viewer/2022020215/5b61bbed7f8b9a09498cbeab/html5/thumbnails/20.jpg)
Standardization for DeliveryMan dependent => fully-automatic workflow:⚬ blue/green deployment
⚬ necessary pause in every step
⚬ standard deployment path and backup
⚬ standard program control interface: control.sh start|stop|reload
20
Semi-automatic
launch compile build
Previewrelease
grayrelease
fullrelease
Fully-automaticworkflow
monitor
A module description file in jsongoes through the whole flow
![Page 21: How to provide a reliable ridesharing service - USENIX · How to provide a reliable ridesharing service DiDiChuxingcompanyservicereliabilityassuranceteam MingHua:huaming@didichuxing.com](https://reader030.fdocuments.us/reader030/viewer/2022020215/5b61bbed7f8b9a09498cbeab/html5/thumbnails/21.jpg)
Result
21
• Frequent service crash caused by misoperation anddelivery have been well controled.
• Accident alarm and fault location takes less time.
![Page 22: How to provide a reliable ridesharing service - USENIX · How to provide a reliable ridesharing service DiDiChuxingcompanyservicereliabilityassuranceteam MingHua:huaming@didichuxing.com](https://reader030.fdocuments.us/reader030/viewer/2022020215/5b61bbed7f8b9a09498cbeab/html5/thumbnails/22.jpg)
22
Multi-cluster service
![Page 23: How to provide a reliable ridesharing service - USENIX · How to provide a reliable ridesharing service DiDiChuxingcompanyservicereliabilityassuranceteam MingHua:huaming@didichuxing.com](https://reader030.fdocuments.us/reader030/viewer/2022020215/5b61bbed7f8b9a09498cbeab/html5/thumbnails/23.jpg)
Multi-clusters service
It often takes a long time to recover even a slight fault occurs
23
All requests of 400+ cities
IDC TOR Failures
There is no backup IDC fortrafficswitching
IDC Network Failures
Service Failures
Other Failures
Sign in Sign up
Request Respond
Dispatch Bid
Pay Review
![Page 24: How to provide a reliable ridesharing service - USENIX · How to provide a reliable ridesharing service DiDiChuxingcompanyservicereliabilityassuranceteam MingHua:huaming@didichuxing.com](https://reader030.fdocuments.us/reader030/viewer/2022020215/5b61bbed7f8b9a09498cbeab/html5/thumbnails/24.jpg)
Multi-clusters challenges
• The difficulties of building multi-clusters for ridesharing service
⚬ Most of modules are stateful;
⚬ Consistency requirement for data;
⚬ Both driver-client and passenger-client are location-aware;
24
![Page 25: How to provide a reliable ridesharing service - USENIX · How to provide a reliable ridesharing service DiDiChuxingcompanyservicereliabilityassuranceteam MingHua:huaming@didichuxing.com](https://reader030.fdocuments.us/reader030/viewer/2022020215/5b61bbed7f8b9a09498cbeab/html5/thumbnails/25.jpg)
Multi-cluster implementation
All factors related to a region are closed to one unit/cluster
25
Requests of passengers and drivers from all the cities
switch
Master-slavesynchronization
DispatchbycityID carried in request
Key solutions:• All data of driversand
passengersand others wasdistinguished bycitiy, and wassplit to more than one copies.
• Each cluster has partial dataand a copy of other cluster.
• Dispatching requests by cityIDin the request entrance.
IDC TOR Failures
IDC Network Failures
Service Failures
Other Failures
Sign in Sign up
Request Respond
Dispatch Bid
Pay Review
Sign in Sign up
Request Respond
Dispatch Bid
Pay Review
router router
![Page 26: How to provide a reliable ridesharing service - USENIX · How to provide a reliable ridesharing service DiDiChuxingcompanyservicereliabilityassuranceteam MingHua:huaming@didichuxing.com](https://reader030.fdocuments.us/reader030/viewer/2022020215/5b61bbed7f8b9a09498cbeab/html5/thumbnails/26.jpg)
26
Competition Mechanism
![Page 27: How to provide a reliable ridesharing service - USENIX · How to provide a reliable ridesharing service DiDiChuxingcompanyservicereliabilityassuranceteam MingHua:huaming@didichuxing.com](https://reader030.fdocuments.us/reader030/viewer/2022020215/5b61bbed7f8b9a09498cbeab/html5/thumbnails/27.jpg)
27
The Principal Rule
![Page 28: How to provide a reliable ridesharing service - USENIX · How to provide a reliable ridesharing service DiDiChuxingcompanyservicereliabilityassuranceteam MingHua:huaming@didichuxing.com](https://reader030.fdocuments.us/reader030/viewer/2022020215/5b61bbed7f8b9a09498cbeab/html5/thumbnails/28.jpg)
Our competition mechanism for reliability work
28
A competition named 『starflower』:A mechanism for ensuring all teams to invest in reliability work
![Page 29: How to provide a reliable ridesharing service - USENIX · How to provide a reliable ridesharing service DiDiChuxingcompanyservicereliabilityassuranceteam MingHua:huaming@didichuxing.com](https://reader030.fdocuments.us/reader030/viewer/2022020215/5b61bbed7f8b9a09498cbeab/html5/thumbnails/29.jpg)
Starflower - Competition RulesJudge:Organizing Committee (OC)
Participants:All the reliability related teams(across nine business units and the technical departments)
Goal:• Server reliability : • 2016 : 99.9% downtime per month less than 40 minutes.• 2017:99.95 downtime per month less than 22 minutes.
Rule:• Downtime:• The time when the core performance indicators(request, respond, pay)was decreased by 10% *
the affect orders / the total number of weekly orders.• Quota:• The downtime available was assigned to each participants as their reliability quota.
29
![Page 30: How to provide a reliable ridesharing service - USENIX · How to provide a reliable ridesharing service DiDiChuxingcompanyservicereliabilityassuranceteam MingHua:huaming@didichuxing.com](https://reader030.fdocuments.us/reader030/viewer/2022020215/5b61bbed7f8b9a09498cbeab/html5/thumbnails/30.jpg)
Starflower - exception handling
30
Core Indicators monitoring:Phone call in urgent case Firemen Group:Communication channel
Fire Map
![Page 31: How to provide a reliable ridesharing service - USENIX · How to provide a reliable ridesharing service DiDiChuxingcompanyservicereliabilityassuranceteam MingHua:huaming@didichuxing.com](https://reader030.fdocuments.us/reader030/viewer/2022020215/5b61bbed7f8b9a09498cbeab/html5/thumbnails/31.jpg)
Starflower - case follow
31
Case study:• Case review• Issue Analysis• Improvement Analysis• Duty review
![Page 32: How to provide a reliable ridesharing service - USENIX · How to provide a reliable ridesharing service DiDiChuxingcompanyservicereliabilityassuranceteam MingHua:huaming@didichuxing.com](https://reader030.fdocuments.us/reader030/viewer/2022020215/5b61bbed7f8b9a09498cbeab/html5/thumbnails/32.jpg)
Starflower - Rewards and punishments
32
Failure level:• p1-p5 according to the severity of the failure
Punishment :• 1,0000 RMB(about $1400) penalty for P1• Summary conference every month
• The teams who held their quota and done well jobs in reliability will be rewarded in money and honor titles;• The teams who broke their quota will be punished by money and negative titles;• CEO 、CTO or other VPS will be the presenter;
![Page 33: How to provide a reliable ridesharing service - USENIX · How to provide a reliable ridesharing service DiDiChuxingcompanyservicereliabilityassuranceteam MingHua:huaming@didichuxing.com](https://reader030.fdocuments.us/reader030/viewer/2022020215/5b61bbed7f8b9a09498cbeab/html5/thumbnails/33.jpg)
Result and the key points
33
All the teams and the whole technical departmentregard stability as the most important work
Key points:ü The judge should not be the competitorü Measurableü Clear rulesü Punishment and rewardü Continuous follow-up
Results:ü 99.9% 2016ü 99.95% 2017(up to now)
![Page 34: How to provide a reliable ridesharing service - USENIX · How to provide a reliable ridesharing service DiDiChuxingcompanyservicereliabilityassuranceteam MingHua:huaming@didichuxing.com](https://reader030.fdocuments.us/reader030/viewer/2022020215/5b61bbed7f8b9a09498cbeab/html5/thumbnails/34.jpg)
34
What we learned
![Page 35: How to provide a reliable ridesharing service - USENIX · How to provide a reliable ridesharing service DiDiChuxingcompanyservicereliabilityassuranceteam MingHua:huaming@didichuxing.com](https://reader030.fdocuments.us/reader030/viewer/2022020215/5b61bbed7f8b9a09498cbeab/html5/thumbnails/35.jpg)
What we learned
35
Collaborative working Long term investment The early the better
![Page 36: How to provide a reliable ridesharing service - USENIX · How to provide a reliable ridesharing service DiDiChuxingcompanyservicereliabilityassuranceteam MingHua:huaming@didichuxing.com](https://reader030.fdocuments.us/reader030/viewer/2022020215/5b61bbed7f8b9a09498cbeab/html5/thumbnails/36.jpg)
Thanks!