Social Speed - Improving Flash Performance for Social Games (GDC 2011)
GDC Tutorial, 2005. Building Multi-Player Games
description
Transcript of GDC Tutorial, 2005. Building Multi-Player Games
GDC Tutorial, 2005. GDC Tutorial, 2005. Building Multi-Player GamesBuilding Multi-Player Games
Case Study: The Sims OnlineCase Study: The Sims Online Lessons LearnedLessons Learned, ,
Larry MellonLarry Mellon
TSO: Overview TSO: Overview Initial team: little to no MMP experience Initial team: little to no MMP experience
Engineering estimate: switching from 4-8 player peer Engineering estimate: switching from 4-8 player peer to peer to MMP client/server would take no additional to peer to MMP client/server would take no additional development time!development time!
No code / architecture / tool support for No code / architecture / tool support for Long-term, continually changing nature of gameLong-term, continually changing nature of game Non-deterministic execution, dual platform (win32 / Linux)Non-deterministic execution, dual platform (win32 / Linux)
Overall process designed for single-player Overall process designed for single-player complexity, small development teamcomplexity, small development team Limited nightly builds, minimal daily testingLimited nightly builds, minimal daily testing Limited design reviews, limited scalability testing, no Limited design reviews, limited scalability testing, no
“maintainable/extensible” impl. requirement“maintainable/extensible” impl. requirement
TSO: Case Study OutlineTSO: Case Study Outline(Lessons Learned)(Lessons Learned)
Poorly designed SP Poorly designed SP MP MP MMP transitionsMMP transitionsScalingScaling
Team & code size, data set sizeTeam & code size, data set sizeBuild & distributionBuild & distribution
Architecture: logical & codeArchitecture: logical & codeVisibility: development & operationsVisibility: development & operationsTestability: development, release, loadTestability: development, release, load
Multi-Player, Non-determinismMulti-Player, Non-determinism
Persistent user data vs code/content updatesPersistent user data vs code/content updatesPatching / new content / custom contentPatching / new content / custom content
ScalabilityScalability(Team Size & Code Size)(Team Size & Code Size)
What were the problemsWhat were the problems Side effect breaks & ability to work in parallelSide effect breaks & ability to work in parallel
Limited encapsulation + poor testability + non-determinism = Limited encapsulation + poor testability + non-determinism = TROUBLETROUBLE
Independent module design & impact on overall system Independent module design & impact on overall system (initially, no system architect)(initially, no system architect)
#include structure#include structure win32 / Linux, compile times, pre-compiled headers, ...win32 / Linux, compile times, pre-compiled headers, ...
What workedWhat worked Move to new architecture via Refactoring & Scaffolding Move to new architecture via Refactoring & Scaffolding
HSB, incSync, nullView Simulator, nullView client, …HSB, incSync, nullView Simulator, nullView client, … Rolling integrations: never darkRolling integrations: never dark Sandboxing & pumpkinsSandboxing & pumpkins
Scalability Scalability (Build & Distribution)(Build & Distribution)
To developers, customers & fielded serversTo developers, customers & fielded servers What didn’t work (well enough)What didn’t work (well enough)
Pulling builds from developer’s workstationsPulling builds from developer’s workstations Shell scripts & manual publicationShell scripts & manual publication
What worked wellWhat worked well Heavy automation with web trackingHeavy automation with web tracking
Repeatability, Speed, VisibilityRepeatability, Speed, Visibility Hierarchies of promotion & test Hierarchies of promotion & test
Scalability Scalability (Architecture)(Architecture)
Logical versus physical versus code structureLogical versus physical versus code structure Only physical was not a major, MAJOR issueOnly physical was not a major, MAJOR issue
Logical: Replicated computing vs client / serverLogical: Replicated computing vs client / server Security & stability implicationsSecurity & stability implications
Code: Client / server isolation & code sharingCode: Client / server isolation & code sharing Multiple, concurrent logic threads were sharing code&data, Multiple, concurrent logic threads were sharing code&data,
each impacting the otherseach impacting the others Nullview client & simulatorNullview client & simulator Regulators vs Protocols: bug counts & state machinesRegulators vs Protocols: bug counts & state machines
Go to final architecture ASAPGo to final architecture ASAP
ClientSim
ClientSim
ClientSim
ClientSim
Multiplayer:
Here beSyncHell
Evolve
Client/Server:
Client
Sim
Client
Client
NiceUndemocratic
Request/Command
Evolve
Final Architecture ASAP:Final Architecture ASAP:Make Everything Smaller&SeparateMake Everything Smaller&Separate
Final Architecture ASAP:Final Architecture ASAP:Reduce Complexity of Branches Reduce Complexity of Branches
PacketArrival
If (client)
If (server)
#ifdef (nullview)
Shared Code
Client Event
Server Event
Client & server teams would constantly break each other via changes to shared state&code
More Packets!!
SharedState
Final Architecture ASAP:Final Architecture ASAP:“Refactoring”“Refactoring”
Decomposed into Multiple dll’s Decomposed into Multiple dll’s Found the SimulatorFound the Simulator
InterfacesInterfaces Reference CountingReference Counting Client/Server subclassingClient/Server subclassing
How it helped:–Reduced coupling. Even reduced compile times!–Developers in different modules broke each other less often.–We went everywhere and learned the code base.
Final Architecture ASAP:Final Architecture ASAP:It Had to Always RunIt Had to Always Run
Initially clients wouldn’t behave predictablyInitially clients wouldn’t behave predictably We could not even play testWe could not even play test Game design was demoralizedGame design was demoralized
We needed a bridge, now!We needed a bridge, now! ? ?
Final Architecture ASAP:Final Architecture ASAP:Incremental SyncIncremental Sync
A quick temporary solution…A quick temporary solution… Couldn’t wait for final system to be finishedCouldn’t wait for final system to be finished High overhead, couldn’t ship itHigh overhead, couldn’t ship it
We took partial state snapshots on the server We took partial state snapshots on the server and restored to them on the clientand restored to them on the client
How it helped:–Could finally see the game as it would be.–Allowed parallel game design and coding–Bought time to lay in the “right” stuff.
Architecture: Architecture: ConclusionsConclusions
Keep it simple, stupid!Keep it simple, stupid! Client/serverClient/server
Keep it cleanKeep it clean DLL/module integration pointsDLL/module integration points #ifdef’s must die!#ifdef’s must die!
Keep it aliveKeep it alive Plan for a constant system architect role: review all Plan for a constant system architect role: review all
modules for impact on team, other modules & extensibilitymodules for impact on team, other modules & extensibility Expose & control all inter-process communicationExpose & control all inter-process communication
See Regulators: state machines that control transactionsSee Regulators: state machines that control transactions
TSO: Case Study OutlineTSO: Case Study Outline(Lessons Learned)(Lessons Learned)
Poorly designed SP Poorly designed SP MP MP MMP transitionsMMP transitionsScalingScaling
Team & code size, data set sizeTeam & code size, data set sizeBuild & distributionBuild & distribution
Architecture: logical & codeArchitecture: logical & codeVisibility: development & operationsVisibility: development & operationsTestability: development, release, loadTestability: development, release, load
Multi-Player, Non-determinismMulti-Player, Non-determinism
Persistent user data vs code/content updatesPersistent user data vs code/content updatesPatching / new content / custom contentPatching / new content / custom content
VisibilityVisibility
ProblemsProblems Debugging a client/server issue was very slow & painfulDebugging a client/server issue was very slow & painful Knowing what to work on next was largely guessworkKnowing what to work on next was largely guesswork Reproducing system failures from live environmentReproducing system failures from live environment Knowing how one build or server cluster differed from another Knowing how one build or server cluster differed from another
was again largely guessworkwas again largely guesswork What we did that workedWhat we did that worked
Log / crash aggregators & filtersLog / crash aggregators & filters Live “critical event” monitorLive “critical event” monitor Esper: live player & engine metricsEsper: live player & engine metrics Repeatable load testingRepeatable load testing Web-based Dashboard: health, status, where is everythingWeb-based Dashboard: health, status, where is everything Fully automated build & publish proceduresFully automated build & publish procedures
Visibility via “Bread Crumbs”: Visibility via “Bread Crumbs”: Aggregated Instrumentation Flags Aggregated Instrumentation Flags
Trouble SpotsTrouble Spots
Server Crash
Quickly Find Trouble SpotsQuickly Find Trouble Spots
DB byte count oscillates out of control, server
crashes
Drill Down For DetailsDrill Down For Details
A single DB Request is
clearly at fault
TSO: Case Study OutlineTSO: Case Study Outline(Lessons Learned)(Lessons Learned)
Poorly designed SP Poorly designed SP MP MP MMP transitionsMMP transitionsScalingScaling
Team & code size, data set sizeTeam & code size, data set sizeBuild & distributionBuild & distribution
Architecture: logical & codeArchitecture: logical & codeVisibility: development & operationsVisibility: development & operationsTestability: development, release, loadTestability: development, release, load
Multi-Player, Non-determinismMulti-Player, Non-determinism
Persistent user data vs code/content updatesPersistent user data vs code/content updatesPatching / new content / custom contentPatching / new content / custom content
TestabilityTestability
Development, release, load: all show stopper Development, release, load: all show stopper problemsproblems
QA coordination / speed / costQA coordination / speed / cost Repeatablity, non-determinismRepeatablity, non-determinism Need for many, Need for many, manymany tests per day, each with tests per day, each with
multiple inputs (two to two thousand players multiple inputs (two to two thousand players per test)per test)
Testability: What WorkedTestability: What Worked Automated testing for repeatablity & scaleAutomated testing for repeatablity & scale
Scriptable test clients: mirrored actual user play sessionsScriptable test clients: mirrored actual user play sessions Changed the game’s architecture to increase testabilityChanged the game’s architecture to increase testability External test harnesses to control 50+ test clients per CPU, 4,000+ External test harnesses to control 50+ test clients per CPU, 4,000+
per sessionper session Push-button UI to configure, run & analyze tests (developer & QA)Push-button UI to configure, run & analyze tests (developer & QA) Constantly updated Baselines, with “Monkey Test” statsConstantly updated Baselines, with “Monkey Test” stats Pre-checkin regressionPre-checkin regression QA: web-driven state machine to control testers & collect/publish QA: web-driven state machine to control testers & collect/publish
resultsresults What didn’t workWhat didn’t work
Event Recorders, unit testingEvent Recorders, unit testing Manual-only testingManual-only testing
MMP Automated Testing: ApproachMMP Automated Testing: Approach
Push-button ability to run large-scale, repeatable testsPush-button ability to run large-scale, repeatable tests CostCost
Hardware / SoftwareHardware / Software Human resourcesHuman resources Process changesProcess changes
BenefitBenefit Accurate, repeatable Accurate, repeatable measurablemeasurable tests during development tests during development
and operationsand operations Stable software, faster, measurable progressStable software, faster, measurable progress Base key decisions on fact, not opinionBase key decisions on fact, not opinion
Why Spend The Time & Money?Why Spend The Time & Money?
System complexity, non-determinism, scaleSystem complexity, non-determinism, scale Tests provide hard data in a confusing sea of Tests provide hard data in a confusing sea of
possibilitiespossibilities End users: high Quality of Service barEnd users: high Quality of Service bar Dev team: greater comfort & confidence Dev team: greater comfort & confidence
Tools augment your team’s ability to do their jobsTools augment your team’s ability to do their jobs Find problems fasterFind problems faster Measure / change / measure: repeat as necessaryMeasure / change / measure: repeat as necessary
Production & executives: come to depend on this data Production & executives: come to depend on this data to a high degreeto a high degree
Scripted Test ClientsScripted Test Clients
Scripts are emulated play sessions: just Scripts are emulated play sessions: just like somebody plays the gamelike somebody plays the game Command stepsCommand steps: what the player does to the : what the player does to the
gamegame Validation stepsValidation steps: what the game should do : what the game should do
in response in response
Scripts TailoredScripts TailoredTo Each Test ApplicationTo Each Test Application
Unit testingUnit testing: 1 feature = 1 script: 1 feature = 1 script Load testingLoad testing: Representative play session: Representative play session
The average Joe, times thousandsThe average Joe, times thousands Shipping qualityShipping quality: corner cases, feature : corner cases, feature
completenesscompleteness IntegrationIntegration: test code changes for catastrophic : test code changes for catastrophic
failures failures
Test ClientTest Client Game Client Game Client
Scripted Players: ImplementationScripted Players: Implementation
Script Engine
State
Game GUI
Client-Side Game Logic
Commands
State
Presentation Layer
Process Shift:
Time TargetLaunch
Amount of work done
ProjectStart
MMP Developer EfficiencyStrong test supportWeak test support
Not GoodEnough
Earlier Tools Investment Equals More Gain
Process Shifts: Automated Testing Process Shifts: Automated Testing Changes The Shape Of The Development Changes The Shape Of The Development
Progress CurveProgress Curve
Scale & Feature Completeness
Keep Developers moving forward, not bailing water
Stability (Code Base & Servers)
Focus Developers on key, measurable roadblocks
Process Shift: Measurable Targets, Process Shift: Measurable Targets, Projected Trend LinesProjected Trend Lines
Core FunctionalityTests, Any Feature
(e.g. # clients)
TargetComplete
Time
Any Time(e.g. Alpha)
First PassingTest
Now
Actionable progress metrics, early enough to react
Process Shift: Load Testing Process Shift: Load Testing (Before Paying Customers Show Up)(Before Paying Customers Show Up)
Expose issues that only occur at scale
Establish hardware requirements
Establish play is acceptable @ scale
Client-Server ComparisonClient-Server Comparison
TSO: Case Study OutlineTSO: Case Study Outline(Lessons Learned)(Lessons Learned)
Poorly designed SP Poorly designed SP MP MP MMP transitionsMMP transitionsScalingScaling
Team & code size, data set sizeTeam & code size, data set sizeBuild & distributionBuild & distribution
Architecture: logical & codeArchitecture: logical & codeVisibility: development & operationsVisibility: development & operationsTestability: development, release, loadTestability: development, release, load
Multi-Player, Non-determinismMulti-Player, Non-determinism
Persistent user data vs code/content updatesPersistent user data vs code/content updatesPatching / new content / custom contentPatching / new content / custom content
User DataUser Data
Oops!Oops! Users stored much more data (with much more variance) that Users stored much more data (with much more variance) that
we had planned forwe had planned for Caused many DB failures, city failuresCaused many DB failures, city failures BIG problem: their persistent data has to work, always, across all BIG problem: their persistent data has to work, always, across all
builds & DB instancesbuilds & DB instances What helpedWhat helped
Regression testing, each build, against live set of user dataRegression testing, each build, against live set of user data What would have helped moreWhat would have helped more
Sanity checks against the DBSanity checks against the DB Range checks against user dataRange checks against user data Better code & architecture support for validation of user dataBetter code & architecture support for validation of user data
Patching / New Content / Custom Patching / New Content / Custom ContentContent
Oops!Oops! Initial Patch budget of 1Meg blown in 1Initial Patch budget of 1Meg blown in 1stst week of week of
operationsoperations New Content required stronger, more predictable New Content required stronger, more predictable
processprocess Custom Content required infrastructure able to easily Custom Content required infrastructure able to easily
add new content, on the flyadd new content, on the fly Key Issue: Key Issue: all effort had gone into going Live, not all effort had gone into going Live, not
creating a sustainable process once Livecreating a sustainable process once Live Conclusion: designing these in would have been Conclusion: designing these in would have been
much easier than retrofitting…much easier than retrofitting…
Lessons LearnedLessons Learned
autoTest: autoTest: Scripted test clients and instrumented code rock!Scripted test clients and instrumented code rock! Collection, aggregation and display of test data is vital in making decisions on a Collection, aggregation and display of test data is vital in making decisions on a
day to day basisday to day basis Lessen the panicLessen the panic
Scale&Break is a very clarifying experienceScale&Break is a very clarifying experience Stable code&servers greatly ease the pain of building a MMP gameStable code&servers greatly ease the pain of building a MMP game Hard data (Hard data (notnot opinion) is both illuminating and calmingopinion) is both illuminating and calming
autoBuild: make it pushbutton with instant web visibilityautoBuild: make it pushbutton with instant web visibility Use early, use often to get bugs out before going liveUse early, use often to get bugs out before going live
Budget for a strong architect role & a strong design review process for Budget for a strong architect role & a strong design review process for the entire game lifecyclethe entire game lifecycle Scalability, testability, patching & new content & long-term persistence are Scalability, testability, patching & new content & long-term persistence are
requirements: MUCH cheaper to design in than frantic retrofittingrequirements: MUCH cheaper to design in than frantic retrofitting KISS principle is mandatory, as is expecting changesKISS principle is mandatory, as is expecting changes
Lessons LearnedLessons Learned
Visibility: tremendous volumes of data require automated Visibility: tremendous volumes of data require automated collection&summarizationcollection&summarization Provide drill-down access to details from summary view web pagesProvide drill-down access to details from summary view web pages
Get some people on board who’ve been burned before: a lot of TSO’s Get some people on board who’ve been burned before: a lot of TSO’s pain could have been easily avoided, but little distributed system pain could have been easily avoided, but little distributed system experience & MMP design issues existed in early phases of projectexperience & MMP design issues existed in early phases of project
Fred Brooks, the 31Fred Brooks, the 31stst programmer programmer Strong tools & process pays off for large teams & long-term operations Strong tools & process pays off for large teams & long-term operations Measure & improve your workspace, constantlyMeasure & improve your workspace, constantly
Non-determinism is painful & unavoidableNon-determinism is painful & unavoidable Minimize impact via explicit design support & use strong, constant calibration Minimize impact via explicit design support & use strong, constant calibration
to understand itto understand it
Biggest WinsBiggest Wins
Code Isolation
Scaffolding
Tools: Build / Test / Measure, Information Management
Pre-Checkin Regression / Load Testing
Biggest LossesBiggest Losses
Architecture: Massively peer to peer
Early lack of tools
#ifdef across platform / function
“Critical Path” dependencies
More Details: www.maggotranch.com/MMP (3 TSO Lessons Learned talks)