Post on 23-Dec-2015
Automated Testing of Massively Multi-Player Games
Lessons Learned fromThe Sims Online
Larry Mellon
Spring 2003
Context: What Is Automated Testing?
Classes Of Testing
SystemStress
Load
RandomInput
Feature Regression
Developer
QA
Automation Components
Collection&
Analysis
System Under Test
Repeatable, Sync’edTest Inputs
System Under Test
Startup&
Control
System Under Test
What Was Not Automated?
Startup & Control
Repeatable, Synchronized Inputs
Results Analysis
Visual Effects
Lessons Learned: Automated Testing
Time(60 Minutes)
1/3
Wrap-up & QuestionsWhat worked best, what didn’tTabula Rasa: MMP / SPG
Fielding: Analysis & Adaptations
Design & Initial ImplementationArchitecture, Scripting Tests, Test ClientInitial Results
1/3
1/3
Design Constraints
Load
Regression
Churn Rate
Automation(Repeatable, Synchronized Input)
(Data Management)
Strong Abstraction
Test Client
Single, Data Driven Test Client
Regression Load
SingleAPI
ReusableScripts & Data
Test Client
Data Driven Test Client
Regression Load
SingleAPI
ReusableScripts & Data
SingleAPI
ConfigurableLogs & Metrics
Key Game StatesPass/Fail
Responsiveness
“Testing feature correctness” “Testing system performance”
Problem: Testing Accuracy
• Load & Regression: inputs must be– Accurate
– Repeatable
• Churn rate: logic/data in constant motion– How to keep testing client accurate?
• Solution: game client becomes test client– Exact mimicry
– Lower maintenance costs
Test Client == Game Client
Test Control
State
Game GUI
Client-Side Game Logic
Commands
State
Presentation Layer
Test Client Game Client
Game Client: How Much To Keep?
Game Client
View
Logic
Presentation Layer
What Level To Test At?Game Client
MouseClicks
Presentation Layer
Regression: Too Brittle (pixel shift)Load: Too Bulky
Regression: Too Brittle (pixel shift)Load: Too Bulky
View
Logic
What Level To Test At?
Game Client
InternalEvents
Presentation Layer
Regression: Too Brittle (Churn Rate vs Logic & Data)
Regression: Too Brittle (Churn Rate vs Logic & Data)
View
Logic
Gameplay: Semantic Abstractions
NullView ClientView
LogicPresentation LayerBuy Lot Enter Lot
Use Object
…Buy Object
~ ¾
~ ¼
Basic gameplay changes less frequently than UI or protocol implementations.Basic gameplay changes less frequently than UI or protocol implementations.
Scriptable User Play Sessions
• SimScript– Collection: Presentation Layer “primitives”– Synchronization: wait_until, remote_command
– State probes: arbitrary game state• Avatar’s body skill, lamp on/off, …
• Test Scripts: Specific / ordered inputs– Single user play session– Multiple user play session
Scriptable User Play Sessions
• Scriptable play sessions: big win– Load: tunable based on actual play– Regression: constantly repeat hundreds
of play sessions, validating correctness
• Gameplay semantics: very stable– UI / protocols shifted constantly– Game play remained (about) the same
SimScript: Abstract User Actions
include_script setup_for_test.txtenter_lot $alpha_chimpwait_until game_state inlot
chat I’m an Alpha Chimp, in a Lot.
log_message Testing object purchase.
log_objects buy_objectchair 10 10log_objects
SimScript: Control & Sync
# Have a remote client use the chair remote_cmd $monkey_bot
use_object chair sit set_data avatar reading_skill 80set_data book unlock use_object book readwait_until avatar reading_skill 100set_recording on
Client Implementation
Composable Client
- Scripts- Cheat Console- GUI
Event GeneratorsEvent GeneratorsEvent Generators
Game Logic
Presentation Layer
Composable Client
- Scripts- Console- GUI
- Console- Lurker- GUI
Any / all components may be loaded per instanceAny / all components may be loaded per instance
Event GeneratorsEvent GeneratorsEvent GeneratorsViewing SystemsViewing SystemsViewing Systems
Game Logic
Presentation Layer
Lesson: View & Logic Entangled
Game Client
View
Logic
Few Clean Separation Points
Game Client
View
LogicPresentation Layer
Solution: Refactored for Isolation
Game Client
View
LogicPresentation Layer
LogicPresentation Layer
Lesson: NullView Debugging
?Without (legacy) view system attached, tracing was “difficult”.
Without (legacy) view system attached, tracing was “difficult”.
LogicPresentation Layer
Solution: Embedded Diagnostics
DiagnosticsDiagnosticsDiagnosticsTimeout Handlers…
Talk Outline: Automated Testing
Time(60 Minutes)
Wrap-up & Questions
Lessons Learned: Fielding
Design & Initial Implementation
Architecture & Design
Test Client
Initial Results
1/3
1/3
1/3
Mean Time Between Failure
• Random Event, Log & Execute • Record client lifetime / RAM• Worked: just not relevant in early
stages of development–Most failures / leaks found were
not high-priority at that time, when weighed against server crashes
Monkey Tests
• Constant repetition of simple, isolated actions against servers
• Very useful: –Direct observation of servers while
under constant, simple input–Server processes “aged” all day
• Examples:–Login / Logout–Enter House / Leave House
QA Test Suite Regression
• High false positive rate & high maintenance–New bugs / old bugs–Shifting game design –“Unknown” failures
Not helping in day to day work.Not helping in day to day work.
Talk Outline: Automated Testing
Time(60 Minutes)
¼
½
¼ Wrap-up & Questions
Fielding: Analysis&AdaptationsNon-Determinism
Maintenance OverheadSolutions & Results
Monkey / Sniff / Load / Harness
Design & Initial Implementation
Analysis: Testing Isolated Features
Analysis: Critical Path
Failures on the Critical Path block access to much of the game.
Failures on the Critical Path block access to much of the game.
enter_house ()
Test Case: Can an Avatar Sit in a Chair?
use_object ()
buy_object ()
buy_house ()
create_avatar ()
login ()
Solution: Monkey Tests
• Primitives placed in Monkey Tests– Isolate as much possible, repeat 400x– Report only aggregate results
• Create Avatar: 93% pass (375 of 400)
• “Poor Man’s” Unit Test– Feature based, not class based– Limited isolation– Easy failure analysis / reporting
Talk Outline: Automated Testing
Time(60 Minutes)
Wrap-up & Questions
Lessons Learned: FieldingNon-Determinism
Maintenance CostsSolution Approaches
Monkey / Sniff / Load / Harness
Design & Initial Implementation1/3
1/3
1/3
Analysis: Maintenance Cost
• High defect rate in game code–Code Coupling: “side effects”–Churn Rate: frequent changes
• Critical Path: fatal dependencies• High debugging cost–Non-deterministic, distributed logic
Turnaround Time
daysBug
Introduced
Development
Checkin
Smoke
Regression
Build
Time to Fix
Cost of Detection
Tests were too far removed from introduction of defects.
Tests were too far removed from introduction of defects.
Critical Path Defects Were Very Costly
Impact on
Others daysBug
Introduced
Development
Checkin
Smoke
Regression
Build
Time to Fix
Cost of Detection
Solution: Sniff Test
Pre-Checkin Regression: don’t let broken code into Mainline.
Pre-Checkin Regression: don’t let broken code into Mainline.
Working Code
Candidate Code
Pass / Fail,Diagnostics
Development
Checkin
Smoke
Sniff
Regression
Solution: Hourly Diagnostics
• SniffTest Stability Checker–Emulates a developer–Every hour, sync / build / test
• Critical Path monkeys ran non-stop–Constant “baseline”
• Traffic Generation–Keep the pipes full & servers aging–Keep the DB growing
Analysis: CONSTANT SHOUTING IS REALLY IRRITATING
• Bugs spawned many, many, emails• Solution: Report Managers
– Aggregates / correlates across tests– Filters known defects– Translates common failure reports to their root
causes• Solution: Data Managers
– Information Overload: Automated workflow tools mandatory
ToolKit Usability
• Workflow automation• Information management• Developer / Tester “push button” ease of use• XP flavour: increasingly easy to run tests
– Must be easier to run than avoid to running
– Must solve problems “on the ground now”
Sample Testing Harness Views
Load Testing: Goals
• Expose issues that only occur at scale• Establish hardware requirements• Establish response is playable @ scale• Emulate user behaviour– Use server-side metrics to tune test scripts
against observed Beta behaviour
• Run full scale load tests daily
Load Testing: Data Flow
ClientMetrics
Game Traffic
ResourceMetrics
Debugging Data
Test Driver CPU
Load Control Rig
Server Cluster
Load Testing Team
System Monitors
Internal Probes
Test ClientTest
ClientTest
Client
Test Driver CPU
Test ClientTest
ClientTest
Client
Test Driver CPU
Test ClientTest
ClientTest
Client
Load Testing: Lessons Learned
• Very successful–“Scale&Break”: up to 4,000 clients
• Some conflicting requirements w/Regression –Continue on fail–Transaction tracking–Nullview client a little “chunky”
Current Work
• QA test suite automation• Workflow tools• Integrating testing into the new
features design/development process
• Planned work–Extend Esper Toolkit for general use–Port to other Maxis projects
Talk Outline: Automated Testing
Time(60 Minutes)
Wrap-up & Questions
Lessons Learned: Fielding
Design & Initial Implementation
Biggest Wins / LossesReuseTabula Rasa: MMP & SSP
1/3
1/3
1/3
Biggest Wins
• Presentation Layer Abstraction– NullView client– Scripted playsessions: powerful for
regression & load• Pre-Checkin Snifftest• Load Testing• Continual Usability Enhancements • Team
– Upper Management Commitment– Focused Group, Senior Developers
Biggest Issues
• Order Of Testing– MTBF / QA Test Suites should have come last– Not relevant when early & game too unstable – Find / Fix Lag: too distant from Development
• Changing TSO’s Development Process– Tool adoption was slow, unless mandated
• Noise– Constant Flood Of Test Results– Number of Game Defects, Testing Defects– Non-Determinism / False Positives
Tabula Rasa
How Would I Start The Next Project?How Would I Start The Next Project?
Tabula Rasa
PreCheckin Sniff Test
There’s just no reason to let code break.There’s just no reason to let code break.
Tabula Rasa
PreCheckin SniffTest
Hourly Monkey Tests
Keep Mainline workingKeep Mainline working
Useful baseline & keeps servers aging.Useful baseline & keeps servers aging.
Tabula Rasa
Dedicated Tools Group
PreCheckin SniffTest Keep Mainline workingKeep Mainline working
Hourly Stability Checkers Baseline for DevelopersBaseline for Developers
Continual usability enhancements adapted toolsTo meet “on the ground” conditions.Continual usability enhancements adapted toolsTo meet “on the ground” conditions.
Tabula Rasa
PreCheckin SniffTest Keep Mainline workingKeep Mainline working
Hourly Stability Checkers Baseline for DevelopersBaseline for Developers
Dedicated Tools Group Easy to Use == UsedEasy to Use == Used
Executive Level Support
Mandates required to shift how entire teams operated.Mandates required to shift how entire teams operated.
Tabula Rasa
PreCheckin SniffTest Keep Mainline workingKeep Mainline working
Hourly Stability Checkers Baseline for DevelopersBaseline for Developers
Easy to Use == UsedEasy to Use == Used
Load Test: Early & Often
Executive Support Radical Shifts in ProcessRadical Shifts in Process
Dedicated Tools Group
Tabula Rasa
PreCheckin SniffTest Keep Mainline workingKeep Mainline working
Hourly Stability Checkers Baseline for DevelopersBaseline for Developers
Easy to Use == UsedEasy to Use == Used
Distribute Test Development & Ownership Across Full TeamDistribute Test Development & Ownership Across Full Team
Load Test: Early & Often Break it before LiveBreak it before Live
Executive Support Radical shifts in ProcessRadical shifts in Process
Dedicated Tools Group
Next Project: Basic Infrastructure
Control HarnessFor Clients & Components
Reference Client Self Test
Reference Feature
RegressionEngine
Living Doc
Building Features: NullView First
Control Harness
Reference Client
NullView Client
Self Test
Reference Feature
RegressionEngine
Living Doc
Build The Tests With The Code
NullView Client
Login
Self Test
Monkey Test
Nothing Gets Checked In Without A Working Monkey Test.Nothing Gets Checked In Without A Working Monkey Test.
Control Harness
Reference Client
Reference Feature
RegressionEngine
Conclusion
• Estimated Impact on MMP: High– Sniff Test: kept developers working– Load Test: ID’d critical failures pre-launch– Presentation Layer: scriptable play sessions
• Cost To Implement: Medium– Much Lower for SSP Games
Repeatable, coordinated inputs @ scale and pre-checkin regression were very significant schedule accelerators.
Repeatable, coordinated inputs @ scale and pre-checkin regression were very significant schedule accelerators.
Conclusion
Go For It…Go For It…
Talk Outline: Automated Testing
Time(60 Minutes)
Wrap-up
Questions
Lessons Learned: Fielding
Design & Initial Implementation 1/3
1/3
1/3