Poker AI: Equilibrium, Online Resolving, Deep Learning and...
Transcript of Poker AI: Equilibrium, Online Resolving, Deep Learning and...
PokerAI:Equilibrium,OnlineResolving,DeepLearningand
ReinforcementLearning
NikolaiYakovenkoNVidiaADLRGroup-- SantaClaraCA
ColumbiaUniversityDeepLearningSeminarApril2017
PokerisaTurn-BasedVideoGame
Call
Raise
Fold
ManyDifferentPokerGamesSingleDrawVideoPoker
2-7LowballTripleDraw(makelowhandfrom5cardswithmultipledraws)
LimitHold’em
NoLimitHold’emWorldSeriesofPoker[Humans]AnnualComputerPokerCompetition[Robots]
Hold HoldHold
Privatecards Publiccards
OneHandofTexasHold’emPrivatecards Flop(public) Turn River
Hero
Oppn
BettingRound
BettingRound
BettingRound
BettingRound
Showdown
Best5-CardHand
Wins
Flush
TwoPairs
CFR:EquilibriumBalancing• AbstractHold’em gametosmallerstate-space• Cycleoverevergamestates
– Update“regrets”– Adjuststrategytowardleast“regret”
• ConvergestoNashequilibriuminthesimplifiedgame.– Closeenoughtoanequilibriuminthefullgame
• WinnersofeveryAnnualComputerPokerCompetition(ACPC)since2007.– LimitHold’em:1%ofunexploitable (2015)*– NoLimitHold’em:defeatedtopprofessionalplayers(2017)**
*pre-computedstrategy**in-gamesimulationonsuper-computercluster
CFR:CounterfactualRegretMinimization
Player1:Randomstrategy Player2:Randomstrategy
Regret:foldinggoodhandsAction:betgoodhandsmore
Regret:notfoldingbadhandsAction:foldbadhands
Regret:notbluffingbadhandsAction:betwhencan’twin
Regret:notcallingbluffsAction:callwithsome%vsbluffs
(equilibrium)
CFR:Pre-ComputeEntireStrategyEachpoint:encodesgamestate*• Privatecardsforplayer• Publiccards• Betsmadesofar
*Opponentcannotdistinguishbetweensomestates
EntangledGameStates:Kuhn(3Card)PokerExample
Heads-upLimitHold’em PokerisSolved byBowling,etal[Nature]
Heads-upLimitHold’em isSolved
Heads-upLimitHold’em PokerisSolved byBowling,etal[Nature]
Within0.1%ofUnexploitable byPerfectResponse
Heads-upLimitHold’em PokerisSolved byBowling,etal[Nature]
Surprising:EquilibriumStrategyis(almost)Binary
Green:raise;Red:fold;Blue:call
DoesthisworkforNoLimitHold’em?
NoQuite:NLHismuchbigger
LimitHold’em• 2privatecards,• 5publiccards• 4roundsofbetting• 3bettingactions
– Check/Call– Bet/Raise– Fold
• 10^14gamestates
NoLimitHold’em (200BB)• 2privatecards• 5publiccards• 4roundsofbetting• Upto200bettingactions
– Check/Call– Bet/Raiseanysize– Fold
• 10^170gamestates– Gohas10^160gamestates
BetSizes:HugeBranchingFactor
DeepStack:… byMoravcik,etal[Science2017]
GoingOff-Tree
Closestoraverageknownstate:errorsaccumulate
ContinuousRe-Solving
(CMU’sLibratus alsoemployscontinuousre-solving)
Range:probabilityvectorover1326uniqueprivatecards
Re-solvingEarly:SolveEntireGame(TooBig)
EstimatevaluesatdepthXwithadeepneuralnetwork(UofAlbertaDeepStack)
DeepStack:EstimatingCFRValues
Goodenoughfor(super)humanperformance?
PracticalResults:Libratus (CMU)andDeepStack (U-Alberta)
Libratus• Design:
– Nocardabstraction– CFR+forpreflop andflop– ”Endgamesolving”onturn
andriver• Speed:
– Instantpreflop &flop– ~30secondsturnandriver– (200nodesuper-computer)
• Results:– $14.1/handvstoppros– 120,000handsover3weeks
DeepStack• Design:
– Nocardabstraction– “Continuousresolving”onall
streets– Depth-limitedsolvingw/DNN
• Speed:– ~5-10spreflop andflop– ~1-5sturnandriver– LaptopwithTorchandGPU
• Results:– $48.6/handvsnon-toppros– Upcomingfreezeoutmatches
Libratus Challenge(January2017)
ThehumanslosttoLibratus AIbeatby4+σ.Didtheyalsogettired?
PreviousACPCAgentsWereHighlyExploitable(AndMaybeStillAre)
Results:1250mbb/hand=$125/hand– morethanfoldingeveryhand.LBRagent=limitedbestresponseusing48betsizes.
Conclusions&Speculations(04/2017)
• Computerscanmatchtophumansatheads-upNoLimitHold’em poker.
• Thewinningapproachiscontinuousre-solving(similartochessorGo,butwithhandranges)
• Greattoolstomeasureexploitabilityandtheluckfactor(seeDeepStack paperfordetails)
• Canthisapproachgeneralizeto3-6playergames?• Cantheonlinesolvingbecomemuchfaster?• Willthisworkalwaysrequireextensivedomainexpertise?
Canwetrainastrongpokerplayerwithamuchsmallerstrategy?
Poker-CNN:Cardsas2DTensorsPrivatecards Flop(public) Turn River Showdown
[AhQs]+[As9s6s]
x23456789TJQKAc.............d.............h............1s....1..1..1.1 Flush draw
Pair (of Aces)
[AhQs]
x23456789TJQKAc.............d.............h............1s..........1..
Flush
[AhQsAs9s6s9c2s]
x23456789TJQKAc.......1.....d.............h............1s1...1..1..1.1
Flush!
Convnet:PredictAnythingYouWant
Input convolutions maxpool dense layer50%dropout
outputlayer
conv pool
Inputs:
• Privatecards• Publiccards• Potsize• Position• Previousbetshistory
(31x 17x 173Dtensor)
Predictactionvalue:
• Bet,call,foldvalues• Actionprobabilities• Valuebybetsize
• single-trial$win/loss
• nogradientforbetsnotmade
• noMonteCarlotreesearchrequired
Surrogatetasks:
• Allin odds• Opponenthanddistribution
$20,000 $20,000
$100 $50
+$265
0204060
20%pot
50%pot
1xpot
1.5xpot
3xpot
10xpot
BetSize
Raise 84.2%Call 15.8%Fold 0.0%
Raise 81.1%Call 18.9%Fold 0.0%
0
20
40
60
0% 33% 50% 66% 100%
Oddsvs Opponent
+$2616
(Call)
BigBlind SmallBlind
Raise 0.0%Call 94.3%Fold 5.7%
0204060
20%pot
50%pot
1xpot
1.5xpot
3xpot
10xpot
BetSize
$17,285 $17,285
$5,430
(Check) (Check)Bet 30.0%Check 70.0%
Valuevs random 91.3%Valuevs oppon 85.6%
Bet 25.9%Check 74.1%
Valuevs random 52.9%Valuevs oppon 32.6%
Bet 59.0%Check 41.0%
(Check) Bet 86.6%Check 13.4%
010203040
20%pot
50%pot
1xpot
1.5xpot
3xpot
10xpot
BetSize
0
50
100
0% 33% 50% 66% 100%
Oddsvs Opponent
+$3,967
$5,430
$17,285 $17,285
$13,318
$9,397
Raise 26.6%Call 73.4%Fold 0.0%
Valuevs random 91.3%Valuevs oppon 68.0%
+$17,285(allin)Call 32.4%Fold 67.6%
Valuevs random 84.7%Valuevs oppon 30.6%
($13,000allin call,towin$26,000)33.3%odds=break-even
$17,285+$3,967
Takeaways• Prettygoodpatternmatching,withenoughdata– Naïvenetworkdesign– andfoolishuseofpooling– Training≈4millionpreviousACPChands
• Struggleswithrarecases– Under-weightsoutliers– Outofsamplesituations
• Strugglesinbigpots– Largeeffectonaverageresults– Sparsedata
• Noattempttoavoidexploitability
FutureWork• Moregames,morecontexts– 3-6playerNoLimitHold’em– Pot-LimitOmaha(4privatecardsinsteadof2)– TournamentHold’em
• LearntheCFRinternalparameters?– Predictopponenthandrangesdirectly
• Personalizemodelagainstanopponent• Tunehyper-parameter…– 100,000handsperexperiment– Findidealnetworkarrangement– Exploitflexibilityofdeepneuralnets
TheDream
CanaDNNlearntoimitatestrongplayersin2+playergames?
Data:high-qualitysimulation,equilibriumsolving,orplayerlogs
ReinforcementLearning
DeepQ-LearningforAtariGames
Human-levelcontrol throughdeepreinforcementlearningbyDeepMind (Nature2015)
OpenAI Gym:Train&ShareRLAgents
Support forAtarigames,classicRLproblems, robotsoccer,Doom[nopoker]
ReinforcementLearningforGames
https://www.slideshare.net/onghaoyi/distributed-deep-qlearning
FaultyRewardFunction?
https://blog.openai.com/faulty-reward-functions/
CanPokerBeSolvedWithRL?
• Yes,withmodifications.– “Standard”RLisgreedyandrequirestheMarkovproperty
– Pokerdecisionscan’tbeoptimizedlocally– Somegame-customlocalsimulationisrequired
• Heinrich&Silver(DeepMind2016)matchstateoftheartonLimitHoldemwithmodifieddeepRL– https://arxiv.org/pdf/1603.01121.pdf– DeepRLalsogivesuseful“similarcontext”embeddingforpokersituations
– (AsshouldourPoker-CNN)
DeepRLHighWatermarks
• Atarigames– Resultskeepimproving– AlthoughOpenAIclaimsequal/betterresultsonthesimplerAtarigameswithevolutionaryalgorithms
• AlphaGo – super-humanachievement• RLsaves40%ondatacentercooling – Google
CanIApplyDeepRLtoMyProblem?
Pros– Goforit!• Cleargame-likerewardfunction• Easytosimulatetheenvironment• Markovpropertyapplies[stateis
notpath-dependent]• Bestpathcanbedeterministic• Rewardsareobservablein
relativelyshortsequences• Hardtocomputeexactproblem
gradients,evenifsolutionseasytocompare
• Accesstomassivemachineresources
Cons– trysomethingelse.• Noclearrewards(selfdrivingcar)• Trainingdata,nottraining
environment• Limitedcomputationalresources• Possibletocomputeexact
gradientsontheproblem(MNIST,videoclassification,etc)
• Notlikelythatrandomactionswillevergetapositivereward(DeepQRLscored0.0onMontezuma’sRevengeforalongtime)
QuestionsforFutureThought• WhataresomehardproblemsthatcouldbesolvedwithDeepRL,givenhugeresources?– Example:componentarrangementformicrochipmanufacture
• GivenaccesstoLibratus orDeepStack engine,couldyoudesignadeepnettoimitateit,ortobeatit?– Withorwithoutonlinesimulation?
• Fromasmallamountofexperttrainingdata,canyoutrainageneralagentfor2PgameslikeStreetFighter?– CouldyoubootstrapitlikeAlphaGo?– Couldyoutrainitsohumanscan’ttellthatit’sabot?
• WhatproblemswouldyoutrainwithaccesstoahugeGPUcluster?
Thankyou!Questions?
References&FurtherReading• DeepStack
– https://www.deepstack.ai/– WatchweeklyhumanvsAImatchesonTwitch:https://www.twitch.tv/deepstackai– Opensource(Torch)codeforNoLimit LeducHold’em (simplifiedNLH):
https://github.com/lifrordi/DeepStack-Leduc• Libratus
– http://www.cs.cmu.edu/~sandholm/– Mywriteupon#BrainsVsAI match:https://medium.com/@Moscow25/cmus-libratus-bluffs-
its-way-to-victory-in-brainsvsai-poker-match-99abd31b9cd4• Poker-CNN
– OurpaperfromAAAI2016onArXiv http://arxiv.org/abs/1509.06731– Code&models(admittedlyneedscleanup)https://github.com/moscow25/deep_draw
• AnnualComputerPokerCompetitionhttp://www.computerpokercompetition.org/• DeepReinforcementLearning
– DeepMind:https://deepmind.com/research/dqn/– OpenAI:https://openai.com/
• NVidiaApplied DeepLearningResearchGroup– Blog/interview:https://blogs.nvidia.com/blog/2016/12/07/ai-podcast-deep-learning-going-
next/– Openrequisition:https://sv.ai/nvidia-senior-research-scientist-deep-learning/