Poker AI: Equilibrium, Online Resolving, Deep Learning and...

PokerAI:Equilibrium,OnlineResolving,DeepLearningand

ReinforcementLearning

NikolaiYakovenkoNVidiaADLRGroup-- SantaClaraCA

ColumbiaUniversityDeepLearningSeminarApril2017

PokerisaTurn-BasedVideoGame

Call

Raise

Fold

ManyDifferentPokerGamesSingleDrawVideoPoker

2-7LowballTripleDraw(makelowhandfrom5cardswithmultipledraws)

LimitHold’em

NoLimitHold’emWorldSeriesofPoker[Humans]AnnualComputerPokerCompetition[Robots]

Hold HoldHold

Privatecards Publiccards

OneHandofTexasHold’emPrivatecards Flop(public) Turn River

Hero

Oppn

BettingRound

BettingRound

BettingRound

BettingRound

Showdown

Best5-CardHand

Wins

Flush

TwoPairs

CFR:EquilibriumBalancing• AbstractHold’em gametosmallerstate-space• Cycleoverevergamestates

– Update“regrets”– Adjuststrategytowardleast“regret”

• ConvergestoNashequilibriuminthesimplifiedgame.– Closeenoughtoanequilibriuminthefullgame

• WinnersofeveryAnnualComputerPokerCompetition(ACPC)since2007.– LimitHold’em:1%ofunexploitable (2015)*– NoLimitHold’em:defeatedtopprofessionalplayers(2017)**

*pre-computedstrategy**in-gamesimulationonsuper-computercluster

CFR:CounterfactualRegretMinimization

Player1:Randomstrategy Player2:Randomstrategy

Regret:foldinggoodhandsAction:betgoodhandsmore

Regret:notfoldingbadhandsAction:foldbadhands

Regret:notbluffingbadhandsAction:betwhencan’twin

Regret:notcallingbluffsAction:callwithsome%vsbluffs

(equilibrium)

CFR:Pre-ComputeEntireStrategyEachpoint:encodesgamestate*• Privatecardsforplayer• Publiccards• Betsmadesofar

*Opponentcannotdistinguishbetweensomestates

EntangledGameStates:Kuhn(3Card)PokerExample

Heads-upLimitHold’em PokerisSolved byBowling,etal[Nature]

Heads-upLimitHold’em isSolved


Within0.1%ofUnexploitable byPerfectResponse


Surprising:EquilibriumStrategyis(almost)Binary

Green:raise;Red:fold;Blue:call

DoesthisworkforNoLimitHold’em?

NoQuite:NLHismuchbigger

LimitHold’em• 2privatecards,• 5publiccards• 4roundsofbetting• 3bettingactions

– Check/Call– Bet/Raise– Fold

• 10^14gamestates

NoLimitHold’em (200BB)• 2privatecards• 5publiccards• 4roundsofbetting• Upto200bettingactions

– Check/Call– Bet/Raiseanysize– Fold

• 10^170gamestates– Gohas10^160gamestates

BetSizes:HugeBranchingFactor

DeepStack:… byMoravcik,etal[Science2017]

GoingOff-Tree

Closestoraverageknownstate:errorsaccumulate

ContinuousRe-Solving

(CMU’sLibratus alsoemployscontinuousre-solving)

Range:probabilityvectorover1326uniqueprivatecards

Re-solvingEarly:SolveEntireGame(TooBig)

EstimatevaluesatdepthXwithadeepneuralnetwork(UofAlbertaDeepStack)

DeepStack:EstimatingCFRValues

Goodenoughfor(super)humanperformance?

PracticalResults:Libratus (CMU)andDeepStack (U-Alberta)

Libratus• Design:

– Nocardabstraction– CFR+forpreflop andflop– ”Endgamesolving”onturn

andriver• Speed:

– Instantpreflop &flop– ~30secondsturnandriver– (200nodesuper-computer)

• Results:– $14.1/handvstoppros– 120,000handsover3weeks

DeepStack• Design:

– Nocardabstraction– “Continuousresolving”onall

streets– Depth-limitedsolvingw/DNN

• Speed:– ~5-10spreflop andflop– ~1-5sturnandriver– LaptopwithTorchandGPU

• Results:– $48.6/handvsnon-toppros– Upcomingfreezeoutmatches

Libratus Challenge(January2017)

ThehumanslosttoLibratus AIbeatby4+σ.Didtheyalsogettired?

PreviousACPCAgentsWereHighlyExploitable(AndMaybeStillAre)

Results:1250mbb/hand=$125/hand– morethanfoldingeveryhand.LBRagent=limitedbestresponseusing48betsizes.

Conclusions&Speculations(04/2017)

• Computerscanmatchtophumansatheads-upNoLimitHold’em poker.

• Thewinningapproachiscontinuousre-solving(similartochessorGo,butwithhandranges)

• Greattoolstomeasureexploitabilityandtheluckfactor(seeDeepStack paperfordetails)

• Canthisapproachgeneralizeto3-6playergames?• Cantheonlinesolvingbecomemuchfaster?• Willthisworkalwaysrequireextensivedomainexpertise?

Canwetrainastrongpokerplayerwithamuchsmallerstrategy?

Poker-CNN:Cardsas2DTensorsPrivatecards Flop(public) Turn River Showdown

[AhQs]+[As9s6s]

x23456789TJQKAc.............d.............h............1s....1..1..1.1 Flush draw

Pair (of Aces)

[AhQs]

x23456789TJQKAc.............d.............h............1s..........1..

Flush

[AhQsAs9s6s9c2s]

x23456789TJQKAc.......1.....d.............h............1s1...1..1..1.1

Flush!

Convnet:PredictAnythingYouWant

Input convolutions maxpool dense layer50%dropout

outputlayer

conv pool

Inputs:

• Privatecards• Publiccards• Potsize• Position• Previousbetshistory

(31x 17x 173Dtensor)

Predictactionvalue:

• Bet,call,foldvalues• Actionprobabilities• Valuebybetsize

• single-trial$win/loss

• nogradientforbetsnotmade

• noMonteCarlotreesearchrequired

Surrogatetasks:

• Allin odds• Opponenthanddistribution

$20,000 $20,000

$100 $50

+$265

0204060

20%pot

50%pot

1xpot

1.5xpot

3xpot

10xpot

BetSize

Raise 84.2%Call 15.8%Fold 0.0%


0

20

40

60

0% 33% 50% 66% 100%

Oddsvs Opponent

+$2616

(Call)

BigBlind SmallBlind


0204060

20%pot

50%pot

1xpot

1.5xpot

3xpot

10xpot

BetSize

$17,285 $17,285

$5,430

(Check) (Check)Bet 30.0%Check 70.0%

Valuevs random 91.3%Valuevs oppon 85.6%

Bet 25.9%Check 74.1%


Bet 59.0%Check 41.0%

(Check) Bet 86.6%Check 13.4%

010203040

20%pot

50%pot

1xpot

1.5xpot

3xpot

10xpot

BetSize

0

50

100

0% 33% 50% 66% 100%

Oddsvs Opponent

+$3,967

$5,430

$17,285 $17,285

$13,318

$9,397



+$17,285(allin)Call 32.4%Fold 67.6%


($13,000allin call,towin$26,000)33.3%odds=break-even

$17,285+$3,967

Takeaways• Prettygoodpatternmatching,withenoughdata– Naïvenetworkdesign– andfoolishuseofpooling– Training≈4millionpreviousACPChands

• Struggleswithrarecases– Under-weightsoutliers– Outofsamplesituations

• Strugglesinbigpots– Largeeffectonaverageresults– Sparsedata

• Noattempttoavoidexploitability

FutureWork• Moregames,morecontexts– 3-6playerNoLimitHold’em– Pot-LimitOmaha(4privatecardsinsteadof2)– TournamentHold’em

• LearntheCFRinternalparameters?– Predictopponenthandrangesdirectly

• Personalizemodelagainstanopponent• Tunehyper-parameter…– 100,000handsperexperiment– Findidealnetworkarrangement– Exploitflexibilityofdeepneuralnets

TheDream

CanaDNNlearntoimitatestrongplayersin2+playergames?

Data:high-qualitysimulation,equilibriumsolving,orplayerlogs

ReinforcementLearning

DeepQ-LearningforAtariGames

Human-levelcontrol throughdeepreinforcementlearningbyDeepMind (Nature2015)

OpenAI Gym:Train&ShareRLAgents

Support forAtarigames,classicRLproblems, robotsoccer,Doom[nopoker]

ReinforcementLearningforGames

https://www.slideshare.net/onghaoyi/distributed-deep-qlearning

FaultyRewardFunction?

https://blog.openai.com/faulty-reward-functions/

CanPokerBeSolvedWithRL?

• Yes,withmodifications.– “Standard”RLisgreedyandrequirestheMarkovproperty

– Pokerdecisionscan’tbeoptimizedlocally– Somegame-customlocalsimulationisrequired

• Heinrich&Silver(DeepMind2016)matchstateoftheartonLimitHoldemwithmodifieddeepRL– https://arxiv.org/pdf/1603.01121.pdf– DeepRLalsogivesuseful“similarcontext”embeddingforpokersituations

– (AsshouldourPoker-CNN)

DeepRLHighWatermarks

• Atarigames– Resultskeepimproving– AlthoughOpenAIclaimsequal/betterresultsonthesimplerAtarigameswithevolutionaryalgorithms

• AlphaGo – super-humanachievement• RLsaves40%ondatacentercooling – Google

CanIApplyDeepRLtoMyProblem?

Pros– Goforit!• Cleargame-likerewardfunction• Easytosimulatetheenvironment• Markovpropertyapplies[stateis

notpath-dependent]• Bestpathcanbedeterministic• Rewardsareobservablein

relativelyshortsequences• Hardtocomputeexactproblem

gradients,evenifsolutionseasytocompare

• Accesstomassivemachineresources

Cons– trysomethingelse.• Noclearrewards(selfdrivingcar)• Trainingdata,nottraining

environment• Limitedcomputationalresources• Possibletocomputeexact

gradientsontheproblem(MNIST,videoclassification,etc)

• Notlikelythatrandomactionswillevergetapositivereward(DeepQRLscored0.0onMontezuma’sRevengeforalongtime)

QuestionsforFutureThought• WhataresomehardproblemsthatcouldbesolvedwithDeepRL,givenhugeresources?– Example:componentarrangementformicrochipmanufacture

• GivenaccesstoLibratus orDeepStack engine,couldyoudesignadeepnettoimitateit,ortobeatit?– Withorwithoutonlinesimulation?

• Fromasmallamountofexperttrainingdata,canyoutrainageneralagentfor2PgameslikeStreetFighter?– CouldyoubootstrapitlikeAlphaGo?– Couldyoutrainitsohumanscan’ttellthatit’sabot?

• WhatproblemswouldyoutrainwithaccesstoahugeGPUcluster?

Thankyou!Questions?

References&FurtherReading• DeepStack

– https://www.deepstack.ai/– WatchweeklyhumanvsAImatchesonTwitch:https://www.twitch.tv/deepstackai– Opensource(Torch)codeforNoLimit LeducHold’em (simplifiedNLH):

https://github.com/lifrordi/DeepStack-Leduc• Libratus

– http://www.cs.cmu.edu/~sandholm/– Mywriteupon#BrainsVsAI match:https://medium.com/@Moscow25/cmus-libratus-bluffs-

its-way-to-victory-in-brainsvsai-poker-match-99abd31b9cd4• Poker-CNN

– OurpaperfromAAAI2016onArXiv http://arxiv.org/abs/1509.06731– Code&models(admittedlyneedscleanup)https://github.com/moscow25/deep_draw

• AnnualComputerPokerCompetitionhttp://www.computerpokercompetition.org/• DeepReinforcementLearning

– DeepMind:https://deepmind.com/research/dqn/– OpenAI:https://openai.com/

• NVidiaApplied DeepLearningResearchGroup– Blog/interview:https://blogs.nvidia.com/blog/2016/12/07/ai-podcast-deep-learning-going-

next/– Openrequisition:https://sv.ai/nvidia-senior-research-scientist-deep-learning/

Poker AI: Equilibrium, Online Resolving, Deep Learning and...

Documents

Transcript of Poker AI: Equilibrium, Online Resolving, Deep Learning and...