Design choices of golang for high scalability
-
Upload
seongjae-park -
Category
Software
-
view
683 -
download
3
Transcript of Design choices of golang for high scalability
Design Choices of Golang for High Scalability
SeongJae Park <[email protected]>
This work by SeongJae Park is licensed under the Creative Commons Attribution-ShareAlike 3.0 Unported License. To
view a copy of this license, visit http://creativecommons.org/licenses/by-sa/3.0/.
These slides were presented during GDG Seoul Meetup 201709(https://www.meetup.com/GDG-Seoul/events/242054608/)
What Makes Golang So Special on Multicore?
● People says Go is a good choice for high performance and scalability● Why scalability is so important?● Why existing solutions are not sufficient?● What makes Go so special for the problems?● TL; DR: Goroutines, Dynamic stack management, and Integrated Poller
DISCLAIMER: This talk is based on Dave Chenny’s OSCON15 presentation (http://cdn.oreillystatic.com/en/assets/1/event/129/High%20performance%20servers%20without%20the%20event%20loop%20Presentation.pdf)
Why Scalability?A long time ago, in a galaxy far, far away...
Moore’s Law
https://www.karlrupp.net/wp-content/uploads/2015/06/35years.png
● Law: Number of transistors per square inch doubles roughly every 18 months
Moore’s Law
https://www.karlrupp.net/wp-content/uploads/2015/06/35years.png
# of transistors
Single thread perf
Clock speed
Power (Watts)
Number of cores
● Law: Number of transistors per square inch doubles roughly every 18 months● CPU vendors used the law to increase cpu clock speed; Only one thing that
programmers need to have for better performance was patience for free lunch
Moore’s Law
https://www.karlrupp.net/wp-content/uploads/2015/06/35years.png
# of transistors
Single thread perf
Clock speed
Power (Watts)
Number of cores
● Law: Number of transistors per square inch doubles roughly every 18 months● CPU vendors used the law to increase cpu clock speed; Only one thing that
programmers need to have for better performance was patience for free lunch● However, CPU clock speed stopped to increase over a decade ago
Moore’s Law
https://www.karlrupp.net/wp-content/uploads/2015/06/35years.png
# of transistors
Single thread perf
Clock speed
Power (Watts)
Number of cores
Why No Clock Speed?https://i.ytimg.com/vi/9S9vP2inD_U/maxresdefault.jpg
Why No Clock Speed?
● Electrons move between transistors for every clock(Clock speed is analogous to switch on/off speed in below circuit diagram)
http://fourthgradespace.weebly.com/uploads/1/3/3/9/13397069/2935717_orig.jpg
https://i.ytimg.com/vi/9S9vP2inD_U/maxresdefault.jpg
Why No Clock Speed?
● Electrons move between transistors for every clock(Clock speed is analogous to switch on/off speed in below circuit diagram)
● Moving a thing requires energy; We use electrical energy here
http://fourthgradespace.weebly.com/uploads/1/3/3/9/13397069/2935717_orig.jpg
https://i.ytimg.com/vi/9S9vP2inD_U/maxresdefault.jpg
Why No Clock Speed?
● Electrons move between transistors for every clock(Clock speed is analogous to switch on/off speed in below circuit diagram)
● Moving a thing requires energy; We use electrical energy here● Few of the electrical energy leaks from transformation to kinetic energy and
becomes heat energy; temperature goes high
http://fourthgradespace.weebly.com/uploads/1/3/3/9/13397069/2935717_orig.jpg
https://i.ytimg.com/vi/9S9vP2inD_U/maxresdefault.jpg
Why No Clock Speed?
● Electrons move between transistors for every clock(Clock speed is analogous to switch on/off speed in below circuit diagram)
● Moving a thing requires energy; We use electrical energy here● Few of the electrical energy leaks from transformation to kinetic energy and
becomes heat energy; temperature goes high● High temperature damages CPU
http://fourthgradespace.weebly.com/uploads/1/3/3/9/13397069/2935717_orig.jpg
https://i.ytimg.com/vi/9S9vP2inD_U/maxresdefault.jpg
Why No Clock Speed?
● Electrons move between transistors for every clock(Clock speed is analogous to switch on/off speed in below circuit diagram)
● Moving a thing requires energy; We use electrical energy here● Few of the electrical energy leaks from transformation to kinetic energy and
becomes heat energy; temperature goes high● High temperature damages CPU● In short, increasing clock speed results in amplified power consumption, heat
dissipation, and CPU damage
http://fourthgradespace.weebly.com/uploads/1/3/3/9/13397069/2935717_orig.jpg
https://i.ytimg.com/vi/9S9vP2inD_U/maxresdefault.jpg
Moore’s Law is Still There, Vendors Are Changed
● In same clock speed, two 0.5-square inch processors would consume power as similar as 1-square inch single processor(Total distance of electrons movement per clock would be similar)
● Vendors now, thus, prefer to supply multi-core processors
http://happierhuman.wpengine.netdna-cdn.com/wp-content/uploads/2012/11/One-cookie-vs-two-cookies.jpg
Parallelism is Not Free
● Multi-core system cannot help zero-concurrency programs● Just increasing concurrency does not guarantee proportional speedup;
Clumsy concurrency controls can make things even worse on multi-core● Go has made important design choices for highly scalable concurrency
control. Remainder of this talk will describe some of the choices
https://img.devrant.io/devrant/rant/r_373632_a3SmV.jpg
Context ManagementProcess? Thread? Goroutine!
Resource Sharing and Context
● Concurrent tasks share processors and memory(Number of tasks is usually larger than number of processors)
● To pause and resume an execution, need to manage context of the task○ Context in this context: point to next instruction, stack frames, data in registers, ...
https://headguruteacher.files.wordpress.com/2017/05/x20142711071202qitokro-s8uda-pagespeed-ic-afnisfpvf0.jpg?w=640
Process: Analogous to a room for lease
● Abstraction of an execution of given program● Process context switching require many expensive operations
https://www.youtube.com/watch?v=4OclkGRLuxw
Process: Analogous to a room for lease
● Abstraction of an execution of given program● Process context switching require many expensive operations
○ Finding out a process to run next, management of waiting / pending processes
https://www.youtube.com/watch?v=4OclkGRLuxw
Process: Analogous to a room for lease
● Abstraction of an execution of given program● Process context switching require many expensive operations
○ Finding out a process to run next, management of waiting / pending processes○ Back-up of current all CPU registers, restore all CPU registers to last backup of next process
https://www.youtube.com/watch?v=4OclkGRLuxw
Process: Analogous to a room for lease
● Abstraction of an execution of given program● Process context switching require many expensive operations
○ Finding out a process to run next, management of waiting / pending processes○ Back-up of current all CPU registers, restore all CPU registers to last backup of next process○ Flush virtual memory mapping cache (TLB)
https://www.youtube.com/watch?v=4OclkGRLuxw
Process: Analogous to a room for lease
● Abstraction of an execution of given program● Process context switching require many expensive operations
○ Finding out a process to run next, management of waiting / pending processes○ Back-up of current all CPU registers, restore all CPU registers to last backup of next process○ Flush virtual memory mapping cache (TLB)○ All above operations should be run in operating system kernel; it means context switch
between user mode and kernel mode
https://www.youtube.com/watch?v=4OclkGRLuxw
Thread: a.k.a Light-Weight Process
● Threads are similar with processes but they share address space● Because of address space sharing, thread context is smaller than process
context; Thread is faster than process for creation and switching● Still context switch overhead exists
https://www.topdraw.com/assets/uploads/2015/04/standing-desk.jpg
Goroutine
● Not thread, not coroutine, goroutine.● Major primitive of Go for concurrent task execution● Designed to have minimal context overhead only
http://edinburghopendata.info/wp-content/uploads/2015/05/141107-hackathon_18_d893499f2c13fe1fa05bd46252246b1e.jpg
Goroutine: Co-operative scheduling
● Cooperative scheduling minimizes context switching itself● Goroutines do context switch only in well-defined situations
http
s://r
eneg
adei
nc.c
om/w
p-co
nten
t/upl
oads
/201
6/05
/RIn
c-C
oope
ratio
n-19
69.jp
g
Goroutine: Co-operative scheduling
● Cooperative scheduling minimizes context switching itself● Goroutines do context switch only in well-defined situations
○ Channel send / receive operation
http
s://r
eneg
adei
nc.c
om/w
p-co
nten
t/upl
oads
/201
6/05
/RIn
c-C
oope
ratio
n-19
69.jp
g
Goroutine: Co-operative scheduling
● Cooperative scheduling minimizes context switching itself● Goroutines do context switch only in well-defined situations
○ Channel send / receive operation○ `go` statement
http
s://r
eneg
adei
nc.c
om/w
p-co
nten
t/upl
oads
/201
6/05
/RIn
c-C
oope
ratio
n-19
69.jp
g
Goroutine: Co-operative scheduling
● Cooperative scheduling minimizes context switching itself● Goroutines do context switch only in well-defined situations
○ Channel send / receive operation○ `go` statement○ Blocking system calls (file or network I/O)
http
s://r
eneg
adei
nc.c
om/w
p-co
nten
t/upl
oads
/201
6/05
/RIn
c-C
oope
ratio
n-19
69.jp
g
Goroutine: Co-operative scheduling
● Cooperative scheduling minimizes context switching itself● Goroutines do context switch only in well-defined situations
○ Channel send / receive operation○ `go` statement○ Blocking system calls (file or network I/O)○ Garbage collection
http
s://r
eneg
adei
nc.c
om/w
p-co
nten
t/upl
oads
/201
6/05
/RIn
c-C
oope
ratio
n-19
69.jp
g
Goroutine: Co-operative scheduling
● Cooperative scheduling minimizes context switching itself● Goroutines do context switch only in well-defined situations
○ Channel send / receive operation○ `go` statement○ Blocking system calls (file or network I/O)○ Garbage collection
● If goroutines are not cooperative, starvation is possible(https://gist.github.com/sjp38/dcdb6295e10f1cfe919b)
http
s://r
eneg
adei
nc.c
om/w
p-co
nten
t/upl
oads
/201
6/05
/RIn
c-C
oope
ratio
n-19
69.jp
g
Goroutine: Minimized Context
● In case of processes or threads, kernel should backup / restore entire registers because kernel doesn’t know which registers are actually in use
https://i.pinimg.com/originals/c3/38/5f/c3385f909b2d2c36877f7ad02f841471.jpg
Goroutine: Minimized Context
● In case of processes or threads, kernel should backup / restore entire registers because kernel doesn’t know which registers are actually in use
● Go compiler emit code for actually using register check and backup of them for the every context switching event
https://i.pinimg.com/originals/c3/38/5f/c3385f909b2d2c36877f7ad02f841471.jpg http://www.cohoots.info/wp-content/uploads/2017/07/coworking-space-Co-Hoots.jpg
Goroutine: User-space scheduling
● M goroutines are multiplexed onto N kernel threads by user space go runtime scheduler
● No transition between user mode and kernel mode
https://image.slidesharecdn.com/realtime-linux-140810101151-phpapp02/95/making-linux-do-hard-realtime-74-638.jpg?cb=1429570932
Goroutine: Minimized Context Switch Overhead
● Minimize context switching● Minimize size of context● No transition between user mode and kernel mode at all● As a result, Tens of thousands of goroutines in a single process are the norm
https://github.com/ashleymcnamara/gophers/blob/master/GOPHER_SHARE.png
Stack ManagementFinding optimal size of stack
● Stack is a storage for task’s call frame○ Each call frame stores where to return, parameters, local variables
● Should not be overlapped with other concurrent task’s stack
Stack
Parameters, Return address, local variables
StackFramePointer
StackPointer
Stack Frame
High
Low
Stack grows downside
Stack Management of Threads
● Threads allocate fixed size stack memory when created
http://docs.roguewave.com/legacy-hpp/thrug/images/stackallocation.gif
Stack Management of Threads
● Threads allocate fixed size stack memory when created● By default, 2 MiB On Linux/x86-32. With pthreads library NPTL
implementation, stack size can be specified in thread creation time
http://docs.roguewave.com/legacy-hpp/thrug/images/stackallocation.gif
Stack Management of Threads
● Threads allocate fixed size stack memory when created● By default, 2 MiB On Linux/x86-32. With pthreads library NPTL
implementation, stack size can be specified in thread creation time● Too large stack size could limit number of concurrent threads
http://docs.roguewave.com/legacy-hpp/thrug/images/stackallocation.gif
Stack Management of Goroutines
● Compiler knows how many stack size is required for a given function● Goroutine starts with very small stack● Just before a function call, Go checks whether current stack can commodate
the function’s stack size requirement; If not sufficient with current stack, increase the stack size
● The stack can be shrinked, too● As a result, goroutines can keep only necessary size of stack and allow
maximum concurrent goroutines
func f() {g()
}
go func() {
f();}()
Stack Management of Goroutines
● Compiler knows how many stack size is required for a given function● Goroutine starts with very small stack● Just before a function call, Go checks whether current stack can commodate
the function’s stack size requirement; If not sufficient with current stack, increase the stack size
● The stack can be shrinked, too● As a result, goroutines can keep only necessary size of stack and allow
maximum concurrent goroutines
func f() {g()
}
go func() {
f();}()
Compiler
f() requires 1KiB stack,g() requires 1.5KiB stack
Stack Management of Goroutines
● Compiler knows how many stack size is required for a given function● Goroutine starts with very small stack● Just before a function call, Go checks whether current stack can commodate
the function’s stack size requirement; If not sufficient with current stack, increase the stack size
● The stack can be shrinked, too● As a result, goroutines can keep only necessary size of stack and allow
maximum concurrent goroutines
func f() {g()
}
go func() {
f();}()
Compiler
f() requires 1KiB stack,g() requires 1.5KiB stack
Goroutine starts with 2KiB stack
Stack Management of Goroutines
● Compiler knows how many stack size is required for a given function● Goroutine starts with very small stack● Just before a function call, Go checks whether current stack can commodate
the function’s stack size requirement; If not sufficient with current stack, increase the stack size
● The stack can be shrinked, too● As a result, goroutines can keep only necessary size of stack and allow
maximum concurrent goroutines
func f() {g()
}
go func() {
f();}()
Compiler
f() requires 1KiB stack,g() requires 1.5KiB stack
Goroutine starts with 2KiB stack
f() will use 1KiB. Current stack (2KiB free) is enough
Stack Management of Goroutines
● Compiler knows how many stack size is required for a given function● Goroutine starts with very small stack● Just before a function call, Go checks whether current stack can commodate
the function’s stack size requirement; If not sufficient with current stack, increase the stack size
● The stack can be shrinked, too● As a result, goroutines can keep only necessary size of stack and allow
maximum concurrent goroutines
func f() {g()
}
go func() {
f();}()
Compiler
f() requires 1KiB stack,g() requires 1.5KiB stack
Goroutine starts with 2KiB stack
f() will use 1KiB. Current stack (2KiB free) is enough
g() will use 1.5KiB. Current stack (1KiB free) is not enough. Allocate bigger stack!
C10K Problemwithout EventLoop
Event? Threads? Goroutines and Integrated Poller!
C10K Problem
● How to hold 10,000 concurrent sessions● 10,000 threads for 10,000 sessions would incur high overhead● Event loop usually results in complex callback spaghetti code
https://www.youtube.com/watch?v=SgjAv1TnS5k
Integrated Poller: Goroutines Allocation
● Allocate 10,000 goroutines for 10,000 concurrent sessions;Don’t worry, goroutine creation is fast enough;tens of thousands of goroutines in single process is norm
● Goroutines waiting for events are just scheduled outGo scheduler would not increase number of threads under the hood because most of goroutines would scheduled out due to slow event completion time
https://github.com/ashleymcnamara/gophers/blob/master/GOPHER_MIC_DROP.png https://github.com/ashleymcnamara/gophers/blob/master/DRAWING_GOPHER.png
Integrated Poller: Polling and Scheduling
● Runtime of Go uses select / kqueue / epoll / IOCP to know which socket is ready instead of the goroutine for the socket
● As runtime knows which goroutine is waiting for the socket, runtime put the goroutine back on the same CPU as soon as the socket is ready
● In short, waiting for event and waking up appropriate goroutine is dedicated to Go runtime while
● As a result, gophers can enjoy Simple programming model and Appropriate context management overhead
https://talks.golang.org/2012/waza.slide#22
Conclusion
● Go is so special on multi-core system owing to its clever design choices● Goroutine is super cheap, fast for context management● Dynamic size stack management of goroutine allows more concurrency● Integrated Poller in Go help gophers to have only benefit of threads and event
loop
https://github.com/ashleymcnamara/gophers/blob/master/GOPHER_LEARN.png
Thank You