This article will teach you how to understand Go’s plan9 assembly through stack operations.
Knowledge points
Linux process in memory layout
Each process in a multitasking operating system runs in its own memory sandbox. In 32-bit mode, it is always 4GB of memory address space, and memory allocation is to allocate virtual memory to processes. When a process actually accesses a virtual memory address, the OS allocates a corresponding space on physical memory and then establishes a mapping relationship with it by triggering a missing page interrupt, so that the virtual memory address accessed by the process will be automatically converted into a valid physical memory address, and then data can be stored and accessed. The virtual memory address accessed by the process is automatically converted to a valid physical memory address, and data can be stored and accessed.
Kernel space : the operating system kernel address space.
Stack : The stack space is where the user stores local variables created temporarily by the program, and the stack grows from the high address to the high address down. On modern mainstream machine architectures (e.g., x86
), the stack grows downward. However, there are some processors (e.g., B5000
) where the stack grows upwards, some architectures (e.g., System Z
) that allow custom stack growth, and even some processors (e.g., SPARC
) that handle a cyclic stack.
Heap : heap space, which is used to hold memory segments that are dynamically allocated while the process is running; it is not fixed in size and can be dynamically expanded or reduced.
BBS segment : BSS segment, which holds global or static data, but holds global/static uninitialized data.
Data segment : a data segment, usually a memory area used to hold global variables that have been initialized in a program.
Text segment : A code segment, which refers to a memory area used to hold the code for program execution. The size of this segment is determined before the program runs, and the memory area is read-only.
Concepts related to the stack
In computer science, a call stack is a stack data structure that stores information about the active subroutines of a computer program.
In computer programming, a subroutine is a sequence of program instructions that performs a specific task, packaged as a unit.
A stack frame is a frame of data that gets pushed onto the stack. In the case of a call stack, a stack frame would represent a function call and its argument data.
With function calls, there is a caller and a callee, e.g., if function A calls function B, A is the caller and B is the callee.
The stack structure of the caller and the callee is shown in the following figure.
The Go assembly code is very vague in explaining the stack registers, so we probably only need to know the role of two registers, BP and SP.
BP: base pointer register , which maintains the base address of the current stack frame for indexing variables and parameters, like an anchor point, which in other architectures is equivalent to the frame pointer FP
, except that in the x86 architecture, variables and parameters can be indexed by SP.
SP: stack pointer register , always pointing to the top of the stack.
Goroutine stack operations
In Goroutine there is a stack data structure with two attributes lo and hi that describe the actual stack memory address.
- stack.lo: the low address of the stack space.
- stack.hi: the high address of the stack space.
In Goroutine it will be determined if stack growth is to be done by stackguard0: * stack.lo: the low address of the stack space; * stack.hi: the high address of the stack space.
- stackguard0:
stack.lo + StackGuard
, used for stack overlow detection. - StackGuard: protected area size, 928 bytes on constant Linux.
- StackSmall: constant size 128 bytes, used for optimization of small function calls; * StackSmall: constant size 128 bytes, used for optimization of small function calls.
- StackBig: constant size 4096 bytes.
To determine if expansion is needed based on the size of the called function stack frame.
- when the stack frame size (FramSzie) is less than or equal to StackSmall (128), if SP is less than stackguard0 then perform stack expansion.
- when the stack frame size (FramSzie) is greater than StackSmall (128), a comparison is made between
SP - FramSzie + StackSmall
and stackguard0 according to the formulaSP - FramSzie + StackSmall
, and if is less than stackguard0 then a stack expansion is performed. - When the stack frame size (FramSzie) is larger than StackBig (4096), it will first check whether stackguard0 has been converted to StackPreempt state; then it will perform expansion according to the formula
SP - stackguard0 + StackGuard <= framesize + (StackGuard- StackSmall)
and if it is true, then perform the expansion.
It should be noted that since the stack grows from the high address to the low address, the comparison is all less than before the expansion is executed, and here you need to think about it.
When the stack expansion is executed, a larger stack memory space will be allocated in the memory space, then all the contents of the old stack will be copied to the new stack, and the pointers to the variables corresponding to the old stack will be modified to point to the new stack again, and finally the memory space of the old stack will be destroyed and recycled, thus realizing the dynamic expansion of the stack.
Assembly
Here is a brief explanation of some of the Plan 9 assemblies used by the Go language that will be used later in the analysis, in case it is not very clear.
assembly functions
Let’s start by looking at the definition of the plan9 assembly functions.
stack frame size: contains local variables and the space for additional function calls.
arguments size: contains the size of the arguments as well as the return value, e.g. if the input is 3 int64 types and the return value is 1 int64 type, then the return value is sizeof(int64) * 4.
Stack adjustment
Stack adjustment is achieved by performing operations on the hardware SP registers, for example:
Since the stack grows downward, SUBQ actually allocates stack frames for the function when it subtracts from SP, and ADDQ clears the stack frames.
Common instructions
Addition and subtraction operations.
Data handling.
Constants are denoted by $num in plan9 assembly, can be negative, and are decimal by default. The length of the carry is determined by the suffix of the MOV.
Another difference is that when using MOVQ you will see a difference between with and without parentheses.
jump.
Address arithmetic.
|
|
The 2 in the above code stands for scale, scale can only be 0, 2, 4, 8.
Parsing
Creation of G
Since the stack is on the Goroutine, let’s start with the creation of G and how to initialize the stack space. Only the code for the initialization part of the stack is explained here.
G creation is done by calling runtime-newproc
to create.
runtime.newproc
The newproc method switches to G0 to call the newproc1 function for G creation.
runtime.newproc1
|
|
The newproc1 method is very long and it mainly gets G and then does some initialization work on the obtained G. Let’s just look at the malg function call here.
When calling the malg function, a minimum stack size value is passed in: _StackMin (2048).
runtime.malg
|
|
The call to malg sets aside the incoming memory size plus a _StackSystem value for use by the system call, and the round2 function rounds the incoming value to an exponent of 2. Then it switches to G0 and executes the stackalloc function to allocate the stack memory.
After allocation, stackguard0 is set to stack.lo + _StackGuard
, which is used to determine if stack expansion is needed, as discussed below.
Initialization of the stack
File location: src/runtime/stack.go
|
|
Before we do the stack allocation let’s take a look at what is done during stack initialization. Note that stackinit is initialized by calling runtime-schedinit
, which is done before calling runtime-newproc
.
Two global variables, stackpool and stackLarge, are initialized when stack initialization is performed. stackpool can allocate less than 32KB of memory, and stackLarge is used to allocate more than 32KB of stack space.
Allocation of the stack
We also know from the initialization of the two two global variables that the stack will be allocated from different locations depending on the size.
Small stack memory allocation
File location: src/runtime/stack.go
|
|
stackalloc will allocate according to the size of the passed parameter n. On Linux, if n is less than 32768 bytes, which is 32KB, then it will enter the allocation logic of small stack.
The small stack refers to the stack size of 2K/4K/8K/16K, and during allocation, different order values will be calculated according to the size, if the stack size is 2K, then the order is 0, 4K corresponds to order 1, and so on. On the one hand, this can reduce the lock conflicts between different Goroutines getting different stack sizes, and on the other hand, the span of the corresponding size can be cached in advance for quick access.
thisg.m.p == 0
may happen when exitsyscall is called or when the number of P’s procresize is changed, thisg.m.preemptoff ! = ""
will occur during GC. That is to say, when exitsyscall is called or the number of P’s is changed, or when GC occurs, stack space is allocated from stackpool, otherwise it is taken from mcache.
If the stackcache corresponding to mcache is not available, then stackcacherefill is called to request a piece of memory from the heap to fill the stackcache.
The main thing to note is that stackalloc is G0 since it switches to G0 for the call, so thisg is G0.
Let’s look at the stackpoolalloc and stackcacherefill functions separately.
runtime.stackpoolalloc
|
|
In the stackpoolalloc function will look for the stackpool corresponding to the order subscript of the head node of the span table, if not empty, then directly remove the node pointed to by the attribute manualFreeList of the head node from the linkedlist and return; if list.first
is empty, then call mheap_’s allocManual function to allocate mspan from the heap.
If list.first
is empty, then the allocManual function of mheap_ is called to allocate mspan from the heap.
The allocManual function allocates a 32KB block of memory, and after allocating the new span, it cuts the 32KB memory according to the size of elemsize, and then strings it together in a one-way chain and assigns the last memory address to manualFreeList.
For example, the current elemsize represents a memory size of 8KB.
runtime.stackcacherefill
|
|
The stackcacherefill function will call stackpoolalloc to get half of the space from the stackpool and assemble it into a list chain, then put it into the stackcache array.
Large Stack Memory Allocation
|
|
For large stack memory allocation, the runtime will check if there is any space left in stackLarge, and if there is no space left, it will also call mheap_.allocManual
to request new memory from the heap.
Stack expansion
Stack overflow detection
The compiler will execute: src/cmd/internal/obj/x86/obj6.go:stacksplit at the time of target code generation to check if the current goroutine has enough stack space by inserting the appropriate instructions based on the function stack frame size.
- when the stack frame size (FramSzie) is less than or equal to StackSmall (128), if SP is less than stackguard0 then perform stack expansion.
- when the stack frame size (FramSzie) is larger than StackSmall (128), the stack is expanded according to the formula
SP - FramSzie + StackSmall
compared to stackguard0, if is smaller than stackguard0 then the expansion is performed. - When the stack frame size (FramSzie) is larger than StackBig (4096), it will first check whether stackguard0 has been converted to StackPreempt state; then it will perform expansion according to the formula
SP - stackguard0 + StackGuard <= framesize + (StackGuard- StackSmall)
and if it is true, then perform the expansion.
Let’s take a look at the pseudo-code which will make it a bit clearer.
When the stack frame size (FramSzie) is less than or equal to StackSmall (128).
When the stack frame size (FramSzie) is larger than StackSmall (128).
where AX = SP - framesize + StackSmall and then the CMPQ instruction is executed to make AX compare with stackguard.
When the stack frame size (FramSzie) is larger than StackBig (4096).
The pseudo-code here will be relatively complicated, as the stackguard0 inside G may be assigned to StackPreempt when it is preempted, so it is clear whether it is preempted or not, then it is necessary to compare stackguard0 with StackPreempt. Then the comparison will be performed: SP-stackguard+StackGuard <= framesize + (StackGuard-StackSmall)
, with StackGuard added on both sides to ensure that the value on the left is positive.
I hope you don’t read further until you understand the above code.
The main thing to note is that in some functions, the compiler intelligently adds the NOSPLIT
flag, which disables stack overflow detection after hitting this flag, which can be found in the following code.
Code location: cmd/internal/obj/x86/obj6.go
The general code logic should be: when the function is at a leaf node of the call chain and the stack frame is smaller than StackSmall bytes, it is automatically marked as NOSPLIT. similarly, we can write the code ourselves by adding //go:nosplit
to the function to force the NOSPLIT attribute.
Stack overflow example
Here we write a simple example.
|
|
Then print out its assembly:
|
|
The example above explains the three cases described above in three method calls.
main function
|
|
First, we load a value from the TLS (thread local storage) variable into the CX register, and then compare the SP with 16(CX), so what is TLS and what does 16(CX) stand for?
Actually, TLS
is a pseudo-register that represents the thread-local storage, which holds the G structure. Let’s take a look at the definition of G in the runtime source code.
You can see that the stack takes up 16bytes, so 16(CX)
corresponds to g.stackguard0
. So the CMPQ SP, 16(CX)
line of code actually compares the SP and stackguard sizes. If the SP is smaller than the stackguard, then the growth threshold is reached and JLS will be executed to jump to line 129 and call runtime.morestack_noctxt to perform the next stack expansion operation.
add1
|
|
When we look at the add1 assembly function, we can see that its stack size is only 32, which does not reach the StackSmall 128 bytes size, and it is a callee caller, so we can send it with the NOSPLIT
flag, which confirms my conclusion above.
add2
|
|
The stack frame size of add2 function is 208, which is larger than StackSmall 128 bytes, so you can see that it first loads a value from the TLS variable to the CX register.
Then execute the instruction LEAQ -80(SP), AX
, but the reason why it is -80 actually made me quite confused at that time, but it should be noted that the formula here is: SP - FramSzie + StackSmall
, after directly substituting it, you will find that it is -80, and then load this value into the AX register.
Finally call CMPQ AX, 16(CX)
, 16(CX) we have already talked about above is equal to stackguard0, so here is a comparison between AX and stackguard0 of small and large, if less than then directly jump to line 138 to execute runtime.morestack_noctxt
.
add3
|
|
The add3 function directly allocates an array of 5000 bytes on the stack, so it starts the same way, loading a value from the TLS variable into the CX register and then assigning stackguard0 to the SI register.
Next, the instruction CMPQ SI, $-1314
will be executed, which actually compares the size of stackguard0 and StackPreempt, and the reason why it is -1314 is that the StackPreempt variable is called directly when the assembly code is inserted, and this variable is written inside the code.
Code location: cmd/internal/objabi/stack.go
If it is not preempted, then go straight down to LEAQ 928(SP), AX
, which is equivalent to AX = SP +_StackGuard
, and _StackGuard is equivalent to 928 in Linux.
Next execute SUBQ SI, AX
, which is equal to AX -= stackguard0
.
Finally CMPQ AX, $5808
, the 5808 is actually framesize + _StackGuard - _StackSmall
, if AX is less than 5808 then jump to line 147 and execute the runtime.morestack_noctxt function.
This is the end of stack overflow detection, I’ve read other articles and I don’t think any of them are as comprehensive as mine, especially when the stack frame size is larger than _StackBig.
Stack expansion
runtime.morestack_noctxt is implemented in assembly and it calls to runtime-morestack, let’s see its implementation as follows.
Code location: src/runtime/asm_amd64.s
|
|
After runtime-morestack does the checksum assignment, it switches to G0 and calls runtime-newstack
to finish the expansion operation.
runtime-newstack
|
|
The first half of the newstack function takes on the task of preempting the Goroutine.
The new stack is calculated to be twice the size of the original before starting the stack copy, and then the Goroutine state is switched to the _Gcopystack state.
Stack Copy
|
|
-
copystack first calculates the size of the used stack space, then only the used space is copied when the stack copy is made.
-
then call the stackalloc function to allocate a block of memory from the heap.
-
then compare the value of hi of the old and new stack to calculate the difference between the two blocks of memory delta, this delta will be called when adjustsudogs, adjustctxt and other functions to determine the location of the old stack memory pointer, then add delta and then get the location of the new stack pointer, so that you can also adjust the pointer to the new stack.
-
call memmove to copy the entire slice of memory from the source stack to the new stack.
-
then continue to call the function that adjusts the pointers to continue to adjust the pointers to the txt, defer, and panic locations on the stack.
-
next switch the stack reference on G to the new stack.
-
finally call stackfree to free the memory space of the original stack.
Stack shrinkage
Stack shrinkage occurs during the phase when the stack is scanned during GC.
runtime.shrinkstack
shrinkstack This function I blocked some checksum functions, leaving only the core logic of the surface.
|
|
The new stack is reduced to half its original size, and if it is smaller than _FixedStack (2KB) then it is not shrunk. It also calculates if the current stack is used by less than 1/4, and does not shrink if it is used by more than 1/4.
Finally, if it is determined that the stack is to be shrunk, the copystack function is called to perform the stack copy logic.
Summary
If you don’t know the memory layout, it may be difficult to understand it, because when we look at the heap, the memory grows from small to large, while the stack grows in the opposite direction, which causes the stack instruction to decrease the SP instead of increasing the stack frame.
In addition, Go uses plan9 assembly, which is a little bit less informative and seems to be a lot of trouble.