Signal-based preemptive scheduling in Go dissected from source code

Introduction

Prior to 1.14 of Go, preemptive scheduling was collaborative and required self-initiated ceding of execution, but this could not handle edge cases that could not be preempted. Some of these problems were not solved until 1.14 by signal-based preemptive scheduling, such as for loops or garbage collection of long-running threads.

Here is an example to verify the difference in preemption between 1.14 and 1.13.

package main

import (
    "fmt"
    "os"
    "runtime"
    "runtime/trace"
    "sync"
)

func main() {
    runtime.GOMAXPROCS(1)
    f, _ := os.Create("trace.output")
    defer f.Close()
    _ = trace.Start(f)
    defer trace.Stop()
    var wg sync.WaitGroup
    for i := 0; i < 30; i++ {
        wg.Add(1)
        go func() {
            defer wg.Done()
            t := 0
            for i:=0;i<1e8;i++ {
                t+=2
            }
            fmt.Println("total:", t)
        }()
    }
    wg.Wait()
}

This example traces the calls made during execution via go trace. Specify runtime.GOMAXPROCS(1) in the code to set the maximum number of CPU cores that can be used simultaneously to 1, using only one P (processor), thus ensuring a single-processor scenario. Then a for loop is called to open 10 goroutines to execute the func function, which is a purely computational and time-consuming function that prevents the goroutines from being idle and giving way to execution.

Here we compile the program and analyze the trace output.

$ go build -gcflags "-N -l" main.go 
-N表示禁用优化
-l禁用内联

$ ./main

Then we get the trace.output file and visualize it.

`1`	`$ go tool trace -http=":6060" ./trace.output`

Go1.13 trace analysis

sobyte

From this figure above, we can see that.

there is only one Proc0 in the PROCS column because we have limited it to one P.
we started 30 goroutines in the for loop, so we can count the color boxes in Proc0 and there are exactly 30 of them.
the 30 goroutines in Proc0 are executed serially, one after the other, without preemption.
click on the details column of a goroutines to see that the Wall Duration is about 0.23s, which means that the goroutines have been executed continuously for 0.23s, and the total execution time of 30 goroutines is about 7s.
cut to the call stack Start Stack Trace is main.main.func1:20, above the code is the func function execution header: go func().
cut out the call stack End Stack Trace is main.main.func1:26, in the code is func function last execution print: fmt.Println("total:", t);.

As you can see from the trace analysis above, Go’s collaborative scheduling does nothing for the calcSum function; once execution starts, you have to wait for it to finish. Each goroutine takes as long as 0.23s and cannot seize its execution rights.

Go 1.14+ trace analysis

sobyte

After Go 1.14, signal-based preemptive scheduling was introduced. From the above diagram, you can see that the Proc0 column is densely packed with calls to goroutines when switching, and there is no longer a situation where goroutines can only wait for execution to end once execution starts.

The above run time is about 4s this case can be ignored, because I am running on two machines with different configurations (mainly because I am too much trouble to find two identical machines).

Let’s take a closer look at the breakdown.

sobyte

This breakdown shows that

this goroutine ran for 0.025s before giving way to execution
the start stack trace is main.main.func1:21, the same as above.
cut away the call stack End Stack Trace is runtime.asyncPreempt:50, which is the function executed when the preempt signal is received, and it is clear from this place that it is preempted asynchronously.

Analysis

Preemption signal installation

runtime/signal_unix.go

initsig

func initsig(preinit bool) {
    // 预初始化
    if !preinit { 
        signalsOK = true
    } 
    //遍历信号数组
    for i := uint32(0); i < _NSIG; i++ {
        t := &sigtable[i]
        //略过信号：SIGKILL、SIGSTOP、SIGTSTP、SIGCONT、SIGTTIN、SIGTTOU
        if t.flags == 0 || t.flags&_SigDefault != 0 {
            continue
        } 
        ...  
        setsig(i, funcPC(sighandler))
    }
}

The initsig function iterates through all the semaphores and then calls the setsig function to register them. We can look at the global variable sigtable to see what information is available.

var sigtable = [...]sigTabT{
    /* 0 */ {0, "SIGNONE: no trap"},
    /* 1 */ {_SigNotify + _SigKill, "SIGHUP: terminal line hangup"},
    /* 2 */ {_SigNotify + _SigKill, "SIGINT: interrupt"},
    /* 3 */ {_SigNotify + _SigThrow, "SIGQUIT: quit"},
    /* 4 */ {_SigThrow + _SigUnblock, "SIGILL: illegal instruction"},
    /* 5 */ {_SigThrow + _SigUnblock, "SIGTRAP: trace trap"},
    /* 6 */ {_SigNotify + _SigThrow, "SIGABRT: abort"},
    /* 7 */ {_SigPanic + _SigUnblock, "SIGBUS: bus error"},
    /* 8 */ {_SigPanic + _SigUnblock, "SIGFPE: floating-point exception"},
    /* 9 */ {0, "SIGKILL: kill"},
    /* 10 */ {_SigNotify, "SIGUSR1: user-defined signal 1"},
    /* 11 */ {_SigPanic + _SigUnblock, "SIGSEGV: segmentation violation"},
    /* 12 */ {_SigNotify, "SIGUSR2: user-defined signal 2"},
    /* 13 */ {_SigNotify, "SIGPIPE: write to broken pipe"},
    /* 14 */ {_SigNotify, "SIGALRM: alarm clock"},
    /* 15 */ {_SigNotify + _SigKill, "SIGTERM: termination"},
    /* 16 */ {_SigThrow + _SigUnblock, "SIGSTKFLT: stack fault"},
    /* 17 */ {_SigNotify + _SigUnblock + _SigIgn, "SIGCHLD: child status has changed"},
    /* 18 */ {_SigNotify + _SigDefault + _SigIgn, "SIGCONT: continue"},
    /* 19 */ {0, "SIGSTOP: stop, unblockable"},
    /* 20 */ {_SigNotify + _SigDefault + _SigIgn, "SIGTSTP: keyboard stop"},
    /* 21 */ {_SigNotify + _SigDefault + _SigIgn, "SIGTTIN: background read from tty"},
    /* 22 */ {_SigNotify + _SigDefault + _SigIgn, "SIGTTOU: background write to tty"},

    /* 23 */ {_SigNotify + _SigIgn, "SIGURG: urgent condition on socket"},
    /* 24 */ {_SigNotify, "SIGXCPU: cpu limit exceeded"},
    /* 25 */ {_SigNotify, "SIGXFSZ: file size limit exceeded"},
    /* 26 */ {_SigNotify, "SIGVTALRM: virtual alarm clock"},
    /* 27 */ {_SigNotify + _SigUnblock, "SIGPROF: profiling alarm clock"},
    /* 28 */ {_SigNotify + _SigIgn, "SIGWINCH: window size change"},
    /* 29 */ {_SigNotify, "SIGIO: i/o now possible"},
    /* 30 */ {_SigNotify, "SIGPWR: power failure restart"},
    /* 31 */ {_SigThrow, "SIGSYS: bad system call"},
    /* 32 */ {_SigSetStack + _SigUnblock, "signal 32"}, /* SIGCANCEL; see issue 6997 */
    /* 33 */ {_SigSetStack + _SigUnblock, "signal 33"}, /* SIGSETXID; see issues 3871, 9400, 12498 */
    ...
}

The specific meaning of the signals can be found in this introduction: Unix Signals

Note that the preemption signal here is _SigNotify + _SigIgn as follows.

`1`	`{_SigNotify + _SigIgn, "SIGURG: urgent condition on socket"}`

Let’s look at the setsig function, which is inside the runtime/os_linux.go file.

setsig

func setsig(i uint32, fn uintptr) {
    var sa sigactiont
    sa.sa_flags = _SA_SIGINFO | _SA_ONSTACK | _SA_RESTORER | _SA_RESTART
    sigfillset(&sa.sa_mask)
    ...
    if fn == funcPC(sighandler) {
        // CGO 相关
        if iscgo {
            fn = funcPC(cgoSigtramp)
        } else {
            // 替换为调用 sigtramp
            fn = funcPC(sigtramp)
        }
    }
    sa.sa_handler = fn
    sigaction(i, &sa, nil)
}

Note here that when fn equals sighandler, the function called is replaced with sigtramp. The sigaction function calls the system call functions sys_signal as well as sys_rt_sigaction to implement installation signals under Linux.

Implementing preemption signals

Here is the signal processing when the signal occurs, originally it should be after sending the preemption signal, but here I first went down the installation signal first. You can jump to after sending the preemption signal and come back.

The above analysis shows that when fn is equal to sighandler, the function called will be replaced with sigtramp, which is an assembly implementation, as we see below.

src/runtime/sys_linux_amd64.s :

TEXT runtime·sigtramp<ABIInternal>(SB),NOSPLIT,$72
    ...
    // We don't save mxcsr or the x87 control word because sigtrampgo doesn't
    // modify them.

    MOVQ    DX, ctx-56(SP)
    MOVQ    SI, info-64(SP)
    MOVQ    DI, signum-72(SP)
    MOVQ    $runtime·sigtrampgo(SB), AX
    CALL AX

    ...
    RET

This will be called to indicate that the signal has been sent in response, and runtime-sigtramp will do the processing of the signal. runtime-sigtramp will then go on to call runtime-sigtrampgo.

This function is in the runtime/signal_unix.go file.

sigtrampgo&sighandler

func sigtrampgo(sig uint32, info *siginfo, ctx unsafe.Pointer) {
    if sigfwdgo(sig, info, ctx) {
        return
    }
    c := &sigctxt{info, ctx}
    g := sigFetchG(c)
    ... 
    sighandler(sig, info, ctx, g)
    setg(g)
    if setStack {
        restoreGsignalStack(&gsignalStack)
    }
}

func sighandler(sig uint32, info *siginfo, ctxt unsafe.Pointer, gp *g) {
    _g_ := getg()
    c := &sigctxt{info, ctxt}
    ... 
  // 如果是一个抢占信号
    if sig == sigPreempt && debug.asyncpreemptoff == 0 { 
        // 处理抢占信号
        doSigPreempt(gp, c) 
    }

    ...
}

The sighandler method does a lot of other signal handling inside, we only care about the preemption part of the code, where the preemption will eventually be performed through the doSigPreempt method.

This function is in the runtime/signal_unix.go file.

doSigPreempt

func doSigPreempt(gp *g, ctxt *sigctxt) { 
    // 检查此 G 是否要被抢占并且可以安全地抢占
    if wantAsyncPreempt(gp) { 
        // 检查是否能安全的进行抢占
        if ok, newpc := isAsyncSafePoint(gp, ctxt.sigpc(), ctxt.sigsp(), ctxt.siglr()); ok {
            // 修改寄存器,并执行抢占调用
            ctxt.pushCall(funcPC(asyncPreempt), newpc)
        }
    }

    // 更新一下抢占相关字段
    atomic.Xadd(&gp.m.preemptGen, 1)
    atomic.Store(&gp.m.signalPending, 0) 
}

function handles the preempt signal, gets the current SP and PC registers and calls ctxt.pushCall to modify them, and calls the asyncPreempt function in runtime/preempt.go.

1
2

// 保存用户态寄存器后调用asyncPreempt2
func asyncPreempt()

The assembly code for asyncPreempt is in src/runtime/preempt_amd64.s, which saves the user state registers and then calls the asyncPreempt2 function in runtime/preempt.go.

asyncPreempt2

func asyncPreempt2() {
    gp := getg()
    gp.asyncSafePoint = true
    // 该 G 是否可以被抢占 
    if gp.preemptStop { 
        mcall(preemptPark)
    } else { 
        // 让 G 放弃当前在 M 上的执行权利,将 G 放入全局队列等待后续调度
        mcall(gopreempt_m)
    }
    gp.asyncSafePoint = false
}

This function will get the current G and determine the preemptStop value of the G. The preemptStop will mark the _Grunning state of the Goroutine as preemptable when the suspendG function of runtime/preempt.go is called gp.preemptStop = true, indicating that the G can be preempted.

Let’s look at the preemptPark function in runtime/proc.go, which is called to execute the preempt task.

preemptPark

func preemptPark(gp *g) {

    status := readgstatus(gp)
    if status&^_Gscan != _Grunning {
        dumpgstatus(gp)
        throw("bad g status")
    }
    gp.waitreason = waitReasonPreempted 
    casGToPreemptScan(gp, _Grunning, _Gscan|_Gpreempted)
    // 使当前 m 放弃 g，让出线程
    dropg()
    // 修改当前 Goroutine 的状态到 _Gpreempted
    casfrom_Gscanstatus(gp, _Gscan|_Gpreempted, _Gpreempted)
    // 并继续执行调度
    schedule()
}

preemptPark modifies the status of the current Goroutine to _Gpreempted, calls dropg to let out the thread, and finally calls the schedule function to continue the task loop scheduling of the other Goroutines.

gopreempt_m

The gopreempt_m method is more of an active cession than a preemption, and then rejoins the execution queue to wait for scheduling.

func gopreempt_m(gp *g) { 
    goschedImpl(gp)
}

func goschedImpl(gp *g) {
    status := readgstatus(gp)
    ...
  // 更新状态为 _Grunnable
    casgstatus(gp, _Grunning, _Grunnable)
  // 使当前 m 放弃 g，让出线程
    dropg()
    lock(&sched.lock)
  // 重新加入到全局执行队列中
    globrunqput(gp)
    unlock(&sched.lock)
    // 并继续执行调度
    schedule()
}

Preemption signal sending

Preemption signaling is performed by preemptM.

This function is in the runtime/signal_unix.go file.

preemptM

const sigPreempt = _SIGURG

func preemptM(mp *m) {
    ...
    if atomic.Cas(&mp.signalPending, 0, 1) { 

        // preemptM 向 M 发送抢占请求。
        // 接收到该请求后，如果正在运行的 G 或 P 被标记为抢占，并且 Goroutine 处于异步安全点，
        // 它将抢占 Goroutine。
        signalM(mp, sigPreempt)
    }
}

preemptM This function calls signalM to send the _SIGURG signal installed at initialization to the specified M.

The main places where preemptM is used to send preemption signals are as follows.

the Go backend monitor runtime.sysmon detects timeouts to send preempt signals.
the Go GC stack scan sends a preempt signal.
calling preemptall when Go GC STW to preempt all P’s and make them pause.

Go background monitoring execution preemption

System Monitor runtime.sysmon calls runtime.retake in a loop to seize a processor that is running or in a system call, and this function traverses the global processor at runtime.

The main reason for system monitoring by preemption in a loop is to avoid starvation caused by G taking up M for too long.

runtime.retake is divided into two main parts.

a call to preemptone to preempt the current processor.
call handoffp to give up access to the processor; preempt the current processor

func retake(now int64) uint32 {
    n := 0

    lock(&allpLock) 
    // 遍历 allp 数组
    for i := 0; i < len(allp); i++ {
        _p_ := allp[i]
        if _p_ == nil { 
            continue
        }
        pd := &_p_.sysmontick
        s := _p_.status
        sysretake := false
        if s == _Prunning || s == _Psyscall {
            // 调度次数
            t := int64(_p_.schedtick)
            if int64(pd.schedtick) != t {
                pd.schedtick = uint32(t)
                // 处理器上次调度时间
                pd.schedwhen = now
            // 抢占 G 的执行，如果上一次触发调度的时间已经过去了 10ms
            } else if pd.schedwhen+forcePreemptNS <= now {
                preemptone(_p_)
                sysretake = true
            }
        }
        ...
    }
    unlock(&allpLock)
    return uint32(n)
}

This process will get the current state of P. If it is in _Prunning or _Psyscall state, and 10ms have passed since the last trigger time, then preemptone will be called to send the preemption signal. preemptone has been discussed above, so we will not repeat it here.

sobyte

call handoffp to give up access to the processor

func retake(now int64) uint32 {
    n := 0
    lock(&allpLock) 
    // 遍历 allp 数组
    for i := 0; i < len(allp); i++ {
        _p_ := allp[i]
        if _p_ == nil { 
            continue
        }
        pd := &_p_.sysmontick
        s := _p_.status
        sysretake := false
        ...
        if s == _Psyscall { 
            // 系统调用的次数
            t := int64(_p_.syscalltick)
            if !sysretake && int64(pd.syscalltick) != t {
                pd.syscalltick = uint32(t)
                // 系统调用的时间
                pd.syscallwhen = now
                continue
            } 
            if runqempty(_p_) && atomic.Load(&sched.nmspinning)+atomic.Load(&sched.npidle) > 0 && pd.syscallwhen+10*1000*1000 > now {
                continue
            } 
            unlock(&allpLock) 
            incidlelocked(-1)
            if atomic.Cas(&_p_.status, s, _Pidle) { 
                n++
                _p_.syscalltick++
                // 让出处理器的使用权
                handoffp(_p_)
            }
            incidlelocked(1)
            lock(&allpLock)
        }
    }
    unlock(&allpLock)
    return uint32(n)
}

This process determines if the state of P is in the _Psyscall state, a judgment is made, and if one is not satisfied, handoffp is called to give up the use of P:

runqempty(_p_): determines whether P’s task queue is empty.
atomic.Load(&sched.nmspinning)+atomic.Load(&sched.npidle) : nmspinning indicates the number of G being stolen and npidle indicates the number of idle P. Determine if there is an idle P and a P that is being scheduled to steal G.
pd.syscallwhen+10*1000*1000 > now : determine if the system call time is longer than 10ms.

Go GC stack scan sends preemption signal

Go scans the stack of G when marking the GC Root at GC time, and calls suspendG to suspend the execution of G before scanning, and then calls resumeG again to resume execution after scanning.

This function is in: runtime/mgcmark.go :

markroot

func markroot(gcw *gcWork, i uint32) { 
    ...
    switch { 
    ...
    // 扫描各个 G 的栈
    default: 
        // 获取需要扫描的 G
        var gp *g
        if baseStacks <= i && i < end {
            gp = allgs[i-baseStacks]
        } else {
            throw("markroot: bad index")
        } 
        ...
        // 转交给g0进行扫描
        systemstack(func() {  
            ...
            // 挂起 G，让对应的 G 停止运行
            stopped := suspendG(gp)
            if stopped.dead {
                gp.gcscandone = true
                return
            }
            if gp.gcscandone {
                throw("g already scanned")
            }
            // 扫描g的栈
            scanstack(gp, gcw)
            gp.gcscandone = true
            // 恢复该 G 的执行
            resumeG(stopped) 
        })
    }
}

markroot switches to G0 before scanning the stack and then calls suspendG to determine the running state of G. If the G is running _Grunning, then it sets preemptStop to true and sends a preempt signal.

This function is in: runtime/preempt.go :

suspendG

func suspendG(gp *g) suspendGState {
    ...
    const yieldDelay = 10 * 1000

    var nextPreemptM int64
    for i := 0; ; i++ {
        switch s := readgstatus(gp); s { 
        ... 
        case _Grunning:
            if gp.preemptStop && gp.preempt && gp.stackguard0 == stackPreempt && asyncM == gp.m && atomic.Load(&asyncM.preemptGen) == asyncGen {
                break
            }
            if !castogscanstatus(gp, _Grunning, _Gscanrunning) {
                break
            }
            // 设置抢占字段
            gp.preemptStop = true
            gp.preempt = true
            gp.stackguard0 = stackPreempt

            asyncM2 := gp.m
            asyncGen2 := atomic.Load(&asyncM2.preemptGen)
            // asyncM 与 asyncGen 标记的是循环里 上次抢占的信息，用来校验不能重复抢占
            needAsync := asyncM != asyncM2 || asyncGen != asyncGen2
            asyncM = asyncM2
            asyncGen = asyncGen2

            casfrom_Gscanstatus(gp, _Gscanrunning, _Grunning)

            if preemptMSupported && debug.asyncpreemptoff == 0 && needAsync { 
                now := nanotime()
                // 限制抢占的频率
                if now >= nextPreemptM {
                    nextPreemptM = now + yieldDelay/2
                    // 执行抢占信号发送
                    preemptM(asyncM)
                }
            }
        }
        ...
    }
}

For the suspendG function I only truncated the processing of G in the _Grunning state. This state sets preemptStop to true, and is the only place where it is set to true. preemptStop is related to the execution of the preempt signal, so if you forgot, you can go to the asyncPreempt2 function above.

Go GC StopTheWorld preempts all P

Go GC STW is executed with the stopTheWorldWithSema function, which is in runtime/proc.go :

stopTheWorldWithSema

func stopTheWorldWithSema() {
    _g_ := getg() 

    lock(&sched.lock)
    sched.stopwait = gomaxprocs
    // 标记 gcwaiting，调度时看见此标记会进入等待
    atomic.Store(&sched.gcwaiting, 1)
    // 发送抢占信号
    preemptall() 
    // 暂停当前 P
    _g_.m.p.ptr().status = _Pgcstop // Pgcstop is only diagnostic.
    ...
    wait := sched.stopwait > 0
    unlock(&sched.lock)
    if wait {
        for {
            //  等待 100 us
            if notetsleep(&sched.stopnote, 100*1000) {
                noteclear(&sched.stopnote)
                break
            }
            // 再次进行发送抢占信号
            preemptall()
        }
    }
    ...
}

The stopTheWorldWithSema function will call preemptall to send preemption signals to all P’s.

The file location of the preemptall function is in runtime/proc.go :

preemptall

func preemptall() bool {
   res := false
   // 遍历所有的 P
   for _, _p_ := range allp {
      if _p_.status != _Prunning {
         continue
      }
      // 对正在运行的 P 发送抢占信号
      if preemptone(_p_) {
         res = true
      }
   }
   return res
}

The preemptone called by preemptall takes the executing G in M corresponding to P and marks it as being preempted; finally it calls preemptM to send a preempt signal to M.

The file location of this function is in runtime/proc.go :

preemptone

func preemptone(_p_ *p) bool {
    // 获取 P 对应的 M
    mp := _p_.m.ptr()
    if mp == nil || mp == getg().m {
        return false
    }
    // 获取 M 正在执行的 G
    gp := mp.curg
    if gp == nil || gp == mp.g0 {
        return false
    }
    // 将 G 标记为抢占
    gp.preempt = true

    // 在栈扩张的时候会检测是否被抢占
    gp.stackguard0 = stackPreempt

    // 请求该 P 的异步抢占
    if preemptMSupported && debug.asyncpreemptoff == 0 {
        _p_.preempt = true
        preemptM(mp)
    } 
    return true
}

sobyte

Wrap-up

Up to this point, we have taken a complete look at the signal-based preemption scheduling process. To summarize the specific logic.

when the program starts, it registers the _SIGURG signal handler function runtime.doSigPreempt ;
at this point an M1 sends an interrupt signal _SIGURG to M2 via the signalM function;
M2 receives the signal, the OS interrupts its execution code and switches to the signal handling function runtime.doSigPreempt;
M2 calls runtime.asyncPreempt to modify the execution context and re-enter the scheduling loop to schedule other Gs.

sobyte

Table of Contents

Introduction

Go1.13 trace analysis

Go 1.14+ trace analysis

Analysis

Preemption signal installation

Implementing preemption signals

Preemption signal sending

Go background monitoring execution preemption

Go GC stack scan sends preemption signal

Go GC StopTheWorld preempts all P

Wrap-up