Evolution of WaitGroup in golang

WaitGroup is a concurrency primitive often used in Go concurrent programming to do task scheduling. It looks like it has only a few simple methods and is relatively easy to use. In fact, the internal implementation of WaitGroup has been changed several times, mainly to optimize the atomic operations of its fields.

The original implementation of WaitGroup

The earliest implementation of WaitGroup is as follows.

type WaitGroup struct {
    m       Mutex
    counter int32
    waiters int32
    sema    *uint32
}
func (wg *WaitGroup) Add(delta int) {
    v := atomic.AddInt32(&wg.counter, int32(delta))
    if v < 0 {
        panic("sync: negative WaitGroup count")
    }
    if v > 0 || atomic.LoadInt32(&wg.waiters) == 0 {
        return
    }
    wg.m.Lock()
    for i := int32(0); i < wg.waiters; i++ {
        runtime_Semrelease(wg.sema)
    }
    wg.waiters = 0
    wg.sema = nil
    wg.m.Unlock()
}

The meaning of its implementation fields is clearer, but the implementation is still slightly rough, for example, sema is implemented using pointers.

Then the fields counter and waiters are merged. In order to guarantee 8-bit alignment for 64bit atomic operations, the alignment point of state1 needs to be found. sema removes the pointer implementation.

type WaitGroup struct {
    // 64-bit value: high 32 bits are counter, low 32 bits are waiter count.
    // 64-bit atomic operations require 64-bit alignment, but 32-bit
    // compilers do not ensure it. So we allocate 12 bytes and then use
    // the aligned 8 bytes in them as state.
    state1 [12]byte
    sema   uint32
}
func (wg *WaitGroup) state() *uint64 {
    if uintptr(unsafe.Pointer(&wg.state1))%8 == 0 {
        return (*uint64)(unsafe.Pointer(&wg.state1))
    } else {
        return (*uint64)(unsafe.Pointer(&wg.state1[4]))
    }
}

Later, WaitGroup was implemented as follows and stabilized.

type WaitGroup struct {
    noCopy noCopy
    // 64-bit value: high 32 bits are counter, low 32 bits are waiter count.
    // 64-bit atomic operations require 64-bit alignment, but 32-bit
    // compilers do not ensure it. So we allocate 12 bytes and then use
    // the aligned 8 bytes in them as state, and the other 4 as storage
    // for the sema.
    state1 [3]uint32
}
// state returns pointers to the state and sema fields stored within wg.state1.
func (wg *WaitGroup) state() (statep *uint64, semap *uint32) {
    if uintptr(unsafe.Pointer(&wg.state1))%8 == 0 {
        return (*uint64)(unsafe.Pointer(&wg.state1)), &wg.state1[2]
    } else {
        return (*uint64)(unsafe.Pointer(&wg.state1[1])), &wg.state1[0]
    }
}

The state1 and sema fields are combined into a single field state1, which is an array of uint32, four bytes. So either the first element is 8byte aligned, or the second element is 8byte aligned. Find the aligned 8byte, the remaining 4byte as sema.

There is no problem with this implementation, it’s just a bit roundabout. Because you have to check the alignment of state1 to determine which is the counters and waiters, which is sema.

A question: What is the maximum number of waiters in a WaitGroup?

Changes in Go 1.18

In Go 1.18, WaitGroup has been changed again to ensure that fields of type uint64 are aligned to 8byte for 64bit architecture environments.

type WaitGroup struct {
    noCopy noCopy
    // 64-bit value: high 32 bits are counter, low 32 bits are waiter count.
    // 64-bit atomic operations require 64-bit alignment, but 32-bit
    // compilers only guarantee that 64-bit fields are 32-bit aligned.
    // For this reason on 32 bit architectures we need to check in state()
    // if state1 is aligned or not, and dynamically "swap" the field order if
    // needed.
    state1 uint64
    state2 uint32
}

Of course, in order to be compatible with the 32bit architecture, it is still necessary to judge the alignment.

func (wg *WaitGroup) state() (statep *uint64, semap *uint32) {
    if unsafe.Alignof(wg.state1) == 8 || uintptr(unsafe.Pointer(&wg.state1))%8 == 0 {
        // state1 is 64-bit aligned: nothing to do.
        return &wg.state1, &wg.state2
    } else {
        // state1 is 32-bit aligned but not 64-bit aligned: this means that
        // (&state1)+4 is 64-bit aligned.
        state := (*[3]uint32)(unsafe.Pointer(&wg.state1))
        return (*uint64)(unsafe.Pointer(&state[1])), &state[0]
    }
}

Overall, this modification results in a 9% to 30% performance improvement in linux/amd64 environments.

Changes in Go 1.20

The optimizations aren’t over yet. In Go 1.19, Russ Cox implemented atomic.Uint64, which is 8byte aligned in both 64bit and 32bit architectures, why? Because it has a “killer feature”: align64.

// An Uint64 is an atomic uint64. The zero value is zero.
type Uint64 struct {
    _ noCopy
    _ align64
    v uint64
}

There is no problem in 64bit architecture, when you see this field in 32bit architecture, Go compiler will automatically align it to 8byte, it is a convention.

It doesn’t help to add align64 to the struct you define under your package.

But if you also want to align your struct 8byte, you can use the following technique:

import "sync/atomic"
type T struct {
    _ [0]atomic.Int64 // 占Use 0 bytes, but the implied fields are 8byte aligned
    x uint64 // x is 8byte aligned
}

In this way, the implementation of WaitGroup can be simplified as follows.

type WaitGroup struct {
    noCopy noCopy
    state atomic.Uint64 // high 32 bits are counter, low 32 bits are waiter count.
    sema  uint32
}

There is no need to implement a separate state() method either. Just use the state field directly (without the race code).

func (wg *WaitGroup) Add(delta int) {
    state := wg.state.Add(uint64(delta) << 32)
    v := int32(state >> 32)
    w := uint32(state)
    if v < 0 {
        panic("sync: negative WaitGroup counter")
    }
    if w != 0 && delta > 0 && v == int32(delta) {
        panic("sync: WaitGroup misuse: Add called concurrently with Wait")
    }
    if v > 0 || w == 0 {
        return
    }
    // This goroutine has set counter to 0 when waiters > 0.
    // Now there can't be concurrent mutations of state:
    // - Adds must not happen concurrently with Wait,
    // - Wait does not increment waiters if it sees counter == 0.
    // Still do a cheap sanity check to detect WaitGroup misuse.
    if wg.state.Load() != state {
        panic("sync: WaitGroup misuse: Add called concurrently with Wait")
    }
    // Reset waiters count to 0.
    wg.state.Store(0)
    for ; w != 0; w-- {
        runtime_Semrelease(&wg.sema, false, 0)
    }
}

Reference

https://colobu.com/2022/08/30/waitgroup-to-love-to-toss/
https://pkg.go.dev/sync#WaitGroup

Table of Contents

The original implementation of WaitGroup

Changes in Go 1.18

Changes in Go 1.20

Reference