Duff's device and Go language zero values

I’ve heard of Duff’s device, an optimization of serial replication, for a long time. Recently, I found that Go language also uses the idea of Duff’s device when initializing zero values. So I took a closer look at it and share it with you today.

If you need to copy a certain length of data to another address starting from a certain address, the simplest code is as follows.

send(short *to, short *from, int count)
{
  do {
    *to++ = *from++;
  } while (--count > 0)
}

Because only one byte is copied in each loop, a total of count conditional branch judgments are performed. Without compiler optimization, this code would be less efficient.

A simple optimization idea is to copy as many bytes as possible in one loop to reduce the number of branch judgments. For example, we can copy 8 bytes at a time.

send(short *to, short *from, int count)
{
  int n = count / 8;
  do {
    *to++ = *from++;
    *to++ = *from++;
    *to++ = *from++;
    *to++ = *from++;
    *to++ = *from++;
    *to++ = *from++;
    *to++ = *from++;
    *to++ = *from++;
  } while (--n > 0);
}

However, the above code can only copy data that is a multiple of 8 in length. In order to support arbitrary lengths, we need to handle the case where the length is less than 8 bytes. So the code can be rewritten as follows.

send(short *to, short *from, int count)
{
  int n = count % 8;
  while (n-- > 0) {
      *to++ = *from++;
  }
  n = count / 8;
  if (n == 0) return;
  do {
    *to++ = *from++;
    *to++ = *from++;
    *to++ = *from++;
    *to++ = *from++;
    *to++ = *from++;
    *to++ = *from++;
    *to++ = *from++;
    *to++ = *from++;
  } while (--n > 0);
}

But the newly inserted while (n-- > 0) will make at most 7 judgments. To further reduce the branching instructions, we can use the switch statement instead.

send(short *to, short *from, int count)
{
  int n = count % 8;
  switch (n) {
    case 7: *to++ = *from++;
    case 6: *to++ = *from++;
    case 5: *to++ = *from++;
    case 4: *to++ = *from++;
    case 3: *to++ = *from++;
    case 2: *to++ = *from++;
    case 1: *to++ = *from++;
    case 0: ;
  }

  n = count / 8;
  if (n == 0) return;
  do {
    *to++ = *from++;
    *to++ = *from++;
    *to++ = *from++;
    *to++ = *from++;
    *to++ = *from++;
    *to++ = *from++;
    *to++ = *from++;
    *to++ = *from++;
  } while (--n > 0);
}

For any length, the switch statement can determine the position of a case branch by comparing it once. The subsequent case branches are then executed sequentially. This avoids multiple branching judgments.

Finally, Duff found can combine switch with the following while statement, which further reduces the compiled machine instructions. Thus came the Duff device.

send(short *to, short *from, int count)
{
  int n = (count + 7) / 8;
  switch (count % 8) { do {
    case 0: *to++ = *from++;
    case 7: *to++ = *from++;
    case 6: *to++ = *from++;
    case 5: *to++ = *from++;
    case 4: *to++ = *from++;
    case 3: *to++ = *from++;
    case 2: *to++ = *from++;
    case 1: *to++ = *from++;
  } while (--n > 0); }
}

With the development of compilation techniques, the performance optimization of Duff devices is no longer obvious, and in some cases even plays a negative optimization role (see this article. But I still think that the Duff device is a very intelligent design.

The Go language will set variables to zero values by default. In the case of very large structures or very long arrays, the Go compiler inserts so-called duffzero function calls as a way to improve the efficiency of zeroing. As you can see from the function names, the Go language also borrows ideas from the Duff device.

duffzero is a series of automatically generated assembly functions, which are essentially a set of assignment instructions. As an example for the AMD64 platform.

MOVUPS  X15,(DI)
MOVUPS  X15,16(DI)
MOVUPS  X15,32(DI)
MOVUPS  X15,48(DI)
LEAQ    64(DI),DI
...
RET

The DI register holds the target address to be cleared, and X15 is the zero value register. MOVUPS clears 16 bytes at a time. A set of four instructions can clear 64 bytes at a time. duffzero series functions are automatically generated by mkduff.go, and the instructions vary from one target platform to another. For example, the AMD64 architecture generates a total of 16 sets of up-clear instructions, which can clear up to 1024 bytes at a time.

Although it is called duffzero, the logic is very different from that of the Duff device. The Duff device adapts to various length values by dynamic computation with switch statements.

Go calculates the offset of the duffzero directly during compilation based on the length to be cleared, and then executes the clear instructions sequentially. It is more efficient because there is no need to dynamically compute and judge conditional branches during execution.

If you use the go tool objdump to look at the Go language assembly instructions, you may find the following code.

`1`	`CALL 0x105d4e2 ; The address of the function called directly, which points to a location inside duffzero`

The Go compiler automatically calculates the offset inside duffzero once it knows the length of the variable.

func dzOff(b int64) int64 {
  // Point to the last RET instruction first
  off := int64(dzSize)
  // dzClearLen is the length of the previous group clear, which is 64 bytes
  // dzBlockSize is the length of each instruction grouping
  off -= b / dzClearLen * dzBlockSize
  // If the length is not divisible by 64
  tailLen := b % dzClearLen
  if tailLen >= dzClearStep {
    // Calculate the instruction offset within the group
    // dzLeaqSize is the length of the LEAQ instruction
    // dzMovSize is the MOVUPS instruction length
    off -= dzLeaqSize + dzMovSize*(tailLen/dzClearStep)
  }
  return off
}

With the dzOff function, the compiler can determine the instruction offset at which duffzero needs to be actually called, and then insert a CALL instruction.