I’ve heard of Duff’s device, an optimization of serial replication, for a long time. Recently, I found that Go language also uses the idea of Duff’s device when initializing zero values. So I took a closer look at it and share it with you today.
If you need to copy a certain length of data to another address starting from a certain address, the simplest code is as follows.
Because only one byte is copied in each loop, a total of count
conditional branch judgments are performed. Without compiler optimization, this code would be less efficient.
A simple optimization idea is to copy as many bytes as possible in one loop to reduce the number of branch judgments. For example, we can copy 8 bytes at a time.
However, the above code can only copy data that is a multiple of 8 in length. In order to support arbitrary lengths, we need to handle the case where the length is less than 8 bytes. So the code can be rewritten as follows.
|
|
But the newly inserted while (n-- > 0)
will make at most 7 judgments. To further reduce the branching instructions, we can use the switch
statement instead.
|
|
For any length, the switch
statement can determine the position of a case
branch by comparing it once. The subsequent case
branches are then executed sequentially. This avoids multiple branching judgments.
Finally, Duff found can combine switch with the following while statement, which further reduces the compiled machine instructions. Thus came the Duff device.
|
|
With the development of compilation techniques, the performance optimization of Duff devices is no longer obvious, and in some cases even plays a negative optimization role (see this article. But I still think that the Duff device is a very intelligent design.
The Go language will set variables to zero values by default. In the case of very large structures or very long arrays, the Go compiler inserts so-called duffzero function calls as a way to improve the efficiency of zeroing. As you can see from the function names, the Go language also borrows ideas from the Duff device.
duffzero is a series of automatically generated assembly functions, which are essentially a set of assignment instructions. As an example for the AMD64 platform.
The DI register holds the target address to be cleared, and X15 is the zero value register. MOVUPS clears 16 bytes at a time. A set of four instructions can clear 64 bytes at a time. duffzero series functions are automatically generated by mkduff.go, and the instructions vary from one target platform to another. For example, the AMD64 architecture generates a total of 16 sets of up-clear instructions, which can clear up to 1024 bytes at a time.
Although it is called duffzero, the logic is very different from that of the Duff device. The Duff device adapts to various length values by dynamic computation with switch statements.
Go calculates the offset of the duffzero directly during compilation based on the length to be cleared, and then executes the clear instructions sequentially. It is more efficient because there is no need to dynamically compute and judge conditional branches during execution.
If you use the go tool objdump to look at the Go language assembly instructions, you may find the following code.
|
|
The Go compiler automatically calculates the offset inside duffzero once it knows the length of the variable.
|
|
With the dzOff
function, the compiler can determine the instruction offset at which duffzero needs to be actually called, and then insert a CALL instruction.