The classic bpf (classical Berkeley Packet Filter) is a technique that works very well to improve performance in some special Go underlying network programming situations.
Background
I have previously developed a Go UDP application where the client and server communicate via a raw socket via a UDP program. The purpose of the program is rather specific, so I’ll present a simple program as an example here.
Actually, I’m not being strict when I say I’m using the rawsocket approach.
I did not use the following way to implement sockets and communicate (link layer way).
The rawsocket approach (IP layer approach) below is also not used.
|
|
Instead, it directly uses the method net.ListenPacket("ip4:udp", addr)
encapsulated in the Go standard library to send and receive packets at the IP layer.
I implement custom packet sending and receiving by wrapping custom UDP layer data structure for network monitoring.
Some people may say, just use the standard library UDPConn. If it is an ordinary UDP program, it is indeed no problem, if for some special needs, such as listening to 1000 UDP ports, there are tens of thousands of nodes regularly send monitoring data, we are unlikely to build 1000 * 10,000 UDPConn, so here I use the rawsocket communication method.
RawSocket is part of the standard Berkeley socket. When we use Go’s standard library to develop network programs, most of the scenarios use the encapsulated datagram type (UDPConn) or stream type (TCPConn), but if you want to do some lower-level network programming, you need to use RawSocket, such as The underlying protocols are TCP, UDP, ICMP, ARP, etc. Different operating systems may implement RawSocket differently, here we take Linux environment as an example.
The Linux man manual gives a detailed introduction to RawSocket related knowledge: socket(2), packet(7), raw(7), which will not be reproduced in this article, and is not the focus of this article.
According to the Linux documentation, packets received in the Linux server are passed to both the kernel network module and RawSocket, so you sometimes need to be careful when using RawSocket, for example, if you are processing a TCP packet, the Linux kernel network program may have already processed the packet.
Raw sockets may tap all IP protocols in Linux, even protocols like ICMP or TCP which have a protocol module in the kernel. In this case, the packets are passed to both the kernel module and the raw socket(s). This should not be relied upon in portable programs, many other BSD socket implementation have limitations here.
If there are no special requirements, we will just use net.ListenPacket to implement a RawSocket program. The signature of this method is as follows:
|
|
The first parameter network
can be udp
, udp4
, udp6
, unixgram
, or ip
, ip4
, ip6
with a colon and a protocol number or protocol name, such as ip:1
, ip:icmp
.
Demo program
Server-side program
The server-side program uses conn, err := net.ListenPacket("ip4:udp", *addr)
to listen for all UDP packets on the local address and start a goroutine to process them. There should be another judgment in the handler, which is to check if the UDP port is the one we are dealing with, because here net.ListenPacket
is listening to all the local UDP, and there may be a lot of useless UDP packets that are passed to the user state program.
Here we use gopacket’s definition of packets for various protocol layers to facilitate the parsing (or creation) of network protocols for each TCP/IP layer.
|
|
Client program
|
|
The client program is simplified here by writing a hello
and reading the return from the server. When we do performance testing, we use a loop to continuously write a seq number and check whether the server returns this seq in order to calculate the packet loss performance. And also use a flow limiter to limit the flow and test the packet loss rate at a certain RPS.
Auxiliary methods
The following is the EncodeUDPPacket
method, which is used to generate a UDP packet data.
|
|
Performance issues
Although the above program runs fine, there are some problems in case of higher concurrency.
Above we started a goroutine to read this packet, here is a performance bottleneck, and eventually the server can only use one core to handle the RawSocket packet.
Even if you create multiple goroutines to read this PacketConn, it is useless because this PacketConn is unique and it is a bottleneck. Sometimes it is better to use only one goroutine to read it than multiple goroutines.
So can we call net.ListenPacket("ip4:udp", *addr)
multiple times to generate multiple RawSockets to process concurrently?
It seems to work, but in reality, the multiple RawSockets all read the same UDPPacket instead of being load balanced and spread out over multiple Sockets. So multiple RawSockets are not only useless, but also consume more resources of the server.
The actual test can only reach 20-30 thousand throughput, and the higher the concurrency, the higher the packet loss.
But is there no way out?
Not really. Here we can see that the main performance bottleneck is that our program above has no way to do load balancing, using the power of multiple cores to concurrently process. The second performance bottleneck is that the program listens to all the UDP packets on the local machine and hands them over to the user state program to filter and process, which has a lot of packets we don’t need.
Both of these performance problems can be handled by BPF.
BPF for Packet Filtering
The classic BPF has been around since 1994, and although everyone is now talking about the extended BPF (eBPF), the classic BPF still has the power to make it work.
You may not have applied BPF in programming, but I believe you must have worked with it in practice.
For example, when you use tcpdump to listen to the transmission of the network, you often add filters, such as the following command to listen only to port 8080 of the tcp protocol:
|
|
tcpdump actually generates tcp port 8080
as a filter, filters the packets in the kernel, and filters out only the filtered packets.
You can actually view the compiled filtering code with the following command.
|
|
What does this mean? BPF defines a limited number of instructions to filter the packets in the VM.
The first line is to load the packet offset (offset 12 bytes), the second line is to check if it is IPV6, if so jump to 002
, if not jump to 008
. Let’s focus on IPV4.
The 008
line is to determine if it is IPv4, if it is jump to 009
. The 009
line loads a byte at offset 23, which is the ip proto, and the 010
line determines if the ip proto is TCP, if so, jump to 011
.
Next, determine the flags in order to determine the address of the data.
Lines 014
and 016
read the source and destination ports in the TCP protocol, if they are equal to 8080
( 0x1f90
), the maximum return packet size is 262144 bytes, otherwise the packet is discarded.
Of course the code generated by tcpdump is quite rigorous. When we actually write it, if we are sure that it is an ipv4 package and the package is not much extended, the code is simpler than this. But we might as well use the code generated by tcpdump when we actually apply BPF without any errors.
Use -dd
to display it as a c code fragment, and -ddd
to display it as a decimal number. Let’s look at the effect of -dd
, as this result we can use to convert to Go code.
|
|
In fact, x/net/bpf provides the corresponding methods to write BPF programs, serializing and deserializing more easily. For example, to write a program that filters out all packets with destination port equal to 8972, we can simply write it in the following format (considering the simple form, we only considered the form of IPV4 and normal IP packets):
|
|
We can write a program that converts the code generated by tcpdump into bpf’s RawInstruction instruction:
|
|
Well all this is ready, the background knowledge is introduced and the performance bottleneck of the current RawSocket program is introduced, so what if the performance bottleneck is solved.
The first performance bottleneck we can generate multiple goroutines, each goroutine is responsible for filtering a part of the packets, so that the load balancing is achieved. For example, filtering based on the IP of the client, or the server listens to 1000 ports, each goroutine is only responsible for a portion of the port. And filtering can be done based on the source port of the client, etc. Always, with BPF filtering, a goroutine is responsible for only a part of the packet, enabling multi-core processing.
The second bottleneck is solved with the first problem. Because BPF only filters our specific ports, UDP packets from other ports are not copied from the kernel state to the user state, reducing the processing of useless packets.
To set BPF filtering for PacketConn of standard library, there are also various ways to do it, such as calling syscall.SetsockoptInt
to set it. But golang.org/x/net/ipv4 provides the SetBPF method, so we can directly convert the PacketConn of the standard library to ipv4.PacketConn, and then set it.
For example, in the server program above, we can modify it to use BPF filtering in the kernel state:
|
|