In-depth understanding of netfilter and iptables

Netfilter (in conjunction with iptables) enables user-space applications to register the processing rules applied by the kernel network stack when processing packets, enabling efficient network forwarding and filtering. Many common host firewall applications as well as Kubernetes service forwarding are implemented with iptables.

Most of the introductory articles on netfilter describe only abstract concepts, but the basic implementation of the kernel code is not too complicated. This article mainly refers to the Linux kernel code version 2.6 (earlier versions are simpler), which may differ significantly from the latest version 5.x, but the basic design has not changed much and does not affect the understanding of its principles.

This article assumes that the reader already has a basic understanding of the TCP/IP protocol.

Netfilter Design and Implementation

The definition of netfilter is a network packet processing framework that works in the Linux kernel. In order to thoroughly understand how netfilter works, we first need to establish a basic understanding of the packet processing path in the Linux kernel.

Kernel packet processing flow

The packet processing path in the kernel, i.e. the chain of kernel code calls for processing network packets, can be broadly divided into multiple layers according to the TCP/IP model, taking the reception of an IPv4 tcp packet as an example.

At the physical-network device layer, the NIC writes the received packet to a ring buffer in memory via DMA. After a series of interrupts and scheduling, the OS kernel calls __skb_dequeue to add the packet to the processing queue of the corresponding device and converts it into a sk_buffer type (i.e., socket buffer - the __skb_dequeue). The packets can be considered as sk_buffer), and finally the netif_receive_skb function is called to sort the packets by protocol type and jump to the corresponding processing function. This is shown in the following figure.
Assuming the packet is an IP protocol packet, the corresponding receive packet processing function ip_rcv is called and the packet processing enters the network (IP) layer. ip_rcv checks the IP initials of the packet and discards the erroneous packets, and aggregates the sliced IP packets if necessary. The ip_rcv_finish function is then executed to route the packet and decide whether to deliver the packet locally or forward it to another host. Assuming the destination of the packet is this host, the next dst_input function will call the ip_local_deliver function. The ip_local_deliver function will determine the protocol type of the packet based on the protocol number in the IP header, and finally call the packet processing function of the corresponding type. In this example, the tcp_v4_rcv function for TCP protocol will be called, and then the packet processing will enter the transport layer.
The tcp_v4_rcv function also reads the TCP header of the packet and calculates the checksum, and then maintains some necessary state in the TCP control buffer corresponding to the packet, including the TCP sequence number and SACK number. The function next calls __tcp_v4_lookup to look up the socket corresponding to the packet, and if it does not find it or if the socket’s connection status is TCP_TIME_WAIT, the packet is discarded. If the socket is unlocked, the packet will enter the prequeue queue by calling the tcp_prequeue function, after which the packet will be available for processing by the user program in the user state. The processing flow at the transport layer is beyond the scope of this article and is actually much more complex.

netfilter hooks

Let’s get down to business. The primary component of netfilter is netfilter hooks.

hook trigger points

For different protocols (IPv4, IPv6 or ARP, etc.), the Linux kernel network stack triggers the corresponding hooks at predefined locations along the packet processing path of the protocol stack. The locations of the trigger points in the different protocol processing flows and the corresponding hook names (in bold outside the blue rectangle) are listed below, with a focus on the IPv4 protocol.

hook trigger points

The so-called hooks are essentially enumerated objects in the code (integers with values incrementing from 0).

enum nf_inet_hooks {
    NF_INET_PRE_ROUTING,
    NF_INET_LOCAL_IN,
    NF_INET_FORWARD,
    NF_INET_LOCAL_OUT,
    NF_INET_POST_ROUTING,
    NF_INET_NUMHOOKS
};

Each hook corresponds to a specific trigger point location in the kernel network stack, for example, the IPv4 stack has the following netfilter hooks definition.

netfilter hooks definition

NF_INET_PRE_ROUTING: This hook is executed in the ip_rcv function of the IPv4 stack or the ipv6_rcv function of the IPv6 stack. The first hook trigger point for all incoming packets (in fact, newer versions of Linux add the INGRESS hook as the earliest trigger point) is executed before the routing determination is made.
NF_INET_LOCAL_IN: This hook is executed in the ip_local_deliver() function of the IPv4 stack or the ip6_input() function of the IPv6 stack. After routing judgment, all incoming packets whose destination address is local arrive at this hook trigger point.
NF_INET_FORWARD: This hook is executed in the ip_forward() function of the IPv4 protocol stack or the ip6_forward() function of the IPv6 protocol stack. After routing, all incoming packets whose destination address is not local arrive at this hook trigger point.
NF_INET_LOCAL_OUT: This hook is executed in the __ip_local_out() function of the IPv4 stack or the __ip6_local_out() function of the IPv6 stack. All locally generated packets ready to be sent first reach this hook trigger point after entering the network stack.
NF_INET_POST_ROUTING: This hook is executed in the ip_output() function of the IPv4 protocol stack or the ip6_finish_output2() function of the IPv6 protocol stack. Locally generated packets that are ready to be sent or forwarded will reach this hook trigger point after routing determination.

NF_HOOK macro and netfilter vector

The macro NF_HOOK is called uniformly for all trigger point locations to trigger the hook.

static inline int NF_HOOK(uint8_t pf, unsigned int hook, struct sk_buff *skb,
    struct net_device *in, struct net_device *out,
    int (*okfn)(struct sk_buff *))
{
    return NF_HOOK_THRESH(pf, hook, skb, in, out, okfn, INT_MIN);
}

The parameters received by NF-HOOK are as follows.

pf: The protocol family of the packet, NFPROTO_IPV4 for IPv4.
hook: The netfilter hook enumeration object shown in the above figure, such as NF_INET_PRE_ROUTING or NF_INET_LOCAL_OUT.
skb: SKB object indicating the packet being processed.
in: The input network device for the packet.
out: The output network device for the packet.
okfn: A pointer to a function that will be called when this hook is about to terminate, usually passed to the next processing function on the packet processing path.

The return value of NF-HOOK is one of the following netfilter vectors with a specific meaning.

NF_ACCEPT: Continue normally on the processing path (actually the last incoming okfn is executed in NF-HOOK).
NF_DROP: Discard the packet and terminate processing.
NF_STOLEN: Packet forwarded, terminate processing. 4.
NF_QUEUE: Queue the packet for other processing.
NF_REPEAT: Re-call the current hook.

Going back to the source code, the IPv4 kernel network stack calls NF_HOOK() in the following code module.

NF_HOOK

The actual call is based on the source code of net/ipv4/ip_forward.c for packet forwarding, where the NF_HOOK macro is called at the end of the ip_forward function on line 115 with the NF_INET_FORWARD hook as an input. The next function in the network stack, ip_forward_finish, is passed in as an okfn argument.

int ip_forward(struct sk_buff *skb)
{
.....
if (rt->rt_flags&RTCF_DOREDIRECT && !opt->srr && !skb_sec_path(skb))
        ip_rt_send_redirect(skb);

    skb->priority = rt_tos2priority(iph->tos);

    return NF_HOOK(NFPROTO_IPV4, NF_INET_FORWARD, skb, skb->dev,
          rt->dst.dev, ip_forward_finish);
.....
}

Callback functions and priorities

Another component of netfilter is the hook callback function. The kernel network stack uses both hooks to represent specific trigger locations and hooks (of integer values) as data indexes to access callback functions corresponding to trigger points.

Other modules of the kernel can register callback functions with the specified hook through the api provided by netfilter. The same hook can register multiple callback functions, and the priority parameter specified during registration can specify the priority of the callback function when it is executed.

To register a callback function for a hook, first define a nf_hook_ops structure (or an array of such structures), which is defined as follows.

struct nf_hook_ops {
    struct list_head list;

    /* User fills in from here down. */
    nf_hookfn *hook;
    struct module *owner;
    u_int8_t pf;
    unsigned int hooknum;
    /* Hooks are ordered in ascending priority. */
  int priority;
};

There are 3 important members in the definition.

hook: the callback function to be registered, the function parameters are defined similar to NF_HOOK and can be nested with other functions via the okfn parameter.
hooknum: The enumeration value of the target hook to be registered.
priority: the priority of the callback function, the smaller value is executed first.

After defining the structure, one or more callback functions can be registered by int nf_register_hook(struct nf_hook_ops *reg) or int nf_register_hooks(struct nf_hook_ops *reg, unsigned int n); respectively. All the nf_hook_ops under the same netfilter hook are registered to form a Linked list structure in order of priority, and the registration process will find the appropriate position from the Linked list according to the priority, and then perform the Linked list insertion operation.

When the NF-HOOK macro is executed to trigger the specified hook, the nf_iterate function will be called to iterate over the nf_hook_ops Linked list corresponding to this hook, and each nf_hook_ops will be called in turn with the registered function member hookfn. The diagram is as follows.

hookfn

This chaining of callback functions works in a way that has also led to the netfilter hook being called Chain, a connection that is particularly reflected in the iptables introduction below.

Each callback function must also return a netfilter vector; if the vector is NF_ACCEPT, nf_iterate will continue to call the next nf_hook_ops callback function until all callbacks are called and NF_ACCEPT is returned; if the vector is NF_DROP, it will break the traversal and return directly to NF_DROP; if the vector is NF_REPEAT, the callback function will be re-executed. The return value of nf_iterate will also be used as the return value of NF-HOOK, and the network stack will determine whether to continue to execute the processing function based on the value of this vector. The schematic diagram is as follows.

nf_hook_ops

The callback function mechanism of netfilter hook has the following features.

Callback functions are executed in order of priority, and only the previous callback function returns NF_ACCEPT before continuing to the next callback function.
Any callback function can interrupt the execution chain of the hook’s callback functions and require the entire network stack to suspend packet processing.

iptables

Based on the hook callback function mechanism provided by the kernel netfilter, netfilter author Rusty Russell also developed iptables to manage custom rules applied to packets in user space.

iptbles is divided into two parts.

The user-space iptables command provides the user with an administrative interface to access the kernel iptables module.
The kernel-space iptables module maintains the rules table in memory, enabling table creation and registration.

Kernel space module

Initialization of xt_table

In the kernel network stack, iptables manages a large number of packet processing rules in an orderly manner through the xt_table structure. A xt_table corresponds to a rule table, which corresponds to the user space concept of table.

Works for different netfilter hooks.
Different priorities for checking different rule tables in the same hook.

Based on the final purpose of the rules, iptables initializes four different rule tables by default, namely raw, filter, nat, and mangle. the following example describes the initialization and invocation process of xt_table for filter.

The definition of filter table is as follows.

#define FILTER_VALID_HOOKS ((1 << NF_INET_LOCAL_IN) | \
                                          (1 << NF_INET_FORWARD) | \ 
                                            (1 << NF_INET_LOCAL_OUT))

static const struct xt_table packet_filter = { 
        .name = "filter", 
        .valid_hooks = FILTER_VALID_HOOKS, 
        .me = THIS_MODULE, 
        .af = NFPROTO_IPV4, 
        .priority = NF_IP_PRI_FILTER, 
};

(net/ipv4/netfilter/iptable_filter.c)

The initialization function iptable_filter.c of the iptable_ filter_init, call xt_hook_link on the xt_table structure packet_filter to perform the the following initialization procedure.

Iterate through the .valid_hooks attribute for each hook that xt_table will take effect for, for the filter, the 3 hooks NF_INET_LOCAL_IN, NF_INET_FORWARD and NF_INET_LOCAL_OUT.
For each hook, register a callback function to the hook using the priority property of xt_table.

The priority values for different tables are as follows.

enum nf_ip_hook_priorities {
    NF_IP_PRI_RAW = -300,
    NF_IP_PRI_MANGLE = -150,
    NF_IP_PRI_NAT_DST = -100,
    NF_IP_PRI_FILTER = 0,
    NF_IP_PRI_SECURITY = 50,
    NF_IP_PRI_NAT_SRC = 100,
};

When a packet reaches a hook trigger point, all callback functions registered on that hook by different tables are executed sequentially, always in a fixed relative order according to the priority value above.

hook

ipt_do_table()

The filter registered hook callback function iptable_filter_hook will execute the public rule checking function ipt_do_table on the xt_table structure. ipt_do_table receives skb, hook and xt_table as arguments, and executes the rule set determined by the latter two arguments against skb, returning the netfilter vector as the return value of the callback function.

Before diving into the rule execution process, it is important to understand how the ruleset is represented in memory. Each rule consists of 3 parts.

an ipt_entry structure. A .next_offset pointing to the memory offset address of the next ipt_entry.
0 or more ipt_entry_match structures, each of which can dynamically add additional data.
1 ipt_entry_target structure, each of which can be dynamically populated with additional data.

The ipt_entry structure is defined as follows.

struct ipt_entry {
    struct ipt_ip ip;
    unsigned int nfcache;

    /* ipt_entry + matches 在内存中的大小*/
    u_int16_t target_offset;
    /* ipt_entry + matches + target 在内存中的大小 */
    u_int16_t next_offset;

    /* 跳转后指向前一规则 */
    unsigned int comefrom;
    /* 数据包计数器 */
    struct xt_counters counters;
    /* 长度为0数组的特殊用法，作为 match 的内存地址 */
    unsigned char elems[0];
};

ipt_do_table first jumps to the corresponding ruleset memory area based on the hook type and the xt_table.private.entries property, and performs the following procedure.

ipt_do_table

first check whether the IP initial of the packet matches the .ipt_ip attribute of the first rule ipt_entry, if it does not match according to the next_offset attribute jump to the next rule.
If the IP initial matches, all ipt_entry_match objects defined by the rule are checked in turn, and the match function associated with the object is called, with the result of returning to the callback function (and whether to drop the packet), jumping to the next rule, or continuing the check, depending on the return value of the call.
ipt_entry_target is read after all checks are passed, and according to its properties returns the netfilter vector to the callback function, continues to the next rule or jumps to another rule at the specified memory address. non-standard ipt_entry_target will also call the bound function, but can only return the vector value and cannot jump to another rule.

Flexibility and update latency

The above data structure and execution provides iptables with the power to scale, allowing us to flexibly customize the matching criteria for each rule and perform different behaviors based on the results, and even stack-hop between additional rulesets.

Because each rule is of varying length, has a complex internal structure, and the same ruleset is located in contiguous memory space, iptables uses full replacement to update rules, which allows us to add/remove rules from user space with atomic operations, but non-incremental rule updates can cause serious performance problems when the rules are orders of magnitude larger: if a large If you use the iptables approach to implementing services in a large Kubernetes cluster, when the number of services is large, updating even one service will modify the iptables rule table as a whole. The full commit process is protected by kernel lock, so there is a significant update latency.

User-space tables, chains and rules

The iptables command line in user space can read data from a given table and render it to the terminal, add new rules (actually replacing the entire table’s rules table), etc.

iptables mainly operates on the following objects.

table: corresponds to the xt_table structure in kernel space. All operations of iptable are performed on the specified table, which defaults to filter.
chain: The set of rules that the specified table calls through a specific netfilter hook, and you can also customize the rule set and jump from the hook rule set.
rule: corresponds to ipt_entry, ipt_entry_match and ipt_entry_target above, defining the rules for matching packets and the behavior to be performed after the match.
match: A highly scalable custom match rule.
target: A highly scalable custom post-match behavior.

Based on the flow of the code invocation process described above, the chain and rule are executed as shown in the following diagram.

chain and rule

For the specific usage and commands of iptables, you can refer to Iptables Essentials: Common Firewall Rules and Commands | DigitalOcean.

conntrack

It is not enough to filter packets by the first part of the layer 3 or 4 information, sometimes it is necessary to further consider the state of the connection. netfilter performs connection tracking with another built-in module conntrack to provide more advanced network filtering functions such as filtering by connection, address translation (NAT), etc. Because of the need to determine the connection status, conntrack has a separate implementation for protocol characteristics, while the overall mechanism is the same.

I was going to continue with conntrack and NAT, but I decided against it because of its length. I will write another article explaining the principles, applications and Linux kernel implementation of conntrack when I have time.

Table of Contents

Netfilter Design and Implementation

Kernel packet processing flow

netfilter hooks

hook trigger points

NF_HOOK macro and netfilter vector

Callback functions and priorities

iptables

Kernel space module

Initialization of xt_table

ipt_do_table()

Flexibility and update latency

User-space tables, chains and rules

conntrack