The two core issues that need to be addressed by container networks are as follows.

  1. the management of container IP addresses
  2. inter-container communication

Among them, the management of container IP addresses includes the allocation and recovery of container IP addresses. And the mutual communication between containers includes two scenarios: communication between containers of the same host and communication between containers across hosts. These two issues cannot be viewed completely separately either, as different solutions often have to consider both of these points.

The development of container network has been relatively mature, this paper will first give some overview of the mainstream container network model, and then will further explore the typical container network model.

CNM vs CNI

For container networks, docker and kubernetes propose separate specification standards.

  • CNM(Container Network Model) used by docker
  • The CNI model (Container Network Interface) supported by kubernetes

CNM is based on libnetwork, a model specification built into docker, and its general architecture is shown in the figure below.

General Architecture of CNM

As you can see, the CNM specification defines the following three main components.

  • Sandbox: Each Sandbox contains a container network stack configuration: container’s network port, routing table and DNS settings, etc. Sanbox can be implemented through Linux network namespace netns.
  • Endpoint: Each Sandbox joins a Network through an Endpoint, which can be implemented through a Linux virtual network device veth pair.
  • Network: A group of Endpoints that can communicate directly with each other, Network can be implemented through Linux bridge device bridge or VLAN etc.

As you can see, the underlying implementation principle is still the Linux virtual network device, network namespace, etc. that we introduced before. the typical scenario of CNM specification is like this: users can create one or more Networks, a container Sandbox can join one or more Networks through Endpoint, containers in the same Network Sanbox can communicate, and the container Sandbox in different Network is isolated. This allows decoupling from containers to networks, i.e. locks, before creating containers, you can create networks and then decide which network to let containers join.

But why kubernetes doesn’t use the CNM specification standard, instead opting for CNI, can be found on the official kubernetes blog Why Kubernetes doesn’t use libnetwork. kubernetes considers CNM to be somewhat too coupled with the container runtime, so a number of other organizations, led by kubernetes, have started working on a new CNI specification.

CNI is not natively supported by docker, it is a generic network interface designed for container technology, so the CNI interface can be easily called from the top to the bottom. However, getting from the bottom to the top is not as easy, so some common CNI plugins are difficult to activate at the docker level. However, both models support plugging, which means that each of us can write our own specific network implementations according to both sets of network specifications.

The network models natively supported by docker via libnetwork can be listed via docker network ls.

1
2
3
4
5
# docker network ls
NETWORK ID          NAME                DRIVER              SCOPE
f559b082c95f        bridge              bridge              local
5f11ccbbf488        host                host                local
97aedfe8792d        none                null                local

You can see that the default docker supports three network models, and you can specify the network model to be used by --network when you create the container. Bridge is the default. We will then describe and simulate the implementation of the bridge network model; the none network model does not create any network; and the host network model uses the host network, which does not create a new network namespace.

Note: If docker swarm is turned on, then you will also see the overlay network model, and we will cover the implementation of docker’s native overlay network model in more detail later.

bridge network

The bridge network model is the default network model of docker, if we do not specify the network model when creating containers, the bridge model will be used by default. bridge network model can solve the problem of communication between containers on a single host and the exposure of container services to the outside world, and its implementation principle is also very simple.

bridge network

As you can see, the bridge network model relies heavily on the famous docker0 bridge and the veth virtual network device pair, and from our previous notes on Linux virtual network devices, we know that packets sent from one veth device pair will be sent directly to the veth device at the other end, even if they are not in a network namespace. space. So a veth device pair is actually a “network cable” that connects different network namespaces. docker0 bridge devices act as gateways to different container networks. In fact, once we create a container in bridge network mode, the corresponding veth device pair is automatically created, with one end connected to the docker0 bridge and the other end connected to the eth0 virtual NIC of the container network.

First we look at the bridge device docker0 and the routing rules on the host where docker is installed.

1
2
3
4
5
6
# ip link show docker0
4: docker0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default
    link/ether 02:42:59:c8:67:c0 brd ff:ff:ff:ff:ff:ff
# ip route ls
...
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1

Then create a container using the default bridge network model and look at the veth device pair on the host side.

1
2
3
4
5
6
# docker run -d --name mynginx nginx:latest
# ip link show type veth
11: veth42772d8@if10: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master docker0 state UP mode DEFAULT group default
    link/ether e2:a3:89:76:14:f3 brd ff:ff:ff:ff:ff:ff link-netnsid 0
# bridge link
11: veth42772d8 state UP @if10: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 master docker0 state forwarding priority 32 cost 2

You can see that one end of the new veth device pair, veth42772d8, is already connected to the docker0 bridge, but what about the other end?

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# ls /var/run/docker/netns/
62fd67d9ef3e  default
# nsenter --net=/var/run/docker/netns/62fd67d9ef3e ip link show type veth
10: eth0@if11: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default
    link/ether 02:42:ac:11:00:02 brd ff:ff:ff:ff:ff:ff link-netnsid 0
# nsenter --net=/var/run/docker/netns/62fd67d9ef3e ip addr show type veth
10: eth0@if11: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
    link/ether 02:42:ac:11:00:02 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 172.17.0.2/16 brd 172.17.255.255 scope global eth0
       valid_lft forever preferred_lft forever

As we envisioned, the other end of the veth device pair is in the new network namespace 62fd67d9ef3e and has an IP address of 172.17.0.2/16, which is on the same subnet as docker0.

Note: If we create a symbolic link /var/run/netns/ that maps to /var/run/docker/netns/, we don’t have to use the nsenter command or go inside the container to see the other end of the veth device pair. You can view it directly using the iproute2 toolkit below.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# ip netns show
62fd67d9ef3e (id: 0)
default
# ip netns exec 62fd67d9ef3e ip link show type veth
10: eth0@if11: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default
   link/ether 02:42:ac:11:00:02 brd ff:ff:ff:ff:ff:ff link-netnsid 0
# ip netns exec 62fd67d9ef3e ip addr show type veth
10: eth0@if11: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
   link/ether 02:42:ac:11:00:02 brd ff:ff:ff:ff:ff:ff link-netnsid 0
   inet 172.17.0.2/16 brd 172.17.255.255 scope global eth0
      valid_lft forever preferred_lft forever

bridge network simulation

We will then simulate the implementation of the bridge network model. The basic network topology is shown below.

bridge network

  1. First create two netns network namespaces.

    1
    2
    3
    4
    5
    6
    
    # ip netns add netns_A
    # ip netns add netns_B
    # ip netns
    netns_B
    netns_A
    default
    
  2. Create the bridge device mybr0 in the default network namespace and assign the IP address 172.18.0.1/16 to make it the gateway of the corresponding subnet.

    1
    2
    3
    4
    5
    6
    7
    8
    
    # ip link add name mybr0 type bridge
    # ip addr add 172.18.0.1/16 dev mybr0
    # ip link show mybr0
    12: mybr0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
        link/ether ae:93:35🆎59:2a brd ff:ff:ff:ff:ff:ff
    # ip route
    ...
    172.18.0.0/16 dev mybr0 proto kernel scope link src 172.18.0.1
    
  3. Next, create the veth device pair and connect it to the two network namespaces created in the first step.

    1
    2
    3
    4
    5
    6
    7
    
    # ip link add vethA type veth peer name vethpA
    # ip link show vethA
    14: vethA@vethpA: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
        link/ether da:f1:fd:19:6b:4a brd ff:ff:ff:ff:ff:ff
    # ip link show vethpA
    13: vethpA@vethA: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
        link/ether 86:d6:16:43:54:9e brd ff:ff:ff:ff:ff:ff
    
  4. Connect one end of the veth device pair created in the previous step, vethA, to the mybr0 bridge and start it up.

    1
    2
    3
    4
    
    # ip link set dev vethA master mybr0
    # ip link set vethA up
    # bridge link
    14: vethA state LOWERLAYERDOWN @vethpA: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 master mybr0 state disabled priority 32 cost 2
    
  5. Place the other end of the veth device pair, vethpA, in the network namespace netns_A and configure IP boot.

    1
    2
    3
    4
    5
    6
    7
    8
    9
    
    # ip link set vethpA netns netns_A
    # ip netns exec netns_A ip link set vethpA name eth0
    # ip netns exec netns_A ip addr add 172.18.0.2/16 dev eth0
    # ip netns exec netns_A ip link set eth0 up
    # ip netns exec netns_A ip addr show type veth
    13: eth0@if14: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
        link/ether 86:d6:16:43:54:9e brd ff:ff:ff:ff:ff:ff link-netnsid 0
        inet 172.18.0.2/16 scope global eth0
        valid_lft forever preferred_lft forever
    
  6. Access to the mybr0 gateway from the netns_A network namespace can now be verified.

    1
    2
    3
    4
    5
    6
    7
    8
    
    # ip netns exec netns_A ping -c 2 172.18.0.1
    PING 172.18.0.1 (172.18.0.1) 56(84) bytes of data.
    64 bytes from 172.18.0.1: icmp_seq=1 ttl=64 time=0.096 ms
    64 bytes from 172.18.0.1: icmp_seq=2 ttl=64 time=0.069 ms
    
    --- 172.18.0.1 ping statistics ---
    2 packets transmitted, 2 received, 0% packet loss, time 1004ms
    rtt min/avg/max/mdev = 0.069/0.082/0.096/0.016 m
    
  7. If you want to access addresses other than 172.18.0.0/16 from the netns_A network namespace, you need to add a default default route.

    1
    2
    3
    4
    
    # ip netns exec netns_A ip route add default via 172.18.0.1
    # ip netns exec netns_A ip route
    default via 172.18.0.1 dev eth0
    172.18.0.0/16 dev eth0 proto kernel scope link src 172.18.0.2
    

    Note: If you try to ping another public address, such as google.com, it will fail. The reason is that the source address of the pinged packet (ICMP packet) has not done source address translation (snat), so the ICMP packet has no return; docker implements source address translation by setting iptables.

  8. Next, follow the steps above to create the connection default and netns_B network namespace veth device pair.

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    
    # ip link add vethB type veth peer name vethpB
    # ip link set dev vethB master mybr0
    # ip link set vethB up
    # ip link set vethpB netns netns_B
    # ip netns exec netns_B ip link set vethpB name eth0
    # ip netns exec netns_B ip addr add 172.18.0.3/16 dev eth0
    # ip netns exec netns_B ip link set eth0 up
    # ip netns exec netns_B ip route add default via 172.18.0.1
    # ip netns exec netns_B ip add show eth0
    15: eth0@if16: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
        link/ether 0e:2f:c6🇩🇪fe:24 brd ff:ff:ff:ff:ff:ff link-netnsid 0
        inet 172.18.0.3/16 scope global eth0
        valid_lft forever preferred_lft forever
    # ip netns exec netns_B ip route show
    default via 172.18.0.1 dev eth0
    172.18.0.0/16 dev eth0 proto kernel scope link src 172.18.0.3
    
  9. By default, Linux disables the forwarding function of the bridge device bridge, so you cannot ping netns_B in netns_A. You need to add an additional iptables rule to activate the forwarding function of the bridge device bridge.

    1
    
    # iptables -A FORWARD -i mybr0 -j ACCEPT
    
  10. It is now possible to verify that two network namespaces can communicate with each other.

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    
    # ip netns exec netns_A ping -c 2 172.18.0.3
    PING 172.18.0.3 (172.18.0.3) 56(84) bytes of data.
    64 bytes from 172.18.0.3: icmp_seq=1 ttl=64 time=0.091 ms
    64 bytes from 172.18.0.3: icmp_seq=2 ttl=64 time=0.093 ms
    
    --- 172.18.0.3 ping statistics ---
    2 packets transmitted, 2 received, 0% packet loss, time 1027ms
    rtt min/avg/max/mdev = 0.091/0.092/0.093/0.001 ms
    # ip netns exec netns_B ping -c 2 172.18.0.2
    PING 172.18.0.2 (172.18.0.2) 56(84) bytes of data.
    64 bytes from 172.18.0.2: icmp_seq=1 ttl=64 time=0.259 ms
    64 bytes from 172.18.0.2: icmp_seq=2 ttl=64 time=0.078 ms
    
    --- 172.18.0.2 ping statistics ---
    2 packets transmitted, 2 received, 0% packet loss, time 1030ms
    rtt min/avg/max/mdev = 0.078/0.168/0.259/0.091 ms
    

In fact, the two network namespaces are in the same subnet at this point, so the bridge device mybr0 is still working at layer 2 (data link layer) and only needs the MAC address of the other party to access it.

But if you need to access addresses in other network segments from both network namespaces, this is where the bridge device mybr0 set as the default gateway address comes into play: packets from both network namespaces find that the destination IP address is not the address of the local subnet and are sent to the gateway mybr0. At this point, the bridge device mybr0 is actually working at layer 3 (IP network layer), and after it receives the packet, it looks at the local route and the destination IP address to find the address of the next hop.