Document a fix for a cluster failure caused by an IP change. There are two clusters, one is a single node (allinone) and the other is a four node (3 master 1 node) cluster.
1. Update Etcd certificate
-
Backup Etcd certificate at each Etcd node.
1
cp -R /etc/ssl/etcd/ssl /etc/ssl/etcd/ssl-bak
-
View the domain in the Etcd certificate
All DNS and IP values need to be recorded and used to generate new certificates.
-
Clean up old Etcd certificates at each Etcd node
1
rm -f /etc/ssl/etcd/ssl/*
-
Generate Etcd certificate configuration in an Etcd node.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
vim /etc/ssl/etcd/ssl/openssl.conf [req] req_extensions = v3_req distinguished_name = req_distinguished_name [req_distinguished_name] [ v3_req ] basicConstraints = CA:FALSE keyUsage = nonRepudiation, digitalSignature, keyEncipherment subjectAltName = @alt_names [ ssl_client ] extendedKeyUsage = clientAuth, serverAuth basicConstraints = CA:FALSE subjectKeyIdentifier=hash authorityKeyIdentifier=keyid,issuer subjectAltName = @alt_names [ v3_ca ] basicConstraints = CA:TRUE keyUsage = nonRepudiation, digitalSignature, keyEncipherment subjectAltName = @alt_names authorityKeyIdentifier=keyid:always,issuer [alt_names] DNS.1 = localhost DNS.2 = etcd.kube-system.svc.cluster.local DNS.3 = etcd.kube-system.svc DNS.4 = etcd.kube-system DNS.5 = etcd DNS.6 = xxx IP.1 = 127.0.0.1 IP.2 = x.x.x.x
The hostname and IP address of all deployed Etcd nodes need to be included.
-
Generate Etcd’s CA certificate at an Etcd node
-
Generate an Etcd Admin certificate for each node in an Etcd node.
Generate a certificate for each node by setting different environment variables with
export host=node1
. Here node1 is the host name, keep it the same as before to avoid not finding the certificate due to name change.1 2 3
openssl genrsa -out admin-${host}-key.pem 2048 openssl req -new -key admin-${host}-key.pem -out admin-${host}.csr -subj "/CN=etcd-admin-${host}" openssl x509 -req -in admin-${host}.csr -CA ca.pem -CAkey ca-key.pem -CAcreateserial -out admin-${host}.pem -days 3650 -extensions ssl_client -extfile openssl.conf
-
Generate Etcd Member certificates for each node in an Etcd node.
Switch nodes by
export host=node1
and generate certificates for each node.1 2 3
openssl genrsa -out member-${host}-key.pem 2048 openssl req -new -key member-${host}-key.pem -out member-${host}.csr -subj "/CN=etcd-member-${host}" -config openssl.conf openssl x509 -req -in member-${host}.csr -CA ca.pem -CAkey ca-key.pem -CAcreateserial -out member-${host}.pem -days 3650 -extensions ssl_client -extfile openssl.conf
-
Certificates generated at an Etcd node distribution
The certificate under
/etc/ssl/etcd/ssl/
needs to be distributed to each Etcd node. -
View etcd configuration in an Etcd node
Here Etcd is started as a binary, and the location of the etcd configuration file can be found in systemd.
-
Replacement IP per Etcd node
Since there are multiple Etcd nodes, it is necessary to replace multiple sets of IPs, here is an example of three nodes.
/etc/hosts
also needs to replace the IP, as sometimes the hostname is used in the configuration file.If you have a regular backup task, you will also need to replace the relevant IP.
-
Each Etcd node restores Etcd data from a backup
This step can be skipped if Etcd is a single node. The Etcd cluster is no longer operational because the node IPs have changed. Multi-node Etcd needs to use backup data to recover, because Etcd’s node information is stored in the disk data and it is not useful to just modify the configuration file.
Distribute the Etcd backup file
snapshot.db
to each Etcd node.Execute the following command on each node:
1
rm -rf /var/lib/etcd
1 2 3 4 5
etcdctl snapshot restore snapshot.db --name etcd-node1 \ --initial-cluster "etcd-node1=https://x.x.10.1:2380,etcd-node2=https://x.x.10.2:2380,etcd-node3=https://x.x.10.3:2380" \ --initial-cluster-token k8s_etcd \ --initial-advertise-peer-urls https://x.x.10.1:2380 \ --data-dir=/var/lib/etcd
Note that the
etcd-node1
name,-initial-advertise-peer-urls
parameter will vary on each node. -
Restart etcd for each Etcd node
1
systemctl restart etcd
-
View etcd status per Etcd node
1
systemctl status etcd
2. Update K8s certificate
- Backing up certificates
|
|
-
Each Kubernetes node replaces the IP address in the associated file
1 2 3 4
sed -i "s/$oldip1/$newip1/" /etc/systemd/system/kubelet.service.d/10-kubeadm.conf sed -i "s/$oldip2/$newip2/" /etc/systemd/system/kubelet.service.d/10-kubeadm.conf sed -i "s/$oldip3/$newip3/" /etc/systemd/system/kubelet.service.d/10-kubeadm.conf sed -i "s/$oldip4/$newip4/" /etc/systemd/system/kubelet.service.d/10-kubeadm.conf
-
Generate a certificate at a master node
1
rm -f /etc/kubernetes/pki/apiserver*
1
kubeadm init phase certs all --config /etc/kubernetes/kubeadm-config.yaml
-
Each Kubernetes node distributes the generated certificates to the nodes
1
The node node does not need a key, only a crt.
3. update the Conf file of the cluster component
-
Generate a new configuration file in a master node
1
kubeadm init phase kubeconfig all --config /etc/kubernetes/kubeadm-config.yaml
-
Each Kubernetes node distributes the new configuration file to each node
Each node needs
/etc/kubernetes/kubelet.conf
and each master node needs/etc/kubernetes/controller-manager.conf
and/etc/kubernetes/scheduler.conf
. -
Configure user access credentials on the node that needs to use kubectl
1
cp /etc/kubernetes/admin.conf $HOME/.kube/config
-
Restart the kubelet for each Kubernetes node
-
View kubelet status per Kubernetes node
1
systemctl status kubelet
4. Fix ConfigMap
-
Replace IP
1
kubectl -n kube-system edit cm kube-proxy
kube-proxy affects node communication. If you are using an LB or a domain as an Apiserver entry, you can also leave it out. As for kubeadm-config, it is automatically replaced in the above steps, so no additional processing is needed.
5. Summary
It is strongly recommended that you do not change the IP address of the cluster hosts. If the change is an expected host IP change, you can rebuild the cluster by backup-restore.
If it is an unintended host IP change, it is recommended to fix it in the above order:
- Etcd
- K8s certificate
- K8s Master Node, Node Node core components
- cluster ConfigMap configuration
The above mentioned content has documented the repair process. However, the site is very complicated when repairing multiple master nodes. The container was restarting continuously, during which it kept reporting port conflicts, for which I also restarted the machine once. The repair process may be imperfectly documented, but just follow the sequence, one component at a time, should not be a big problem.