Background
Elastic Cloud online services have long been plagued by cache inconsistency.
The occurrence of cache inconsistency is usually accompanied by the upgrade or restart of the kube-apiserver. When cache inconsistency occurs, the user side can perceive it more obviously, and the problem can cause online service failure when it is serious. The common failures are as follows.
- Platform data inconsistency: Pod status is normal at one time and abnormal at another, and bounces back and forth.
- Loss of service management events: when the service changes, the service management does not work normally, such as service tree is not mounted, traffic is not accessed, etc.
Before the problem is located, Elastic Cloud has developed many problem-aware and timely stop-loss strategies.
- Problem perception.
- Manual: When kube-apiserver is upgraded or restarted, manually notify the associated party to also restart the platform service
- Smart: Configure monitoring and alerting policies to send alert messages when no change events are received for k8s objects for a period of time
- Timely stop loss.
- Restart: restart the service when cache inconsistency issue occurs and pull the latest data from kube-apiserver in full
- Self-healing: In some scenarios, even if the service is restarted, it cannot be fully recovered, so add a self-healing policy to proactively sense and handle abnormal situations.
Problem-aware and stop-loss strategies do not really solve the problem, but only try to restore service in deterministic scenarios, and along with the discovery of more anomalous scenarios, the strategies need to be adjusted in parallel.
Problem Location
Perception and stopping is a fix similar to mending, and obviously, we would prefer a complete solution to the problem. So, let’s start with the root cause of the cache inconsistency.
We choose notifier to troubleshoot the problem. Notifier is a collection of controllers for cluster management services, and its functionality consists mainly of
- Service tree mounts
- DNS registration
- LVS pickup flow, etc.
The reason for choosing notifier is its simplicity: notifier uses client-go’s informer and registers processing functions for core resource events; in addition, there are no complex business processes to interfere with troubleshooting.
Problem Replication
We tested in our offline environment and found that the kube-apiserver service was able to reproduce the problem after restarting, which gave us a great convenience to troubleshoot the problem. So the steps to reproduce the problem are as follows.
- Start the notifier service
- Restart the kube-apiserver service
Status analysis
When the problem occurs, we first do some basic checks on the service status.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
|
# #服务存活状态
# ps -ef|grep notifier
stupig 1007922 1003335 2 13:41 pts/1 00:00:08 ./notifier -c configs/notifier-test.toml
# #服务FD打开状态
# lsof -nP -p 1007922
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
nobody 1007922 stupig 0u CHR 136,1 0t0 4 /dev/pts/1
nobody 1007922 stupig 1u CHR 136,1 0t0 4 /dev/pts/1
nobody 1007922 stupig 2u CHR 136,1 0t0 4 /dev/pts/1
nobody 1007922 stupig 3u unix 0xffff8810a3132400 0t0 4254094659 socket
nobody 1007922 stupig 4u a_inode 0,9 0 8548 [eventpoll]
nobody 1007922 stupig 5r FIFO 0,8 0t0 4253939077 pipe
nobody 1007922 stupig 6w FIFO 0,8 0t0 4253939077 pipe
nobody 1007922 stupig 8u IPv4 4254094660 0t0 UDP *:37087
nobody 1007922 stupig 9r CHR 1,9 0t0 2057 /dev/urandom
nobody 1007922 stupig 10u IPv4 4253939079 0t0 TCP *:4397 (LISTEN)
nobody 1007922 stupig 11u REG 8,17 12538653 8604570895 ./logs/notifier.stupig.log.INFO.20211127-134138.1007922
nobody 1007922 stupig 15u IPv4 4254204931 0t0 TCP 127.0.0.1:43566->127.0.0.1:2479 (ESTABLISHED) # ETCD
nobody 1007922 stupig 19u REG 8,5 252384 821 /tmp/notifier.stupig.log.ERROR.20211127-134505.1007922
nobody 1007922 stupig 20u REG 8,5 252384 822 /tmp/notifier.stupig.log.WARNING.20211127-134505.1007922
nobody 1007922 stupig 21u REG 8,17 414436 8606917935 ./logs/notifier.stupig.log.WARNING.20211127-134139.1007922
nobody 1007922 stupig 24u REG 8,17 290725 8606917936 ./logs/notifier.stupig.log.ERROR.20211127-134238.1007922
nobody 1007922 stupig 30u REG 8,5 252384 823 /tmp/notifier.stupig.log.INFO.20211127-134505.1007922
|
Comparing the service status information before the problem occurred, we found a serious problem. The connection established between notifier and the kube-apiserver service address: https://localhost:6443
disappeared.
As a result, the notifier lost data synchronization with the kube-apiserver, and subsequently the notifier did not sense business change events and eventually lost the ability to manage the service.
Log analysis
Now we analyze the notifier’s operational logs, focusing on the logs printed by the notifier when the kube-apiserver restarts, where the key log messages are as follows.
The above shows the key exception message http2: no cached connection was available
, and the associated operation is the EndpointInformer’s ListAndWatch operation.
Here we have the key clue, and the next step is to analyze the code to locate the root cause.
Code Analysis
The introduction of the working mechanism of Informer is not the focus of this article, we will only focus on the following code snippet.
1
2
3
4
5
6
7
8
9
10
|
// Run starts a watch and handles watch events. Will restart the watch if it is closed.
// Run will exit when stopCh is closed.
func (r *Reflector) Run(stopCh <-chan struct{}) {
glog.V(3).Infof("Starting reflector %v (%s) from %s", r.expectedType, r.resyncPeriod, r.name)
wait.Until(func() {
if err := r.ListAndWatch(stopCh); err != nil {
utilruntime.HandleError(err)
}
}, r.period, stopCh)
}
|
The Informer’s Reflector component runs in a separate goroutine and recursively calls ListAndWatch to receive notification events from the kube-apiserver.
We conclude from the log analysis that all ListAndWatch operations of the notifier service return the http2: no cached connection was available
error when the kube-apiserver service is restarted.
Therefore, we shifted our focus to this error message.
By searching through the code, we were able to locate the location of the error and where it was returned.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
|
// file: vendor/golang.org/x/net/http2/transport.go:L301
var ErrNoCachedConn = errors.New("http2: no cached connection was available")
// file: vendor/golang.org/x/net/http2/client_conn_pool.go:L55~80
func (p *clientConnPool) getClientConn(req *http.Request, addr string, dialOnMiss bool) (*ClientConn, error) {
if isConnectionCloseRequest(req) && dialOnMiss {
// It gets its own connection.
const singleUse = true
cc, err := p.t.dialClientConn(addr, singleUse)
if err != nil {
return nil, err
}
return cc, nil
}
p.mu.Lock()
for _, cc := range p.conns[addr] {
if cc.CanTakeNewRequest() {
p.mu.Unlock()
return cc, nil
}
}
if !dialOnMiss {
p.mu.Unlock()
return nil, ErrNoCachedConn
}
call := p.getStartDialLocked(addr)
p.mu.Unlock()
<-call.done
return call.res, call.err
}
|
The above code returns ErrNoCachedConn
with the following condition
- The parameter dialOnMiss has a value of false
- No connection available in the p.conns connection pool
Theoretically, when sending an http request, if the connection pool is empty, a connection will be established first and then the request will be sent; and the connection pool can automatically reject connections with abnormal status. So how does the problem of concern in this article sometimes happen?
Now we focus on the call chain of the getClientConn
method, which has two main.
Stack one.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
|
0 0x0000000000a590b8 in notifier/vendor/golang.org/x/net/http2.(*clientConnPool).getClientConn
at ./go/src/notifier/vendor/golang.org/x/net/http2/client_conn_pool.go:55
1 0x0000000000a5aea6 in notifier/vendor/golang.org/x/net/http2.noDialClientConnPool.GetClientConn
at ./go/src/notifier/vendor/golang.org/x/net/http2/client_conn_pool.go:255
2 0x0000000000a6c4f9 in notifier/vendor/golang.org/x/net/http2.(*Transport).RoundTripOpt
at ./go/src/notifier/vendor/golang.org/x/net/http2/transport.go:345
3 0x0000000000a6bd0e in notifier/vendor/golang.org/x/net/http2.(*Transport).RoundTrip
at ./go/src/notifier/vendor/golang.org/x/net/http2/transport.go:313
4 0x0000000000a5b97e in notifier/vendor/golang.org/x/net/http2.noDialH2RoundTripper.RoundTrip
at ./go/src/notifier/vendor/golang.org/x/net/http2/configure_transport.go:75
5 0x0000000000828e45 in net/http.(*Transport).roundTrip
at /usr/local/go1.16.7/src/net/http/transport.go:537
6 0x00000000008016de in net/http.(*Transport).RoundTrip
at /usr/local/go1.16.7/src/net/http/roundtrip.go:17
7 0x00000000016a1ef8 in notifier/vendor/k8s.io/client-go/transport.(*userAgentRoundTripper).RoundTrip
at ./go/src/notifier/vendor/k8s.io/client-go/transport/round_trippers.go:162
8 0x00000000007a3aa2 in net/http.send
at /usr/local/go1.16.7/src/net/http/client.go:251
9 0x00000000007a324b in net/http.(*Client).send
at /usr/local/go1.16.7/src/net/http/client.go:175
10 0x00000000007a6ed5 in net/http.(*Client).do
at /usr/local/go1.16.7/src/net/http/client.go:717
11 0x00000000007a5d9e in net/http.(*Client).Do
at /usr/local/go1.16.7/src/net/http/client.go:585
12 0x00000000016b9487 in notifier/vendor/k8s.io/client-go/rest.(*Request).request
at ./go/src/notifier/vendor/k8s.io/client-go/rest/request.go:732
13 0x00000000016b9f2d in notifier/vendor/k8s.io/client-go/rest.(*Request).Do
at ./go/src/notifier/vendor/k8s.io/client-go/rest/request.go:804
14 0x00000000017093bb in notifier/vendor/k8s.io/client-go/kubernetes/typed/core/v1.(*endpoints).List
at ./go/src/notifier/vendor/k8s.io/client-go/kubernetes/typed/core/v1/endpoints.go:83
……
|
Stack two.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
|
0 0x0000000000a590b8 in notifier/vendor/golang.org/x/net/http2.(*clientConnPool).getClientConn
at ./go/src/notifier/vendor/golang.org/x/net/http2/client_conn_pool.go:55
1 0x0000000000a5aea6 in notifier/vendor/golang.org/x/net/http2.noDialClientConnPool.GetClientConn
at ./go/src/notifier/vendor/golang.org/x/net/http2/client_conn_pool.go:255
2 0x0000000000a6c4f9 in notifier/vendor/golang.org/x/net/http2.(*Transport).RoundTripOpt
at ./go/src/notifier/vendor/golang.org/x/net/http2/transport.go:345
3 0x0000000000a6bd0e in notifier/vendor/golang.org/x/net/http2.(*Transport).RoundTrip
at ./go/src/notifier/vendor/golang.org/x/net/http2/transport.go:313
4 0x00000000008296ed in net/http.(*Transport).roundTrip
at /usr/local/go1.16.7/src/net/http/transport.go:590
5 0x00000000008016de in net/http.(*Transport).RoundTrip
at /usr/local/go1.16.7/src/net/http/roundtrip.go:17
6 0x00000000016a1ef8 in notifier/vendor/k8s.io/client-go/transport.(*userAgentRoundTripper).RoundTrip
at ./go/src/notifier/vendor/k8s.io/client-go/transport/round_trippers.go:162
7 0x00000000007a3aa2 in net/http.send
at /usr/local/go1.16.7/src/net/http/client.go:251
8 0x00000000007a324b in net/http.(*Client).send
at /usr/local/go1.16.7/src/net/http/client.go:175
9 0x00000000007a6ed5 in net/http.(*Client).do
at /usr/local/go1.16.7/src/net/http/client.go:717
10 0x00000000007a5d9e in net/http.(*Client).Do
at /usr/local/go1.16.7/src/net/http/client.go:585
11 0x00000000016b9487 in notifier/vendor/k8s.io/client-go/rest.(*Request).request
at ./go/src/notifier/vendor/k8s.io/client-go/rest/request.go:732
12 0x00000000016b9f2d in notifier/vendor/k8s.io/client-go/rest.(*Request).Do
at ./go/src/notifier/vendor/k8s.io/client-go/rest/request.go:804
13 0x00000000017093bb in notifier/vendor/k8s.io/client-go/kubernetes/typed/core/v1.(*endpoints).List
at ./go/src/notifier/vendor/k8s.io/client-go/kubernetes/typed/core/v1/endpoints.go:83
……
|
After tracing the two call stacks separately, we can quickly rule out stack one.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
|
// file: net/http/transport.go:L502~620
func (t *Transport) roundTrip(req *Request) (*Response, error) {
if altRT := t.alternateRoundTripper(req); altRT != nil { // L537
if resp, err := altRT.RoundTrip(req); err != ErrSkipAltProtocol {
return resp, err
}
var err error
req, err = rewindBody(req)
if err != nil {
return nil, err
}
}
// file: vendor/golang.org/x/net/http2/configure_transport.go:L74~80
func (rt noDialH2RoundTripper) RoundTrip(req *http.Request) (*http.Response, error) {
res, err := rt.t.RoundTrip(req) // L75
if err == ErrNoCachedConn {
return nil, http.ErrSkipAltProtocol
}
return res, err
}
// file: vendor/golang.org/x/net/http2/transport.go:L312~314
func (t *Transport) RoundTrip(req *http.Request) (*http.Response, error) {
return t.RoundTripOpt(req, RoundTripOpt{}) // L313
}
// file: vendor/golang.org/x/net/http2/transport.go:L337~379
func (t *Transport) RoundTripOpt(req *http.Request, opt RoundTripOpt) (*http.Response, error) {
addr := authorityAddr(req.URL.Scheme, req.URL.Host)
for retry := 0; ; retry++ {
cc, err := t.connPool().GetClientConn(req, addr) // L345
if err != nil {
t.vlogf("http2: Transport failed to get client conn for %s: %v", addr, err)
return nil, err
}
}
}
// file: vendor/golang.org/x/net/http2/client_conn_pool.go:L254~256
func (p noDialClientConnPool) GetClientConn(req *http.Request, addr string) (*ClientConn, error) {
return p.getClientConn(req, addr, noDialOnMiss) // L255
}
|
Stack one call to getClientConn
returns an ErrNoCachedConn
error and is replaced with an http.ErrSkipAltProtocol
error in the noDialH2RoundTripper.RoundTrip
function, which returns the roundTrip
function after continues with the rest of the process and goes to stack 2.
We therefore focus on the Stack 2 process.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
|
// file: net/http/transport.go:L502~620
func (t *Transport) roundTrip(req *Request) (*Response, error) {
for {
var resp *Response
if pconn.alt != nil {
// HTTP/2 path.
t.setReqCanceler(cancelKey, nil) // not cancelable with CancelRequest
resp, err = pconn.alt.RoundTrip(req) // L590
}
if err == nil {
resp.Request = origReq
return resp, nil
}
// Failed. Clean up and determine whether to retry.
if http2isNoCachedConnError(err) {
if t.removeIdleConn(pconn) {
t.decConnsPerHost(pconn.cacheKey)
}
} else if !pconn.shouldRetryRequest(req, err) {
return nil, err
}
}
}
// file: vendor/golang.org/x/net/http2/transport.go:L312~314
func (t *Transport) RoundTrip(req *http.Request) (*http.Response, error) {
return t.RoundTripOpt(req, RoundTripOpt{}) // L313
}
// 内层调用栈同栈一,不再列出
|
Unlike stack 1, stack 2 does not do a conversion on the returned error, but returns the ErrNoCachedConn
error directly, and the error handling process of roundTrip
also specifically handles this type of error. If http2isnoCachedConnError
returns true, the connection pool will remove the exception connection.
Everything makes sense, so how does the problem occur? The problem here is http2isnoCachedConnError
.
1
2
3
4
5
6
7
8
|
// file: net/http/h2_bundle.go:L6922~6928
// isNoCachedConnError reports whether err is of type noCachedConnError
// or its equivalent renamed type in net/http2's h2_bundle.go. Both types
// may coexist in the same running program.
func http2isNoCachedConnError(err error) bool {
_, ok := err.(interface{ IsHTTP2NoCachedConnError() })
return ok
}
|
If the err
object implements the anonymous interface (only one function IsHTTP2NoCachedConnError
is defined), then it returns true, otherwise it returns false.
So, does the error type returned by getClientConn
implement that interface? Obviously: no.
1
2
|
// file: vendor/golang.org/x/net/http2/transport.go:L301
var ErrNoCachedConn = errors.New("http2: no cached connection was available")
|
So far, the cause of the problem has been basically located clearly.
Solution
Since the problem is caused by the error type ErrNoCachedConn
returned by getClientConn
not implementing the IsHTTP2NoCachedConnError
function, the natural fix is to change the returned error type and implement the interface function.
Note that since this part of the code is from an external code base that we referenced, we checked the latest golang.org/x/net
code and found that the problem was fixed back in January 2018. See specifically: golang.org/x/net fix.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
|
// noCachedConnError is the concrete type of ErrNoCachedConn, which
// needs to be detected by net/http regardless of whether it's its
// bundled version (in h2_bundle.go with a rewritten type name) or
// from a user's x/net/http2. As such, as it has a unique method name
// (IsHTTP2NoCachedConnError) that net/http sniffs for via func
// isNoCachedConnError.
type noCachedConnError struct{}
func (noCachedConnError) IsHTTP2NoCachedConnError() {}
func (noCachedConnError) Error() string { return "http2: no cached connection was available" }
// isNoCachedConnError reports whether err is of type noCachedConnError
// or its equivalent renamed type in net/http2's h2_bundle.go. Both types
// may coexist in the same running program.
func isNoCachedConnError(err error) bool {
_, ok := err.(interface{ IsHTTP2NoCachedConnError() })
return ok
}
var ErrNoCachedConn error = noCachedConnError{}
|
The version we are using online is still: 1c05540f6.
Therefore, our fix strategy has become much simpler, just upgrade the dependency library version in vendor.
Currently, the online notifier service has been upgraded and is fully live on all server rooms. We have also verified that the kube-apiserver restart does not cause the notifier service to be abnormal.