Recently, I encountered some bugs in my project: the receiver of tokio channel was dropped for some reason, resulting in a send error.
After debugging, I found that it was actually caused by hyper’s cancel propagation. The following changes to examples/web_api.rs in the hyper source code can be reproduced
|
|
async fn api_get_response
is an asynchronous function that hyper processes http requests, spawning a Task inside to do some time consuming operations.
We use sleep to simulate that it takes 5 seconds to finish the task, and finally send the processed data/results to async fn api_get_response
via chanel.
If the client actively closes the connection before the server response, hyper will cancel async fn api_get_response
.
So rx will be dropped and the subsequent delivery will fail
cancel No propagation poses a problem
A large number of malicious requests that actively disconnect
This leads to the problem that the client is already disconnected without waiting for the response, and the server is still checking the spawn concurrently to check the database. **For example, if someone maliciously sends 100,000 requests that take a long time to process, and the client cancels immediately after the request is sent, the server is still processing the “already disconnected” requests. At this point, if the server is still processing “requests that have been canceled”, it will cause a waste of resources
If the client is disconnected, it is reasonable to let all the associated asynchronous Task/Future processing the request from the root node backward all propagated and canceled.
Otherwise, the client would have been disconnected long ago and the server would still be foolishly continuing to check a bunch of databases, consuming a lot of resources to check out the data ready to return to the client only to find Send Error
systemctl stop timeout
For example, the web server process generally uses the libc::signal callback function to allow the process to receive a signal to shut down and gracefully shutdown the web server
generally receive the signal after cancel to propagate the signal to each process, but some stubborn process live a long time (such as loop sleep and other polling tasks)
However, sending a kill signal to stop the process from completing the deployment update is not good, because the callback function of libc::signal
cannot and will not handle the SIGKILL signal and cannot perform some custom resource recovery operations
(this problem golang also has go routetine leak, go can only spawn out plus a down channel/ctx notification mechanism, passed down through the parent task)
|
|
Future pruning?
async fn api_get_response
can be abstracted as the root node of Future, and the Future spawned inside can be abstracted as a son node
I thought that the root Future of an rpc request handler would be choked out by hyper due to client disconnection, along with all the leaf Future nodes of that Future, but actually tokio doesn’t have such an API or design
What I would like is a constraint like scoped_thread (child threads live shorter) so that the Future spawned by async fn api_get_response
must live shorter, which would reduce the mental burden of worrying about the child Future holding the parent’s resources after the parent is dropped.
Sounds a bit like a scoped future
How to get hyper’s cancel to propagate
The line tx.send(()).unwrap();
will panic when the client disconnects early.
In order to make tokio::spawn’s task abort early when the request is cancelled, more boilerplate code needs to be introduced
|
|
When the client disconnects early, hyper’s log is as follows
The line dbg!("task finished
was not executed before the spawn was cancelled, as expected
You must ask your superiors before writing a spawn
In order to avoid bugs caused by improper handling of tokio::spawn, the use of spawn must be strictly limited by code review.
Our company requires that before introducing a new tokio::spawn you must request it from your supervisor, and include the reasons why you want to use spawn. and include in the comments a detailed explanation of why the spawn is being used and how long the spawn will live, and what external resources are being introduced.
and ask the JoinHandle that explicitly stores the spawn to call handle.abort(), which is usually active in the Drop
spawn is not a silver bullet, if only to not block the execution of the code in fact there are other ways to write instead, you must think carefully before using spawn
tokio Some thoughts on issues
tokio::task::JoinHandle
The documentation says that A JoinHandle detaches the associated task when it is dropped
Personally, I don’t think it’s as convenient as glommio’s detach or libpthead.so’s pthread_detach to let the caller decide whether to detach or not.
pin Future to CPU core
tokio lacks a concept like glommio pin to core, similar to libpthread.so’s pthread_setaffinity_np? It seems tokio doesn’t have it either.
This is also a limitation of tokio to be cross-platform. Some APIs are available on Linux, but maybe tokio doesn’t provide them because windows/mac doesn’t have them.
Can tokio fix a certain Future on a certain CPU core to avoid the context switching overhead of executing across multiple cores? But this seems to conflict with tokio’s desire to load-balance all cores with work stealing, as tokio does not want other cores to be idle.
Single-threaded runtime is faster in some cases
In our project benchmark found that some modules perform better with tokio’s single-threaded runtime, so you can’t be convinced that multi-threaded is necessarily better than single-threaded, because the CPU has a lot of overhead to synchronize data before multiple cores, you need to analyze on a case-by-case basis whether to use single-threaded or multi-threaded runtime
(But the Future signature of tokio::spawn is the same for both single-threaded and multi-threaded… Single-threaded does not have one less Send constraint…)
tokio’s discussion of cancel on drop
https://github.com/tokio-rs/tokio/issues/1830
Finally, I dare to imagine if we can design the runtime from the perspective of Linux only, without considering compatibility with mac/windows, so that we can use many Linux only APIs to fully drain Linux performance, and may eventually find that Rust’s own Future is not quite enough or not good enough, or even create a runtime dedicated Future (just like C++ many libraries each make their own Future not universal)