Consider the scenario, an API gateway that distributes requests to multiple upstream nodes, as in the upstream configuration of nginx.
There are many common practices on how to distribute the routes /
equally to the four services, in general, such as randomized algorithms, etc.
Random
Treat each node as an element of an array, and for each request calculate a random number, e.g. calculate a random number from 1-4 and take the modulus, e.g.
Then we directly take hosts[x]
as the object of the distribution request. Therefore, we only need to ensure that this random algorithm is random enough to achieve balanced distribution.
Polling
In addition to random, polling can be used to achieve random assignment, e.g.
This approach maintains a cursor that points to which node was last hit, and then sequentially polls the next node in the array.
Directed Distribution
Some scenarios, requests for the same IP (or other fields) need to be forwarded to the same node, this way an IP needs to be transformed into a number and mapped to an array subscript, like this scenario, a crc32 algorithm can be used to map a string to a number and then modulo that number.
If we consider that each IP access is random in terms of data distribution, then this approach can be used to implement the LB function.
All of these methods above are relatively common and easy to think of solutions. Based on these balancing strategies, some more business-friendly variants, such as weighting, can also be implemented. In ordinary business scenarios, these methods should be sufficient to meet business needs. The only problem that may arise is that if a node goes down during a process, this situation will cause a portion of the requests to still be forwarded to the failed node. To solve this problem, the mechanism of live detection has emerged.
Heartbeat Detection
To ensure service availability, a heartbeat detection mechanism from LB to node can be maintained, with each node providing an interface. For example:
Then LB accesses this interface every time, say 10ms, and the service is considered normal only if it returns normally, and when the service is abnormal, it needs to be removed from the hosts list. In this way, combined with the several load balancing strategies mentioned above, it greatly increases the service availability when the node hangs. All of the above, however, are solutions for stateless services.
For a stateful service, such as ws long connection service, data storage service, etc., then such node downtime will lead to data loss. Therefore, it is necessary to do data synchronization when the node of a stateful service has a problem. And if we use the above random algorithm approach, we have to recalculate the original data and land on the remaining nodes, this kind of performance may not be guaranteed for a large distributed system, so another strategy of data synchronization needs to be improved, such as consistent hashing.
Consistent Hashing
First, first define as a ring data structure, such as an array or LinkedList, and here assume a ring array with 16 elements.
Next, bind specific Host and Vnode nodes, such as binding the following nodes.
When a request comes through, we get a number based on a random number, or IP, or other field that can be identified. For example.
|
|
At this point, find the first node with binding from the ring array according to clockwise (counterclockwise is also possible) and use it as the node for load balancing.
When a node in the system hangs, it is necessary to do data migration, at this time, assuming that a3.com is down, it is only necessary to transfer all the data of the original a3.com to a4.com
, which means that all the data of the original a3.com
is transferred to a4.com
. This process only involves moving from a3.com
to a4.com
, other nodes are not affected. However, if this is a simple forward transfer, it is found that the data is not evenly distributed and a4.com
directly carries 50%
of the data, so to improve it, the distribution of physical nodes to multiple nodes on the ring can be used.
This way, when nodes are migrated, they can be spread more evenly to the remaining nodes. Of course, consistent hashing is designed to minimize the cost of data synchronization and recovery when a node hangs in a distributed environment, thus improving overall high availability.
Nginx’s Load Balancing Strategy
Nginx implements several load balancing strategies, such as polling, which polls each node according to request time and automatically listens to keep it alive.
Weighted assignments can also be specified to request the corresponding nodes in proportion to their weights.
It can also be distributed in a targeted manner, such as ip_hash
.
SRV records
SRV is a type of DNS record that supports load balancing on DNS resolution. SRV supports specifying IP, port, and weight information on the record, so it can also be used for cluster load balancing.
The DNS “service” (SRV) record specifies a host and port for specific services such as voice over IP (VoIP), instant messaging, and so on. Most other DNS record only specify a server or an IP address, but SRV records include a port at that IP address as well. Some Internet protocols require the use of SRV records in order to function.
An SRV record has the following format.
That is, if you visit example.com
, it will resolve to the target port of the target domain by weight and priority.