I’ll go into some more detail as to how restarts currently work to clarify things.
The chat system is separated by clusters:
- main cluster (majority of channels, on 7 IPs)
- event cluster (large event channels, on 3 IPs)
- group cluster (on 2 IPs)
There are 3 edge servers per IP: 80, 443, 6667
During restarts, there is at most one box (IP) per cluster being restarted at any given time. Meaning, we can restart edges on a group server, event server, main server simultaneously but cannot restart edges on 2 main servers simultaneously. Each servers restart process takes ~15 minutes. The server will notify clients to reconnect at a constant rate over the course of those 15 minutes (the connection will be closed by the server ~30s after they are notified). This period of disconnecting clients one by one is how we avoid thundering herds, we do not notify all clients to reconnect at once.
To restart all edge servers in the main cluster takes a minimum of 105 minutes currently. With that in mind, bots should distribute their connections between all servers so that when a given connection is disconnected, you only need to rejoin a subset of the rooms you care about. Obviously, over the course a full cluster restart you’ll need to rejoin all rooms, but you should never need to do full-restart (aka re-join all your channels immediately).
Its still potentially troublesome for bots considering the case where you get unlucky and get notified at the end of a 15 min interval, and then another connection is notified at the start of the next 15 min interval. You effectively have 2 connections restarting simultaneously, but its still not a full restart.
The arc of work to make restarts not affect users (aka no missed messages) is aimed at improving the restart experience for users (who largely only need to join a single room) and not bots which are connected to hundreds/thousands of rooms.
This unlocks the ability for us to iterate on our edge server without hesitation and start improving it. A large portion of the work we have planned are changes that have been requested via these forums.