Documentation on various internal structures. Most important structure use an anonymous shared mmap() so that child processes can watch them. (All the cli connections are handled in child processes). TODO: Re-investigate threads to see if we can use a thread to handle cli connections without killing forwarding performance. session[] An array of session structures. This is one of the two major data structures that are sync'ed across the cluster. This array is statically allocated at startup time to a compile time size (currently 50k sessions). This sets a hard limit on the number of sessions a cluster can handle. There is one element per l2tp session. (I.e. each active user). The zero'th session is always invalid. tunnel[] An array of tunnel structures. This is the other major data structure that's actively sync'ed across the cluster. As per sessions, this is statically allocated at startup time to a compile time size limit. There is one element per l2tp tunnel. (normally one per BRAS that this cluster talks to). The zero'th tunnel is always invalid. ip_pool[] A table holding all the IP address in the pool. As addresses are used, they are tagged with the username of the session, and the session index. When they are free'd the username tag ISN'T cleared. This is to ensure that were possible we re-allocate the same IP address back to the same user. radius[] A table holding active radius session. Whenever a radius conversation is needed (login, accounting et al), a radius session is allocated. char **ip_hash A mapping of IP address to session structure. This is a tenary tree (each byte of the IP address is used in turn to index that level of the tree). If the value is postive, it's considered to be an index into the session table. If it's negative, it's considered to be an index into the ip_pool[] table. If it's zero, then there is no associated value. config->cluster_iam_master If true, indicates that this node is the master for the cluster. This has many consequences... config->cluster_iam_uptodate On the slaves, this indicates if it's seen a full run of sessions from the master, and thus it's safe to be taking traffic. On the master, this indicates that all slaves are up to date. If any of the slaves aren't up to date, this variable is false, and indicates that we should shift to more rapid heartbeats to bring the slave back up to date. ============================================================ Clustering: How it works. At a high level, the various members of the cluster elect a master. All other machines become slaves as soon as they hear a heartbeat from the master. Slaves handle normal packet forwarding. Whenever a slave get a 'state changing' packet (i.e. tunnel setup/teardown, session setup etc) it _doesn't_ handle it, but instead forwards it to the master. 'State changing' it defined to be "a packet that would cause a change in either a session or tunnel structure that isn't just updating the idle time or byte counters". In practise, this means almost all LCP, IPCP, and L2TP control packets. The master then handles the packet normally, updating the session/tunnel structures. The changed structures are then flooded out to the slaves via a multicast packet. Heartbeat'ing: The master sends out a multicast 'heartbeat' packet at least once every second. This packet contains a sequence number, and any changes to the session/tunnel structures that have been queued up. If there is room in the packet, it also sends out a number of extra session/tunnel structures. The sending out of 'extra' structures means that the master will slowly walk the entire session and tunnel tables. This allows a new slave to catch-up on cluster state. Each heartbeat has an in-order sequence number. If a slave receives a heartbeat with a sequence number other than the one it was expecting, it drops the unexpected packet and unicasts C_LASTSEEN to tell the master the last heartbeast it had seen. The master normally than unicasts the missing packets to the slave. If the master doesn't have the old packet any more (i.e. it's outside the transmission window) then the master unicasts C_KILL to the slave asking it to die. (The slave should then restart, and catchup on state via the normal process). If a slave goes for more than a few seconds without hearing from the master, it sends out a preemptive C_LASTSEEN. If the master still exists, this forces to the master to unicast the missed heartbeats. This is a work around for a temporary multicast problem. (i.e. if an IGMP probe is missed, the slave will temporarily stop seeing the multicast heartbeats. This work around prevents the slave from becoming master with horrible consequences). Ping'ing: All slaves send out a 'ping' once per second as a multicast packet. This 'ping' contains the slave's ip address, and most importantly, the number of seconds from epoch that the slave started up. (I.e. the value of time(2) at that the process started). (This is the 'basetime'). Obviously, this is never zero. There is a special case. The master can send a single ping on shutdown to indicate that it is dead and that an immediate election should be held. This special ping is send from the master with a 'basetime' of zero. Elections: All machines start up as slaves. Each slave listens for a heartbeat from the master. If a slave fails to hear a heartbeat for N seconds then it checks to see if it should become master. A slave will become master if: * It hasn't heard from a master for N seconds. * It is the oldest of all it's peers (the other slaves). * In the event of a tie, the machine with the lowest IP address will win. A 'peer' is any other slave machine that's send out a ping in the last N seconds. (i.e. we must have seen a recent ping from that slave for it to be considered). The upshot of this is that no special communication takes place when a slave becomes a master. On initial cluster startup, the process would be (for example) * 3 machines startup simultaneously, all as slaves. * each machine sends out a multicast 'ping' every second. * 15 seconds later, the machine with the lowest IP address becomes master, and starts sending out heartbeats. * The remaining two machine hear the heartbeat and set that machine as their master. Becoming master: When a slave become master, the only structure maintained up to date are the tunnel and session structures. This means the master will rebuild a number of mappings. #0. All the session and table structures are marked as defined. (Even if we weren't fully up to date, it's too late now). #1. All the token bucket filters are re-build from scratch with the associated session to tbf pointers being re-built. TODO: These changed tbf pointers aren't flooded to the slave right away! Throttled session could take a couple of minutes to start working again on master failover! #2. The ipcache to session hash is rebuilt. (This isn't strictly needed, but it's a safety measure). #3. The mapping from the ippool into the session table (and vice versa) is re-built. Becoming slave: At startup the entire session and table structures are marked undefined. As it seens updates from the master, the updated structures are marked as defined. When there are no undefined tunnel or session structures, the slave marks itself as 'up-to-date' and starts advertising routes (if BGP is enabled). STONITH: Currently, there is very minimal protection from split brain. In particular, there is no real STONITH protocol to stop two masters appearing in the event of a network problem. TODO: Should slaves that have undefined sessions, and receive a packet from a non-existant session then forward it to the master?? In normal practice, a slave with undefined session shouldn't be handling packets, but ... There is far too much walking of large arrays (in the master specifically). Although this is mitigated somewhat by the cluster_high_{sess,tun}, this benefit is lost as that value gets closer to MAX{SESSION,TUNNEL}. There are two issues here: * The tunnel, radius and tbf arrays should probably use a mechanism like sessions, where grabbing a new one is a single lookup rather than a walk. * A list structure (simillarly rooted at [0].interesting) is required to avoid having to walk tables periodically. As a back-stop the code in the master which *does* walk the arrays can mark any entry it processes as "interesting" to ensure it gets looked at even if a bug causes it to be otherwiase overlooked. Support for more than 64k sessions per cluster. There is currently a 64k session limit because each session gets an id that global over the cluster (as opposed to local to the tunnel). Obviously, the tunnel id needs to be used in conjunction with the session id to index into the session table. But how? I think the best way is to use something like page tables. for a given , the appropriate session index is session[ tunnel[tid].page[sid>>10] + (sid & 1023) ] Where tunnel[].page[] is a 64 element array. As a tunnel fills up it's page block, it allocated a new 1024 session block from the session table and fills in the appropriate .page[] entry. This should be a reasonable compromise between wasting memory (average 500 sessions per tunnel wasted) and speed. (Still a direct index without searching, but extra lookups required). Obviously the <6,10> split on the sid can be moved around to tune the size of the page table v the session table block size. This unfortunately means that the tunnel structure HAS to be filled on the slave before any of the sessions on it can be used.