INTERNALS

   1 Documentation on various internal structures.
   2
   3 Most important structure use an anonymous shared mmap()
   4 so that child processes can watch them. (All the cli connections
   5 are handled in child processes).
   6
   7 TODO: Re-investigate threads to see if we can use a thread to handle
   8 cli connections without killing forwarding performance.
   9
  10 session[]
  11         An array of session structures. This is one of the two
  12         major data structures that are sync'ed across the cluster.
  13
  14         This array is statically allocated at startup time to a
  15         compile time size (currently 50k sessions). This sets a
  16         hard limit on the number of sessions a cluster can handle.
  17
  18         There is one element per l2tp session. (I.e. each active user).
  19
  20         The zero'th session is always invalid.
  21
  22 tunnel[]
  23         An array of tunnel structures. This is the other major data structure
  24         that's actively sync'ed across the cluster.
  25
  26         As per sessions, this is statically allocated at startup time
  27         to a compile time size limit.
  28
  29         There is one element per l2tp tunnel. (normally one per BRAS
  30         that this cluster talks to).
  31
  32         The zero'th tunnel is always invalid.
  33
  34 ip_pool[]
  35
  36         A table holding all the IP address in the pool. As addresses
  37         are used, they are tagged with the username of the session,
  38         and the session index.
  39
  40         When they are free'd the username tag ISN'T cleared. This is
  41         to ensure that were possible we re-allocate the same IP
  42         address back to the same user.
  43
  44 radius[]
  45         A table holding active radius session. Whenever a radius
  46         conversation is needed (login, accounting et al), a radius
  47         session is allocated.
  48
  49 char **ip_hash
  50
  51         A mapping of IP address to session structure. This is a
  52         tenary tree (each byte of the IP address is used in turn
  53         to index that level of the tree).
  54
  55         If the value is postive, it's considered to be an index
  56         into the session table.
  57
  58         If it's negative, it's considered to be an index into
  59         the ip_pool[] table.
  60
  61         If it's zero, then there is no associated value.
  62
  63 config->cluster_iam_master
  64
  65         If true, indicates that this node is the master for
  66         the cluster. This has many consequences...
  67
  68 config->cluster_iam_uptodate
  69
  70         On the slaves, this indicates if it's seen a full run
  71         of sessions from the master, and thus it's safe to be
  72         taking traffic.
  73
  74         On the master, this indicates that all slaves are
  75         up to date. If any of the slaves aren't up to date,
  76         this variable is false, and indicates that we should
  77         shift to more rapid heartbeats to bring the slave
  78         back up to date.
  79
  80
  81 ============================================================
  82
  83 Clustering: How it works.
  84
  85         At a high level, the various members of the cluster elect
  86 a master. All other machines become slaves as soon as they hear
  87 a heartbeat from the master. Slaves handle normal packet forwarding.
  88 Whenever a slave get a 'state changing' packet (i.e. tunnel setup/teardown,
  89 session setup etc) it _doesn't_ handle it, but instead forwards it
  90 to the master.
  91
  92         'State changing' it defined to be "a packet that would cause
  93 a change in either a session or tunnel structure that isn't just
  94 updating the idle time or byte counters". In practise, this means
  95 almost all LCP, IPCP, and L2TP control packets.
  96
  97         The master then handles the packet normally, updating
  98 the session/tunnel structures. The changed structures are then
  99 flooded out to the slaves via a multicast packet.
 100
 101
 102 Heartbeat'ing:
 103         The master sends out a multicast 'heartbeat' packet
 104 at least once every second. This packet contains a sequence number,
 105 and any changes to the session/tunnel structures that have
 106 been queued up. If there is room in the packet, it also sends
 107 out a number of extra session/tunnel structures.
 108
 109         The sending out of 'extra' structures means that the
 110 master will slowly walk the entire session and tunnel tables.
 111 This allows a new slave to catch-up on cluster state.
 112
 113
 114         Each heartbeat has an in-order sequence number. If a
 115 slave receives a heartbeat with a sequence number other than
 116 the one it was expecting, it drops the unexpected packet and
 117 unicasts C_LASTSEEN to tell the master the last heartbeast it
 118 had seen. The master normally than unicasts the missing packets
 119 to the slave. If the master doesn't have the old packet any more
 120 (i.e. it's outside the transmission window) then the master
 121 unicasts C_KILL to the slave asking it to die. (The slave should
 122 then restart, and catchup on state via the normal process).
 123
 124         If a slave goes for more than a few seconds without
 125 hearing from the master, it sends out a preemptive C_LASTSEEN.
 126 If the master still exists, this forces to the master to unicast
 127 the missed heartbeats. This is a work around for a temporary
 128 multicast problem. (i.e. if an IGMP probe is missed, the slave
 129 will temporarily stop seeing the multicast heartbeats. This
 130 work around prevents the slave from becoming master with
 131 horrible consequences).
 132
 133 Ping'ing:
 134         All slaves send out a 'ping' once per second as a
 135 multicast packet. This 'ping' contains the slave's ip address,
 136 and most importantly, the number of seconds from epoch
 137 that the slave started up. (I.e. the value of time(2) at
 138 that the process started). (This is the 'basetime').
 139 Obviously, this is never zero.
 140
 141         There is a special case. The master can send a single
 142 ping on shutdown to indicate that it is dead and that an
 143 immediate election should be held. This special ping is
 144 send from the master with a 'basetime' of zero.
 145
 146 Elections:
 147
 148         All machines start up as slaves.
 149
 150         Each slave listens for a heartbeat from the master.
 151 If a slave fails to hear a heartbeat for N seconds then it
 152 checks to see if it should become master.
 153
 154         A slave will become master if:
 155                 * It hasn't heard from a master for N seconds.
 156                 * It is the oldest of all it's peers (the other slaves).
 157                 * In the event of a tie, the machine with the
 158                         lowest IP address will win.
 159
 160         A 'peer' is any other slave machine that's send out a
 161         ping in the last N seconds. (i.e. we must have seen
 162         a recent ping from that slave for it to be considered).
 163
 164         The upshot of this is that no special communication
 165         takes place when a slave becomes a master.
 166
 167         On initial cluster startup, the process would be (for example)
 168
 169                 * 3 machines startup simultaneously, all as slaves.
 170                 * each machine sends out a multicast 'ping' every second.
 171                 * 15 seconds later, the machine with the lowest IP
 172                         address becomes master, and starts sending
 173                         out heartbeats.
 174                 * The remaining two machine hear the heartbeat and
 175                         set that machine as their master.
 176
 177 Becoming master:
 178
 179         When a slave become master, the only structure maintained up
 180         to date are the tunnel and session structures. This means
 181         the master will rebuild a number of mappings.
 182
 183         #0. All the session and table structures are marked as
 184         defined. (Even if we weren't fully up to date, it's
 185         too late now).
 186
 187         #1. All the token bucket filters are re-build from scratch
 188         with the associated session to tbf pointers being re-built.
 189
 190 TODO: These changed tbf pointers aren't flooded to the slave right away!
 191 Throttled session could take a couple of minutes to start working again
 192 on master failover!
 193
 194         #2. The ipcache to session hash is rebuilt. (This isn't
 195         strictly needed, but it's a safety measure).
 196
 197         #3. The mapping from the ippool into the session table
 198         (and vice versa) is re-built.
 199
 200
 201 Becoming slave:
 202
 203         At startup the entire session and table structures are
 204         marked undefined.
 205
 206         As it seens updates from the master, the updated structures
 207         are marked as defined.
 208
 209         When there are no undefined tunnel or session structures, the
 210         slave marks itself as 'up-to-date' and starts advertising routes
 211         (if BGP is enabled).
 212
 213 STONITH:
 214
 215         Currently, there is very minimal protection from split brain.
 216 In particular, there is no real STONITH protocol to stop two masters
 217 appearing in the event of a network problem.
 218
 219
 220
 221 TODO:
 222         Should slaves that have undefined sessions, and receive
 223 a packet from a non-existant session then forward it to the master??
 224 In normal practice, a slave with undefined session shouldn't be
 225 handling packets, but ...
 226
 227         There is far too much walking of large arrays (in the master
 228 specifically).  Although this is mitigated somewhat by the
 229 cluster_high_{sess,tun}, this benefit is lost as that value gets
 230 closer to MAX{SESSION,TUNNEL}.  There are two issues here:
 231
 232         * The tunnel, radius and tbf arrays should probably use a
 233           mechanism like sessions, where grabbing a new one is a
 234           single lookup rather than a walk.
 235
 236         * A list structure (simillarly rooted at [0].interesting) is
 237           required to avoid having to walk tables periodically.  As a
 238           back-stop the code in the master which *does* walk the
 239           arrays can mark any entry it processes as "interesting" to
 240           ensure it gets looked at even if a bug causes it to be
 241           otherwiase overlooked.
 242
 243         Support for more than 64k sessions per cluster. There is
 244 currently a 64k session limit because each session gets an id that global
 245 over the cluster (as opposed to local to the tunnel). Obviously, the tunnel
 246 id needs to be used in conjunction with the session id to index into
 247 the session table. But how?
 248
 249         I think the best way is to use something like page tables.
 250 for a given <tid,sid>, the appropriate session index is
 251 session[ tunnel[tid].page[sid>>10] + (sid & 1023) ]
 252 Where tunnel[].page[] is a 64 element array. As a tunnel
 253 fills up it's page block, it allocated a new 1024 session block
 254 from the session table and fills in the appropriate .page[]
 255 entry.
 256
 257         This should be a reasonable compromise between wasting memory
 258 (average 500 sessions per tunnel wasted) and speed. (Still a direct
 259 index without searching, but extra lookups required). Obviously
 260 the <6,10> split on the sid can be moved around to tune the size
 261 of the page table v the session table block size.
 262
 263         This unfortunately means that the tunnel structure HAS to
 264 be filled on the slave before any of the sessions on it can be used.
 265