I was asked to describe clustering to a person that doesn’t know much about this stuff, so here goes (Before anyone asks, the answer is no…I didn’t use ChatGPT for this):

 

When you hear “server cluster,” you think of a bunch of servers that are joined together to improve performance, availability, etc., but how is their joining any different than a bunch of servers on a network?

 

Servers in a cluster work in exactly the same way as a committee operates:

– First, there has to be quorum for the session to be viable. This means 1/2 + 1 of all members (servers) must be present at all times, or the thing gets shut down.

– Second, each participating member regularly submits their vote with their availability. Each time, an election is performed and the winning vote is assigned the resources.

– Incumbents get priority, so if you already had the resources, and you’re still able to keep them running, please do so.

– If they can’t, it usually falls to the old reliable member that still has a half-empty desk.

– Sometimes, however, a young buck gets promoted to the committee, and has a fully empty plate. When this occurs, the Chair (systems administrator) can call a full election, and reassign jobs without preferential treatment. This is known as Rebalancing.

 

Now the cool part is, the member only really has to use their brain power (processor and memory). All the actual work is handled by two contracted groups: Logistics (networking) and Labour (storage). So when someone changes responsibility, they really just need to pick up the project binder, look at the first page, and wait for a call.

 

Technically, in networking we have firewalls, routers, and switches that these servers connect to. They are programmed specifically to allow the cluster to function as it needs to, but also allows it to communicate with the rest of the world. Networking provides us with Border Security, Transit Operations, and Priority Access for our most important applications.

 

Storage, the set of disks that store all the actual data for files and servers, is also managed separately. It is accessed by the cluster through that priority access on the network I mentioned a moment ago.

What all this means is that there would have to be a series of catastrophic failures across one of these systems before an actual “outage” would occur.

 

To limit this from happening to the real committee (Production), we have 1-3 mock environments that we can test the impacts of changes that we make ourselves. These are known as Test, Development, and Staging. They are completely separate from production, and have different hardware to make them run.

 

Unfortunately, outages still do happen. To go back to our committee analogy, we prevent this by having multiple committees (separate clusters). Furthermore, we make sure these other committees are all set up to take over everything, and work out of a different city.

 

In the event that the main committee can’t function, the first step is to bring in the retired committee that have been at the meetings, but don’t vote.

This is known as a local backup cluster. We usually have it on a separate physical rack in the datacenter, with a separate power source and separate switches connecting us to storage, that is supposed to also be duplicated locally.

 

If the problem is really bad, and both the retirees and the facility is unavailable, we call one of the extra committees we established earlier.

They get daily updates that are sent to them at night, so they may not know anything that went on today for all projects. However, after a very brief period, they can take over the job. They’re probably not going to be as fast, and it takes a little longer to drive there, but everything will work.

 

This, my friends, is known as a disaster, and we’ve entered Disaster Recovery mode.

 

We’ve brought our backup facility online and have performed a failover of our replicas, and restored our non-critical servers from backup. There are networking changes that must be made to point everyone to the new location, and those changes literally have to replicate to all the internet-facing routers in the world before service is fully restored.

 

Clusters these days are small, usually 2-8 servers. We also have what we call Hyper-converged servers, that have processing, networking, and storage all in the same multi-node server. For 2-node clusters, we use what’s called a Witness Server to ensure the cluster can maintain quorum (you can’t achieve 1/2 +1 if there are only 2 members). This would be a committee member that has a vote, but doesn’t actually take on any work.

 

While 2-node clusters and hyper-converged infrastructure (HCI) may not sound optimal, the solution uses the fastest components available and provides 1 level of failover for smaller operations. This provides availability, increased performance, and stability over a single server, without the footprint or cost of a full 3+ node cluster.

 

Summing it all up, the main problem with the CrowdStrike outage is that it involved security software that is implemented on all devices within an organization. Though I haven’t investigated this issue specifically, in order for security products to remain viable, they have to constantly update their agents with virus identifiers (called definitions) to respond to the latest threats. From an organization’s perspective, this process cannot be stringently tested as the vulnerabilities are known to threat actors as well as the security company, so we tend to trust what we’re given and roll upgrades multiple times a day to all our devices.

The only way to prevent this from happening is to thoroughly test your update before releasing it into the wild, but time is of the essence. There ultimately becomes a gray line of accepted risk to achieve expediency, and sometimes that line goes too far.

If you made it this far, hope this helps!!