Suffering an outage of over 6 hours, Facebook and their family of apps are finally emerging from a blackout that has sparked furore around the world.

Because of their size and global position, Facebook and other large companies have complicated systems they need to keep up and running and, to be fair, most of the time they do! There will be numerous automated and manual checks being performed on the systems all the time.

However, even with the best tools and processes, complacency can still set in. While Facebook has made no official announcement about the cause of their issues, other than to deny that it’s a DDOS attack, these are some common automation failures that can lead to problems like this not being detected before they hit a live environment.

Let’s take a closer look at what some of the underlying issues could be:

Insufficient contract automation around third party APIs

Third party APIs can change without warning and can cause large disruptions to your services. Contract testing ensures that the data flow between your software and third parties continues to conform to the agreed standards. It catches any unexpected changes that have the potential to break your own software before they make it into a production environment.

Badly-scoped automation (UI and API)

Automated testing can never catch all of the bugs coming through and often there is not enough time and money to build and execute huge suites of tests. Mapping your systems and potential points of failure can help define which systems are capable of causing critical failures or outages, and automation in these areas should be given priority. No one cares if your FAQ page has spelling mistakes if they cannot log in and use your service!

Infrequent use of existing automation

Ideally automation is run as part of a build pipeline in a team doing CI/CD. There are sometimes compromises (UI suites are slow to execute – often a subset is used for a smoke test and the full regression suite is run on a scheduled basis) but the more frequently it is run, the more quickly these issues are picked up. If the tests are run infrequently, or mothballed, then the value is lost.

Failures are not followed up

Sometimes automation finds issues and no one investigates them in a timely manner. A brittle automation suite can lead to failure fatigue (“The automation tests are always broken so it’s a waste of time investigating!”) so having robust automation is very important. Clear results reporting (such as a visible dashboard) assists in making sure someone is following up all failures as soon as possible.

Have you set a plan in place to ensure automation success?