Skip to main content

Disaster Recovery

This should highlight our disaster recovery plans, including backups.

All of our infrastructure is built using terraform, and k8s clusters managed by flux. This means that our k8s resources are "self healing" as we follow gitops principles. Our infrastructure however needs to be rebuilt in case of a disaster.

Scenarios

AWS Region outage - not tested

If the outage lasts longer than the business deems acceptable, a new environment must be spun up from scratch in a new region. This can be undertaken in the platform-terraform repo by creating a new environment for that region, or copy pasting the main module definition.

AWS Account or resources deleted - tested (as part of DORA DiRT exercise)

If an AWS account or resources are deleted terraform can be run in the new account to recreate the resources, and the services must be reconfigured to point to the new datastores and create the rabbitmq users in the new rabbitmq cluster.

Database outage - tested (as part of DORA DiRT exercise)

If the database suffers a catastrophic failure and is unrecoverable, terraform can be used to create a new one and the services configuration needs to be updated with the new hostname, and users created.

K8s Cluster unavailability - tested (as part of DORA DiRT exercise)

If the cluster becomes unavailable for some reason, rebuild a new one side by side, this is easily accomplished through the terraform.

DR Steps

The main mechanism of restoring an environment is via the Terraform and Kubernetes resources in platform-terraform and platform-k8s respectively. To create an environment the platform-terraform should be run in order in the following steps

Terraform

  • Run the Networking terraform. This will recreate each environments networking, VPCs and hosted zone configuration.
  • The nameservers in the Route53 configuration for the domain must be updated to point to the hosted zone.
  • Run the Application terraform
  • Run the Kubernetes terraform
  • Login to the Kubernetes cluster (via EKS commands, see the EKS docs in the Kubernetes docs)
  • Verify that Flux is deployed and as synchronized the platform-k8s repository with the Kubernetes cluster
  • Re-encrypt the Backend repository with SOPS using the new KMS key
  • Verify the backend ad frontend are working for the new deployment