Disaster Recovery

This should highlight our disaster recovery plans, including backups.

All of our infrastructure is built using terraform, and k8s clusters managed by flux. This means that our k8s resources are "self healing" as we follow gitops principles. Our infrastructure however needs to be rebuilt in case of a disaster.

Scenarios

AWS Region outage - not tested

If the outage lasts longer than the business deems acceptable, a new environment must be spun up from scratch in a new region. This can be undertaken in the platform-terraform repo by creating a new environment for that region, or copy pasting the main module definition.

AWS Account or resources deleted - tested (as part of DORA DiRT exercise)

If an AWS account or resources are deleted terraform can be run in the new account to recreate the resources, and the services must be reconfigured to point to the new datastores and create the rabbitmq users in the new rabbitmq cluster.

Database outage - tested (as part of DORA DiRT exercise)

If the database suffers a catastrophic failure and is unrecoverable, terraform can be used to create a new one and the services configuration needs to be updated with the new hostname, and users created.

K8s Cluster unavailability - tested (as part of DORA DiRT exercise)

If the cluster becomes unavailable for some reason, rebuild a new one side by side, this is easily accomplished through the terraform.

Scenarios​

AWS Region outage - not tested​

AWS Account or resources deleted - tested (as part of DORA DiRT exercise)​

Database outage - tested (as part of DORA DiRT exercise)​

K8s Cluster unavailability - tested (as part of DORA DiRT exercise)​

Scenarios

AWS Region outage - not tested

AWS Account or resources deleted - tested (as part of DORA DiRT exercise)

Database outage - tested (as part of DORA DiRT exercise)

K8s Cluster unavailability - tested (as part of DORA DiRT exercise)