Skip to main content
Table of contents

This documentation is intended for internal use by the RE team.

Automate On Call & Interruptible

This document will serve as a quick reference for people on the Automate interruptible or on-call rotas. It often links to external resources for more detail and context.

Interruptible routines

A few things to do as part of the interruptible routine:

  1. Update the document that is linked from the #re-autom8 channel topic with that week’s (or day’s) interruptible person
  2. Update the topic in #reliabily-eng with that week’s (or day’s) interruptible person

GOV.UK Verify

Team manual: https://verify-team-manual.cloudapps.digital/

Hub

On call

The only alert which can directly call the automate interruptible out of hours is the Prometheus Cronitor heartbeat for the hub production environment. This is an alert which always fires; we use Cronitor to create a PagerDuty incident if the alert ever fails. In this way we have confidence that the alerting system is functioning correctly.

Other than that, only escalations from the Verify primary on-call team apply out of hours; Alertmanager won’t directly page reliability engineering. On occasion it may also be the result of a cyber security escalation.

Concourse deploys the Hub to ECS (one cluster per application) on EC2 (i.e. not ECS on Fargate). Reliability engineering only offer out-of-hours support for the production services. Some useful links:

Relevant gds-cli aws account names:

  • verify-prod-a
  • verify-tools

Interruptible

Reliability engineers may also need to respond in hours to issues and incidents in non-production environments. Examples:

  • problems deploying the application to non-production environments (such as integration or staging)
  • machine reboots (prometheus only; all other non-tools environment machines are routinely rebuilt by concourse)

Some useful links:

Relevant gds-cli aws account names:

  • verify-staging
  • verify-integration-a

DCS

On call

The Document Checking Service runs in GSP on AWS. As with Hub, reliability engineering are not part of the primary on call flow and so will only be called on if necessary as part of an escalation.

Some useful links:

Interruptible

The most common task are the UKCloud and Carrenza machine reboots. Other notable items include available disk space on packages-1, sensu alerts (both low-side and high-side) and other general escalations from the yak team or the Verify devs.

Useful links:

Proxy Node

On call

The proxy node is part of the eIDAS framework. It is deployed to the GSP on the Verify cluster by the in-cluster concourse.

At the time of writing, the proxy node is not supported out of hours, except for one scenario: a security issue severe enough to warrant the invocation of the kill switch.

GSP Platform

The GSP platform supports DCS and the Proxy Node above, but there are some alerts for the platform itself.

On call

The only alert which can directly call the automate interruptible out of hours is the Prometheus Cronitor heartbeat for the GSP cluster. This is an alert which always fires; we use Cronitor to create a PagerDuty incident if the alert ever fails. In this way we have confidence that the alerting system is functioning correctly.

Dev infrastructure

Interruptible

Other “misc” infrastructure for Verify is supported in-hours:

  • AWS tools environment which includes concourse, motomo, dash etc.
  • UKCloud components such as artifactory and jenkins

Relevant gds-cli aws account names:

  • verify-tools

Observe

Alertmanager

On call

The Observe Alertmanager deployment is used by several programme’s services to route alerts to other systems like PagerDuty. It is part of Observe’s Prometheus service. Alertmanager is deployed to ECS Fargate by concourse. Alertmanager instances are clustered together to handle (among other things) deduplication of downstream alerting (each prometheus instance sends alerts to all the Alertmanager instances).

The most probable source of any kind of event will be cronitor.

Relevant gds-cli account names:

  • re-prom-prod

Interruptible

In addition to Alertmanager, the Prometheus instances need to be managed in-hours. The most common incident involving the Oberve prometheus instances is related to the PaaS service broker not removing deleted PaaS apps from its S3 store, causing prometheus to alert on instances being down.

Main article: Prometheus for GDS PaaS Service

Useful links:

Relevant gds-cli account names:

  • re-prom-staging

Cyber security escalations

On call

Cyber security may escalate to reliability engineering out of hours for a number of reasons. Possible candidates:

  • proxy node or connector metadata CloudHSM communications look suspicious
  • certain AWS account access out-of-hours (such as admin)
  • some GSP cluster access patterns out-of-hours

Useful links:

Relevant gds-cli targets:

  • gsp-verify (for AWS)
  • verify (for kubectl & other GSP-related actions)

Interruptible

TODO

Performance Platform

Performance Platform is supported best effort, in-hours only.

Interruptible

Main article: Performance Platform

Multi-tenant concourse

Terraform applied manually from the tech-ops-private repository.

Useful links:

Relevant gds-cli account names:

  • techops

AWS Account Actions

Interruptible

Main article: GDS AWS Account Management Service

Deleting AWS Accounts no longer required

If the account uses the aws-root-accounts@digital.cabinet-office.gov.uk email address, then GDS staff who have access to the safe and membership of the aws-root-accounts google group can remove the account.

If it uses a different email address, the account can only be removed by someone with access to that email address and associated MFA secret.