Table of contents

This documentation is intended for internal use by the RE team.

Prometheus for GDS PaaS Service

Reliability Engineering operates a monitoring and alerting service for GDS PaaS tenants, and are responsible for the support and reliability of the service.

The service is known informally as “Prometheus for GDS PaaS” and includes Prometheus, Alertmanager, Grafana, supporting infrastructure, metric exporters and user documentation.

Architecture

Architectual Drawings

AWS components Figure 1: Architecture for components hosted on AWS

Description and Notes:

  • Three instances of Prometheus, Alertmanager are deployed over three AWS availability zones in Ireland (eu-west-1) for resilience and high availability (figure 1).
  • URLs for these instances are:
  • Each Prometheus instance has its own persistent EBS storage. Each instance is independent to each other and scrapes metrics separately.
  • The three Prometheis are not load balanced and each have their own public URL, routed by the ALB according to the request URL (prom-1, prom-2, prom-3)
  • The ALB for Alertmanager routes traffic to the corresponding Alertmanager according to the request URL. The inbound requests are also restricted to office IP addresses only. It does not load balance the traffic.
  • The alerts are configured and generated in Prometheus.
  • Each Prometheus sends their alerts to all three of the Alertmanagers.
  • One instance of Grafana and a Postgres database (small-9.5) is deployed on GOV.UK PaaS. It uses prom-1 as a data source.
  • Configurations for Prometheus and Alertmanager are provided in YAML files and are stored in an S3 bucket.

System Boundary

System boundary Figure 2: System boundary diagram for Prometheus for GDS PaaS - interaction with external systems and services.

Integration with GDS PaaS applications

PaaS and Prometheus Interactions

Figure 3: Interaction between PaaS tenants and Prometheus for GDS PaaS and service discovery

  • Tenants deploy Prometheus exporters on PaaS to export container-level, app and service metrics on PaaS with /metrics endpoints to be scraped.
  • Tenants create a service using the gds-prometheus service broker and bind apps to the service.
  • If the tenants wish to restrict the web requests with IP safelist, they can deploy the ip-safelist route service and bind application routes to the service. This step is optional.
  • PaaS tenants can use the Prometheus GUI to query the metrics.
  • PaaS tenants can use Grafana to create dashboards for the metrics.
  • PaaS tenants can use the PaaS to configure additional targets to be scraped for their organisations.

Service discovery

  • Service discovery allows Prometheus for GDS PaaS to discover which targets on PaaS to scrape.
  • A service broker, named “gds-prometheus”, is available to GDS PaaS tenants (they can see it via cf marketplace) and is deployed from the cf_app_discovery code base .
  • cf_app_discovery is an app written in Ruby, which is composed of two elements:
    • prometheus-service-broker: a Sinatra app that listens to calls made by the CloudFoundry service API when applications bind to or unbind from the service; and
    • prometheus-target-updater: a cron job that runs every five minutes to tell prometheus-service-broker to detect apps that have been stopped, scaled or killed
  • PaaS tenants create a service with the gds-prometheus service broker and bind the apps to the service. This will register and update the targets to be scraped by Prometheus.
  • prometheus-service-broker writes JSON files to an S3 bucket which detail the target to monitor, target labels to use for the target, and the application guid which is used by the instrumentation libraries to protect the /metrics endpoint on the app via basic auth.
  • A cron job running on each Prometheus instance syncs these files to the config directory so that Prometheus can pick up the changes.

AWS Nginx configuration

Nginx is set up in front of Prometheus and acts as an ingress/egress request proxy. It is composed of two elements:

paas-proxy:

A forward proxy is used for the traffic from Prometheus to PaaS for two purposes.

  • Custom header insertion: custom headers X-CF-APP-INSTANCE, which is a CloudFoundry-specific header which requests a specific instance ID to scrape, is inserted to requests from Prometheus to PaaS so that Prometheus can get metrics from each instance of an app - EC2 Nginx config.

  • Bearer token: Set to be CloudFoundry app guid, bearer token is used to authorise the connection to the /metrics endpoint for metrics exporters running on PaaS - EC2 Nginx config.

auth-proxy

Basic auth is used to protect inbound access to Prometheus EC2 Nginx config, unless the origin of the inbound requests are from office IPs. Basic auth is needed so Grafana, which does not have a static IP, can access Prometheus.

AWS session manager

We use AWS session manager for accessing AWS node instances via the systems manager console (login to aws first) or CLI. We do this instead of sshing into the node and do not need a bastion host in our architecture.

IP safelist for PaaS routes

PaaS tenants can optionally deploy an IP safelist service on PaaS, which is based on PaaS route service that provides a full proxy for application routes for applications on PaaS, e.g. prometheus-metric-exporter that are bound to it. PaaS tenants can use the route service to provide an IP restriction layer before web requests hit the applications running on PaaS.

Logging, monitoring and alerting

The following apps and SaaS are used for monitoring, logging and alerting for the prometheus-for-PaaS.

Logit

We send logs generated from AWS EC2 and PaaS to Logit, which provides an ELK (Elasticsearch/Logstash/Kibana) service for storing, visualising and filtering logs.

Prometheus and Alertmanager

We use Prometheus for GDS PaaS to monitor and alert on itself. Most applications we run expose a /metrics page by default, for example Grafana, or we run additional exporters where needed, for example the node_exporter on our AWS EC2 instances.

The metrics endpoints we expose for Prometheus for GDS PaaS are summarised in the table below.

Metric endpoint/ job name Exposed by Description
Observe PaaS metrics {job=“observe-paas-prometheus-exporter”} paas-prometheus-exporter Container-level metrics for apps and services we run in the PaaS
Grafana app metrics {job=“grafana-paas”} Grafana Application-level metrics for Grafana
Prometheus metrics {job=“prometheus”} prom-1 on EC2 Prometheus apps metrics. Also available for prom-2 and prom-3
Alertmanager metrics {job=“alertmanager”} alerts-1 on ECS Alertmanager application metrics. Also available for alerts-2 and alerts-3
Service broker {job=“prometheus-service-broker”} gds_metrics_ruby Request metrics for Prometheus service broker
EC2 node metrics {job=“prometheus_node”} node_exporter VM metrics for the EC2s running Prometheus (The URL is not public)

You can see available metrics from those metric endpoints either by visiting the endpoint URL (if public facing) or by querying {job="<job-name>"} in Prometheus.

Cronitor

Cronitor is a “Dead Man’s Switch” type of service for health and uptime monitoring of cron jobs. Regular “heartbeats” are sent to Cronitor indicating uptime, it will raise a Pagerduty ticket if it misses the number of heartbeats as configured. We use this to page us if our alerting pipeline is not working.

Zendesk and Pagerduty

Zendesk is used for receiving non-interrupting alerts and Pagerduty is used to receive interrupting alerts. Alert priority is defined in the Prometheus alert itself. Alertmanager is used for routing the tickets and pages to the services. The alerting actions and procedures are defined in Zendesk and Pagerduty. Refer to gds-way for information on managing alerts.

Repositories

Infrastructure, service discovery and secrets

Repositories Description
prometheus-aws-configuration-beta Terraform configuration to run Prometheus, Alertmanager and nginx on AWS EC2 and ECS with supporting infrastructure such as S3.
re-secrets Contain secrets used for Prometheus for GDS PaaS
cf_app_discovery Cloud Foundry service broker (“gds-prometheus”), that acts as a service discovery agent and updates a list of targets apps to be scraped by Prometheus for GDS PaaS. Tenants bind their apps to the service to be discovered by Prometheus for GDS PaaS.
grafana-paas Grafana configured to be deployed to GOV.UK PaaS

Metric exporters for Prometheus

Repositories Description
paas-prometheus-exporter Expose container-level app metrics and some backing service metrics for the org that this exporter has read-access to. It reads the metrics from PaaS Doppler component.
gds_metrics_dropwizard Expose apps metrics for Dropwizard based apps.
gds_metrics_python Expose app metrics for Python based apps
gds_metrics_ruby Expose app metrics for Ruby based apps

IP Safelist proxy service for PaaS services

Repositories Description
re-paas-ip-safelist-service Cloud foundry route service (an nginx) that implement an IP safelist (whitelist) for Prometheus and GDS office IPs to access /metrics endpoints

Documentation

Repositories Description
re-team-manual Team manual for internal use, including but not limited to team rituals, incident process and runbooks.
reliability-engineering The team maintains the metrics and logging section of the reliability engineering manual.

Access infrastructure

Access to AWS

Our AWS account ids are as follows.

Production and Staging Stacks

re-prometheus-production = 455214962221

re-prometheus-staging = 027317422673

example .aws/config

[profile re-prometheus-staging]
region = eu-west-1
source_profile=fredbloggs
role_arn=arn:aws:iam::027317422673:role/Administrator
mfa_serial=arn:aws:iam::1234567890123:mfa/fred.bloggs@digital.cabinet-office.gov.uk

[profile re-prometheus-production]
region = eu-west-1
source_profile=fredbloggs
role_arn=arn:aws:iam::455214962221:role/Administrator
mfa_serial=arn:aws:iam::1234567890123:mfa/fred.bloggs@digital.cabinet-office.gov.uk

AWS EC2 Access

Access to our EC2 instances is detailed within EC2 Access.

Access to Paas

Paas information can be found at Paas.

  • The spaces that are relevant to Observe are:-
    • prometheus-grafana (production grafana)
    • prometheus-production
    • prometheus-staging

Access to our secrets

Our secrets and passwords are stored within github. Our specific secrets are in re-secrets/observe.

Access restrictions

Prometheus is accessible by office IP safelisting and then falls back to basic auth. Basic auth details are held in re-secrets.

Alertmanager is accessible by office IP safelisting but does not fall back to basic auth.

The office IPs can be found here GDS Internal IP Network.

Support

Support hours

The Automate team offer in work hours support, during 9am - 6pm, Monday to Fridays. Any issues out of hours will be recorded and handled during work hours.

Support process and tasks

  • Keep interruptible documentation up to date
  • Respond to PagerDuty alerts
  • Support users on the #re-prometheus-support and #reliability-eng slack channels
  • Report any problems on the #re-prometheus-support and #reliability-eng slack channels with regular updates
  • Check emails for Logit and PagerDuty status updates
  • Check on the status of the Prometheus service
  • Check Zendesk queue for tickets
  • Triage issues and bugs
  • Initiate the incident process

Keep interruptible documentation up to date

Keep the documentation for how to support this service up to date.

Respond to PagerDuty alerts

PagerDuty is configured to ring the Interruptible phone when an alert is triggered. PagerDuty alerts should be acknowledged and investigated.

Support users on the #re-prometheus-support and #reliability-eng slack channels

Users of the monitoring service can request help on the #re-prometheus-support and #reliability-eng slack channels, as interruptible you should be monitoring the channel and engage with users.

  • help solve the users problem if it is simple
  • triage the users problem if it is an issue or bug
  • keep the user updated with progress

Check emails for Logit and PagerDuty status updates

  • inform #reliability-eng on slack and send something in the reliability engineering announce email group

Check on the status of the Prometheus service

Periodically:

  • Monitor the Prometheus benchmark (beta) dashboard on Grafana
  • Check the Prometheus targets on the active Prometheus dashboard

Triage issues and bugs

If you spot what may be an issue or bug then investigate following the triage process.

Triage process

One of the goals is to capture which tasks are being performed whilst on the interruptible duty.

It is important that tasks are recorded in Trello cards so that we understand what tasks are being performed and how long they take, the card should have the label Interruptible. This information is important so that we can feedback and improve the process.

When triaging an issue you should take some time to ask the following questions:

  • is someone else already looking at the issue
    • slack the #re-autom8 channel, ask the team and look at existing Trello cards.
  • what impact is it having on tenants:
    • High - does it affect their services, i.e. cause problems with deployments, affects performance of their apps.
    • Mid - does it impact their metrics collection, i.e. see unexpected gaps in metrics, or odd values, loss of historical metrics.
    • Low - is it causing problems in viewing metrics on Grafana, but metrics are still being collected and stored.
  • how long will the issue take to resolve
    • get estimate from the person who is working on resolving the issue.
    • update the tenant(s) affected with progress.

Ideally you will not need to spend more then 30 minutes finding the answers.

If it is a new issue, and no one else is aware of it then create a Trello card adding the details you have found and add the appropriate label.

Talk to the team and decide who is going to be responsible for fixing the issue.

Incident process

  • Identify if our users are being affected
  • Inform our users on #re-prometheus-support for Prometheus, #reliability-eng for Logit or team specific channels for issues only affecting a single team.
  • If an incident has not already been created on PagerDuty then create one.
  • Triage and technical investigation.
  • Gather and preserve evidence.
  • Resolution, update users that the issue has been resolved.
  • Closure, organise a team incident review.

Reliability

User expectations

The most important events for our users are:

  • their metrics can be successfully scraped by Prometheus
  • they can access their dashboards in Grafana
  • alerts are delivered to their receivers

This is the minimum so our users can be alerted to problems with their system and debug/monitor them.

We also consider the following important but not as critical for our users:

  • access to the Prometheus user interface (most functionality is available in Grafana)
  • access to the Alertmanager user interface (currently rarely used by our users)
  • alerting rules quickly reviewed and deployed to Prometheus (this is a rare process with little evidence seen so far that quick review and deployment of alerts is essential for users)

Note, these lists may not be exclusive and are expected to change as our system develops.

Service Level Indicators and Objectives (SLIs and SLOs)

We measure, monitor and alert on our most important user events using SLIs and SLOs. Our SLIs are defined and measured on our Grafana SLI dashboard.

We still need to define and implement SLIs for all of our most important user events (see above). We still need to define SLOs for our SLIs and set up corresponding monitoring and alerting for this. Until these are done we may not find out if we are not meeting our users expecations of our service levels.

Alerts

AlwaysAlert

Alert delivery is one of the main things our users rely on. The purpose of this alert is to provide confidence that an alert that fires in Prometheus will be sent from Alertmanager by using a dead mans switch. This alert is configured to always be firing (so will appear red in Prometheus and as alerting in Alertmanager). The alerts are sent to Cronitor our external monitoring service. If Cronitor has not received an alert from our Alertmanagers for 10 minutes then an alert is raised via Pagerduty.

  1. Check using the AWS console that the Prometheus are running in EC2
  2. Check that the alert is firing. Production or Staging
  3. Check using the AWS console that there are sufficient number of running ECS instances (Auto Scaling Group self healing).
  4. Check that the alert has arrived at the Alertmanager. Production or Staging
  5. If the above are working then slack the #re-autom8 channel, ask the team to check Cronitor.

RE_Observe_AlertManager_Below_Threshold

The current number of Alertmanagers running in production has gone below two.

  1. Check using the AWS console that there are sufficient number of running ECS instances (Auto Scaling Group self healing).
  2. Check using the AWS console if the ECS Alertmanager tasks are trying to start and are failing to do so.
  3. Check the ECS logs for the Alertmanager services - these can be found in the ECS console.

RE_Observe_No_FileSd_Targets

Prometheus has no targets via file service discovery for the GOV.UK PaaS.

Check the govukobserve-targets-production S3 targets bucket in the gds-prometheus-production AWS account to ensure that the targets exist in the bucket.

If there are files in the targets bucket then:

  1. Check the [Prometheus logs][] for errors.
  2. SSH onto Prometheus and check if the target files exist on the instance.

If there are no files in the targets bucket then:

  1. Check the [service broker logs][] for errors.
  2. Check the prometheus-service-broker and prometheus-target-updater are running by logging into the PaaS prometheus-production space.

RE_Observe_Prometheus_Below_Threshold

The current number of Prometheis running in production has gone below two.

  1. Check the status of the Prometheus instances in EC2.
  2. Check the [Prometheus logs][] for errors.

RE_Observe_Prometheus_Disk_Predicted_To_Fill

The available disk space on the /mnt EBS volume is predicted to reach 0GB within 72 hours.

Look at Grafana for the volume’s disk usage or the raw data in Prometheus. This will show the current available disk space.

Increase the EBS volume size (base the increase on the current growth rate in the Prometheus dashboard) in RE Observe Prometheus terraform repository code and then run terraform apply. When the instance is available ssh into each instance and run sudo resize2fs /dev/xvdh so that the file system picks up the available disk space.

RE_Observe_Prometheus_High_Load

Prometheus query engine timing is above the expected threshold. It indicates Prometheus may be beginning to struggle with the current load. This may be caused by:

  • too many queries being run against it
  • queries being run which are too resource intensive as they query over too many metrics or too long a time period
  • an increase in the number of metrics being scraped causing existing queries to be too resource intensive

Queries can originate from a Grafana instance, alerting or recording rules, or be manually run by a user.

If this issue occurs please notify and discuss with the team.

RE_Observe_Prometheus_Over_Capacity

Prometheus query engine timing is above the expected threshold. It indicates Prometheus cannot cope with the load and is critically over capacity. This may be caused by:

  • too many queries being run against it
  • queries being run which are too resource intensive as they query over too many metrics or too long a time period
  • an increase in the number of metrics being scraped causing existing queries to be too resource intensive

Queries can originate from a Grafana instance, alerting or recording rules, or be manually run by a user.

If this issue occurs please notify and discuss with the team.

RE_Observe_Target_Down

There is a Prometheus target that has been marked as down for 24 hours.

This alert is used as a catch all to identify failing targets that may have no related alert (of which there are several).

You should identify who is responsible for the target and check their alerting rules to see if they would have been notified of this. If they would not have received an alert because they do not have one set up then you should contact them.

If the target is a leftover test app deployed by ourselves then check with the team but we may delete the application if no longer needed or unbind the service from the app, either manually or by removing the service from the application manifest.

We have also seen a potential bug with our PaaS service discovery leaving targets for blue-green deployed apps even after the old (also known as venerable) application has been deleted. If this is the case we should try and identify what caused it. If we can’t figure out why, manually delete the file from the govukobserve-targets-production bucket.

RE_Observe_Grafana_Down

The Grafana endpoint hasn’t been successfully scraped for over 5 minutes. This may be caused by:

  1. A deploy is taking longer than expected.
  2. An issue with the PaaS.

Check:

  1. Check with the team to see if there is a current deploy happening.
  2. Check the [non 2xx Grafana logs][]

Runbook

There is a problem with the monitoring service (Prometheus or Alert Manager)

  • check if the services are available
  • check if there are any deployments in progress
  • check that Grafana is pointing to a live Prometheus service by looking at the data sources under configuration.
  • check the health of the ECS cluster to make sure that the services are running in each AZ.

Escalate the issue to the rest of the team if you are unable to track down the problem.

If the issues are not affecting services (Users are able to continue to use the service without any disruption) then follow the triage process.

There is a problem with one of the Prometheus tenants

Put a message in slack: #re-prometheus-support and speak to someone in the team who is responsible for the service which has a problem.

Adding and editing Grafana permissions

If a user requests a change in Grafana permissions, for example so that they can edit a team dashboard, then you should add that user to the relevant Grafana team and ensure that the team has admin permissions for their team folder.

Do not change a user’s overall permissions found in Configuration > Users - this should remain as ‘Viewer’ for all users who are not part of the RE Observe team.

Rotate basic auth credentials for Prometheus for PaaS

  1. Create a new password, for example using openssl rand -base64 12

  2. Save the plaintext password in the re-secrets store under re-secrets/observe/prometheus-basic-auth.

  3. Hash the password:

    docker run -ti ubuntu
    apt-get update
    apt-get -y install whois
    mkpasswd -m sha-512
    
  4. Append grafana: to your hashed password and save this in the re-secrets store under re-secrets/observe/prometheus-basic-auth-htpasswd.

  5. Deploy Prometheus to staging. As this deploy changes the cloud.conf for our instances, you may need to follow steps in the Prometheus README to deploy with zero downtime.

  6. Update the basic auth password for the Prometheus staging data source in Grafana. You will need to do this for every Grafana organisation.

  7. Repeat step 5 for production. Note, as soon as this has been deployed to the main Prometheus that Grafana is using as a datasource our users dashboards will start breaking as they will still using the old credentials.

  8. Repeat step 6 for production.

  9. Let users know via the #re-prometheus-support Slack channel that they may need to refresh any Grafana dashboards they have open to use the new basic auth credentials.

Architecture history

The major development milestones are summarized as follow:

Year/Quarter: 2018/Q1

Alpha Previous docs

  • Self hosted and configured a prometheus instance on AWS EC2
  • Deployed nginx auth-proxy and paas-proxy on the same EC2 machines
  • Developed exporters to expose apps and service metrics to be scraped by prometheus
  • Developed PaaS service-broker for the exporters for PaaS tenants to export their metrics to Prometheus

Year/Quarter: 2018/Q2-3

Beta Previous docs

  • Deploy 3 instances of Prometheus on AWS ECS
  • Deployed 3 instances of Alertmanager on AWS ECS
  • Deployed 1 instances of Grafana on GOV.UK PaaS
  • Configure metrics and logs monitoring for the service
  • Later migrate Prometheus and nginx processes from ECS to EC2
  • Successfully tested 2 instances of Alertmanager running on the new Kubernetes platform
  • Started migration of Nginx auth-proxy and paas-proxy back from ECS to EC2

Year/Quarter: 2019/Q4

  • Fixed meshing across the Alertmanagers
  • Migrated Alertmanagers to Fargate
  • Made Alertmanagers continously deployed via Concourse

Architectural decisions

1. Environments

Date: 2018-04-16

Status

Accepted

Context

We want to have separate environments for running our software at different stages of release. This will be used to provide a location where changes can be tested without impacting production work and our users.

Decision

We have decided to have N+2 separate environments: development, staging and production. In development, we can create as many separate stacks as we want. The staging environment will be where the tools team will test their changes. The production environment will run all of our users monitoring and metrics and poll each of their environments.

Any code can be deployed to development environments. Only code on the master branch can be deployed to staging and production.

Consequences

Keeping the environments separate reduces the chance of a core change impacting our users and will allow us to test aspects of our system such as handling load, fail over testing and new, possibly behaviour changing, version upgrades. Users will test their prometheus related changes in our production environment and will not have access to the staging environment.

2. Configuration Management

Date: 2018-04-18

Status

No longer relevant.

This decision related to the alpha code (in alphagov/prometheus-aws-configuration), which is no longer being actively developed.

Context

We have the requirement of adding some resources to the base cloud instances. We currently do this via the cloud.conf system. This presents us with some limitations, such as configuration being limited to 16kb, duplication in each instance terraform and a lack of fast feedback testing.

Decision

We have decided to move away from cloud.conf as much as possible and instead use it to instantiate a masterless puppet agent which will manage the resources.

Consequences

This change firstly brings us inline with the GDS Way, and most of the programs, in our selection of tooling. It removes the limit of 16kb of configuration and allows the reuse of existing testing tools. By running in masterless mode we remove the need of running a puppet master and related infrastructure. Our puppet manifests can be reused both within tools and possibly other programs.

It’s worth noting we will still need a basic cloud.conf file to install and run the puppet agent but this will be minimal and reusable in each of our terraform projects. There is also the risk that people will put more in the puppet code than they should. This will be remediated via review of architecture and code.

3. Use Amazon ECS for initial beta build-out

Date: 2018-06-21

(although note that this ADR was written post hoc, a couple of months after the decision was made)

Status

Superseded.

We have now migrated prometheus off ECS and on to EC2. The rest is covered in ADR #12.

Context

Existing self-hosted infrastructure at GDS has been managed in code using tools like puppet, but in a somewhat ad hoc way with each team doing things differently, little sharing of code, and much reinvention of wheels. We would like to learn about other ways of deploying infrastructure which encourage consistency: in terms of code artifacts, configuration methods, and such like.

Systems such as Kubernetes and Amazon ECS are coalescing around Docker as a standard for packaging software and managing configuration.

Decision

We will build our initial prometheus beta in Amazon ECS, and assess how effective it is. We will review this decision once we have learnt more about both prometheus and ECS.

Consequences

There may be ways in which Prometheus’s opinions clash with ECS’s opinions. For example, Prometheus by default uses local disk for storing state, which may mean that we need to pin the prometheus container to a single underlying instance so that it always gets the same local disk.

4. Cloudwatch Metrics

Date: 2018-06-28

Status

Accepted.

Context

We wanted to gather metrics from our own infrastructure so that we can drive alerts when things go wrong. Amazon exposes platform-level metrics via CloudWatch. Prometheus provides a cloudwatch_exporter which makes these metrics available to prometheus.

We had a spike to see if we could use the Cloudwatch exporter to get Cloudwatch metrics and trigger alerts. After getting it to work with a number of metrics we were able to start to estimate costs. From these estimates we had concerns regarding the high costs in running it due to the number of metrics requested using the Cloudwatch API.

Each time the /metrics endpoint is hit triggers the exporter to retrieve metrics from the Cloudwatch API. Every Cloudwatch metric that you request costs money (regardless of if you ask for 100 metrics using 1 metric per api call or 100 metrics using 100 metrics per api call).

cost = number of metrics on /metrics page x number of scrapes an hour x number of hours in a year x price of requesting a metric using the API x number of prometheis running across all our environments.

Based on a simple assumption of 100 metrics requested, being scraped once every 10 minutes (6 per hour), 3 Prometheis in production, 3 Prometheis in staging and 4 in dev accounts, it would work out:

100 x 6 x 24 x 365 x $0.00001 x 10 = $525.6 per year

However if we wish to scrape at the same rate we would a normal target, e.g. 30 seconds that would become roughly $10,500 per year.

100 metrics also appears to be unlikely. Based on asking for just these ALB and EBS metrics:

ApplicationELB/RequestCount
ApplicationELB/TargetResponseTime
ApplicationELB/HTTPCode_ELB_5XX_Count
EBS/VolumeWriteBytes
EBS/VolumeReadBytes
EBS/VolumeReadOps
EBS/VolumeWriteOps

we appear to be requesting roughly 4000 metrics per scrape. If we scraped our current config at a period of once every 10 minutes we would end up roughly $21,000 a year. If we scraped it every 30 seconds it would be about $420,000.

By requesting only ALB metrics in the dev accounts, we still produce about 160 API requests according to the cloudwatch_requests_total counter for each scrape. Somewhat strangely, this only returns about 30 timeseries so we are not sure if our config is incorrect and if the number of API calls could be reduced to closer match the number of timeseries.

Note, as dev accounts have lots of resources e.g. volumes, there may be fewer metrics requested for staging and prod as unlike the dev account we wouldn’t be exporting metrics for other stacks.

It takes a long time to get a response from the /metrics endpoint as it needs to make many API calls. This causes slow response times for which our prometheus scrape config needs to allow for using the scrape_timeout setting.

We found the Cloudwatch exporter app to be very CPU intensive, we had to double our standard instance size to a m4.2xlarge to get it to work. We were also concerned about having such a CPU intensive task running in the same instance as Prometheus.

Because we fetch config from S3 using a sidecar container, there is a race condition between fetching config and starting cloudwatch_exporet. We found that this condition was encountered every time, meaning either a 2nd restart of the task was needed or we would need to add a delay to the exporter starting. We did not try and solve this.

The exporter task is slow to start up, the health check needs longer than usual to pass health checks.

We spotted that several targets we are scraping, such as prometheus and alertmanager, are set at very short scrape intervals (every 5s), this seems excessive and we can likely change down to every 30secs regardless of this story.

There is roughly a 15 minute delay in Cloudwatch metrics. The Prometheus CloudWatch exporter README explains it:

CloudWatch has been observed to sometimes take minutes for reported values to converge. The default delay_seconds will result in data that is at least 10 minutes old being requested to mitigate this.

Decision

We will not use the cloudwatch_exporter to gather Cloudwatch metrics into prometheus.

Consequences

We will not have alerts for EBS volumes not being attached to the instances, which was a concern as Prometheus would start but no metrics stored.

This has however been fixed in this commit, which resolves this by causing the user data script to exit with failure when the volume is not attached.

No alert is raised but the failure is no longer silent as Prometheus will no longer run without an attached EBS volume.

We also don’t have metrics available for ALBs and other parts of our AWS infrastructure.

We need to explore other solutions to get these metrics that are more cost-efficient.

5. Give users access to Prometheus and Alertmanager

Date: 2018-08-08

Status

Accepted.

Context

Our users need an easier way to write alerts. They currently have no easy way to test queries before writing their alerts.

In principle, they could use Prometheus’s expression browser for this, but our Prometheus service requires basic auth, which locks our users out of it.

Decision

We will give our users access to Prometheus so they can use its expression browser to test queries when writing alerts. We will do this by using IP whitelisting instead of basic auth and only allowing our office IP addresses.

We identified a number of different routes that we could have taken to allow our users to access Prometheus. One possible route that we considered was using Oauth2 authentication. This would enable users to authenticate themselves to the platform with their Google account.

We did not choose to go with this option this time for expediency. The idea behind this was to try to deliver the fastest value as possible to the user. This method enables us to learn more about the user’s usage pattern. We do intend to add authentication but this will be done at a later date.

Consequences

Anyone in the office can access Prometheus and Alertmanager. This means that they can use expression browser in order to dynamically test and create queries with fast feedback loop.

With IP whitelisting, we are not able to track our users.

6. Model for deploying prometheus to private networks in AWS

Date: 2018-07-26

Status

Draft

Context

We are looking to offer prometheus to non-PaaS teams. The infrastructure to be monitored will be run by another team (called “client team” in this document), but we will provide one or more prometheus servers which will will be responsible for gathering metrics from the underlying infrastructure.

Longer term, we are aiming to provide prometheus as a service to multiple environments across multiple programmes.

Non-suitability of existing infrastructure

Our existing prometheus infrastructure (for PaaS teams) works by using our service broker to generate file_sd_configs which prometheus then uses to scrape PaaS apps over the public internet.

This approach won’t work for non-PaaS teams, because it’s based on an assumption – that every app is directly routable from the public internet – that doesn’t hold in non-PaaS environments. Instead, apps live on private networks and public access is controlled by firewalls and load balancers.

Main problem to be solved: scraping apps on private networks

As the previous section explained, our main problem is that we want a prometheus (provided by us) to be able to scrape apps and other endpoints (owned by the client team, and living on a private network). We want to do this in a way which doesn’t require clients to unnecessarily expose metrics endpoints to the public internet.

Some other things we would like to be able to do are:

  • maintain prometheus at a single common version, by upgrading prometheus across our whole estate
  • update prometheus configuration without having to restart prometheus
  • allow client teams to provide configuration (for example, for alert rules)
  • perform ec2 service discovery by querying the EC2 API (for which we need permissions to read client account EC2 resources)

There are several ways we might provide a prometheus pattern that allows us to scrape private endpoints:

  • provide an artefact to be deployed by the client team
  • client team provides IAM access to us and we deploy prometheus ourselves, within their VPC
  • we build in our own infrastructure and use VPC peering to access client team’s private networks
  • we build in our own infrastructure and use VPC Endpoint Services as a way to get prometheus to access client team’s private networks
Provide an artefact to be deployed by the client team

The artefact we provide could take several forms:

  • an AMI (amazon machine image) which the client team deploys as an EC2 instance
  • a terraform module which the client team includes in their own terraform project

This model has the downside that it doesn’t allow us to maintain prometheus at a single common version, because we are at the mercy of client teams’ deploy cadences to ensure things get upgraded.

Client team provides IAM access so that we can deploy prometheus ourselves

In this model, the client team would create an IAM Role and allow us to assume it, so that we can build our own prometheus infrastructure within their VPC.

This would mean that the client team needs to do some work to provide us with the correct IAM policies so that we can do what we need to do, without giving us more capability than they feel comfortable with.

The client team would have visibility over what we had built, and would be able to see it in their own AWS console. However, they would likely not have ssh access to the instance itself.

One possible issue with this model is that we’re beholden on the client to provide us with a CIDR range to deploy onto, and depending on their existing estate, private IPs may be in short supply.

You’re also dependent on their networking arrangements. You will need to ask questions like:

  • how can you get in and out of the network?
  • do you need to download packages to install? how will that work?
  • do you need to download security updates? how will that work?

The answers to these questions may be different for different client teams. This means that our prometheus pattern needs to be flexible enough to cope with these differences, which will take extra engineering effort.

We would need to work out what the integration point between our teams would be. This could be:

  • terraform outputs that appear in a remote state file
    • or stereotyped resource names/tags which can be matched using data sources

Whether or not we go with this option for deploying prometheus, if we want to do ec2 service discovery (described above), prometheus will need some sort of IAM access into the client team account anyway.

Use VPC Peering to provide access for prometheus to scrape target infrastructure

VPC Peering is a feature which allows you to establish a networking connection between two VPCs. In particular, this is a peering arrangement which means that the two networks on either side of the VPC Peering arrangement cannot share the same IP address ranges.

Crucially for us, the VPCs can be in different AWS accounts.

This means that we could build Prometheus in an account we own, and it could access client team infrastructure over the peering connection. Running our own infrastructure in our own account without being dependent on client teams providing anything to us would make for smoother operations and deployments for us.

This has a drawback for the client in that it adds extra points of ingress and egress for traffic in the client networks. This increases the attack surface of the network which makes doing a security audit harder and makes it harder to have confidence in the security of the system.

There’s also a drawback in terms of the combination of connections: as RE builds more services that might be provided to client teams, and as we extend these services to more client teams, we end up with N*M VPC peering connections to maintain.

As RE (and techops more broadly) provides more services, client teams end up having to consider more VPC Peering connections in their security analyses, and this doesn’t feel like a particularly scalable way for us to offer services to client teams.

Finally, we believe that VPC Peering is something that has to be accepted manually on the receiving account console (at least when peering across AWS accounts), which compounds the scaling problem.

There are two sub cases worth exploring here:

A single prometheus VPC peers with multiple client VPCs

In this model, we would build a single prometheus service in a VPC owned by us, and it would have VPC peering arrangements with multiple client team VPCs in order to scrape metrics from each of them.

This has the benefit that we can run fewer instances to scrape metrics from the same number of environments.

This has some drawbacks:

  • the single VPC becomes more privileged, because it has access to more environments. This means that a compromise of the prometheus VPC could lead to a compromise of more clients’ VPCs.
A single prometheus VPC peers with only a single client VPC

In this scenario, we would build a single prometheus in its own VPC for each client team VPC we offer the service to.

This avoids some of the drawbacks of the previous case, in that the prometheus doesn’t have privileges to access multiple separate VPCs.

Use VPC Endpoint Services to access scrape targets

This is a similar idea to VPC Peering. VPC Endpoint Services (aka AWS PrivateLink) provides a way to provide services to a VPC, again potentially in another account.

This allows you to make a single Network Load Balancer (NLB) appear directly available (ie without going through a NAT Gateway) in another VPC.

In the case of Prometheus, because it has a pull model, it seems likely that the only way we could make this work would be by having the client team provide the endpoint service and prometheus consume it. This would mean the client team would need to add a routing layer (possibly path- or vhost-based routing, possibly using an ALB) to distribute scrape requests to individual scrape targets.

This has the following advantages:

  • the IP address spaces in the prometheus VPC and the client team VPC are completely independent

However it has some drawbacks:

  • it feels like we’re not using Endpoint Services in a designed use case. in other words, it feels like a bit of a hack.
  • Prometheus is designed to be close to its targets, so that there are fewer things to go wrong and prevent scraping. The more layers of routing between prometheus and its scrape targets, the more chance we’ll lose metrics during an outage, exactly when we need them.

Decision

  • Deploy Prometheus into Verify Performance Environment to test concept
  • Deploy an instance of Prometheus into the client teams VPC
  • Use Ubuntu 18.04 as the distribution for the instance ADR-7
  • Use Cloud Init to configure Prometheus ADR-9
  • Use Verify infrastructure initially for the PoC ADR-8ADR-10

Consequences

We haven’t yet solved the problem of how client teams provide alert configuration to prometheus.

For the moment, it is acceptable to define configuration in cloud-init, but this does not meet our need to update configuration without rebuilding instances so we will need to revise this in future.

7. Use Ubuntu 18.04

Date: 2018-08-15

Status

Accepted

Context

Prometheus needs to run on a Linux distribution. The choice of distribution impacts the methods of packaging, deploying and configuring Prometheus. It is important to consider what is currently used across GDS as prior experience of distribution is helpful when supporting the service. Ubuntu is currently in use in Verify and GOV.uk (and possibly others) which provides a pool operators with the experience to be able to support the service. Alternatives considered were Amazon Linux, the advantage of this distribution is that it is optimised to the Amazon cloud, however this advantage did not overcome the advantage of familiarity across GDS Reliability Engineering

Decision

Use Ubuntu 18.04 LTS

Consequences

Ubuntu provide security updates and support until 2023, reducing the support burden. The use of Ubuntu matches other programs choice.

8. Use of Verify Egress Proxies

Date: 2018-08-15

Status

Accepted

Context

Verify employ egress proxies to control access to external resources. These are a security measure to help prevent data from being exfiltrated from within Verify. The Prometheus server will need access to external resources, notibly an Ubuntu APT mirror during the bootstrap process. The Prometheus server should not setup it’s own routes to bypass the egress proxy i.e. use a NAT gateway or Elastic IP, as this will potentially open up a route for data exfiltration.

Decision

The Prometheus server should use Verify’s egress proxies and choice of APT mirror.

Consequences

This creates a binding to Verify’s infrastructure, this should be considered as a temporary compromise until the Prometheus server is built as a machine image

9. Use Cloud Init to build Prometheus Server

Date: 2018-08-15

Status

Accepted

Context

The Prometheus Server needs to be built in a reproducible way within AWS. Reproducible in this context means that the server can be built, destroyed and rebuilt. The rebuilt server will be identical to the original server and the is no external intervention required (i.e. logging into the server to make changes to configuration)

Decision

Cloud init will be used to build a reproducible server.

Consequences

The cloud init was chosen over other strategies such as creating a machine image because there is prior art for building a Prometheus server with cloud init and building machine images requires additional tools. It was felt that cloud init will be the fastest way to achieve the short term goals. The use of cloud init should be reviewed at the earliest opportunity.

10. Packaging Node Exporter as .deb for Verify

Date: 2018-08-15

Status

Accepted

Context

Node Exporter needs to be installed on Verify’s infrastructure so that machine metrics can be gathered. Verify runs Ubuntu Trusty which does not have an existing node exporter package. Verify has an existing workflow for packaging binaries which can be leveraged to package node exporter.

Decision

Node exporter will be packaged as a deb using FPM following Verify’s exiting packaging workflow.

Consequences

The use of Verify’s infrastructure ties the Node exporter package to Verify, the node exporter would need to be repackaged for other programs to be able to be used.

11. SLI for how reliably do we deliver pages

Date: 2018-10-25

Status

Accepted

Context

https://trello.com/c/56qyWJ60/675-show-an-sli-how-reliably-do-we-deliver-pages

We wish to have a service level indicator for how reliably do we deliver pages. We believe the way to measure this is to calculate for a given time period:

the number of incidents created in pagerduty / the number of incidents that we expect to have been created in pagerduty

Calculating the number of incidents created in pagerduty

We think we can work out how many incidents there have been within a provided timeframe using the pagerduty API. We have done this for our pagerduty account successfully using a few lines of Ruby code. We would need to have access to every other teams account in order to know about all incidents and not just ones for our team. We did not spend time trying to actually do this. We would also need to run a exporter to export this information from pagerduty so Prometheus could scrape it.

Calculating the number of incidents that we expect to have been created in pagerduty

The main source of information is the ALERTS metric in Prometheus but there are a few problems with this.

Problem 1

At the moment we don’t have a way of identifying which alerts are tickets and which are pages. Some metrics include this information using labels but not all do. We could solve this if needed to by adding severity labels to all alerts and adding documentation so our users would also do this.

Problem 2

Prometheus doesn’t provide a metric for how many incidents we believe should have been created. Prometheus instead has metrics which measure if alerts are currently firing. We would need to reliabily turn the ALERTS metrics into a single metric for how many incidents we believe should have been created.

We came up with:

count(ALERTS{alertstate="firing", severity="page"}) or vector(0)

We think that from here we can use the increase function to tell us how many times alerts have begun firing. To use this we think we would need to use recording rules as per https://www.robustperception.io/composing-range-vector-functions-in-promql to turn our query into a single range vector.

At that point we should have a number for how many alerts have begun firing in a given time period. However we are not confident that this number is equal to the number of pagerduty incidents we expect to be created. The reason for this is because Alertmanager groups firing alerts, meaning multiple firing alerts may only result in one notification and therefore one incident. A potential way around this would be to try and edit the grouping behaviour of Alertmanager using it’s config but it doesn’t look it’s possible to turn it off completely. There could also be issues if an alert fires, then resolves itself, and then fires immediately after only triggering an single incident.

Decision

We have decided not to try and implement this SLI at the moment as we are not confident that we can accurately calculate the number of incidents we expect in Pagerduty for a given time period using metrics from Prometheus. It might be possible but would require a few days to investigate if so and would likely end up with a somewhat complex system to measure this SLI. We could change this decision if we become more confident.

Consequences

We do not measure one of our main user journeys accurately and thus can’t alert if there are problems with it. We may instead have to use a proxy or measure individual components that make up the user journey instead which may be easier to measure but less accurate.

12. Deploying alertmanager to the new platform

Date: 2018-11-06

Status

Accepted

Context

The Observe team is a part of the new platform team, which is building out kubernetes capability in GDS.

There is a long-term goal that teams in GDS should avoid running bespoke infrastructure, specific to that team, so that any infrastructure we run is run in a common way and supportible by many people.

We also have a desire to migrate off ECS. ECS is painful for running alertmanager because:

  • ECS doesn’t support dropping configuration files in place
  • ECS doesn’t support exposing multiple ports via load balancer for service discovery

Kubernetes does not have either of these limitations.

Currently, we have a plan to migrate everything to EC2, in order to get away from ECS. We have quite a bit of outstanding pain from the old way of doing things:

  • we have two different deploy processes; one using the Makefile and one using the deploy_enclave.sh
  • we have two different code styles, related to the above
  • we have two different types of infrastructure

We haven’t fully planned out how we would migrate alertmanager to EC2, but we suspect it would involve at least the following tasks:

  • create a way of provisioning an EC2 instance with alertmanager installed (probably a stock ubuntu AMI with cloud.conf to install software)
  • create a way of deploying that instance with configuration added (probably a terraform module similar to what we have for prometheus)
  • actually deploy some alertmanagers to EC2 in parallel with ECS
  • migrate prometheus to start using both EC2 and ECS alertmanagers in parallel
  • once we’re confident, switch off the ECS alertmanagers
  • tidy up the old ECS alertmanager code

This feels like a lot of work, especially if our longer-term goal is that we shouldn’t run bespoke infrastructure and should instead run in some common way such as the new platform.

Nevertheless, we could leave alertmanager in ECS but still ease some of the pain by refactoring the terraform code to be the new module-style instead of the old project-and-Makefile style, even if we leave alertmanager itself in ECS.

(Prometheus is different: we want to run prometheus the same way that non-PaaS teams such as Verify or Pay run it, so that we can offer guidance to them. The principle is the same: we want to run things the same way other GDS teams run them.)

Decision

  1. We will pause any work migrating alertmanager to EC2
  2. We will run an alertmanager in the new platform, leaving the remaining alertmanagers in ECS
  3. We will try to migrate as much of nginx out of ECS as possible; in particular, we want paas-proxy to move to the same network (possibly same EC2 instance) as prometheus.
  4. We will refactor our terraform for ECS to be module-based rather than the old project-and-Makefile style, so that we reduce the different types of code and deployment style.
  5. We will keep prometheus running in EC2 and not migrate it to the new platform (although new platform environments will each have a prometheus available to them)

Consequences

We will have to be careful to keep alertmanager configuration in sync between the old and new infrascture.

We will have to keep our ECS instances running longer than we might otherwise choose to.