Prometheus for GOV.UK
“From metrics to insight” read up on that catchy phrase and more at prometheus.io
Simply put, Prometheus lets us see and respond to things that happen on the computers we look after.
Prometheus is a flexible monitoring system that other teams across GDS use. Implementation can be on a small or large scale, simple or complex. All of this makes it a good choice to get monitoring in place quickly.
Tell me more…
Prometheus uses a client/server pull model to collect time series data. The Prometheus server will pull data from each configured node or exporter. This data can then be used in a number of ways.
Various configurations are possible but at it’s most simple the “Prometheus monitoring server” can be installed and can collect metrics from itself.
In more complex configurations multiple exporters can be setup for the server to pull data from, “Alertmanager” can be used for push alerting and data can be exported and visualised by, for example, Grafana. See diagram and more from prometheus.io
At GDS, Reliability Engineering (slack #reliability-eng) is the best place to start. Some teams will be better suited to help with specific questions depending on the implementation and what needs to be done. Any code is generally “owned” by a specific team that has implemented it.
There’s a public community supporting the software and more information can be found at Prometheus.io Community.
There’s also a London Prometheus Meetup that GDS has helped organise and has given talks at (there’s often free pizza and drinks!).
A good start is prometheus.io. [note: If you find good reading please add it to this page!]
Of course you should read this guide if interested in GOV.UK implementation and say hello to Reliability Engineering (slack #reliability-eng) if you have any questions.
Prometheus Implementation for GOV.UK
How do I read, run and change the code?
Git Hub Repositories
alphagov/govuk-aws alphagov/govuk-aws-data alphagov/govuk-dns-config alphagov/govuk-puppet alphagov/packager
* a “node” using the
* an EBS volume
* an IAM role and policy for prometheus
* security groups for access control
* an “LB” using the
* the Auto-Scaling Group data and attachment
* route 53 records
Puppet data used by
app-prometheus (subnets) and
infra-public-services (service name).
DNS setup for
publishing.service.gov.uk CNAME records, using
govuk.digital data for the different environments the prometheus server is deployed to.
govuk_prometheus_node_exporter sets the
apt::source, installs and configures the software, permits inbound
9080 port access and sets up an
govuk_prometheus, sets the
apt::source, installs and configures the software (including nginx), and sets up an
Existing code was modified to include the modules where needed and also add a new
govuk::node for prometheus. The
apt_mirror_hostname for both packages is included in the hieradata.
FPM recipes for
prometheus-node-exporter can be found here.
aptly::repo puppet resource
- Package the prometheus software
- Get package onto mirrors for distribution
- follow steps in GOV.UK docs
- manual steps might be needed
- Run terraform to create infrastructure
- Deploy puppet to complete configuration
- autodeployment to integration
- check puppet deployment status in staging and production
- Configure & Deploy DNS
what to set up
- to aws
- to the required github repos
- to getting started on GOV.UK
- solo - could be a VM or aws instance, make sure it’s appropriate, the correct distribution/OS/ami is a good start.
- together - there are shared or common GOV.UK environments. Know the users of each and how to notify them of changes. Integration is often exposed to breaking changes but that can still impact a large number of people who don’t expect it.
- all the brew installs and downloads
- the existing code base
what to build
- build the software packages
brew install all_the_things(please check individual repos for specifics)
how to test
- check you can browse to the package on the aptly mirrors, https://apt.publishing.service.gov.uk/
dpkg -lto list all software on a host and confirm package is installed
- check the config files
/etc/prometheus/prometheus.yml- to confirm details expected for the server, e.g.
scrape_configsand ec2 discovery
/etc/init/node-exporter.conf- to confirm details expected for the node exporter
- check the AWS Console to confirm resources have been created by terraform
- check ports are listening on clients
- ssh onto a host and
- check connectivity between hosts, e.g.
a_host$ curl http://<b_host>:9080/metrics
- ssh onto a host and
- check access to the load balancer
- from a browser:
- check the health of the targets in the AWS Console
- from a browser:
- raise a
govuk-puppetPR and check the tests in CI govuk-puppet
how to deploy it
- always run changes through integration before progressing them onto staging and production
- talk in govuk and reliability engineering slack channels about testing
- get PRs raised and peer reviewed for your changes
- run the terraform job with a
plan. Check the output, then run the
- run the
- check GOV.UK Icinga Alerts after deploying
Note: the format of these docs has been inspired by the “Language System API” section of https://github.com/saiaps/language-system-api