Skip to content

Hosted Runner maintenance for {customer} has failed

General Troubleshooting DHR Maintenance Failure

Section titled “General Troubleshooting DHR Maintenance Failure”

First, know that it is very likely that only the inactive shard of the Dedicated Hosted Runner (DHR) is experiencing a problem, while the active shard is likely continuing to process jobs. You can verify this via looking at the Hosted Runners Overview dashboard and make sure that the active shard is still actually processing jobs.

Most of what you need to know about troubleshooting a failed hosted runner maintenance can be found under Troubleshooting problems with ZDD in hosted-runners-troubleshooting.md in the team repo.

Specific known categories of DHR Maintenance Failure

Section titled “Specific known categories of DHR Maintenance Failure”
  1. Hosted_runner_provision post deploy healthcheck failed
  2. Hosted_runner_provision pre deploy healthcheck failed
  3. Inaccuracies between deployment_status SSM Parameter and state of infrastructure

After you rerun provision successfully, please also always run shutdown and cleanup so that we don’t waste money and cause confusion by having both colours live at the same time.

If you had to do some serious shenanigans to get a successful run of hosted_runner_provision, it is highly recommended that you rerun the entire hosted_runner_deploy pipeline for that runner stack and getting a clean successful maintenance before moving on. This is especially relevant if you had to fix an infrastructure issue on one shard, as the same problem may be present on the other shard.