Skip to content

Job Queue Duration Apdex SLO Violations

Covers:

  • CiOrchestrationServiceSharedRunnerJobQueueDurationApdexSLOViolation
  • CiOrchestrationServiceNonSharedRunnerJobQueueDurationApdexSLOViolation

These alerts fire when the job queue duration apdex for shared or non-shared runners violates its SLO burn rate threshold, indicating that jobs are waiting too long in the pending state before being picked up by a runner.

The job_queue_duration_seconds histogram measures the time between job creation and runner assignment, emitted when a runner picks up a job via POST /api/v4/jobs/request. Separate SLIs track shared runners (SaaS-managed) and non-shared runners (project/group runners).

  • Longer pipeline durations due to queueing delays
  • Developer feedback loops slowed
  • Shared runner impact: affects all GitLab.com users without self-managed runners
  • Non-shared runner impact: typically affects specific projects or groups

Shared runners:

  • Runner fleet capacity insufficient for demand
  • Autoscaling delays in provisioning new runners
  • Runner pod evictions or node failures
  • Surge in CI demand (e.g., mass retries after flaky test fix)

Non-shared runners:

  • Project/group runners offline or misconfigured
  • Tag mismatches (jobs requiring tags that no runner provides)
  • Runner concurrency limits reached
  • Plan-gating: no_matching_runner due to allowed_plans restrictions (results in stuck_or_timeout_failure after 24h)

Uses job_queue_duration_seconds_bucket{shared_runner="true"} with histogramApdex.

  • Satisfied threshold: 1 second (jobs picked up within 1s are “satisfactory”)
  • SLO: 90% apdex
  • MWMBR fires at: < 40% (6h window). The 1h window threshold computes to < -44%, which is below 0% and therefore mathematically impossible — in practice, only the 6h window can trigger this alert.

Uses job_queue_duration_seconds_bucket{shared_runner="false"} with histogramApdex.

  • Satisfied threshold: 30 seconds
  • SLO: 95% apdex
  • MWMBR fires at: < 70% (6h window) / < 28% (1h window)

1, 3, 10, 30, 60, 300, 900, 1800, 3600, +Inf seconds. The metricsFormat='migrating' handles both Prometheus 2 (integer) and Prometheus 3 (float) bucket label formats.

  • Severity: S3 (Slack-only, no paging)
  • Routes to: #s_verify_alerts
  • MWMBR requires both short and long windows to breach simultaneously
  • Non-shared runner queue duration is inherently noisier (depends on customer runner fleet availability)
  • Silencing: Safe to silence during known runner fleet maintenance or capacity scaling events. Use Alertmanager silence with matchers type=ci-orchestration, component=~.*job_queue_duration
  • Expected frequency: Shared runner alerts should be rare. Non-shared runner alerts may fire more frequently due to customer runner fleet variability

Default severity is S3. Consider upgrading to S2 if:

  • Shared runner queue duration p90 > 5 minutes sustained
  • Runner fleet-wide issue affecting all shared runner jobs
  • Correlated customer reports of jobs stuck in pending
# Shared runner queue duration apdex
gitlab_component_apdex:ratio_5m{component="shared_runner_job_queue_duration", type="ci-orchestration", environment="gprd"}
# Non-shared runner queue duration apdex
gitlab_component_apdex:ratio_5m{component="non_shared_runner_job_queue_duration", type="ci-orchestration", environment="gprd"}
# Queue duration percentiles (shared)
histogram_quantile(0.90, sum by (le) (sli_aggregations:job_queue_duration_seconds_bucket:rate_5m{environment="gprd", shared_runner="true"}))
# Queue duration percentiles (non-shared)
histogram_quantile(0.90, sum by (le) (sli_aggregations:job_queue_duration_seconds_bucket:rate_5m{environment="gprd", shared_runner="false"}))
  • CI Runners dashboard for runner availability and saturation
  • Check pending jobs queue length vs. runner capacity
  • Look for autoscaling issues or GCP quota limits

Non-shared runner queue duration issues are often caused by jobs requesting tags that no available runner provides. This results in jobs sitting in pending indefinitely (until stuck_or_timeout_failure at 24h).

The “Job pickup rate” panel on the Pipeline Observability dashboard shows how quickly jobs are being assigned to runners. A declining pickup rate with stable creation rate indicates runner capacity issues.

Unusual spikes in pipelines_created_total or job creation rate can overwhelm runner capacity:

  • Mass retries after a flaky test fix
  • Scheduled pipeline bursts
  • Large merge train activity

No past incidents have been recorded yet for this alert. This section will be updated as incidents occur.

  • CI Runners (shared): SaaS runner fleet capacity and health
  • Project/Group Runners (non-shared): Customer-managed runner availability
  • Rails API: Job assignment endpoint (/api/v4/jobs/request)
  • Shared runner queue p90 > 10 minutes for > 30 minutes
  • Runner fleet capacity appears exhausted
  • Correlated with runner service degradation
  • #s_verify_alerts (primary)
  • #g_runner (Runner team)
  • #f_hosted_runners_on_linux (Hosted Runners)
  • #production (if S2+ severity)