Sidekiq Concurrency Limit

Throttling/Circuit Breaker based on database usage

To protect the primary database against misbehaving/inefficient workers which can lead into incidents like slowdown of jobs processing, web availability, etc, we have developed a circuit breaking mechanism within Sidekiq itself.

When the database usage of a worker violates an indicator, Sidekiq will throttle the worker by decreasing its concurrency limit at an interval of every minute. In the worst case scenario, the worker’s concurrency limit will be suppressed down to 1.

Once the database usage has gone to a healthy level, the concurrency limit will automatically recover towards its default limit, but at a much slower rate than the throttling rate. The definition of the throttling/recovery rate is defined here.

Observability around concurrency limit can be viewed at sidekiq: Worker Concurrency Detail dashboard.

Database usage indicators

There are 2 indicators on which the application will throttle a worker:

DB duration usage (primary DBs only)

Dashboard:
Make sure to check all primary DBs usage, ie json.db_main_duration_s, json.db_ci_duration_s and json.db_sec_duration_s. If any one of the DB duration limit is exceeded, throttling event will kick in.

By default, the per-minute DB duration should not exceed a limit of 20,000 DB seconds/minute on non-high urgency worker and 100,000 DB seconds/minute on high-urgency workers (source).

The limits above can also be overwritten as described below. To check the current limit:
Terminal window
```
glsh application_settings get resource_usage_limits -e gprd
```
Number of non-idle DB connections

Dashboard:
Sidekiq periodically samples non-idle DB connections from pg_stat_activity to determine which worker classes are consuming the most connections.

The system determines the predominant worker (the worker consuming the most connections) by:
1. Summing the number of connections used by a worker over the last 4 samples of pg_stat_activity (approximately 4 minutes of data)
2. The worker with the most aggregated connections is the “predominant worker”

Throttling events

The table below illustrates what happens when each indicator is violated:

Indicator 1 (DB duration)	Indicator 2 (DB connections)	Throttling Event
❌	✅	Soft Throttle
❌	❌	Hard Throttle
✅	❌	No throttling. Not throttled as some workers may momentarily hold many connections during normal workload
✅	✅	No throttling

Updating DB duration limits

The DB duration usage described above can only be updated by calling the application settings API. It cannot currently be set using the admin web UI.

Prepare a JSON file. Here’s an example to update a single worker Chaos::DbSleepWorker to have its own limit on the main DB:

Click to expand

❯ cat rules.json
{
  "rules": [
    {
      "name": "main_db_duration_limit_per_worker",
      "resource_key": "db_main_duration_s",
      "metadata": {
        "db_config_name": "main"
      },
      "scopes": [
        "worker_name"
      ],
      "rules": [
       {
         "selector": "worker_name=Chaos::DbSleepWorker",
         "threshold": 5,
         "interval": 60
       },
        {
          "selector": "urgency=high",
          "threshold": 100000,
          "interval": 60
        },
        {
          "selector": "*",
          "threshold": 20000,
          "interval": 60
        }
      ]
    },
    {
      "name": "ci_db_duration_limit_per_worker",
      "resource_key": "db_ci_duration_s",
      "metadata": {
        "db_config_name": "ci"
      },
      "scopes": [
        "worker_name"
      ],
      "rules": [
        {
          "selector": "urgency=high",
          "threshold": 100000,
          "interval": 60
        },
        {
          "selector": "*",
          "threshold": 20000,
          "interval": 60
        }
      ]
    },
    {
      "name": "sec_db_duration_limit_per_worker",
      "resource_key": "db_sec_duration_s",
      "metadata": {
        "db_config_name": "sec"
      },
      "scopes": [
        "worker_name"
      ],
      "rules": [
        {
          "selector": "urgency=high",
          "threshold": 100000,
          "interval": 60
        },
        {
          "selector": "*",
          "threshold": 20000,
          "interval": 60
        }
      ]
    }
  ]
}

To prepare a file with the current configuration to edit, run:

glsh application_settings get resource_usage_limits > rules.json

Run a helper script glsh application_settings resource_usage_limits to update the limits with an admin PAT.
Terminal window
```
glsh application_settings set resource_usage_limits -f rules.json -e gprd
```

Disabling the Throttling/Circuit Breaker feature entirely

To disable throttling globally for all workers:

/chatops gitlab run feature set sidekiq_throttling_middleware false

To disable throttling for a worker:

# replace Security::SecretDetection::GitlabTokenVerificationWorker with the worker you want to disable
/chatops gitlab run feature set `disable_sidekiq_throttling_middleware_Security::SecretDetection::GitlabTokenVerificationWorker` true

SidekiqConcurrencyLimitQueueBacklogged Alert

This alert fires when a Sidekiq worker has accumulated too many jobs in the Concurrency Limit queue (>100,000 jobs for more than 1 hour).

The long backlog is usually caused by higher arrival rate of jobs compared to the rate of resumed jobs by ConcurrencyLimit::ResumeWorker.

The arrival rate is equal to the worker deferment rate, which can be found here.
The rate of resuming jobs can be found in Kibana.

If the arrival rate is consistently higher than the rate of resuming jobs, the only option is to disable the concurrency limit for the worker class as described in Option 2 below.

These jobs are queued in Redis Cluster SharedState, so large amount of jobs could saturate Redis Cluster SharedState memory if left untreated.

Option 1: Increase Worker Concurrency Limit

If the worker can safely handle more concurrent jobs:

Locate the worker definition in the codebase
Check current concurrency limit setting from the dashboard or the worker class definition.
Create an MR to increase the limit to an appropriate value based on concurrent jobs

If concurrency_limit attribute is not set in the worker class, consider overriding max_concurrency_limit_percentage attribute to use higher percentage of max total threads in the Sidekiq shard. The default percentage can be found here (based on worker’s urgency).

Option 2: Temporarily Disable Concurrency Limit

Alternatively, disable_sidekiq_concurrency_limit_middleware_#{worker_name} feature flag can be enabled to help clear the backlogs instantly without waiting for deployment as in Option 1.

Enable the feature flag:

/chatops gitlab run feature set `disable_sidekiq_concurrency_limit_middleware_WebHooks::LogExecutionWorker` true --ignore-feature-flag-consistency-check

Monitor the concurrency limit queue size to confirm it’s draining
If we decide to increase the concurrency limit, wait until the limit has been increased and disable the feature flag back:

/chatops gitlab run feature delete `disable_sidekiq_concurrency_limit_middleware_WebHooks::LogExecutionWorker`

When the concurrency limit middleware is disabled:

Jobs will be resumed at a higher pace.
New jobs will execute immediately.

Post-Incident Tasks

Create an issue to properly address the root cause if Option 2 was used
Update monitoring thresholds if needed

Useful Dashboards

References

Concurrency limit worker attribute