Description
Description
- Start 60 independent parallel act_runners ( opposed to a single runner that has parallel jobs enabled )
# docker-compose.yml
services:
runner:
build: . # this Dockerfile adds self-signed certs to act_runner be able to connect
environment:
- GITEA_INSTANCE_URL= # Your Gitea Instance to register to
- GITEA_RUNNER_REGISTRATION_TOKEN= # The Gitea registration token
- GITEA_RUNNER_LABELS=label # The labels of your runner (comma separated)
user: root
deploy:
mode: replicated
replicas: 60 # <--- This is required on a different host as the gitea server, single runner setups does not seem to be affected
- Now create a large 10x10 matrix with 100 jobs at once
on: push
jobs:
stress-test:
strategy:
matrix:
a: [0,1,2,3,4,5,6,7,8,9]
b: [0,1,2,3,4,5,6,7,8,9]
max-parallel: 60
runs-on: label
steps:
- run: ${{ tojson(github) }}
shell: cat {0}
- run: uname -a
- run: ${{ tojson(github) }}
shell: cat {0}
- run: sleep 60
- run: ${{ tojson(github) }}
shell: cat {0}
- run: ${{ tojson(runner) }}
shell: cat {0}
- run: ${{ tojson(env) }}
shell: cat {0}
- run: ${{ tojson(strategy) }}
shell: cat {0}
- run: ${{ tojson(job) }}
shell: cat {0}
- run: ${{ tojson(needs) }}
shell: cat {0}
- run: ${{ tojson(steps) }}
shell: cat {0}
- Notice only randomly 1-10 jobs get assigned to runners
- 90-99 jobs keep waiting
- once the runners that make progress finish those set taskversion to 0 and get a new one
- this job has a sleep 60 to keep working runners busy for some amount of time
- Notice the old Workflow run might continue to queue new jobs to runners even if it should have been stopped by concurrency = 1
- This could be a database / request timeout side effect
Observed internal behavior
- fetchtask might return no job under load even if jobs are available instead of returning an error
- if this happend once taskversion gets updated and no picktask calls happen until taskversion is incremented due to new jobs
- ca. 50 of the 60 runners directly update their taskversion to latest even if jobs are still queued
- in this scenario rerun all jobs returns http 500 probably due to database/request timeout
- all other features of gitea keep functional
Workaround
- patch act_runner to always send taskversion 0 to force query the database + set fetchtimeout to 50 then all runners got a queued job assigned
- Works as well more cpu power and more ram for the database and gitea tested on m4 pro mac + sqlite
- Possible alternative Untested use a single act_runner to delegate resources to other machines
I'm not aware of other reports here, still debugging this trying to understand why gitea sometimes claims that no new job is available even if there are clearly tens of them
I'm not planning to run the tests against the demo site to not stress test its resources
EDIT
Update macbook pro m4 as Gitea server I got 20 parallel jobs using sqlite during debugging..
Now need to make breakpoints why FetchTask returns no error and no job
EDIT
Need to collect more information...
EDIT
First edit is obsolete, more powerful device solves this problem as well.
So the good path works perfectly fine, but there must be a bad path with degraded performance as well
EDIT
Possible root cause is here, concurrent job assignments are forwarded as no more jobs instead of an error in line 320
Lines 317 to 320 in 09a3b07
Gitea Version
1.23.1
Can you reproduce the bug on the Gitea demo site?
No
Log Gist
No response
Screenshots
No response
Git Version
No response
Operating System
ubuntu 22.04 arm64
How are you running Gitea?
docker image on a raspberry pi4 8GB, depending on the database performance more parallel runners might be needed to see something similar.
Database
MySQL/MariaDB