Description
Description of the problem
When rejuding a large contest or getting a lot of submission for problems with many testcases, it could be possible that some submissions are taking much longer wall time than their CPU time. With a short timelimit overshoot these submissions might be judged as TLE even if they are correct.
And this is actually what happens in a recent ICPC Asia Regional Contest (with ~350 teams and an easy problem with 50 testcases). After taking a lot time bisecting kernel and debugging, it was found out that a lock contention issue (2 global locks: shrinker_rwsem
and cgroup_mutex
) in kernel < 6.3 under heavy load might block kernel operations such as cgroup and page fault handling inside memory cgroup for several seconds.
(This is fixed (or alleviated) after kernel commit torvalds/linux@da27f79)
Though it is impossible for judgedaemon (runguard) to "fix" this issue by code, mentioning the kernel issue in documentation could be helpful for server admins.
Your environment
- DOMjudge/Webserver: any compatible version
- OS: Ubuntu 22.04 with kernel 5.15 (default) or 6.2 (latest generic kernel in jammy repo)
- Tested under a KVM with 32 cores and 21 or 30 judgedaemons, and a bare metal 2 CPUs (40 cores) server with 21 judgedaemons.
Steps to reproduce
Submit a correct solution many times at once like:
for i in $(seq 1 1000); ~/Downloads/domjudge-8.2.2/submit/submit --url http://localhost:12345/ --contest test -y G.cpp; end
And wait for it to be done.
Expected behaviour
Reasonable judgehost system load, and no submission takes a wall time much longer than its CPU time.
Actual behaviour
Judgehost system load >= 2 * judgedaemon number. With timelimit overshoot set to 1s|10%
, some submissions are judged as TLE even they only take a very short CPU time. The judgement is very slow.
Any other information that you want to share?
#2157 mentions about "the call cgroup_delete_cgroup_ext
did sometimes hang for multiple seconds". I'm afraid that a double check for this contest rejudgement might be necessary to ensure no correct solutions are judged as TLE...
If you are interested in this specific kernel issue, I have also written a blog post (Simp. Chinese) to help explain this to contestants affected in this regional contest, and for server admins in later contests.