Skip to content

how to restore cortex operator normally when too many jobs are requested #2394

Open
@nellaG

Description

@nellaG

hello.

I'm currently using cortex 0.40.0.

I seldom request thousands of jobs to certain cortex api by mistake.
When I do like that, I can't use cortex cli well (the response time is so long, or just hanging) and I guess that cortex operator is overloaded because of me.
(the status of operator-controller-manager pod is continuously goes to OOMKilled -> CrashLoopBackOff)

To resolve this issue, I attempted these so far but It didn't work well.

  1. delete thousands of AWS sqs queue
  2. delete all of enqueuer job and worker job created by mistake
  3. delete certain cortex api and re-deploy it

After all I just down the cluster and up (+ re-deploy all of api) to make cortex work well.
If this is happened, what should I do to restore cortex without down and up cluster?

I glad to your support. Thank you so much.

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requested

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions