Spent the morning looking into …

Spent the morning looking into what was causing the service I’m working on to be performing so badly. Turns out that one of the reasons was just us being stupid but another was actually quite surprising.

Without giving too much away, the service I’m working on takes messages from a NATS Jetstream, and converts them to jobs to be sent to a pool of workers. The workers need to access an API to do the job, and the service I’m responsible for needs to setup the permissions so the worker can do so. This involves calling out to a bunch of other micro-services and producing a signed JWT to grant it access to the API. If the workers are fully utilised, then the service will send any incoming jobs to a queue. When a worker has finished a job, it will get the next one from the queue.

One of the responsibility of this service is to make sure the workers are doing work as much as possible. The pool of workers is fixed, and we’re paying a fixed price for them, so we’d like them to be in use much like an airline would like their planes to be in the sky. So one of the things I added was that the Jetstream handler was to try and send a new job to a worker first, before adding it to the queue to be worked on later. This means that the job could completely skip the queue if a worker is available to pick it up job then and there.

Turns out that this made a huge hit on performance. When the workers are all working on jobs, any requested work sent by this queue bypass logic would be refused, making all the prep work for preparing access to the API completely unnecessary. Time spent on calling the micro-service I can completely understand, but I was surprised to find that a significant chunk of prep work was spent on signing the JWT — a JWT that would never be used. This showed up on the CPU profile trace I took, but it also came through in the output of top and the CPU status on the EC2 dashboard. The CPU was maxed out at 99%, and virtually all of that was spent on the service itself. CPU idle time, system time, and I/O wait time was pretty much at 0%.

Taking out this queue bypass logic dramatically improved things (although the CPU is still unusually high, which I need to look at), and what I’ll eventually do is add a circuit breaker to the Jetstream handler, so that if no worker is available to pick up the job then and there, all the other jobs will go straight to the queue. There are a few other things we could do here, like raise the number of database connections and Jetstream listeners.

But I’m shocked that signing a JWT was having such an impact. One of my colleagues suggested building the service with GOAMD64=v3, which I’m guessing would enable some AMD extensions in the compiled binary. I’d be interested to see if this would help.