Acute Performance Issues

Acute Performance Issues

A new release was rolled out in the production environment over the weekend.

Next morning after a few hours of usual operations, the production DB server was set on fire.

The CPU utilization which usually remained within 40%-60% for this box suddenly jumped up to 95% and would not come down.

The client APIs which generally would run with a response time of 400 ms were now taking more than 900 ms.

SE team of Sr. DBAs was engaged and initiated investigation as they were informed on this issue.

After detailed investigation the team identified a MSTVF (multi statement table valued function) which was running with a bad execution plan. This function was being called by several different stored procedures and would get executed anywhere between 25k to 40k times every single hour.

With the release that went in over the weekend a part of the code change was now inducing an index scan on a particular table which was leading to persistent high cpu utilization.

Pre-release this function call would take anywhere between 5ms - 10ms of cpu time and post the release the same function call was taking anywhere between 300ms - 550ms.

This jump of 60x to 100x in the cpu utilization was good enough to set the box ablaze.

The SE team recommended a particular index to curb the situation immediately and then later helped with refactoring the code to bring the down the cpu utilization naturally in the following release.