In one of our recent projects we had alerts set up for CPU and memory spikes. The CPU alert was set at 10% and 50%. While we got the 10% alert periodically when the app was in active usage , we also got a few 50% alerts once in a while which would get resolved all within 5 mins. Notice how the alerts are activated and resolved all within 5 mins.
However once we got a 50% CPU alert which did not get resolved.When we didn’t get a resolved email for about 45 mins , we logged onto the portal and noticed that the CPU was at a 99% usage and no-one was even using the system. A look at the http logs also did not indicate any heavy usage recently. Thinking it just might be a system glitch we quickly recycled the application pool using the Azure support portal “Restart w3wp” and analyzed the CPU and memory usage for the next 30 mins or so. Everything looking good we were convinced it was a system glitch!
About 2 days before going live on production we got another alert on our staging environment with the same behavior. Now we knew we were in trouble! We quickly came to a few conclusions
- The system overall is stable since we can go days with heavy usage and not entering into this 90% CPU usage state
- There is no continuous memory leak since the CPU and Memory both spike together and suddenly and once the pool is recycled again become steady
- The issue seemed most likely a specific functionality which is causing the system to get into a loop and the client seems to be using that functionality more than we did
We started analyzing the system “while it was in the erroneous state”. The first step was to “Diagnose the system while it was still faulty”
- Open the Azure support portal for your app and click Analyze—>Diagnose Now to begin the diagnosis!
Once its done analyzing it presents raw data as well as VERY useful Analysis Status. Following is what helped us
- Open up the Http Logs Analysis to see the “Top 20 Longest Processing Requests”. In our case it was an API call called “/api/Calender/GetFirstValidCatteEntriesOnline” on 2016-03-28 at 16:04:21 .
- Open up the Http Log file and look for a similar entry at the exact time.
The last part of the statement indicates the time it took for the response. 230001ms!! So we knew our culprit.
- The next problem was to reproduce this issue. Notice that the Http Logs file did not mention that parameters that were passed to the API. Instead the parameters are being shown as — 0 —— The parameters are reported properly for all other cases so while the Http logs helped us confirm the culprit we could not reproduce it because we don’t know what parameters are causing this to go into a loop!
- Next we checked the Memory Analysis results and as expected it reported that the CPU was running at full capacity and 1 API call seems to be holding the entire system to ransom!
Clicking on Client Connections took us to the exact call and there we could see the exact query parameters!!!
When we hit our local deployment with the exact query string the system went into a infinite while look and we found our bug!
So the next time you see a sudden CPU spike in your system which persists use these steps to figure our the faulty method!! Awesome collection of data by the Azure team!!
Oh and once you are done analyzing please clean up your VMs file system because diagnostics just put a huge memory dump there!! More info here .
Until next time!