A key skill in being a good developer is the ability to effectively debug software problems. Your skill at debugging improves with experience. When I think back to the start of my career as a developer I sucked at debugging. Over time I have built up a number of strategies to get to the bottom of a problem and fix quickly and efficiently.
The project I am working on at the moment is about to get released out in the wild. We have now entered that stressful time as bugs are ironed out. We race to complete the last mile. Stress is high and stress leads to ineffective debugging unless you stick to the strategies for debugging you have built up.
Yesterday was one of those days when the development team was stopped from working due to a service issue. Something which I thought was working, suddenly stopped working and I had to quickly diagnose and fix an issue.
People debug issues in different ways. The way I approach debugging may not be what works for you but here is how I approach it with an example.
Steps to effectively solve software problems and debug code
Understand the problem
Replicate the problem
Make it smaller
Question your assumptions
Understanding the problem
I first try to understand the problem. The reported problem may not be of help in doing this. Incident reports often contain detail, which doesn’t help in the slightest.
The data displayed on the results page is incorrect.
This is an awful description. Yes it tells you the data displayed are incorrect but there is nothing about how to create the problem, what it did show and what it should have shown.
Bad incident reports don’t help at all but are often the reality. It doesn’t inspire you to fix the problem when before you can even start you have to spend all your time getting to the bottom of what the person who raised the issue was trying to highlight in the incident report.
In the case of the incident that got raised by the development team consuming the service I was responsible for wasn’t too bad.
I’m getting a 504 Gateway timeout for save requests on the accounts service.
There are lots of levels of understanding when it comes to debugging initially though just understanding what the general problem is. In this case save requests to the accounts service are resulting in 504 gateway timeout errors. The incident report unfortunately doesn’t contain detail of how often this happens and under what conditions. Is it every time, sometimes, under load?
Replicate the problem
It’s now time to see if we can replicate the problem. In the case of this incident it was an API issue so I used Postman to send a request to the service to see if I could reproduce the issue.
In this case a 504 error resulted for every request. Sometimes you can find it is only a percentage of requests which result in failure. Postman Collection Runner can really help if the error does not always occur. Collection Runner allows you to repeat a call for a specified number of iterations. It allows you to set a test for each call and you can quickly determine number of successful and failed runs.
Make it smaller
Programs you write are often very complex and large. If faced with an error don’t be one of those people who run their program through the debugger, tracing variables until you think you see the problem. Make the problem smaller first. Get rid of extraneous code which isn’t part of your problem. If you can reproduce a problem with one line of code then that makes it very easy to fix. So start from where the problem occurs and work backwards. Isolate the problem.
In the case of the issue I was looking at, the service is exposed by Apigee. Apigee is used to expose a public interface to services, so again using Postman I tried hitting the internal gateway directly. This time the call worked :-] Now I know the problem isn’t the service itself it is within the public interface to the service.
Question your assumptions
I find it useful to maintain a list of assumptions and prove them as fact to narrow down the problem. You can do this in your head, on paper, or maintain the list digitally. It doesn’t matter just make sure you don’t chain a list of assumptions together and build a fact which sends you off down the wrong path.
For the problem I was working on I knew the problem was happening when I hit the public interface to the service but not when I hit the internal gateway to the service. I wanted to be certain though by ensuring it happened a 100% of the time, I used Postman Collection Runner to test the Internal Gateway and Public Gateway. The test confirmed my assumption 100% of calls to the Public Gateway failed and a 100% of calls to the Internal Gateway succeeded.
Cycling through the stages
During this phase of debugging you need to iterate through each of the steps. Increasing your understanding as you complete each step. You will make the problem smaller, this will help you understand the problem. You will make a series of assumptions, you need to question your assumptions. Replicate the problem with the smallest piece of code possible and repeat.
I opened the debugger in Apigee and ran my test in Postman again. The Apigee debugger highlighted the point in the service which is resulting in a 504 Gateway Timeout Error.
The problem was occurring within a Service Callout Policy. The Service Callout policy lets you call to an external service from your API flow. A Service Callout policy is used in conjunction with an Assign Message policy to build up the request message. The code below shows the policies which are used during a save request in Apigee.
The 504 Gateway Timeout Error occurs within the Callout.SaveAccountRequest. The AssignMessage.GenerateSaveRequest builds up the request object which is used within the Callout.SaveAccountRequest. So in terms of “Make it smaller” we only need to worry about two policies: AssignMessage.GenerateSaveRequest and Callout.SaveAccountRequest.
I decided to make an assumption about what was going wrong. Requests without a User Agent are thrown out by the Internal Gateway so one possible reason for the 504 Gateway Timeout Error is there is no User Agent on the request.
If we make an assumption we must try and prove or disprove it. The request is built up by the policy AssignMessage.GenerateAccountRequest. On the surface it looks like the User Agent is set in the AssignMessage.GenerateAccountRequest policy.
var requestObject = JSON.parse(context.getVariable("SaveAccountItemRequest.content"));
var userAgent = requestObject.header.User-Agent;
throw "User-Agent: "+userAgent;
When running the request again using Postman no 504 Gateway Timeout error was encountered and the request worked. My assumption was correct the User-Agent isn’t getting set correctly. Where the problem lies has also been narrowed down, it is not the Callout Policy but the Assign Message Policy.
Looking at the Assign Message Policy again with this knowledge there is an obvious typo. On line 3 the AssignTo variable reads SaveAccountRequest instead of SaveAccountItemRequest.