Friday, May 1, 2009

When it comes to debugging, don't trust anyone!

I was assigned a bug whose report simply claimed that a common feature that was known working for long time is broken on a specific hardware platform of the product. For being new to the source code and the product, the first thing I did was to talking to the original software developer. He vehemently maintained that the source code has not changed in many years. So there is no point in spending time to find problem there. He had a theory that recent changes in something else traversed by the incoming network traffic is causing this misbehavior as a side effect.

I spent painstakingly long hours comparing multiple versions of the software builds around the date the issue was reported on. Compared the software changes that happened between the last date it was found working fine and the date it actually stopped working. Note: the latter date could be even earlier than the date the issue was observed and reported by the tester. The original developer then undid the changes he thought may be causing the issue and I tried those on the test setup once again with no avail! It was a Friday already.

For the weekend I had decided to drop all the theories and start with some basics and try identifying the problem by the way of elimination of one software component at a time. Unfortunately for me, the original developer did just that and already found the root cause--an incorrect acquisition of synchronization spinlock! This was there since the day-1 when he introduced multi-threading in the software for the new multicore hardware platform.

Conclusions
  • Never trust anyone when it comes to analysing software bugs! Not even the original developer...
  • Managers should allow sufficient time for engineers handling issues to form their own theories and then go about analyzing instead of coercing them to follow specific techniques they had seen working earlier. One cannot use the same tool and approach for just all problems. Otherwise that leads to significantly wasted time. After all, it is them who have hired this engineer and the engineer's own judgment should be trusted and be given a chance. In this situation, me not being the original developer, in fact, put me in the best position to question the correctness of the source code.
  • Such a basic issue should have been caught during unit testing by the original developer.

Don't trust a bug reporter

Couple of weeks ago I was assigned an year old bug that had changed many hands over the time. Multiple times back and forth between other developers from different teams and the reporter.

The issue
The reported problem was that when an IPv6 ICMP packet with size greater than the MTU of an outgoing interface of an intermediate router between the source and destination is sent with don't fragment (DF) flag set, the packet does not reach its destination as expected. However, corresponding statistics for the number of received ICMP packets and the number of those exceeding MTU is not updated on the intermediate router that is expected to discard such packets.

My struggle
Being new to the specific source code portion that handles exception packets, I spent time understanding the packet path. Added debug logs to know the path taken by incoming packets. Unfortunately, the specific router configuration required and the fact that one of the normal interfaces was repurposed for management traffic, the router being debugged was also receiving traffic other than the one required for the analysis. The management interface was to be used for uploading my debug software image.

My debug logs confirmed that the needful ICMP and IP fragment hanlders were installed properly. What puzzled me was the packets were not seen reaching the fragment handler that checks for the DF flag. Interestingly, this issue was not seen for IPv4 traffic!

The setup involved sending traffic to a multicast group. For not being familiar with multicast, I had a third person from the bug reporter's team verify the configuration was adequate for the purpose. To make my life difficult, the original bug reporter was not available when needed the most.

Troubleshooting
I decided to take help from a colleague who is familiar with the product and source code for more than three years. Seeing that the originating router was sending the ICMP to a multicast group, he first checked the forwarding state and packet count for the same. First clue was the packet counters were not increasing. Second, the forwarding state for the multicast group was reported as 'pruned'.
It turns out that the original configuration missed out configuring the rendevous point (RP). Second, the third router that was supposed to receive the ICMP packets--had the intermediate router allowed those for packets smaller than its outbound interface MTU--did not have any multicast listener configured. It only had PIM configured. This caused the intermediate router to set the forwarding state to 'pruned'. Given these shortcomings in the configuration, the multicast setup consisting of the three routers was clueless as to where to send the packets to. It was silently dropping the packets.

Learnings
  • Before starting with analysis of any bug, first understand what the bug reporter intends to do.
  • Never trust that their claims about the correctness of the setup configuration. Discovering the shortcomings in the same due to omission or their incorrect understanding of features will save us further needless and painful analysis using a debugger.
  • Being silent about error handling in one's software is a crime for which developers will spend clueless long hours to discover it in a hard way.