Friday, May 1, 2009

Don't trust a bug reporter

Couple of weeks ago I was assigned an year old bug that had changed many hands over the time. Multiple times back and forth between other developers from different teams and the reporter.

The issue
The reported problem was that when an IPv6 ICMP packet with size greater than the MTU of an outgoing interface of an intermediate router between the source and destination is sent with don't fragment (DF) flag set, the packet does not reach its destination as expected. However, corresponding statistics for the number of received ICMP packets and the number of those exceeding MTU is not updated on the intermediate router that is expected to discard such packets.

My struggle
Being new to the specific source code portion that handles exception packets, I spent time understanding the packet path. Added debug logs to know the path taken by incoming packets. Unfortunately, the specific router configuration required and the fact that one of the normal interfaces was repurposed for management traffic, the router being debugged was also receiving traffic other than the one required for the analysis. The management interface was to be used for uploading my debug software image.

My debug logs confirmed that the needful ICMP and IP fragment hanlders were installed properly. What puzzled me was the packets were not seen reaching the fragment handler that checks for the DF flag. Interestingly, this issue was not seen for IPv4 traffic!

The setup involved sending traffic to a multicast group. For not being familiar with multicast, I had a third person from the bug reporter's team verify the configuration was adequate for the purpose. To make my life difficult, the original bug reporter was not available when needed the most.

Troubleshooting
I decided to take help from a colleague who is familiar with the product and source code for more than three years. Seeing that the originating router was sending the ICMP to a multicast group, he first checked the forwarding state and packet count for the same. First clue was the packet counters were not increasing. Second, the forwarding state for the multicast group was reported as 'pruned'.
It turns out that the original configuration missed out configuring the rendevous point (RP). Second, the third router that was supposed to receive the ICMP packets--had the intermediate router allowed those for packets smaller than its outbound interface MTU--did not have any multicast listener configured. It only had PIM configured. This caused the intermediate router to set the forwarding state to 'pruned'. Given these shortcomings in the configuration, the multicast setup consisting of the three routers was clueless as to where to send the packets to. It was silently dropping the packets.

Learnings
  • Before starting with analysis of any bug, first understand what the bug reporter intends to do.
  • Never trust that their claims about the correctness of the setup configuration. Discovering the shortcomings in the same due to omission or their incorrect understanding of features will save us further needless and painful analysis using a debugger.
  • Being silent about error handling in one's software is a crime for which developers will spend clueless long hours to discover it in a hard way.

No comments: