Debugging experiences

Sunday, October 7, 2012

Voice, video streaming issue with Google+, Skype (fixed)

It was my router! It was incorrectly identifying genuine audio/ video packets (UDP) as part of an ongoing denial of service or port scan attack from Internet, and discarding them. That led to large gaps in real time audio, and poor video quality. So much that I was unable to carry out any meaningful conversation. The remote party was able to view/ hear us fine. And initially I was blaming it on all other things in my setup that I could imagine, including the net connectivity.

Audio/ video packets incorrectly identified as attack

Fortunately, there's a setting to disable the behavior. After disabling, I tested the call quality using Skype's test server and it was fabulous! I was just lucky that some of the previous Skype sessions worked fine. I admit, some sessions did have this issue; just not so consistently as with Google+. I also tested with Gmail video chat. What a relief to know the root cause, and also why the unpredictability of it occurrence.

Disable DoS/ port scan detection

After this fix, even video playbacks like those at YouTube go smoothly without awkward halts in between.

The particular router model is more than four years old. So, the issue may not be relevant in recent devices with improved software implementation of DoS/ port scan detection.

Conclusions

Always good to provide some kind of logs for the users to figure out what is going on, like Netgear provided in this case.
It is unclear why the router would classify packets that are part of an already established flow as part of DoS or port scan attempts. It would classify even packets coming from Google DNS (8.8.8.8 or 8.8.4.4) as port scan.
Fortunately Netgear has already made the source code available publicly. Will share here if I find something.

Monday, October 12, 2009

CONFIG_RELOCATABLE fails booting linux-2.6.31.3 kernel

After compiling the recently brewed linux-2.6.31.3 kernel, noticed that the resultant binary failed to boot. After selecting the image from GRUB boot menu, the PC would just reset without any message.

Fortunately I had my previous configuration saved to compare against. Also, I knew what I had explicitly changed after invoking 'make oldconfig'. It was this little guy here:

CONFIG_RELOCATABLE

Looks like it is not working fine with my hardware configuration. Disabling it in the .config using xconfig fixes it. Thought I should share in case someone else is facing this, too.

Oh, and I decided to give the following a try for more responsive desktop environment:

# CONFIG_PREEMPT_VOLUNTARY is not set

CONFIG_PREEMPT=y

Yet to think of a test to feel the difference. At least the kernel boots :-)

Friday, May 1, 2009

When it comes to debugging, don't trust anyone!

I was assigned a bug whose report simply claimed that a common feature that was known working for long time is broken on a specific hardware platform of the product. For being new to the source code and the product, the first thing I did was to talking to the original software developer. He vehemently maintained that the source code has not changed in many years. So there is no point in spending time to find problem there. He had a theory that recent changes in something else traversed by the incoming network traffic is causing this misbehavior as a side effect.

I spent painstakingly long hours comparing multiple versions of the software builds around the date the issue was reported on. Compared the software changes that happened between the last date it was found working fine and the date it actually stopped working. Note: the latter date could be even earlier than the date the issue was observed and reported by the tester. The original developer then undid the changes he thought may be causing the issue and I tried those on the test setup once again with no avail! It was a Friday already.

For the weekend I had decided to drop all the theories and start with some basics and try identifying the problem by the way of elimination of one software component at a time. Unfortunately for me, the original developer did just that and already found the root cause--an incorrect acquisition of synchronization spinlock! This was there since the day-1 when he introduced multi-threading in the software for the new multicore hardware platform.

Conclusions

Never trust anyone when it comes to analysing software bugs! Not even the original developer...
Managers should allow sufficient time for engineers handling issues to form their own theories and then go about analyzing instead of coercing them to follow specific techniques they had seen working earlier. One cannot use the same tool and approach for just all problems. Otherwise that leads to significantly wasted time. After all, it is them who have hired this engineer and the engineer's own judgment should be trusted and be given a chance. In this situation, me not being the original developer, in fact, put me in the best position to question the correctness of the source code.
Such a basic issue should have been caught during unit testing by the original developer.

Don't trust a bug reporter

Couple of weeks ago I was assigned an year old bug that had changed many hands over the time. Multiple times back and forth between other developers from different teams and the reporter.

The issue
The reported problem was that when an IPv6 ICMP packet with size greater than the MTU of an outgoing interface of an intermediate router between the source and destination is sent with don't fragment (DF) flag set, the packet does not reach its destination as expected. However, corresponding statistics for the number of received ICMP packets and the number of those exceeding MTU is not updated on the intermediate router that is expected to discard such packets.

My struggle
Being new to the specific source code portion that handles exception packets, I spent time understanding the packet path. Added debug logs to know the path taken by incoming packets. Unfortunately, the specific router configuration required and the fact that one of the normal interfaces was repurposed for management traffic, the router being debugged was also receiving traffic other than the one required for the analysis. The management interface was to be used for uploading my debug software image.

My debug logs confirmed that the needful ICMP and IP fragment hanlders were installed properly. What puzzled me was the packets were not seen reaching the fragment handler that checks for the DF flag. Interestingly, this issue was not seen for IPv4 traffic!

The setup involved sending traffic to a multicast group. For not being familiar with multicast, I had a third person from the bug reporter's team verify the configuration was adequate for the purpose. To make my life difficult, the original bug reporter was not available when needed the most.

Troubleshooting
I decided to take help from a colleague who is familiar with the product and source code for more than three years. Seeing that the originating router was sending the ICMP to a multicast group, he first checked the forwarding state and packet count for the same. First clue was the packet counters were not increasing. Second, the forwarding state for the multicast group was reported as 'pruned'.
It turns out that the original configuration missed out configuring the rendevous point (RP). Second, the third router that was supposed to receive the ICMP packets--had the intermediate router allowed those for packets smaller than its outbound interface MTU--did not have any multicast listener configured. It only had PIM configured. This caused the intermediate router to set the forwarding state to 'pruned'. Given these shortcomings in the configuration, the multicast setup consisting of the three routers was clueless as to where to send the packets to. It was silently dropping the packets.

Learnings

Before starting with analysis of any bug, first understand what the bug reporter intends to do.
Never trust that their claims about the correctness of the setup configuration. Discovering the shortcomings in the same due to omission or their incorrect understanding of features will save us further needless and painful analysis using a debugger.
Being silent about error handling in one's software is a crime for which developers will spend clueless long hours to discover it in a hard way.

Thursday, June 5, 2008

Unable to use Windows for half minute after each login

So I was part of a software development team that worked on a network security product of a startup company. Before allowing access to the network, the secured product would do some sort of network access control by asking the user for access credentials--some sort of captive portal. The issue observed was that when user accesses network using Windows as client side access platform, the network equipment to which this client was connected would prevent further access to the PC for at least half a minute. For example, if one starts a Web browser, that would freeze the graphical user interface shell of Windows for about 30 seconds. Would not interact with the user until the browser opens after half a minute. This would be repeated for any program that attempts to access network. This problem would persist until user is authenticated for network access. For the same thing, an ordinary non-secured equipment would not even let the user feel its presence in the network--meaning no delays in letting access to any network resources, or the user's own PC.

Before releasing the product, it was being used internally within the development labs. There were more than hundred software development engineers and testers as users connected to the network through this security product. Apparently no one complained about or tried to find out why this half a minute delay and just moved on to accept it as a 'fact of life'...

I could not take it any more; more so as one of the members in the development team of the product. I started my analysis by thinking about what happens between the Windows GUI shell coming up and the user invoking a program that attempts to access network. I knew network traffic from Windows is quite chatty thanks to SMB and NBNS. Using common sense, I suggested to let the network equipment to explicitly send connection reject message for all network access attempts from the network interface of an unauthenticated user. Earlier the equipment was configured to simply discard such traffic. The common sense paid off! Until authenticated for network access control, the user would not have access to network--the desired behavior but that would not prevent a user from interacting further with their OS shell.

When unauthenticated traffic was silently dropped, the chatty protocols from user's client PC would keep retrying until their preconfigured time out occurs, which in this case was half a minute. Issue resolved without a single line change in software source code. Phew! Needless to say I was quite happy that day.

Sunday, June 1, 2008

Netgear DG834G used with BSNL DataOne as ISP would not open certain Web sites

Hope the experience shared here will help fellow buyers of home DSL modem routers that are used with BSNL DataOne broadband Internet service.

Recently I purchased Netgear DG834G wireless router with a built-in DSL modem. Setting it up proved a challenge for following reasons:

By default the modem kept insisting to pre-programmed automatic setup wizard. The wizard only chooses to auto-detect ISP settings. It could not detect settings for my ISP and at the same time did not allow any way to specify the settings manually. The method specified in the installation manual still kept following the auto-detect path. Manually entering IP address http://192.168.0.1 still took me to the same wizard thing.

Solution: manual reset using the switch on the back of the device presented access to the manual configuration method. Subsequently I entered settings for the ISP and I was connected! It was a very satisfying experience thereafter. The access range of this wireless router is pretty good. I tried the farthest corners in the house with multiple walls in between the router and my PC. It just worked with the lowest signal strength of 'good' quality (higher being 'very good' and 'excellent').

The joy was short lived. Next is why and how I cracked it.
With this new router I could not open Microsoft Web site! Huh. Then I tried few other sites I use frequently. Same with download.com. Everything else would work fine: IMAP, POP, instant messenger, VPN access... Firefox would consistently not display the site contents. Internet Explorer would unreliably open the front page after taking too long. Clicking on any of the hyper links on the page would eventually time out without any data coming from remote site. I was baffled! What's about this Web sites and my new router?

How I narrowed down the problem and the discovery of a solution was very satisfying debugging experience. Those who are only interested in lnowing the fix can skip to the last section 'Conclusion!'.

Narrowing down the problem from the symptoms:
First I suspected DNS settings. Obtained the settings from ISP's Web site. Set those manually in the router instead of the default option of obtaining those from ISP. Did not help. Someone suggested me setting those manually on my PC. Still, no yield.

The same sites would open with the ordinary modem provided by the ISP.

Suspecting something foul with the firmware, I downloaded the latest software image from manufacturer's support Web site. As that did not help either, I started getting restless as a new buyer. I started thinking less rationally. I logged in using VPN to office. Using that method I could access the problem Web sites. Now I knew the router did not have anything against getting data bytes from those Web sites :-) I was still doubtful about the DNS part. I decided to sleep over the problem to be able to think logically.

When I shared the experience with my kind friends and colleagues, received following suggestions that helped me rule out certain possibilities:

"disable IPv6 access in your Internet browser on your PC" -- then why would the same browser open those problem sites with another modem?
"don't use your router as a DNS proxy; instead set DNS manually on your PC" -- same as above
"your router has bugs; get a replacement" -- hmm! That's easy. Then why other Web sites open and other services work? What so specific about these sites?

I was not convinced.

The analysis
I decided to dig deeper and understand what was happening behind the scenes. This was the turning point! I captured network traffic using Wireshark. It's a software tool that allows you to capture and analyze network traffic from/ to your PC. I compared network traffic while accessing the sites that were accessible alright against the that generated to/ from the problem sites. Noticed that for the problem sites, response for HTTP GET request was not reaching my PC. The PC would keep retrying the same request for certain time without any yield then give up. At least this confirmed that DNS resolution was happening fine. So it was time to make certain educated guesses as to why the reply from the remote site would not reply!?

1)DNS request initiated from my PC as a result of entering Web site address in the browser:No. Time Source Destination Protocol Info
1 0.000000 192.168.x.y 218.248.255.139 DNS Standard query A www.microsoft.com
2)DNS response comes back with a list of IP addresses from where the site contents can be fetched. Notice it contains akamai.net Web cache

No. Time Source Destination Protocol Info
2 0.011239 218.248.255.139 192.168.x.y DNS Standard query response CNAME toggle.www.ms.akadns.net CNAME g.www.ms.akadns.net CNAME lb1.www.ms.akadns.net A 207.46.193.254 A 207.46.19.190 A 207.46.19.254 A 207.46.192.254
3)From a list of IP addresses in the DNS response, my PC decides to initiate a TCP connection with one of the IP addresses
No. Time Source Destination Protocol Info
3 0.014299 192.168.x.y 207.46.193.254 TCP rich-cp > http [SYN] Seq=0 Win=16384 Len=0 MSS=1460

No. Time Source Destination Protocol Info
4 0.444975 207.46.193.254 192.168.x.y TCP http > rich-cp [SYN, ACK] Seq=0 Ack=1 Win=8190 Len=0 MSS=1436

No. Time Source Destination Protocol Info
5 0.445056 192.168.x.y 207.46.193.254 TCP rich-cp > http [ACK] Seq=1 Ack=1 Win=17232 Len=0

4)TCP/IP connection established. Now PC asks for HTTP data
No. Time Source Destination Protocol Info
6 0.445475 192.168.x.y 207.46.193.254 HTTP GET / HTTP/1.1

...
Internet Protocol, Src: 192.168.x.y (192.168.x.y), Dst: 207.46.193.254 (207.46.193.254)
...Total Length: 451
...Protocol: TCP (0x06)
...
Transmission Control Protocol, Src Port: rich-cp (2057), Dst Port: http (80), Seq: 1, Ack: 1, Len: 411
Hypertext Transfer Protocol

...0030 43 50 5c b1 00 00 47 45 54 20 2f 20 48 54 54 50 CP\...GET / HTTP
0040 2f 31 2e 31 0d 0a 48 6f 73 74 3a 20 77 77 77 2e /1.1..Host: www.
0050 6d 69 63 72 6f 73 6f 66 74 2e 63 6f 6d 0d 0a 55 microsoft.com..U

No. Time Source Destination Protocol Info

7 0.854522 207.46.193.254 192.168.x.y HTTP HTTP/1.1 302 Found (text/html)
...Internet Protocol, Src: 207.46.193.254 (207.46.193.254), Dst: 192.168.x.y (192.168.x.y)
...Total Length: 553
...Time to live: 108
Protocol: TCP (0x06)
...Transmission Control Protocol, Src Port: http (80), Dst Port: rich-cp (2057), Seq: 1, Ack: 412, Len: 513
Hypertext Transfer Protocol
Line-based text data: text/html

No. Time Source Destination Protocol Info
8 0.869677 192.168.x.y 207.46.193.254 HTTP GET /en/us/default.aspx HTTP/1.1

...
Internet Protocol, Src: 192.168.x.y (192.168.x.y), Dst: 207.46.193.254 (207.46.193.254)
...Total Length: 469
...Time to live: 128
Protocol: TCP (0x06)
...Transmission Control Protocol, Src Port: rich-cp (2057), Dst Port: http (80), Seq: 412, Ack: 514, Len: 429
Hypertext Transfer Protocol

...0030 41 4f fa 75 00 00 47 45 54 20 2f 65 6e 2f 75 73 AO.u..GET /en/us
0040 2f 64 65 66 61 75 6c 74 2e 61 73 70 78 20 48 54 /default.aspx HT
0050 54 50 2f 31 2e 31 0d 0a 48 6f 73 74 3a 20 77 77 TP/1.1..Host: ww
0060 77 2e 6d 69 63 72 6f 73 6f 66 74 2e 63 6f 6d 0d w.microsoft.com.
5) PC retries to fetch the previously asked data as it did not receive response for certain time
No. Time Source Destination Protocol Info
9 3.600904 192.168.x.y 207.46.193.254 HTTP [TCP Retransmission] GET /en/us/default.aspx HTTP/1.1

...Internet Protocol, Src: 192.168.x.y (192.168.x.y), Dst: 207.46.193.254 (207.46.193.254)
...Time to live: 128
Protocol: TCP (0x06)
...Transmission Control Protocol, Src Port: rich-cp (2057), Dst Port: http (80), Seq: 412, Ack: 514, Len: 429
Hypertext Transfer Protocol
6) Problem detected: the TCP sequence number of an eventual response coming from the remote host tells that my PC did not receive the intermediate frames that were sent by the remote site.
No. Time Source Destination Protocol Info
10 4.028862 207.46.193.254 192.168.x.y TCP [TCP Previous segment lost] http > rich-cp [ACK] Seq=4822 Ack=841 Win=63430 Len=0

...Internet Protocol, Src: 207.46.193.254 (207.46.193.254), Dst: 192.168.x.y (192.168.x.y)
...Protocol: TCP (0x06)
...
Transmission Control Protocol, Src Port: http (80), Dst Port: rich-cp (2057), Seq: 4822, Ack: 841, Len: 0
Conclusion!
So what could it be that did not deliver the missing TCP segments from the remote site? Probably the intermediate nodes between my router and the remote site. This hinted me at MTU used by my router. Probably the ISP's intermediate forwarding IP nodes are not happy to receive bigger TCP segments sent by the remote site. The default MTU used by my Netgear router was 1492 (1500 max MTU of Ethernet minus 8 bytes for PPPoE encapsulation). When I reduced the MTU to 1400 bytes, bingo! The problem sites opened up in a flash :-) I also noticed that the Web traffic response had improved compared to earlier sluggish feel.

In case of microsoft.com noticed that after initial HTTP GET request, there was no response from the other side. After a retransmit only partial response was shown captured. That response contained TCP sequence number way beyond what my client was expecting and had acknowledged so far in the session. In other words the requester did not get the data it was expecting and the sender said I have sent all that to you. This hinted me at packet drops by
intermediate (ISP) nodes due to possibly oversize MTU. This did not happen with other sites because their server did not respond aggressively to cause the response packet drop due to an oversize MTU.

Later I experimented further and found that MTU of 1460 (also seen as MSS in the initial three-way handshake pasted above that happened with the Web server) is the maximum that can work. Now that I knew what the root cause was, searched Web and found that multiple users of the product and the ISP service have reported this issue.

Netgear knowledge base articles also address it (article 1, article 2) in the context of AOL.
Another good description of the same issue with a formal name for it: path MTU discovery (a.k.a. PMTU).
List of servers with reportedly broken PMTU.

:-)