Thursday, June 5, 2008

Unable to use Windows for half minute after each login

So I was part of a software development team that worked on a network security product of a startup company. Before allowing access to the network, the secured product would do some sort of network access control by asking the user for access credentials--some sort of captive portal. The issue observed was that when user accesses network using Windows as client side access platform, the network equipment to which this client was connected would prevent further access to the PC for at least half a minute. For example, if one starts a Web browser, that would freeze the graphical user interface shell of Windows for about 30 seconds. Would not interact with the user until the browser opens after half a minute. This would be repeated for any program that attempts to access network. This problem would persist until user is authenticated for network access. For the same thing, an ordinary non-secured equipment would not even let the user feel its presence in the network--meaning no delays in letting access to any network resources, or the user's own PC.

Before releasing the product, it was being used internally within the development labs. There were more than hundred software development engineers and testers as users connected to the network through this security product. Apparently no one complained about or tried to find out why this half a minute delay and just moved on to accept it as a 'fact of life'...

I could not take it any more; more so as one of the members in the development team of the product. I started my analysis by thinking about what happens between the Windows GUI shell coming up and the user invoking a program that attempts to access network. I knew network traffic from Windows is quite chatty thanks to SMB and NBNS. Using common sense, I suggested to let the network equipment to explicitly send connection reject message for all network access attempts from the network interface of an unauthenticated user. Earlier the equipment was configured to simply discard such traffic. The common sense paid off! Until authenticated for network access control, the user would not have access to network--the desired behavior but that would not prevent a user from interacting further with their OS shell.

When unauthenticated traffic was silently dropped, the chatty protocols from user's client PC would keep retrying until their preconfigured time out occurs, which in this case was half a minute. Issue resolved without a single line change in software source code. Phew! Needless to say I was quite happy that day.

Sunday, June 1, 2008

Netgear DG834G used with BSNL DataOne as ISP would not open certain Web sites

Hope the experience shared here will help fellow buyers of home DSL modem routers that are used with BSNL DataOne broadband Internet service.

Recently I purchased Netgear DG834G wireless router with a built-in DSL modem. Setting it up proved a challenge for following reasons:
  1. By default the modem kept insisting to pre-programmed automatic setup wizard. The wizard only chooses to auto-detect ISP settings. It could not detect settings for my ISP and at the same time did not allow any way to specify the settings manually. The method specified in the installation manual still kept following the auto-detect path. Manually entering IP address http://192.168.0.1 still took me to the same wizard thing.

    Solution: manual reset using the switch on the back of the device presented access to the manual configuration method. Subsequently I entered settings for the ISP and I was connected! It was a very satisfying experience thereafter. The access range of this wireless router is pretty good. I tried the farthest corners in the house with multiple walls in between the router and my PC. It just worked with the lowest signal strength of 'good' quality (higher being 'very good' and 'excellent').

    The joy was short lived. Next is why and how I cracked it.
  2. With this new router I could not open Microsoft Web site! Huh. Then I tried few other sites I use frequently. Same with download.com. Everything else would work fine: IMAP, POP, instant messenger, VPN access... Firefox would consistently not display the site contents. Internet Explorer would unreliably open the front page after taking too long. Clicking on any of the hyper links on the page would eventually time out without any data coming from remote site. I was baffled! What's about this Web sites and my new router?
How I narrowed down the problem and the discovery of a solution was very satisfying debugging experience. Those who are only interested in lnowing the fix can skip to the last section 'Conclusion!'.

Narrowing down the problem from the symptoms:
First I suspected DNS settings. Obtained the settings from ISP's Web site. Set those manually in the router instead of the default option of obtaining those from ISP. Did not help. Someone suggested me setting those manually on my PC. Still, no yield.

The same sites would open with the ordinary modem provided by the ISP.

Suspecting something foul with the firmware, I downloaded the latest software image from manufacturer's support Web site. As that did not help either, I started getting restless as a new buyer. I started thinking less rationally. I logged in using VPN to office. Using that method I could access the problem Web sites. Now I knew the router did not have anything against getting data bytes from those Web sites :-) I was still doubtful about the DNS part. I decided to sleep over the problem to be able to think logically.

When I shared the experience with my kind friends and colleagues, received following suggestions that helped me rule out certain possibilities:
"disable IPv6 access in your Internet browser on your PC" -- then why would the same browser open those problem sites with another modem?
"don't use your router as a DNS proxy; instead set DNS manually on your PC" -- same as above
"your router has bugs; get a replacement" -- hmm! That's easy. Then why other Web sites open and other services work? What so specific about these sites?
I was not convinced.

The analysis
I decided to dig deeper and understand what was happening behind the scenes. This was the turning point! I captured network traffic using Wireshark. It's a software tool that allows you to capture and analyze network traffic from/ to your PC. I compared network traffic while accessing the sites that were accessible alright against the that generated to/ from the problem sites. Noticed that for the problem sites, response for HTTP GET request was not reaching my PC. The PC would keep retrying the same request for certain time without any yield then give up. At least this confirmed that DNS resolution was happening fine. So it was time to make certain educated guesses as to why the reply from the remote site would not reply!?

1)DNS request initiated from my PC as a result of entering Web site address in the browser:No. Time Source Destination Protocol Info
1 0.000000 192.168.x.y 218.248.255.139 DNS Standard query A www.microsoft.com
2)DNS response comes back with a list of IP addresses from where the site contents can be fetched. Notice it contains akamai.net Web cache

No. Time Source Destination Protocol Info
2 0.011239 218.248.255.139 192.168.x.y DNS Standard query response CNAME toggle.www.ms.akadns.net CNAME g.www.ms.akadns.net CNAME lb1.www.ms.akadns.net A 207.46.193.254 A 207.46.19.190 A 207.46.19.254 A 207.46.192.254
3)From a list of IP addresses in the DNS response, my PC decides to initiate a TCP connection with one of the IP addresses
No. Time Source Destination Protocol Info
3 0.014299 192.168.x.y 207.46.193.254 TCP rich-cp > http [SYN] Seq=0 Win=16384 Len=0 MSS=1460

No. Time Source Destination Protocol Info
4 0.444975 207.46.193.254 192.168.x.y TCP http > rich-cp [SYN, ACK] Seq=0 Ack=1 Win=8190 Len=0 MSS=1436

No. Time Source Destination Protocol Info
5 0.445056 192.168.x.y 207.46.193.254 TCP rich-cp > http [ACK] Seq=1 Ack=1 Win=17232 Len=0

4)TCP/IP connection established. Now PC asks for HTTP data
No. Time Source Destination Protocol Info
6 0.445475 192.168.x.y 207.46.193.254 HTTP GET / HTTP/1.1

...
Internet Protocol, Src: 192.168.x.y (192.168.x.y), Dst: 207.46.193.254 (207.46.193.254)
...Total Length: 451
...Protocol: TCP (0x06)
...
Transmission Control Protocol, Src Port: rich-cp (2057), Dst Port: http (80), Seq: 1, Ack: 1, Len: 411
Hypertext Transfer Protocol


...0030 43 50 5c b1 00 00 47 45 54 20 2f 20 48 54 54 50 CP\...GET / HTTP
0040 2f 31 2e 31 0d 0a 48 6f 73 74 3a 20 77 77 77 2e /1.1..Host: www.
0050 6d 69 63 72 6f 73 6f 66 74 2e 63 6f 6d 0d 0a 55 microsoft.com..U


No. Time Source Destination Protocol Info

7 0.854522 207.46.193.254 192.168.x.y HTTP HTTP/1.1 302 Found (text/html)

...Internet Protocol, Src: 207.46.193.254 (207.46.193.254), Dst: 192.168.x.y (192.168.x.y)
...Total Length: 553
...Time to live: 108
Protocol: TCP (0x06)
...Transmission Control Protocol, Src Port: http (80), Dst Port: rich-cp (2057), Seq: 1, Ack: 412, Len: 513
Hypertext Transfer Protocol
Line-based text data: text/html


No. Time Source Destination Protocol Info
8 0.869677 192.168.x.y 207.46.193.254 HTTP GET /en/us/default.aspx HTTP/1.1

...
Internet Protocol, Src: 192.168.x.y (192.168.x.y), Dst: 207.46.193.254 (207.46.193.254)
...Total Length: 469
...Time to live: 128
Protocol: TCP (0x06)
...Transmission Control Protocol, Src Port: rich-cp (2057), Dst Port: http (80), Seq: 412, Ack: 514, Len: 429
Hypertext Transfer Protocol

...0030 41 4f fa 75 00 00 47 45 54 20 2f 65 6e 2f 75 73 AO.u..GET /en/us
0040 2f 64 65 66 61 75 6c 74 2e 61 73 70 78 20 48 54 /default.aspx HT
0050 54 50 2f 31 2e 31 0d 0a 48 6f 73 74 3a 20 77 77 TP/1.1..Host: ww
0060 77 2e 6d 69 63 72 6f 73 6f 66 74 2e 63 6f 6d 0d w.microsoft.com.
5) PC retries to fetch the previously asked data as it did not receive response for certain time
No. Time Source Destination Protocol Info
9 3.600904 192.168.x.y 207.46.193.254 HTTP [TCP Retransmission] GET /en/us/default.aspx HTTP/1.1

...Internet Protocol, Src: 192.168.x.y (192.168.x.y), Dst: 207.46.193.254 (207.46.193.254)
...Time to live: 128
Protocol: TCP (0x06)
...Transmission Control Protocol, Src Port: rich-cp (2057), Dst Port: http (80), Seq: 412, Ack: 514, Len: 429
Hypertext Transfer Protocol
6) Problem detected: the TCP sequence number of an eventual response coming from the remote host tells that my PC did not receive the intermediate frames that were sent by the remote site.
No. Time Source Destination Protocol Info
10 4.028862 207.46.193.254 192.168.x.y TCP [TCP Previous segment lost] http > rich-cp [ACK] Seq=4822 Ack=841 Win=63430 Len=0

...Internet Protocol, Src: 207.46.193.254 (207.46.193.254), Dst: 192.168.x.y (192.168.x.y)
...Protocol: TCP (0x06)
...
Transmission Control Protocol, Src Port: http (80), Dst Port: rich-cp (2057), Seq: 4822, Ack: 841, Len: 0

Conclusion!
So what could it be that did not deliver the missing TCP segments from the remote site? Probably the intermediate nodes between my router and the remote site. This hinted me at MTU used by my router. Probably the ISP's intermediate forwarding IP nodes are not happy to receive bigger TCP segments sent by the remote site. The default MTU used by my Netgear router was 1492 (1500 max MTU of Ethernet minus 8 bytes for PPPoE encapsulation). When I reduced the MTU to 1400 bytes, bingo! The problem sites opened up in a flash :-) I also noticed that the Web traffic response had improved compared to earlier sluggish feel.

In case of microsoft.com noticed that after initial HTTP GET request, there was no response from the other side. After a retransmit only partial response was shown captured. That response contained TCP sequence number way beyond what my client was expecting and had acknowledged so far in the session. In other words the requester did not get the data it was expecting and the sender said I have sent all that to you. This hinted me at packet drops by
intermediate (ISP) nodes due to possibly oversize MTU. This did not happen with other sites because their server did not respond aggressively to cause the response packet drop due to an oversize MTU.

Later I experimented further and found that MTU of 1460 (also seen as MSS in the initial three-way handshake pasted above that happened with the Web server) is the maximum that can work. Now that I knew what the root cause was, searched Web and found that multiple users of the product and the ISP service have reported this issue.

Netgear knowledge base articles also address it (article 1, article 2) in the context of AOL.
Another good description of the same issue with a formal name for it: path MTU discovery (a.k.a. PMTU).
List of servers with reportedly broken PMTU.

:-)