I spent painstakingly long hours comparing multiple versions of the software builds around the date the issue was reported on. Compared the software changes that happened between the last date it was found working fine and the date it actually stopped working. Note: the latter date could be even earlier than the date the issue was observed and reported by the tester. The original developer then undid the changes he thought may be causing the issue and I tried those on the test setup once again with no avail! It was a Friday already.
For the weekend I had decided to drop all the theories and start with some basics and try identifying the problem by the way of elimination of one software component at a time. Unfortunately for me, the original developer did just that and already found the root cause--an incorrect acquisition of synchronization spinlock! This was there since the day-1 when he introduced multi-threading in the software for the new multicore hardware platform.
- Never trust anyone when it comes to analysing software bugs! Not even the original developer...
- Managers should allow sufficient time for engineers handling issues to form their own theories and then go about analyzing instead of coercing them to follow specific techniques they had seen working earlier. One cannot use the same tool and approach for just all problems. Otherwise that leads to significantly wasted time. After all, it is them who have hired this engineer and the engineer's own judgment should be trusted and be given a chance. In this situation, me not being the original developer, in fact, put me in the best position to question the correctness of the source code.
- Such a basic issue should have been caught during unit testing by the original developer.