Bug of the Year

My favourite bug of all time was uncovered because I was just too impatient.

I was working on a fairly large and complex embedded project. A microcontroller was interfacing with some custom hardware. It would configure, control, and monitor the hardware. The microprocessor was running an RTOS with a lightweight TCP/IP stack, so remote clients could connect to the device via Ethernet and perform remote configuration. We also used this for diagnostics and testing during development. We had successfully used a similar design on several products, with great results. We had also designed a protocol for monitoring and control that was used in earlier models, and we extended it here - though this product was significantly larger and more complex than its predecessors. What could possibly go wrong?

The first thing the microprocessor does when you connect is to dump the status of the entire system. With smaller devices, this was very fast. But this new model was so large and had many more components and subsystems, it was taking a very long time every time you connected. This device can also serve many concurrent clients, so this slow connection startup would be repeated for every client. The processing and time cost was not trivial, and this delay would limit how quickly clients could start updating their own settings.

Slow and inefficient code bothers me. Eventually I got sick of waiting a good 10-20 seconds or so every time I connected before I could start working with the hardware to test my latest changes, so I decided to track down the performance problem. As a baseline, I measured the system performance before changing anything, and then got to work. It had to be something simple, after all - we’ve had this same code working on other products for many months in the field.

I figured a few hours would probably be enough to solve it. The debugging process was made somewhat more difficult by the fact that there was no keyboard or monitor on this hardware; all debugging was either via JTAG and a serial USB connection, or by the Ethernet port - which was part of the problem.

The actual time to solve the mystery would be more like a week, with several long, late, pizza-fuelled nights, drilling through many layers and thousands of lines of code to find the very surprising root cause.

First layer: application protocol

The most obvious point to start with was the application level buffering. A protocol layer abstracted away the TCP/IP networking (which also allowed for serial interfaces) and had its own read and write buffers. I guessed it might help to increase the buffer size a little, to have fewer networking calls. A little increase didn’t help much overall, nor did a big increase. Ok, so it’s not the protocol buffering.

The status dump on connection has to traverse many data structures, build strings and create structured text for output. I reviewed the algorithms, memory management and improved the code by reducing reallocations and copies, tidied things up and used pre-allocated buffers more efficiently.

I measured the performance improvement and noticed a modest difference. All that work helped - but it was still not enough. It was just too slow for my liking. According to my calculations, the system should be capable of much higher throughput. I decided to keep digging.

Second layer: networking and OS

Underneath the protocol code was a high level interface for networking. Perhaps this wasn’t as efficient as it could be? After poring over the code, analysing potential performance issues, I found a few small issues, but no smoking gun.

Now this RTOS has many tasks (threads) running, only one of which is the networking support. Could this networking task be getting starved for processing time? The interrupt handling latency should be guaranteed and well within the required time. I disabled all non-essential tasks, tried increasing the priority of the networking task, and various other tweaks. None had any impact. The RTOS kernel was perfectly happy and running smoothly.

Keep digging…

Third layer: TCP/IP tuning

The TCP/IP stack we were using has a boatload of parameters you can configure at build time, including its internal buffers. This was a prime candidate for performance issues. I dug up the documentation, went through our configuration, and sure enough - bingo! Several parameters were not at the recommended values for this version of the library. Some buffer sizes needed to be multiples of the packet sizes (eg MSS), and tuned to match other significant parameters. This could have caused fragmented packets or memory buffers, and introduce small but potentially disruptive delays to the flow.

This tuning process took many many hours, and eventually resulted in a decent improvement in throughput. But was it enough? No - the big dump when the connection was established wasn’t just slow now, it was noticeably jerky and very bursty. I really needed to see exactly what was happening between the two socket endpoints. I needed to understand why it was bursty - and fix it.

Fourth Layer: On the Wire

Having calculated the theoretical peak throughput, I decided there was no good reason this microprocessor shouldn’t be able to maintain a much higher level of throughput. Time to do some low-level packet analysis.

I set up Wireshark and started capturing packets. At first, everything seemed ok but looking at the timestamps showed clearly that the transmissions were very bursty. Sometimes there were delays of a few seconds between packets! No wonder it was taking so long for a full status dump… but what was causing this?

Looking at the IP layer, I decoded and inspected the session piece by piece, from the very first packet. SYN, SYN-ACK, ACK... All good so far. But after transmitting only a few data packets: NAK. Retries? Backoff? Delays! What on earth was going on? The trace showed the micro was resending packets it had successfully sent. Yet by matching up the sequence numbers, it showed the packets were being ACKed by the other end. Eventually after receiving a few out-of-order packets, the receiver tried to back off by increasing timeouts. This perfectly illustrates the bursty nature of the traffic. But what could be causing it?

Not leaving anything to chance, I tried changing Ethernet cables to make sure it wasn’t a dodgy connection causing the fault. No dice.

At this point, my best hunch pointed to a bug in the TCP/IP library. Resending an already acknowledged packet? Madness! Since we had found bugs in this library before, it was quite conceivable. I upgraded the stack to the absolute latest version and reran all the tests. Same problem. Yet according to the forums and bug tracker, nobody else had reported this kind of problem with this stack before.

I decided some major action was needed. I needed to partition the problem and eliminate large components to isolate the fault.

Isolation

First stop, to write a simple socket server which would accept a client connection, and then send out packets in a tight loop, as fast as it could. This would exercise the TCP/IP stack, driver and hardware without any of the protocol or application code. The packets contained a monotonic counter so I could see if any packets were being corrupted or lost.

Running this test and capturing packets on the wire revealed the same problem. A burst of traffic, a flurry of ACKs and NAKs followed by timeouts and resends. Curses, foiled again!

Ok, how do I eliminate the TCP/IP stack from the equation? I constructed a UDP ping packet by hand, using parts of the wire capture data to fill in the relevant fields (such as MAC addresses). I kept a monotonic sequence counter and copied this into the binary ping blob at the correct offset, which I passed directly to the driver, with my workstation hardcoded as the destination. I started with a small delay, in the order of 100ms between ping packets. This seemed to work ok. But as I decreased the delay, packets were being dropped. Dropped?!

The only thing between this test and the wire is the device driver and hardware. Could the driver be corrupting or dropping packets?

Fourth layer: device driver

A code review of the device driver didn’t show up anything suspicious. Looking at the memory management, interrupt handling - it all seemed quite carefully written. Many hours later, no closer to the problem.

I pulled up the datasheet for the Ethernet controller and started querying the status registers, halting the microprocessor and printing a diagnostic. There were no clear errors to be found, so the driver did not appear to be causing the hardware to fail sending or receiving data.

Fifth layer: hardware

The microprocessor has onboard Ethernet support, which is connected to a separate MAC (Media Access Control) chip. This MAC performs the actual electrical interfacing, and is the last piece of silicon before the wire. I started reading the datasheet and looking at the initialisation sequence in the driver, which configures the registers in the MAC on powerup. I verified the correct register flags and values, but while I was reading I noticed there were some counter registers which collected counts of certain types of media (physical layer) errors.

I added some code to my minimalist hand-crafted ping test to read these counters from the MAC registers, showing the values before and after the ping burst. Sure enough, the counters were 0 on powerup, and after the ping test one of the error counters had a very large number. Ok, I think we’re finally on to something…

Back on the wire

I modified the test program to send out hand-crafted ARP packets. The only other code in play was the driver. I went back to Wireshark and captured another session. This time, I exported the trace data to a file and analysed the timing information in the headers.

I then stepped through and counted the number of successful packets sent before a failure. Then the next, and the next. And I started to notice a sort of pattern. The gaps were something like 9, 17, 33, 51… and eventually it would come back down and repeat. A regular pattern is very interesting, but what could be causing this kind of failure?

Stepping back and looking at the regular pattern of successes and failures over time was like looking at an interference pattern. Like ripples in a pond, where the waves met, packets were dropped. A colleague observed that this looked a bit like there were two slightly different frequencies involved… Wait a minute!

Don’t blame the Hardware

It was nearly midnight, and I desperately wanted to talk to the Hardware Engineer who designed the system. But it would have to wait until the morning. I fired off an email explaining what we had found, and went home exhausted.

The next day, I walked up to the Hardware Engineer who had a big grin on his face. “I think I found your problem…”, he opened. I was skeptical, but excited and urged him to explain. “In the last board spin, I rerouted the clock source of the MAC controller. So the Microprocessor and the MAC were actually running off two different clocks!”

I was elated. This perfectly explained the “interference pattern” we had observed. The frequencies of the two clocks were supposed to be the same, but were not perfectly aligned. Even a slight difference in frequency would cause a ‘beating’ effect as they drifted in and out of phase. Much like you can hear when tuning a guitar, and two strings are almost, but not quite, in tune and you hear a lower frequency ‘wow’.

So - while the two clocks were aligned, the microprocessor and the MAC controller chip could reliably communicate, and the packets flowed normally. But as the clocks drifted slightly out of phase, their chip-to-chip communication was corrupted as the rising and falling signals between them became misaligned. This explained why packets appeared to be sent or received at the higher layers, but were in fact lost in the intermittently garbled transfers. It’s almost a marvel TCP/IP worked at all!

The fix

In the end, it was a simple matter of ensuring both chips used a common clock source - which required a modificaiton to the PCB routing. But for now, to test the issue, the Hardware Engineer broke out the soldering iron and fixed the routing by hand on our development system. (We were fortunate that the clock signal was on an outer layer of the PCB!) I started the test again, very nervous to see the results. After days of chasing ghosts, I didn’t want to get my hopes up.

It worked. The hand-crafted arp and ping tests would run for as long as we liked, and never skipped a beat, running as fast as it could go. Finally, full throughput was achieved. I queried the registers for protocol and link errors, and it was all good. I checked the TCP/IP layer diagnostics for errors and statistics, and there were no red flags. I went back to the original application firmware and tested out the protocol, monitoring the wire for good measure.

This time, it took less than a second for a full status dump. Finally - success!

Wrapup

So - what began as a seemingly simple software performance problem eventually turned out to be caused by a hardware design fault. And it revealed several other opportunities for improvement along the way. This was a great learning experience, and a very satisfying puzzle to solve!

Random thoughts:

TCP/IP is really, really good at surviving unreliable hardware and problems in layers below.
Don’t mix clock sources between chips!
Don’t assume that the first problem you find is causing the problem.
Don’t assume the improvement you make is sufficient. Measure!
Performance is relative. What can you reasonably expect from the system in front of you?
Performance tuning is all about careful measurments and consistent tests. And changing one thing at a time!
It’s hardly ever the fault of the OS. But it could be. And it’s hardly ever the fault of the hardware. But it could be.
Don’t be satisfied with a fix until you understand how it works and why it fixes the problem.
It is sometimes possible to diagnose hardware problems through analysing software.
Persistence usually pays off!