Sunday, June 10, 2012

Post Upgrade Woes - Part 2: Spontaneous Reboots

After running into an initially scary but simple to fix problem following the recent(ish) GPU upgrade I performed, the PC was subjected to it's usual routine as the house work-horse; a bit of web-browsing, photo editing/curating, primary school teacher planning and resource creation, and plenty of gaming! In fact, the machine was subjected to some fairly lengthy Skyrim sessions post-upgrade; check out some of my screen caps for an idea of how amazing it looked!

Over the course of a Saturday, the machine had been left idling; both my own account and that of my fiancée were logged in, with her account the currently visible session. I happened to notice that the PC was displaying the Windows Welcome/Log in screen, neither of our accounts displaying the "currently logged in" message, which was strange. Remembering the previous issue and fearing a spontaneous reboot, I checked the Windows Event Viewer.

As I had feared, the machine had experienced a BSOD and rebooted, however, this time with a different error than before:


The computer has rebooted from a bugcheck. The bugcheck was: 0x00000124 (0x0000000000000000, 0xfffffa8004d90038, 0x0000000000000000, 0x0000000000000000). A dump was saved in: C:\Windows\Minidump\032512-15288-01.dmp. Report Id: 032512-15288-01.


What I found particularly concerning was that despite having configured Windows to not reboot automatically following a BSOD. A brief search online only compounded my fears of a hardware issue. Some possible causes given were:

  • Possible memory issues.
  • Power issues - either incorrectly configured voltages or instabilities in the supply.
  • Overheating.
  • Pretty much any piece of hardware in the system(!)

To rule out a problem with the system's RAM and CPU, I immediately decided to perform an extended run of MemTest86+, followed by Prime95's small fast Fourier Transform torture test; both ran for several hours without any problems. So, I dug further into the Event Viewer and found a message that revealed a bit more info:


A fatal hardware error has occurred.

Component: AMD Northbridge
Error Source: Machine Check Exception
Error Type: HyperTransport Watchdog Timeout Error
Processor ID: 0


The HyperTransport bus has been used in AMD CPU design since the introduction of the AMD64 architecture, and replaces the old Front-Side Bus used previously. It provides a channel, separate from the memory bus, for the CPU to carry out I/O operations on the various other pieces of hardware that make up a modern PC, in particular, those on the PCI, AGP or PCIe buses (there's a great explanation of the HyperTransport bus over on Hardware Secrets for those interested). Since I had just upgraded my graphics card, my first thought was that there was in fact an issue with the GPU itself.

To try and confirm my suspicions, I opened the minidump file referenced in the system events and performed some basic analysis. This proved fruitless, however, because it seems that because the system was rebooting despite being configured to halt and perform a full crash dump, instead producing "minidumps". This meant that there was never enough information present in the dump for the Windows debugger tool to narrow the problem down to a particular driver.

I contacted the supplier, Overclockers.co.uk, mentioning the issue and they immediately issued an RMA. Because I hadn't really performed any troubleshooting myself and because the card was making mince-meat of most of the games I was throwing at it, I loathed the thought of simply returning it. Fortunately, Overclockers support team pointed out that I could take as much time as I wanted to try and diagnose the problem, my RMA would still be valid.

This offer proved to be most fortunate, as my attempts to replicate the issue were mostly unsuccessful. The problem did reoccur, but I was seemingly unable to create a situation where I was certain the BSOD would be triggered and indicate the root cause. Over several weeks, I made several re-adjustments to my system to try and isolate the problem:

  • Reset the BIOS settings to their defaults.
  • To completely ensure there were no outstanding driver issues (the previously mentioned problem was caused by such a conflict), I performed a fresh re-install of Windows.
  • Switched the Corsair VX550W PSU back to the Enermax NoiseTaker EG701AX-VE W previously installed. Despite this not fixing the issue, I left he older supply installed after I used an online tool to check the suitability of the Corsair PSU against my current system spec. Choosing to allow for 10 USB devices and a small amount of overclocking headroom, the results suggested I would need a PSU capable of outputting 581W. Even adjusting the calculator for only 4 USB devices, the suggested PSU was 560W, still more than the 550W Corsair model could output.

None of these changes resolved the issue and so I decided to ship the GPU back to Overclockers; I had run out of troubleshooting options and I wanted their opinion on the GPU's stability. Even if they found no problem with the GPU, I would have the piece of mind that someone with a superior testing suite could confirm the card to be OK. I provided them with the troubleshooting steps I had undertaken and the following observations about the problem:

  • The issue only occurred when the system was idling, or at the most playing music.
  • There was no guarantee that leaving the system idle for any length of time would result in a reboot. In attempting to replicate the problem, I did not shut down or reboot the machine for over week and could not trigger a BSOD.
  • No matter how hard I pushed the system, whether in-game or using stress testing/benchmarking tools like FurMark and Unigine, the machine never rebooted or exhibited any strange behaviour of any kind.

After just a couple of days, I received the following response from the Overclockers team:

"Graphics card passes all Tests with no issues, Tests Run: Crysis WarHead, 3D Mark 11, Furmark, 3Dmark Vantage, AVP, Lost Planet 2, Stone Benchmark MSI Kombuster, Resident Evil Benchmark, ATI Mecha, ATI Ladybug, Streetfighter Benchmark, Stone Giant and heaven benchmark tool, these were all tested at 1920 x 1200 resolution running maximum settings, tested on DX9, DX10 and DX11 where applicable"

They were unable to find a fault the card when under load, which was to be expected. While this wasn't the problem I had been experiencing, I didn't expect them to spend weeks waiting for a BSOD to occur! Even if they had gone to such length, I honestly doubt they would have seen the issue occur; I suspect the problem is quite specific to my system's configuration.

Since my card was returned to me and it's been re-installed into my system, the issue hasn't re-surfaced. I'm hoping that the act of re-seating the card may have been enough to resolve the issue, but it's more likely that the particular condition that caused the reboot hasn't arisen yet. I have my fingers crossed for the former, however!