Thursday, April 11, 2013

Stability Testing with StressLinux

I've been experimenting with over-clocking my primary gaming machine and I became extremely concerned about the constant BSODs I was generating while stability testing with Windows-based tools such as Prime95 and Intel CPU Burn. Knowing how fragile a Windows installation can be, I was keen to reduce the likelihood that my testing would result in a non-booting system.

My initial searches turned up some bootable images, such as the Ultimate Boot CD (UBCD), that came equipped with builds of Prime95, but I finally settled on a specialised Linux Live CD: StressLinux. This is an OpenSuSE derived distribution that comes bundled with the following utilities:

  • stress - a basic load generating tool that can stress CPU, memory and disk I/O.
  • cpuburn - a collection of utilities for stressing various CPU architectures.
  • hddtemp - a tool for reading the system temps from the S.M.A.R.T. information present in modern hard drives.
  • lm_sensors - a tool for monitoring hardware temperatures; CPU, motherboard, etc.
  • mPrime - a Linux version of the popular Prime95 tool.

The reasons I favoured the StressLinux distribution over the UBCD was that the combination of the above utilities allowed me to watch my system temps while running the mPrime/Prime95 torture tests to ascertain where instabilities lay. I was pleased to find the mPrime tool mimicked the functionality of its Windows counterpart, despite being command line driven:

       Main Menu

    1. Test/Primenet
    2. Test/Worker threads
    3. Test/Status
    4. Test/Continue
    5. Test/Exit
    6. Advanced/Test
    7. Advanced/Time
    8. Advanced/P-1
    9. Advanced/ECM
   10. Advanced/Manual Communication
   11. Advanced/Unreserve Exponent
   12. Advanced/Quit Gimps
   13. Options/CPU
   14. Options/Preferences
   15. Options/Torture Test
   16. Options/Benchmark
   17. Help/About
   18. Help/About PrimeNet Server
Your choice: 15
Number of torture test threads to run (2):
Choose a type of torture test to run.
1 = Small FFTs (maximum FPU stress, data fits in L2 cache, RAM not tested
much).
2 = In-place large FFTs (maximum heat and power consumption, some RAM
tested).
3 = Blend (tests some of everything, lots of RAM tested).
11,12,13 = Allows you to fine tune the above three selections.
Blend is the default. NOTE: if you fail the blend test, but can pass the
small FFT test then your problem is likely bad memory or a bad memory
controller.
Type of torture test to run (3): 3

Accept the answers above? (Y):
[Main thread Apr 11 23:22] Starting workers.
[Worker #1 Apr 11 23:22] Worker starting
[Worker #2 Apr 11 23:22] Worker starting
[Worker #1 Apr 11 23:22] Setting affinity to run worker on logical CPU #1
[Worker #2 Apr 11 23:22] Setting affinity to run worker on logical CPU #2
[Worker #1 Apr 11 23:22] Beginning a continuous self-test to check your computer.
[Worker #1 Apr 11 23:22] Please read stress.txt. Hit ^C to end this test.
[Worker #2 Apr 11 23:22] Beginning a continuous self-test to check your computer.
[Worker #2 Apr 11 23:22] Please read stress.txt. Hit ^C to end this test.
[Worker #1 Apr 11 23:22] Test 1, 6500 Lucas-Lehmer iterations of M12451841 using AMD K10 type-2 FFT length 640K, Pass1=640, Pass2=1K.
[Worker #2 Apr 11 23:22] Test 1, 6500 Lucas-Lehmer iterations of M12451841 using AMD K10 type-2 FFT length 640K, Pass1=640, Pass2=1K.