From: Avery Pennarun (apenwarr_at_nit.ca)
Date: Wed 04 Aug 2004 - 17:23:37 BST

On Wed, Aug 04, 2004 at 11:16:25AM +0300, Ehab Heikal wrote:
> Avery Pennarun wrote:
> >We always run a memory test on every server before it ships, and we return
> >about 25% of motherboards to the manufacturer before shipping because of
> >this :( Some computers do go bad about a year later, eg. because of the
> >"exploding capacitor" problem that started a couple of years ago.
> I know this is not the core of this list but could you elaborate on how
> is hardware bad these days. What kinds of tests do you run to reduce
> this. I see that you have very very valueable know-how and would really
> appreciate it :)

Okay, you asked for it: I'll try to keep it clean, but this will turn into a
bit of a commercial plug for our products.

The most important test you can run is memtest86 (http://www.memtest86.com/)
or another heavy memory testing tool. Many motherboards (not particular
models; just individual boards of many models) just fail this outright,
especially phase 4 and 5. If it fails memtest86, it *will* corrupt data,
end of story. We always test our servers for 24 hours with memtest86 before
shipping. (You have to do this again every time you make a change. For
example, adding extra memory or a PCI card can upset the electronics and
make the test fail, even if the memory tests out fine on another system.)

We also put a lot of work into our general "burn-in" diagnostics tools,
which stress the system by copying 1/6-disk-sized files around on a reiserfs
while blasting data through the network. We run this for at least 24 hours
as well before shipping, and it often finds harder-to-identify problems (for
example, some of the IDE drivers in Linux have been known to "rarely"
corrupt data, but our tests discovered the bugs and we talked to people
until they were fixed).

Our diagnostic tools are proprietary, but they come on our free bootable
Nitix trial CD image (about 38 megs) and you can use them whenever you want;
http://nitix.com. It includes memtest86, too. (Note that if you do the
disk test, it wipes out all data on your disk!!)

Hardware engineers can also hook an oscilloscope to the various critical
signals on the motherboard to check out how clean they are; this is usually
depressing, because *most* power supplies, CPU voltages, and PCI bus signal
quality is at least partly out-of-spec. You can find hardware that is
actually designed correctly, but it's difficult, because of course that
costs a few extra dollars per board, and companies that charge a few extra
dollars per motherboard quickly go out of business. Usually you have to
settle with hardware that's at least "mostly" within specifications...

When people ask us to "certify" their hardware as compatible with our Linux
distribution, we do all the hardware-engineering tests too, and the complete
set of tests can take several weeks. That's for certifying *types* of
hardware, so you know the majority of boxes made like that will work. Even
without engineering certification, though, you're usually pretty safe if you
run our full (software) burn-in test for a couple of days on each box you
ship. I don't know how many of our customers do, but we try to make it easy
for them.

To learn more about the joys of our hardware certification process, visit
the third-party discussion forums http://nitix.net-itech.com/.

Have fun,


P.S. ObOnTopicNote: I'm on this list because we're thinking of using
linux-vserver in an upcoming version of Nitix :)

P.P.S. I'd be more than happy if everyone downloaded and ran our burn-in
diagnostics tests on their hardware for free. The more hardware that gets
returned because it's crap, the smarter it becomes, economically, for the
manufacturer to not ship crap in the first place.
