Speaker Steven Ellis
Time 2010-01-19 11:55
Conference LCA2010

Hardware Raid, lots of controller issues.

Hardware had the habbit of reinitializing the raid.

Software Raid, can mix and match hardware types. Performance usually sufficient for modern day to day usage.

SATA PCI card. Sanity check before exposing business to it. Considerable hardware failures in the past.

Hardware raid controllers don’t let you see the low level disks, not possible to see smart data.

Software RAID without partition - your mad! Put in a partition table, it is only small.

Stress test with production data before going live.

Keep old volumes in case we have issues.

Users report files loaded are slightly different to when they were saved. File curruption.

Filesystem check happy, revert to old volumes.

Can’t reproduce problem on our test hardware.

Move back to production server. Still seeing data curruption.

Create dummy test files on test, running md5sum checks on, check runs in a loop. Sometimes md5sums match, sometimes they don’t. Depends on access point used to retrieve data.

Software raid exposes /sys filesystem. Force raid to run a check on itself. This takes a long time.

High mismatch count. 311424, very high, disks out of sync, count should be 0.

Repair disk mismatch. Fix filesystem issues. Wait. wait, test the system. Run checksum loop again. Surprise, still got data curruption.

Bug in Xen? Bug in VM? Reduce the problem, remove Xen. Will still have the problem.

Kernel mailing list, sata mailing list, raid mailing list, raid WD Hard Drive Issues.

Known problems with SATA controller known problems are fixed.

Time Limited Error Recovery (TLER). Enterprise series drives have this enabled by default. Standard Green drives don’t, enable at own risk. WDTLER.EXE. This wasn’t the problem.

TLER: most hardware vendors enable this by default on enterprise drives for SATA drivers, but not consumable drives. If can’t recover within certain period of time, drive is marked as bad, and error is returned to controller, rather then keep trying to find spare drive for a long time. Hard drives fail when all spare blocks are used. Harder to work out what is happening when this happens.

smartctl -a -d ata will list number of blocks that have reallocated. Can’t trust figure from seagate.

You wouldn’t normally expect lack of TLER to cause disk disk curruption, however we needed to eliminate this as a possible reason.

Check hardware conflicts. Check firmware system boards, controllers, IDE/SATA controller conflict, motherboard family, new hard drive with same version of Redhat. Suspect hardware conflict.

Same everything, different motherboard, different IDE, no problems. Internal SATA controller OK, and external SATA controller OK.