Re: Triple RAID and the use of very small numbers
"So you should start by assuming HDD faults of around 5% per year and do the maths from that, not from claimed BER figures."
5% is near the AFR (Annual Failure Rate) ballpark. That's total failure of the disk, not the BER.
Here is a link to the most recent analysis by Backblaze:
Bit/sector errors are going to be considerably higher (unless you count that a 1TB disk completely failing constitutes 8Terabits of errors).
It is also worth noting that AFR and BER relate to two very distinctly different failure modes. Traditional RAID protects you from complete failures (as measured by AFR), but is massively more wobbly in case of sectors duffing out. There is also a failure mode that is a subclass of the duff sectors and that is latent bit errors, which basically means the disk will feed back duff data rather than throw an error saying the sector was unreadable. This could happen for a number of reasons, including firmware bugs, phantom writes to the wrong sector, head misalignment causing the wrong sector to be read, etc. - and it happens more often than you might think. Here is a link to a very good paper on the subject:
Against these sorts of errors (by far the most dangerous kind), the _only_ available solution is a fully checksumming file system like ZFS, GPFS, or BTRFS (make sure your expectations are suitably low when trying the latter).