MTBF – Mean Time Between Failures or why your hard drive won’t last 170 years…

I ran across this a while back:

http://www.electronista.com/articles/08/12/17/toshiba.512gb.ssd/

…in which the author breathlessly proclaims: “Longevity is also a focus
with about one million hours (114 years) between failures thanks to
reduced wear on potentially short-lived memory cells.”

Wrong.

I often get asked about MTBF (Mean Time Between Failure) and it’s
amazing how many “industry people” don’t understand it.

And for those who have already figured out that their 1.5M MTBF drives
don’t last 171 years, but are not sure what that MTBF thing is…
here’s a vastly oversimplified quickie:

Why your hard drive doesn’t last 150 years.

(There are about 8700 hours in a year, but to make this example
simple, let’s call it 10,000.)

Here’s how MTBF works: it’s an aggregate of many units based on
expected life of a single unit.

Let’s say you have a hard drive that is warranted to last 3 years, or
30,000 hours.

You put it in a server, and behold, it lasts 3 years. You take it out
and put in a new one, and that also lasts 3 years. So you replace it
with a new one, and that too…. well, you get it.

Let’s say you keep doing that and finally, on the 50th unit, only two
years into it’s life, it breaks.

You now have 3 years or 30,000 hours per unit, times 50 units =
1,500,000.

And that’s your MTBF.

So anyone who says “Wow! MTBF of 1.5 million hours! that mean this
thing will last (1.5M / 10000) 150 years!” -clearly- doesn’t know what
they’re talking about.

(I have grossly oversimplified the description of MTBF calculation in order to make the general point that it is NOT life-expectancy. MTBF is more complex than my example, including “infant mortality”
and “wear out” phases; “theoretical” vs “operational” MTBF and so on,
but the gist of what’s here is correct.)

How long will my drive last?

3 years, on average.

OK, if it’s warranted by the manufacturer to last three years, it should last three years. If it’s warranted for 5, it should last 5. You can add maybe as much as a year if you never turn it off, but remember, the warrantee is only as long as it is.

Does that sound awfully definitive? And, How Do You Know?

OK: I’ve been at this for over 30 years now, usually with 4 or five computers at a time, and certainly always with several (sometimes dozens) of backup drives. Servers on 24/7 and my desktops on 15/24.

Here’s what I do with my desktops: turn them on first thing in the morning; turn them off last thing at night. That means they are off about 9 hours a day. This pattern is 365 days a year.

How long do my drives last? 3 years. Every now and then a few months longer. I don’t recall shorter (unless they went out in the first 60 days… and there have only been two of those.) Maybe 6 months longer in a 24/7 server.

Still, being terminally curious, I called an actual drive engineer at (it was) Maxtor (at the time) and asked her specifically about the warm-up / cool-down cycle, and how that affected the drive life. (It was not easy to reach a real engineer. They don’t like to take calls from the general public.) I explained how I cycled my drives on and off, and asked if leaving them on would make them last longer or less. Was I gaining 9 x 365 hours by turning them off?

Her answer was simple: “We consider all that when we anticipate drive life. If you turn them on and off, yes, it will reduce drive life… but by just about the amount of time you actually have them off – about 1/3 of a day, due to the stress on the bearing, mostly. So what that means is that if you turn your machine on and off once a day, the drive will last about three years. And if you leave the drive on 24/7 it will last about 3 years.”

“Think about it: we’re not going to get into a situation where we are constantly underestimating or overestimating our warrantee period,” she said. “Either way, it would cost us money.”

So there you are. My own anecdotal evidence and the word straight from the horse’s mouth.

My working assumption ever since is this: turning on a cold drive and letting it fully warm up, and then shutting it off equals about 8 hours of wear. Yes: that means your removable drives for backups and so on. Yes: if you turn it on and off and on and off and on and off, warming up and cooling off several times a day, you are probably shortening the drive life. (That might just explain why notebook drives don’t seem to last as long, eh?)

Finally, people ask me, well then, why do you turn off your computer at night? Well, 1) if you read closely above, you’ll find that in terms of drive life, it really doesn’t make any difference, and 2) because 9 x 365 saves me $235 per year in electricity.

LATER….

Wow! Did I ever get a bunch of passionate letters telling me what an idiot I am for proclaiming “that drives will last for 3 years.” Well, first, I didn’t. I said that it will last, on average, as long as the manufacture thinks it will last. That’s hardly news.

Many people said “but mine have lasted 10 years!” Well, if you only turn on your computer for 20 minutes every week, they should last for 20 years, eh? “You’re full of it. I’ve had them fail inside a year.”

Sigh. What part of reality did you miss?

Interesting the passions this silly subject evokes…

You could say the same thing about anything physical/mechanical.

Cars? Refrigerators? Airplanes? Washing machines?

Drives are mere mechanical devices. They have a serviceable life expectancy, just like cars and refrigerators. Some people’s cars last for 600,000 miles; some last for 28. So what? Does that mean we can’t ask about life expectancy? Does that mean that all such questions are useless?

Of course not. It’s an expectation, not a guarantee!

Generally speaking, how long will a hard drive last? On average, as long as its warranty. (about right)
Generally speaking, how long will a car last? On average, as long as its warranty. (These days, 100,000 miles.)
Generally speaking, how long will a refrigerator last? On average, as long as its warranty. (often longer, fortunately)

I think that understanding that your drive will likely last about X amount of time IS useful information. No one is -guaranteeing- how long it will last; no one is suggesting that life expectancy is an excuse for failing to backup; no one is even suggesting that drives won’t fail prematurely or last 5 years longer than average.

That’s what “average” means, after all. Hard drives have no more random failure than anything else.

I think the passion connected to drive life is more associated with the trauma associated with the loss of precious data when a drive goes south, and you have no recent (or any) backup.

The other thing is that my example of cars etc, are products that don’t totally, completely, die – they wear out – and remain useable after the warranty. Hard drives (for the most part) either work or they don’t.

One writer calls drives “a crapshoot” a sentiment with which others apparently agree.

It’s my experience that a 1 year warrantied drive won’t last as long as a 5 year one. That’s useful information to me.

If you think not, then you should buy the cheapest drives you can find. However, I think it’s fair to say (and quite likely) that your experiences are going to be less satisfactory if you do.

Just like buying cars or refrigerators or anything mechanical.

People make the mistake of thinking I’m saying that their specific drive will last three years. How could I possibly know that? It’s an average.

Generally speaking (on average) there are 12 hours of daylight. Specifically speaking, only 2 days out of 365 actually have exactly 12 hours of daylight. Will your specific, individual 3-year warrantied drive last three years? I can’t possibly say… but I can say that on average, it will.

YMMV.

On Drive Repair Utilities

Most people understandably don’t have a clue about how a hard drive stores information, and therefore don’t have a clue about what it takes to repair it when something goes wrong.

Think of a hard drive like a book. There is a table of contents which points at the chapters. Let’s say that chapter 2 begins on page 30; chapter 3 begins on page 45 and chapter 4 begins on page 70.

Now, if the Table of Contents gets messed up and says that chapter begins on page 72 (instead of page 45), you’re going to start reading in the wrong place.

(Obviously here, I’m equating the actual chapters to files on your hard drive, and the Table of Contents to the drive directory “the catalog.”)

So far that’s not too bad, and products such as DiskWarrior can easily go thru the actual “book” and find where the chapters begin, and replace the defective table of contents with a new one.

This is by far the most common corruption on a hard-drive, and the easiest to fix.

However, in reality, the situation is considerably more complex, because instead of each page in a given chapter following one after the other, they are scattered all over the book. In the worst case scenario, no two adjacent pages are actually adjacent to each other.

So, in order to keep track of that mess, there is a hidden list of where things are “attached” to the chapter listing in the table of Contents. In order to “read” a chapter’s worth of pages in sequence, the computer has to refer back to this hidden list (called a B-Tree) to find each page. These B-Tree point at “Extents” which are contiguous blocks of space on the drive

Already finding and fixing a corrupted directory becomes far more difficult that our first simple example.

But still doable, based on certain assumptions…

But we’re not done yet… because this is no ordinary book! The fact is that the reader can write new pages and add them, or delete old pages. Now the process of keeping the contents up to date is even more complex… and made harder because that list of extents may run out of space. So if that’s the case, there is a structure created, call the Extents Overflow, which is also examined to find where the rest of the pieces of the file might be

You’ve got lists pointing at lists of lists which point at pages in chapters or documents…. it’s “non-trivial” as they say.

NOW… supposed you delete one of those pages. Heck, delete several of them (which you think of as series of contiguous pages, but which are in fact scattered all over the drive) and you have “holes” in the lists of lists…. all of which have to be updated to indicate that they are now empty space on the drive, and can be used for something else.

Whew! NOW… you go add a new file or report or install a new program, and some of it goes in those holes. Well… it goes where the lists -say- that the holes are.

However, if something goes wrong; the power goes out; some installer doesn’t work right… and one of those lists ends up reporting that there is free space when in fact that space is actually used by a program or file…and the new program overwrites it (because the list incorrectly said it was empty)… well NOW you’ve got a real mess.

You’ve got a totally corrupted hard drive, and no amount of Disk Warrior, Drive Genius, TechTool Pro et al is likely to succeed in fixing it. That’s because there’s data where the lists say there is supposed to be data, and data where the lists say the space can be reused. (Deleting things does not do anything at all to the data: it just changes the lists themselves.)

Some repair utilities have routines that can handle -some- of this kind of corruption, but only if it happened THIS way and not THAT way, while others can handle those lost chains only if they happened THAT way, but not THIS.

And if you run too long with a disk that has corrupted lists of what is free space and what is not… if you run with those list wrong, then as time goes by, the corruption becomes worse and worse. Eventually you’ll get to a place where absolutely no software will ever be able to repair your drive… and you’re toast.

So: that’s why anecdotal reports of Drive Genius fixed something that Disk Warrior couldn’t, don’t mean diddly squat. Had the corruption been the other way around, DW would have fixed it and DG might have failed.

The fact is that the products that are out there now: Disk Warrior, Drive Genius, TechTool Pro, DiskTools Pro… all of them are good at what they do; most of them do fundamentally the same things; and all of them do those things slightly differently.

So anyone saying, as a general, overall statement “Disk Repair Product A works, and Product B doesn’t, so never buy Product B” simply doesn’t understand what is going on.

-Most- of the time, for simple directory corruption, any and all of those products will repair the catalog. Because simple corruption is the most common thing, and because Disk Warrior has chosen to concentrate on that, it has a justified excellent reputation. But there certainly are things it will not fix that the other packages will.

I hope this helps explain why you see so many “A works, B sucks… NO NO: B WORKS, it’s A that sucks!” comments out there.

HTH.

___________________________________________

Tracy,

>Thanks for another of your great contributions to this list.

>It brings up a few questions:

>If a drive is cloned, via SD or CCC, does that make files more contiguous?

Very good question. If you begin with a blank destination drive, then the answer is “yes.” This is a -much- faster way of “defragmenting” a drive (than to use defrag utilities.)

(CCC offers a “block copy” which is an -exact- physical-layout copy of the source. Using it, therefore, dutifully copies all the fragmentation of the original.)

>It used to be recommended to drag copy a main drive to a blank drive, erase the main drive, then drag copy everything back rather than using one of the utilities to clean this up. Would either method help reduce this problem?

Drag and drop copies of a boot drive will not result in a bootable drive under OSX. File links and structure are broken, and the system will not operate.

>Is a reasonably frequent restore from backups a usable method to lower this kind of corruption? Obviously, this cannot correct a power drop failure to write correctly.

Not really. You’d not want to clone back something you did, say, a week ago, because you’d be losing all the files that were newly created during that week.

Fragmentation is perfectly natural, and not to be feared. The HFS+ system does a grand job of keeping it under control. Defragmentation is normally not needed on a Mac (or at least -very- rarely.) The only common use for it is with data drives devoted to huge files, such as video files which are being edited. A severely fragmented video file will not stream thru the editor as smoothly as a defragmented one.

>One could infer from the article that using several (DW, WG, TTP) might not be a bad idea. Do they play well with each other?

Hmmm… as a “pro” with clients, I have one of each. Whether or not that’s a great idea in general is slightly up in the air. If money is not an issue, then having each wouldn’t hurt… with a caveat that comes from the second part of your question: it’s been my experience that “ganging them up” one after the other with the thought that one might fix what another missed, isn’t the best of all possible ideas.

A few years ago, before these had all matured, it was definitely a -bad- idea, since it frequently ended up making things worse. Now that they have matured, I’d be less hesitant about it… but only “less.” They each have various different routines to handle situations, and it’s almost impossible to say that they will or will not “mesh.”

My own approach is to use one, and then see if the issue is fixed. If it is, I’m done. If not, then I’ll try it again with the same software. If the issue still remains, only then will I try another. YMMV.

Someone asked if my explanation means that one should “regularly defragment your drive”…

Absolutely NOT! First, Apple’s HFS defrags many frequently used files automatically. Second, the fact that the file structure is complex doesn’t mean that defragging makes it that much less complex. Third: defragging is dangerous to your data, and should never be done at all without a current, full, bootable, backup. Fourth, if you’re going to create a clone/backup, you can defrag faster and safer by just wiping your drive and cloning back. Fifth, drives and the HFS are fast enough that you won’t likely notice any difference, unless – Sixth, you’re doing video or audio editing and you have a drive dedicated to it. THEN, you’ll find A/V will stream faster into your editor if the files are contiguous.

So, what do you use?

I own all of them. Some people say that Disk Warrior does what no others do – and that’s not correct. In fact, the analyze/rewrite catalog was first done my TechTool Pro years ago. Now they all do it. That said, Disk Warrior has concentrated almost exclusively on that aspect, (although they’ve expanded their tool in later versions) and pretty much have directory corruption repair under control. Because of that, and because it’s generally thought of as a “one-trick-pony” most folks (including me) go to it first.

That said, Drive Genius and TTP do the same thing… and they do more as well. I like Drive Genius for a quick check of the drive and for block cloning. It has other features which are used less frequently, and seems to perform well. TechTool Pro and DiskTools Pro also have their uses.

Each of these programs overlap the others to some extent. Do you need all of them? Probably not, unless you’re doing repairs professionally. The general consensus that Disk Warrior is the go-to tool is correct for most cases, since most problems are of the kind it repairs with great success.

But honestly, the others will do the same thing, and offer more features. The thing is, as I’ve explained above, there IS NO “one best” and I wouldn’t hesitate to recommend any of these.

too much documentation? Here’s my solution.

My own solution to too much documentation (ie instruction manuals, reference files, PDFs and so on) is FoxTrot Pro. (There is also a less expensive personal edition, and an upgrade path.)

I’ve been around the block on this (Spotlight; HoudahSpot; SOHO Notes; Devon Think Pro; Yojimbo; Shove Box; Eagle Filer… and probably several I forgot.)

I still use Devon Think Pro for “heavy lifting” because of it’s massive range of features, but FoxTrot offers something convenient and easy to use. When I just want to look up something I want fast, take me right to it responses. FoxTrot offers that.

Here’s what’s so: You tell FoxTrot what folders to monitor. Just drag your files into the folders. They just sit there, and FT indexes them. It does not copy them somewhere else or otherwise muck about with them. It’s not some special folder either – it may be folders you already have (as it was in my case – a couple of folders of documents that already existed; some PDFs, some sample code, so on.)

That’s the first nice thing: I don’t have to rethink how things are in the first place.

Next, it puts a quick-search into your menu bar, just like spotlight’s search.

type in what you’re looking for. For example, I might enter a search for “NKImageView” (which is an object in a programming language.)

FT does not not need to be running, but if it’s not it will launch. Takes about three second. If it is running, results are virtually instantaneous. Faster than Spotlight (largely because, I’d assume, you’ve limited the scope by specify a few particular folders.)

Up pops a screen listing the documents where it’s found. Doubleclick the document, and it opens up a screen with NKImageView highlighted in the table of contents, and a sidebar with a list of every place within that document where the word appears (with a bit of surrounding text.)

Click on the contents or anything in the sidebar, and you’re right at the page.

With FT running, total time from hitting enter on my return to the results? Faster than I can measure. 1/20 of a second? Time to click on the PDF filename and the selection within it? I dunno: how fast can you click?

Bottom line: very fast, very convenient, and at least for me (as a programmer, and a person with too much software documentation [even for non-programming things]) exactly the right solution.

www.ctmdev.com/foxtrot/personal_search/