Hard disk reliability study - 2005-2020

Mrkvonic · Feb 19, 2020

I think you will love this. I've published this ultra-long and insightful reliability study of mechanical hard disks conducted over 15 years (2005-2020) in a home environment, covering desktops, laptops and external devices, including environmental conditions, temperatures, usage patterns, Mean Time To Fail (MTTF), failure probability, other factors and findings, and more. Enjoy most profoundly.

https://www.dedoimedo.com/computers/hard-disk-reliability-study-2005-2020.html

Cheers,
Mrk

itman · Feb 19, 2020

I have a WD Blue; i.e. WD2500KS, circa 2006 that I reinstalled last Sept in my Win 10 build when my much newer Seagate Barracuda drive used for backups failed. WD2500KS Smart only shows 22 ECC errors which haven't changed since installation. HD Tune states this could be a sata cable problem but I am not worried unless those counts start increasing.

Bill_Bright · Feb 19, 2020

I commend you for taking the time and effort to consolidate and aggregate all this data from 15 years of use. I find the results interesting but I am sorry to say, its not very useful or valuable for us normal consumers. That is NOT a criticism of you, all the effort you put into it, or the report. Let me explain.

I have never seen such a report, at least not one like this that centers on long term use in a home environment. And that is exactly what most consumers need. We might see RMA reports from the likes of Newegg or Amazon, but those reports tend to be highly skewed as they include products returned because they were damaged during shipping, were the wrong size or wrong color, missing parts out of the box, or they were returned for other reasons not associated with failure under normal use. They also include DOAs which really don't reflect the reliability of that particular model. And RMA reports never address longevity.

So a report like yours that covers many years is a fantastic idea and I truly applaud your efforts to bring one to us.

The problem is, there just aren't enough samples. A sample size of one is anecdotal at best. Even the finest products from the best manufacturer will occasionally have a sample that fails prematurely. That does not mean that is an unreliable model or brand. To be of real value, there needs lots of samples of the same model number. The more samples the greater the accuracy and value. That's just simple Statistics 101.

What makes Backblaze informative and useful is, in most cases, they evaluate 100s or even 1000s of the same model number over time. But sadly, that is not done in a home environment.

I appreciate you acknowledged the exact model numbers are not listed. But unfortunately that information is critical to be of value to the consumer. For example, your report shows that 2 identical WD Blue 1000 drive were "OK" but another WD Blue 1000 failed. Were they the same model numbers from the same series? We don't know, so we don't know what to buy or what to avoid. There was a WD My Book 500 that failed, but another was OK. Were they the same model numbers? We don't know.

You are 100% correct that revision numbers change too - but that is often to correct firmware bugs or to replace failing components on the controller board with reliable components from a different source - fixes that may make later model numbers very reliable - and thus desirable to us consumers.

"Laptop disk" is listed 5 times. What does that even mean? Are they the same brand? Different brands? We don't know. I note the difference between a "laptop disk" and "PC disk" is often based solely on the type computer it is put in. While you don't normally see 3.5" disks in laptops, it is not uncommon to see the exact same model number 2.5" disks used in notebooks also used in PCs (or in enclosures) Beyond that, nothing makes a disk a laptop disk different from a PC disk. So without a model number, those entries, again, are of no value to us.

I also appreciate you acknowledge your bias towards WD. But Seagate is the world's largest HD manufacturer with 40% of the market share to WD at 36%. One of the more common questions I get when people are seeking advice is, "Seagate or WD?" But your report is void of Seagate drives.

So again, I find your report very interesting and I do appreciate the effort you put into it. But do I love it? Sorry, but no. It really serves no "value" for us.

I understand pride of authorship but please don't take this critique personally. I meant nothing personal.

itman · Feb 19, 2020

@Mrkvonic here's one for you.

For same above noted Win 10 build, recently replaced boot drive and installed Win 10 1909 fresh. Drive is Seagate Constellation 1TB purchased in 2013 but never used. I did run HDD diagnostics on the drive at time of drive purchase and everything checked out. I have been getting an I/O error per Win log entry every time I boot. No more errors thereafter. Using MS standard SATA driver in AHCI mode. Any ideas on this?

Log Name: Microsoft-Windows-Storage-Storport/Operational
Source: Microsoft-Windows-StorPort
Date: 2/19/2020 9:16:27 AM
Event ID: 534
Task Category: Miniport logs an error.
Level: Error
Keywords: Event logged by Miniport
User: N/A
Computer: XXX-PC
Description:
Miniport logs an error for device(Port = 6, Path = 1, Target = 0, Lun = 0).
Id: 24
Error description: Finished I/O with error
Corresponding Class Disk Device Guid: {00000000-0000-0000-0000-000000000000}
Adapter Guid: {f0dc9716-4cec-11ea-a3e1-806e6f6e6963}
Miniport driver name: storahci
VendorId:
ProductId:
DataLength: 176
Data: 0x01000000B00000000100000000000000160000000001000008002882000000005842525301000000B80000000000000058010800000000007B000000020000000A0000000000000000000000800000000100000010000000706D493287B8FFFF00000000000000000000000000000000000000000000000020E0608481A7FFFF40E4608481A7FFFF00000000000000009000000000000000A0000000000000000010000000000000FF51040000000000
Event Xml:
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
<System>
<Provider Name="Microsoft-Windows-StorPort" Guid="{c4636a1e-7986-4646-bf10-7bc3b4a76e8e}" />
<EventID>534</EventID>
<Version>1</Version>
<Level>2</Level>
<Task>17</Task>
<Opcode>109</Opcode>
<Keywords>0x800200000000000</Keywords>
<TimeCreated SystemTime="2020-02-19T14:16:27.724567500Z" />
<EventRecordID>432</EventRecordID>
<Correlation />
<Execution ProcessID="0" ThreadID="0" />
<Channel>Microsoft-Windows-Storage-Storport/Operational</Channel>
<Computer>XXX-PC</Computer>
<Security />
</System>
<EventData>
<Data Name="PortNumber">6</Data>
<Data Name="PathID">1</Data>
<Data Name="TargetID">0</Data>
<Data Name="LUN">0</Data>
<Data Name="ClassDeviceGuid">{00000000-0000-0000-0000-000000000000}</Data>
<Data Name="AdapterGuid">{f0dc9716-4cec-11ea-a3e1-806e6f6e6963}</Data>
<Data Name="MiniportName">storahci</Data>
<Data Name="VendorId">
</Data>
<Data Name="ProductId">
</Data>
<Data Name="DataLength">176</Data>
<Data Name="Data">01000000B00000000100000000000000160000000001000008002882000000005842525301000000B80000000000000058010800000000007B000000020000000A0000000000000000000000800000000100000010000000706D493287B8FFFF00000000000000000000000000000000000000000000000020E0608481A7FFFF40E4608481A7FFFF00000000000000009000000000000000A0000000000000000010000000000000FF51040000000000</Data>
<Data Name="Id">24</Data>
<Data Name="Description">Finished I/O with error</Data>
</EventData>
</Event>
Click to expand...

zapjb · Feb 19, 2020

I don't want to know. LOL. I have a hard enough time not being in the present moment.

Mrkvonic · Feb 19, 2020

Bill, statistically, 30 devices is enough to have a meaningful report. Of course, like you said - and I wrote - there are lots of other things to take into account. But it is indeed impossible to cover them by "one" household. I tried my best to account for what's feasible and reasonable. But indeed, I haven't tried too many different manufacturers. Even so I guess this is pretty indicative, even on the conservative side, of what you might get if you go for whatever mix of hard disks you need for your home.

itman - firmware error, driver error, controller error, cable error, disks error, can't say more. My guess, if the disk works fine - probably driver error.

Thanks,
Mrk

Bill_Bright · Feb 19, 2020

Mrkvonic said: ↑

Bill, statistically, 30 devices is enough to have a meaningful report.
Click to expand...

It would be if they were 30 identical drives. Or even if 5 samples of 6 different model numbers. Even 3 samples of 10 model numbers might be meaningful. But here, most are a sample size of one. And they don't specify the model number. So the report does not reveal any trends, good or bad. It does not reveal brands to avoid, models to avoid, series to avoid - or buy.

A single sample that fails could simply be an exception. And exceptions don't make the rule. And the reverse is true too. A single sample that passes all tests could be an exception too.

If I did a study of power supplies that included a single Corsair 500W supply and reported that one Corsair 500W failed, would that suggest all Corsair 500W supplies are of poor quality and unreliable? Not at all because (1) a sample of one does not show a trend and (2) we know Corsair has several series or tiers of PSUs and Corsair's top tier are quality, reliable power supplies. So would that be meaningful information about Corsair 500W supplies? If I did a study of power supplies and didn't include even one EVGA or SeaSonic, would it be meaningful?

So, no. It really isn't "meaningful". Interesting? Absolutely!

What would have been meaningful is if 15 years ago, you got all your friends and colleagues to keep track as you did. Then you consolidated all that data (but, again with model numbers and all the market leaders). That would have been meaningful and of value to us home consumers. So why didn't you do that?

Mrkvonic said: ↑

I tried my best to account for what's feasible and reasonable.
Click to expand...

And that is clearly obvious and appreciated.

Mrkvonic · Feb 19, 2020

The focus is not specifically around brands - it's about failures and data backups - that's the most important thing, I believe.
One, you can't expect disks to warn you when they are going to fail. Two, my percentages/time are thus - hence the data redundancy plan.
Getting people to do this ... well, I might as well run a penal colony

Cheers,
Mrk

Bill_Bright · Feb 19, 2020

Mrkvonic said: ↑

The focus is not specifically around brands - it's about failures and data backups
Click to expand...

I have to admit, I am really confused now. Absolutely, backups are critical. And I absolutely mean backups - as in more than one copy - a point you also stress. And it is sad when everyone knows they should have a robust backup plan, and use it, but most don't. We always hear one of these excuses and not until they lose the only copy of their data that they then might do something about it.

I guess where I'm confused is if model numbers don't matter and brands don't matter, then why break it down at all? Of 30 drives, you had 5 failures. That's a 17% failure rate. Not good! In fact, that's horrible!

The true fact is this; all drives will fail - eventually (unless retired before that eventuality).

IMO, since brand and model numbers don't matter, then it also does not matter if those failures occurred in the 1st year, 5th year or 10th year. The fact 17% of all drives failed during while in use is proof enough to conclude users need to have a robust, multi-copy backup plan, and use it. Breaking it down by brand (in some cases), years in use, type drive, etc. just obfuscate the true conclusion you are trying to make.

JMVHO

Mrkvonic · Feb 19, 2020

That is 17% over 7 years average disk life, so that is about 2.5% annual in the worst case, hence two or three data copies are needed (at least).
Mrk

Bill_Bright · Feb 19, 2020

Mrkvonic said: ↑

hence two or three data copies are needed (at least).
Click to expand...

Two or three backup copies are needed. That is, 3 copies minimum are needed for a "robust" backup plan; the original and 2 backups, preferably with one of those backup copies being stored off-site.

That said, if the focus is about the importance of data backups, it is important to note data loss is NOT just about drive "failure". Data "corruption" can be just as devastating and can occur due to unexpected power loss or malware. Data loss can occur totally through user error by accidentally deleting files, wiping or formatting drives. Data loss can occur due to natural disasters (fire, flood, tornado or hurricane). Or a bad guy can break into the home and steal the computer (and the external backup drive too that folks typically have sitting next to their monitors). Natural disasters and badguys illustrate the need for an "off-site" backup copy.

If the focus was to be about data backups ("the most important thing"), then all that information about those drives, including their size, time in service, etc. makes that focus a bit fuzzy. I mean the title of your article is "Hard disk reliability study - 2005-2020", not "Why you need to have a robust backup plan".

reasonablePrivacy · Feb 20, 2020

Mrkvonic said: ↑

That is 17% over 7 years average disk life, so that is about 2.5% annual in the worst case, hence two or three data copies are needed (at least).
Mrk
Click to expand...

I have read your article "My backup strategy" and I think it has one assumption, that is not true in many cases.

All right, let's go with the very high and probably super-pessimistic 50% figure. Therefore, your one hard disk has a 50% chance of dying within the first three years of its life. If you have two disks containing the same date, the risk goes down to 25%, three disks down to 12.5%, and so forth. This means that if you have six hard disks, the chance of a complete data failure of all your disks in three years is only 1.5%.
Click to expand...

Assumption is: one would not buy new drive immediately after failure of previous drive and continue to use only remaining drives.
If you buy new drive immediately after failure then whole math is a lot more complicated.
What means immediately? There may be a scenario user doesn't know about failure for a few days or weeks. And even if he know then buying may take few days to ship to home. So there is a time window. You would need to calculate risk of all drives failure in that short time.

As @Bill_Bright noted backups don't only protect from drive failures. Backups protect from malware, file corruption due to power loss or software errors, bad guy taking your data drives from you, or mistakes user may do: deleting or modifying data that makes data less useful.

Mrkvonic · Feb 20, 2020

Assumption is - data is present simultaneously on the X devices (copies).
Mrk

Linux Build · Feb 21, 2020

There is a hard drives reliability research based on collected SMART reports over the past 5 years: https://github.com/linuxhw/SMART

reasonablePrivacy · Feb 22, 2020

Mrkvonic said: ↑

Assumption is - data is present simultaneously on the X devices (copies).
Mrk
Click to expand...

What is X? X is a starting number of drives? When starting number of drives increases then probability of at least one drive failure increases too. In that case probability of not having all X drives in operable state should increase, not decrease. The larger X is then the less chance you have to have X drives after time Y (caeteris paribus) compared to the same time Y but smaller drive count W < X.

Mrkvonic · Feb 22, 2020

If one drive has a 10% probability of failure, if you have two such drives with the same data (identical copies), then the probability of loss of data due to both disks failing at the same time is 10% x 10%, so 1%.
Mrk

reasonablePrivacy · Feb 22, 2020

It is hard to discuss with you, because you often reply with just one sentence.

Mrkvonic said: ↑

If one drive has a 10% probability of failure, if you have two such drives with the same data (identical copies), then the probability of loss of data due to both disks failing at the same time is 10% x 10%, so 1%.
Mrk
Click to expand...

10% probability of failure over 3 years I assume.
It is not 1% probability of data loss if you you discover drive failure within minutes, buy new drive (or use other spare drive in operable state) and place backup here within few hours. It is non-zero probability, but considerably less than 1%.

Mrkvonic · Feb 22, 2020

Well, I explained that in detail in both articles, I hope, hence the one sentence wonders.

Yes, there is the window where you have downtime due to a failure and increased temporal risk in the time window where you need to introduce the new component. So you can then go with triple or quadruple redundancy, have spares ready, so the time to resume normal state goes down to maybe minutes.

The time frame that defines the probability for failure affects cost, not your redundancy. The probability determines your acceptable redundancy level.

Mrk

142395 · Feb 23, 2020

Haven't read the all arguments, replying only for some mathematical part. Sure, if a disk fails just after 3 years (I'm not sure of my English, I mean if you buy a disk at Jan. 1, 2010 then it fails exactly at Jan. 1, 2013 but probabilistically, in this case 10%; it's also okay to assume a disk fails within 3 years with the probability but the owner don't buy new one), if you buy 2 disks and have the same data on both, the risk you lose both at the day is 1% while the probability both disks survive is 81% and only either one survive is 9%. Ofc it's not realistic, it'll be better to assume the failure as a function of time (ignore temperature etc. for simplicity). I don't know what function is good for this (if we have enough data, we can estimate the function), but one candidate may be:
Pr(failure) = exp(-(-ln(time))^γ)
where γ is a positive parameter determining the shape of the function. In this case time must be within 0 to 1 so let's assume time = 1 means 30 years - a scenario that after 30 years all disks will fail w/ 100% certainty. If γ = 1 (linear), the probability a disk will fail after 3y is 10% as the previous discussion. But now you can calculate the probability for more realistic cases such as what @reasonablePrivacy talked. but I think γ = 1 is not likely, probably γ > 1 in reality.

For the original discussion, it seems two disks failed after 4y, one after 5y, one after 8y, and the last after 10y. Simply summing up this by 16.7% per 10y and 1.67% per year is wrong because they failed in different timings. Even assuming a disk fails w/ fixed 1.67% probability every after a year INDEPENDENTLY (throwing a dice every after year), after the first year the surviving probability is 98.33%, after two years it's 96.69%, and ten years still 84.5%, not 100 - 16.7 = 83.3%.

stapp · Feb 23, 2020

The post here links to a large amount of stats on hard drive reliability (just scroll down the page a bit)

https://www.wilderssecurity.com/threads/hard-disk-reliability-study-2005-2020.426252/#post-2896622

reasonablePrivacy · Feb 23, 2020

142395 said: ↑

I don't know what function is good for this (if we have enough data, we can estimate the function)
Click to expand...

I think that bathtub curve is usually a candidate. The idea of bathtub curve is that clearly defective machines will have very early failure. Machines that are not clearly defective will survive time of very high failure rate and fail at a constant failure rate later. It does not apply to all machines, but it is a classic example.

Mrkvonic · Feb 23, 2020

Yuki, I agree with you on the curve - it's several independent curves, but for the sake of simplicity, I went with the simplest one.
If we had the exact equation, we could predict disk failures accurately, which is not the case in the industry. Far from it.

I analyzed the cumulative failure by doing:

Sum of failures/divided by #years, so 1/30 disk after 4 years gives it roughly 0.75%, but say 3/30 disks after 5 years, gives it 2%.

Besides, because of the lack of exact math in this case, I try not to focus on the equation. I just look at end states (worst case) scenario. And then, derive from that the expected failure rates, and then decide how important the data needs to be so that I won't lose it if disks go bust.

Mrk
Mrk

reasonablePrivacy · Feb 23, 2020

Mrkvonic said: ↑

Besides, because of the lack of exact math in this case, I try not to focus on the equation.
Click to expand...

I also don't focus on exact math, but I don't agree to ignore human action. Two cases:
1. Somebody buys 2 HDDs. One of them fails after some time, person discovers that, but does nothing (human inaction).
2. Somebody buys 2 HDDs. One of them fails after some time, person discovers that, buys another drive and places backup there as soon as possible (human action by reaction to an event).
Even without close look and mathematical calculations common sense tells us that probability of data loss is different in these scenarios.

142395 · Feb 23, 2020

Well, seeing is believing. Let's assume the failure rate is linear and after 40 years all disks will fail as expectation (Google search suggested 40 - 50y). For simplicity, I use discrete time whose unit is a month.

Case 1: a user always use at most 2 disks (tho personally I'd like to recommend at least 2 simultaneous backups i.e. 3+ disks, let's assume more risky scenario). If one fails, he'll replace it in a month. It's possible both of his disks fail before he replaces.

Case 2: another user bought 5 disks at first and never buy another. All other conditions are the same.

I simulated both cases w/ 50000 users for each. Every month, each disk can fail w/ 0.20833% probability.

Results:

Case 1: 105/50000 users had failure of both disks before 40y. The distribution of time to full-failure for those 105 users is shown below (the vertical axis is the number of users, the horizontal is time range in months). Users bought 1.986 new disks (other than initial 2) on average of 50000 users.
https://i.imgur.com/XipVXQ8.png

Case 2: 5001/50000 users had failure of all disks before 40y. The distribution of time to full-failure for those 5001 users is shown below.
https://i.imgur.com/n1DRITu.png

Bill_Bright · Feb 23, 2020

Mrkvonic said: ↑

If we had the exact equation, we could predict disk failures accurately, which is not the case in the industry. Far from it.
Click to expand...

It is not the case in the industry because developing an exact equation to accurately predict when a specific disk will fail is simply impossible.

That would be great if possible, however. Companies, governments, institutions and individual consumers could "strategically" plan and budget for those future expenditures because they would know exactly when those resources would be needed.

IT managers could schedule downtime to periodically replace the drives before their "expiration dates". Now that would be really nice!

And while there certainly are other factors dictating the need for data backups, unexpected drive failures would no longer be one of them.

Maybe one day. But not today.

Log in or Sign up

Hard disk reliability study - 2005-2020

Mrkvonic Linux Systems Expert

itman Registered Member

Bill_Bright Registered Member

itman Registered Member

zapjb Registered Member

Mrkvonic Linux Systems Expert

Bill_Bright Registered Member

Mrkvonic Linux Systems Expert

Bill_Bright Registered Member

Mrkvonic Linux Systems Expert

Bill_Bright Registered Member

reasonablePrivacy Registered Member

Mrkvonic Linux Systems Expert

Linux Build Registered Member

reasonablePrivacy Registered Member

Mrkvonic Linux Systems Expert

reasonablePrivacy Registered Member

Mrkvonic Linux Systems Expert

142395 Guest

stapp Global Moderator

reasonablePrivacy Registered Member

Mrkvonic Linux Systems Expert

reasonablePrivacy Registered Member

142395 Guest

Bill_Bright Registered Member

Log in or Sign up

Hard disk reliability study - 2005-2020

Mrkvonic Linux Systems Expert

itman Registered Member

Bill_Bright Registered Member

itman Registered Member

zapjb Registered Member

Mrkvonic Linux Systems Expert

Bill_Bright Registered Member

Mrkvonic Linux Systems Expert

Bill_Bright Registered Member

Mrkvonic Linux Systems Expert

Bill_Bright Registered Member

reasonablePrivacy Registered Member

Mrkvonic Linux Systems Expert

Linux Build Registered Member

reasonablePrivacy Registered Member

Mrkvonic Linux Systems Expert

reasonablePrivacy Registered Member

Mrkvonic Linux Systems Expert

142395 Guest

stapp Global Moderator

reasonablePrivacy Registered Member

Mrkvonic Linux Systems Expert

reasonablePrivacy Registered Member

142395 Guest

Bill_Bright Registered Member

Useful Searches