About time someone spoke up about this nonsense.
Despite compression, dedupe and other ways people try to reduce and manage the amount of data that they store, it still seems that many storage infrastructure managers tend to waste many thousands of pounds just by using it according to the vendor’s best practice. I spend a lot of my time with clustered file-systems of one …
Having spare capacity is like having a large storage warehouse.
You lose money, yes, but it is there for a reason.
If you are able to use a Just In Time (JIT) system for provisioning space, well, you will save good money. If you try and fail, you will lose much more.
I would very much prefer an 80% max threshold than a half baked JIT solution.
I think it all boils down to the fact that it is never a good idea to fill up a hard disk (that you use every day) to 100%.
Try it - but not with data that is important to you. You won't like the result. Even NTFS breaks down when reaching the disk storage limit. It's a consquence of how the disk works. The worst of it is that it doesn't happen all at once. For a while, you'll be fine at 99%. Then, one day, your disk will just be unreadable. Game over. And yes, I've seen this happen to friends of mine. More than once.
So, to keep your data safe and your disk in good working condition, you ensure that your data never goes above 90% of your available disk space.
That obviously scales to disk arrays and massive online storage warehouses, because they all still depend on the same 3.5" HDD that you have at home.
At least, I think that's what this is all about.
"At least, I think that's what this is all about."
The Ignoramus responds:
Well, I understand all that hardware stuff, a little bit about the file systems, and a far bit about SSD perofrmance when full, but that was the basis of the question. If you're buying several petabytes of storage (or whatever units are reasonable), and you self provision, then you take the known problems. But if you're buying storage as a service, surely the provisioning arrangments by the vendor OUGHT to be able to magic away the device limits across its multiple customers, and what you buy is what you can use?
I suspect from the comments that the original article refers to buying your own hardware, and if that's the case then the article writer can quit whining. Claimed capacity has always been greater than the truth, ever since a marketing dweeb decided that a megabyte must mean a million bytes, and that was at least three decades ago.
It's about fragmentation.
Despite claims to the contrary, fragmentation is a bad thing on disk arrays (ok, not nearly so bad on SSD ones), especially if they're serving multiple heavy clients (headseek is a bitch)
If you have seriously large storage your %age can be vastly reduced. We usually only see performance degradation on Terabyte-class FSes once they hit something higher than 99% (YMMV) - a much greater source of slowdowns is users putting 32,000+ files in one directory (150,000+ in one case) and wondering why their operating system doesn't like it.
If you insist on using NTFS all bets are off, but I'm pretty sure that even that doesn't take disks offline at 100% full.
If any vendor told me that the shiny new 400TB (usable) SAN system i was about to pay for for can only filled to 80%, I'd be seeking another vendor (a couple have and that's why they aren't under consideration anymore).
Then again if XYZ vendor sold me a 400TiB usable array which had 600TB in it, I would take the 400TB usable staement happily.
When you are only using 75% of the available space, there is a fair chance that the OS can put new data somewhere it can reach quickly after accessing related data. When you are using 95% of the available space, there is a fair chance that the OS will have to scatter data into places that will require head moves and disc rotations to get it all back.
Modern filing systems have at least one separate free space list per core. That way, any core can allocate (or release) some space without having to lock the other cores out of the only free space list. When a file system is mostly full, some of the free space lists will be full, and cores will have to queue to allocate space.
If your data is static, you can use something like cramfs and use 100% of the disk efficiently. If your use case involves modifications, you have to choose between performance and utilisation. If this is costing you thousands per month then experiment to find which solution gives you the best trade off. You really cannot have a whole cake while rapidly adding and removing slices.
Nice explanation - thanks. (And have an upvote.)
One question I've always had is how this logic applies to modern storage setups. Surely in even a basic RAID 5 situation the idea of sequential writes is a bit moot - no? Extend that out to a SAN arrangement with data distributed, striped and replicated across multiple already striped arrays and surely the idea of sequential data is even more confusing. Extend that further to TIERED storage systems, where different bits of the data are striped across different sets of arrays and may be moved back and forward from NL to SAS to SSD to cache and I just can't see the concept of 'sequential' even having meaning there.
I'm not contradicting you, I just don't understand exactly; how does this work in modern arrays, striping and replicating data in myriad complex and (allegedly) self-optimising ways. Of course, if you use a tiered system then I would assume that free space would be very important to allow data/block mobility across the tiers, but not for the reasons of keeping things sequential.
I very much get your point about free-space lists and that makes a lot of sense to me. Personally I always try for a MINIMUM of 10% buffer across pretty much anything that even vaguely looks like IT equipment and mostly aim for 20%. BUT, much of that is prudence on my part due to my acceptance that I don't understand all the ins and outs of every technology and so am better safe than sorry; I have always suspected that people with more familiarity with things could spec a solution with much less 'waste'.
You can use 99% of the disk, or even 100% if you don't want to write anything else to it, but performance will be degraded. Those are the choices. The advice to only use at most 80-90% is based upon the idea that you will have tested the performance of the server at 0-10% capacity, when it is easy to find continuous sectors, and so performance at the raggedy end might be so bad that your server becomes unusable due to lack of IO. If IO isn't your concern, and mainly you just need to store lots and lots of bits, have at it. If IO is your concern, buy more/bigger disks.
".....you're losing an extra drive or two in each RAID set....." Ah, the memories of trying to explain to a non-technical project manager why, when we specified mirrored boot for redundancy, we needed two disks in a mirrored boot set to hold only one disk's worth of data. That was two hours I will never get back! He was completely horrified when I explained the arrays we planned to use for the main storage used blocks of four disks in RAID5, meaning - as he saw it - "one disk in four is wasted!" I chickened out of telling him about the spare disks in the frames.
I suspect the author actually doesn't have any real issues with vendor recommendations, he just thought this was a zany theme to throw out and let the forum chew over.
Someone wake up on the wrong side of the bed this morning?
As a general rule of thumb it's inadvisable to run anything at 100% all the time because you'll fuck it quicker than not. Would you red line your car constantly? Would you run an AC unit at 100% because it just about cools the room or do you install a second unit to alleviate stress on the first? Hell, do you ever sprint everywhere you go? Sure you'll get there quicker, but at the cost of being a sweaty mess?
As for capacity what happens when some muppet uploads an extra 100GB of data to that storage space and there's no room? It either fails or you get charged an extortionate "burst" rate until you bring the amount down or up your package?
I can't agree with this comment. Everything you've mentioned is a "performance" metric, not "capacity." If you are moving to a new apartment, but need to put your stuff in storage first, and you know you have just shy of 10x15 of stuff, so you lease a 10x20 storage unit? No, you fit the purchase as closely as you can to the requirements to prevent overspending.
This argument comes down to storage management. If you know your capacity requirements, can accurately forecast the growth, and have the procurement/implementation maturity to get the storage in quickly when it's needed, it's perfectly doable. Obviously if you're storage growth is dynamic and the environment isn't very flexible/scalable, you're better off having the padding.
Not everyone's environment lends itself to that model, nor have the installation/procurement maturity to manage that model. Just my view.
The analogy with the storage unit fits well:
- If you're not constantly changing what's in there, you'll be fine with a unit that fits barely.
- But if you're constantly bringing in new stuff and try to exchange it against stuff that's all the way in the back, you'll wish you had paid for a bigger unit...
The arguments about keeping it 80% might be true if you are using 1 disk. However, enterprise SANS will splatter your data across many spindles, and these simplistic viewpoints don't apply. Even a cheapish disk array will raid and braid your data all over the place, move it about and manage performance "hotspots" automatically.
I have never heard a vendor say keep it 80% for performance. Storage is such a large area that one rule of thumb does not apply to all, it all depends on the usage pattern.
Obviously you don't let certain partitions fill up completely or it breaks the OS and applications. Also some admin tools won't work on 100% full file systems, eg. fsadm (my own blog)
This has always been the problem with antiquated Storage arrays. EMC best practices get you to 75% written capacity and that's it. Netapp--worse usable capacity of any vendor. All their overhead takes your raw to usable down to 50% or so. Hitachi? Performance starts degrading somewhere between 45-55% written capacity. If you believe SPC-1 then 3PAR seems to be the best. Their performance numbers hit at 86% written to. Assuming HP hasn't screwed it up yet, I would say they handle this problem very nicely.
But with SSD's being the wave of the future, I wonder which of the SSD startups scale performance with usable capacity consumption? I think HP announced an all SSD 3PAR but is that really an SSD array or just 3PAR with a bunch of SSD's in it? Do any of them have any kind of usable capacity published numbers to look into?
Yeah. Part of Netapp's problem is that the de-dupe metabase and other overhead items that the filer itself uses all live in that unallocated portion of the disk aggregate, AND it doesn't mark it as allocated for the disk tools. So you run into an issue that we had about 4 months ago where the aggregate ran out of space, De-dupes stopped dead cold, and our BI server fell over because the joker who built it thin-provisioned the LUN. Even better, the only way to reclaim space at the aggregate level is by nuking the volume and creating a new volume that's smaller. (The entire admin staff did a 'WTF?!?!' when we found that one out!)
Anon to protect my paycheck.
"Yeah. Part of Netapp's problem is that the de-dupe metabase and other overhead items that the filer itself uses all live in that unallocated portion of the disk aggregate, AND it doesn't mark it as allocated for the disk tools. So you run into an issue that we had about 4 months ago where the aggregate ran out of space, De-dupes stopped dead cold, and our BI server fell over because the joker who built it thin-provisioned the LUN. Even better, the only way to reclaim space at the aggregate level is by nuking the volume and creating a new volume that's smaller. (The entire admin staff did a 'WTF?!?!' when we found that one out!)
Anon to protect my paycheck."
No offense but this reads like "we implemented a NetApp badly, having bought too little capacity, then didn't monitor it or manage it in any way and now we're shocked because things went wrong"
This is not NetApp's problem...
"No offense but this reads like "we implemented a NetApp badly, having bought too little capacity, then didn't monitor it or manage it in any way and now we're shocked because things went wrong"
The implementation was done correctly (I've contacted Netapp support a couple times and they've told me that the way we did it is fine, despite my protests otherwise), but the bought too little capacity? yes. Lack of management? probably, it's five people managing server and network infrastructure for a 3000+ employee, multi-site company, and we consistently have a pretty decent amount of projects and stuff on our plate. If anything, I'm amused at how quickly Parkinson's Law kicked in.
The only thing I wish our Netapp did was perform space reclamation better and give us a better idea of how much free space we actually have within the GUI without resorting to drop into an SSH session, but that's just life.
We've picked up a couple of Nimble boxes, largely to see how well they do for our remote sites- So far, my only quibble is that I'd love to see a reporting tool that isn't completely and utterly dependent upon a third party site, or failing that, a plugin to SolarWinds, SCOM, or some other Ops Management application.
And still anon- that pesky paycheck, you know.
"The only thing I wish our Netapp did was perform space reclamation better and give us a better idea of how much free space we actually have within the GUI without resorting to drop into an SSH session, but that's just life"
The NetApp will never tell you free space because of all the space reduction features making this figure total nonsense. The online toolset will, however, tell you when you're likely to run out of capacity. Access this through the support portal or ask your NetApp partner (the people who sold it to you - they can access this too) to get you the info.
This is a Netapp problem, it's all the hidden and confusing best practice around space reservations that constantly catch people out. Unsurprisingly it gets glossed over during the sales cycle, but not fully understanding how space gets consumed (it's confusing) can at best kill performance and worst take you applications offline.
Yes, deduplication will stop working when the aggregate is full. In fact, if there's no space to write to, it doesn't matter who's disk array you're using or what you're doing things will go wrong.
You can't reserve space when you don't know exactly how much space metadata will be needed, and that's dependent on the amount of duplicate deta. Netapp recommends when dedup is on to have 5-6% free space, to accommodate for that. However, If they reserved 5-6% but the metadata didn't occupy that space the complain would have been that space is wasted by such reservation. So it's a catch 22.
There's nothing wrong with implementing Thin Provisioning but doing it without monitoring the free space and setting aggregate and volume alerts is fundamental to the TP technology for all vendors.
Reclaiming space on the aggregate is not a netapp issue everybody else has the same issue. Once a block has been written to the array considers that block as used regardless of what the host filesystem thinks, unless you're dealing with an OS or a hypervisor that provides block reclamation capabilities.
That said, netapp does provide a block reclamation mechanism for windows physical and virtual machine and that could have been used to return free FS blocks to the volume and aggregate.
So, while the implementation may have been fine, the process or monitoring/alerting TP doesn't appear to have been in place and that's really what caused this.
"Netapp--worse usable capacity of any vendor."
No, NetApp have a poor usable percentage of capacity. Because they give better performance per spindle though, you can use 4TB drives for everything so they actually have the best usable capacity overall for the money.
The all SSD 3Par is identical apart from a larger cache on the head. There are also plans to up the bandwidth from 8Gbps due to flash being able to use it.
No offense but I've seen Netapp in action. Better performance per spindle it does not deliver. It's good for what Netapp is--a filer. But 3PAR, HDS, IBM (DS) all smoke Netapp by leaps and bounds. If we throw cache hits in the mix, you get better performance from XIV and VMAX. Don't get me wrong Netapp is a fantastic NAS device but it has no real place in this discussion.
That's pretty interesting on the SSD stuff. I'd still like to see it be truly optimized for flash. Throw in some dedupe and it would be pretty interesting to see what HP has put together.
"The all SSD 3Par is identical apart from a larger cache on the head"
Why would you say this? Are you deliberately saying wrong things or do you just don't follow these things? The operating system of 3PAR has been rewritten in many places just because of flash. Please do your research first before claiming these things.
HP told me this first hand during a partner training event focussed on 3Par. You want to blame someone blame HP. The code is identical on both SANs. There may well be optimisations for the SSDs but those same optimisations are in the standard SAN as well as the all flash one. The only difference between the two is the hardware and the difference there is cache. Are you deliberately trying to argue or do you genuinely think that HP have forked the entire codebase just for one model?
The explanation you received on the differences was a bit simplistic since not all optimizations make it into the public domain. Yes SSD optimization code was back ported to other 3PAR family members but that doesn't mean that their aren't different streams to handle different hardware features. Single code base and common features doesn't mean you can't optimize for specific use cases, either way pound for pound a 3PAR will smoke a Netapp on capacity or performance.
Wrong the 3PAR 7450 has faster CPU's more cores, larger cache and a bunch of optimizations both published and unpublished specifically for SSD use. All storage is reservation-less so no need to mess about with all that aggregate planning and stranded space.
It's also a quad controller system with true symmetric active / active access and will easily outpace a Netapp box on capacity utilization, performance per spindle and overall system performance, the hybrid 7400 with most of the SSD specific optimizations now back ported will also do the same.
SSD is typically not about bandwidth, it's about IOps and the 3PAR 7450's 24 x 8Gb ports should be more than adequate for pretty much any use case the 7450 is aimed at.
I think you make an interesting point here. Storage arrays degrade in performance as they fill up. Even high-end enterprise arrays start to see performance trail off with as little as 40% of the array filled. This forces administrators to run the arrays with large amount of free space that is stranded capacity. As SSD’s don’t see this drop off in performance, you will be able to run the array closer to the red line.
Mix in some inline de-dupe or compression and boom!
If anyone’s interested google XtremeIO.
It always gets my back up that they will make a backplane and make it capable of holding say 12 disks. Then tell you the optimum is to leave 2 or 3 slots free as this makes it faster and more efficient etc etc.
Can anyone tell me why the hell if they are making a backplane that can hold 12 disks they don't optimise it for that? Even if this means developing for 15 disks and then never utilising it because the last 3 slots will never physically exist.
That was always my problem with the NetApp stuff, especially with the fiber channel shelves that held 14 disks. Optimum RAID group sizing was 16 drives according to the ONTAP cookbook, so I needed two shelves with two disks in the second shelf. Explaining that to a director was a pain in the ass, we'd end up buying two full shelves and leaving 12 disks unused until we bought yet another full shelf. It's really not just a NetApp problem, but the RAID group sizes being so large was always a bit of a pain come expansion time.
Moving to virtualized RAID storage systems made a big difference for us. Our usable capacity went up a fair amount, as well the performance per spindle is consistently higher. And we can add them as needed in any amount required either to deliver more IOPS, more capacity, or both.
There are two different values for Raid group size on a NetApp system. You are referring to the default value, which is 16 (actually depends on several factors, but it is 16 in your case). That's just a default and very different from the optimum value.
In order to get the optimum value, a number of things need to be considered, such as disk type and size, Ontap version, number of disks in system, etc.. In general, the goal for the optimum value is to have evenly-sized raid groups within an aggregate and use all the disks available (except for necessary spares). If you didn't follow the process (as described in various documents such as the storage subsystem faq) to find the optimum value for your configuration, you'll end up using the default value and that typically results in a non-optimal configuration. Not NetApps fault though.
In a restaurant, you don't get to see the stuff left over at the end of the day. There may be a perfectly good steak left in the fridge that has to be thrown out for elf and safety. Sure, you *could* have eaten it during lunchtime if perfect knowledge of future customers' orders had been available at that point. But there wasn't and so you paid for a piece of steak you never even saw on your plate.
But it's more invisible than the "held back" storage. So there is less complaining. (Same goes for the invisible costs of taxation, bureaucracy, war, nepotism, money printing etc. - people don't complain much about those... but that's another can of worms)
So I love the miss conception here that the 20% is being wasted just sitting unused. When its purpose isn't to store data durring 90% of its life time, its job is for the 10% of a decade investment where large scale failures do occur. I've spent my life in support at different venders watching people fill this very necessary space up only to come crying to the vender when a node or series of drives fail, and it makes neither sides job easy when you have to drop ship hardware and people into a location to save the day of something that could have been handled smoothly if a sysadmin had a bit of discipline and foretasted the storage requirements of the organization properly or sent lesser used data to a slower tier of storage.
Biting the hand that feeds IT © 1998–2019