back to article Speedy storage server sales stumps sysadmin scribe: Who buys this?

Every once in a while I need to ask a question I know is going to get me in a world of trouble. It's the sort of question that triggers panicked emails from corporate PRs, and sometimes even the odd thinly veiled threat for daring to ask such things in a public forum. I'm pretty lucky in that Chris Mellor, Storagebod, and …

COMMENTS

This topic is closed for new posts.
  1. anatak

    didn't you answer your own question ?

    CSA's are more complex. Non IT management does not understand complex IT things. Maintaining a CSA is more difficult. The person who installed the CSA has a higher job security. (slightly cynic view of course)

    1. Trevor_Pott Gold badge

      Re: didn't you answer your own question ?

      Not all CSAs are complex. Proximal is "fire and forget". vFlash will be there in August, I'm sure. There are others that are as simple, or close to. It's only when you start layering on the features that the CSAs get into "job security" territory...and I start wondering "why don't just go full-hog, use a server SAN and be done with it?"

    2. Anonymous Coward
      Anonymous Coward

      Re: didn't you answer your own question ?

      Quote: "Non IT management does not understand complex IT things."

      Hey...wake up...."Non IT management" are paying the bills. They don't need to "understand complex IT things". They pay IT for BUSINESS OUTCOMES. They specify what they want, and IT provides the correct tested outcome. End of story. What part of this don't you understand?

    3. Anonymous Coward
      Anonymous Coward

      Re: didn't you answer your own question ?

      "claimed VSAN was better than competing server SANs because it was built right into the hypervisor. It's in the kernel! It's faster!"

      Faster than which SANs? Benchmarks?

      Certainly Windows Server / Storage Server - which gives the fastest NFS benchmarks of any standard OS distribution as far as I know - already runs it's drivers in kernel space.

  2. TaabuTheCat

    VMware VFC - beware this bug!

    This one bit me twice - once about three months ago before anyone at VMware could tell me what was happening. After five hours on the phone with support we just ended up rebuilding vCenter to get running again. And it happened again last week after upgrading my vCenter virtual appliance to U1. This time support was able to find this KB article after several hours on the phone - turned out to have nothing to do with the upgrade, it was the reboot afterwords that triggered it.

    http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2072392

  3. Phil Dalbeck

    Perhaps...

    Those expensive CSA's are presumably targeted at enterprises who (for some compliance or political reason) aren't ready to adopt a distributed VSAN architecture? I'm going to go with the good old "financial institutions" piñata here who are still in many cases using code and processes older than most of their customers.

    I'm guessing that somewhere, carved on a stone tablet in the (presumably very heavy) dungeon masters acceptable risk handbook is a line that says "though shalt buy only physical SAN's with Multiple controllers, and no less than six power supplies, and it shall sayeth EMC or NETApp on the front, for all else is witchcraft and heresy" or something.

    I'm also guessing that CSA's are a nice work around for virtualisation admins - they might be stuck with the stipulated backend storage (because until the peasants rise up and stick a pitchfork through the dusty old storage manager who has complete control over that side of things) - but they can damn well fling a CSA in front of it, as that's in their domain god dammit, and one day they'll throw off the shackles and deploy a VSAN and cut the dark lords of storage out of the picture entirely. One day...

  4. Gordan

    Assertions

    "I personally see two conflicting assertions here: that VSAN is so much better than fully virtualised server SANs because it runs in the hypervisor, and that the VSAN-less hypervisor is so awesome it can easily handle any workload – like, er, fully virtualised server SANs."

    The hypervisor isn't so awesome that virtualized workloads are overhead-free. At full hardware saturation point the performance hit from running virtualized can be up to 40% on some workload/hardware combinations.

    1. Trevor_Pott Gold badge

      Re: Assertions

      Do you happen to have information on specific workloads I can test in my lab that would prove your claim? I'd love to test that and write it up. Please get in contact if you have details!

      1. Gordan

        Re: Assertions

        As a matter of fact - I do.

        I'll PM you later today with more details on other good test cases, but as a first pass you might want to take a look here:

        http://www.altechnative.net/2012/08/04/virtual-performance-part-1-vmware/

        1. Trevor_Pott Gold badge

          Re: Assertions

          Hey; took a brief look at this, and noticed a few problems straight away. First up:

          "(VMware Player 4.0.4, Xen 4.1.2 (PV and HVM), KVM (RHEL6), VirtualBox 4.1.18)"

          Only Xen and KVM are hypervisors, and they are the two that are the easiest to tune improperly. You don't have ESXi or Hyper-V here, and they are the real test of what virtualisation can do. VMware Player and VirtualBox are not hypervisors...or at least not Type 1 hypervisors. There is going to be a huge penalty for running those. Everyone in the industry knows that. That's why they aren't advocated for production anything.

          I am really shocked you got such bad numbers for Xen and KVM, which leads to me wonder how they were configured...but being Xen and KVM, if you look at them funny they'll run like crap.

          Very interested to see numbers with ESXi and/or Hyper-V!

          1. Gordan

            Re: Assertions

            I guess you didn't look hard enough. If you Ctrl-F and search for "esx" it shoudl find you the relevant part of the page. Including the first line in the article. ESXi scored second least bad, after PV Xen.

            Hyper-V I don't use, so cannot comment on it.

            Xen isn't that configuration error-prone - there isn't that much to configure on it. The only things that make any appreciable difference are pinning cores and making sure you use PV I/O drivers. In the case at hand, the I/O was negligible since everything was done in RAM with primed caches, so PV I/O made no measurable difference. Either way, you wanted a reproducible test case - there is a reasonably well documented one.

            A few notes on the test case in question:

            1) Testing is done by fully saturating the machine. That means running a CPU/memory intensive load with at least twice as many threads as there are threads on the hardware. For example, if the machine has 4 cores, that means setting up a single VM with 4 cores, and running the test with at least 8 CPU hungry threads.

            2) Not leaving any cores "spare". If you have, say, a 6-core system, and you leave a core dedicated to the hypervisor (i.e. you give the only VM on the system 5 cores), you are implicitly reducing your capacity by 17%. Therefore, in that configuration, the overhead is 17% before you ever run any tests.

            3) Pinning cores helps, especially in cases like the Core2 which has 2x2 cores, which means every time the process migrates, the CPU caches are no longer primed. This is less of an issue on a proper multi-core CPU, but the problem comes again with extra vengeance on NUMA systems, e.g. multi-socket systems with QPI where if your process migrates to a different core, not only do you not have primed CPU caches, but all your memory is 2-3x further away in terms of latency, and things _really_ start to slow down.

            However, many VM admins object to core pinning because it can interfere with VM migration. It's a tradeoff between a bigger performance hit and easier management.

            You may find overheads are slightly lower on more recent hardware (e.g. if you are using a single socket non-QPI system), since the above tests were done on a C2Q which suffers from the extra core migration penalty if the target core isn't on the same die as the source core).

            On something like a highly parallel MySQL test the results tend to be worse than the compile test implies, but you'll have to do your own testing with your own data (replaying the general query log is a good thing to test with) as I don't have any publicly shareable data sets, and I haven't tested how well the synthetic DB benchmarks reflect the real-world hypervisor overheads.

            1. Trevor_Pott Gold badge

              Re: Assertions

              @Gordan; yup, I had missed it. In my defense, I hadn't slept in 4 days due to datacenter migration.

              Let's address a few issues in the testing methodology you state:

              "1) Testing is done by fully saturating the machine."

              Testing should always be done by pushing the machine to the red line, otherwise we learn nothing.

              "2) Not leaving any cores "spare"."

              Leaving cores "spare" doesn't present a real test. However, the host instances should have reserved RAM and CPU on any production virtualisation deployment. It's a fairly common mistake not to enable this, and typically results in Xen/KVM showing badly compared to properly deployed instances. The hosts instances need wiggle room to do their jobs, especially with "noisy" VMs.

              3) Pinning cores helps, especially in cases like the Core2 which has 2x2 cores, which means every time the process migrates, the CPU caches are no longer primed.

              I pin cores all the time and never run into the issues you describe here. I have flattened multiple generations of systems and still don't see the disparity you do. What I wonder is if it is related to the Core 2.

              Back in the Core 2 days I used AMD stuff, and they were well ahead of Intel in terms of hardware virtualisation support. Today's processors have any number of improvements over that old design and the introduction of proper hardware support in these generations of processors may explain the discrepancy.

              The only time I have ever seen results like you describe is when I am able to saturate the RAM bandwidth. This is entirely possible with DDR 2 systems, especially when you are allowing memory deduplication on the systems, something that - at least in ESXi - is enabled by default.

              I'd also have to look at your I/O subsystems as being suspect. It smell a lot like I/O thrashing. I will see if I can scrape together any equipment from that era and place it against both the AMD Shanghai systems I have as well as my modern Intel Xeons. I am very curious to see what will happen when I pin them.

              1. Gordan

                Re: Assertions

                As I said, the I/O saturation was a non-issue because the write caching was enabled, the data set is smaller than the RAM used for testing, and the data set was primed into the page cache by pre-reading all the files (documented in the article). The iowait time was persistently at 0% all the time.

                I am glad you agree that the machine needs to be pushed to the redline for testing to be meaningful. Many admins aren't sufficiently enlightened to recognize that.

                On the subject of leaving resources dedicated to the host/hypervisor, that is all well and good, but if you are going to leave a core dedicated to the hypervisor, then that needs to be included in the overhead calculations, i.e. if you are running on a 6-core CPU, and leaving one core dedicated to the hypervisor, you need to add 17% to your overhead calculation.

                In terms of migrations and near vs. far memory in NUMA, if you have, say, a 2x6 core dual socket system, an you dedicate one core to the hypervisor and the other 11 cores to the test machine, you are still facing the same problem of hiding the underlying topology so the guest OS kernel is disadvantaged by not being able to make any decisions on what is near and what is far - it all appears flat to it when in reality it isn't. While pinning cores will still help, the situation will nevertheless put the virtualized guest at a disadvantage.

                Heavily parallel loads suffer particularly badly when virtualized because of the extra context switching involved, and the context switching penalty is still a big performance problem on all hypervisors.

                1. Trevor_Pott Gold badge

                  Re: Assertions

                  "On the subject of leaving resources dedicated to the host/hypervisor, that is all well and good, but if you are going to leave a core dedicated to the hypervisor, then that needs to be included in the overhead calculations, i.e. if you are running on a 6-core CPU, and leaving one core dedicated to the hypervisor, you need to add 17% to your overhead calculation."

                  I never said "leave a core dedicated to the hypervisor". I said reserve it some space. Typically 500Mhz or so.

                  As for this:

                  "the I/O saturation was a non-issue because the write caching was enabled, the data set is smaller than the RAM used for testing, and the data set was primed into the page cache by pre-reading all the files (documented in the article). The iowait time was persistently at 0% all the time."

                  I would have to conduct my own testing. My lab results consistently show an ability to saturate RAM bandwidth on DDR2 systems. Your results smell like an issue with RAM bandwidth, especially considering that's where you're pulling your I/O. I will look to retry by placing the I/o on a Micron p420M PCI-E SSD instead.

                  I also disagree with your assessment regarding near/far cores on NUMA setups. Just because the hypervisor can obfuscate this for guest OSes doesn't mean you should let it do so for all use cases. If and when you have one of those corner case workloads where it is going to hammer the CPUs ina highly parallel fashion with lots of shared memory between then you need to start thinking about how you are assigning cores to your VMs.

                  Hypervisors can dedicate cores. They can also assign affinity in non-dedicated circumstances. So when I test something that I know is going to be hitting the metal enough to suffer from the latency of going across to fetch memory from another NUMA node I start restricting where that workload can play. Just like I would in production.

                  Frankly, I'd also start asking pointed questions about why such workloads are running on a CPU at all, and can't I just feed the thing a GPU and be done with it?

                  I flatten my systems all the time, not just in testing, but in production. I run full-bore render engines in a virtualised environment and I just don't see the issues you describe. That makes me very curious where the tipping point between my workloads and your simulation is. What needs to change in order to experience this dramatic drop in capability? Do I need to be on the lookout for it in my future workloads, or is it an artifact of using an ancient CPU or a peculiar testing configuration.

                  I don't have answers to these, but I've added it to the list of things to find out.

                  1. Gordan

                    Re: Assertions

                    "I would have to conduct my own testing. My lab results consistently show an ability to saturate RAM bandwidth on DDR2 systems. Your results smell like an issue with RAM bandwidth, especially considering that's where you're pulling your I/O. I will look to retry by placing the I/o on a Micron p420M PCI-E SSD instead."

                    This implies you are asserting that running virtualized causes a substantial overhead on memory I/O, otherwise saturating memory I/O shouldn't matter. I'm open to the idea that the biggest overhead manifests on loads sensitive to memory bandwidth, although measuring memory bottleneck independently of the CPU bottlenecking isn't trivial.

                    "I also disagree with your assessment regarding near/far cores on NUMA setups. Just because the hypervisor can obfuscate this for guest OSes doesn't mean you should let it do so for all use cases. If and when you have one of those corner case workloads where it is going to hammer the CPUs ina highly parallel fashion with lots of shared memory between then you need to start thinking about how you are assigning cores to your VMs."

                    In some cases you may not have much of a choice, if you need more cores for a VM than a single physical socket has on it. For other cases, maybe you could get a little more mileage out of things by manually specifying the CPU socket/core/thread geometry - if your hypervisor supports that. I'd be interested to see your measurements on how much difference this makes on top of pinning the cores.

                    "Hypervisors can dedicate cores. They can also assign affinity in non-dedicated circumstances. So when I test something that I know is going to be hitting the metal enough to suffer from the latency of going across to fetch memory from another NUMA node I start restricting where that workload can play. Just like I would in production."

                    Sure you can, but testing with a simple base-line use-case where you have one host and one big VM seems like a good place to start assessing the least bad case scenario on the overhead. As you add more VMs and more arbitration of what runs where, the overhead is only going to go up rather than down.

                    I'm not even saying that the overhead matters in most cases - my workstation at home is dual 6-core Xeon with two GTX780Ti GPUs (and an additional low spec one), split up using Xen into three workstations, of which two are gaming capable. With the two gaming spec virtual machines having a dedicated GPU and pinned 3 cores / 6 threads, both on the same physical socket (but no overlap on the CPUs). The performance is good enough for any game I have thrown at it, even though I am running at 3840x2400 (T221). So clearly even for gaming type loads this kind of a setup is perfectly adequate, even though it is certainly not overhead-free. It is "good enough".

                    But in a heavily loaded production database server you don't necessarily have the luxury of being able to sacrifice any performance for the sake of convenience.

                    "Frankly, I'd also start asking pointed questions about why such workloads are running on a CPU at all, and can't I just feed the thing a GPU and be done with it?"

                    That's all well and good if you are running custom code you can write yourself. Meanwhile, the real world is depressingly bogged down in legacy and off-the-shelf applications, very few of which come with GPU offload, and most of which wouldn't benefit due to the size of data they deal with (if you PCIe bandwidth is typically lower than RAM bandwidth, so once your data doesn't fit into VRAM you are often better off staying on the CPU).

                    "That makes me very curious where the tipping point between my workloads and your simulation is."

                    Databases are a fairly typical worst-case scenario when it comes to virtualization. If you have a large production database server, you should be able to cobble together a good test case. Usually 100GB or so of database and 20-30GB of captured general log works quite well, if your queries are reasonably optimized. Extract SELECT queries from your general log (percona toolkit somes with tools to do this, but I find they are very broken in most versions, so I just wrote my own general log extractor and session generator that just throws SELECTs into separate files on a round-robin basis). You will need to generate at least twice as many files as you have threads in your test configuration (e.g. 24 files for a single 6-core/12-thread Xeon). You then replay those all in parallel, and wait for them to complete. Run the test twice, and record the time of the second run (so the buffer pools are primed by the first run). Then repeat the same with a VM with the same amount of RAM and same number of CPU cores/threads (restrict the RAM amount on bare metal with mem= kernel parameter, assuming you are testing on Linux). This should give you a reasonably good basis for comparison. Depending on the state of tune of your database, how well indexed your queries are, and how much it all ends up grinding onto disks, I usually see a difference of somwhere in the 35%-44% ball park. Less optimized, poorly indexed DBs show a lower performance hit because they end up being more disk I/O bottlenecked.

                    1. Trevor_Pott Gold badge

                      Re: Assertions

                      Again, you assert "worst case" that is, to be blunt...dated. I run huge databases virtualised all the time. Ones that pin the system with no ill effects and no noticeable difference to metal. I also strongly disagree with your assertion that you cannot give up an erg of performance in the name of convenience; that may be your personal choice, it certainly isn't mine.

                      As for "running virtualised causing a substantial overhead on memory I/O" I have maintained this particular item for some time. Specifically that "features" within most hypervisors to optimize RAM usage create a dramatic overhead on the system and they need to be weeded out. There is also the issue that many virtualised systems = many OSes caching to RAM. This changes the game versus each system having it's own dedicated setup, more than CPU sharing, I believe.

                      Databases used to be a big problem on hypervisors. 3-4 years ago. We've come a long since then, and it's only the true edge cases that still show issues. That said, isolating an edge case enough to reproduce it on modern equipment and hypervisors is always a fun exercise.

                      So if I seem skeptical, that's why. You write like someone who did a bunch of testing in the ESXi 4 era, went "pfaugh, virtualisation" and then put up a "get off my lawn" sign until the end of time. 2-4 years ago, I wouldn't have put a gigantic 100GB DB2 instance in a hyeprvisor. Today? Not a problem. Oracle still gives me shit...but that's Oracle. MSSQL doesn't bat an eye about being virtualised and Pervasive runs like a dog no matter what you stuff it into.

                      MySQL can be tuned to run in anything. I have virtualised instances that work fine, others don't. I haven't, however, seen a different to metal worth writing about in years.

                      Now, maybe my databases are "poorly optimized." They certainly are I/O bound in the extreme. That said, I test with real-world workloads, not theoretical constructs. As I said above, I'd love to assemble a lab with a real-world workload that can reproduce what you're saying. It sounds fun to explore.

                      That said, if I seem skeptical, please bear in mind that your discussion does mirror any of dozens of conversations with some rather closed-minded anti-virtualistion folks that can't let go of stuff from the beforetime and look at what is on the table now.

                      Thus testing. IT should be about the numbers, not about religion. Not for you, me, or anyone. Ultimately, that was the point of the article I wrote: there's too much religion in IT. From marketing and sales to even the phoney baloney whitepapers many companies knock together.

                      Let's get down to the testing. Reproducible results that we can then determine applicability, market impact, use cases and so forth. That's the information needed to properly advice clients. :)

                      1. Gordan

                        Re: Assertions

                        "Again, you assert "worst case" that is, to be blunt...dated. I run huge databases virtualised all the time. Ones that pin the system with no ill effects and no noticeable difference to metal."

                        Whatever happened to your previous statement that it is only by pushing the system to the redline that we learn tings? Which of the two assertions isn't true? :)

                        "I also strongly disagree with your assertion that you cannot give up an erg of performance in the name of convenience; that may be your personal choice, it certainly isn't mine."

                        I said that "you don't necessarily have the luxury of being able to sacrifice any performance for the sake of convenience." I didn't say it is always the case, I said it isn't necessarily the case. To give you some real-life examples, if you are already paying £100K+/month on rented bare-metal servers from someone like Rackspace (I have several clients I do database consultancy for that pay at least that much for their server renting from similar providers), losing 40% of performance would also crank up your costs by a similar amount. That's not an insignificant hit to the bottom line.

                        "Specifically that "features" within most hypervisors to optimize RAM usage create a dramatic overhead on the system and they need to be weeded out."

                        If you speak of ballooning and deduplication, I always test with then disabled. I mostly use Xen, and with that what happens is that the domU memory gets allocated when the domU is created, and if ballooning is not enabled there will be no memory shuffling taking place. The domU memory is allocated and never freed or in any way maintaned by the dom0 - it's all up to the domU kernel to handle.

                        "There is also the issue that many virtualised systems = many OSes caching to RAM."

                        Again, I am not sure what difference that would make, since the caching is done within the guest, and the disks are typically not shared.

                        I'm not saying that memory I/O isn't the problem - I'm saying that I have not yet hard an explanation for it that makes sense. IMO the biggest difference comes from context the increase in the cost of context switching. This has been documented by several people who looked into it. I'm sure you can google it, but here are a few links for a start:

                        http://blog.tsunanet.net/2010/11/how-long-does-it-take-to-make-context.html (finds context switching is 2-3x more expensive on ESX)

                        The databases most of my customers present me with almost always start of badly disk I/O bound, and running so poorly that the entire system is falling apart. By the time I'm done with them with making sure they are appropriately indexed for the queries run against them, some query rewriting to work around the more eggregious unoptimizable cases, usually using materialized views, and a handful of other tweaks, they are typically running in the region of 10-20x faster, and are purely CPU limited (possibly memory I/O limited, but this is quite difficult to differentiate).

                        As for my data being out of date - I'm happy to admit that I have not re-tested the virtualization performance with the MySQL load since November 2012 (ESXi 5.0 or 5.1, I am not 100% sure which).

                        But the most important point I would like to make is this: "Don't listen to my numbers - produce your own based on your workload." By all means, use my methodology if you deem it appropriate, and/or point out the flaw in my methodology. But don't start with the assumption that the marketing brochure speaks the unquestionable truth. Start with a null hypothesis and go from there. Consensus != truth, and from what you said I think we both very much agree on this.

                        1. Trevor_Pott Gold badge

                          Re: Assertions

                          "Again, you assert "worst case" that is, to be blunt...dated. I run huge databases virtualised all the time. Ones that pin the system with no ill effects and no noticeable difference to metal."

                          Whatever happened to your previous statement that it is only by pushing the system to the redline that we learn tings? Which of the two assertions isn't true? :)

                          I don't see a contradiction. I pin my systems. In testing and in real-world workloads. I believe it is absolutely required. What i don't believe in - and let's be perfectly clear here - is finding edge case scenarios that don't work and brandishing them as a reason to avoid a technology in all instances. If there are edge case scenarios or configurations that don't work, let's find them and then either fix the issue or not use the technology for that application.

                          That said, I can - and do - push my systems to the redline with my workloads and I do not see the results you see. That says to me that your results are dependent on your config and your workload. Thus I cannot use your scenario as a generic "virtualisation imposes a huge penalty" catch-all, nor can I extrapolate from the fact that you encountered a high-overhead scenario to state categorically that this is why something like "virtualised server SANs" are a bad idea.

                          "losing 40% of performance would also crank up your costs by a similar amount. That's not an insignificant hit to the bottom line."

                          I agree, losing 40% of performance is a big hit. That said, my experience from both synthetic lab testing and real-world results do not show a 40% hit, or anywhere near. Closer to 6% for redline workloads with 4% being the average.

                          4-6% falls into the "perfectly acceptable tradeoff for convenience" category for me. Again: I cannot accept your "40%" assertion without testing, and thus I will use my own numbers for the time being (and the workloads that I am aware of) and say "virtualisation is a great technology with a more than acceptable tradeoff."

                          "But the most important point I would like to make is this: "Don't listen to my numbers - produce your own based on your workload." By all means, use my methodology if you deem it appropriate, and/or point out the flaw in my methodology. But don't start with the assumption that the marketing brochure speaks the unquestionable truth. Start with a null hypothesis and go from there. Consensus != truth, and from what you said I think we both very much agree on this."

                          When have I ever accepted consensus on anything? Point me to an article where this has occurred. I test things all the time. It's my job. If I disagree with your take on virtualisation it's because your numbers not only aren't close to mine, they're in a different postal code.

                          I don't have a chip on my shoulder about virtualisation, or metal, or really any technology. Frankly, I don't give a damn one way or another. What I am saying is the following:

                          1) Your numbers have to be reproduced before they can be believed

                          2) Determination of how relevant your workloads are to real-world workloads as run by, well, anyone has to be made.

                          3) If you can evidence reproducible workloads that show 40% virtualisation overhead then there are people at VMware that will want to see this, reproduce it themselves and solve the problem by making a better hypervisor. I know many of them. They're good people.

                          In my experience, virtualisation is between 4% and 6% overhead for every workload I've tried. If you've workloads outside that range, I consider them an exception. An interesting one, worthy of investigation, discussion and remediation, but until we get more widespread testing on various workloads to see where they fall between my experience and your own I simply don't have enough data to kybosh hypervisors as a concept.

                          1. Gordan

                            Re: Assertions

                            I look forward to seeing your numbers, then, when you reproduce the test I originally mentioned. We can then debate observations with some more numbers to compare.

    2. vcdxnz001

      Re: Assertions

      If you ever see something near 40% overhead you're doing something very wrong and have a configuration issue. I have only seen things that bad on ESX 2.5 or where things are very badly configured at both the application, virtual machine and infrastructure layer. In which case any comparison is invalid anyway, or you are hopelessly overloading the environment. On vSphere 5.5 with a proper configuration in accordance with comment good practice application and infrastructure configuration you should always be < 10% overhead. In some edge cases and very rare conditions you might see 20% and need to tweak something to get it back down to < 10% and that would all be only if the system was pushed to 100% utilization. I often do extreme stress testing on vSphere and haven't seen anything like that overhead, but if you have data to back it up I'm sure everyone here and VMware would love to see it. You might like to take a look at these two papers, which show some pretty heavy HPC workload comparisons with native and the benefits of low latency optimizations.

      https://labs.vmware.com/academic/publications/performance-evaluation-of-hpc-benchmarks-on-vmwares-esxi-server

      http://www.vmware.com/files/pdf/techpaper/latency-sensitive-perf-vsphere55.pdf

  5. Do Not Fold Spindle Mutilate

    Or perhaps ...

    1) The user's requirements of the servers is very skewed with very important transactions on one server and non-important stuff on all others. Thus the people budget do not share the money for overall benefit but choose products which have direct benefit to their specific server.

    2) The very important transactions are very time sensitive and there cannot be delays that might happen if other servers were doing heavy i/o such as being completely refreshed. If a server is being used for real time trading of stocks then the user might have a fear of unknown caused delays which rumor might be virtualization.

    3) The CSA is relatively newish so there is heavy advertising to tell potential buyers about the product. I wonder if the purpose is actually to postpone upgrading the server. If the cost of a new server is very significant and requires much management justification, planning, reporting, budgeting then the CSA might increase transaction throughput enough to postpone the managers paperwork.

  6. dan1980

    I think one possible explanation is that people can be wary of 'hands-off' solutions that claim to simply work without any real configuration and provide tasty bacon with little effort or money.

    I can vouch 100% that at least one sysadmin out there is wary of such claims . . .

    I think it's also true that there is a bit of a contradiction in IT in that many of us find it difficult to manage and monitor every aspect of a system (due to time, personnel and cost constraints) and yet desire just such control.

    Again, I can vouch for one such sysadmin.

    In my current position, I often find myself in a position of not being able to adequately trial and assess potential solutions. It's in exactly these positions where I find myself most wary of magical bacon-granting solutions promising all pork and no pain (that might have come out wrong). Instead, I will sometimes prefer more configurable solutions that will allow me to tweak as I go.

    I am not saying that the simpler solutions would not be suitable, just that the control available in more complex solutions gives me some comfort that I will be able to tune it to get the most out of the system or adapt it to any changing needs.

    None of this is to say that one solution is better than the other, just that we all go through these processes with our own biases and the more pressed we are, the more we tend to fall back on them.

  7. Nate Amsden

    read cache does nothing

    for me anyways with 90-93% write workloads. all the read caching is done in the apps in memory.

    Write caching is pretty complicated to get right...

    1. Trevor_Pott Gold badge

      Re: read cache does nothing

      Yep, in my testing you have to have at least 30% read as part of your workload for it to have a tangible difference. As I have said many a time: storage isn't one-size fits all. Everyone's a little different and there's more than enough money in the space for everyone.

      ...even if I don't quite understand why anyone (excepting very select niches) would choose some of 'em

This topic is closed for new posts.

Other stories you might like