back to article Meltdown/Spectre fixes made AWS CPUs cry, says SolarWinds

Log-sniffing vendor SolarWinds has used its own wares to chronicle the application of Meltdown and Spectre patches on its own Amazon Web Services infrastructure, and the results make for ugly viewing. The image below, for example, depicts the performance of what SolarWinds has described as “a Python worker service tier” on …

  1. Anonymous Coward
    Anonymous Coward

    I trust

    AMZN will be reducing AWS prices pro-rata and sending the bill for the difference to Intel.

    1. Anonymous Coward
      Anonymous Coward

      Re: I trust

      Yeah, the flying butt monkeys are gonna get right on that for you.

    2. Professor Clifton Shallot

      Re: I trust

      Sadly the end result of this is likely to be Amazon spending a lot more money with Intel for new kit and passing the bill on to us.

      A bit like the banking crisis, causing a massive screw-up doesn't preclude you from benefiting from it due to being the only people who understand and can sort out the mess.

    3. Anonymous Coward
      Anonymous Coward

      Re: I trust

      It doesnt seem to have made any difference to Azure. Our time and utilisation on complex batch processes is unchanged.

    4. John Brown (no body) Silver badge
      Joke

      Re: I trust

      "AMZN will be reducing AWS prices pro-rata and sending the bill for the difference to Intel."

      Is this finally the year of flying bacon delivered to the desktop?

      1. Brian Miller

        Re: I trust

        The bacon delivers itself, by flying to your desk. You'll have to excuse the grease splattering from all of its flapping, though. Bit like a hummingbird or humble bee, but bacon!

  2. J. R. Hartley Silver badge

    Great Scott

    This is heavy.

    1. Down not across Silver badge

      Re: Great Scott

      Why are things so heavy in the future? Is there a problem with the earth's gravitational pull?

  3. Pascal Monett Silver badge

    I don't get it

    The performance issue is apparently centered around switching context. The initial patches brought performance down, as was forecast.

    I wonder what kind of wizardry could allow for alleviating the performance issue while still conserving the security aspect of the operation ?

    1. Gordan

      Re: I don't get it

      The problem is that switching context is in the order of magnitude of 100x slower in a VM than on bare metal (addding microseconds to nanoseconds).

      That is why some workloads virtualize with minimal performance hit (few threads, low concurrency, mostly userspace CPU burn), and some workloads virtualize extremely poorly with a huge performance hit even without meltdown patches (anything highly concurrent such as compile farms, databases). I have measured performance hit from virtualization on some such workloads to be upward of 30% - and that was before meltdown patches came into play.

    2. Mike Dimmick

      Re: I don't get it

      It could be that they had not previously enabled Process-Context Identifiers (PCID), or possibly the virtual machine manager allowed it to be virtualised.

      PCID is a relatively recent x86-64 addition. The PCID is a tag against the Translation Lookaside Buffer entries that acts as a filter, saying 'this TLB entry belongs to this process'. The hardware will only use a TLB mapping if the tag matches the current process's tag. This allows the TLB to contain mappings for multiple processes or contexts.

      Traditionally, on a context switch between processes, the whole TLB had to be flushed, all entries discarded (or marked invalid). That meant that for the initial memory accesses that the process performed, including the instructions to be executed, the hardware would have to walk the page tables to find the mappings from virtual to physical memory, even if it was something that the process had recently accessed the last time any of its threads ran.

      With PCID, the OS doesn't have to flush the TLB on a process switch - only if it's reusing a PCID value from a different process. It can selectively flush entries for a process if it's changing that process's address map, using the INVPCID instruction. This would normally happen in response to a page fault exception.

      You can mark pages as global in the x86 architecture, which means that when you switch to a new process context - register CR3 changed to point to a different set of page tables, causing a TLB flush - the TLB entries for those global pages are retained. Since it's common that the incoming thread was already executing in kernel mode - for many workloads the thread is blocked on a kernel operation, not having been pre-empted in user mode - this saves having to walk the page tables to find the kernel code.

      However, we're now putting kernel code into a separate memory space altogether, so that the processor can't speculate loads of its address space. That causes an address space switch on every user->kernel and kernel->user transition, which itself causes a TLB flush on older hardware or with PCID disabled. So, if the processor doesn't support PCID or it's turned off, the newly-loaded kernel code causes page table walks, then on return to user mode, it has to walk the page tables again.

      TL;DR check that your processor supports PCID and the INVPCID instruction, and that it's enabled.

      1. Korev Silver badge
        Pint

        Re: I don't get it

        I've just learned something new :)

        A pint for Mike ->

  4. Anonymous Coward
    Anonymous Coward

    Maybe they should upgrade their instances

    AWS have offered HVM (hardware virtual machine) instances for quite some time now, and the Xen hypervisor details on Meltdown were clear that only PV (paravirtualised) instances needed Hypervisor changes.

    1. andyp1per

      Re: Maybe they should upgrade their instances

      Agreed, who actually uses PV these days? HVM has been available for years and clearly you are going to need to be on HVM to get the benefits of PCID - which is the only way not to get clobbered by the fixes.

      PV - makes a nice headline for ops software companies but in real life not so much.

  5. PaulR79
    Coat

    Meltdown and Spectre

    Were those names chosen because the increased workload would meltdown the CPUs and kill them?

    I have manflu, it's the best I can do. Be gentle.

  6. ecofeco Silver badge

    A wha...?

    “a Python worker service tier” on paravirtualized AWS instances.

    I think I see the problem. Too much Rube Goldberg.

    Seriously, they just make this shit up, don't they?

  7. Yavoy

    Any reports on how it affects azure and Google cloud?

  8. whatsyourShtoile

    the thing is they had all these chips running at 25%

    So maybe they deserve for the utilisation to go up to 50% as some kind of punishment? Otherwise why would you not just use slower chips in the first place?

    1. DougS Silver badge

      Re: the thing is they had all these chips running at 25%

      If you want to have capacity for peak loads you need to be operating at much below peak most of the time. That's one of the main reasons for virtualization and cloud, after all.

    2. Wayne Sheddan

      Re: the thing is they had all these chips running at 25%

      Remember those graphs are for the Solarwinds guest - not the hypervisor. Do we have any visibility of what has happened for AMZN/MS/GCP at the hypervisor level? You could say Solarwinds are using inappropriate instances if they aren't hammering the CPU they've bought off the CSP.

      If I was AMZN/MS/GCP I'd be aiming for a system wide run queue approx = #cores. I'd be aiming for 100% CPU with run-state guests waiting one tick to get onto a CPU core. With good tuning of system tick this would mean effective use of all resources on the hypervisor. I suspect CSPs likely try and achieve the same thing - full use of the resource. Probably through using serious levels of overcommit (why havent we been doing this on-prem?) I suspect the CPU charts for the actual hypervisors in a CSP would 'scare' your typical on-prem sysadmin. ;-)

      Is it likely the patching has only changed the CSP 'overcommit' ratios - not the actual hypervisor CPU usage. Which is likely 80-100%, 24x7. On all cores.

      1. DougS Silver badge

        Re: the thing is they had all these chips running at 25%

        If the hypervisor was planned for 80-100% 24x7 and they installed a patch that caused everyone's load to increase by 20%, they would be royally fucked!

        I know of course the 25% was the vCPU, not the actual CPU which as you say will be more heavily loaded. The customers want low utilization in their vCPUs so they have spare capacity for when they hit their peaks. The provider needs to look at trends to insure they have high utilization (otherwise they have idle resources that are a waste of capital) but not too high - otherwise they could get caught with their pants down if something happens to trigger peak loads for their customers - like a major news item like missile being fired at Hawaii, a president being assassinated, a stock market crash, etc.

        1. Anonymous Coward
          Anonymous Coward

          Re: the thing is they had all these chips running at 25%

          More interestingly, what about the power and thermal design requirements across whole data centre sites? Usually providers will have plenty of spare capacity available but if say overall power usage or temperatures increased by more than a quarter across the whole site this could have serious implications on power or HVAC systems working efficiently, especially if any are close to their rated capacity? My team once added a few extra racks of power-hungry servers (maybe ten kilowatts) without bumping additional power supply to the same floor, we literally blew up the main site distribution board and had to call in electrical experts to deal with it at the substation level while all other services on site were down waiting for their power to be recovered, clients weren't prepared to pay for site upgrades till forced.

        2. GreenReaper
          Mushroom

          Re: the thing is they had all these chips running at 25%

          I'd be a little surprised if cores were the limit. It's more often RAM or I/O performance. That's why servers tend to have so many RAM slots. Even tasks you think of as "compute" can be bottlenecked elsewhere.

  9. Sorry that handle is already taken. Silver badge

    Kafka? Cassandra?

    I think these guys are pessimists.

    1. Michael Wojcik Silver badge

      Re: Kafka? Cassandra?

      "Beware of geeks bearing graphs!"

      Gave you a thumbs-up, but actually I think Cassandra was an optimist. She kept telling people what was going to happen, despite a history of being ignored.

  10. Adrian 4 Silver badge
    Headmaster

    expensive fix

    While the initial hack to protect against meltdown was justified, it's an expensive fix, loading up every system call for the 99.99....9 % of innocent applications.

    The pattern of operations needed to provoke the bug and probe kernel memory is pretty specific : hitting blocks of memory with huge numbers of illegal accesses in order to measure fetch times. This pattern ought to be detectable.

    Perhaps a cheaper approach would be to monitor the rate of such accesses - which themselves raise an exception - and quarantine the guilty applications by both a tarpit approach (make illegal accesses themselves expensive) and enabling kpti for them.

    This would still allow the attack, but would force it to operate slowly in order to evade detection. So slow, perhaps, that it could no longer read a useful section of kernel memory in the time before it changed enough to make the operation worthless. This appears to be a strategy used against rowhammer (https://lwn.net/Articles/704920/).

  11. a_yank_lurker Silver badge

    Statistics

    Right now, we are not seeing good statistics of just how much damage the patches are causing. The number of data points are few and seem to be biased towards the worst. However, I suspect the effects will range from none/not detectable to eye-popping but the key is the distribution in the server farms once people figure out how to work around the problems. I would not be surprised at something the resembles a Weibul distribution or a mirror of a Weibul (asymmetric cluster with must values clustered at one end with longish tail).

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Biting the hand that feeds IT © 1998–2019