Dinosaur struggles to rent apartment.
100 year old couple fail to conceive.
IBM has scrambled to fix the Meltdown and Spectre bugs, but has struggled to develop processes, reporting tools or reliable patches to get the job done for itself or its clients. Internal documents seen by The Register reveal that Big Blue has ordered staff not to attempt any Meltdown/Spectre patches, but that the advice to do …
... you're not wrong. For quite a while during my time at IBM, we had no central management tool, our license for BladeLogic expired, and there was nothing to replace it, ... so we ran MBSA scans on each server, imported the results into Lotus 1-2-3, then created scripts for each server based on that report, and then ran those scripts manually to patch. Then of course ran MBSA again to prove the server had been patched. We patched manually on each server for about two years, before we finally got IEM fully deployed.
Lots of upper management were still on leave so the middle management drones who just follow orders didn't know what to do. The few technical managers on deck get bogged down in continual meetings briefing higher ups, explaining things, walking through technical challenges etc. instead of being able to provide direction down the chain of command. Or their instructions get reparsed through management groupthink and become delayed as more upper management want to stamp their approval. I can understand that they want a consistent apprach and communication but there was a project managing this before the issue became public knowledge and we thought they would have already worked out what communications, direcations and actions needed to be communicated.
Meanwhile delivery executives hassle technical teams to patch resulting in some offshore resource finding the patch doesn't install, sees a reference to a registry key to make it work and then applies the patch and bluescreens multiple boxes or breaks the antivirus.
In lack of clear directive of what to actually say to the customers who approach me, I just explain the issue in simple terms, that there are many moving parts, brief overview of what we'll need to do and that we are confirming that we have all the info so we can execute a plan to remediate.
Locally, my team know what to do and have been prepping in the background (system inventories, firmware levels, AV status, patch readiness etc) so that we'll be ready when we get the go ahead (still awaiting customer agreement/signoff on the process as we proposed to separate the Meltdown/Spectre patch deployment from the normal monthly patch deployment.) Meanwhile, the offshore team who normally do all the patching are sitting there waiting to be told what to do through the normal chain of command.
It's like they've been listening to the Eddy Grant song;
We gonna rock down to Electric Avenue
And then we'll take it higher
And are now waiting for further instructions...
And folks, that's what happens if you get rid of your experienced staff...
Sure, firing your experienced staff and relying on documentation so that the plebs and n00bs can continue working (and rake in money) will only work for a while (and look great with all things beancountery) - until an IBM (incapacitated by meltdown) event happens.
Then, when you need more experienced staff, tough luck finding some.
But it's the IBM way. A project manager with the problem: "If it takes 9 months for a woman to conceive and then give birth to a baby" then it stands to reason that all you need to do is contract this out to three women and get the job done in 1/3rd the time. Simple Maths.
Talk about Project Management timelines - 2000-2006: I was subcontracted to IBM for a total revamp/facelift of ibm.com. Development and Project planners laid out a comprehensive plan and reported to "Management" that it would take 12 months from start to "go live." This was a streamlined plan with minimal "fudge factor" built in. "Management" (a person who's last name was Watson-otherwise I seriously doubt she would have lasted long) listened to the planners, then decided that "6 months is plenty of time. Deal with it." Can't begin to tell you how many times I've seen this happen at IBM. BTW, it took 12 months to complete.
AC for obvious reasons: I'm currently a sub on another IBM contract. Yep, got sucked back into the Big Blue bilge water.
"Management" listened to the planners, then decided that "6 months is plenty of time. Deal with it." Can't begin to tell you how many times I've seen this happen at IBM. BTW, it took 12 months to complete.
Unless the particular Mangler was savvy to IBM fudge factors, and simply shaved off the numbers. If she had left it at 12 months it would have taken 18 or24 months (a longer project time would lead upper manglement to think they could "resource" more people).
probably better for IBM to put out a proper measured response rather than rushing to deploy something that isn't properly tested and breaks customer stuff. There is loads of mitigation that can be done to provide some assurance in the short term that systems are not currently being compromised and will not be compromised.
Doesn't stop teams learning and working stuff out in case a rapid response is needed.
Quote "The documents also say some Red Hat Enterprise Linux servers aren’t rebooting after patching" whilst we haven't seen this happen with anything we've patched so far this could be concerning if there's any real evidence of this. Anyone experienced this or IBM just blowing smoke?
Its true, I upgraded asoftlayer VM today and rebooted or rather didn't, had to boot in rescue mode and change the kernel to the previous one. There is a note tucked away on the general announcement saying not to implement the upgrade on redhat Enterprise 6 whilst they investigate.
I also experienced it with another cloud provider so they are not the only ones.
That said their console access is a java applet which doesn't appear to work under windows 10, nor does pptp that is required to connect in the first place (windows 10 issue). You can use ssl but that only works on IE. After a number of responses from support they sent me (incorrect) instructions for another SSL client but no response on yet as to how to get round the java console issue.
So wasted a couple of hours trying to gain console access before booting in rescue mode and using ssh.
Just in case anyone was waiting for an update: seems that so long as this is just second-hand news about internal IBM documents, we don't want to comment on it. Of course, if you're a paying RH customer and you're concerned about the consequences of updating any of your systems, please contact the relevant support folks and they'll certainly be able to help you with it.
### SPECIAL ADVISORY ###
It our understanding that this is library versioning issue in our middleware. We have deployed our finest team of devs to reach a resolution that will work for all our customer of this software and expect a speedy resolution by 2023, which will give our team the opportunity to develop the appropriate skills while finishing college.
Damn I'm glad I left when I did. I know I wouldn't want to be on that team right now.
We've 8 RH7.3 systems - testing cluster for the patches -
1) no issues with the RH patches - these are v4 intel dual sockets.
2) minor issue with network driver, (10G/b Ethernet) fixed with an additional patch to the driver from HP.
*there is* a performance hit on high IO processes and well, hadoop is massive IO. Still trying to get a consistent number on the real affect here.
If the customer is large enough there might be a one team managing backups, one looking after day to day support and another working on patching (all offshore). Patching team doesn't necessarily check that the last set of backups were successful before their change. Even smaller accounts might be using shared infrastructure with a separate team looking after backups for the multiple customers.
Normally a support team would do a manual backup where the nightly backup had failed (if it had been communicated by the team monitoring backups.) However, some customers won't allow backups run during the day to avoid potential impact to normal day to day operations. A few days of problematic backups, no manual backups (communicated to account team and customer) and then something goes wrong in a change or a system fails. I've been pulled into more incidents than I care to remember where an ongoing backup issue had not been communicated properly, an update went pear shaped and the backips were missing data.
Testing is often done by the customer application teams - IBM would do an Operating System Post Implementation Validation (PIV) while the customer is resposible for application PIV. Customer pre-production (DEV, SVP, UAT) PIV often isn't as rigorous as that as production instances - Having worked on changes in all environments myself I've had the customer report back all okay after 10 minutes in a test environment, production PIV took 2 hours. I've also seen a lot of incidents where the customer has signed off a PIV in pre-prod and then a week later lodge a complaint about a change 'breaking' their application (because they hadn't tested rigorously.)
As the technical staff deal directly with the application teams when organising the changes, the communication would be re-iterating the process (mostly for management types who would start asking about backups etc as they aren't familiar with the day to day operations.) The other issue is that communications in some customers can be just as bad as any other organisation and while the comms goes out from account teams to their contacts in the organisation it sometimes doesn't get passed down. We've had application teams push back on updates being deployed as they are about to perform a release even though their own CIO has declared the update as a CIO override for deployment of the patch.
... when I worked in security and compliance, we checked backups had run before we deployed patches. Generally we'd check the last week in the eventlog for any errors, check the backups had been working, and record all running services (yes, I scripted that), and then after we patched I ran a post script that made sure all the services that were running, we still running, and checked for errors in the eventlog again. We got IEM as a patching tool, and supposedly someone was going to build our checks into IEM pre-reqs before the scheduled task would run, but I left before that happened.
Meanwhile, Service Management were also supposed to alert us if backups failed because they got a daily report on that. Not once was I ever notified of a backup failure by Service Management though, they knew we checked, so they never bothered.
AIX is POWER only (the short-lived AIX-5L port to Itanium has long since disappeared, and AIX/PS2 is a dead product).
AFAIK, nobody has demonstrated that AIX is vulnerable to MELTDOWN (indeed, it relies on virtually the whole kernel memory space being mapped into the address space of user-processes, although it is protected by memory access controls, and AIX does not have that mapping).
I'm guessing, but I think that the Power Linux distributions are removing the mapping of the kernel memory from the user process address space because this is actually a very sensible precaution (Linus has a lot to answer for, UNIX systems on other architectures like PDP-11, VAX, s370 et. al. never mapped kernel memory into user process's address space, so him doing it for Linux was rather short sighted - although early x86 processors were a bit deficient on the MMU front).
SPECTRE is a different beast. I would not be surprised to see some elements of SPECTRE affecting Power processors.
the short-lived AIX-5L port to Itanium has long since disappeared, and AIX/PS2 is a dead product
But let us not forget the also-gone original AIX for the RT PC, AIX/370, and AIX/ESA.
Anyway: Meltdown (it's not an acronym, so there's no reason to write it in block caps) only applies to CPU+OS combinations where pages with different read permissions are mapped into a single address space, and speculative execution ignores those read permissions. Currently, that's only x86 and one ARM core family (which isn't in production yet).
The much larger Spectre (also not an acronym) family of attacks are generally possible, in some form, on any CPU that provides speculative execution and any side channel that permits indirect analysis of load contents. The attacks in the Spectre paper use cache timing, but the paper notes some of the other possible side channels.
Spectre has been confirmed against x86, AMD, ARM, POWER, z, some nVidia GPUs, and by now probably other processor architectures. Because spec-ex is well-known and long-established for high-performance general-purpose processors, Spectre attacks are widely applicable.
I'm interested in the references to any PoC on Power processors that you may have.
I can find a zdnet article that claims vulnerability, which quotes the IBM PSIRT blog item that is hugely non-specific, and does not mention Meltdown or Spectre by name or CVE number. The original Google Project Zero write-up dos not list Power as being one of the processors it discovered had issues.
Because of the specific mechanism, Meltdown uses, until I see someone claiming they have a PoC. this bug on Power will remain in the not-proved category as far as I am concerned.
The write-up for Spectre, however, lists a range of techniques, and lists in passing things like power monitoring, branch prediction table poisoning, and instruction timing exploits, some of which can be made more effective by exploiting speculative execution.
I know that this may be a complacent view, but I believe that IBM's line on Power is that there is a possibility that one of these various techniques detailed in Spectre may well work on Power, and that not issuing a statement or patches would be more damaging to the reputation of IBM and it's Power line than issuing fixes that do something (like remove the kernel address space mapping from user processed), which remove one of the identified issues that cause problems on other processors).
I've seen no indication that anybody has actually come up with a viable method of removing significant amounts of risk of the Spectre vulnerabilities, other than those which serialize instruction execution, effectively disabling speculative execution. These normally involve code or compiler changes, and this will not make any difference if some malware not compiled with these techniques is executed on the system, i.e. it's not a complete solution.
Couple that with the referenced Return Oriented Programming, using existing sequences of bytes in a process that may not actually be code, but which happen to represent valid instruction sequences are identified, and then executed using buffer overflow techniques to jump to these locations, and you have attacks that are extremely difficult to mitigate.
So if you have found any references to any real Power PoC, I would be very interested in reading them.
"but that the advice to do nothing is incorrect and needs to be changed: ...
NO! You are FUCKING WRONG.
WAIT. Sit the hell still and wait for a WORKING remedy. THis rushing in with ill-concieved, half-cocked bullshit is bricking shit.
Patching for the sake of patching is the same idiot game we played with, well SSL seems the latest cluster-fuck apropos.
Wait, for a usable solution (if you're not Intel) and Then Roll it.
There is a Huge duplication of effort here and it's going to step on other initiave's toes.
I, for one, do not have time to explain to My Powers That Be that I need new motherboards OR a Surface Mount soldering station and a metric shitload of new CPUs and BIOSs.
Chill out, sit down, and just wait a sec. Bejeabus.
It's likely true that there isn't a crisis. These security holes aren't trivial to exploit, yet.
But this should never have been an issue. Vendors had months to work out a response and fixes. IBM is exactly the company that should have process and procedures ready to announce and deploy.
It really does appear that IBM laid off the experienced people that would have been able to understand the implications of these holes, and how best to deploy the fixes. Instead they have a bunch of people in India that can apply a patch and reboot systems, but don't understand the whole-system implication of code being able to look beyond the address space protection.
These security holes aren't trivial to exploit, yet.
1They've basically disabled the two high-precision timing mechanisms used in the paper. There are several others available to scripts, and they're well-documented. One of the key papers describing them was linked to from the comments on another Reg article recently.
> > We haven't even begun to plan the remediation.
> Ignore it, and let the marketing/sales guys figure out how to spin that as a unique benefit?
Just tell the customers that they have now automatically been migrated onto the "performance plan" versus the "standard plan".
 Comes with small security considerations. Sign here.
 Reduced performance, not yet available.
Biting the hand that feeds IT © 1998–2019