Irate VMware customers were left unable to power up their virtual servers this morning because of a bug that killed their systems when the clock clicked round to 12 August. The bug was sent out to customers in ESX 3.5 update 2, VMware's latest hypervisor, which went out on 27 July. The version could have been downloaded and …
I've got a bad feeling about this....
Curious that this happens just after the putsch that ousted Dianne Greene et al. Would someone with such a reputation as a stickler for detail have had this happen on their watch?
Disgruntled developer who didn't get the expected payrise.
We'll never know of course.
I guess it's a good thing that I wasn't able to update VC and ESX last week to this revision level. For once in it's miserable life, our Change Management process actually did something useful. I'll have to make sure that management doesn't find out :-)
This Will Hurt VMware...
Being in Australia, we have a short reprieve still ahead of us, but after reading the thread on VMware's site, I already dread going into the office tomorrow.
I canceled scheduled maintenance on all virtual hosts and their VMs on pain of immediate sacking (OK, not really... I'm not that horrid a manager).
But, I will be dancing around the data centre, naked and in body paint with bone-filled rattles to drive out the spirits of VM crashes, which have surely been waiting for just such a moment.
If we make it through the next few days without the board of directors demanding out heads, I'll be hosting an open bar for the entire team.
Saved by a test plan before implementing in production?
<Insert smug look here>
Who planted the new man in charge at VMware...
Who planted the new man in charge at Vmware
Anger isn't the answer
“...we are aggressively working on a fix which should be within a short time frame.”
I'd feel better if they were calmly and systematically working on a fix instead of going at it like angry kittens with a ball of string.
And the cause of this bug is...
The start of the grouse-shooting season or what?
benefits of virtualization ..
Given the relatively low cost of hardware, I do believe the benefits of virtualization are oversold.
Also what's magical about Aug 12 2008, does it cross some boundary in a bit field, as when Windows hangs every 49.7 days ..
Computer Hangs After 49.7 Days
"VMWare's thread about the issue has been viewed more than 2,500 times." Hits are a terrible metric for popularity, not even worth stating. I don't expect such poor use of data from The Reg.
Wouldn't want to be holding VMware shares after this
Love their products but talk about handing it to MS on a plate
Someone will be getting their coat.
A 12th of August bug?
Could happen to anyone, we shouldn't grouse about it.
Make me happy
VMware and all other virtualization "solutions" in the same category are nothing more than crutches for incompetent system administrators that can not properly load and/or tune their OS and applications. We have had their consultants in here, and sure, they found stuff to virtualize, but when pointed to the servers I am running, they just say "oh, well, those systems would not be suitable for our product " [insert excuse here]. The reason is that they were properly planned, and laid out ahead of time and are running at 75% load, right in the sweet spot where I like it ;)
Of course, that being said I should also point out, I don't do windoze ;)
lol, lol and lol again
If your going to throw a 2 week old patch onto a system that to be honest, had no reason being updated then your going to hit problems.
Test, Test and Test again. If your still paranoid, don't bother.
OK, smug person with the test plan....
So even if I'd had a test plan, and run through it and it all worked fine - how would that save me if the problem is date-related?
Unless you're telling me that part of your test is to run your vmware servers with every possible date - or that you just had a suspicion that August 12th was likely to be a problem....
Is this why kb.vmware.com is unresponsive today, perhaps?
I wanted to look up the article on poor timekeeping in a Linux guest (http://kb.vmware.com/kb/1420 ?) on a Windows host, having this morning seen my kernel rebuilt to 1000Hz lose seven hours since 6pm yesterday... but the kb website also seems to have a responsiveness problem...
Yesterday would have been a good time to short those VMware shares.
For everyone (including those at VMware), how do you test for a date problem? Do you set the date forward one day at a time until you have covered a five year period? Maybe we could step forward one hour (minute, second,) at a time until we know that the product will run at all dates and times? Clearly this is not a case where proper testing would have caught a problem.
Think before using the "you should have tested" out.
Hmm, not sure if its coincidence but....
Whilst the application thats running in my VMWare server 1.0.5 is still running nicely - any attempt to log into it using the VMWare Console is failing.
Trying to boot the vmware instance from the console fails. Fortunately, its set to start automatically when the host system reboots - so yes - I had to reboot this workstation in order to restart it.
Thank goodness I can ssh into it to stop or start it - but I'm wondering if this is co-incidence.
I doubt strongly that testing would be effective in this case.
It is important to always test things before putting them into use on production servers. My thought has always been test it for a couple days, then hook up a couple of test computers (or a small somewhat ISOLATED segment of the network) and let (L)users brea... err test it some more. If it isn't COMPLETELY ballsed bup by a day or two of that then push it to production machines. All told a week or less from download to production or the round file bin aka file 13.
Since this has been out two weeks, that wouldn't have prevented this. I agree with other posters that claiming a test program would prevent this is at best debateable (unless you are running some REALLY protracted testing).
That said, if you have a network environment and aren't testing updates from all vendors before implementing then you are asking for trouble.
All hands, cause given the wide implementation of VMware we are all in for shit for a couple days I suspect.
I think you'll find they were meaning that with a test plan, you'd run it for a month or so. The testing would have picked it up. All depends on your risk managment really.
They need to blam tis on the San Fransisco sys admin
What else is it going to take?
How much more incidents like this have to happen, before somebody in an IT ministry somewhere in the world decides it's high time that software vendors were obliged by law to supply Source Code with every product precisely in order to prevent precisely this sort of scenario?
@AC Testing for date problems
Testing for date problems is not exactly rocket science. I would suggest having a list of random dates you try during testing. In addition to that have several systems running with dates set in the future in test. Maybe 1 week in the future, 3 months in the future and 6 months. Hopefully the 3 and 6 month systems will catch date related bugs before shipment. The 1 week system will give you 1 weeks warning of a test escape.
I wonder why VMware did not do this.
Why hasn't anyone mentioned DRM?
I saw this this morning, but didn't have time to comment. Now I see still no one as mentioned the DRM angle. Surely there is no functional reason for this. It has to be some bit of crap they tried to add in to ensure no evil pirates would run their product - odd considering the far most likely audience for their product are huge corporations that don't dare do such things.
But as their attempts to virtually take over the world (pun intended) falter they probably blame pirates rather than the fact that their product adds precious little functionality to a data center. Sure it's neat and all, but when companies are cutting programmers it's probably a hard time for gee-wiiz software sales.
One of the few marketing terms I know is "backlog". Best I've ever been able to figure out, it's the term sales people use for sales they "should have made", but honest boss "we'll close them next quarter". So, as people figure out their product is not a silver bullet to replace skilled system admins, and it costs a bloody fortune anyway - they probably put 2 and 2 together and got 22. Obviously the problem is we need a more aggressive DRM system.
beta expiry code...
...got left in, which is what caused the product to expire.
I guess the lesson to learn is that it wouldn't hurt to have a stage where you whack the clock forward an arbitrary amount of time (e.g. twice the length of a typical update cycle) and make sure it still runs in your test environment. Particularly given software with subscription based licensing, you should definitely be testing with operating system dates either side of the point where you expect the licence to expire, as they mark a known change in conditions.
No, I wouldn't have thought of this myself. ;-) In any case I believe the smugness above was due to the fact that the bug was made public before that poster's testing cycle happened to finish, so he was just lucky.
Shot in foot ?
Won't do VMwares rate of adoption much good - I wonder how many of the people who grabbed free 3i licences to make a business case for vmware adoption are actually going to take the testing to full term.
Still, at least you can wind the date back a year to start the vm's...
Don't test for date problem, REVIEW THE CODE.
AJ Stiles sort of said it already. But even if they don't ship the code to the paying customer, code should be reviewed internally by People With Clue.
It's the calender...
...it's the y2k bug, it's arrived at last. The calender got all screwed up with religious doings years ago. Know what this means?...Happy new millennium!!! Time to go on the piss before apocalypse gets here!
@ Michael Hoffmann
No, no, no, my good man. The approved witchdoctor garb is a grass skirt; no nudity. In your part of the world, however, a long penis sheath is the de rigeur accessory, worn either on its own or with the skirt
Sheesh! Geeks! Especially managerial geeks! No fashion sense at all!
Testing? What testing?
In these days with NTP and GPS clocks, who is running a test system that thinks its next week? That is an easy way to find out about these sorts of problems before they show up.
To compound matters - that "free" ESXi they announced on July 28th
Yes, thats broken too.... all those freebies given out to convince the Hyper-V maybes that VMware is better are now broken as well... Shot themselves in the foot there.
Paris because she is high Quality Ass(urance)
(no, I don't mean that, honest)
Re: Don't test for date problem, REVIEW THE CODE.
There's no good reason not to ship the Source Code to the paying customer.
It doesn't do anything to prevent piracy. And code plagiarism would be obvious anyway, if your competitors were also obliged to ship their Source Code.
All it does is create problems for users.
Until it becomes law to supply Source Code, or a decompiler exists, issues like this -- and worse -- will keep on happening.
@ A J Stiles
Oh, here we go, here come the freetard gang again, with their clarion call of "open source is a panacea for everything". Well it isn't, so STFU.
For a start, not one in a thousand people have the skills, shit, not one in a thousand programmers have the skills, to read through the source listing of a hypervisor and spot a bug like this, unless it's something really glaringly obvious like a great big commented section that says "THIS CODE WILL CAUSE THE SYSTEM TO FAIL ON AUGUST 12". Certainly the average sysadmin has neither the time nor the inclication to do this kind of thing, even if they do have the appropriate skill set.
Plus, if you think someone with a market share like VMWare don't have a code review and testing process that would catch something that was easy to spot, you're clearly living in la la land.
And as a case in point, I'm currently filling in a bug report for a Debian upgrade that totally FUBAR'd my wireless IDS/IPS box, and that code was supposedly QAd by about a thousand developers, so clearly your suggested panacea doesn't work. Period.
If anything, this incident illustrates an issue with VMWare's QA process (although frankly, software is complex, and shit happens), _not_ with the closed source model. So put down your cheerleading pom poms and go back to downloaning pr0n for your umbongo desktop. Spankard.
VMware knowledge base down and out
No longer unresponsive, now (deliberately) inaccessible: "This section of the VMware website is currently unavailable while we make important user improvements and upgrades to the site. We apologize for any inconvenience this may cause."
Someone's bonus seems to be at risk.
Anyone else notice...
...this is exactly a year since their IPO:
Aug. 13, 2007 (Bloomberg) -- EMC Corp.'s VMware software business raised $957 million in an initial public offering today, at the top end of the forecasted range.
"if you think someone with a market share like VMWare don't have a code review and testing process that would catch something that was easy to spot, you're clearly living in la la land."
Er, anyone who thinks there's any reliable connection between a company's size/market share/visibility and the quality of their processes and products is surely living in La La Land, no? VMware aren't the only example... one classic (the 49.7 day crash) has already been mentioned here though iirc that was in a Win9x of some flavour.
"the average sysadmin has neither the time nor the inclication to do this kind of thing,"
Which is why enterprise-critical systems shouldn't be designed or deployed by "average" people (not that it stops most companies), they should be designed and deployed by that rare commodity, People With Clue (not me, but I know a few).
"a bug report for a Debian upgrade"
How does a one-off (?) failure of one Linux flavour in one set of circumstances to meet your requirements of the day suddenly mean the whole "open source" model is kaput? There are plenty of happy Linux users out there too (and a few unhappy ones, just like with Microsoft).
Anyway, access to source isn't just an issue of FOSS vs closed source. Back in the day, VMS customers with money and interest and competent (not average) techies could buy the source listings on machine readable media. No FOSS there, but if something catastrophic like this were to happen, the smarter customers would likely be in a position to fix it PDQ if the suppliers didn't.
Got out of bed the wrong side this morning did we?
This is too bad, but I think a lot of folks do not have a realistic understanding of software engineering or systems management. Everyone has bugs, and there is always a chance something bad will slip through.
Shipping source code is kind of a silly idea, it is nearly impossible to find a bug by inspecting a huge source code base, except during focused code reviews by knowledgeable co-workers as it is being developed. Customers don't want to spend resources trying to do that, and the Raymond-esque notion that an army of amateurs can do it is just ridiculous.
You really need a test organization, people who will run the code, stress test it and sleuth out bugs and process reports from customers. You also need a database system to remember and prioritize bugs. When we were working with UNIX programmers from ___ on a project years ago, they were just completely baffled by the this concept, they never understood or used the bug tracking system, routinely left the code base in a state where it wouldn't even compile. Didn't leave us with a very positive impression about hacker culture.
Calm down, dear!
@"The Other Steve" -- don't forget to take your meds!
It's true that openness is not a magical solve-all panacea, but no-one said it is. It beats "play and pray" though! The point is well made that mandatory source-code disclosure would serve the interests of those who deploy and use computing resources.
Looks like some smart guy found a workaround...
@AC, Re: benefits of virtualization ...
-- begin quote --
"Also what's magical about Aug 12 2008, does it cross some boundary in a bit field, as when Windows hangs every 49.7 days ..
Computer Hangs After 49.7 Days
-- end quote --
If you're still running Windows 95/98, hanging every 50 odd days is the least of your worries.
not all time-related errors can be exposed by setting the clock forward
The first Patriot missle batteries deployed during Desert Unpleasantness Part I had a timing error that was only exposed if the system was left on for a sufficient length of time, allowing for decimal to binary fraction rounding errors to accumulate through repeated addition. This had at least two consequences:
1) The missle performed perfectly according to its lights, and went a number of meters to one side of the target instead of hitting it.
2) The magnificent explosions hailed by the media as Scud interceptions were really Patriot self-destructs to avoid mischief on ground impact.
The problem was later solved by a software update.
In this particular case, code inspection plus numerical analysis might have reasonably been expected to reveal the problem.
Marketing manager asleep at the wheel?
What is disturbing to me is Niemer’s cavalier attitude that nothing major is wrong and that if your organization is affected it’s your own fault for trusting their software. Personally, I would have liked to have heard VM’s marketing manager explain how important their customers are, how serious they take any problem and how they will spare no resources in fixing the problem. Niemer left me with the impression that if they can find the problem and fix it they will, but otherwise they’re not going to lose any sleep over it.
Hands up those who have ever seen a test system running in shifted time...
...it just doesn't make sense to do that. Takes up a server, which may be Big Iron (thus costly), it won't run all the production stuff anyway and then what kind of bugs is one supposed to catch? And would one even recognize them? One might as well test the CPU adder circuit.
Hell, anyone who has been through a Y2K planning session knows the glazed look across the room when the questions "so what are we looking for" and "so what is the test plan and where are the people to implement it" comes up. And in that case, the exact moments of interest were actually known.
Systems running on shifted date+time
In any worthwhile application suite where dates are of any great significance (where shift changes matter, week/month/quarter/year ends matter, leap years matter, etc), the application date (and time) should arguably be isolatable from the OS date+time, specifically so the application's date+time handling can be properly tested without screwing up date+time on the rest of the system.
But where the application design doesn't permit that, you fiddle with the OS date+time for those tests where it really truly matters (or, occasionally where appropriate and available, use a bit of clever software that intercepts selected date+time related system calls without actually really changing the system-wide OS date+time).
There's no guarantee that such testing would have spotted whatever caused today's VMware hiccup; competent code review sounds more promising.
Another reason for shifted time testing is the small matter of the transitions to and from daylight savings time, especially in applications which may be used across multiple time zones, zones which may not all be changing at the same time, and some of which zones may not even use whole-hour offsets. Maybe here you *do* want the OS to be running on the relevant date+time.
Otherwise you can take the Microsoft/VMware-compatible approach you seem to recommend: write the code, take the money, ship it, and hope.
Have we all done our Y2038 testing by now?
So at least 2.5k people have worked out a reason to use a VM... please enlighten those of us who don't think having yet another layer of slow software in the way is a good idea for anything close to production.
I know the VM chaps that keep bugging me in the day-job wheel out a huge list of so called benefits, but none of them seem to stand up to any real serious scrutiny.
* Cost - what's that? Equipment is a business expense, and commodity Iron is cheap.
* Isolation - Err that's otherwise known as chroot, permissions, security.
* Standardisation - Err buy the same Kit / OS (now that's standardisation done properly).
* Consolidation - Err don't buy to much crap in the first place.
* Testing - I'll give them that one, they are slightly useful for testing.
* Mobility - Err that's called redundancy (hot/cold-spare) in the real biz world, or a Disaster Recovery plan. Or better still Load Balanced with capacity.
* Hardware Support - Err that's why you choose your hardware carefully, and even more carefully choose the OS with the driver support. Come on dummies.
One of the more startling problems with VM's that the sellers of VM's neglect to mention is that by using VM's you have all you egg's in one basket. Now that is dangerous.
In my day-job VM's were considered by the high'n'mighty, but I soon put the kybosh on that with some well placed questions to the VM software sales / technical meetings. Everyone came away knowing VM's are for companies that are downsizing.
I have advocated for 20 years that Redundancy & Resilience can not be met with of the new fangled stuff that comes to market. Good old fashioned planning and preparation is what counts, not being able to move a OS from one box to another because the first has died - Hey, isn't that a Hot/Cold Spare? ... so why have Slow-ware(that's VM's to the un-initiated) in the way?
Guys, invest in a Load Balancer (they call them Application Switches now BTW), you wont regret it, and with a little bit of programming thought, your programmers will see the benefit of being able to scale-out in a very big way.
I know VM's are not the way forward, its a shame so many others have yet to discover this :(
And no, clustering ain't the answer for the other end of the spectrum (nor cloud computing).
Good luck suckers, you'll need it with any Slow-ware.
@@AC, Re: benefits of virtualization ...
win98 nice love it however not used now for about 9 weeks. 50 days mmmm normaly the memory leak gets you first lol
Will VM it sometime.
Funny thing is the microsoft site lists "Microsoft Windows 98 Standard Edition" not the first time I have seen that there.
BTW if you are from Microsoft SE is Second Edition
Cost ? The power costs alone from runnning one box rather than 20 are a good reason.
Every single one of theFortune 100 companies have adopted VMware. And they are not stupid, but I think you are.....
Paris, for she has as many brain cells as you
- Geek's Guide to Britain INSIDE GCHQ: Welcome to Cheltenham's cottage industry
- 'Catastrophic failure' of 3D-printed gun in Oz Police test
- Game Theory Is the next-gen console war already One?
- Analysis Spam and the Byzantine Empire: How Bitcoin tech REALLY works
- VIDEO Herschel Space Observatory spots galaxies merging