We've been hearing about a bug in EMC's VNX2 software which causes it to reboot every 80 days. Tweets like this one have been appearing like canaries in a coalmine: @TheJasonNash I just bought a Vnx28000 and expect another one coming soon... Anyways, looking for info on the vnx reboot issue? New to EMC — Nathan Daggett (@ …
80 days... is that all?
You've got to wonder what sort of testing has been done on an enterprise product that reliably crashes after 80 days uptime.
Re: 80 days... is that all?
Of course it sounds pretty funny... you have to reset to avoid unexpected resets...
But I don't believe EMC had not been testing new generations VNX less than 80 days. It sounds like a very familiar bug, annoying common today in Enterprise systems about an unexpected reboot after a fixed time frame. I had the same problema with Brocade switches and I believe all those bugs have a common root.
Believing today Enterprise systems are buggiest than yesterday is for very young systems engineers... I could write a book about this kind of issues along my career, started a lot of years ago.
> All VNX2 models appear to be affected: VNXe3100 and 3300, VNX 5100, 5300, 5500, 5700 and 7500.
OK, but those are all 1st gen VNX models, not VNX2!
I'd hope EMC don't still have issues like this on the first generation VNX, but regardless the Bull support page linked specifically mentions VNX2 as being the affected platform.
"URGENT : Unexpected SP panic on new VNX2 every 80 days"
Sounds from the article like they've now patched it ? The workaround is effectively reboot it before it reboots itself. Note the 30 minute offset between reboots will reset the counter just in case you don't manage to apply the patch, then at least the controllers will reboot 30 minutes apart rather than together.
Not very enterprise though, having to lose half your processing power just to reset a ticking time bomb (counter).
my switches used to reboot after 497 days of uptime, due to the linux uptime counter rolling over, though that bug was fixed about 6 years ago I think. Linux uptime counter doesn't roll over at 497 days anymore either.
I'll take "Memory leak" for $400, Alex
If you're stuck with one of these things, it's probably a good idea to time those reboots for when the SPs are running < 50% load. Only high school football teams can perform at 110%....storage arrays cannot. And I agree with the AC above - is the problem the zillions of lines of the 24 year olde Clariion code, or the 64 bit parts that were strapped onto it so that marketing could call it 'VNX2'?
Re: I'll take "Memory leak" for $400, Alex
Here's the cause, from the horses mouth
A logic error in 64-bit math causes a timer overflow within each Storage Processor, resulting in a first stage WatchDog [WDT] panic (which results in an NMI).
Software periodically requests the number of micro-seconds since boot. This information is continuously fed to another component, as an indication that the VNX OE software is not hung. This aids in protection against starvation. When the overflow issue is encountered (every ~80 days); it causes disruption in the software which is aimed at identifying both software and hardware hang. The result is that VNX OE software believes the Storage Processor is hung, resulting in the (deliberate]) WatchDog panic.
Avamar had an issue like this, but more so related to it's Linux OS. Pretty much had to do a reboot each year, until the appropriate patches were installed. A few other EMC products, also running under linux had this issue.
Luckily for EMC, they are aware of which systems and customer are affected, and were proactive about the fix.
This one though, a bit more extreme at only 80 days. Seems like QA as of late is taking a dump here. This with the avamar/datadomain issues, what a pain in the arse.
Tech Support Call
TECH SUPPORT: Sir, I'm going to need you to unplug the VNX2 for 30 seconds and plug it back in again
USER: This isn't a cable box you know?
The next version of VNX will include a big, circular reset button in the middle
About ten years ago the popular IBM DS8000 disk system had a similar issue. My memory tells me EMC attacked the product at the time because of that, long after the problem was fixed. The VNX2 bug indicates the law of karma may well be operating.
- NASA boffin: RIDDLE of odd BULGE FOUND on MOON is SOLVED
- SOULLESS machine-intelligence ROBOT cars to hit Blighty in 2015
- BuzzGasm! Thirteen Astonishing True Facts You Never Knew About SCREWS
- Worstall on Wednesday YES, iPhones ARE getting slower with each new release of iOS
- Tor attack nodes RIPPED MASKS off users for 6 MONTHS