Utility IT service supplier Flexiant has blamed intermittently slow service over the past two weeks on slow accesses to disks on an Oracle ZFS-based 7400 storage array. There has been slow communication between virtual machines on the servers and the 7400 storage. It is being said that some customers' servers have effectively …
Run, forrest, run .....
This story is basically confirming/telling us that "2 weeks downtime" is never heard of in this industry until the Oracle/Sun 7000 series came to life.
Personally I believe this claim because I think zfs is still not mature enough and we personally
experienced a specific issue on 7410C that literally down for 3 days just trying to delete a dedup enabled LUN, yes you heard me right, just one click to delete a LUN could bring the 7410 cluster down for 3 days, that is why I believe it could cause 2 weeks down time for other end user....
I guess the end results will again point to zfs module (akd)...... then another software upgrade with new MAJOR bugs waiting to be found....
And then the sales team will tell you that you should never expect a 99.8% up time for this kind of product, you should set your expectation to somewhere near five Eight ( 88.888%) instead of five Nine....
By the way, we are off 7410C and onto NetApp since.
akd != ZFS
While I agree that the 7410 has had a lot of serious suckyness, and almost all of it is down to the akd management system being just dumb in design/implementation, that is not a ZFS issue.
ZFS is the file system, and while it may be the underlying reason for your down-time deleting a LUN, far more likely is a fu*k-up with akd. Again.
That is cutting edge!
Not sure I'd trust zfs to that degree just yet but it is very nice technology.
See what it's like in a year.
Can't believe they are still downplaying the problem.
Perhaps Mr Bligh is not kept informed by his staff, or doesn't spend enough time on the coal face?
I am currently just about connected to a single CPU Debian server on light duties, currently running with a load average of 154 @ 16:20. Other than that session, its been inaccessible since this morning, and I have not noticed the server running at a sane load average for more than an hour or two since this problem began. In between it tends over 2, and frequently rises to 3-6.
It is this high load average, frequently causing actual outages lasting several hours (not one hour as they claim), multiple times a day, which leads me to confidently say that the platform has been essentially unusable since I reported the issue, and as near as I can tell since two weeks ago when the problem first emerged (inbetween I was assuming I had a problem on my own server).
Last night the server so badly affected, that files were disappearing before my eyes which was pretty scary (though fortunately it turned out they were still in storage). That culminated in the server locking up, requiring a kill which has corrupted our database. It hasn't been up long enough since to attempt a repair.
Even trying to move my data elsewhere has proved impossible as it doesn't stay up long enough to transfer the data. We have a backup, but that's on the same platform as its primarily still a development server.
I've been told moving my server to different hardware would not make any difference, so presumably this affects all customers.
I reported this over a week and a half ago, and have been chasing continuously since by the way Mr Bligh - feel free to call me if there is any confusion over that - ticket #9793.
The whole time, I've felt they have tried to spin/downplay the size/impact of the problem and that appears to be the option they have chosen responding to this article. I am amazed that in this day and age a CEO will come out with nonsense like this without even checking facts properly.
[stop press - load is now at 187 and rising @ 16:53.]
ZFS is fine as long as you don't use RAIDZ
We've seen this for about three years now, and still no fix in sight. Basically, The minute you turn on RAID you can throw predictable latency out the window.
Hmm when Oracle on a certified supported/stable linux platform went belly up once a day (memory leak in Oracle itself) Oracle's fix (after two weeks of support hell) was to "reboot at midnight". They then closed the support ticket!
Perhaps they should include "its not our fault" on thier logo! Its seems downtime is never due to Oracle being a buggy, memory leaking mess.
These days we use PostgreSQL - it has its problems but a least if something goes wrong we can delve into the source (usually sort it ourselves) or hire a Pg internals specialist if need be. Oracle support is and always will be a joke.
Quick get the Fishworks team to look at it!
Oh they left didn't they?
7410 and iSCSI do not mix well
From the specs on their website...."These provide iSCSI LUNs to the end-user operating systems, looking just like physical disks".
In our experience the 7410 ZFS external slog on SSD (logzilla), which is designed to accelerate and quickly acknowledge sync writes, quickly gets overrun by iSCSI write traffic and then spends all of it's time flushing to disk and slowing everything down.
Tread carefully with these devices and what part of the business you are willing to bet on them.
- Nokia: Read our Maps, Samsung – we're HERE for the Gear
- Ofcom will not probe lesbian lizard snog in new Dr Who series
- Kaspersky backpedals on 'done nothing wrong, nothing to fear' blather
- Episode 9 BOFH: The current value of our IT ASSets? Minus eleventy-seven...
- Too slow with that iPhone refresh, Apple: Android is GOBBLING up US mobile market