Flynn said apps simply read or write data from an area of memory, using CPU Load Store instructions.
Now where would I have seen something like this before? On an AS/400 maybe? I seem to remember the AS does something like that...
Fusion-io has achieved a billion IOPS from eight servers in a demonstration at the DEMO Enterprise event in San Francisco. The cracking performance needed just eight HP DL370 G6 servers, running Linux 2.6.35.6-45 on two, 6-core Intel processors, 96GB RAM. Each server was fitted with eight 2.4TB ioDrive2 Duo PCIE flash drives; …
I guess this is user mode application writing to memory directly (load & store); memory mapped to device via DMA; and DMA channel is managed by ioMemory . It's all one big guess, so would someone - Mr Wozniak ? - please either confirm or refute? Also, what latencies are we talking about - under 1us?
Very exciting news, but not sure how practical this is if the whole solution is not mapped to some filesystem (implemented in userspace perhaps?).
dikrek, you have to leave the 4k paradigm behind because we're talking about a new type of memory, a nonvolatile storage class memory. Since we're talking memory the 64-byte size packets does make sense for their demo considering the memory size of a cache line is 64-bytes.
Anonymous, their power-cut flush feature is nothing new. For example, HP's user guide for the IO Accelerator talks about it. Below is from the HP IO Accelerator Linux User Guide:
"The Remote Power Cut Module ensures in-flight writes are completed to NAND flash in these catastrophic scenarios. The Remote Power Cut Module is not required, but HP recommends the module. NOTE: The power cut feature is built into PCIe IO Accelerators; therefore, no Remote Power Cut Module is necessary."
Wouldn't database admins talk about TPS? I suppose you might hear a storage administrator who supports DBAs talking about IOPS.
If you were talking about RAM would you use 4k as your packet size? I would assume for single-thread receives, processes, and transmits you would would want to use 64 bytes.
Maybe what is being missed and what I found so interesting about this demo is it blended storage and memory concepts, or what IBM is calling Storage Class Memory. This was high capacity non-volatile storage that was transferring 100's of million of packets per second at the CPUs cache line or data block size.
Ok, so based on the test configuration and specs posted by FusionIO, this does not make sense.
I checked the specs and the math on the servers and agree that this is possible to a volatile RAM store no issue, except, as FusionIO notes, to be an I/O in the traditional sense, I agree this must be to non-volatile "storage" of some sort, and I understand the ACM is an OS-bypass with smaller than typical I/O request sizes (64 bytes instead of 512 bytes or more).
The HP servers, the Intel chip set, the 6 channels to memory and the IOH on the system described are more than capable of 1 billion 64 byte transactions - no disagreement there.
However, the ioDrive2 Duo is capable of only 700K reads IOPs and 937K write IOPs at 512 byte request sizes per FusionIO spec posted here:
http://www.fusionio.com/platforms/iodrive2-duo/
So, if we note that with ACM (this is only 64 bytes per transaction) and allow the 8x multiplier for the smaller transfer size (a big assumption, but ok), we still come up short. This means at most 8 x 937K write IOPs, or 7.5 million write IOPs per card.
As I understand, 8 of these x8 PCIe gen2 cards were loaded in each HP server. So, 8 x 7.5 million is 60 million IOPs per HP server. With 8 of these servers (clustered or operating independently, I'm not sure), this is only 480 million 64 byte IOPs!!
We are missing more than half of the claim based on the best case spec sheet that FusionIO provides.
Furthermore, to be a bit more critical, this is from 8 separate memory mapped nodes, not one giant shared memory, but ok, we can ignore that for the purpose of this question. I view this as showing 60 million IOPs per memory space, or even at best 125 million per memory space, which is impressive for NVM to be sure (even the 60 million).
My issue is that the spec for the cards described does not allow for 1 billion - only 480 million, unless there's more to this than shared, or the ioDrive2 Duo published spec on FusionIO's website is wrong (low values for IOPs) or some other assumption is in play that is non-obvious.
Can someone explain clearly in precise terms how this adds up to 1 billion IOPs?
If I were to purchase the described equipment, could I repeat this?
Could I get results from a trusted 3rd party benchmark like STREAM triad that would support the claim?
Does FusionIO consider an IOP a read-after-write, just a write, or just a read, or can more specifics be provided?
Perhaps it is time that new trusted benchmarks for this exciting new NVM architecture be developed by 3rd parties?
Is FusionIO willing to share their "Custom load generator that exercises memory-mapped I/O at a rate of approximately 125 million operations per second per server"? The claim would be more impressive if they would. Also, is the OS by-pass open source and/or can I run say MMIO on standard Linux to a RAM device and replicate?
I think some clarification should be provided.
Apologies in advance if my math is wrong, but it seems like a simple calculation if the bottleneck is the nand NVM, as I suspect it is, and not the I/O hub, PCI-express, or the server node itself. The nand NVM cards must each be able to provide 1 billion / 64 IOPs, or 16 million 64 byte IOPs each.
Perhaps FusionIO could put out an ACM spec to clear this up? Or open source it? It should be able to run to RAM just as well as NVM so the benchmark and OS by-pass can be understood independent of the ioDrive product.
It sounds like ACM IOPs are more than 8x the standard block/storage 512 byte I/Os?
How does that work? The multiplier seems to be much higher at 16x.
This post has been deleted by its author
If anyone from FusionIO wants to jump in and correct me that's fine. But it seems clear to me that we can't look at the Duo drive specs in the data sheet because those cards come default as block storage devices. This was a demonstration showing flash as a literal extension to memory. From CPU cache, to RAM, to Fusion flash but none of it going through the OS's storage subsystems.
Here is an article I found that quotes one of Fusion's technology architects and will give you a different perspective on how they view the world http://lwn.net/Articles/408428/
How's the lifetime until a SSD fails I'm wondering after reading this?
Bypassing the OS for better IO is nothing new, many databases can used a raw(it's own IO) or cooked(operating system) method to access the disc. Over time that has become less of an issue due to the speed of operating systems and disc's. Now we have SSD's that up's the anti and offers another level of performance, at least until the CPU's and I/O bus's exceed there speeds. But then there is nothing stopping somebody doing a SSD/DDR memory slot adapter. Why not if SSD's that good and reliable, be cheaper and no overhead of another API to learn and methordology. Makes you think about it more when you look at it like that. Disposable IT kit mentality was never my strong point though.
But if your thinking about this type of kit, please think twice as long and use the money you save to pay your staff enough so they can configure your servers correctly without needing the cutting edge. Remember cutting edge kit can be replaced alot easier than cutting edge staff. That you can take to the bank.