Flash array supplier Kaminario says its new K2 arrays will last for seven years - and claims performance drop-off during a system failure will never be more than 25 per cent. The Performance Consistency Guarantee says Kaminario will send its customers additional array capacity if performance does drop off more than 25 per cent …
Smoke, mirrors and IT sales
That is all.
why log-structured? (and failure cases)
We can't see why this has anything to do with sustaining a level of performance during a system failure, but maybe Reg readers can.
Have a look at the wikipedia page for "write amplification" to get an idea what the problem with traditional uses of flash storage is. In a nutshell, writers and updaters of the memory tend to treat it as a normal random-access memory. However, since flash needs to be updated many blocks at a time (in a unit called a "page", I think they call them), if you've just changed one block, then you need to read in all the other blocks in that page, update it and write it back out to a fresh place on the disk. If the power fails in the middle of all of this then it can be tricky to figure out exactly which blocks are now good. Worse, since the R/W pattern tends to be random, other files can be sharing the same page, so any corruption will not necessarily be limited to just the file (or chunk of a database table, etc.) that was being updated at the time.
With log-structured databases, you just imagine the whole disk to be like a circular list. In the simplest case, you just push stuff on the end of it and if there's a power failure, you just rescan the whole list from start to finish and delete any uncommitted writes. Of course, it's more complicated than that since O(n) traversal just to find some bit of info on the disk isn't practical, so most log-structured dbs will have some sort of compaction and indexing threads going in the background. Also, updates are generally timestamped so that later writes in the list override previous values. They'll also generally keep as much of the indexes in RAM as possible so that (notwithstanding initial delay when reading this in from the flash at startup) it's efficient to find the data you're looking for (and writes/updates generally simply involve writing to the head of the circular list, so it's O(1)).
A quick search for log-structured databases and file systems throws up examples such as Log-Structured Merge Tree (LSM-Tree), Riak's Bitcask, Logbase, Fawn-KV and SILT (Single Index, Large Table, IIRC). Any of the technical papers describing those will most likely explain why log-structured is the way to go with flash-based storage. Maybe my explanation above is enough, though... but definitely read the wiki page on write amplification and things should make a lot more sense.
Oh, just one other point... your actual question was to do with performance after a failure. Chances are they use something like Fawn-KV or SILT: some redundancy is built in, so that there will be backup "silos" for storing the data (much like RAID replication). Using a Distributed Hash (DHT) lets all the silos effectively share a common key space, and if one of them goes down, then collectively they can switch over to the alternates, while in the background they'll repartition the DHT space to account for genuine hardware failures (as opposed to transient errors). You'd have to delve into some of the papers of the above systems and (if they exist) the ones describing kaminario's implementation in particular, but I'd guess that's what they're doing and what they're talking about "sustaining performance during system failure".
So wrong I don't even know where to begin.
Frumious' explanation is so wrong in so many ways. You don't have to read in the rest of the page and write it back to change one user data block's portion of the page, you write the changed user block along with other changed user data to a new page. No one (I hope) actually rewrites whole flash blocks or pages just to change one piece of user data any more, maybe 10 years ago they did.
And all log structured systems do is trade garbage collection performed by the SSD for log cleaning performed by the log file system. If write amplification is defined by how many times the flash is written divided by the number of user write operations, it doesn't matter that the log system writes sequentially, it still will rewrite the flash multiple times
It also doesn't look like Frumious really knows how log systems work, (FYI, I had my grad OS course back in the day from one of the BSD LFS authors, so i have a little bit of a clue here) And the last two paragraphs seem like just so much gibberish I almost wonder if it was produced by some automated buzzword generation algorithm.
Oh and as for a 7 year guarantee, what business is going to keep around a system which was fully depreciated and which can be replaced by a new system taking up 100 times less space and power? Think about it, with density basically doubling each year in 7 years you will store in 1U what today takes 2 full racks.
So wrong ...
I'm not sure exactly what you're trying to point out as being wrong, apart from what I said about rewriting full pages. To be honest, as I started to write I had a different idea about what the article's author was asking us, and by the end of it I figured he was asking about something slightly different, which I answered with my last paragraph of "gibberish".
The essential idea I was trying to get across at the start was that with flash-based systems you need different strategies for updating data on disk than with traditional block-based storage. You can't just update a structure like a B-Tree or a directory entry in situ because of the penalty that flash memory as a medium imposes on you. I don't disagree with what you say that we don't use a naive approach for updating a single block in this case---you're totally right to say that instead we group updates and write them all in a single page. But this has implications for filesystem integrity. If you can't mark the original data as obsolete and you can't just erase that whole page, then how do you (a) know which copy is the correct one, and (b) how do you handle problems like loss of power while writing the update? That's why I mentioned timestamps and periodic compaction. That's all I can really say on that because I'm not really sure where I went wrong in explaining it.
Maybe it's the last two paragraphs, but the last one paragraph is, I think, the key point I was trying to make. Up to that I was trying to explain the problems with error recovery at the flash level (which implements its own log-structured storage system at the firmware level, as you say), but what I think this Kaminario system is describing is more like Fawn-KV[pdf] and SILT. [abstract]. Those approaches use relatively large in-memory indexes to find data values on flash, and store all the data (including indexes) in a log-structured storage system on flash. FAWN-KV, in particular, looks a lot like the diagram, which shows each block spread across multiple nodes. The way this is usually done (and is done in FAWN-KV) is to use consistent hashing to spread the data across several nodes/silos. FAWN-KV also includes replication, so that a single hash key is stored to more than one node/silo. That's the essential point I was trying to make regarding node failure and recovery from it: FAWN-KV can recover from this quickly in the short term because an alternate node/silo is there to provide a backup copy of the data, although repartitioning the hash scheme (with associated costs of moving the actual data across nodes) will be necessary in the longer term if a node is really dead.
The SILT paper has a section on extending their scheme to include crash tolerance/crash recovery, which, again, I think is what our author here was really trying to get his head around.
- Vid Hubble 'scope snaps 200,000-ton chunky crumble conundrum
- Updated + vids WHOA: Get a load of Asteroid DX110 JUST MISSING planet EARTH
- 10 years of Facebook Inside Facebook's engineering labs: Hardware heaven, HP hell – PICTURES
- Very fabric of space-time RIPPED apart in latest Hubble pic
- Massive new AIRSHIP to enter commercial service at British dirigible base