gold whop whop whop!
Listen and prepare to behold this vision. Storage arrays will become nearline vaults because storage memory will steal their primary data storage role. There is a thundering great divide, a Grand Canyon, between server memory in the single digit terabyte area and "storage": double digit terabytes and up to multi-petabytes of …
The 1st and Primary question
Do we "really" need so much data ?
Some mildly interesting facts on this page about storage requriements in relation to how much space is required for all of the books in "the Library of Congress"
The first figure is the most astounding:
•“Every Six Hours, the NSA Gathers as Much Data as Is Stored in the Entire Library of Congress.”
Re: The 1st and Primary question
Packrat attitude mostly.
They'd rather sift through ten gallons of sludge than let the big one get away.
I remember when all application data was stored in non-volatile memory for processing. Back then it was called a Core Store. Perhaps someone will re-invent backing store sometime soon.Drums anyone?
Single level storage is a very OLD idea
This concept was first developed in-house by IBM in about 1970 as 'Future Series', originally intended to be a replacement for the System 360/370 mainframes. That project was axed in about 1972 before being resurrected in the late 1970s as System/38, which was on sale from 1979 and later morphed into the AS/400 range.
Its key feature was single level virtual storage. All RAM and all disk space was mapped into a single address space, so the only storage access method was virtual memory page reads and writes. There was no separate filing system as we know it because all files were in-memory structures. This worked well and was fast and reliable because RAID 5 disk arrays were used. Replacing disks was very easy - you just migrated disk-resident pages off a disk you wanted to replace. Adding a disk was even easier: plug it in and the load-balancing paging algorithm would to start moving pages onto it.
I'm not particularly an IBM fan, but this was one bit of hardware architecture that they got right.
But this future will never come
RAM sizes today mean that I can use in-memory databases for the kind of data problems I was handling in the 90's. I remember my first *million* row database. How quaint it seems today.
Just as system memory sizes have grown over the decades, the size of the problem that they *can* address has also grown, so we are today still surrounded by spinning rust.
Tomorrow will be the same. Just as today's systems are orders of magnitude greater than a decade ago, BigData is an orders of magnitude greater problem which will continue to require spinning rust.
So I'm not so hopeful that it'll all fit in memory...
Re: But this future will never come
So I'm not so hopeful that it'll all fit in memory...
There's been a trend in research systems at least towards looking at using RAM to store index information while delegating actual data storage to (flash) disks. FAWN-DS (Fast Array of Wimpy Nodes Data Store), for example, reduces the amount of RAM used by each index entry to 6 bytes, while SILT (Small Index, Large Table) achieves even more compression of those index data (somewhere between 1.5--2.5 bytes per index entry, iirc). It also helps that these systems are designed from the ground up to work well with flash storage and avoid the write amplification problem (where a single write requires several physical writes due to the need to rewrite entire memory blocks when a single page changes). I'm not sure how many of these design features are implemented in today's commercial-grade systems (like hadoop's file system) but I'd wager that there are more similarities than differences.
If you add to this the fact that clustering your storage nodes is relatively easy using consistent hashing (or a DHT) to spread the storage across many nodes/controllers each with their own RAM and local storage, then I think that such a future is actually quite practical today. A lot more practical than you think.
Layers upon layers
All that's happening is the next step in an ongoing evolutionary process. Over the past few decades, the number of intermediate steps between slow storage and fast compute has been growing, with on-die CPU cache, level 2 cache, level 3 cache, system RAM, HBA/controller caching, onboard flash cache, storage array cache, on-drive cache, and now array flash storage providing yet another layer designed to improve the speed of transfer from static storage to active compute. The slowest storage has essentially stagnated, from a speed perspective, merely growing in capacity. The next tier up, "fast" spinning disk, is itself turning into yet another intermediary layer for staging data.
All any of this means is that same as it always has: ultimately, the goal is to touch the disk as little as possible and keep the relatively small amount of data you're actually using somewhere else.
Keep It Up...
At this rate, Skynet is almost a certainty. And it will know which hand you use to wipe!
Big question for everyone...
Is 64 bits of address enough?
You only get a range of 64 exabytes (64 x 10^18, more or less).
You never know? IPv6 allows for 128 bits of addresses 4 x 10^40, so maybe not?
Of course there is the guy who wanted to be paid by grains on a chess board, one grain then two, then doubling for every succeeding one. Problem was there weren't enough grains in the world to satisfy him, and (as the story goes) he lost his head. Never mind.
"The only way to deal with slow access to stored data is to take it out of storage altogether and dump it it in memory instead."
Hundreds of Gigabytes of storage in memory. Can anyone say stupidly slow read times?