Cache data where it is most effective
Yes, but caching can be done on SAN clients as well as arrays. Any application that runs against a single-mount filesystem can cache data locally to reduce re-reads. The amount of local cache scales up easily with the number of SAN clients, and filling out DIMM slots with best bang/buck size modules is a cheap way to buy cache. And yes, if the data is on a cluster filesystem then the benefits of client caching depend a lot more on the type of app and the particular filesystem: Oracle RAC for instance manages cache coherence across multiple clients on shared database files at the application level, bypasses OS caching altogether, and AFAIK can pass cached data from one RAC node across a fast interconnect like Infiniband to another RAC node rather than making the second client read from database shared storage; in fact, over Infiniband the requesting client may not even have to make a system call to receive the data.
Cache on storage arrays is much more expensive per byte than local client RAM; on midrange arrays with set amounts per controller it is not that large compared to the total cache available on a few well-sized SAN clients, and for high-end gear like Hitachi virtualizing controllers, cache upgrades cost so much that a couple of years back, the storage admins at my University ended up in a sorry bind where they knew they needed more cache, but simply couldn't raise the money to get the upgrade. This scarce and expensive resource is best used to do things that *can't* easily be done with cache on local clients, like:
- reliable (mirrored, nonvolatile) write-behind caching, for write aggregation, annulment (quickly rewritten filesystem journal blocks, etc), and load smoothing (assuming there's any idle time!)
- speculative readahead of sequential data during idle time; the array is the only thing that can really know if the disks are actually idle
- reducing data read in common across *multiple clients*; for instance, base OS disk images in a copy-on-write VMFS disk hosting setup, or copy-on-write cloned SAN volumes. Or, as on the clusters at my workplace, lots of cluster nodes all reading the same executable and source data when a large parallel job starts. But in that case, the filesystem is all on JBODs, and Lustre object servers are doing the read caching, with terabytes of aggregate cache across all the object servers, at low cost per GB of cache.
As for high-end arrays beating midrange arrays with the same quantity of disk due to lots of cache: apart from cache sizes, the number and speed of host-side and drive-side interfaces on the SAN controllers will certainly make a difference for non-random I/O benchmarks, given a large enough number of disks in the array, and then the controller architecture needs to be capable of feeding those interfaces. There are a lot of ways to get more performance from the same (sufficiently large) number of drives.