cratered during file copy?
How were you copying, because if it wasn't unbuffered then no wonder it died...
NVM Express (NVMe) is the next generation specification for accessing non-volatile memory such as flash. Traditional technologies such as SAS and SATA are just too slow. In order to demonstrate how much of a difference NVMe makes, Micron has provided 12 9100 NVMe flash drives, 800GB each in the HHHL (standard PCIe card) format …
Ignore the "how were you copying" - screenshots (that I couldn't see too well on mobile) clearly show Windows Explorer.
That's buffered IO and it absolutely WILL bring a server to its knees. Next time, watch the memory tab go through the roof and when it approaches maximum, that's when your server starts dying. If you're using Windows Explorer for benchmarking to copy files, then you're doing it wrong - the amount of memory in your server is taking up the slack and your results are therefore invalid.
Next time, use "xcopy /j".
Hey guys, I did buffered, I did unbuffered, I did every kind of copy I could imagine. I tried multiple operating systems, I even used LiveCDs in an attempt to remove the local disks and the SATA controllers from any use whatsoever. I tried every conceivable kind of anything I could imagine and i regularly ended up with the SSDs faster than any of the operating systems in play could talk to before running out of CPU.
Interesting. I remember when i got our database san, a dell MD3200 with a mixture of ssd and 15k spinning rust. It runs from quad HBA sas to a poweredge 720 with dual HBA (there are actually 2 of them in a 2 node cluster). It had 192gb ram too so well under your spec.
Obviously the first thing you do is bugger the redundancy and see how fast you can MPIO them. The ssds were raid 0 on single lun. MPIO across the lot and turn off any sort of redundancy on the san.
W2k12 (not r2) had no problems with windows explorer or robocopy copying big fat ISOs or 100ks (not millions i accept) of website files. I got 2.5GBs out of it copying email datastores (the largest file i could find). Nothing borked.
"I regularly ended up with the SSDs faster than any of the operating systems in play could talk to before running out of CPU."
That was my experience on a far smaller scale too. These babies are _FAST_, which is good news when messing around with databases and spooling ~100 simultaneous backups (Bacula) across the network.
I was using Intel NVMe HHHL, but it was clear the card was outrunning the systems for everything real-world I wanted it to do.
Hmm, seems like the old "640K is enough for anybody" is still lurking in the back, somewhere.
Might it have something to do with CRC ? Like, Windows is expecting a nice little 100-files a second speed, gets 1 000 000 and goes "WHAT!!", then kernel panic and full retreat because the code was written with a loop still controlled by a 16-bit integer.
They do look like I'd have some serious problems connecting those to the PCIe bus in a system without a little help.
Maybe they meant this instead? https://regmedia.co.uk/2016/04/13/micron_9100s.jpg
With non-volatile storage devices such as these available it's a pity the concept of a single level store (as implemented in MULTICS [now defunct] or AS/400 [now IBM i]) never caught on in the mass market. These devices would neatly fit in the storage hierarchy between a DRAM write cache (to ease the wear on the flash storage) and the remaining, higher latency stuff. Am I the only one who thinks that it's really a waste having to use these devices as "drives", for lack of a software abstraction that is able to leverage their power?
From:
From: "Adapting to Thrive in a New Economy of Memory Abundance", Kirk M. Bresniker, Sharad Singhal, and R. Stanley Williams, Hewlett Packard Labs - IEEE Computer 2015/12, pp 44-53:
Simultaneous adoption of massive NVM pools unifying storage and main memory, centimeter-scaling photonics, application-specific computation acceleration, and relegation of I/O to peripheral interfaces could indicate a fundamental shift in information processing that harkens back to Turing. With today’s emphasis on cheap computation, scarce volatile memory, and abundant nonvolatile I/O storage, systems must constantly manage data flow into and out of memory. The application code provides the translation mechanisms between the efficient, dense in-memory representation and the serialized, buffered persistent or communication representation, while the OS maintains application state and mediates hardware resources. Without the state provided by the OS and application code, the in-memory representations are meaningless. Data must be computed to be useful, but what happens when a vast in-memory representation lives much longer than the now ephemeral computation? Data might need to carry its own metadata and be packaged with its own applications and OSs. As with Turing’s universal machine, the heart of the new machine will be memory, with demonstrably correct access to data in perpetuity. Given that this concept of computing could be the catalyst for many profound insights, we have christened it Memory-Driven Computing. Having emancipated memory from computation and made it the centerpiece of computing, how do we guarantee its correctness? Augmenting the interfaces to memory with a state-change mechanism based on a functional language could provide a formally provable evolution of data without side effects as well as a self-describing type system to guarantee continuity of data interpretation. Adding strong cryptography and a capabilities-based permission system could give future generations the confidence that our information legacy is trustworthy.
Yes, the architectural concept behind the project dubbed "The Machine" by HP management is certainly interesting, but IIRC it all hinges on the availablility of technically and commerically viable memristor memory (it's supposed to be built not around a single level /store/ but really a flat, single level, persistent main memory, as in the diagram linked to in your post) and it remains to be seen if HP, currently a company very much in distress, still has the power to make this a reality.
The problem is, as I (with my limited competence in the field) see it, twofold: 1. On the level of electrical engineering (provide the chips for a single level persistent memory, and with good-enough yields at a competitive price point) and 2. provide a software abstraction and SW development model that gives enough benefits to make abandoning legacy code attractive and commerically viable.
Neither is a trivial task, to say the least...
If they do manage (and I hope they do), interesting times could be ahead indeed
Trevor couldn't hardware RAID them because the drives were on individual PCIe cards.
So unless someone builds a RAID controller that has PCIe slots on it then software RAID is the only way.
(Software RAID is less of an issue than normal, because the CPU has a decent amount of bandwidth to the storage, and plenty of oomph)
When you hit limitations of PCIe v3 on the mobo and CPU adding a hardware abstraction will only make it worse. There are ways to improve the situation. Implement PCIe v4 will double throughput pretty much. Add lanes might improve throughput. Choose your motherboard, CPU and design very carefully might improve things. For instance a single socket would probably make design easier so that Windows doesnt accidentally split IO between two PCI buses with multiple cards. Also make certain that all PCI lanes are actually real PCI lanes, there are various ways to add more lanes at the cost of performance.
The tests looked good and he rightly said at this level youre just moving bottlenecks. I once got 1s latency out of a Violin by increasing queues to ramp up IOPS :)
With linux: make sure you use the multithreaded IO scheduler when dealing with fast SSDs, otherwise all IO is singlethreaded (this is different to cfq/noop/deadline - those are all singlethreaded schedulers) and you'll max out (As Trevor discovered)
Which in practice means "add scsi_mod.use_blk_mq=1 to your grub boot options"
It does make a difference and it'd be interesting to see if Trevor can quantify it.
"Time for better hardware RAID?"
Software RAID has been eating hardware RAID's dogfood for years. The last time I bothered with HW raid cards was 5 years ago (£1200 apiece) - running SW raid on the same system was actually faster and had lower latency (SATA SSDs). The only advantage of HW RAID was battery backed write caching but once you have SSDs in play that advantage is mostly negated.
You can get PCIe expansion busses but the problem is that the bus itself becomes the bottleneck before very long.
I'm old enough that calling them (or for that matter SSDs, SD cards etc) "drives" just doesn't quite feel correct.
@ ATCSNWT:
This was an example of Pottorture.
Trevor:
Given a distributed computing environment, and these being available in the 2G range, I'm actually considering one per node as scratchpad tmp space - we've an ETL that runs and generates up to 200,000 temporary files in a 100Gb lv that *might* just be precipitating internal timeouts. If we can serialize that we'd be able to keep it under 2G of data at a time. These just might be fast enough for that. Thoughts?
Beer for playing with toys.
When the number of files being read and written goes above a few tens of thousands, Resource Monitor will tie up a core just managing and displaying the list box in the “Disk Activity” section of the UI. Switch to the CPU tab (which doesn’t display that list) and watch it become responsive again.
As ever, the act of observing affects the outcome…
Both ZFS and Storage spaces were unable to cope with these units any better than regular Windows or Linux software RAID. Well, actually, that's a lie. They coped "better", but not "better enough". The software "lash it all together" solutions were clearly bottlenecks in all cases.
If you had read the review you would have learned that I tried rather a lot of things. For months. Here are the benchmarks I've used.
Databases
Hammerora http://hammerora.sourceforge.net/ Microsoft SQL, MySQL, Postgres, OracleDB (if you have it).
OStrell http://blogs.msdn.com/b/psssql/archive/2014/04/24/version-9-04-0013-of-the-rml-utilities-for-x86-and-x64-has-been-released-to-the-download-center.aspx Microsoft SQL, as part of the SQL RML Utilities.
SQLIO http://www.microsoft.com/en-us/download/details.aspx?id=20163 This writes all zeros. It tells us a very specific thing about how "zero blocks" are dealt with. It's tricky. Follow http://www.mssqltips.com/sqlservertip/2127/benchmarking-sql-server-io-with-sqlio/ and http://www.brentozar.com/archive/2008/09/finding-your-san-bottlenecks-with-sqlio/
SQLIOSIM https://support.microsoft.com/en-us/kb/231619?wa=wsignin1.0 this is to test stability, not performance. https://www.simple-talk.com/sql/database-administration/the-sql-server-sqliosim-utility/
General disk tests
FIO http://freecode.com/projects/fio Read http://support.sas.com/resources/papers/proceedings13/479-2013.pdf and all will be revealed.
Iometer http://www.iometer.org/ Various configurations
Exchange
Jetstress 2013 http://www.microsoft.com/en-ca/download/details.aspx?id=36849
Jetstress 2010 http://www.microsoft.com/en-ca/download/details.aspx?id=4167
Background work tests
Using iometer determine your peak global IOPS as per above test. Load the system to 25%, 33%, 50%, and 75% of IOPS capacity. Now run various common administrative tasks and time them.
1) Full VM backup using VM backup software
2) Snapshot
3) Clone
4) Creation of VM from template
5) SQLIO test runs on a single VM (testing mixed workloads!)
6) Exchange Jetstress (testing mixed workloads!)
7) SQLIO and Exchange Jetstress (testing mixed workloads!)
...and there I was thinking that endurance really isn't that much of an issue with modern SSD technology. The things generally have enough wear space and good enough cycling (can't remember the actual term) algorithms that they will last for 5 years even when treated horribly. That's more than we'd usually trust spinning rust for.
I always want to know performance, but almost more importantly the endurance or lifespan of the SSD drives. I keep melting them to slag and need to know replacement intervals and redundancy planning concerns.
You obviously went to some effort to beat these up but didn't report drive total lifetime write activity. Any comment on how far you got, and if you can just sit and cycle one or more to destruction to satisfy my hunger for magic smoke (and, budget and ops planning for the bosses for whom I might have to buy a bunch of them...)?
Thanks.
I abused the piss out of those things for two months and they have since moved into regular lab use. Josh has two in his video editing desktop that he abuses all day long. I have them scattered about the lab in every server I can find. I have yet to see them go below 99% lifetime, according to the diagnostics.
I have used both storage spaces and the dynamic-disk based RAID. NTFS is the filesystem I most tested, but I did run a few ReFS tests. (ReFS can handle a few million more files, but honestly the difference, at least in Server 2012 R2, isn't that great.) I was using a stripe/RAID0 rather than parity or mirror.