Sorry hyper converged players but ML/DL workloads are not meant for your kit. You can't touch an NVIDIA DGX on performance let alone simplicity in the SW stack and those monsters demand NFS.
Cisco and Lenovo have shoved Intel's Optane caching drives in their hyperconverged systems, and Switchzilla has also added Nvidia GPU support to grant AI/ML apps hyperconverged system access. The Optane and NVMe drive injections are supposed to make these hyperconverged systems perform better than all-flash SAS/SATA SSDs and …
Because ML/DL is not only about training at scale. There are lot of details missing in the article (like operations and collaboration features). Sounds like you are pretty hyped up about AIRI, which is a great solution for training at scale. The Cisco solution is complimentary and evolves to offload training to a solution like AIRI. There is actually more simplicity in the Cisco solution for certain things. Everything is packaged to make it easier and more cost effective to start experimenting and sharing among a team, then when training demands become critical a solution like AIRI comes into play. Once you have a trained model the Cisco solution makes it a simple point and click operation to import and create an API for the working model for easy integration into applications for inference/consumption.
The solution measures the accuracy over time and collects information that can be used to retrain models to maintain a certain level of accuracy. It also shares a lot of the same libraries and is simpler than if you tried to do the same thing with a bunch of standard servers and P4 cards for distributed inference.
Partially correct. Flashblades can be integrated into a HX/UCS fabric and the platform described in this article has an equivalent of DXG-1 available. It’s a Nvidia Cisco codevelepment with NVLINK but with Skylake instead of Broadwell. Better IO too (not really surprising for a Cisco box) and much more RAM than DXG1. Rumors are it could run other GPUs than Nvidia but I am not sure how that will work with NVLINK.
So you can order an AIRI or just plug a few nodes into your fabric and have the same thing with better components from the same vendors. Nothing against Nvidia, they absolutely rock, but they are not good at designing servers or integrated systems.
Sorry, but you are misinformed. CUDA is not done via NFS, NFS is used for data access only. Computational scaling is done a bit different and Cisco’s data center fabric is particularly well suited for it, although you can achieve similar results with any 40/100G network. I encourage you to look up BitFusion which has established itself as standard for remote CUDA execution.
As to AI/ML workloads on HCI, using Voltaire GPUs HX supports 220 TFLOPS per converged node, 670 TFLOPS per compute node, and 1PFLOP per NVLINK node. The NVLINK nodes are DXG-1 equivalents but with Skylake architecture. Number of nodes is currently limited to 64 but that’s for converged and compute nodes, GPU nodes can most likely be just plugged into the fabric.
This is anything but a toy. Combined with the K8s Tensorflow stacks (KubeFlow) natively integrated in CCP and a native integration of Pascal and Voltaire (and potentially TPU) this is shaping up to be a one stop shop for advanced workloads.
IMHO, the competition has been asleep at the wheel and allowed Cisco to leapfrog them. It will require significant R&D to catch up.
I didn’t realize HX had nodes based off of anything else besides C240s and C220s. And I’m sorry for my skepticism but even if you put an S-series HX mode that could handle 6 V100s, there’s no effing way you’re pushing 1 petaflop out of that when a DGX-1 with 8 cards only does 960 teraflops.
I love what they are doing with Kubernetes and think they have a great message with hybrid GCP, but if we are talking straight performance here I don’t see how HX can stack up on training vs an AIRI or NetApp/NVIDIA solution
Yes, more than just 220/240. Lots of more stuff coming apparently.
No, not an s-series but a NVLINK node with 8xV100 that is identical to a DGX1 but based on Skylake.
960TFLOP is .96PFLOPS so I assume the data I saw is rounded up. If I understood it correctly it’s an “open” DGX1 with equal or better performance than a DGX1 due to better CPU and significantly more RAM (3TB vs 512GB on DGX). I am not sure if CPU and RAM is necessary on an AIML node but there is still so much crappy code out there that might benefit from it. Either way, building a new platform on Broadwell is a bit stupid but I guess Nvidia had their reasons. Or simply not enough engineering to build something state of the art themselves but had to rely on some OEM/ODM.
Looking through my notes I see there is a C480M5 compute node for HX with six V100, the C240 have 2 V100, and that UCS DGX with eight V100. There was also talk about a dedicated GPU node but now I am confused if that’s a separate smaller node or that DGX clone. I could swear it was a separate product but I might or might not have had a few beers already :p
I asked the Cisco dudes if that UCS DGX box (it had a weird codename starting with M but I forgot what it was) if it could run other GPUs like Intel and AMD or even a Google TPU but man, they were really evasive about that. My Google account manager told me a week ago that Nvidia received a bunch of TPUs so at this point I am totally confused.
As to the AIRI, I don’t really see what’s so special about it as we already have our Flashblades connected to our UCS domain and if we can just add those Msomethingsomething nodes and start cranking we’d be ready to rock and roll. And now I have to laugh at myself because looking at our competent and strategic management we’ll probably buy whatever got pitched at the last golf event if you know what I mean.
Yes, C480 thats what I meant! Not S-Series, I was blanking on that for like half an hour.
Interesting, the way that I was told is UCS DGX you are talking about is just the NVIDIA DGX but on Cisco's GPL. Same way you can buy Commvault, Veeam, VMware or about a million other technologies. I don't believe there is any difference so I highly doubt you're going to be running AMD or TPU on an NVIDIA box :)
You seem to have a great handle on this. How can I pick your brain more on this stuff outside of this venue?
Biting the hand that feeds IT © 1998–2020