back to article Intel pits QDR-80 InfiniBand against Mellanox FDR

As Joe Yaworski, fabric product marketing manager at Intel, put it to El Reg, the next frontier of system innovation will be in fabrics. As we see it, that frontier may also be the front line of the InfiniBand war between Mellanox and Intel – with one upcoming battle being the former's 56Gb/sec interconnects versus Intel's new …

COMMENTS

This topic is closed for new posts.
  1. frobnicate
    Meh

    This doesn't make sense

    I am sorry. Peak "flops" are achieved during compute phase of cluster jobs, when the interconnect is mostly idle. The network is used during IO phase, e.g., to dump checkpoints. Having a larger flops capacity does not imply a faster network. More details about benchmarks are needed.

    1. Jason Ozolins
      Happy

      Re: This doesn't make sense

      I don't think it doesn't make sense. :-) Looking at this from a sysadmin POV (I'm not an applied maths whiz, but I've worked for and with some):

      - Unless your job is embarrassingly parallel, your cluster nodes will need to communicate with each other, not just with the filesystem.

      - The pattern and amount of that communication depends on the type and scale of the job.

      - As more cores end up inside each compute node, the interconnect has to scale up in speed for some sorts of jobs (definitely for all-to-all patterns) to get the same throughput per core as used to occur when each node had fewer cores. There is also more RAM in each node, and hence more checkpoint data to be saved in the I/O phase - but the I/O phase is likely limited more by the filesystem, unless you're using some fancy two-stage checkpoint setup (i.e. quick dump to dedicated checkpointing system that can then stage it out to the filesystem).

    2. Anonymous Coward
      Anonymous Coward

      Re: This doesn't make sense

      That may be the case for 'embarrassingly parallel' jobs where there is little or no inter-node communication during the execution of the job.

      But for jobs where there is a significant amount of message passing while the job runs, the interconnect can have a huge impact on performance.

      Applications such as CFD, FEA, Weather/Climate, etc have a single model or mesh distributed over many nodes. As the job runs messages containing state updates, synchronisation operations and data are flying backwards and forwards between the processes on each of the CPU cores. Ideally the job would be arranged in such a way that adjacent points or cells in the model are placed on adjacent cores in the cluster - but for jobs larger than a single node there will still be a requirement to get messages (packets) out from the node, onto the network and to it's intended destination as quickly as possible. Message generation rate, latency and, to a lesser extent, bandwidth play an important role in the measured/actual FLOPS achieved by the application, regardless of what the aggregate peak FLOPS of the CPUs actually are.

      1. frobnicate

        Re: This doesn't make sense

        Let's try a very rough back of the envelope estimation. Assuming that the network is fully utilised during compute phase, the IO phase would be at least as long as compute phase, because the total state of computation must be dumped at it is more data than exchanged during compute phase. Worse yet, IO phase cannot be overlapped with compute phase of another task, because they compete for the fully utilised network. Which means that even in the ideal state, when the storage system is so blazingly fast that it is the network which is the bottleneck of IO phase, the duty cycle of the system is less than 50%. A reason for a big lab administrator to have a heart attack. The system seems to be misconfigured.

        1. Anonymous Coward
          Anonymous Coward

          Re: This doesn't make sense

          Why would the IO phase need to write out the sum of all data transmitted during the compute phase ?

          Not all of the traffic during the compute phase is data, some is control/synchronisation from MPI. Also any data transmitted during the compute phase may have been updated, discarded or expired during compute - the end state or result isn't necessarily the sum of all of the data.

          Additionally, the MPI traffic can be quite bursty - so while you want the interconnect to be capable of high performance, it's capacity (bandwidth) isn't usually the limiting factor; the speed (latency or message rate) is typically more important.

          It is perfectly acceptable to have MPI (compute) and filesystem (IO) traffic on the same interconnect. IB has QoS features to ensure certain traffic types can be prioritised e.g. MPI packets, to ensure latency in the compute phase doesn't skyrocket if there is also IO traffic on the network.

          In fact if you have a multi-user, multi-job cluster with jobs starting and stopping asynchronously it's pretty much inevitable that you will have a mix of traffic at any given time on the cluster.

  2. Tom Womack
    Boffin

    "many clusters where latency or cost is more important than bandwidth are still being built with Gigabit Ethernet switches"

    Gigabit Ethernet latency is *dreadful*, 180us or more for a ping between two boxes attached to the same switch! You use a gigabit interconnect only when latency is immaterial and bandwidth not terribly important; thankfully a lot of interesting jobs have that property.

    Unfortunately the slower grades of infiniband, which were still cheaper and lower-power than 10GbaseT when they started to be phased out, are no longer readily available new.

  3. Anonymous Coward
    Anonymous Coward

    Intel compilers?

    "The Intel compilers know about QDR-80 and how to arrange code so it doesn't try to go over the QPI link."

    The Intel compiler blurb is interesting since it is the only reference to existing Intel software and presumably that's part of a complete Intel-only cluster solution.

    How to "arrange code" only affects portions of code that can be statically re-arranged which affects only a small fraction of the existing codes out there. Where to route a message (either through QPI or QDR-80) is a runtime determination, so Intel and others are on equal footing here.

This topic is closed for new posts.