back to article Benchmark bandit: Numascale unveils 10TB/sec monster

Numascale's non-universal memory architecture has been used to build a 324-CPU system with 108 Supermicro servers sharing a single system image and 20.7TB of memory – scoring a winning McCalpin STREAM benchmark. The system, with its cache-coherent shared memory, ran at 10.096TB/sec for the McCalpin Scale function. It was 53 …

  1. astrax

    They need a decent mascot.

    I choose you, Gary Brolsma!

  2. Aitor 1

    Old chip

    I guess this is a proof of concept with old hardware they had.. as those chips are ancient.

    Still, quite impressive.

    1. Anonymous Coward
      Anonymous Coward

      Re: Old chip

      AMD have been doing NUMA for years. Maybe Intel doesn't have any NUMA options other than Itanium?

      https://en.wikipedia.org/wiki/Non-uniform_memory_access#Cache_coherent_NUMA_.28ccNUMA.29

      1. Bronek Kozicki

        Re: Old chip

        Intel Xeons supported cache coherent NUMA for last 6 years or so (reference)

    2. Nigel Campbell

      Re: Old chip

      Probably to do with Hypertransport, which has a history of being friendly to folks plugging other stuff into CPU sockets. For example, there are some FPGA products (e.g. Altera) that will go into an AMD socket, and AMD has been friendly to this sort of application since the mid 2000's when they brought out the Socket 940.

      Although it doesn't specifically mention it in the article, the article does talk about 3 CPUs, which suggests the fourth socket is being used for something else, probably the connectivity. I guess the choice of Opteron is because HT is more friendly to this sort of application than QuickPath.

    3. admiraljkb

      Re: Old chip

      Umm, Opteron 6386 is a new chip. Now the systemboards could be 5 years old, but thats because its a stable platform (which now desperately needs refreshed for PCIe3, DDR4 and such).

  3. phil dude
    IT Angle

    price vs performance....

    Probably a much cheaper way to build a big memory box - and avoid rewriting your applications just to test them.

    I would like to point out the ~1us MPI latency in the whitepaper, but no good figures for "bare metal"....

    Anyone the wiser?

    P.

  4. Anonymous Coward
    Anonymous Coward

    But why?

    Serious question.

    What sort of real-world use case would you throw at this? And I mean, more specific than "Big Data", which comes in many flavors. Does anybody have an example where this Single Image model is a better approach than the Hadoop / Spark style distributed model?

    1. Anonymous Coward
      Anonymous Coward

      Re: But why?

      >> Does anybody have an example where this Single Image model is a better approach than the Hadoop / Spark style distributed model?

      It says in the article. Anything that can be implemented using message passing or map-reduce is almost certainly easier to implement using shared memory.

      1. CarbonLifeForm

        Re: But why?

        At first blush, large scale finite difference codes might be a good candidate. I assume the GNU stack is on this box, so you could do an OpenMP implementation pretty easily.

        That's not Big Data mind you, it's good 'ol HPC.

    2. Nigel Campbell

      Re: But why?

      Not so much big data as applications that need a large shared memory space and don't lend themselves to being split up into the sort of isolated steps with discrete inputs and output that Hadoop and its ilk support. This encompasses a large set of traditional supercomputing applications that revolve around big dense matrix operations.

      Examples of this include finite element models used in engineering, computational fluid dynamics, or certain types of signal processing applications (e.g. geophysics applications for oil exploration). Any number of scientific applications use matrix operations.

      This type of application computes relationships between n entities by representing the relationships in a n x n matrix. If the relationships are dense enough (i.e. there is a non-zero connection between enough pairs of elements) then the most efficient way of doing this is through a two dimensional array held in memory. As this is O(N^2) for memory the data sets can get very large very quickly.

  5. vgrig_us

    NUMA

    "Numascale's non-universal memory architecture"

    And i thought NUMA stands for "Non-uniform memory access" - silly me...

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like