back to article AMD reveals potent parallel processing breakthrough

AMD has released details on its implementation of The Next Big Thing in processor evolution, and in the process has unleashed the TNBT of acronyms: the AMD APU (CPU+GPU) HSA hUMA. Before your eyes glaze over and you click away from this page, know that if this scheme is widely adopted, it could be of great benefit to both …

COMMENTS

This topic is closed for new posts.
  1. Anonymous Coward
    Anonymous Coward

    This is pleasing

    More good news technology stories like this and I shall renew my subscription.

    1. Fred Flintstone Gold badge
      Coat

      Re: This is pleasing

      Be warned, you need a sense of hUMA.

      No, the *dirty* Mac, thanks.

  2. James Wheeler
    Thumb Up

    About time

    This is going to make GPU coprocessing useful in many instances where it isn't now.

  3. ARP2

    No more worrying about Graphic card memory

    If I understand correctly, if they were use this architecture on a "traditional" PC, do we need to worry about memory on a graphics card anymore? Or would the CPU/GPU simply split the use of my RAM as they need it? Could that also mean that more RAM= much better graphics performance?

    1. DF118

      Re: No more worrying about Graphic card memory

      I think the point of this is in situations where the CPU and GPU reside on the same die, not where you have a discrete GPU elsewhere on the assembly or even on a daughter board, which is I think what you're talking about.

      1. JEDIDIAH
        Linux

        Re: No more worrying about Graphic card memory

        A co-processor is a co-processor. If it can act in place of the CPU with less nonsense then that's useful regardless of whether or not the co-processor is on the same die. This is just turning a GPU into a fancier math co-processor.

        Surprised it hasn't been done yet actually.

    2. Matt Bryant Silver badge
      Boffin

      Re: No more worrying about Graphic card memory

      "....if they were use this architecture on a "traditional" PC, do we need to worry about memory on a graphics card anymore?...." The article specificly mentions tablets and handsets, which implies this is more a better "system on-a-chip" than a replacement for traditional gaming PC architecture.

      One reason it is unlikley to upset the PC applecart is because plug-in graphics card vendors like having complete control over the discrete memory on their cards, they don't have to wait for CPU or motherboard designers to catch up. If a new memory type that works best for graphics comes out they don't have to wait for the CPU manufacturers to redesign the memory controllers on their CPUs or the mobo designers to issue a new mobo with new RAM slots, they simply add it to their own cards. That is the advantage of discrete graphics memory in PCs. If you had no memory on your graphics card and had to go out over the bus to main memory then performance would suck. And graphics card makers will want to use the latest and greatest as they will want to maintain a perfromance lead over combined designs like this or they will go out of business.

      I would suggest this is more aimed at tablets and possibly virtual desktop environments, the latter seeing greater efficeincies in memory if they can pool it for all tasks.

      1. Craig 2
        Thumb Up

        Re: No more worrying about Graphic card memory

        What woule be nice for gaming with a discrete 3D graphics card is if the new on-processor APU could be re-tasked to help with something like physics or AI.

        1. Palf
          WTF?

          Re: No more worrying about Graphic card memory

          Yes - also it's unclear what degree of control the programmer has over CPU/GPU usage decisions. It would be nice to specify that compute-intensive inner loops be executed mandatorily on the GPU, for example. I get the impression that the GPU is auto-assigned only when the CPU runs out of puff, but that's just a SWAG.

    3. ThomH

      Re: No more worrying about Graphic card memory

      By putting the GPU behind the MMU it does technically reduce one of the video memory concerns — you could have a single graphic however many gigabytes in size, memory map the file and call that the texture. Attempts by the GPU to read sections not currently paged would simply raise the usual exception, which would be caught by the OS in the usual way and handled by the existing paging mechanisms. You no longer have to treat texture caching as a separate application-level task.

      That said, as others have noted the main point of the design is that when you write a parallel for loop in your language of choice to perform some vector operation — especially if it involves no branching — the GPUs can be factored into the workload just as easily as any traditional CPU cores, but so as to perform the work much more efficiently. So writing programs that take advantage of all the available processing becomes a lot easier. Collapsing virtual memory to a single mechanism that your OS vendor has already supplied is just one example.

  4. Anonymous Coward
    Anonymous Coward

    What was UMA architecture then?

    I am not an expert in this field, but I use to hear this distinction Unified Memory Architectures (UMA) and Non-Unified Memory Architectures (NUMA) when it came to GPUs. So how is the present offering from AMD different than UMA?

    1. bazza Silver badge
      Happy

      Re: What was UMA architecture then?

      In their current APUs the GPU doesn't interact with memory in the same way as the CPU does. That's in spite of the fact that they're on the same die and ultimately share the same DDR3 memory bus. In that's sense the arrangements are slightly Non Uniform, and you have to copy data in order to get it from one realm to another.

      This new idea means that the GPU and CPU interact with memory in exactly the same way, and that makes a big difference. Software is simpler because a pointer in a program in the CPU doesn't need to be converted for the GPU to be able to use it. That helps developers. More importantly the "GPU job setup time" is effectively zero because no data has to be copied in or out first. That speeds up the overall job time.

      I like it!

      1. Anonymous Coward
        Anonymous Coward

        Re: What was UMA architecture then?

        But aren't you actually describing NUMA v/s UMA rather than UMA v/s hUMA?

      2. @Yehuda Ben Yehudin.
        WTF?

        Re: What was UMA architecture then?

        Isn't this exactly what the Cell architecture had 10 years ago, a single system memory that is accessed by the SPE accelerators and the CPUs? What's the news here?

        Plus ca change...

    2. Flocke Kroes Silver badge

      Re: What was UMA architecture then?

      The key difference is not on the diagram. When a process on a CPU tries to access some memory, the address that the process selects is a virtual address (back then: a 32-bit number, now often a 64-bit number). The CPU tries to convert the virtual address into a physical address (a different number, sometimes a different size). There are several uses for this rather expensive conversion:

      Each process gets its own mapping from virtual to physical addresses - this makes it very difficult for one process to scribble all over the memory that belongs to a different process.

      The total amount of virtual memory can exceed the amount of physical memory. (Some virtual addresses get marked as a problem. When a process tries to access such a virtual address, the CPU signals this as a problem to the operating system. The operating system suspends the process, assigns a physical address for the virtual address, gets the required data from disk into that physical memory then restarts the process.)

      Sometimes it is just convenient - the mmap function makes a file on a disk look like some memory. If a process tries to read some of the mapped memory, the operating system ensures data from the file is there before the read instruction completes. If a process modifies the contents of mapped memory, the operating system ensures the changes occur to the file on the disk.

      In UMA, the CPU and the GPU access the same physical memory, but the GPU only understands physical addresses. When a process wants some work done by the GPU, it must ask the operating system to convert all the virtual addresses to physical addresses. This can go badly wrong because a neat block of virtual addresses could get mapped to a bunch of physical addresses scattered all over the memory map. Worse still, some of the virtual memory could map to files on a disk and not have a physical address at all. The two solutions are to have the operating system copy the scattered data into a neat block of contiguous physical addresses or for the process on the CPU to anticipate the problem and request that some virtual addresses map to a neat contiguous block of physical addresses before creating the data to go there.

      Plan B looks really good until you spot that the operating system might not have such a large block of physical memory unassigned. It would have to create one by suspending the processes that use a block of memory, copying the contents elsewhere, updating the virtual to physical maps and then resuming to suspended processes. It gets worse. That huge block of memory cannot be paged out if it is not being used, and the required contents might already be somewhere else in memory so it will have to be copied into place instead of being mapped.

      All this hassle could be avoided if the GPU understood virtual addresses. That would cut down on the expensive copying (memory bandwidth limits the speed of many graphics intensive tasks). The down side is it adds to the burden of the address translation hardware which is already does a huge and complicated task so fast that many programmers do not even know it is there.

      1. oldcoder

        And don't forget the MASSIVE security failure.

        Having a user application being run by the GPU that bypasses any memory constraints...

        1. Bronek Kozicki
          Thumb Down

          Re: And don't forget the MASSIVE security failure.

          and the "bypass" part came from ... ?

          1. Destroy All Monsters Silver badge
            Pint

            Re: And don't forget the MASSIVE security failure.

            Anyone who remembers the Guru Meditation, Press Left Mouse Button To Continue?

            1. Trib

              Re: And don't forget the MASSIVE security failure.

              I read this article and first thing that came to mind was 'Fat Agnus'!

              1. Destroy All Monsters Silver badge
                Trollface

                Re: And don't forget the MASSIVE security failure.

                I read this article and first thing that came to mind was 'Fat Agnus'!

                This has nothing to do with Gabe Nevell!

      2. Anonymous Coward
        Anonymous Coward

        Re: What was UMA architecture then?

        Thanks @Flocke Kroes for answering my question! Don't know why you were given a down vote for your answer!

      3. @Yehuda Ben Yehudin.
        Megaphone

        Re: What was UMA architecture then?

        Right, so that's why the Cell SPE did understand virtual system memory addresses and used those to access system memory. How can you chase pointers, ensure secuirty, etc etc., in a reasonable, portable and high performance way.

        Cell showed that that architecture could be used even for things such as intensive pointer chasing in garbage collection. (Check out their cool Cell GC work in the VEE conference!)

      4. borje

        Re: What was UMA architecture then?

        To work well and with good performance, the correct page size as well as TLB structure is vital.

        X86 systems today, work mostly with 4kB pages (there might be a handful of TLB-entries that can be used for huge (2MB) pages). Dividing a main memory of multiple, maybe +100 GB of memory into 4kB will be a huge overhead. It will be even worse with a combination of CPU's and GPUs.

        4kB page size and 1024 TLB entries mean that you can only access 4MB of virtual memory before you need to start replacing TLB entries (reading the translation between virtual to physical memory from memory, before you can access the memory - ie you double the number of memory transactions).

        SPARC and POWER today support much larger page sizes (+1GB) and this is something that needs to be done in X86 too.

        1. Anonymous Coward
          Anonymous Coward

          Re: What was UMA architecture then?

          x86 supports 2MB (huge pages), 4 MB(huge pages+pae) and even 1GB pages(large pages)

  5. Eddy Ito
    Thumb Up

    Interesting

    So who is interested in the tie in with SeaMicro tech? Oh it's going to be in a cluster alright and even though I don't know what I'd do with it, I still want one.

  6. Rampant Spaniel

    So what I really want to know is will this actually result in a decent improvement in performance using photoshop \ lightroom \ premier pro? The mercury engine in PP was a nice improvement, ssd's helped with the other two a little, but this year on year ~10% performance jumps from intel, and amd struggling to keep up is getting old. If you want me to drop a few thousand on a new workstation, cough up a decent performance jump. If they turn round with an 16 core APU with 200% of the performance of a 4770 that doesn't require its own power station or ac unit I shall be suitably humbled and reach for my wallet.

    If this comes out with a mild performance bump, on lithography 2 steps behind intel then honestly I will be disappointed. I would love for AMD to knock it out of the park, they are the only thing that keeps intel vaguely awake in the desktop space.

    1. Anonymous Coward
      Anonymous Coward

      It could have an impact if the software is written to take advantage of it.

      What concerns me though, I recall some time back (a couple of years ago) there being a WebGL exploit that could extract pieces of video RAM. Admittedly, the exact problem occurred nearly 2 years ago, and a lot has changed since then, however this isn't to say the same vulnerability can't exist in future software.

      What makes this kind of vulnerability dangerous here though is that this sort of architecture potentially opens up your entire system memory to attack via the same vector, since video RAM is essentially your main system RAM.

      The idea is not new though... the SGI O2 has a similar design, as did a lot of late 90's era desktop boards which had integrated video devices.

      1. h4rm0ny

        What concerns me though, I recall some time back (a couple of years ago) there being a WebGL exploit that could extract pieces of video RAM. Admittedly, the exact problem occurred nearly 2 years ago, and a lot has changed since then, however this isn't to say the same vulnerability can't exist in future software.

        Issues of this nature were given by Microsoft as the reason they hadn't implemented WebGL in IE for such a long time.

      2. Lennart Sorensen
        Thumb Up

        I believe most of the SGI machines had all of the video card mapped in the CPU memory space so everything could access everything else.

        Of course it used to be video cards had their memory mapped into the memory space of the PC, although there wasn't as much acceleration then, so allowing the CPU a fast way to write updates to the video card made sense. Once we got 3D chips with hundreds of MB of ram, the 32bit memory space started getting a bit tight and they stopped doing that for all the memory. No reason a 64bit machine couldn't allow everything to be mapped into one memory space though, unless you want to support running 32bit software still.

  7. Crisp

    Things that make you go hUMA...

    This is interesting stuff. It'll be more interesting to see it in practice. Especially as more and more developers have started using the GPU for parallel tasks.

  8. Mips
    Childcatcher

    Acronyms

    AMD need to take care, they could finish up with an ultra religious faction somewhere in there. Homeland Security will be watching.

    1. Destroy All Monsters Silver badge
      Big Brother

      Anticitizen#1

      DHS is too busy burning through all the beeellions of dosh setting up random Overwatch Checkpoints and training at the range with up to 2 bullets fired for each US citizen yearly.

      1. Palf

        Re: Anticitizen#1

        American logic: learning to shoot things is cheaper than providing its citizenry with health care.

  9. Matt Bryant Silver badge
    Meh

    Bittiness?

    OK, not to get too down on this, what about memory bandwidth? Maybe an extreme example but the nVidia Quadro 6000 in my workstation uses GDR5 memory with a bus width of 384bits, and has an 128bit graphics engine, whereas its Xeon CPU is 64-bit and uses 64bit wide DDR3 SDRAM - two sets of completely different code have to run on each because the architectures are so different. Now, unless AMD are saying they're going to bump up their CPU cores to 128bit designs with much wider and faster built-in memory controllers that means the graphics engine actually has to accept CPU-specified memory that will be painfully slow compared to the discrete memory on the stand-alone graphics card. All in all, it may be great for tablets or handhelds, but not for PCs.

    1. Richard 12 Silver badge

      Re: Bittiness?

      It's for their "APU"s, which are CPU with on-chip (possibly on die?) GPU.

      So their GPU is already using the same physical memory bus and memory hardware as the CPU.

      This isn't for discrete GPUs.

      Looking at the list of partners, seeing ARM is very, very interesting - GPGPU in a Cortex A* is already very cool, and this would not only add go-faster stripes but severely reduce the CPU needed.

      Anybody for 2-big.2-little.loads-of-titchy?

  10. Anonymous Coward
    Go

    If both the CPU and GPU can address a 64bit virtual address range, then why limit the slow store to SSD. Very large data objects could be paged out to the cloud, as cloud storage is good.

    1. asdf
      Stop

      >Very large data objects could be paged out to the cloud, as cloud storage is good.

      Looks like you picked the wrong icon. You really should have picked the Joke Alert icon for that post.

      1. Rampant Spaniel

        No, the cloud would be ideal, as long as you have a decent LTE connection! You could even use azure to piss off Eadon.

        1. Eddy Ito

          No, I'm afraid that would actually make Eadon happy, perhaps to the point of climax. I mean just the thought of copy-pasting "FAIL" fifty times might do it for him. Unless he's actually Steve Balmer doing a parody of a barking mad penguinista in which case climax is a given.

        2. asdf

          wow

          I thought it was just somebody trolling but where to begin with the cloud idea. Umm performance, issues, security issues, availability issues etc. Very large data objects for most employers contain valuable proprietary data that is best kept in house if for no other reason.

          1. Rampant Spaniel

            Re: wow

            Seriously? I had thought the lte might have made it clear to all but the most uphill thinkers. :-)

  11. BornToWin

    AMD is definitely forward thinking

    This development has been in process since 2006 and it's finally starting to come to fruation. It's great to see AMD leading the next big PC performance improvement. They may have had their troubles over the years, no thanks to Intel's illegal practices, but at least AMD continues to deliver the best value products for consumers.

    I personally will be buying a Kaveri powered laptop as soon as they are available for purchase. Kaveri should offer a dramatic improvement in APU performance allowing AMD to then offer mid-range and high-end desktop solutions that equal current and future discrete CPU/GPU systems, but at a lower cost with lower power consumption. It's all good for consumers.

  12. FSM

    APU

    I wish they'd cut out the 'it must have a cool name' bullshit and just call it a Heterogeneous Processing Unit. Sounds good and is based on fact.

    Very encouraging news nonetheless.

  13. 0_Flybert_0

    All I'm wondering is ..

    what do Intel and nVidia have up their sleeves for CPU/GPU shared memory that will beat AMD out of the gate by a year to 18 months on a smaller process .. say 14nm ..

    * HSA Foundation, of which AMD is merely one of many members along with fellow cofounders ARM, Imagination Technologies, Samsung, Texas Instruments, Qualcomm, and MediaTek.*

    Notable lack of Intel .. nVidia .. you'd think Google would be interested in the tech as well ..

    down votes expected .. <sigh>

  14. Shagbag
    Gimp

    Does this mark the death of x86?

    I don't know much about the difference between x86 and ARM but will this stuff level the playing field between x86 and ARM around computational efficiency?

  15. CastorAcer
    Thumb Up

    Everyone is focussed on the fact that hUMA allows GPU access to DDR3...

    What I'm intrigued to know is whether this would support (using virtual addressing) CPU / GPU access to shared GDDR5 and DDR3 memory pools. That capability would get very interesting...

  16. Sir_bobbyuk
    Paris Hilton

    replacement to the X86 architecture all to gether

    Isnt any of the big chip companies doing a new architecture to replac the x86 or will that cause a slight problem

  17. This post has been deleted by its author

  18. NIghtFlight

    The Commodore Amiga line of computers had unified memory way back in 1985. Such hardware was way ahead of it's time.

    I sometimes wonder how Amiga came to fruition, because there was practically nothing else that could touch it:

    . 16/32-bit CPU (Motorola 68xxx series and later IBM PowerPC accelerators)

    . Preemptive multi-tasking built into the hardware

    . Custom chipsets for video, audio and input/output

    . WIMP interface.

    . Programs running in their own screens, each with their own resolution.

    . Video-friendly timing, hence no RCB monitor required. Hugely popular with cable networks.

    . Massive array of public-domain software

    I owned several Amiga's in addition to x86 hardware and from a hardware viewpoint the Amiga was in a class of it's own, even compared to a Mac or Atari ST. The thing could even run Mac software through emulation, at a fraction of the cost of a real Mac. The biggest problems I faced were the cost of upgrades, along with public perception (due to marketing), of Amiga being a kid's game machine rather than a creative powerhouse.

    Sorry of this post was a tad off-topic. I have fond memories of this machine and the things it could do, as well as the great community who stood by it through thick and thin. Seeing innovative products like this from AMD reminds me a little of those days.

    1. Destroy All Monsters Silver badge
      Pint

      Less "unified memory" than 1 CPU (with no memory-management unit, hence the need to save often, save early and the kludge of patching up executables when loading them into memory) plus off-CPU video hardware that could access the lower 512K with DMA.

      All very nice, but not exactly on the level with 2013.

      The problem is that backward-looking enhances things. Do you remember the beautiful workbench? Then you look at it for real and you know ... it was nice then but it sure ain't now.

This topic is closed for new posts.

Other stories you might like