back to article Microsoft 'Catapults' geriatric Moore's Law from CERTAIN DEATH

Microsoft has found a way to massively increase the compute capabilities of its data centers, despite the fact that Moore's Law is wheezing towards its inevitable demise. In a paper to be presented this week at the International Symposium on Computer Architecture (ISCA), titled A Reconfigurable Fabric for Accelerating Large- …

COMMENTS

This topic is closed for new posts.
  1. This post has been deleted by its author

  2. Michael H.F. Wilkinson Silver badge

    Interesting stuff

    FPGAs do offer interesting ways to extend the power of computers, especially for SIMD type situations (and there are loads of these). I will see if our library has that paper available.

    I thought Moore's law was not so much an assertion as an observation. I gather he observed a trend in the data so far, and suggested the exponential growth might go on for a while yet. I do not think he envisaged it to last as long as it has.

    I do tire of people who still suggest Moore's Law will come to the rescue of their pathetically slow algorithms. I always like to point out that even if Moore's Law continues unabated, the amount of data their quadratic, cubic, or even exponential complexity algorithms algorithms can handle will not grow in the same way. Instead, the amount of data will grow by the square or cube root of two for each doubling, and in the exponential case you can add one data item per doubling of speed.

    The end of Moore's Law might put an end to this form of sloppy thinking

    1. Anonymous Coward
      Anonymous Coward

      Re: Interesting stuff

      >>FPGAs do offer interesting ways to extend the power of computers, especially for SIMD type situations

      I'd say kind of the opposite. FPGAs excel when there's not an efficient way to express a function with typical computer instructions, regardless of how much data is being processed. If anything it would be MI?D instead of SIMD... "Multiple Instruction, __ Data".

      1. Michael H.F. Wilkinson Silver badge

        Re: Interesting stuff

        Maybe SIMD is not the best turn of phrase. As I see it, FPGAs could work well if the processing steps are fairly predetermined. I do not see FPGAs working on the data-driven processing order required for my kind of image and volume filtering work, but then I haven't got much experience with FPGAs, so maybe I am wrong

  3. Hero Protagonist

    A troupe of boffins?

    Sounds rather undignified, we need a better collective noun than "troupe".

    1. AceRimmer

      Re: A troupe of boffins?

      A Toupée of boffins?

      1. ratfox

        Re: A troupe of boffins?

        A school of boffins? A pod of boffins?

        1. Anonymous Coward
          Anonymous Coward

          Re: A troupe of boffins?

          Surely a shed of boffins...

        2. wowfood

          Re: A troupe of boffins?

          I am personally a fan of a pod of boffins, since we apparently work in pods where I am.

    2. Frumious Bandersnatch

      Re: A troupe of boffins?

      Undignified? But some of the best boffinry comes from monkeying around ...

      1. John Gamble
        Headmaster

        Re: A troupe of boffins?

        "Undignified? But some of the best boffinry comes from monkeying around ..."

        An indignity of boffins. There you go then.

        1. Anonymous Coward
          Anonymous Coward

          Re: Indignity of boffins

          Nah. If we're talking monkeys, it has to be a flange.

  4. JeffyPoooh
    Pint

    Here's a Lesson Learned (from SDR) for anyone going down this road...

    There's an old Rule of Thumb that is deadly-wrong. "Allow 50% of the capacity for growth."

    FAIL. Stupid stupid stupid.

    The very next generation of requirements will require about 10x the processing capabilities. So your 50% spare capacity barely makes it through the maintenance cycles.

    Lie: "It's reprogrammable so you won't be buying another one for decades."

    Fact: They go obsolete as fast as the old way of doing business (CPUs). Faster in fact, because nobody can be bothered to reprogram them.

    Decision makers beware.

    1. Michael H.F. Wilkinson Silver badge

      Re: Here's a Lesson Learned (from SDR) for anyone going down this road...

      Absolutely true. Reprogramming FPGAs is a bottleneck to many. A software platform which could ease that pain would help, but I trust that might be as difficult as compilers which "automatically" recognize how to parallellise code. I have seen examples of the latter which handled quite a few situations admirably, but feed them e.g. a queue based algorithm and they are stuck. Many challenging problem require real originality.

      1. Vic

        Re: Here's a Lesson Learned (from SDR) for anyone going down this road...

        A software platform which could ease that pain would help, but I trust that might be as difficult as compilers which "automatically" recognize how to parallellise code

        Kinda sorta...

        A big part of the problem of FPGA development is that you're not just programming someone else's core, you're actually doing hardware layout. That means that, beyond getting your design algorithmically correct, you also need to layout the design such that it conforms to timing constraints. This is not trivial as you approach the capability of the device.

        The way this seems to be done is to perform multiple layouts (in parallel if possible), and then to compare each to see which fits the constraints properly. Altera's tool for this is called Design Space Explorer (DSE), Xilinx's is called SmartXPlorer.

        I've spent much of the last 3 years implementing & maintaining a Grid Engine cluster specifically to run this sort of tool. It's entirely possible to do, but the complexity of pushing out a new FPGA design should not be underestimated. There's *lots*[1] of compute time burnt in each iteration...

        Vic.

        [1] Especially as the Linux port of some of these tools seems not to be as effective as it might be. Now I'm out of contract on that job, I might have to talk to the vendors to see if I can't get their tools running a little quicker :-)

        1. Charles Manning

          Re: Here's a Lesson Learned (from SDR) for anyone going down this road...

          " That means that, beyond getting your design algorithmically correct, you also need to layout the design such that it conforms to timing constraints. "

          Not exactly...

          You don't do the layout yourself. The tools do that. You just provide the timing constraints and it is the job of the tool to do the floor planning.

          Of course you need to write the HDL code in a way that timing **can** be met.

          For example, instead of having complex blocks of logic that take a long time to crunch you instead chop these up into lots of small steps and use pipelining.

          ie.

          X := A + B + C + D

          might become

          Step 1:

          X1 : A + B

          X2 = C + D

          Step 2

          X = X1 + X2

          Now each bite of processing is smaller and can therefore be processed within a clock cycle.

          Something like a multiply and accumulate might end up taking many (6-10) pipeline steps, but at least it is FAST.

          1. Vic

            Re: Here's a Lesson Learned (from SDR) for anyone going down this road...

            You don't do the layout yourself. The tools do that. You just provide the timing constraints and it is the job of the tool to do the floor planning.

            If you read the rest of my post, you'll see I talk about the tools.

            But floorplanning - you tend to do that yourself for non-trivial designs, as it makes a huge difference both to the execution time of the tool and to the probability of any run actually meeting timing constraints.

            SmartXPlorer and DSE still take a shitload of time to run...

            Vic.

      2. Anonymous Coward
        Anonymous Coward

        Re: Here's a Lesson Learned (from SDR) for anyone going down this road...

        "A software platform which could ease that pain would help"

        I take it you're aware of neither Handel C at one end, and National Instruments Labview software-through-pictures meets FPGA (http://www.ni.com/fpga/) at the other end? And other variants on the theme?

        Previous incarnations of Handel C were associated with Oxford University Computing Lab, and Embedded Solutions Limited (later Celoxica).

        Attempts to market the Handel C product to engineering and scientific markets seem to have been relatively disappointing; last time I looked they were aiming at the City instead, specifically at High Frequency Trading, where pockets are deep and every nanosecond counts. But that didn't seem to work all that well either, and the remains of the company and technology were bought in 2009 by Mentor Graphics, the eCAD company turned mini-conglomerate.

        Afaict both products work technically, but the solution space where they are relevant is not very big (even in comparison with the solution space where graphics card computing is relevant).

        That said, where are all these FPGAs going to get all their data from? If it doesn't fit in on-FPGA memory, then surely main memory speed (bandwidth? latency?) is the bottleneck, as it often is. FPGAs don't really change anything much in that respect, surely?

    2. Gordon 10

      Re: Here's a Lesson Learned (from SDR) for anyone going down this road...

      Doesnt that suggest that the future of CPU's is in Software Defined Computation?

      ie Future iteration of FGPA's or similar that can load new CPU models on the fly whenever a non-trivial set of operations are detected.

      1. Vic

        Re: Here's a Lesson Learned (from SDR) for anyone going down this road...

        ie Future iteration of FGPA's or similar that can load new CPU models on the fly whenever a non-trivial set of operations are detected.

        Already happening...

        Vic.

      2. oldcoder

        Re: Here's a Lesson Learned (from SDR) for anyone going down this road...

        Unfortunately, loading the FPGA is slow.

        Synchronization among FPGAs is slow.

        Saving context will be slow...

        The network described looks the same as the old Cray T3. And that beast took several seconds just to get ready to start the job. Granted, once setup it was quite fast. But job setup time took forever - as did shutdown times. A single checkpoint could take 15 minutes if all the processors were used by one job.

        And this doesn't look any different.

  5. Anonymous Coward
    Anonymous Coward

    Slow evolution?

    SRAM style FPGAs have been slow to achieve their potential. Back in 1986 most programmers only understood how to handle data by serial processing on a cpu. A need to do processing in real-time would always end up as "need a faster cpu" - rather than offload some data handling tasks to FPGAs. Hardware designers mostly used FPGAs for quick prototyping of hardware logic - or to allow logic maintenance updates.

    In general the idea of reconfiguring the FPGA on the fly as part of the application was often ignored. I suspect that CGI companies have been doing work like this for many years.

    1. Pypes

      Re: Slow evolution?

      I thought the British micro scene in the 80's had a fetish for ULA's (rather than ASIC's) which were essentially a basic write-once FPGA.

      1. Vic

        Re: Slow evolution?

        ULA's (rather than ASIC's) which were essentially a basic write-once FPGA.

        No, you're thinking of PLDs.

        ULAs were logic blocks with the function defined by the metal layer; this allowed near-ASICs to be generated comparatively cheaply and with low lead time.

        PLDs are EPROM/EEPROM-configured, and are significantly smaller than FPGAs (and generally have much less granularity in function).

        Vic.

        1. TkH11

          Re: Slow evolution?

          ULA's WERE ASICS. ASICS - application specific integrated circuits - were/are mask programmable, which is precisely what ULA's were.

          1. Vic

            Re: Slow evolution?

            ULA's WERE ASICS. ASICS - application specific integrated circuits - were/are mask programmable, which is precisely what ULA's were.

            The term "ASIC" covers far more than just mask-programmable parts; full-ASIC development would determine the silicon layout underneath the metal layers, and that means a shedload more NRE.

            ULAs had a common silicon layout, with just the metal layer being application-specific. That's why I described them as "near-ASIC".

            Vic.

            1. TkH11

              Re: Slow evolution?

              I was working with ASICs - mask programmable mostly - in the 1990's. The design of the device was undertaken by the customer and a netlist of the logic circuit sent to the ASIC vendor, which then carried out pre-layout validation checks, design rule checks, pre layout simulation, floor planning, place and route, and then post-layout simulations. The end result of this process was a database which was sent onto the fab. plant which simply created the final few metalisation masks.

              The difference between those ASICs in the 1990's and the ULA's such as the Ferranti ULA used in the BBC microcomputer was that they didn't have the design tools and CAD software in the days of the ULA. Design and verification of the mask patterns was done by printing out huge print outs, and visually inspecting them.

              From a manufacturing perspective, I believe the ULA and mask programmable ASICs were the same.

              1. Vic

                Re: Slow evolution?

                I was working with ASICs - mask programmable mostly - in the 1990's.

                I was working with ASICs - full-custom mostly - in the 1990's.

                The design of the device was undertaken by the customer and a netlist of the logic circuit sent to the ASIC vendor, which then carried out pre-layout validation checks, design rule checks, pre layout simulation, floor planning, place and route, and then post-layout simulations

                So you're doing floorplanning & place/route. That's way more than just configuring the metal layer. That's floorplanning the layout of the silicon beneath that metal layer. Which is exactly what I said...

                Take a look at Wikipedia's entry on full-custom ASIC. Feel free to do some more searching if you don't find that sufficiently authoritative.

                From a manufacturing perspective, I believe the ULA and mask programmable ASICs were the same.

                Nothing like it. ULAs were only customised when the metal layer was applied - long after the dice were sawn. Full-custom ASIC is a custom piece of silicon from start to finish.

                Vic.

                1. TkH11

                  Re: Slow evolution?

                  Vic,

                  >So you're doing floorplanning & place/route. That's way more than just configuring the metal layer.

                  No, this isn't right. With a semi-custom mask programmable ASIC, this isn't what is happening

                  When you floor plan such an ASIC, you're laying down instructions to the place and route tools to constrain the placement of logic gates.

                  The end result is purely that the creation of metalisation masks to do the final interconnect.

                  My credentials;

                  1) Spent two years at university studying full custom, semic custom IC design.

                  2) Made an interated cicuit chip in the lab, using photolithography

                  3) Spent 5 years as an ASIC validation engineer, laying out chips, pre and post layout logic simulations, automatic test pattern generation of one of the world's leading ASIC venors.

                2. TkH11

                  Re: Slow evolution?

                  Might have made a typing error. With full custom you are right, lower level masks of the difusion regions need to be produced, but this is not true for semi-custom.

                  Thanks for the reference to Wiki, I don't need to check. I know what my job was, did it for 5 years, I *am* the authority on it.

                  I had to hand route a 30,000 logic gate Gallium Arsenide device at 97% utilisation, when the routing software couldn't hack it.

                  1. Vic

                    Re: Slow evolution?

                    With full custom you are right, lower level masks of the difusion regions need to be produced, but this is not true for semi-custom

                    I wasn't talking about semi-custom...

                    Vic.

                    1. Anonymous Coward
                      Anonymous Coward

                      Re: Slow evolution?

                      There were two competing implementations of the FPGA concept circa 1985.

                      Altera had an FPGA user gate configuration process that was a one-off process. You burned the logic configuration by (IIRC) fusible links - and it was then non-volatile.

                      Xilinx had SRAM technology to control the way the FPGA gates were configured. The chip could be dynamically loaded off a serial eprom - or via a serial line on a I/O port of a cpu. The FPGA could be dynamically reloaded for different configurations. One of their examples was reloading a comms FPGA between send and receive. The gate count was quite low - in the order of a couple of thousand gates.

                      My 1986 IBM PC ISA bus prototype board was wire-wrapped by hand. It used two Xilinx FPGAs and some quite expensive 15nsec SRAM. The rest of the board was filled with chips for bus and external interfaces. The FPGA logic was first configured to enable the SRAM to be loaded with customised data via the cpu.

                      Then the FPGAs were reloaded with their operational logic. This download bit stream was customised for each run by on-the-fly merging of pre-built bits of several different FPGA configuration files. The result was able to generate a very large number of preset bit patterns for various counters etc.

                      It clocked at 100ns and worked first time using only a home-made 0/1 TTL logic probe to test the wiring. The professional hardware engineers were duly impressed with Xilinx FPGA capabilities.

                      1. Anonymous Coward
                        Thumb Up

                        Re: Slow evolution?

                        Having done such work (and absolutely killer wire-wraps), you have my total respect. And that's the reason that the next, obvious, step is programming everything. (I'm surprised that some wag hasn't trotted out Software-Defined Hardware, unless I missed seeing it somewhere.) Rearranging connects, and therefore function, will work for a time to conceal the sloppiness of the of developers (yeah, right, the "art of computer programming") but for a time. Hopefully, by then, you'll be able to rearrange atoms (using nanotech) or living computing nodes, as required.

                        [I've been doing every damn IT thang, hardware/software/network/systems, since 1975 and have zero respect for "practitioners" in software today. BTW, software defined biologicals should totally frighten anyone as they'll give "bug" a whole completely new manifestation!]

            2. TkH11

              Re: Slow evolution?

              ASICs offered lower NRE than full custom/semi custom technologies.

              The NRE cost of the development effort into designing the ASIC device by the vendor was distributed across all the dies made, across all customers that used that particular ASIC.

              When we talk about NRE, we are generally talking about the NRE to the customer that has created the design to be put on to the ASIC, it's important to distinguish the customer's NRE - the charge they pay to the ASIC vendor for the validation, layout and production of samples of the ASIC, and the developmental costs of the particular ASIC technology family which is incurred by the ASIC vendor.

              As far as the customer is concerned, they don't care about the development costs of the technology incurred by the vendor, that cost is distributed across all chips that are made and manifests itself in both the NRE charge for ASIC validation and individual price per chip the customer pays.

              The point about ASICS is that the die, unmetallised, was already fabbed, and made in their thousands,

      2. TkH11

        Re: Slow evolution?

        ULA's were write once, but there's another important distinction, they were mask programmable.

        Digital custom chips fall into two high level categories: field programmable and mask programmable.

        Mask programmable, the non-recurring engineering costs on there are much higher, the interconnect of the logic gates/transistors is done at manufacturing time.

        In field programmable chips the chips are mass produced and all the way through to packaging the die into the lead frame and final test. All the chips coming off the production line are the same.

        Customisation of the chip and putting the circuit design into the chip is undertaken by the customer (and not the chip manufacturer).

  6. Destroy All Monsters Silver badge
    Paris Hilton

    Ah we are back,

    First found in BYTE Magazine in the early 90's to speed up the prime sieve. I can't remember the company that produced the FPGA ... probably dead by now.

  7. itzman

    The next step will come from ...

    Analysing bloatware.

    And instead of tailoring chips to run it, tailoring the software not to need it.

    Bye bye X-windows. You served us reasonably well..

    1. Anonymous Coward
      Anonymous Coward

      Re: The next step will come from ...

      Right, because the X server is the obvious bottleneck on a massively parallel backend system. Not.

      As for bloatware , yes you have a point there. First of all perhaps its time to dump a lot of inefficient scripting languages and VMs whether running Java style languages (unless the JIT execution can really be proven to be faster than precompiled binaries for the task) or entire OSes, and go back to to-the-metal style coding at using C/C++ and research even more improvements to compilers.

      Following that strip down the OS itself - far to many server OSes run with code they don't need that while it doesn't do anything still occasionally is pointlessly spun up and queried (eg drivers) or just has to be passed through (eg firewalls). This is the case for Windows and Linux.

      Then reduce the number of daisy chained API libraries and rewrite important code in a single library with minimal code jumping and hence paging required.

      Once all that is done then we can start worrying about hardware.

  8. Anonymous Coward
    Anonymous Coward

    Netezza

    Isn't this pretty much what Netezza does?

    Disclaimer: I work for Big Blue although it's a genuine question, not rhetorical

  9. roselan

    BOC LIC

    Soooo

    Big Out of order Core + Little In order Core + unified io (arm BIG.little)

    AMD HSA

    Intel phy

    Nvidia cuda (kepler or whatever it is called now)

    god knows what's google up to

    and now MS fpga sauce

    They all look quite similar to me.

  10. Anonymous Coward
    Anonymous Coward

    Took long enough to get to the point

    The only interesting part of the article is WHAT Microsoft has programmed its FPGAs to do... nothing else is new. We have had FPGAs for decades. Putting FPGAs in a server farm is a brain-dead logical step that's hardly worth mentioning as long as you have found a use for said FPGAs. All this stuff about Moore's law and whatever is pointless word padding.

    Since feature detection is mentioned, I assume these FPGAs are assisting with image searches. That's nice, but I doubt that image searches take up that much of Microsoft's cloud compute power. Thus the benefits of this approach are limited. Not to say they shouldn't do it, but, certainly not the extension of Moore's law that the article promised.

  11. Charles Manning

    Microsoft research...

    Re-cutting the edge that has been mown by others for decades.

    FPGAs have been used for data processing applications for ages. They are particularly valuable for doing things like video processing.

    Pipelining means hundreds, or even thousands, of steps can be executed in parallel. I'm currently working on a project where we're applying a long sequence of manipulations to video pixels passing through at 150M pixels per second. As each pixel is having, say, 100 transformations applied, that's 15 Gpixel transformations happening per second. Far, far faster than anything that you can do on a CPU.

    Similar things can be done with networking, search, ...

  12. southpacificpom
    Coat

    Moore

    What was the MS marketing spiel several years ago.

    Do Moore with less...

  13. Kevin McMurtrie Silver badge
    WTF?

    "parallel programming is devilishly difficult"

    But FPGA programming is easy?

  14. maniacminer

    Transputer?

    Haven't we been here before, 30 years ago? Anyone remember Occam? Go back to the late 70s for Hoare's seminal paper on Communicating Sequential Processes, into Inmos and the Transputer in the early 80s. So many times has the performance of the computer seemed to reach a plateau only to suddenly take off again. Is it a bluff this time? or, are we really going to have to re-write operating systems and applications to be genuinely parallel, scalable, resilient, portable and usable?

    Personally, I'm really looking forward to a massively parallel future of software defined silicon :D

    1. Vic

      Re: Transputer?

      Haven't we been here before, 30 years ago?

      I'm glad I'm not the only one saw the Transputer link in that slide :-)

      Anyone remember Occam?

      The trouble with Occam is that many people really couldn't get their heads around it. Although it helps with implementing parallel designs, it doesn't do all the work for you - and some of the Occam floating around the world is, shall we say, "less than optimal".

      I don't have exact figures, but it was generally accepted within ST that more Transputers were sold after the name was dropped and 3 of the links were cut off - at that point, it became the ST20, and that forms the core of a significant number of STB designs throughout the world. You've probably got one in your living room.

      One of the big thongs that affected T4/ST20 popularity was the existence of the C compiler. People were much happier programming in C. If you run "strings" against the binary, you'll see that it's a C-to-Occam translator lying on top of the Occam compiler :-)

      Vic.

  15. Anonymous Coward
    Anonymous Coward

    Wait!

    Are you telling me Bing is running on more than one core!?

    I known, I know, I'm just trouble making but oh it's fun.

    1. Phil_Evans

      Re: Wait!

      Not at all, I would cite this as a portent of Bing as the 'fastest search engine that nobody uses'. But then most of the most elementary arguments in this thread leave me with a pained look, so a cheap jibe is all I could manage.

  16. Truth4u

    this will work great

    until someone goofs up the layout of the FPGAs and they start pumping hundreds of gigs of garbage into the CPUs.

  17. Stretch

    So they added a SPARC chip in effect. Hmm.

This topic is closed for new posts.

Other stories you might like