back to article More on that monster Cerebras AI chip, Xilinx touts 'world's largest' FPGA, and more

Now that the Hot Chips conference is over until next year, let's bring you up to speed quickly on developments from and around the Silicon Valley event. The world's biggest chip is a pain in the butt to cool A real crowd-pleaser this year was Cerebras, an AI hardware startup, which unveiled what is claimed to be the largest …

  1. StargateSg7

    60 GHz GaAs for 128-bits Wide for Signed and Unsigned Integer, Fixed Point and Floating Point numbers.

    1.2 PetaFLOPS per combined CPU/GPU/DSP Super-Server Chip!

    WE STILL WIN !!!

    Hooooooyaaaahhh !!!! We is STILL the BIGGEST 1000 POUND GORILLA of CPU Chip Making !!!

    Intel, AMD, IBM, Cerebras, Xilinx, etc, etc. Eat Yer Heart Out !!!!

    Wait until you see the upcoming 2 THz Version of the above chip with an added layer for a 64k by 64k core 128-bits wide SI/UI/FP/FXP/RGBA/YCbCrA/HSLA array processor component. It will be quite a bit over an ExaFLOP per chip on that one!

    WE STILL DA CPU KING up here in Vancouver!

    .

    Bwaaaaaaaggggghhhhhhhhhhhh

    .

    1. Charles 9
      Joke

      I suppose next you'll be telling us you're on the verge of a breakthrough by which your system will run into the exaflops per indifivudual unit (instead of combined), will be multi-layered, put the secret Utah quantum computer to shame, prove/disprove P=NP, and create the formula for world peace while you're at it.

      1. StargateSg7

        Alrighty then..........!!!!

        Our mega-monster-size supercomputer is a 119 ExaFLOPS SUSTAINED using cpu dies running at 128 bits wide at 60 GHz GaAs for Signed and UnSigned Integer, Floating Point, Fixed Point numbers and RGBA/YCbCrA/HSLA pixel types processing. This is a CISC computing device. (i.e. general purpose combined CPU/GPU/DSP Complex Instruction Set Computing)

        The Vector/Array Processor dies are much much simpler and are RISC (Reduced Instruction Set Computing) devices and run at 2 THz (Two TeraHertz) as their cores are much smaller and simpler.

        We are just now combining the two devices into a single combined chip die where one CISC cpu side runs at 60 GHz and the other side is a 2 THz Vector/Array processor that has 65536 by 65536 mini-RISC-cores at 128-bits wide that work SIMULTANEOUS in a synchronized fashion for the data types of Int/FP/FXP/RGBA/YCbCrA/HSLA. This vector/array processor part of the chip has a series of named internal registers and SRAM-like caches assigned to each mini-core at the full 2 THz!

        The RISC array processor side does only very simple Integer and Real number tasks such as Add, Subtract, Multiply, Divide, Root, Square, Cube, Square Root, Nth Root, PowerOf, and then the bitwise tasks such as SHR, SHL, AND, OR, XOR, NOT, ROTATE BITS, SWAP BITS, REVERSE, MOVE, COPY bits AND a hardware-based 2D-XY up-to-16x16-value convolution filter and a 3D-XYZ 16x16x16 value convolution filter. There is no super-pipelining or advanced branch prediction or hyperthreading! Each core does only one task or operation at a time in serial that processes only from One to 256 data values (i.e. an up to 16x16 2D-XY convolution filter) However each set of cores in a processing block is synchronized with its neighbouring mini-cores ensuring that ALL data values get processed and finished at the same time! This is somewhat similar to SIMD-like (Single Instruction, Multiple Data) vector instructions used on common GPU's.

        This part of the die runs at the FULL Two Terahertz on the data registers and convolution data AND accesses the shared memory cache ALSO at 2 THz by temporarily locking the shared sram-like cache block and setting the bits in that locked memory block at the full 2 THZ clock rate. The CISC side cannot use that memory block until the RISC vector/array processor unlocks it. And when the RISC side unlocks the shared memory block, the CISC unit can lock and access the data block at it's own internal 60 GHz clock rate.

        There is a variable speed cross-bridge where each data processing die put it's final data result from each side's own internal cache memory and data registers into a larger SHARED RAM memory cache at their OWN internal clock speed (i.e. 60 GHz or 2 THz)

        BOTH sides can simultaneously access different portions of the shared memory cache using a lock/unlock memory block semaphore infrastructure at their own clock speed. That on-chip shared memory cache size is in the terabytes range!

        We ALSO added a Vector instruction set that assign the array processor side as a single 64k by 64k cores synchronous processing block, or as four 32k by 32k cores processing blocks, or as sixteen of 16k by 16k cores processing blocks down to many 1k by 1k cores processing blocks which can be assigned to separate tasks BUT PROCESSING BLOCKS of multiple-cores will run all those set-of-cores simultaneously.

        This makes for easy-to-create and manage synchrononized video/audio/DSP processing tasks that require common array lengths of common data types to have simple math operations done on all specified values in an array ALL at ONCE! The synchronization can be such that ONLY when block-based processing task has finished putting ALL it's results in the final results output array, will another processing block access those results as inputs for it's own processing task. This makes it EASY to create multiple layers of audio/video filters and effects that finish processing an ENTIRE block of data in a KNOWN amount of time that is in the mere nanoseconds range! This allows for syncing and playback/recording at common video frames rates and/or audio sample rates even when multiple filters and effects are applied to each video frame/audio sample set or applied to multiple groups of video frames/audio sample sets!

        Initial testing has shown the COMBINED processing power is 1.2 PetaFLOPS per chip which means I only need 167 of the combined CISC/RISC cpu dies to equal the 200 PetaFlops of the SUMMIT supercomputer! Right now we have a 119 ExaFLOPS monster which has a SEPARATE rack system for the 60 GHz CISC CPU's and a separate rack system for the 2 THz RISC-based Vector/Array processors.

        Now we are COMBINING each chip type onto a single die and EMBEDDING thermal transfer fluid microchannel-based cooling INTO the die itself for maximum heat-wicking capability. We are ALSO embedding multiple Dense Wave Optical Interface ports right onto the die so that each combined chip has DIRECT access to neighbouring CPU chips AND there are multiple pass-through optical transfer lanes so that backbone-type networking is no longer needed and we can organize the resulting supercomputer very much like the human brain as a cross-linked-to-nearest-neighbour-chips optical network topology.

        This ALSO MEANS there is no more of a rats nest of cables, since we organize each motherboard as processing units of 8 x 8 combined CPU/GPU/DSP/Vector chips that have the optical pathways etched right into the motherboard which cross-links ALL 64 CPU's on each motherboard together much like neurons AND allowing for a higher-level board-to-board cross-link using short dense-wave fibre cables for board-to-board communications to their nearest motherboard neighbors.

        Since each chip has it's own on-chip terabytes-sized cache/working memory and has access to a SHARED on-motherboard battery-backed system very-large-sized RAM block, that means each CPU chip and motherboard have their OWN RAM-based storage media for ultimate data storage speed! Only when data is finally finished processing on each single-chip and/or via the group-based 8-by-8-chip-shared-motherboard processing, does results data get transferred out through the bypass/pass-through optical network lanes to the cheaper and/or slower larger external SSD storage arrays.

        With the etching times needed for such wide-trace-lines of the GaAs substrate process (i.e. a minimum of 280 nm wide circuit lines!) we are looking at a 10+ day window to etch all the traces on each combined CPU chip using a multi-beam etcher. BUT since we now have more than a few thousands of those etchers, we can do a around 30 thousand such chips a month. By late 2020 we will have the world's FIRST ZettaFLOP supercomputer!

        AND since the first 119 ExaFLOPS supercomputer is already human+ equivalence in terms of general intelligence because it's running a physics-based molecular/electrical functional simulation of the K/Na/P/etc gating done in human neurons, a ZettaFLOP supercomputer would allow us to model ALL electrical gating of all human neural tissue, so we will LIKELY get a self-evolving super-intelligence (200+ IQ) within a few weeks of its initial training/teaching!

        .

        Hooooooooyaaaaahhhhh !!!!

        .

        Bring on the super-CPU's bay-beeeee!

        .

        CAN YOU DIG IT ????

        .

        1. Charles 9
          Trollface

          ...

          No.

          You still didn't solve P=NP or world peace yet.

          1. StargateSg7

            a 128 bits wide Von Neuman architecture machine simply does not have enough bits to solve P=NP problems in a resonable amount of time.

            We would have to goto an ALL-STATES-AT-ONCE Quantum Computer and when someone FINALLY gets around to being able to read Q-bits without decohering them THEN I could realistically solve a P=NP problem such as reducing the number of tries it takes to find a 2^256 bits long decryption key down to less than 2^128 tries!

            Example Problem: Get the ORIGINAL decrypt Key of 256-bit AES-256 encrypted Wikileaks insurance files on a linear time and one-after-another-tries basis is simply BEYOND the capabilities of this system BUT since I know a LOT about the Social proclivities of Julian Assange and his crew, I can break down 2^256 tries to a much more manageable 2^128 tries simply by IGNORING what I know they REASONABLY WILL NOT LIKELY USE as Wikileaks Insurance File Passwords AND having quite a bit of knowledge about WHAT TYPE of pseudo-random number generators they would have LIKELY used to derive "random" passwords!

            Ergo, I can LIKELY use this 128-bits wide supercomputer to TRULY BREAK all 5 of the still encrypted Wikileaks Insurance files because I'm keeping my number of decrypt key tries to 2^128 or less simply BECAUSE I am reasonably SURE than Wikileaks will EXCLUDE certain key lengths and/or text randomizations as their Insurance File password!

            .

            BINGO !!!! I win and YOU Jane and Joe Q. Public get to SEE the ENTIRE contents of the Wikileaks Insurance Files!

            .

          2. StargateSg7

            I should note that that a 65536-bits wide integer processing supercomputer COULD solve the 256-bit AES-key issue in exactly 2^128 tries using simple bit-wise calculations in as little as 19 days! That's what our engineers told me.

            They are thinking of creating that ultrawide integer-only computing device by using trapped Xenon Atoms within Quantum Wells that have their current spin state "read" every few femtoseconds using pulsed femtosecond lasers IN PARALLEL so that a Quantum computer-specific problem could be simulated in normal linear time and not polynomial time!

            Since this is a PSEUDO-quantum computer, modern CLASSICAL computation exists and thus problem-solving algorithms can be described in normal C++ code and run as normal on such a machine. Xenon is STABLE and has known spin characteristics MEASURED and IMPARTED (without decoherence!) by a femtosecond laser AND can be manufactured (i.e. trapped) in a quantum well fairly easily.

            And BECAUSE I can put so many Xenon atoms in a small space, I only need to worry about the laser circuits which for reading/imparting 65,536 separate bit values, would be the size of a small warehouse BUT focus into a chip area barely the size of a postage stamp! IT'S DEFINITELY by the parent company! They can EASILY afford it!

            And at a few tens-of-femtosecond read/write operations based upon the spin rate of a Xenon atom, I could run a 65536-bits wide math operation MANY QUINTILLIONS of times per second! Again, it is estimated by our corporate ComSci PhD's and MSc-EE's that we could break AES-256 in linear time in 2^128th tries in about 19 days with a 65536-bits wide CPU!

            DOES THAT WORK as an explanation for you?

            .

          3. StargateSg7

            No World Peace Will Be Made Possible by what I am concocting ....BUT.... if it helps, I do suggest you put John Lennon's Imagine on an infinite loop and sit calmly on your yoga mat for a few hours per day to make you FEEL LIKE there is world peace.

            The first step to world peace is PERSONAL PEACE !!! Which means quit your job. Take a three month hike on the West Coast trails of Canada AND THEN make some babies with a woman you find and happen to like being around!

            After that, build a look cabin in Northern Coastal British Columbia and raise DUCKS (not chickens!) for their larger eggs and some goats while you and your new wife learn to paint, make log furniture and/or figure out HOW to extract and/or manipulate ultra high levels of thermal energy from STABLE heavy elements using only pulsed light or oscillating EM fields! (upon which will then freely OPEN SOURCE those methods and designs!)

            And again, bring your yoga mat and infinite loop John Lennon songs to your log-cabin abode just before retiring to your candle-lit evenings and down-conforter-covered nights!

            THAT should be world peace to YOU!

            .

  2. redpawn

    I'm investing

    in power plants and industrial refrigeration once the markets open.

  3. Conundrum1885

    AI Core

    Sounds intriguing. But 15 kW is way too much power.

    I did look into building a smaller version and using the larger superchip (tm) to compile code for it.

    The problem is finding a powerful enough processor/TPU combi that can run in isolation.

    For my purposes a 5 TFlop system may be good enough.

    1. Duncan Macdonald

      Re: AI Core

      Looking at the images from the Hot Chips presentation, there is room on the wafer for several individual dies as well as the 84 die monster. If the makers want to, they could make single dies alongside the monster for very little additional cost. An individual die should have a power consumption low enough (<200 Watts) to put on a PCIe card. A one die system should be enough for many smaller projects that could not afford the monster.

      1. Anonymous Coward
        Anonymous Coward

        Re: AI Core

        If you strip it down to a single core you lose the interconnect fabric which is likely the real star in getting so many cores to work together efficiently.

        On top of that, I'm not aware of any company that has tried multiple designs on one wafer how you've described (ie complex large chip and use remaining space for smaller chips) as getting the large chip to yield would be a challenge even with redundant areas and once you have a working design (at a likely coat of $50-100m) you're unlikely to want to risk making wafer layout changes for a few more $$$ unless you know you will be doing high volumes.

        Emulation on x86 is likely to be more productive and significantly cheaper.

        1. Duncan Macdonald

          Re: AI Core

          Each of the 84 dies on the monster chip is the same - and that would also be the case for any dies in the currently unused area. The dies in the currently unused area would not have the die to die interconnect wiring that is part of the monster chip. Individual dies for PCIe cards could also be produced from monster chips that have too many defects to produce a functional monster chip even with the built in redundancy.

  4. jjk

    I'm getting flashbacks

    Sir Clive Sinclair already tried to do wafer scale stuff in the 1980s.

    http://www.computinghistory.org.uk/det/8199/Anamartic-Limited/

    1. Anonymous Coward
      Anonymous Coward

      Re: I'm getting flashbacks

      Wasn't that for memory?

      1. Ian Johnston Silver badge

        Re: I'm getting flashbacks

        Clock-free processor, as I recall. Designed by Ivor Catt, who also had some very individual ideas about electromagnetism.

  5. AceRimmer1980
    Windows

    4 of those

    and it *might* be able to run Crysis.

    1. bpfh
      Joke

      Re: 4 of those

      At 25 fps

  6. Daniel von Asmuth
    Flame

    balance of power

    400,000 cores sounds like a quantum leap compared to Intel, AMD or POWER. By comparison, 18 GB sounds more like a cache than main memory.

    1. Charles 9

      Re: balance of power

      Keep in mind, we're talking more "cores" in the GPU sense than in the CPU sense: highly-specialized beasts, and you're probably right that the 18GB is closer to a cache than a general-purpose memory. It's likely more for communication between cores than from the wafer to the outside.

  7. karlkarl Silver badge

    An "AI" chip... Now we just need to learn how to program actual AI rather than glorified sorting algorithms ;)

    1. Charles 9

      Next question becomes: What IS AI beyond sorting algorothms? Followed thereafter by: What exactly is INTELLIGENCE?

      1. Tom 7

        Intelligence is what the next chip will do. It seems when computers do things that it takes intelligent humans to do then these things are no longer require intelligence because computers can only be stupid according to insecure humans.

    2. JDX Gold badge

      Neural nets aren't sorting algorithms. I'm not sure Machine Learning is either, but that's not an area I feel comfortable making any statements on!

  8. jms222

    Upto

    ok Intel I'll pay $upto for your Upto device.

  9. Anonymous Coward
    Anonymous Coward

    Quake 3 FPS

    Almost the only benchmark that matters! Or, perhaps a linux kernel recompile in sub 1-second range?

    1. Anonymous Coward
      Anonymous Coward

      Re: Quake 3 FPS

      Nah. How long to boot from power off to a working desktop is my benchmark. That seems to have changed little over the years despite multicore processors vs single core and clock speeds 100x greater than my early computers.

      1. JDX Gold badge

        Re: Quake 3 FPS

        That's because it's IO-limited (disk).

      2. Anonymous Coward
        Anonymous Coward

        Re: Quake 3 FPS

        Time spent in BIOS is the main limit on my boot speed at home. But then, I'm not on Windows at home. Dos/Win3.1 boot time on a current i7 is rather hilarious if you can be bothered to set it up.

        I have a SAS card and tape drive in that same machine, which spends most of the time unplugged unless it's actually needed to do a backup... SAS BIOS takes a freaking age. If anyone can recommend a controller that doesn't take 3 mins to spin up it would be very helpful!

  10. Dwarf
    Coat

    Chunky Chips

    For some strange reason, I want some...

    Seriously though, I wonder how they deal with manufacturing defects since you can't just drop one die from the wafer like normal. I guess they have some one-time programmable fuses or similar to kill defective sub-assemblies or similar.

  11. Kevin McMurtrie Silver badge

    15kW

    If it's running at 3V, you'd need 5kA to reach 15kW. That makes me think that the socket for this chip is a giant hollow copper clamp where coolant and current must flow together. I'm not even sure what a 5kA power supply with +/- 0.05V or so regulation on a dynamic load would look like. I'm guessing lots of mid-sized synchronous buck regulators followed by low-dropout linear regulators, each with very precise current balancing.

  12. A Non e-mouse Silver badge

    Sounds a bit like the transputer from the 80s - but on steroids.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like