back to article Intel Atom chips have been dying for at least 18 months – only now is truth coming to light

The flaw in Intel's Atom C2000 family of chips has been vexing Intel's hardware customers for at least a year and a half, according to a source at one affected supplier, but it wasn't immediately obvious that Intel's silicon was to blame. The well-placed insider, who spoke to The Register on condition of anonymity, said the …

Page:

  1. Nick Collingridge

    Synology seem to be winning the transparency race at the moment. Intel = BIG Fail. They must surely work out sooner or later that the game is up on this, and the longer they fail to engage properly with the issue the more their reputation will suffer.

    1. BillG Silver badge
      Boffin

      Insider's View

      The clock "vendor" Intel uses has a solid history of reliability. The problem, in my opinion, is most probably Intel's implementation of the Atom's on-chip clock domains. Probably specs are loosening after the chip is heated over 18 months.

      I have a hard time believing that Intel didn't catch this during QC testing, as some tests subject the MCU to intense heat over a short period of time. It would have caught this. I suspect middle-level managers inside Intel knew about this and didn't tell upper-level managers.

      1. Brian Miller

        Re: Insider's View

        I have a hard time believing that Intel didn't catch this during QC testing, as some tests subject the MCU to intense heat over a short period of time.

        I can believe that they didn't catch this, as this sounds like long-term component degradation. A short heat test of, say, 72 hours, probably wouldn't catch this. Maybe there is something actually growing on the inside of the chip packaging, too. It could have been fine in the initial samples, and then when they went to production something happened.

        (A former boss of mine worked on the i386 design team, so I learned a few interesting things.)

        1. Alien8n Silver badge

          Re: Insider's View

          I used to work for a mosfet and igbt manufacturer and some of the "solutions" to chip faults were insane. One line of chips were consistenly testing at 7V instead of 5V. Turns out the wafer fab had loosened their spec to allow all the failures to pass, only to have them fail when put into final package. I recommended they start testing at 5.5V which would have removed about 90% of the failures while still worth next to nothing but the wafer fab didn't want the failures to reflect on their balance sheet. They'd rather the manufacturing plant took the hit and look bad than them. They took a conscious decision to cost the company millions of dollars a week in wasted production just so they'd look good. Partly this was down to the US Finance Director hating the UK, the wafer fab was in California, the manufacturing plant was in Britain.

      2. Anonymous Coward
        Anonymous Coward

        Re: Insider's View

        I've worked in the past on RTOS embedded systems inside Cisco, with Intel CPU's (primarily during the P6, Netburst, and Core/Core2 era). Reliability/stability was an increasing problem starting with Netburst, although Nehalem amazingly seemed OK in spite of a new arch, while Westmere was not.

        Much of the critical flaws weren't widely disseminated and were resolved with microcode updates that went into fresh BIOS updates on the normal consumer gear (desktop/laptop/server). In the meantime is the end user really going to notice an extra BSOD? In the embedded space where things HAVE to be UP, we didn't have that luxury. We'd have a 3-4 year lockin and purchase agreements on what was supposed to be a stable LTS CPU, and Intel would have a critical flaw pop up 6 months in, then change the stepping mid-cycle, or occasionally change the chip entirely while they put out a stream of microcode updates to compensate on already released product. Then somehow we had to get customers to actually apply those updates before support tickets rolled in... (yeah right) With embedded systems unfortunately, it is very, very costly to resolve once product is shipping. Installs have to be modified, end to end testing re-performed (knocking other revenue generating new features off the QA schedule - $$$), and the factories have to get new images/processes/procedures, and someone has to fly out for a First Article Inspection to verify its all good before you can ship product again (more $$$). For normal desktop/laptops and servers, its not hat big an issue, but in the embedded space, Intel was killing us with the amount of time it took to workaround/support their screwups.

        At the time there was an Intel Board member also on the Cisco Board, so there wasn't any options to get out of it when spec'ing the next product hardware, and I tried hard when Netburst Xeon's were coming apart at the seams and causing stop-ships every few months. Very difficult to get down to brass tacks with a supplier when you have an incestuous Board of Directors relationship like that...

        1. Doctor Syntax Silver badge

          Re: Insider's View

          "Very difficult to get down to brass tacks with a supplier when you have an incestuous Board of Directors relationship like that"

          it would depend. A board member in that position should be able to short circuit a lot of internal obstacles on the vendor side in the short term and persuade them that quality issues matter in the long term. The long term benefits would be mutual.

          1. Anonymous Coward
            Anonymous Coward

            Re: Insider's View

            "The long term benefits would be mutual."

            I think I see the problem..."long term" anything are someone else's problem these days.

          2. Anonymous Coward
            Anonymous Coward

            Re: Insider's View

            "it would depend. A board member in that position should be able to short circuit a lot of internal obstacles on the vendor side in the short term and persuade them that quality issues matter in the long term. The long term benefits would be mutual."

            Hehe - *Should* is the operative word there. Unfortunately it in reality it meant that the Emperor had no Clothes, and nobody was going to disturb the illusion. Keep in mind it was in the same timeframe that Intel was trying to assure everyone that the Netburst arch (ala P4 for consumers) was awesome, so anything that indicated otherwise they tried to squash. It wasn't until Intel released the Core arch (itself a resurgent P6 arch), that Intel came clean with all the problems and performance issues it had with Netburst/P4 that just couldn't be resolved. Core/Core2 still had some issues, but at least Nehalem was stable, although the next cycles after that didn't appear to be as stable.

            My team's recommendations during the Netburst era didn't get past the Director level before getting squashed, but such is the nature of modern Corp life... If there had not been a mutual Board Member, my area of Cisco would have likely shifted off Intel for our embedded stuff in the mid naughts. We were getting pretty ticked at the drudgery of repeatedly having to re-spin both hardware and software releases for Intel's product issues versus working on our own product features and issues. I'm surprised other companies stayed with them in the embedded space in that timeframe. Surely they didn't have mutual Board Members with everyone. :)

        2. yuhong

          Re: Insider's View

          The other fun thing is there is the NDA spec updates for engineering sample steppings.

    2. darkknight

      I'm not so sure Synology are winning anything at the moment. This is the reply to a support ticket I opened with them:

      "Thank you for contacting Synology support.

      Regarding this issue, our related team are working on such issue and I will give you the reply ASAP if there is any update.

      However, thank you for bringing this issue to our attention.

      Please feel free to contact us again if you have further questions or suggestions."

      1. darkknight

        To reply to my own post, the current case progress I have with Synology

        Boilerplate rubbish...!

        "Sorry to disturb.

        Intel has recently notified Synology regarding the issue of the processor’s increased degradation chance of a specific component after heavy, prolonged usage.

        Synology has not currently seen any indication that this issue has caused an increase in failure rates for DiskStation or RackStation models equipped with Intel Atom C2000 series processors compared to other models manufactured in the same time frame not equipped with the affected processors.

        It is safe to continue to use your device, however should you encounter any issues, our support teams will do everything they can to expedite your ticket. Technical Support can be reached via www.synology.com/ticket. Synology will post a follow-up on this topic once additional information is available.

        Please feel free to contact us again if you have further questions or suggestions."

        1. darkknight

          and to confirm that Synology are now on my personal "do-not-buy-from-again" list:

          "Thank you for your reply.

          Due to there is no issue on such model recently, please must let us know if you face any problem. We will always adjust our product and protect the interests of customers.

          However, thank you for bringing this issue to our attention again.

          Please feel free to contact us again if you have further questions or suggestions."

      2. bogd

        I got the same canned reply at first. To Synology's credit, though, they followed up a few hours later with a more detailed message:

        "Intel has recently notified Synology regarding the issue of the processor’s increased degradation chance of a specific component after heavy, prolonged usage.

        Synology has not currently seen any indication that this issue has caused an increase in failure rates for DiskStation or RackStation models equipped with Intel Atom C2000 series processors compared to other models manufactured in the same time frame not equipped with the affected processors.

        Intel has recently notified Synology regarding the issue of the processor’s increased degradation chance of a specific component after heavy, prolonged usage.

        Synology has not currently seen any indication that this issue has caused an increase in failure rates for DiskStation or RackStation models equipped with Intel Atom C2000 series processors compared to other models manufactured in the same time frame not equipped with the affected processors.

        It is safe to continue to use your device, however should you encounter any issues, our support teams will do everything they can to expedite your ticket. Technical Support can be reached via www.synology.com/ticket. Synology will post a follow-up on this topic once additional information is available.

        "

        While not very encouraging (it does sound like they're not going to replace anything until the unit has actually died), at least it's a bit more info on the topic :)

        1. Alan Brown Silver badge

          > It does sound like they're not going to replace anything until the unit has actually died

          My worry is that even if makers have this policy, they won't have enough stock to handle increasing failure rates.

          Having a critical system go tits-up - then finding that even though there's a support contract, not being able to replace it for a week - is one of the nightmare scenarios.

      3. Chris King Silver badge

        That's why I was cynical of JP's responses in yesterday's article.

        "We're looking into it" would have been plenty good enough for me, but "take a ticket" sounds more like a delaying tactic, especially if they don't seem to have answers.

        I've made sure that everything on my DS1815+ is backed up elsewhere - I've gone from "happy camper" to "Can I even trust the tin ?" in the space of 24 hours, and I'm probably not alone on that score.

        1. darkknight

          No sir, you are not alone.

          I bought my DS1815+ *as* a backup data store.

  2. Doctor Syntax Silver badge

    Is it just this particular Intel line that has a problem? Are their others that haven't come to light yet? This one might have stayed under the radar if it hadn't been for the Cisco announcement.

    1. Anonymous Coward
      Anonymous Coward

      Many knew something smelled.

      This has been on the radar for 2 years, that's why Intel is forced to admit it. It's the sole reason I waited for the D. I'm not sure if the exact problem was known but, while in pursuit to replace a home NAS with a mini-itx, I've read many discussions claiming there was no other possibility for failure rates other than a faulty CPU.

      Supermicro and undoubtedly AsrockRack C2000 boards both have atypical failure rates. In the case of Asrock, I believe the failure rate after 18 months is between 60% and 70% (I'm not exaggerating) due to the thermal conditions of all things considered.

      1. Anonymous Coward
        Anonymous Coward

        Re: Many knew something smelled.

        MyBackDoor: "It's the sole reason I waited for the D"

        I see what you did there!

      2. jmb06

        Re: Many knew something smelled.

        I have one of these CPUs on a ASRock board that just died. The CPU embeded board is 24 months old. I spent $299 on it because it was a "server motherboard". I should have stuck to using cheapo motherboards for $79.

        Let's see how Intel handles RMAs. A lot of people may go AMD if Intel goes cheap on replacements. They should look to Samsung for advice on how to handle replacing a bad product. Fess up and do the right thing.

  3. Anonymous Coward
    Anonymous Coward

    Intel get a life

    Your buggy Atom's and your crappy LGA bend pins

  4. ma1010 Silver badge
    Megaphone

    Corporate weasels just can't learn

    Corporations are like people: if you make a big mistake, you're generally much better off owning up to it, apologizing and undertaking to not do it again.

    But scumbag weasels duck, dodge, lie, deflect and do anything else they can to try to avoid owning up to their screw-ups. Most people have no respect for weasels. Or weasel corporations. Like Intel appears to be turning into (if it wasn't already).

    Even Steve Jobs finally admitted Apple screwed up on the "they're holding it wrong" iPhone. It's time for Intel to step up and take the blame, too.

    1. chivo243 Silver badge

      Re: Corporate weasels just can't learn

      "But scumbag weasels duck, dodge, lie, deflect and do anything else they can to try to avoid owning up to their screw-ups. Most people have no respect for weasels. Or weasel corporations. Like Intel appears to be turning into (if it wasn't already)."

      Just like frat boys... do I need to connect the dots?

    2. Anonymous Coward
      Anonymous Coward

      Apple just as weaselish...

      "Even Steve Jobs finally admitted Apple screwed up on the "they're holding it wrong" iPhone. It's time for Intel to step up and take the blame, too."

      I'd hardly use Apple as a paragon of virtue. They flat out denied the existence of the infamous "touch disease" problem on the iPhone 6 and 6 Plus for several months.

      This was almost universally accepted by third parties to be down to the iPhone 6's infamous "bendy" design causing excessive stress on the board.

      When Apple themselves finally "admitted" it, they tried to pin it on a combination of two causes, the first being that the phone had been dropped and *then* subjected to "further stress".

      Oddly, the part that they were quite clear about- the phone (allegedly) having been dropped- was the one that entirely coincidentally would be indisputably the user's fault. The part about "further stress" on the other hand... well, that's pretty vague, isn't it?

      Possibly because the "further stress" is likely that caused by excessive bending of the case. Of course, it's also entirely coincidental that the part they're being vague about- the part that everyone else pins as the real cause- is the part that might point to Apple being to blame.

      And how plausible is it that "touch disease" is- as Apple claim- caused in every case by the exact combination of the phone being dropped and *then* subjected to "further stress"? It sounds like they want to have their cake and eat it... it's not plausible enough that touch disease is caused by the user dropping it (on its own), but the most likely cause would pin the blame on them, so fudge the issue by implying that the user is still partly to blame- however implausibly- which lets them focus on that and skirt around anything that might show them to be at fault.

      That's about as "weaselish" as it comes.

      1. SpitfireNoNotThePlane
        Meh

        Re: Apple just as weaselish...

        "I'd hardly use Apple as a paragon of virtue. They flat out denied the existence of the infamous "touch disease" problem on the iPhone 6 and 6 Plus for several months."

        Apple was okay back when Jobs were with them. Since he died, it's honestly gone down the shitter. Note how the iPhone 6 was released about three years after he died.

        I know Jobs was a PR and sales man, but that's literally what we're talking about here.

    3. Atilla_the_bun

      Re: Corporate weasels just can't learn

      Ahh, the lovely Apple Antenna fiasco. Remember it well. When they issued special cases for that model iPhone I commented to friends that I bet I knew what happened. They probably engaged one of the best radio engineers on the planet along with a great industrial design engineer for that phone and set them to design that part of the phone/radio and antenna. This they did and product eventually was manufactured. When these problems surfaced in the real world they go back to said brilliant engineers with the issue and explain problems people had holding the phones the response was "Nobody told us they'd be holding the phone in their bloody hands!"

      1. admiraljkb

        Re: Corporate weasels just can't learn

        @Atilla_the_bun - The antenna fiasco as I recall was that the proto case (a mod'd iPhone 3 case) they were testing in the field before release, was NOT the case that went into production. In the pursuit of external design secrecy ahead of the big reveal, they ended up using their customers for alpha testing of the antenna in the final case.

  5. sanmigueelbeer Silver badge

    Forgive me if I'm wrong but isn't the Atom c2000 been along from 2013? So I would guess Intel has EoS this?

    If this is the case, then I can fully understand why Intel is behaving like this. The logistics to find replacement chips (if the fabrication line is already close) and the financial cost to compensate the clients will be difficult.

    Intel doesn't want to get swamped with calls from angry manufacturers about end-users swamping their support with RMAs (because they will all put a strain on Intel). Intel wants to control the situation before everything gets to "chaos" mode.

    What I don't understand is why is Cisco the only one to publish a field notice? Why hasn't HP or other big names, like Dell, Netgear and Seagate, publish technical notices yet?

    1. diodesign (Written by Reg staff) Silver badge

      Re: sanmigueelbeer

      "If this is the case, then I can fully understand why Intel is behaving like this."

      Extremely strange that Intel won't explain this or say this. And some of the affected components started shipping in 2014. If you bought an Atom-powered NAS in 2015, would you want to start 2017 knowing it could die after the next reboot? It's crappy quality.

      "Intel wants to control the situation before everything gets to 'chaos' mode."

      No shit. Sorry, we don't do Intel's PR. Happy to set the cats among the pigeons and get some real answers out of vendors, rather than suppliers hiding behind NDAs while people's devices mysteriously fail.

      C.

      1. Mage Silver badge

        Re: sanmigueelbeer

        Are the replacements going to fail, as they don't have a new stepping yet, or can the people making product change their design slightly?

      2. oldcoder

        Re: sanmigueelbeer

        It doesn't seem any different than the Pentium FPU fiasco.

        Same ducking of responsibility.

        Oh well - Intel lost the power war, ARM won.

        It might also explain why Intel quit making the Atom line.

        1. Tom 7 Silver badge

          Re: sanmigueelbeer

          I want my ARM shares back. Bet the buyers knew of this.

        2. Lennart Sorensen

          Re: sanmigueelbeer

          The pentium was socketed and easy to replace. This one is soldered on the board.

        3. admiraljkb

          Re: sanmigueelbeer

          "It might also explain why Intel quit making the Atom line."

          You might think that, but these were special purpose Atom based SoC versus regular garden variety Atom's. Trust me, the regular Atom line has plenty going against it, mainly the age of the architecture. There was only so far Intel could tweak on it before the engineering costs on an outdated arch (unrelated to their primary bread/butter x86 cpu lines) outweighed the rewards/profit margins.

    2. Lennart Sorensen

      These are chips for embedded systems with long term supply promises. This is very much not a chip that is end of life yet. It was supposed to be available for at least 5 years I suspect, maybe more.

      1. Chris King Silver badge

        SuperMicro quoted 7 years availabilty for their C2000 boards.

    3. Alan Brown Silver badge

      > So I would guess Intel has EoS this?

      The Cisco story said anything shipped after Nov 2016 was OK.

      That's probably your cutoff on bad silicon, but as a long term supported SoC for embedded systems and low end servers, the failures seen to date are probably only the tip of the iceberg.

      Regarding other comments: These are not your grandfather's anaemic Atoms that used to hobble consumer systems. The Avoton/Rangely parts were a new generation Atom System-on-chip with performance spec that outruns Xeons prior to the E55xx/56xx parts (ie, anything more than 7-8 years old) whilst using 1/10 of the power and in a lot of cases became the chip of choice even when replacing 3-5 year old systems that would have traditionally been DP

      They've been Intel's bread and butter server chip for non-compute-intensive operations and embedded work, which means the company is going to take a hammering unless it steps up and 'fesses up. The longer they leave it, the deeper the shitpile's going to be.

  6. Anonymous Coward
    Anonymous Coward

    Maybe everyone from Pentium FDIV bug days has retired?

    I'd have thought it was a durable lesson in transparency: try to hide/minimise a serious defect and end up with major PR damage and expensive recalls. But after 22 years perhaps too much institutional memory has been lost.

    1. Anonymous Coward
      Anonymous Coward

      Re: Maybe everyone from Pentium FDIV bug days has retired?

      The Pentium FDIV bug was one of my favourite times in IT...

      So much overtime, so little actual work....

      1. Anonymous Coward
        Anonymous Coward

        Re: Maybe everyone from Pentium FDIV bug days has retired?

        Yes, I remember getting HR to work out my final contractor rate cut on an Intel Pentium PC, and I ended up buying the company.

    2. Humpty McNumpty

      Re: Maybe everyone from Pentium FDIV bug days has retired?

      For something more recent, how about the Sandybridge chipset, in that instance an issue that according to them was fairly unlikely to manifest itself in the working life of the product was deemed sufficiently serious to recall and rework every motherboard that (barely launched) chipset was on. A stark contrast between their approach to a fault they became aware of while the product was in production and this.

    3. Jusme

      Re: Maybe everyone from Pentium FDIV bug days has retired?

      > But after 22 years

      Feck, I'm old!

      1. Mage Silver badge
        Windows

        Re: Maybe everyone from Pentium FDIV bug days has retired?

        What about the 1980s '386 mult bug?

        About 30 years ago. Before there were web sites. I remember the company I was working for implementing a work-around on the compiler they were developing. Or talking about it.

        I bet not many people from those days are still working as Engineers.

        This seems informative on the 386 steppings. MS even decided to stop 80386 support with NT 4.0. I'd forgotten that!

        So the Atom C2000 issue isn't new, it's just that now these things can't easily be swapped.

  7. Anonymous Coward
    Anonymous Coward

    "What I don't understand is why is Cisco the only one to publish a field notice? Why hasn't HP or other big names, like Dell, Netgear and Seagate, publish technical notices yet?"

    It's because Cisco has made the decision to go public, while everybody else is still hedging their bets.

    At the end it'll make Cisco look good.

    A few years ago both EMC and NetApp had a large badge of faulty Seagate Drives. The technicalities are irrelevant but the drives would die prematurely due to contamination on the platters.

    EMC made the conscious decision to replace the drives before failure and EMC made a big story out of it in front of the customer, shaming NetApp.

    NetApp thought they could address the issue with a combination of drive firmware and Software to gradually phase out the drives as they fail. NetApp customers experienced multi-disk failures and lost data. EMC customers did not.

    What Cisco is doing now is showing up the competition. Being an enterprise vendor does not just mean producing enterprise grade products. It's about the entire Customer experience and doing the right things.

    Cisco competitors that do not follow suit will get shamed in the media. The smaller vendors you read about in the article who try so hard to be "enterprisey" just cannot afford that sort of service.

    Apple will give you a new phone when its broken. Others will send it in for repair and 6 weeks later you get it back - while the technician has read all your emails.

    1. P. Lee Silver badge

      >The smaller vendors you read about in the article who try so hard to be "enterprisey" just cannot afford that sort of service.

      I reckon they probably can. Production costs are pretty low on most hardware and they'd get a lot kudos for doing this right. I'd imagine the main problem is finding a replacement - weren't Intel phasing out Atom? What is Intel going to do for you? Even if they do have a replacement, those embedded boards with the CPU soldered in... they just increased vendor costs.

      1. admiraljkb

        "...weren't Intel phasing out Atom? What is Intel going to do for you? Even if they do have a replacement, those embedded boards with the CPU soldered in... they just increased vendor costs."

        This is (mostly) the embedded space, so Intel has contract obligations to continue supply these for as long as the customer contract specifies. :) So Cisco/Dell/Synology/etc are probably taken care of. Smaller end users though, that could get interesting. For the second part - yeah, its OUCH time for Intel. Whatever profits they got off these incredibly low margin and custom chips will get wiped out and then some.

        re the phase out and this is somewhat unrelated to the current issue - unless I misunderstood, Intel is going to use their normal x86 Architecture du jour as the base of all the new low power x86 stuff rather than continue to have the expensive one-off that Atom was.

    2. Uffish

      Re: "Apple will give you a new phone when its broken."

      That should read "Apple will give you a new phone when its broken within a very short time period from the purchase date".

      My experience of Apple products is that they really don't suit my lifestyle.

  8. Howard Hanek Bronze badge
    Childcatcher

    Through the Looking Glass?

    Didn't the March Hare's watch turn backwards? Perhaps the masters were just placed upside during the xray process?

    1. phuzz Silver badge
      Alien

      Re: Through the Looking Glass?

      Are you related to amanfrommars?

  9. Anonymous Coward
    Anonymous Coward

    EU Customers don't need warranty

    The defect exists in the processor contained within the product at the time of manufacture, therefore it is not of "merchantable quality” and matters not when the failure occurs or if under warranty.

    Even if they play hardball, in the UK, the small claims court costs £25 to £60 for claims of £300 to £1000. You are almost guaranteed to win by default as the manufacturer will likely not turn up or settle first as it's cheaper.

Page:

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Biting the hand that feeds IT © 1998–2019