back to article Intel's Atom C2000 chips are bricking products – and it's not just Cisco hit

Intel's Atom C2000 processor family has a fault that effectively bricks devices, costing the company a significant amount of money to correct. But the semiconductor giant won't disclose precisely how many chips are affected nor which products are at risk. On its Q4 2016 earnings call earlier this month, chief financial officer …

  1. Andy Tunnah

    I smell an oscar

    Bang up detective and reporting work. All you need now is a script and a role for Mark Ruffalo

    1. AmenFromMars

      Re: I smell an oscar

      Agreed, good work.

    2. Mephistro Silver badge
      Thumb Up

      Re: I smell an oscar

      And another thumbs up for the article!

  2. Your alien overlord - fear me

    "A board level workaround exists" that involves us paying each other millions as bonuses for not paying attention to Quality Control. The rest of the world can just sit there wondering if their kit will boot tomorrow.

    1. Anonymous Coward
      Anonymous Coward

      :) I see what you did there. Boredroom level workaround, indeed!

      I think when they say "stepping" they mean the next iteration of the chip. Which means new chips and boards all 'round, unless you are handy doing super-fine, surface mount soldering with your trusty micro-iron! The workaround sounds like strapping the line to another clock source, and Low Pin Count sounds like the processor is not fully enabled yet so just provide a clock to get us out of POST and hand over control to the boot loader and everything will be fine. Also, no way to fix with microcode. Very nasty problem given the 18 month wait period before it manifests!

      Good on El Reg for putting the clues together!

      1. Lennart Sorensen

        The work around means adding some resistors to the design, which on most boards is not something you can just do, since this is a clock line. You can't just add wires and resistors since that would mess with the clock signal. So it is either change the board design to add the resistors, or wait for the next version of the chip (which will probably take months to happen). Of course since the chips are soldered to the board (not a socket), they are not easy to replace either.

        1. Anonymous Coward
          Anonymous Coward

          adding resistors

          "The work around means adding some resistors to the design"

          Apologies if I missed something obvious - but is this stated officially anywhere?

          [It's entirely plausible, it's just one of the many failure modes possible when digital designers forget they live in an analogue world]

  3. Spotswood

    Uh oh

    We own several Synology DS1815+ devices each with about 24TB capacity and currently quite full of data. They use the Intel Atom 2538 which is listed as a SoC containing this fault. These are well over 12 months old and therefore approaching the 18 month danger zone.

    This is obviously very concerning.

    I hope Synology are ready to help us.

    1. Chris King Silver badge

      Re: Uh oh

      pfSense also sell boxes based on C2358/C2758 processors.

    2. Nate Amsden Silver badge

      Re: Uh oh

      Looks like those have 3 year warranty so would be surprised if they didn't fix it..but maybe you have to wait for them to fail.

    3. John Smith 19 Gold badge
      WTF?

      "I hope Synology are ready to help us."

      Wouldn't it be an idea to call them first?

    4. Anonymous Coward
      Thumb Up

      Re: Uh oh

      Don't worry. Just put all your data in the cloud.

    5. Synology UK

      Re: Uh oh

      Hello,

      JP here from Synology UK. Unfortunately we didn't receive the contact here so we apologise that no has responded. Please contact our technical support team via www.synology.com/ticket and they will be able to advise you on the best course of action.

      Kind regards

      JP

      1. Anonymous Coward
        Anonymous Coward

        Re: Uh oh

        Hello,

        AC here from Synology UK. We hold all our open ticket and customer details on our own award winning DS1815+ devices, so will get back to you in 18 months and 1 day.

        Kind regards

        AC.

      2. Chris King Silver badge
        Mushroom

        Re: Uh oh

        Dear JP

        Posting boilerplate messages to random threads doesn't inspire me with confidence in Synology's approach to this problem.

        Right now, I've got a shiny DS1815+ that's humming along very nicely, but I'd really like to know if this thing has one of affected processors so I can plan accordingly. I understand that RAID Is Not Backup, so losing access to my storage won't hurt me, but it will make my life less convenient.

        This is an inherent fault in manufacture of a component, identified by the manufacturer of that component. I'd really like to know what Synology (and other manufacturers who are reading this discussion - CK) are going to do to rectify it.

        If your attitude is "we'll wait for your kit to die, then we'll replace it" then your box is going back to where I bought it from - and I will make very sure that anyone who asks me for advice on buying a NAS knows that.

        Toodle Pip,

        CK

    6. Chris 244
      FAIL

      Re: Uh oh

      Probably not a good idea to have faith in Synology. They are still promoting said products on their website:

      https://www.synology.com/en-us/products/DS1815+#spec

      And the resellers the Synology website sends you to are still reselling:

      http://www.ncix.com/detail/synology-ds1815-diskstation-8-bay-diskless-71-103588.htm?promoid=1721

    7. Chris King Silver badge
      FAIL

      Re: Uh oh

      Double jeopardy for me... "If my 1815+ crashes, no worries, stuff is available elsewhere in verified backups. I can always build myself a U-NAS box as a replacement and I've got just the motherbo-oh crap, it's a C2758 !"

      Seems 2016 was not a great year for hardware purchases for the home lab.

    8. ryan_c

      Re: Uh oh

      I have a Synology 1815+ that we purchased about 10 months ago. Starting in December I started having random reboots but they weren't caught right away because notifications were not setup correctly and the unit boots so quickly. Towards the end of December I noticed the problem as the frequency had increased quite a bit. I did some searching around and it sounds like quite a few DS1815+ units have bad power supplies. I called up Synology and they confirmed that my issue was a bad power supply and swapped it out. The transfer to the replacement DS1815+ couldn't have been easier. My point with this long winded comment is that the forum post that is referenced in this article is most likely faulty power supplies which is concerning but not the same issue.

  4. Planty Bronze badge

    This will show up the good vendors vs the bad ones.

    Cisco are top of the good list, and so far the only entry. Any other takers? Or does everyone else think nobody will notice?

    Personally sick of companies that take the sweep it under the carpet and hope nobody notices. Don't they realise than in the internet era nothing can really be covered up and mass product faults that the manufacturer hopes nobody will notice, that won't wash anymore... (Panasonic AllPlay, calling you out here...)

    1. Tinslave_the_Barelegged Silver badge

      Re: This will show up the good vendors vs the bad ones.

      > Personally sick of companies that take the sweep it under the carpet and hope nobody notices.

      Recently had this with a camera. A Fuji randomly shutting down. Reading various forums suggested it was a common issue and that the lens (built-in) was the cause. After a struggle contacting Fuji UK, they said there was no problem with the camera or the lens, but to send it back. They replaced the lens, which resolved the issue. Which, of course, didn't exist.

      Why do corporates not take the high ground of admitting clearly to a problem and then resolving it? There's far more to gain and much less to lose that way.

      1. Nick Ryan Silver badge

        Re: This will show up the good vendors vs the bad ones.

        Why do corporates not take the high ground of admitting clearly to a problem and then resolving it? There's far more to gain and much less to lose that way.

        Why? Litigation society, that's why. It's usually believed that if a company admits a problem then they are admitting liability and opening themselves up to litigation. Which can/will be expensive. Best to err on the side of caution and to never admit to anything. Ever.

        See also: "Dark ages" or "why nothing of great importance happened because much of Europe was concerned with pointless legal matters and why external input was required"

  5. Adrian 4 Silver badge
    Facepalm

    Oh no, not again

    Back in the dawn of PC time, I worked on an 8086 machine designed before everyone was expected to copy IBM. The 8088 and 8086 used a special intel clock driver, the 8284. We had loads of problems with them not oscillating properly ..

    1. Dwarf Silver badge
      Coat

      Re: Oh no, not again

      Perhaps someone should have introduced Intel to OpAmps, they ALWAYS oscillate, even when you don't want them to !

  6. Aitor 1 Silver badge

    Crap support

    The problem with this is they will simply start dying.. and my guess as many other commentards say is that most vendors will do a la la la, and ignore clients. As they are set to lose plenty of money.. unless intel compensates them.

    What SHOULD be done is vendors sending new devices with the corrected processor, so the old ones are returned and either scrapped or refurbished.

    This is quite bad news.. and potentially crippling for intel, not just for the money, but for the lack of confidence. Ppl might just feel more confident putting an nvidia SOC than an Intel one!

    1. a_yank_lurker Silver badge

      Re: Crap support

      Actually the vendors might have a very strong civil suit against Chipzilla for delivering defective products. The vendors are caught in the middle as the ultimate miscreant, Chipzilla, is a direct supplier. So the customer harasses/sues vendor who in turn harasses/sue Chipzilla.

      Note, do not scrimp on QA/QC because the few bucks you save up front will eventually come out of your hide with a very serious multiplication factor.

    2. Anonymous Coward
      FAIL

      Re: Crap support

      You're right. Companies don't spend money on customer support or service any more; that cash is instead split as follows: 85% to the board, 14.5% to marketing and making the website pretty and 0.5% to an offshore team to run the customer twitter account ('can you be typing in your number of customer and bank account and identification of order with a quickness kind sir, and I or they or we will be back with you with a perfect answer in hours of plenty').

      We will find out that most tech companies don't understand what they sell at all, and are just a change of logo, a website and a hefty dose of BS. It won't be an easy lesson.

  7. Doc Ock

    Now would be a good time to buy ARM shares......damit.

    1. Peter Danckwerts

      Pity it's too late to buy ARM shares. It was bought by SoftBank last year.

      1. Doc Ock

        >Pity it's too late to buy ARM shares. It was bought by SoftBank last year.

        Hence the Damit. Do keep up at the back, no offence intended.

  8. Anonymous Coward
    Anonymous Coward

    I worked at NetApp when they encountered the PCI/NMI error whereas a sub standard adhesive caused controllers throw up protection faults and panic. I have never seen so much effort go into Cover Up, Playing Down, Case Manage and Control Communications (inside as well as outside the organisation).

    The Company went into full damage control mode, so concerned about reputation that the technical fault itself became a secondary issue. For NetApp only a few thousand systems were affected, yet they couldn't keep up with producing/refurbishing the number of fixed boards required. It took months to years to fix the last customers.

    Now imagine intel with millions of C2000's and most of them on SoC's.

    I can tell you this:

    If you are large customer with a large vendor (e.g. a large Cisco customer) you get fixed first. Cisco say they would prioritise systems by operational age, but that's BS. Customer's get prioritised by the size of impact and potential of negative press. Therefore large Telco's will come first. Cisco wants to avoid negative press at all costs. "ISP or Mobile Carrier went down due to faulty Cisco gear", would affect a lot of people and generate a lot of negative press.

    If you are a small'ish vendor of C2000 systems -or - you are a customer of those systems - you are screwed!

    That hot potato will stay in your hands until the large vendors and customers are fixed. Next comes the medium businesses and finally the guys at home with their Synology NAS' come last.

    The reason you don't hear a thing from your vendor - is not because they're unaware of the issue - it's because they're developing strategies to minimise their costs. And sorry - they don't give a shit about you (the customer) and the fact that your gear (or business) may fail at any time.

    1. Anonymous Coward
      Anonymous Coward

      Been there done that.

      As a vendor there is so much you can do.. and doing a samsung is going broke.

      A BGA resolder properly done can go to 400$ a piece.. so it makes no sense to do it on synologys...and yet hey, there is your data.

      We have a synology as a single point of failure in our company, just for internal use and replication. While we do have a backup of it (well, 2 to be precise) it will be a nuissance to say the least.

      1. Doctor Syntax Silver badge

        Re: Been there done that.

        "A BGA resolder properly done can go to 400$ a piece.. so it makes no sense to do it on synologys...and yet hey, there is your data."

        So just swap the whole processor board.

        1. Anonymous Coward
          Anonymous Coward

          Re: So just swap the whole processor board.

          I'm not familar with the NAS boxes in question, but as well as swapping the processor board, wouldn't another option be to swap the hard drive(s) to a similar-enough NAS box that wasn't implicated in this affair?

          The valuable-to-customers bit here is probably the data not the hardware, right?

          Just askin' (apologies if it's a daft question).

          1. Doctor Syntax Silver badge

            Re: So just swap the whole processor board.

            "Just askin' (apologies if it's a daft question)."

            Not a daft question. I'm not familiar with the product.

            If the drives are nothing but data and the whole thing is driven by firmware on the processor board then it would be a tad difficult. It would depend on being able to find an alternate device with sufficiently similar firmware which would be entirely down to the software being generic. Without going off & researching that I've no idea whether it is or whether it's proprietary.

            If the drives have an OS on them then it would depend on the OS including the right drivers. There's always a problem, even with general purpose OS's, of having support for newer or even older hardware.

            Short answer, "similar-enough" might not exist.

    2. Anonymous Coward
      Anonymous Coward

      Completely agree about the cover-up

      Just like the cases of flaming Ford Kuga's (check news in New Zealand and South Africa)

      1. Anonymous Coward
        Anonymous Coward

        Re: Completely agree about the cover-up

        "Just like the cases of flaming Ford Kuga's (check news in New Zealand and South Africa)"

        The problem only occurs there because the Kuga was never designed to run upside down.

        1. Paul Kinsler

          Re: The problem only occurs there because the Kuga was never designed to run upside down.

          And isn't reported in Australia because it gets blamed on bushfires instead. :-)

      2. Anonymous Coward
        Anonymous Coward

        Re: Completely agree about the cover-up

        "Just like the cases of flaming Ford Kuga's (check news in New Zealand and South Africa)"

        That was the Voice Control System committing suicide after hearing the accent! :)

    3. Lennart Sorensen

      The only fix so far is to change your own board to add the workaround. New chips don't exist yet so no one is getting those until they exist. So everyone is at their own mercy about how long it takes to change the board design and get new boards made, or they can wait for the new chips and hope for the best in the mean time. Doesn't matter if you are Cisco or some tiny company. Of course I suspect Cisco might very well be able to get a new board revision design made a lot faster than the little guys.

    4. a_yank_lurker Silver badge

      @AC

      Given the actual screwup is Chipzilla, the vendors in many cases do not have any real options until Chipzilla figures out how to fix their mess. Then Cisco can start fixing/replacing gear; they do not have any inventory of good chips. Right now there is no gear except for known defective gear to push out. Cisco has the luxury of nailing Chipzilla with a knockout punch and probably will go after them.

  9. Anonymous Coward
    Anonymous Coward

    I remember the NetApp PCI/NMI error. Internally they called it the PCI/Enema and everybody had a good laugh.

    When facing the customer the sales guys pretended not to know anything about it. Actually not just sales, but the entire leadership team, all the way up.

  10. Herby Silver badge

    So when do I short CSCO/INTC stock??

    Given that this seems to happen after 18 months, one might want to calculate the time of first failure, and watch the stock go down. It could get interesting.

    Of course, one wonders WHY the failure manifests itself after 18 months. Is there some flash component that gets used to determine elapsed time? We know the symptoms of the failure, but not the actual root cause (other than a bad chip design (DUH!).

    In any event, not an easy re-work. BGAs are almost impossible, Surface mounts can probably be done in the field, but I wouldn't. Time will tell how this is handled (good, bad, terrible).

    Me? No, I don't own any INTC/CSCO stock.

    1. Richard 12 Silver badge

      Re: So when do I short CSCO/INTC stock??

      Semiconductors of all types wear out over time, as the doping drifts - mostly due to thermal effects, so hotter parts fail faster.

      Package pins are connected to the silicon by really tiny wires that can snap, eg under the stress of warming up or cooling down.

      There's other failure modes such as insulation breakdown, overvoltages and many more.

      It only takes a small miscalculation or manufacturing error to turn a chip with a theoretical 50-year MTBF into chip with an 18-month MTBF.

      It sounds like this failure may only matter at boot, if true then a device left running will keep going even after the failure - it just won't boot again.

      It is a shame that Intel is saying nothing about the failure rate. Could be 1%, or even 90%. Given the lack of info, it's probably quite high.

    2. bogd

      Re: So when do I short CSCO/INTC stock??

      Funny you should mention stock value - this is the actual title of an article published today: "Intel Is on a Roll After a Difficult Spell, So Buy the Stock Now"

      Unfortunately, I cannot post the link, but here is a nice quote:

      "...the quarter also solidified 2016 as a comeback year for the Silicon Valley company.

      For years, Intel has tried to break into the mobile-phone business. Last year, it finally secured a deal with Apple to provide chips for the iPhone 7."

      Quite funny in context, eh? :)

  11. Anonymous Coward
    Anonymous Coward

    Cheating Software ?

    Perhaps intel's planned obsolescence team has made a mistake and set the thresholds too low?

    This should be investigated. Could be the next VW.

  12. Anonymous Coward
    Anonymous Coward

    2017 is the new Millenium Bug.. !

  13. weekend
    Unhappy

    My NAS build uses a ASRock c2750di and mysteriously stopped working several months back. I was blaming ASRock as there are a lot of complains about that motherboard failing.

    Would there be any way to find out if it's because of intel or if its an unrelated fault?

    I can't afford to have such an expensive board break again and each time I try to come up with a new build that can handle as many hdd's I get carried away and things get expensive... So that machine is still not replaced.

  14. abortnow
    Unhappy

    Aargh!

    I have two potentially affected boxes:

    iXsystems FreeNAS Mini

    CPU: Intel(R) Atom(TM) CPU C2750 @ 2.40GHz (2400.06-MHz K8-class CPU)

    Nothing yet in the FreeNAS forum.

    Netgate pfSense SG-2220 firewall

    CPU: Intel(R) Atom(TM) CPU C2338 @ 1.74GHz (1750.04-MHz K8-class CPU)

    User comments and questions already present in pfSense forum. No response yet from Netgate.

    Plus the FreeNAS Mini XL I have on order (8-(

    Very annoying that this quite expensive kit should have such a problem. Thanks Intel. Some of us have not yet forgotten the Pentium FDIV saga.

  15. ecofeco Silver badge
    Facepalm

    Rut Roh

    Well... smeg.

  16. Anonymous Coward
    Anonymous Coward

    @ pfSense, SuperMicro, Synology & others

    By this stage you should probably release some sort of official statement along the lines:

    - we are aware of the issues with the C2000 CPU

    - we are investigating whether any of our products are affected

    - we are working with intel to determine whether our products are affected and how

    - we will communicate the next steps with you (the customer) in a timely manner

    So far I haven't heard anything from these vendors and this is making me and others very nervous.

    I'm a customer of all of these vendors - the first vendor addressing the problem will keep me as a customer!

    1. Synology UK

      Re: @ pfSense, SuperMicro, Synology & others

      Hello,

      JP here from Synology UK. Unfortunately we didn't receive the contact here so we apologise that no has responded. If anyone owns this unit and has concerns, please contact our technical support team via www.synology.com/ticket and they will be able to advise you on the best course of action.

      Kind regards

      JP

      1. Anonymous Coward
        Anonymous Coward

        Re: @ pfSense, SuperMicro, Synology & others

        Nice one JP. Thanks for acknowleding the issue. Let's hope Synology offers affected customers a workable solution.

        The very fact that you responded here shows that Synology cares. That's more than 99% of other vendors have done so far!

        You've done your organisation a great service.

        1. bogd

          Re: @ pfSense, SuperMicro, Synology & others

          There's a very long way from acknowledging the issue to actually fixing it. So far, all we have seen on this thread is a canned message from JP being posted repeatedly (and asking us to contact support).

          Here is what Synology's support has to say on the matter:

          "Intel has recently notified Synology regarding the issue of the processor’s increased degradation chance of a specific component after heavy, prolonged usage.

          Synology has not currently seen any indication that this issue has caused an increase in failure rates for DiskStation or RackStation models equipped with Intel Atom C2000 series processors compared to other models manufactured in the same time frame not equipped with the affected processors.

          It is safe to continue to use your device, however should you encounter any issues, our support teams will do everything they can to expedite your ticket. "

          I read that as "we know about it, if your unit dies we'll replace it, but until then... good luck!"

          1. Unexploded
            FAIL

            Re: @ pfSense, SuperMicro, Synology & others

            >Synology has not currently seen any indication that this issue has caused an increase in failure rates for >DiskStation or RackStation models equipped with Intel Atom C2000 series processors compared to other >models manufactured

            There's a seven page (and climbing) thread on dead DS1815+s that begs to differ.

            https://forum.synology.com/enu/viewtopic.php?f=7&t=119727&start=90

            Thankfully, I've got everything backed up on my old unraid box, but I now find myself hoping this thing fails after they've got a bug-free replacement but before the warranty is out.

      2. darkknight

        Re: @ pfSense, SuperMicro, Synology & others

        Thank you JP.

        Any idea of the likely backlog in tickets?

        1. Synology UK

          Re: @ pfSense, SuperMicro, Synology & others

          Hi,

          I am afraid I can't advise a specific time on that but we do strive to get back to people as soon as we can.

          There are technical support teams in America, UK, Taiwan, France and Germany so depending on your location, your ticket will be sent to the appropriate office and therefore there shouldn't be (where possible) time zone delays and having multiple teams helps to further reduce response time.

          Kind regards

          JP

    2. Anonymous Coward
      Anonymous Coward

      Re: @ pfSense, SuperMicro, Synology & others

      By this stage you should probably release some sort of official statement along the lines:

      - we are aware of the issues with the C2000 CPU

      - but, we already have your money...

      - and our chairman has spent it on gin and hookers...

      - so we don't care... (although his wife probably does...

      - about the hookers, I mean... she's fine with the gin).

      - ...nearly as much as we should do...

      - as you are now a past problem and we look to the future, because that's where the cash is.

      - so who's the winner and who's the loser?

      - we know the answer to this.

    3. galactica_actual

      Re: @ pfSense, SuperMicro, Synology & others

      Here you go: blog.pfsense.org

      1. bogd

        Re: @ pfSense, SuperMicro, Synology & others

        There is something VERY interesting in the post on the pfsense blog:

        "[The systems that experience the problem] will not suddenly stop working, but if the component fails, the system will not successfully reboot"

        This is the first time somebody has said this clearly (from the Cisco FNs it is not clear whether the devices just stop, or they work properly until rebooted).

        If the people at Netgate are right, this might be bad news for people running affected gear. Because it means that the problem could remain hidden for months or years - until you have a power failure (or you initiate a firmware upgrade), and poof! there go both your primary AND your backup unit... (be they spines, NASs, firewalls, or anything else)

        1. Roland6 Silver badge

          Re: @ pfSense, SuperMicro, Synology & others

          "and poof! there go both your primary AND your backup unit..."

          This is probably the real takeaway from this for those really interested in belts and braces DR, Business Continuity, etc. namely: it is not sufficient to simply have a spare/backup unit, nor is it sufficient for this backup unit to be a different model or vendor's product, it needs to use totally different chipsets.

          I therefore wonder how long it will be before those seriously into such matters will be offering tandom/paired systems, one using an Intel chipset say and the other AMD or ARM.

  17. Stuart21551

    Inventors

    do not trust intel

    1. hplasm Silver badge
      Facepalm

      Re: Inventors

      I knew it.

      Intel Inside is a warning sticker!

  18. Tom 64
    FAIL

    So basically...

    The fact that they declined to comment on when these stopped being shipped and stated that the affected product line 'will' be updated to fix this probably means that:

    - it isn't yet fixed

    - these are still being sold

    - there's a lot of these parts in the channel intel don't want back

    Guess they are trying the old head in the sand approach on this one, hoping no-one will kick up too much of a stink. Twats.

    1. Nick Ryan Silver badge
      Joke

      Re: So basically...

      Not at all. Intel did what any sensible tech company would have done and reverted back to using old, stable equipment to perform the cost vs benefit calculations on. Unfortunately this was an old Pentium chip...

    2. Lennart Sorensen

      Re: So basically...

      I suspect their customers might in some cases want to continue to sell products even if they have to replace them and swap the CPU on the one they sell now. It is their choice to take the risk after intel tells them about the problem (And intel will probably insist on them signing something to continue to receive the chips with the known problem to reduce intel's risk at that point). So I would think it is still shipping although probably not in the same quantities as before.

      New chip revisions take time to make and validate, so certainly not fixed yet.

  19. vmistery

    This will be a problem for me as I have a number of these in my network (Supermicro) working as critical routers and all of them are either at 18 months or over. I did have one unexpectedly reboot last week and I bloody hope it isn't about to pop. Time to ensure I have all the configs ready...

  20. John Smith 19 Gold badge
    FAIL

    A time when diversity in the ecosystem is a good thing

    But of course how can you know if all those different mfg's boxes don't have the same chip (or copy of the chip logic) inside?

  21. eldakka Silver badge
    Pint

    Wonder if this is what happened to the ATOs HPE storage arrays? ;)

    1. Anonymous Coward
      Anonymous Coward

      While the HP arrays probably perform as if the run ATOM CPUs they're most likely not affected.

  22. David Roberts Silver badge
    Unhappy

    MTBF?

    Interesting time if you are running a really big network (Telco or cloud).

    Beyond a certain point you know you are going to have regular failures.

    1. Ledswinger Silver badge

      Re: MTBF?

      Beyond a certain point you know you are going to have regular failures.

      But that's true of most components, and for those running really big networks this should (yeah, right) not be a problem since they ought to expect individual devices to go "phut" without destroying the entire network.

      Think HDDs or SSDs in a DC as the best example, but it's true for the majority of components: If you've got enough, they'll always be some failing. The trick is to have failover systems, sufficient analytics to know what's gone down, and the logistics to replace the failed devices. In this case there's an apparent risk of a spiking failure rate, but knowing that any sane DC or cloud provider would initiate proactive replacement before failure of some of the devices, so as to spread the replacement workload.

      If there's only a few then just replacing all of them makes sense, but if you've got thousands, and your maintenance workload is stable, then spreading that peak makes more sense than panicking to get every one changed this week.

  23. TrevorH

    I'm pretty sure that I've had one Supermicro A1SAi fail with these exact symptoms already. Was in normal use one day and then the load average went sky high with no warning and a shutdown/reboot killed it stone dead. One replacement motherboard and processor later...

  24. Anonymous Coward
    Anonymous Coward

    This:

    " contain a faulty clock component that is likely to fail at an accelerated rate after 18 months of operation."

    Then This:

    ""a degradation of a circuit element under high use conditions at a rate higher than Intel’s quality goals after multiple years of service."

    Are they using some old pre-gregorian calendar because in my fucking diary, 18 months does not equate to multiple years.

    1. Nick Ryan Silver badge

      Re: This:

      Unfortunately the bullshit bingo marketing statement is technically correct as 1.5 years is a multiple of a single year as we just expect it to mean "more than 1" (which 1.5 most definitely is).

      1. Anonymous Coward
        Anonymous Coward

        Re: This:

        Fair point, well made.

        To be fair I was spleen venting at the "bullshit bingo marketing dept" as you so eloquently and brilliantly describe it. Just sick of reading about crap stuff where QC is given the least priority and it is us that suffer the consequences through downtime, investigation and being at their whim to get stuff sorted. Having to force things like this out because the bastards just wont hold their hands up when shit turns sour annoys me* ...

        *almost anything these days.

    2. Lennart Sorensen

      Re: This:

      They are not saying they will all fail, they are saying that the rate of failure starts to go up more than normal for intel's chips, due to a design mistake on the LPC signals.

      So you might have a system that fails in 18 months, or you might have one that fails in 36 months, or one that never fails. Intel almost certainly has statistics of how much the expected increase in failures is after a given amount of time, but they aren't likely to share that. Could be the failure rate is 50% higher than normal, or 5000% higher (I have no idea what the normal failure rate for intel chips is, although based on the ones I have dealt with over the years, I have no seen very many fail). If the normal failure rate was 0.1% and it is now 1% or 5%, well that's certainly a problem, although it might still mean that most systems will be OK. Unfortunately intel isn't likely to share that level of details although I am sure they have done the calculations and hence determined it was bad enough that they had to admit to it.

  25. Bodge99

    Yay.. A good way to kill a chip family.. How many board manufacturers state the stepping/revision number of any chip that is fitted to a particular board?

    Won't most folk just avoid any hardware that contains the description "atom" or "C2xxx" ?

    1. Chris King Silver badge

      "How many board manufacturers state the stepping/revision number of any chip that is fitted to a particular board?"

      Most manufacturers who sold Z68 motherboards made a big song-and-dance about having B3-stepping chipsets, because B2 had SATA controllers that failed in a short time. Hey, sounds familiar ! (Just sayin'...)

  26. HmmmYes Silver badge

    I worked for a company hat got shat on by Intel.

    Big pitch on this + that.

    New product, based on Intel silicon.

    The chip did not work. it would be hard to say which was the biggest error, there were so many.

    Intel's soluion - wait for the next release, which had different pin out, different power requirements. A mess

    Moral of the story - do not build products on Intel silicon unless you know it a) works b) will be in production for the life span of your product.

    Intel have a habit of creating teams then breaking them up before products have been in the market for long.

    Intel also seem to had a lot of problems with verifying their silicon. I remember one of these chip makers tech discussion where someone from Intel said they only created a software model to test the 386 in 88ish.

  27. darkknight
    Thumb Down

    Great...

    ...so my 15 day old Synology DS1815+, bought as a local backup destination (in addition to a remote one) will fail in about 18 months. Great.

    Synology give a 3 year warranty, but will the device need to faill first? I've a ticket open with their support, I wonder what their answer will be.

    Too bad I am one day over the cooling off period for internet purchases

  28. aregross
    Thumb Up

    Please note, JP from Synology UK has posted twice in here about contacting their Tech Support Team if you have an affected device of theirs. Sounds like they'll make it right. Good on them to be proactive!

    And to Richard 12 about 'don't re-boot the device and you'll be fine (paraphrasing)', my thought is that if the clock stops, the processor stops... =brick

    1. ChrisC

      It depends on when the clock is required by the system. We know it's definitely required when the system starts up, but it's less clear if it's also then still required once the system has started up and the other clock sources have been initialised OK.

      So as Richard 12 suggests, *if* this failing clock is only being used to get the system off the ground from a restart, then the fault may well remain hidden for however long the system can remain up and running. And if this is the case, it'd then beg the question as to just how many of these Atoms have *already* gone into this knackered state without anyone being aware of it...

    2. Lennart Sorensen

      Not rebooting is not good enough. The clock signal is used for quite a few things inside the CPU.

      Having the system sleep when not working will reduce wear on the clock and make it last longer. Makes sense that things that are off last longer than things that are on. :)

    3. Pliny the Whiner

      "Please note, JP from Synology UK has posted twice in here about contacting their Tech Support Team if you have an affected device of theirs. Sounds like they'll make it right. Good on them to be proactive! ..."

      To be precise, the JP Here From Synology UK Chatbot 4000 posted twice, and it'll likely post again. And again and again and again. You see, the Chatbot is built around the Intel C2000 family of "Atom" microprocessors, which is thought to be defective.

    4. bogd

      You really shouldn't believe everything you read comming from various companies' PR persons.... :)

      When JP said "contact us if you have an affected device", he actually meant "if you have a DEAD device". Other than that, I did contact Synology's support, and as far as I can tell they have no plan to either replace the current units, or change the silicon for future ones. Here's what they had to say on the issue:

      "Intel has recently notified Synology regarding the issue of the processor’s increased degradation chance of a specific component after heavy, prolonged usage.

      Synology has not currently seen any indication that this issue has caused an increase in failure rates for DiskStation or RackStation models equipped with Intel Atom C2000 series processors compared to other models manufactured in the same time frame not equipped with the affected processors.

      It is safe to continue to use your device, however should you encounter any issues, our support teams will do everything they can to expedite your ticket."

      This is similar to what I'm seeing on the pfsense blog - replace units as they die, and bear it out until the end of warranty.

      And speaking of pfsense - damn, that Intel NDA seems to be watertight! Nobody can really say anything about the issue....

      1. Roland6 Silver badge

        Re: and bear it out until the end of warranty.

        Need dig out your sales receipts and any delivery statements. To check what warranty you've actually got; was it one year (RoW) or 2 year (EU) or something different?

        http://europa.eu/youreurope/citizens/consumers/shopping/guarantees-returns/faq/index_en.htm

        1. bogd

          Re: and bear it out until the end of warranty.

          The 2 year EU warranty might not help very much... From your actual link:

          "After six months, you can still hold the seller responsible for any defects during the full two-year guarantee period. However, if the seller contests this, you must be able to prove that the defect existed at the time of delivery. This is often difficult, and you will probably have to involve a technical expert."

          Luckily, many of the affected products also carry manufacturer warranties of at least two years (Synology has 2, 3, or even 5 year warranties on their products).

  29. Anonymous Coward
    Anonymous Coward

    Great excuse for network kit that does it work, let's dine out on that for a while.

  30. Daniel Bower

    Just went to log a ticket with WD support

    As I have a new DL4100 which uses an affected chip. Their website is down! Cooincidence?! (Well yes it probably is but I found it far more amusing than I ought...)

    1. Roland6 Silver badge

      Re: Just went to log a ticket with WD support

      The DL2100 uses a C2350.

      http://support.wdc.com/knowledgebase/answer.aspx?ID=11425

  31. Anonymous Coward
    Anonymous Coward

    That's maybe because..

    "...The Register pinged Dell via email, and it was not immediately available for comment. "

    ICMP is not a valid protocol for sending email. You might want to re-send it via another method ;)

  32. tempemeaty
    Facepalm

    An untimely issue

    From Chip-zilla to Brick-zilla in...oh wait...my clock seems to be off...

  33. I2R
    FAIL

    Wow Intel, what's up? Lots of silence from them on the Puma cable modem problem too! https://www.dslreports.com/forum/r31079834-ALL-SB6190-is-a-terrible-modem-Intel-Puma-6-MaxLinear-mistake.

    Time to avoid their stuff?

  34. fredds
    FAIL

    ironic

    Recently, google was looking at ARM chips for its servers. The following is a quote from the article.

    But Intel has been in this game for a long, long time, and as a consequence can bring process advantages and expertise to bear that mean the chance of Google being able to actually develop a better general-purpose chip than Intel is slim.

    Intel is going to be taping out 14nm low-power chips next year, which will combine excellent performance with a lower-than-usual power draw. ARM processors, by comparison, will be pumped out of fabs operated by TSMC, Global Foundries, and Samsung, among others, which are thought to be running at the high-20nm node at the moment, and may move to 20nm by end of 2014.

    “With over 50 server, storage and communications designs based on 22nm Intel Atom C2000 (Avoton) SoC launched in September, we are well on our way in leading the low-power 64-bit system-on-chips (SoCs) segment,” an Intel spokesman told El Reg. “Today, Intel Atom is still the only available 64-bit server SoC offering leading energy efficiency and performance and we expect that to continue into next year and beyond as we approach yet another generation of 14nm-based SoCs.”

    1. Anonymous Coward
      Anonymous Coward

      Re: ironic

      fredds quote is from:

      https://www.theregister.co.uk/2013/12/16/google_intel_arm_analysis/

      "ARM processors [..] will be pumped out of fabs operated by TSMC, Global Foundries, and Samsung, among others" [and the rest]

      Whereas the affected C2000 Intel Atoms are the first of their kind, and the FinFET technology+process they use (for the first time inside Intel) are based on stuff bought in from GlobalFoundries and Samsung (among others?).

      See e,g,

      http://techreport.com/review/25311/inside-intel-atom-c2000-series-avoton-processors

      ""Intel produces this SoC on a custom-tuned variant of its 22-nm fabrication process, which has some of the finest geometries in the industry and is the first process to adopt a "3D" or FinFET-style transistor structure.

      We've already seen quite a few bigger cores manufactured at 22-nm, but the benefits of this process are arguably most notable for low-power chips like Avoton. Intel is taking full advantage of its celebrated manufacturing advantage here""

      Maybe Intel's "celebrated manufacturing advantage" works best when someone else's bought-in ideas (FinFET) are carefully thought about at the early stages of chip design time, otherwise as the clock turns midnight the whole system may turn into a pumpkin (in certain unusual circumstances, NDA applies).

  35. Anonymous Coward
    Anonymous Coward

    The bad trail continues. EXCLUSIVE to El Reg:

    HOT off the press (just before midnight (somewhere), yesterday):

    https://www.theregister.co.uk/2017/02/07/intel_atom_failures_go_back_18_months/

    "[...]

    the problem – which results in bricked systems – became apparent to engineers at product makers when the return rate on gear spiked about 18 months ago.

    [...]"

  36. Anonymous Coward
    Anonymous Coward

    Looks like Netgear is recalling

    This looks to me like a recall is in the works:

    https://kb.netgear.com/000037344/Service-Note-for-RN3130-RN3138-WC7500-and-WC7600v2

  37. Anonymous South African Coward Silver badge

    Anybody being bitten by this bug?

    Got a colleague Down Under who've got a couple of bricked Atoms, and they're not happy....

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Biting the hand that feeds IT © 1998–2019