back to article FYI: Processor bugs are everywhere – just ask Intel and AMD

In 2015, Microsoft senior engineer Dan Luu forecast a bountiful harvest of chip bugs in the years ahead. "We’ve seen at least two serious bugs in Intel CPUs in the last quarter, and it’s almost certain there are more bugs lurking," he wrote. "There was a time when a CPU family might only have one bug per year, with serious …

Silver badge

stay on top of firmware updates

Semi regularly anyway ..

Up until I joined my present company and moved them out of public cloud into hosted infrastructure (HP Proliant) in 2011 firmware updates prior seemed to be problematic, difficult to keep track of and sometimes really difficult to apply.

Enter the Proliant Service Pack ISO image, combined with ILO virtual media really changed the game for me anyway in being able to easily apply firmware updates, and know what versions are installed, I can just tell support I am on PSP 2016.10 or something like that. All firmware components updated whether it is BIOS, ILO(out of band management), Power management firmware, Network cards, storage controllers, disk drives etc..

Oh what a joy.. in 2012 a flaw was discovered in the Qlogic (HP OEM) NICs, and HP had me apply firmware updates to them.. those updates weren't available through PSP(yet), so had to make I believe a custom boot CD (FreeDOS ?? or linux I forget), in order to apply the updates(ESX 4.1 was the server OS), took me several hours alone to just to build that, hadn't done that in years and my only access was remote over iLO virtual media. But I got it done.. it was a harsh reminder on how firmware updates used to go for me. Those Qlogic NICs eventually got replaced, manufacturing defect.

At a previous company in about 2009 they asked me to track down a performance issue on their Dell servers, ended up being related to Seagate drives, and there was a firmware fix(prior to that I think I had never NEEDED to apply a firmware update to a hard disk connected to a server) -- however the firmware fix had to be applied via DOS floppy boot disk (no fancy management on those servers). Hardware guy had to go to each one plug in USB floppy to update the firmware. Firmware update fixed the performance issue. Damn dell and their multi vendor setup, servers had at least 3 different brands of disks in them(even those bought within the same batch of gear). Company tried to troubleshoot the issue for a year prior to my arrival.

Earlier than that working with Supermicro gear.. just forget it.. I mean they even used to(maybe still do) specifically say DON'T DO FIRMWARE UPDATE unless you have a problem that support says is fixed by firmware. Not only that but they often didn't even put a list of changes in the firmware files(as someone who had purchased about 400 servers(2004-2005) with supermicro stuff I was pretty shocked). My last experience updating firmware on supermicro was (ironically) on my own personal server at a colo. To update the out of band management firmware the first step they say to do is to reset the configuration to defaults(really never a viable option for remote management). So I did, and I lost connectivity immediately. That was probably 2 or 3 years ago now, fortunately haven't had a failure since, haven't gone on site to try to fix it. Next step is to replace the system it is getting old.

I know in fancier setups with blades and stuff the process is even simpler and more automated(even more so for vmware shops to apply firmware and driver updates in the right order - fortunately I have never had an issue with driver/firmware versions). I have about 40 DL38x systems running about 1300 VMs nothing converged here, I apply firmware updates typically once per year. vs prior to the PSP servers would typically only get firmware updates when they were first built(if that), or unless there was a problem support said to apply a firmware fix.

I know there was one or two issues with PSP in the past year or so HP recalled one of the PSPs I think, didn't affect me I never get the latest one right away, always give it at least 1-3 months to bake in (which is on top of the time taken by the updates before they make it into the PSP).

Recently due to size constraints I guess HP split the PSPs out, so instead of 1 ISO, I have to use 1 for G7, one for G8, and one for G9/10 (I only have G7-9). Not that big of a deal though.

I had used HP gear back in 2003-2008 though as far as I recall there was no such easy PSP method to install firmware at the time.

11
0
Coffee/keyboard

An Added Dimension

"Thanks to growing chip complexity, compounded by hardware virtualization, and reduced design validation efforts, Luu argued, the incidence of hardware problems could be expected to increase."

Let's not forget people are actually actively looking for flaws now instead of blindly believing Intel marketing blurbs.

11
0

Re: stay on top of firmware updates

It’s a good thing too, seeing as HPe’s firmware (and drivers) are bug ridden crap!

2
0
Anonymous Coward

Re: stay on top of firmware updates

@Nate Agree HP firmware process rocks now, except for the whole warranty registration thing. Dell PowerEdge servers are similarly easy - download the repo from Dell then point your server's lifecycle controller at it via FTP/SMB. Tell it to update everything and it does, all remotely via the iDRAC.

Been a long time getting to this point but it's sure nice.

2
0
Anonymous Coward

Re: stay on top of firmware updates

Wasn't KCL's data loss meltdown due to a firmware inconsistency when replacing a RAID/disk controller?

0
0
Silver badge

Re: stay on top of firmware updates

point your server's lifecycle controller at it via FTP/SMB. Tell it to update everything and it does, all remotely via the iDRAC.

Been a long time getting to this point but it's sure nice.

Until some miscreant uses this nice simple process to update your firmware and install a backdoor. There are advantages to making invisible software hard to update.

1
1

Unfortunately we, the plebs don't have any choice in the matter of terrible, terrible chips because there is yet to be one that is fully open-source and can be fabricated by someone other than Intel in Israel.

10
5

Open source a chip with billions of transistors?

Seriously... The bugs per million transistors is pretty low. If you have ever worked with large SW or HW projects like this, it takes a freaking army to do this. And oh, by the way who's going to pay and manage all of the multi-million dollar mask sets to actually build one, etc... I know Bobby, the your neighbor down stairs says he's really good, but I'm not using something from him. You might want to go get some other cheese to go with your whine as the cheese you're eating is not good for you...

8
32
Silver badge

Er, there's OpenPOWER from IBM (current and open source) and Sun used to give away SPARC designs for free (I think Oracle still do).

OpenPOWER is particularly attractive, there's a bunch called Raptor Engineering doing a completely open source machine (chips, board schematic, firmware and Linux) based on it. There's lots of reasons to buy one of those!

29
3
Anonymous Coward

Re: Open source a chip with billions of transistors?

Oh another "computing is in someway more complex than any other human endeavor", bollocks and it always has been. Computing is one of the few science subjects where we are actually in control, yes there are physics problem but so are there in every other science application and they do not have "well it is good enough for the sheep" philosophy

The truth is they have been getting away with selling crap so long they think it is their right, well it is not.

The days where CPUs were designed on the back of cigarette papers are long gone, now there are plenty of electronic design aids that can minimuise logical errors but since I would suggest layout, obscrufication and clockspeed are given a higher priority than functionality then bad design is seen as okay.

Lastly do not come the "if you have ever worked with large SW or HW projects like this".

Large scale projects just need to be managed properly and have the right people doing the right jobs, it works for everything else except computing where actually producing a professional and finished product is seen as optional. Due I must add to people like you pushing the modular design is somehow completely different on computing that anywhere else it is applied. These problems are not down to the physics they are down to sloppy design and that is a purely human problem.

36
5
Silver badge
Thumb Up

@bazza

Now that's intriguing. This is a problem that the crypto community wrestles with all the time. Definitely worth watching; be interesting to see the price-points once it's out of pre-order.

4
0
Silver badge

the plebs don't have any choice in the matter of terrible, terrible chips

"The plebs" never have any choice in chips because there aren't any convenient chip foundries to pop out just a few on a wafer. Seriously it's a big undertaking, closed or open design.

The closest to any of that are the ARM designs, but then of course the licensing has to be paid, etc. And some of the designs are still vulnerable to Specter.

9
1
Silver badge

Re: Open source a chip with billions of transistors?

"that can minimuise logical errors"

I just think that's too good not to immortalize in another post. Not saying anything about your other points, so carry on.

4
0
gap

Modern processors require large design teams and huge compute ranches for simulation and verification. The crazy things modern processors do (particularly the CISC) to obtain the performance is truly amazing, but like software, the complexity comes with a cost - design defects.

While you could get something fabricated yourself, it won't be the cutting edge processors.

4
1
Silver badge

"The plebs" never have any choice in chips because there aren't any convenient chip foundries to pop out just a few on a wafer. Seriously it's a big undertaking, closed or open design.

RISC-V, it seems there are suppliers already.

1
3
Silver badge

RISC-V, it seems there are suppliers already

RISC-V is not a CPU design, it is the specification of an instruction set. You can't take it to a foundry and get them to produce you a CPU.

7
1
Silver badge

So you can't use SiFive's open-sourced designs based on RISC-V?

(Maybe you can't, but that's what I've understood the news articles about SiFive to mean.)

2
0
Anonymous Coward

>Unfortunately we, the plebs don't have any choice in the matter of terrible, terrible chips because there is yet to be one that is fully open-source and can be fabricated by someone other than Intel in Israel.

Amateur hour at the El Reg Commentard VLSI Design Center. Hilarious.

Chip designs are not software. Those who treat them as such come unstuck.

But that said, there are open source processors design out there. e.g. Leon.

8
0
Anonymous Coward

Re: Open source a chip with billions of transistors?

@AC "The days where CPUs were designed on the back of cigarette papers are long gone, now there are plenty of electronic design aids that can minimuise logical errors but since I would suggest layout, obscrufication and clockspeed are given a higher priority than functionality then bad design is seen as okay"

I think this kind of post just goes to show how little most people know about integrated circuit design and manufacture, But because they know how to use/fix an iPhone/laptop or are "in IT" they think they are suddenly chip design experts.

4
4
Anonymous Coward

Re: Open source a chip with billions of transistors?

@AC2 "show how little most people know about integrated circuit design and manufacture"

Yes, that why there is an article about the leading PC CPU manufacturer's failure to produce working components, too many people were afraid to question Intel's secrecy and the result is intel sold crap as gold and told us they were doing us a favour.

Favours do not typically cost thousands and then not work properly, that sort of favour is a sign that your "friend" doesn't actually like you at all.

That sort of "friend" is one that noone needs, so don't try to tell me that is a thermometer and you are just checking my temperature, everyone knows when they have been shafted.

3
0
Anonymous Coward

"Modern processors require.... yadda yadda" says who? yes companies that sell individual chips for thousands can employ huge teams but clearly the masses of helpers failed to prevent these stupid errors in what was sold as finished products.

CISC, yes additional inherant complexity but there should be nothing crazy included in the design process but clearly you are correct, Intel did put crazy inside their CPUs and their customers were crazy to buy them.

0
1
Silver badge

So you can't use SiFive's open-sourced designs based on RISC-V?

There are a number of preliminary implementations of RISC-V (and really only the user-mode instruction set is fully settled at this stage), but you have to read what they mean by "open source" very carefully. RISC-V is being touted as a common instruction set that can be targetted by open source software with the aspiration that this leads to a diversity of chip suppliers, freed of the licensing constraints for the ISA. That doesn't say anything about licensing of the hardware design, though. It also says nothing about the large number of patents on processor design that might constrain any particular implementation seeking to be entirely open.

I've not loooked into it in detail, but I'm aware of only one significant project aiming to produce truly "open" hardware based on RISC-V. Although SiFive say they're "changing the way people buy IP", their hardware is not, as far as I can tell, "open source" - there's still a licence agreement and a fee, as there would be, say, with ARM, although the process overhead is said to be much lower.

1
0
Silver badge

@karlkarl

Uhmm, Intel in <u>Israel</u>..... Really? *Checks memory of visit to a Chipzilla plant* Nope. Not really.

Chipzilla has 2 plants in Israel, https://en.wikipedia.org/wiki/List_of_Intel_manufacturing_sites. They're mostly making exactly the same stuff also coming out of the fabs in the USA for 45 and 22 nm nodes. Those plants exist to have a backup that is not in the same general geographic region as the other plants, whereas both Hillsboro and Chandler could potentially be affected by the same large natural/political/man-made disaster.

The location has nothing to do with it.

0
0
Silver badge

Stop making them faster

Start making them better.

Sometimes it's just better to spend time improving quality rather than adding features and speed.

24
2
Anonymous Coward

Re: Stop making them faster

Problem is technically this is already happening. Raw GHZ is not increasing like it was. It's parallel and future branch prediction where it is all at.

4
0
Silver badge

Re: Stop making them faster

"It's parallel and future branch prediction where it is all at."

Which are aimed at making them faster in terms of computing power. And look where we've just discovered it's got us.

7
4

One bug per year? What has this guy been smoking? Any processor comes with an errata sheet. Some bugs get fixed in firmware, some are not serious enough to do anything about until the next release. Like the fdiv bug. Oops.

Reduced design verification? Not that I've seen. But I'd like to see any DV methodology which includes attacks by hackers.

I wonder what the bug rate is for the equivalent amount of Microsoft code. Remember, processors work pretty well without hardware patches every two weeks.

7
3
Silver badge

"Any processor comes with an errata sheet"

And is that a good thing?

3
2
Silver badge

Yes, because you know the list of flaws. They really aren't intended to public consumption, it is the people who design the PC hardware, write the BIOS/UEFI, write the operating systems and write the compilers who need to know that stuff. The average Joe who buys a PC with an Intel CPU doesn't need to see the list of two dozen errata for the stepping (which will grow over time a few more are found) Most of them are corner cases of a corner case, and not worth worrying about (i.e. they'll find a way to mitigate it if they can but a lot are basically marked "wontfix" because they don't matter in the real world)

There aren't any chips that are errata free, at least not anything much more complex than a 6502, so while in a perfect world chips would have no errata its the same perfect world where software has no bugs. It doesn't exist in the real world when humans design things.

9
2
Gold badge
Unhappy

you know the list of flaws. They really aren't intended to public consumption,

Because the "public" might decide all microprocessor mfg are a bit s**t?

Quell f**king surprise.

5
1

Even the 6502

The 6502 had a couple of well known bugs. You just had to code around those. The early 16 bit chips like the 68000 had bugs. I doubt that there has been a bug free processor. Ever.

15
1
Silver badge

Re: Even the 6502

I know for certain there are some bug-free formally verified cores, used for security roles. I don't know this for sure, but I'd bet on the CPU core Apple is using for its secure enclave was formally verified. The L4 microkernel it is running is formally verified, but without the CPU it runs on also being formally verified that's not really worth much.

Any CPU you want to formally verify would have to be a very small simple in-order with a single core. Once you go OOE or SMP I'd have to think it would get too complicated for formal verification, even if you could automate most of it.

5
2
Silver badge
Boffin

@Dc SynTax

Any processor comes with an errata sheet

And is that a good thing?

You might excel at syntax, context is not your forte, right ?

One bug per year? What has this guy been smoking? Any processor comes with an errata sheet.

Let me explain what the above means:

Some MS guy comes along and pulls the following out of his backside:

I predict the number of bugs in CPU's will increase [...] [to] one bug per CPU per year.

The guy never heard of processor errata sheets which prove we have already long passed the 1 bug per year milestone ... more like 5 or 10, if you ask me, you don't and that's fine.

CPU errata sheets are better than "This update fixes an issue in Microsoft Windows" boilerplate patch descriptions we get in Windows Update.

Books and scientific publications come with an ERRATA sheet and I think it is good, because honest.

For a successful technology, reality must take precedence over public relations, for Nature cannot be fooled.

6
0
Silver badge

Re: Even the 6502

And even formal verification of the logic doesn't guarantee that you won't have problems like the Atom C2000 clock issue - translating it all into silicon is still a bit of a black art.

4
0
Silver badge

Re: Even the 6502 - The early 16 bit chips like the 68000 had bugs

IIRC the early 68000 bug caused two sets of drivers on the internal bus to turn on, one all ones and one all zeroes, leading to a crack down the middle of the case and a smell of decomposing epoxy. Literally Meltdown.

3
0
Gold badge

Re: Even the 6502

"And even formal verification of the logic doesn't guarantee that you won't have problems like ..."

...like Spectre? Let's not lose sight of the fact that Spectre is not a bug. The chip is doing exactly what its designers intended. It's just that, with hindsight, they wish they'd intended something less susceptible to side-channel attacks.

19
0
Anonymous Coward

Re: Even the 6502

>I know for certain there are some bug-free formally verified cores, used for security roles.

Wanna bet? Formal verification does not mean a design is bug free. Just that it matches the specified design intent.

5
0
Silver badge
Headmaster

Re: "I know there are some bug free cores"...

The "Halting problem" may wish to have a conversation with you... or it may not. It's a bit annoying like that.

8
0

I don't think you read the Dan Luu post from 2016, recently updated.

https://danluu.com/cpu-bugs/

The guy knows what he's writing about and is worth paying attention to.

2
0
Bronze badge

Re: Even the 6502 - The early 16 bit chips like the 68000 had bugs

That was the "Halt and Catch Fire" instruction.

Very useful in military applications where you did not want you software leaking from chips with on board ROM.

2
0
Silver badge

"Books and scientific publications come with an ERRATA sheet and I think it is good, because honest."

Agreed. But the problem is the ready acceptance that they should be needed, particularly in this context. Would it not be better if development effort were concentrated on fixing the errata so as to eliminate them rather than adding more features which in turn add more errors?

2
0
Silver badge

Re: Even the 6502

Wanna bet? Formal verification does not mean a design is bug free.

Yes it does guarantee the design is bug free. What it does not guarantee is that the actual device is bug free - i.e. when manufacturing issues rear their ugly head like they did with the Intel Atom C2000.

0
1

Re: Even the 6502

I won't downvote you for the inaccuracy because it is shockingly common.

The 68000 was a 32 bit processor with a 16 bit data bus much like the 8088 was a 16 bit processor with an 8 bit bus.

3
0
Bronze badge

Re: Even the 6502

The Z80A was bug free thanks to the sprites. My Spectrum also had the perfect keyboard. I keep telling potential employers that a fully secure business should only deploy Spectrums as one Sys Admin can listen if any modems or tape decks are being accessed.

3
0
Anonymous Coward

Re: Even the 6502

DougS> Yes it does guarantee the design is bug free. What it does not guarantee is that the actual device is bug free

No. Test will discover on-chip/device issues. Formal verification, in a chip design flow context, will verify if your design intent RTL/HDL matches your design gate level netlist after each flow step, right up until the final netlist that is generated by your place & route tool.

(Processor design doesn't necessarily follow SOC design methodology 100% but the principles are there and for certain processor IP implemented as ASICs they still apply.)

3
1
Silver badge

Re: Even the 6502

Wanna bet? Formal verification does not mean a design is bug free. Just that it matches the specified design intent.

My experience with formal verification[1] is that it leads to *more* bugs.

The reason: the verification process is itself complex and therefore error-prone, and the longwinded processes involved provoke humans into taking their eyes off the ball and possibly even cutting corners.

I recollect a very brief (between-client-projects) involvement with a former employer's formally verified satellite telemetry, tracking and control system. I made myself unpopular when I found an error which I tracked down to an off-by-one in the implementation of the formal tests. Whoever had produced the code in question had naturally concentrated on the hardest part of the job - getting it through the tests - and was evidently too distracted to apply the commonsense to see that the outputs were wrong.

[1] admittedly from sometime last century.

5
0
Silver badge

It's a good thing that errata sheets for chips are provided.

It's not so good that they are a necessity in the first place.

1
0
Silver badge
Joke

@Nick Kew - satellite telemetry, tracking and control system

You worked on the Hathaway project at Pacific Tech?

0
0
Anonymous Coward

Re: Even the 6502

Yes the 6502 has logical errors like ROR being not passing the bit back to the left but at the time and for it's price it was deemed acceptable until it was fixed in later versions.

Low unit price was a big factor with the 6502 in regard to it's uptake, not to mention that Acorn used the 6502 as the model for the ARM.

Then add in the size of the 6502 design team and lack of modern design aids to the low CPU price and you can understand why they kept making the faulty part. Basically what worked on the 6502 for it's price and availiblity was enough for other people to sell computers designed around it

Intel on the other hand were selling what people had been told they wanted at the highest price possible and they had no execuse at all. More than enough money, people, time and resources to do it right.

0
0
Silver badge

Re: Even the 6502

"I won't downvote you for the inaccuracy because it is shockingly common.

The 68000 was a 32 bit processor with a 16 bit data bus"

It wasn't really.

Although it had 32 bit registers it only had a 16 bit ALU. The 8086 also has a 16 bit ALU and is considered a 16 bit processor. The RCA 1802 had 16 bit registers and an 8 bit ALU and is considered 8 bit. As a computer, the really important thing is the ALU width. When we benchmarked the current version of the 68000 against the NS 16032, which did have a 32 bit ALU, the 16032 absolutely wiped the floor with it on our 32 bit integer arithmetic test set, . National Semi in their marketing always described the 16032 (later the 32016) as the first 32 bit microprocessor for this reason. Acorn used the 16032 as a coprocessor in some of their designs to give "workstation" performance, and it's said that while the series was a bit of a flop, some of its design features eventually made their way to the Pentium.

0
0

Page:

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Forums

Biting the hand that feeds IT © 1998–2018