Feeds

back to article MIT boffins: Use software to fix errors made by decaying silicon

Smaller transistors means more noise means more errors means the collapse of everything we know and love about computers, such as their infallible ability to run with perfect stability for years on end … right? Well, perhaps not such perfection, but the basic problem remains: ever-shrinking microprocessor feature sizes will some …

COMMENTS

This topic is closed for new posts.
Bronze badge

"Modern software barely works when the hardware is correct"

Kudos to anyone who manages an amusing comment that does not reference the fact he works for Microsoft.

4
1
Silver badge
Boffin

How does one know the answer is correct?

So, you've got an instruction to calculate something, you throw that to the ALU, you get an answer.

Presently, we assume that the one we get back is the right answer. This article proposes that we don't, or we somehow check it. So you run the calculation 3 times, and get 3 slightly different answers. Which one do you use? Are we doing better by effectively dividing our CPU efficiency by 3?

Reading, they suggest switching to the slower "error free" mode. If you've already calculated the right answer there, is there any benefit in doing it a second time, to only to get the wrong answer to a question you already know the right answer to?

I'm clearly missing something.

2
0
Silver badge

Re: How does one know the answer is correct?

"Presently, we assume that the one we get back is the right answer. "

Yup - and that led to all sorts of entertainment when a fairly prolific british space scientist pointed out that code run on 32 and 64 bit linux machines gave different answers.

Further digging showed BOTH sets of answers were wrong(*), thanks to poor coding and bad assumptions in some very widely used software - which is still in wide use in the space science arena.

(*) "Wrong" as in "Close, but not close enough to prevent space probes smacking into Mars occasionally"

2
0

Re: How does one know the answer is correct?

The slower "error free" mode is what is currently done in producing processors. The processor companies will test the processors up to a certain speed, and if they produce correct results on the test harness, they will package the processor. Occasionally, they will sell the ones that work at slightly slower speeds, and mark them as such.

The point of this paper seem to be to attach a probability of a correct computation at each step of the code. This way, some corrective measure can be taken if the probability exceeds some limit. For playing youtube videos of cats, the tolerance could be larger (who cares if the tabby cat's color is slightly off). For a guidance computer, the probability limit would be much lower.

0
0
Gold badge
Unhappy

The spirit of the CDC6600 rises again

Legendary for the rule of "Run the program twice to make sure you get the same answer."

A Seymour Cray design they skimped on a few things to get more speed.

Error checking being one of them.

That said finding a (fairly) unobtrusive notation is quit impressive.

0
0

I can see the point in software recovery of hardware that may degrade over time (martian landers etc) but to allow the creation of a inaccurate system then try to predict what is correct appears backward.

My only caveat is feedback - if we develop digital systems that are so fast that feedback systems drive the output towards the statistically correct result - it could work ... Personally I'd call it an 'analogue computation module' and shove a d-to-a and a-to-d on each end ... simples ...

1
0
Silver badge

@Andy The Hat

Same thought here. It seemed obvious to me that we were going to be facing this soon, and it seemed equally obvious that the two ways to go were probabilistic algorithms on reliable hardware, or bit the bullet and go proper analog where feasible. Both of these have potential major power savings. This third option never occurred to me and it does seem a little weird but I'm sure MIT aren't just pretty faces.

0
0

Just add hardware

IMO the Stratus Computer approach works best here: You run two MPUs in lock-step off the same clock, both running the same code. You also have fast hardware that compares their outputs. If it sees a difference, it signals an error, turns the board off while keeping its output off the main bus. Nothing is lost, because there's another board doing exactly the same thing and synced with the first board. Meanwhile, the sysadmin sees there's a problem, pulls the failed board and plugs in a replacement, which fires up, gets synchronised and shadows the surviving board.

This approach can be made to work for all components of the computer and, of course, doesn't need anything special in the way of compilers or program organisation. Last but not least, Moores Law says that if this worked and was affordable in the late '80s, which it did and was, it will be dirt cheap by now. Even in the late '80s it was giving four nines uptime and, IME, was rather more reliable than Tandem NonStop kit, which required proprietary software structures and compilers to provide a similar service.

0
0
Roo
Silver badge

Re: Just add hardware

"IMO the Stratus Computer approach works best here: You run two MPUs in lock-step off the same clock, both running the same code. You also have fast hardware that compares their outputs"

I suspect that is not possible on a lot of MPUs these days because they sit on a pile of caches and can't even give you a guaranteed interrupt response time. Out of order execution won't help either.

0
0
Bronze badge

Re: Just add hardware

You need a minimum of 3 MPU's in lock step, coupled to a voting system, and take the majority vote as the correct answer. This has been used in various systems which have to deliver high integrity in an environment with high error rates, such as the space program.

1
0

Re: Just add hardware

Depends on the safety and reliability requirement.

0
0
Silver badge
Paris Hilton

How does one manage that the hardware decays in the sense that soft error probability increases instead of the thing just grinding to a halt like a station wagon that prangs itself as a wheel comes off?

Analog computers are good at that because they don't use symbols that can be corrupted by a few bit flips, but they don't need the Rely.

0
0

some ways of relaxing reliability

This sounds like it's related to the utterly mind-bogglingly "brilliant" dark silicon work. Good job on the good professor for getting himself a ton of publicity (and presumably funding), but it seems pretty clear he hasn't ever worked in say imaging where tons of effort is put into killing bad pixels because people are actually very good at catching them and find them offensive.

My guess is that they are assuming that you take your existing test machines, characterize hard and transient failures per die, look at under-voltage characteristics either in simulation or final silicon, and use that information to optimize power and die yield. Now how putting a dot is sufficiently informative to distinguish between "I can tolerate 3lsb error in this calculation" and "the result of this calculation can be completely bogus (e.g. NAN, INF, -INF, 42, . . .) 97% of the time I have no clue. If you knew that you had an adder that was say, stuck at zero in the LSBs you could guarantee the first, characterization of delay and running at an appropriately over-aggresive lower voltage can be done to meet the second constraint.

At any rate, this all seems like a useless crock. Another day passes and I remain grateful that I dropped out of academia.

1
0
Facepalm

grey noise

The folks @ MIT just invented a random number generator

0
0
Roo
Silver badge
Boffin

Re: grey noise

Gray Noise, surely. ;)

0
0
This topic is closed for new posts.