The Register® — Biting the hand that feeds IT

Feeds

* Posts by Frank Rysanek

37 posts • joined Tuesday 2nd October 2007 13:55 GMT

Frank Rysanek
Thumb Up

Re: Numbering

And, Windows 2003 Server is 5.2. Makes a hell of a difference from XP in some drivers (and no, just changing that version string in the INF file often doesn't get the job done - some kernel API's really are slightly different).

Makes me wonder what the 2008 reports (no live machine at hand).

Frank Rysanek
Thumb Up

C2D came along with solid-polymer caps (=> capacitor plague was over)

Yes, Core2Duo with 2 GB of RAM is quite good enough for general office work, web browsing, movie playback and the like. And the 45nm generation of Core2 doesn't even eat all that much power. Any CPU's before that, back to say the Pentium 4 heating radiators, have somewhat less performance and eat more electricity. But the C2D is quite okay. And there was another major change, that came along with the C2D: it was the solid-polymer electrolytic caps. This ended the "capacitor plague" era when motherboards only lasted like 2-4 years. With solid-polymer caps and C2D, motherboards do survive the warranty, do survive twice the warranty, do survive much longer. I work for an IPC assembly shop (which is a somewhat conservative industry) and in terms of desktop-style and PICMG motherboards, there are hardly any boards RMA'ed with solid-polymer caps on them.

Frank Rysanek
Devil

20W for that much neural computing horsepower...

To me, it seems pretty wasteful to emulate a rather fuzzy/analog neural network on a razor-sharp vaguely von-Neumannian architecture (albeit somewhat massively parallel). Perhaps just as wasteful as trying to teach a human brain (a vast neural network) to do razor-sharp formal logic - with all its massive ability to create fuzzy and biased associative memories and search/walk the associations afterwards, with all its "common sense" being often at odds with strictly logical reasoning ;-)

Frank Rysanek
Joke

HPC = using electricity to produce heat... co-generate heat and crunching horsepower?

Not my invention, I admit - was it here, where I read about a supercomputer that had a secondary function of a heat source for a university campus?

Now if you needed to build a supercomputer with hundreds of MW of heat dissipation... you could as well use it to provide central heating to a fairly big city. Or several smaller cities. Such as, there's a coal-fired 500MW power plant about 40 km from Prague, with a heat pipeline going all the way to Prague. The waste heat is used for central heating. Not sure if the pipeline still works that way, it was built still within the commie era when such big things were easier to do...

The trouble with waste heat is that it tends to be available at relatively low "lukewarm" temperatures. Computers certainly don't appreciate temperatures above say 40 degrees. Then again, there are heating systems that can work with about 30 degrees Celsius of temperature at their input. Probably floor heating sums it up - not much of a temperature, LAAARGE surface area. No need to heat the water up to 70 degrees or so.

The obvious implication is: generate heat locally, so that you don't need to transport it over a long distance, which is prone to thermal losses and mechanical resistance (for a given thermal power throughput, the lower the temperature, the larger the volume of media per second).

The final conclusion: sell small electric heat generators, maybe starting with a few kW of continuous power consumption, electric-powered and interconnected by fiber optics, where the principle generating heat is HPC number crunching. Build a massive distributed supercomputer :-) Yes I know there are some show-stopping drawbacks - but the concept is fun, isn't it? :-)

Frank Rysanek
Gimp

Re: Asus mainboards?

I recall trying to max out 10Gb Eth several years ago. I had two Myricom NIC's in two machines (MYRI-10G gen."A" at the time), back to back on the 10Gb link. For a storage back-end, I was using an Areca RAID, current at that time (IOP341 CPU) and a handful of SATA drives. I didn't bother to try random load straight from the drives - they certainly didn't have the IOps.... They had enough oomph for sequential load. I used the STGT iSCSI target under Linux. Note that STGT lives in the kernel (no copy-to-user). The testbed machines had some Merom-core CPU's in an i945 chipset. STGT can be configured to use buffering (IO cache) in the Linux it lives within, or to avoid it (direct mode). I had jumbo frames configured (9k). On the initiator side, I just did a 'cp /dev/sda /dev/null' which unfortunately runs in user space...

For sequential load, I was able to squeeze about 500 MBps from the setup, but only in direct mode. Sequential load in buffered mode yielded about 300 MBps. That is simplex. The Areca alone gave about 800 MBps on the target machine.

Random IO faces several bottlenecks: disk IOps, IOps throughput of the target machine's VM/buffering, IOps throughput of the 10Gb cards chosen, vs. transaction size and buffer depth...

Frank Rysanek
Meh

Did I hear a rumor about "moving the VRM on-chip" ?

A few days ago, there were rumours that Haswell would have more say in how its input rail power is regulated - beyond the VID pins of today. Some even suggested that the VRM would move into the CPU package. Anyone knows further details? This would be more of a revolution than skin-tone enhancements...

Frank Rysanek
Thumb Up

Re: Armadillo?

It was an interesting and insightful comment nonetheless - thanks for that.

Frank Rysanek
Meh

Re: Isn't that many "recent technologies" are crap?

My point exactly. I'm a geek by profession, and I've been having a self-perception of being pretty conservative, since my twenties maybe. I try to understand and leverage the underlying basics, rather than adopt any shiny new toy (after a few such toys, it gets old). A lot of the "new" technologies *are* crap. New things introduced for the sake of novelty / eyewash / sales pitch, rather than utility / progress / improvement. How many times can you sell an Office suite, with just a new version sticker on it? A new version of a windowing OS? A "radically novel" user inteface? Yet another software development environment? Increasingly nasty licensing schemes and vendor lock-in? Some of the new stuff feels increasingly degressive...

I have the luxury of working in a small company, where I'm free to study and try whatever I want.

I also meet IT and "embedded system integration" pro's in other companies. Speaking of training, in my experience, many of them would use training in "the underlying basics". Stuff as basic as Ethernet, TCP/IP, dynamic behavior of disk drives, disk partitioning and file systems (UNIX/Linux angle), vendor-independent basic OS and networking concepts - just to get some common sense. But maybe it's indeed down to everyone's personal eagerness to "peek under the hood", or ability to take a distance from the product you've just purchased. Or, down to chance, down to opportunity to work with different technologies...

Most of the training commercially available is heavily vendor-specific and brainwashy (Cisco, Microsoft). The other side of things is undoubtedly "freedom to starve to death", as Sir Terry would put it... Once you reach the "intermediate sorcerer" level, you can become a freelancer, pretty much on your own.

In my part of the world, there are a number of "training products" apparently developed with one key goal in mind: to get an EU grant from the "training/education funds of the EU". Hardly any IT training in there, or any other rigorous professional training. Mostly soft skills. The non-IT colleagues attend those trainings voluntarily, even happily. I have other, more entertaining or useful ways of wasting time / procrastinating - or study :-)

Frank Rysanek

Uh oh, whatever that means for CFast ?

A while ago, I was delighted to test-drive my first CFast card. I even managed to find some card readers, sold in Germany and elsewhere under the DeLock brand... CFast seems like a neatly open, future-proof and fairly obvious successor to CompactFlash. I'm starting to wonder if CFast is going to survive, or if it turns out to be a dead end... (owing to big-name camera vendors' marketing decisions). It would certainly be wonderful to have CFast as a ubiquitous boot drive form factor in embedded PC's for the years to come (instead of CompactFlash, now that parallel IDE is finally dead in new PC chipsets).

The SATA interface spec is nowadays capable of 600 MBps. The CFast card that I held in my hands (SLC-based, by Innodisk) was capable of about 90 MBps sequential (Linux dd-like test), featuring a 1st-gen 150 MBps SATA interface.

Frank Rysanek
Coat

interesting stuff

A couple questions come to mind... do the machines contain a Flash-based data logger, keeping track of temperatures over the service life / warranty period of the machine? As part of AMT 12.0 maybe? :-)

High-temp computing is interesting stuff. If you pay attention to board-level design, there is a lot you can do to help your design survive longer in operation at higher (broader) temperatures. And there's a lot you can *spoil* by careless board-level and system-level design.

MLCC capacitors (for power blocking) are made of several dielectric materials with quite different sensitivity to temperature - even if we speak "comparable" models in the range of tens of uF per unit, typically used to block low-volt high-amp CPU power. Some drop to ~40% of their capacity at -20 *C, some are much more stable.

Cheap Aluminum electrolytic capacitors also lose capacity at low temperatures and their ESR increases maybe tenfold, but even in conventional Al Elyts the chemistry can be slightly modified (alcohol added?) to make them perform better at low temperatures. (I hope the capacitor plague is over by now.) More importantly nowadays, solid-polymer elyts don't seem to have that low-temp problem at all, and they don't dry out at higher temp either. They last much longer. The downside is, that solid-polymer elyts are not made for voltages above say 30V - so you cannot use them at mains PSU primaries :-( So the PSU may well be the weakest spot in any computer, especially the PSU primary, which must contain conventional Al elyts and is typically a point of hot air exhaust, which certainly doesn't help the Al caps' longevity.

Next, in order to compensate for low-temp effects and gradual ageing, there may be room for designing in more capacity on a motherboard, just to be on the safe side, to have some headroom. Connecting more caps in parallel may bring the added bonus of decreasing the actual ripple current per capacitor, which decreases the capacitors' internal heating -> allows for operation in a higher ambient temperature. (The effect of cap addition -> ESR decrease might actually translate quadratically into the temperature difference / derating, which then translates into the cap's service life along some vaguely exponential curve.)

Effectively it all translates into attention to detail, and into board space occupied by the caps and by on-PCB heat dissipation space. Any additional heatsinks (e.g. on VRM FET's) mean mechanical design, which means substantial added cost (apart from designer headache) - so the board designers typically let the FET's run rather hot, because they're relatively insensitive... And space is always at a premium, especially in the ever-more-compact datacenter gear.

Let me suggest an interesting concept: if you could let your gear run at hotter ambient temperatures, you wouldn't have to use air conditioning (artificial freezing) all that much, you could use plain heat transfer more of the time. As far as I can tell, higher ambient temperatures are prevented by relatively sharp "temperature gradients" in the hardware = poor heatsinking.

Heatsinking seems to be a nasty can of worms for PCB and case designers. Especially fanless designs are highly suspicious in principle... It's interesting to see how different hardware vendors deal with this, deduce who's willing to take radical + effective + systemic steps, and who resorts to eyewash... (put shiny galvanized heatsinks on chipset + VRM, run some pointless heatpipes among them, then cover the biggest heatsink by a company logo badge).

I have to admit that in this respect, the top three name-brand hardware makers are generally in a higher league, compared to the noname market - and have been in a higher league for many years back.

One last observation maybe: even noname motherboards that started coming with solid-polymer elyts in the VRM, last much longer. The transition from the "plagued" Al Elyts to solid-polymer elyts also concided with a transition from P4 Netburst to Core2Duo (I'm Intel-based, sorry), which overall ran much cooler. My favourite way of building a long-lasting computer used to be: take an LGA775 motherboard that has enough VRM oomph to support the 130W P4's, and slot in a 45nm low-end C2D or Celeron :-) It tends to take a BIOS update, if the board has some older chipset.

Frank Rysanek
Thumb Up

agreed, for the most part

The one advantage of a hardware RAID (if integrated with a proper/compatible chassis) is failure LED's. This is somewhat difficult to get from a software RAID on a plain HBA.

At the low end, you need a disk enclosure/backplane with either SGPIO (combine with Adaptec, Areca or just about anything recent with SFF-8087) or with discrete "failure" signals, one per drive (combine with any Areca controller).

As for firmware continuity, Adaptec AACRAID used to be my favourite during several years, but in the recent years Areca has taken over their crown. Replace a controller with something you find in your dusty stock of spare parts - well that's where the fun begins :-)

Swapping cables around has never been a problem on any HW RAID. One recent experience, with a SAS-based Areca: I built an array in a 24bay external box (attached to an Areca RAID). Then I powered down the box, added another external JBOD, and scattered the drives between the two enclosures. Powered up, and voila, no problem - Areca combined all the drives correctly, from the two enclosures. Or another example: build a RAID in one 24bay enclosure, and then plug in another external SAS enclousure at runtime. No problem - enclosure detected, drives enumerated, ready to configure another RAID volume or whatever...

A quiz question: suppose you buy a new server with two drives in an Adaptec (AACRAID) mirror. Before installing your production OS, you try some recovery exercises, to see how the firmware works. You set up a mirror in the Adaptec firmware, you install an OS maybe, you remove a drive from the mirror and insert another one, to see what it takes to rebuild the mirror. The rebuild goes on just fine. You go ahead with the OS install and turn that into a production machine. You remember to "erase" the drive that you initially pulled, when testing the hot-swap: to be precise, you plug it alone into the Adaptec RAID controller once again, and remove the degraded array stump. Then you plug back your two production drives (the mirror), and put that "cleared" drive aside. After two years, one of the production drives fail - so you fumble in your drawer, produce the spare drive, plug it in, maybe a powercycle... and voila: *the production mirror array is gone* ! Explanation: the recentmost configuration change, logged on the drives, happend to be the array removal on your "spare" drive...

Otherwise I agree that for Linux it doesn't make much sense to buy a HW RAID just to mirror two drives to boot from. If you know your way through the install on a mirror, and maybe to install grub manually from a live CD, and especially if you don't plan to spend money on a proper hot-swap enclosure (so that failure LEDs are not an issue either), the Linux native SW RAID will prove similarly useful as any HW RAID firmware. For Windows users willing to spend some money on hot-swap comfort, I tend to suggest the dual-port ARC-1200 with some SAS series enclosure by Chieftec = the ones coming with workable failure LED's (the ARC-1200 is SATA-only).

As for parity-based SW RAID on Linux: if you can find it in dmesg, the MD RAID module does print a simple benchmark of several alternative parity calculation back-ends (plain CPU ALU, MMX, SSE etc) and picks the speediest one. And the reported MBps figure has been well into the GBps area for ages (since the PIII times). 3 GBps on a single core are not a problem - corresponding to 100% CPU utilization for that core.

For most practical purposes though, you'll be limited by your spindles' (drives') random seek capability. This is about 75 IOps for the basic desktop SATA drives. That is typically the bottleneck with FS-oriented operations. In such a scenario, you won't get anywhere near a HW RAID's CPU throughput limit. And yes, OS-based buffers / disk cache can sometimes help there - provided that you can configure the kernel's VM+iosched to make use of all the RAM (speaking of Linux that is).

Frank Rysanek
Linux

-20% power; nearest competitor

Based on past reading, I believe they compare themselves to the 45nm Atom cores (possibly N270 still with FSB), rather than the ION's video acceleration capability :-) I have to say that the current Nano yields quite some neat number-crunching oomph, considering the 2W (or so) power envelope.

Regarding the open-source woes, actually it's not that bad - I believe their disk controllers are supported by the mainline Linux kernel, the S3 graphics subsystem is basically supported by X.org - correct me if I'm wrong here, I haven't checked for a while. In the low-power segment, DM&P Vortex has perhaps better open-source support (quite a bit of open documentation) and even lower absolute TDP - but also significantly less crunching power (which is no problem in many control applications).

Frank Rysanek
Thumb Up

22nm planar vs. 22nm tri-gate

Exactly my point :-) It really seems to me that they've merely found a way to make the die-shrink work out once again, i.e. once again somewhat in proportion to the basic geometry, before they finally have to give up any hope of dragging Moore's law any further using just silicon and lithography. 32nm->22nm = 0.6875^2 =~ 0.47 . By boasting just a 50% cut in power consumption, they're in fact admitting that that they've *almost* made the die-shrink work out up to the theoretical expectations :-D

Thanks for the _deep_ explanation BTW - if it wasn't for El Reg, I'd wade through Intel's PR fog blindfolded till the end of my days :-)

Frank Rysanek
Coat

Single point of failure => what about fault tolerance?

Okay, suppose the software can cope with NUMA on this scale... the next question is, how does RHEL or SLES or Windows Server cope with individual component failures? Is RAM and CPU hot-swap there already? What if a whole multi-socket blade goes down?

(Small hardware = small problems. It's so obviously soothing. Unless you sell too many of them, and there's a systematic design flaw... Have to pet the smartphone in my pocket just to feel basically sane again...)

Frank Rysanek

Wobbly CMOS clock to blame for garbled playback? Not likely...

In a particular generation of SuperMicro dual-Xeon boards (Nocona/Irwindale = socket 604), some devices in the RAM VRM's were dying. Cannot say if it was the old elyt caps or the switching FET's combined with some thermal design omission... multiple different motherboards of that generation showed higher RMA rates. Since those days, I've heard no complaints about SuperMicro - I'm pretty sure they've learned from that historical experience :-)

As for system clock distribution: that's a relatively complex issue. You have multiple levels of clocks in the system (some of them in hardware, some of them in software) and multiple levels of audio data buffers (again HW/SW). Makes me wonder if you were facing buffer underruns, or indeed wobbly playback clock (as in sampling rate). All audio cards that I know of have their own Xtal oscillators for the sampling rate clock - so the system-wide PCI clock should have little effect. BTW, the CMOS RTC clock can hardly be the culprit - the PCI clock and the various hardware timers' clocks (=> also your OS system clock) tick along some other master reference crystal, different from the non-volatile CMOS RTC. And, that multi-output clock synthesizer for the various busses and chipset subsystems can employ a technique called "spread spectrum" on purpose - to duck some EMI radiation limits simply by making the radiated "frequency poles" broader / softer. In some BIOSes, the "spread spectrum feature" can be disabled (in others, it cannot). This "spread spectrum" thing is quite common and perfectly legitimate in modern chipsets/motherboards.

However, I still don't think a little bit of added jitter in the CPU+PCI clocks would hamper your audio playback. Rather, I have a different explanation: IRQ and general bus transfer latencies, resulting in buffer underruns. "A few years back" could quite as well correlate to the transition from the old-fashoned discrete interrupt delivery over dedicated signals, to the new+hip message-signaled delivery, in-band over the "hub link" or whatever the chipset backbone link is called. The change has come in the form of chipsets such as i815 / i845. Previous Intel chipsets and contemporary chipsets from cheaper competition still used the old "out of band" IRQ delivery and were therefore showing better "interrupt latencies" under load. Another factor might be that, at about the same time, motherboard vendors (BIOSes) started to use SMI more extensively for software emulation of some missing features (such as, to emulate legacy keyboard / floppy on top of USB devices) - again resulting in occasional excessive latencies. The RTAI project even had some standard test utilities for this. I recall that some telco voice processing boards for the PCI bus did have a problem with that - and a feasible workaround at the time was to replace the Intel-based mobo's with something SiS-based.

I mean to say that none of this is a problem on part of SuperMicro - it's evolution, and it's common to a particular generation of system chipsets. Blame the chipset makers...

Frank Rysanek

@"one BIOS to rule them all"

> http://www.arm.com/community/software-enablement/uefi.php

Good, thanks for that link :-) ARM has joined the "UEFI Forum" in March 2008. It's about time for some open hardware with ARM+UEFI to start to appear on the shelves.

In the PC World, I haven't been aware of UEFI very much. Some name-brand BIOSes do contain that interface, but as far as I can tell, mostly I still walk the PC BIOS side of things... Or maybe it's just that I'm not aware that Windows actually call the UEFI stuff, rather than legacy BIOS. Maybe when drives >2TB become common as system drives, I will notice :-)

As for MAC hardware (EFI), that's a bit of a special case - the hardware is legcuffed to MacOS, and it takes a bit of tweaking to get e.g. Windows running on that - provided that the proud Mac owner would ever want to dual-boot into Windows, which somehow counters the purpose of buying an Intel Mac :-)

Anyway I don't mean to argue with your good point that attempts at "ARM BIOS" have been there for quite some time...

Frank Rysanek

one BIOS to rule them all

Do you know when I'd start to be afraid, in the shoes of Intel? When ARM publishes a comprehensive, uniform and open BIOS-like interface, for the ARM architecture = a software compatibility standard that would allow you to boot any operating system on a broad range of ARM machines, without you having to customize a bootstrap loader for the particular hardware model (i.e. read HW docs, modify obscure C/ASM code, compile, flash over JTAG or some other hardware probe). Another condition for being afraid would be: it would have to get adopted by the gadget makers.

*that* would be competition to the x86 PC's - to their core virtues: universal compatibility and openness.

Is that the kind of feature any of the gadget makers want? No no, quite the opposite :-) Security (vendor lock-in) by obscurity and "architecture fragmentation" reign supreme. ==> the uniform commodity x86 market is still safely in the hands of Intel (and its formal x86 competition).

Frank Rysanek
Alert

Re: new = less reliable => disk drives :-)

The recentmost disk drives on the market, at any given time, are bleeding edge, and tend to have lower reliability. Highest possible areal data density, four double sided platters, quite a bit of heat produced... In the recent history, especially around 1 TB (3.5") the vendors were pulling all their best of cunning tricks to cover up physical defects at runtime, to compensate for poor reliability of the platter surface.

The most reliable drives tend to be the lowest-capacity model still being manufactured at any given time.

As for long-term durability (years, up to a decade or more): on several occasions in the last few years, while diagnosing some RMA'ed drives, or rather drives long over warranty, I've noticed an interesting "syndrome" or phenomenon. During an initial full-surface sequential reading test, the drive reports a couple of bad sectors, scattered across the surface of the drive. On repeated sequential reading tests, it's always the same sectors. Next, I tend to write the drive with all zeroes - to test if it fails when writing as well. The write test gets completed just fine. Next, I try another sequential read - lo and behold, the drive reads just fine! Even upon many repeated sequential readings, e.g. looping the full-surface read test for a week, the drive acts just fine.

My hypothesis: the payload data tend to wear off in the sectors. After years of sitting on the platter surface, the recording fades out - difficult for me to say if this is due to natural properties of the material, or due to writing/deflection magnetic field activity all around during long runtime hours. Note that this fading out does not impact the track alignment marks, comprising the skeleton of the drive's low-level format - those are made on bare platters by the disk vendor using a special machine - those tracking marks are much more durable, and "track not found" is a much more serious error than "error reading this sector".

This might have an interesting implication. To keep your data safe, you may as well want to "refresh" the recording every year or so. Just read the whole drive sector by sector, and write back the sector contents immediately to the same place, as you go along. It could keep the recording alive for many more years. As far as I know, noone does this. RAID firmwares can check the surface periodically in a read-only fashion, looking for sectors that have already failed - but as far as I know, noone has ever tried *refreshing* the recording on the platters just in case.

Frank Rysanek
Linux

Good idea = use HDDs instead of tapes

Now that's an interesting idea - use cheap notebook drives instead of tape cartridges. Maybe the tape cartridge survives a few more G of a mechanical shock, but owing to its internal magnetic heads and bearings (dust-tight environment), it should OTOH survive many more overwrites, longer hours of continuous operation. Which may also compensate the highter cost of disk drives vs tape cartridges (and maybe tape drives).

Why use some mechanical or electrical multiplexing of the drive lanes? For a smaller number of drives=cartridges, you can just as well use an expander, it may be cheaper than a robot + tape drives. If you have the know-how, you can build your own virtual library along that principle - buy some cheap hot-swappable case (such as SuperMicro SC216 or SC417, or SC847 for 3.5" drives) and build a virtual library on top of that. It's not much of a problem to shut down or spin up a drive in software, or keep it in some shallower stand-by state, and even to watch out for hot-swaps. An important part remains to be solved though: the management software candy on top, and some backup client software. Something to keep track of your "tape-style disk cartridges" and maybe provide a virtual tape interface on demand to the clients. Someone in the open-source camp with plenty of free time could as well start coding something like that :-) Okay, once you run out of drive bays and you need a robot, it's back to the library makers...

There are potential fields of application where it's intriguing to deploy a big robotised tape library, with several tape drives, to perform long-term archival of some data - such as from continuous video surveillance systems (big brother kind of thing) or medical Xray/CT data. I've been told by practitioners who have attempted that kind of thing, that tapes have downsides in this application. The tape drives wear out much too fast - not up to 24x7 continuous operation in video systems. And in the medical systems, the users tend to get addicted to the possibility of having a patient's past history always at their fingertips, so that the library again just keeps huming all the time and the users are disappointed about the access time ("hey it's all in the computer somewhere anyway, so why does it take so long"). Plus, in the medical imaging technology, the data volume just EXPLODES every time a new machine is installed in the hospital (having a higher resolution, being 3D rather than 2D etc). And, somewhere inbetween all that, the tape cartridges are not very reliable after all... It's a crazy world...

Frank Rysanek
Linux

ReiserFS ; spindles

The one Linux filesystem, notorious for its capability to work with myriads of small files, is ReiserFS. There are downsides though: the stable ReiserFS v3, included in the vanilla Linux kernel, has a volume size limit of 16 TB. ReiserFS v4 is not in the mainline kernel (in some part due to "functionality redundancy" reasons = inappropriate code structure) and its future is somewhat uncertain - but it is maintained out of tree and source code patches (= "installable package") for current Linux kernel versions are released regularly. Both versions also have other grey corners, just like everything else...

When working with a filesystem that large, I'd be concerned about IOps capability of the underlying disk drives (AKA "spindles"). The question is, how often you need to access those files, i.e. how many IOps your users generate... This problem is generic, ultimately independent of the filesystem you choose.

Frank Rysanek

even just different x86 chipsets?

Someone has previously reported that the code worked fine on one x86 "chipset", but not on another one. And that it was a piece of onboard hardware in the drones... yes it goes against the supposed "data harvesting" categorization of the software product. That reference to two x86 chipsets may be related to another problem, rather than the originally described FP precision issue.

Difficult to say what the word "chipset" is supposed to mean here - whether just the system bridges, or if the CPU is implied as well. E.g., a mobile Core2 certainly has some more oomph than an Atom.

Either way: I recall that the people making the RTAI nanokernel (for hard-realtime control under Linux) do have statistics and testing tools to evaluate a particuar chipset's IRQ latency, and it's true that even among "x86 PC compatibles" some brands and models of chipsets show reasonably deterministic responses, while other chipsets show quite some interesting anomalies in interrupt service response time. This was real news with the introduction of Intel 8xx series of chipsets, where Interrupts were delivered inband over the HubLink for the first time in x86 history, rather than out-of-band via discrete signals or a dedicated interrupt bus - so that interrupt messages were competing for bus bandwidth (time) with general bulk payload. At that time, competing "old-fashioned" chipsets such as the SIS 650 had a much better IRQ delivery determinism than the early Intel Pentium 4 chipsets. Some cases of high IRQ latencies are attributed to unaustere use of SMI by the BIOS, e.g. for software emulation of features missing in hardware... but that again tends to go hand in hand with a particular chipset, stems from BIOS modules provided by the chipmaker etc. Don't know what the situation is now, how the different later generations of Intel hardware behave, how the Geode compares for instance...

Heheh, were (GP)GPU's involved by any chance? That's another area where floating point math can screech in its hinges...

Frank Rysanek
Go

RAID software quality from Intel etc

Several issues come to mind.

Historically, Intel has had soft-RAID "support" in several generations of their ICH's - on top of SATA HBA's, up to six drive channels. A few years ago it was called the "Application Accelerator", then it was renamed to "Matrix Storage". I don't know for sure if there's ever been a RAID5/XOR Accelerator in there, or if the RAID feature consisted of some ability to change PCI ID's of the SATA HBA at runtime + dedicated BIOS support (= RAID support consisting of software and PR/advertising, on top of a little chipset hack). Based on the vague response in the article, I'd guess that there's still no RAID5 XOR (let alone RAID6 Reed-Solomon) acceleration in the PCH hardware - what they said means that they're looking at the performance and trying to squeeze out as much as possible out of the software side. Looks like not much is new here on the software part (RAID BIOS + drivers) - the only news is SAS support (how many HBA channels?), which gives you access to some swift and reliable spindles (the desktop-grade SATA spindles are neither), if the ports support multi-lane operation they could be used for external attachment to entry-level HW RAID boxes, and if the claim about expander support is true, you could also attach a beefy JBOD enclosure with many individual drives (unless the setup gets plagued by some expander/HBA/drive compatibility issues, which are not uncommon even with the current "discrete" SAS setups). I'm wondering about "enclosure management" - something rather new to Intel soft-RAID, but otherwise a VERY useful feature (especially the per-drive failure LED's are nice to have).

The one safe claim about Intel on-chip SATA soft-raid has always been "lack of comfort" (lack of features). The Intel drivers + management software, from Application Accelerator to Matrix Storage, has been so spartan that it was not much use, especially in critical situations (drive fails and you need to replace it). I've seen worse (onboard HPT/JMicron I believe), but you can also certainly do much more with a pure-SW RAID stack - take Promise, Adaptec HostRAID or even the LSI soft-RAID for example. It's just that the vanilla Intel implementation has always lacked features (not sure about bugs/reliability, never used it in practice). Probably as a consequence, some motherboard vendors used to supply (and still do supply) their Intel ICH-R-based boards with a 3rd-party RAID BIOS option ROM (and OS drivers). I've seen Adaptec HostRAID and the LSI soft-stack. Some motherboards even give you a choice in the BIOS setup, which soft-stack you prefer: e.g., Intel Matrix Storage or Adaptec HostRAID. Again, based on one note in the article, this practice is likely to continue. I just wish Intel did something to improve the quality of their own vanilla software.

One specific chapter is Linux (FOSS) support. As the commercial software-RAID stacks contain all the "intellectual property" in software, they are very unlikely to get open-sourced. And there's not much point in writing an open-source driver from scratch on top of reverse-enginered on-disk format. There have been such attempts in history and led pretty much nowhere. Any tiny change in the vendor's closed-source firmware / on-disk format would "break" the open driver. And the open-source volunteers will never be able to write plausible management utils from scratch (unless supported by the respective RAID vendor). Linux and FreeBSD nowadays contain pretty good native soft-RAID stacks and historically the natural tendency has been to work on the native stacks and ignore the proprietary soft-RAID stacks. The Linux/BSD native soft-RAID stacks can run quite fine on top of any Intel ICH, whether it has the -R suffix or not :-)

People who are happy to use a soft-RAID hardly ever care about battery-backed write-back cache. Maybe the data is just not worth the additional money, or maybe it's easy to arrange regular backup in other ways - so that the theoretical risk of a dirty server crash becomes a non-issue. Power outages can be handled by a UPS. It's allways a tradeoff between your demands and budget.

As far as performance is concerned:

Parity-less soft-RAIDs are not limited by the host CPU's number-crunching performance (XOR/RS). If you omit the possibility of sub-prime soft RAID stack implementation, the only potential bottleneck that remains is bus throughput: the link from north bridge to south bridge, and the SATA/SAS HBA itself. Some Intel ICH's on-chip SATA HBA's used to behave as if two drives shared a virtual SATA channel (just like IDE master+slave) in the old days - not sure about the modern-day AHCI incarnations. Also the HubLink used to be just 256 MBps thick. Nowadays the DMI is 1 GBps+ (full duplex), which is plenty good enough for 6 modern rotating drives, even if you only care about sequential throughput. Based on practical tests, one thing's for sure: Intel's ICH on-chip SATA HBA's have always been the best performers around in their class - the competition was worse, sometimes much worse.

As for parity-based RAID levels (5, 6, their derivatives and others): a good indicator may be the Linux native MD RAID's boot messages. When booting, the Linux MD driver "benchmarks" the (potentially various) number-crunching subsystems available, such as the inherent x86 ALU XOR vs. MMX/SSE XOR, or several software algorithm implementations, and picks the one which is best. On basic desktop CPU's today (Core2), the fastest benchmark usually says something like 3 GBps, and that's for a single CPU core. I recall practical numbers like 80 MBps RAID5 sequential writing on a Pentium III @ 350 MHz in the old days.

The higher-end internal RAID cards, containing an IOP348 CPU at ~1GHz, tend to be limited to around 1 GBps when _not_ crunching the data with XOR (appears to be a PCI-e x8 bus limit). They're slower when number-crunching.

In reality, for many types of load I would expect the practical limit to be set by the spindles' seeking capability - i.e., for loads that consist of smaller transactions and random seeking. A desktop SATA drive can do about 60-75 random seeks per second, enterprise drives can do up to about 150. SSD's are much faster.

The one thing I've recently been wondering about is this: where did Intel get their SAS HBA susbsystem from? Already the IOP348 contains an 8way SAS HBA. Now the Sandy Bridge PCH should also contain some channels. Are they the same architecture? Are they not? Is that Intel's in-house design? Or, is it an "IP core" purchased from some incumbent in the SCSI/SAS chipmaking business? (LSI Fusion MPT or Agilent/Avago/PMC Tachyon come to mind.) The LSI-based HBA's tend to be compatible with everything around. Most complaints about SAS incompatibility that I've noticed tend to involve an Intel IOP348 CPU (on boards e.g. from Areca or Adaptec) combined with a particular expander brand or drive model / firmware version... Sometimes it was about SATA drives hooked up over a SAS expander etc. The situation gets hazy with other less-known vendors (Broadcom or Vitesse come to mind) producing their own RoC's with on-chip HBA's...

Frank Rysanek
Coat

DNS SRV records for HTTP *still* unsupported... (or are they?)

Audio visualization and 2D acceleration are the hot news TODAY? And a very simple web redundancy framework, potentially very useful, known and wanted by informed people for a decade, is still missing: SRV - just a small update to the DNS resolver. Just a few lines of code. The largely technophobic/ignorant web-surfing masses would appreciate it too, once it got into production use. A bug entry has been in the bugzilla for years, even with some early patches. The ultimate excuse from the Mozilla team has always been that there is no RFC standard. There is for SRV, but not specifically for SRV on HTTP.

https://bugzilla.mozilla.org/show_bug.cgi?id=14328

http://support.mozilla.com/tiki-view_forum_thread.php?comments_parentId=6112&forumId=1

The only party who would certainly not appreciate HTTP SRV, are the vendors of content switches for HTTP load balancing / HA solutions.

Mine's the one with BGP4+IPv6 in the pockets and global routing stability painted on the back...

Frank Rysanek

good for the business

If this FUD turns out to be true, some devices will die. For various reasons, I'd expect overly old and maybe overloaded devices to die. The weakest links will blow. This might result in a burst of investment into power transmission and IT / Telco equipment, as well as some techie gadgetry "consumption spending" from end users. Short power blackouts are not a problem. As far as telco/data communications are concerned, the backbones have been based on fiber optics for ages, and in some countries wireless links are also quite widespread. If the old residential telco copper gets disrupted, maybe it's not all bad news :-)

Frank Rysanek

technical details - linux compatibility etc.

With respect to Linux, Adaptec's FSA RAID (aacraid) historically has been and still is one of the first and best supported RAID controller brands, on par with maybe only 3ware. The Linux user-space utils starting with aaccli/arcconf and ending with the "Storage Manager" have always been among the best on the market. The vanilla Linux AACRAID driver even exhibits some "forward compatibility" with newer AACRAID cards - detects them as a generic AACRAID model and tends to work with them just fine.

I remember a time at the end of nineties when AACRAID was a novelty, but it soon became a cornerstone of Linux HW RAID driver support. I still remember how happy I was around 2003/2004 that the old DPT RAID flavour of Adaptec cards was finally gone - especially the last specimen of the DPT ZCR family tended to be unreliable and the firmware features were sub-prime.

3ware used to be the cheapest HW RAID, reliable and compatible, but lacking CPU horsepower. From 9000 series above I lost track, so I cannot judge the current portfolio (they finally seem to have switched to high-performance CPU's, the AMCC-flavour PowerPC).

Regarding Adaptec's own SCSI controllers (AHA/ASC/AIC): I started to avoid them with the second generation of AIC-7902/29320/39320, which apparently can be distinguished by the "A" suffix. The first-gen Adaptec U320 controllers had no problem against LSI U320 targets (I still have one or two pieces), but the latter variety couldn't run properly at U320 against LSI, hence there was a problem getting them to work with CDB16/LBA64, which was a problem with external storage boxes, typically featuring target-mode controllers by LSI. Even the earlier variety of U320 *and* 64bit (PCI) U160 controllers had some problem against ServerWorks chipsets (ceased to be a problem as Intel chipsets finally prevailed in servers).

None of this was a problem with the Adaptec RAID controllers, because

A) you don't attach an external RAID box to an internal PCI RAID

B) on ASR2120/2200 the AIC HBA chip is attached to the host PC via a PCI IOP CPU by Intel, hence no problem with PCI compatibility.

Regarding the Adaptec SATA RAID portfolio and "rebadging a SiliconImage chip": many people still fail to distinguish

A.) a proper hardware RAID controller (with its own CPU, RAM and firmware in Flash)

B.) from a "soft RAID" (just a cheap HBA chip with a companion Flash for the BIOS option ROM).

I cannot tell whether or not it was a marketing error on part of Adaptec to sell cheap soft-raids, which admittedly are a problem in Linux. The AAR-1200 series were a soft RAID (HostRaid in Adaptec lingo). The AAR-2410 / 2420 were/are a proper hardware RAID (aacraid family), in terms of features precisely on par with Adaptec 2120/2130. Actually the SATA implementation is even slightly better in some respects, such as independent drive channels and quicker response to drive failures (that's right, the failure response on SCSI is *slower*). When shopping for an Adaptec controller for Linux, you always had to check that you were buying an "aacraid". You always get what you pay for. The SiliconImage chip itself is pretty good in its class, has no obvious compatibility or performance problems - in that sense, it was certainly a good choice. Obviously not for the Linux folks, who don't like being fooled into buying a software RAID stack that they have to dump anyway (if it can be circumvented at all, starting from the BIOS).

Note that there were even SCSI HBA's wearing the "HostRaid" suffix - some members of the 29320/39320 family. Of course those were easier to identify as "just plain HBA's" by the basic product number.

One last note regarding Adaptec HostRaid: among the many "software RAID HBA" implementations out there, the Adaptec HostRaid BIOS and drivers were among the best. As good as it gets, without a dedicted CPU. Adaptec shipped the HostRaid even with onboard HBA's - initially with the SCSI AIC series, later on the stack also started to appear as just a BIOS option ROM with third-party onboard HBA's (Intel ICH, even Marvell I think). E.g. on some SuperMicro motherboards, you have a choice between an original Intel soft-RAID stack (matrix storage) and the Adaptec HostRaid option ROM. To me, the choice has always been clear - the Adaptec HostRaid, owing to its bug-free BIOS part and excellent OS-based management tools. Unfortunately for Adaptec, the onboard HostRaid stack was almost invisible in the motherboards' marketing material (product web, datasheets, packaging), and actually hardly any end-customers knew enough to tell a difference.

Obviously this train of thought is only valid for Windows. Forget about HostRaid for Linux. If you don't want to pay for a genuine HW RAID, save some money, buy a plain HBA and use a native Linux MD RAID. Some argue that the MD RAID even has advantages over a proprietary HW RAID in terms of both performance and "hardware-independent crash recovery".

I still remember the time when Intel rounded off the i960-based generation of the IOP CPU family and all the RAID vendors depending on that (Adaptec and MegaRAID among others) had a hard time taking the next step - some followed the path to Intel Xscale IOP's, others took other paths. Adaptec finally rolled out its own RoC chips, forming the basis of ASR-2130/2230 (MIPS-based?). Adaptec later returned to Intel with the "universal SATA/SAS family" (so the AACRAID firmware once again ran on Intel Xscale hardware), though actually the first Arm-based AACRAID was the old ASR5400 quad-channel SCSI if memory serves...

It may well be that the discontinuation of i960 by Intel has "mixed the cards" in the RAID game quite a bit. Non-intel CPU's got a chance and some Xscale-only startup competition has been founded, e.g. Areca (though there have been Areca models that do run on non-Xscale CPU's). The dot.com bust, the growing market acceptance of SATA (?) in servers and even the maturing open-source soft-RAID implementations in Linux/xBSD have just been additional nails in the Adaptec coffin.

The current "universal SATA/SAS" family (starting with 3800 series) is actually pretty good. The SFF-8087 with SGPIO support are the best SAS/SATA interconnect ever.

Some of our customers still demand Adaptec as "the top-notch RAID controller brand".

I tend to prefer Areca, which has similar features and IMO a richer yet lighter-weight management interface - but some customers are difficult to convert :-)

The BIOS interface to Adaptec cards has traditionally been fairly spartan (compared to e.g. Areca, but certainly on par with or better than MegaRAID, 3ware and others). Makes me wonder how many people are actually coding the firmware, BIOS and OS-based tools at Adaptec and the other vendors. I wouldn't be surprised if it's just a fairly narrow team of people, maybe down to 2-5. How much fluctuation is there in the core team, across all the ups and downs and mergers? Is anyone of the original AACRAID developers still working on the firmware? To me as a techie, the set of features and capabilities is actually the decisive selling point - rather than press announcements, acquisitions, stock splits, hostile takeovers, board-level coups and all the other corporation games...

Adaptec wanted to buy Symbios? Wow, didn't notice that :-) To me, Symbios has always been a part of LSI, a key part of the LSI SCSI expertise and excellence, up to U320.

Did you say that Adaptec bought some RAID stuff from IBM? I thought the MegaRAID acquisition path was AMI->IBM->LSI :-)

Frank Rysanek
Dead Vulture

R.I.P. QPI

Now, forget about the QPI folks, okay?

QPI has been the major buzzword around Core i7. Now suddenly it's not needed anymore. Not on a single-CPU desktop system, not when the PCI-e root complex has become a part of the CPU, and the CPU can talk "DMI" (notably similar to PCI-e 1.1 x4) straight to the ICH (sorry, PCH). Besides having an on-CPU RAM controller, of course.

Frank Rysanek

Re: reliability (in reply to Peter D'Hoye)

Exactly. People who opt for the bleeding edge "bits per square inch" on the platters, combined with four platters in a drive, should be prepared to replace a couple drives over the first month of operation, and some more over the first two years.

Neither SATA protocol-level compatibility, nor the drive size per se (compatibility with RAID firmware) have been too much of a problem lately, even with some lower-end RAID brands. It's nowadays a fairly safe bet that you can plug the latest drive into your two-year-old RAID box and it's gonna work. But the first deliveries of every bleeding-edge HDD model coming out can have maybe 20% of the drives essentially "dead on arrival", i.e. failing in RAID assembly burn-in. That's at least 10 times more compared to trailing-edge drives, such as the 80GB Barracuda 7200.10 being phased out just now.

Let me suggest a recipe: always warn the early adopters among your customers. When assembling a RAID unit, give it a thorough burn-in under generated load in your lab and replace any misbehaving drives before you ship the box to your customer. Keep a few pcs of the drives in stock for quick replacements. Favour RAID levels with more parity. If the RAID firmware is capable of that, schedule some periodic surface testing (exhaustive whole-surface reading) to prevent sudden "multiple failures" (bad sectors piling up undiscovered for a long time). When a drive fails, don't blame drive firmware, blame bad sectors on the high-density platters.

Frank Rysanek

The first in what? PR lingo...

The first to use 2TB drives in a RAID unit? Or, rather, the first one to boast that on the web?

There are other reasons to take this message with a grain of salt. I've met customers who specified minimum guaranteed IOps per TB required. The modern desktop drives with 1.5 TB and above may well be below that target even for fairly boring applications such as "file sharing" websites... you get a nominally huge storage box, but effectively you cannot make use of all the free space - you cannot access it fast enough.

That sort of drives can be good enough for round-robin surveillance video archival or maybe HD video capture+editing (provided that the FS doesn't require too many IOps, and that you don't unleash too many parallel users unto the RAID box). The resulting IOps throughput is also a matter of what RAID level you configure...

Frank Rysanek

Re: Hardware compatibility

@Toastan Buttar:

> "Every new piece of hardware" already works with Windows and seldom takes "hours of configuration".

I've recently bought a relatively low-end Acer notebook PC. Cheap stuff. I paid attention to it having an Intel CPU+chipset+IGP, but it also has a number of other-brand peripherials. I installed it to dual-boot XP and Fedora 10.

It took me half a day to install XP along with all the drivers. Especially the webcam driver gave me a headache - its power saving glitch prevented XP from shutting down correctly, and it took some time to google out a workaround (prevent PM on the USB port in Windows Device Manager). It *did* take hours, even just downloading and installing all the drivers, even if I disregard that webcam gotcha.

With Fedora 10 x86_64, I booted the "netinst" CD, pointed it to use a single flat partition for its filesystem, selected some apps to install, and went off to do other things. It didn't take more than 15 minutes. In an hour or so, the system was up and running, including WiFi, a multi-combo flash card reader, Realtek HD audio chip, and including the darn cheap webcam! Not to mention a host of apps (Mozilla / Gimp / OOo). All removable media / flash cards work out of the box, just as seamlessly as in Windows, or maybe better. It *was* significantly faster to install than Windows.

Note the look and feel of automatic updates in Fedora - the level of detail of progress reporting, the lack of reboots. Non-english language support (keyboard and display): no problem, either. Do I play games? Not anymore, I don't have the time. Out of curiosity, I did compile UFO AI from SVN source under Fedora, but it's still too buggy to be any serious use :-)

I do have frequent encounters with buggy 3rd-party hardware drivers for Windows. They tend to be stale and buggy versions, especially for cheap noname brands (imagine all the USB gadgets) - or sometimes incorrectly labeled on the device manufacturer's web site, or wrapped in a an impenetrable installer archive together with a buggy install script. Generic drivers in Linux tend to work surprisingly well out of the box for that same hardware.

Frank Rysanek
Thumb Down

Pump CO2 underground? How efficient is it? Basic physics

Imagine a 500MW coal-fired power plant. I live nearby one. Just think about the chimneys. How power-efficient would it be to pump the whole huge volume of exhaust gasses underground? How deep are the gas wells? How many metres of "water pillar"? It's one Bar every 10 m. Imagine that you'd need to compress a powerplant chimney worth of gas up to hundreds of Bars to pump them underground. Think about the heat produced, apart from the potential energy stored in the pressure difference. Does that seem worthwhile? Wouldn't it take more energy than the power plant would be able to produce?

Or, would you just take some water from down under, let the exhaust gasses bubble through it until most of the CO2 dissolves (and nitrogen bubbles back up), and pump the water back down to where you got it? That might be a tad more efficient... But, would the water take any further CO2, at our surface pressure?

Any solid data on that? URL pointers? Too lazy to do the basic maths myselfs...

Frank Rysanek
Linux

@wireless setup

I've actually just installed Fedora 10 64b on an Acer notebook (a brand being dismissed with a grin by many of my colleagues) and IT ALL WORKS OUT OF THE BOX, including WiFi a/b/g/DraftN (Intel chip), bluetooth and a crappy integrated webcam that has a quirky driver in Windows. I did pay attention to having an Intel chipset in the notebook, and I went for a bargain model with an older 65nm C2D CPU and the chipset is two generations old by now, too.

As for WiFi: in the past I've configured WiFi by hand on a Broadcom-based AP with OpenWRT installed. So the first thing I did in Fedora, I tried a few lines with iwconfig. But somehow I couldn't get past WPA2. Only then did I take a fumble through the system configuration menus on the graphical desktop, and guess what: found a WiFi configuration tool, which did the magic with just a few mouse clicks - I managed to enter the WPA2 PSK at the first attempt.

Wow!

Frank Rysanek
Happy

Further reading

from IDT and PLX - some PCIe switch chips with a prospect of multi-root architectures. Maybe pre-IOV, but the docs attached give a thorough technical background, answer many of my basic questions.

http://www.idt.com/products/getDoc.cfm?docID=18639469

http://www.idt.com/products/getDoc.cfm?docID=18688297

http://www.plxtech.com/products/expresslane/switches.asp

Interestingly for me, Pericom has not much to offer in that vein... Unsurprisingly, neither has Intel. That LSI paper on IOV mentioned before makes me wonder what LSI has up its sleeve.

Frank Rysanek
Alert

PCI-e IOV - multiple root complexes can't talk to each other

Ahh well. So you can put your NIC in an external expansion box, rather than into the server itself. But unless it's a very special NIC that supports IOV, you can only use it from one root complex (= from only one host computer). The IOV standard simply says that the external multi-root PCI-e switch can carry traffic for multiple PCI-e bus trees that don't know about each other (like VLAN's). Each bus tree is "owned" by a particular "OS partition" running on the host computer. At least the part of the bus tree carried by the external switch runs virtualized on the switch, though I guess IO virtualization at PC server chipset level is already in the works too.

Any peripherial board that is to be shared by multiple PCI-e trees must have special "virtualization" support, to be able to keep track of several simultaneous PCI-e conversations with multiple root complexes. Not so very sexy...

I bet the HPC folks would appreciate much more if the external PCI-e switch could cater for direct root-to-root data transfers - for HPC networking purposes. Imagine DMA from one root complex to another root complex (memory to memory). This doesn't necessarily mean that a standard IP networking stack would be involved - perhaps as a slow fallback solution or for management purposes. Rather, I could fancy some dedicated low-latency mailboxing framework. It would really get the multi-root PCI-e monster fairly close to a NUMA, except that we're still speaking distinct "OS partitions" in the IOV lingo. The way I understand PCI-e IOV, such direct transfers are impossible. Maybe via an intermediate "virtual NIC" or some other peripherial along those lines (call it a DMA engine with some dedicated memory of its own) implemented within the external PCI-e switch.

The sort of bandwidth available from PCI-e at least makes very good sense for direct RAID storage attachment. Perhaps not via an additional intermediate storage bus technology (that could be useful as a slow lane for some wider-area SAN interconnects).

Frank Rysanek
Stop

erase block size?

So what's the "erase block size" in the upcoming enterprise-grade flash drives? Did I hear you say interleaved erase+write cycles? How many ways/channels?

http://www.storagesearch.com/easyco-flashperformance-art.pdf

Frank Rysanek
Thumb Up

Re: how do they scale up

They probably mean the 5085, with 2x SFF-8088 (external x4 multilane SAS). That's two ports, 128 SAS addresses each, or even more with "fanout" expanders (see e.g. the SAS JBOD enclosures by AXUS).

Each ML SAS port can be connected to a daisy-chain or a tree of cascaded JBOD enclosures.

The internal SFF-8087 can also be used for cascading, provided that you have an expander-based SAS backplane in your server that provides an external SAS expansion port for daisy-chaining.

Those 256 drives may as well mean a firmware-side limitation, rather than the max.number of SAS addresses theoretically possible per a daisy-chain / cascaded tree. Still I'd be a little cautious about 256 drives per RAID. There can be real-world glitches that may limit the practically useful degree of cascading, performance with so many drives, choice of RAID level, maximum block device size that your OS can actually take, runtime reliability of such a monster etc.

I also keep hearing rumours of SATA drives being incompatible with some SAS expanders, or that you can only use a single expander per ML SAS port for SATA drives (no JBOD daisy-chaining) etc.

I believe the limit of 256 drives is the same with the older 3xxx series.

The 5085 is on par with an Areca ARC-1680x: same IOP CPU, same number of ports. As far as firmware features and comfort are concerned, nowadays I'd probably opt for the Areca.

Frank Rysanek

It's not just drivers for old hardware, far from that

Quite a lot of legacy business software fails to run correctly on Vista, even fairly simple apps and some handy utils. This is not a matter of hardware drivers - this is a matter of backwards compatibility with user-space software. I know about people who recently got a new name-brand Notebook PC with Vista preinstalled, and after losing a day or two trying to make their beloved apps work, they downgraded to XP (having a corporate multi-license) and were up and running in a few hours, including all the software and all the third-party hardware drivers in their most recent versions. And XP *sing* on the Vista-ready hardware :-)

Yes, there's the drawback with message-signaled interrupts, but that's quite negligible on a business desktop/laptop... If the MSI capability was back-ported to XP via SP3, that might be interesting :-)

Obviously this is going to improve over time, as third-party software suppliers provide Vista-compatible updates.

Frank Rysanek

PCIe MSI: a performance-based reason to buy Vista?

XP can't do Message Signaled Interrupts. XP runs in legacy IO-APIC mode. That's why you can't get rid of the insane level of IRQ sharing on modern PCIe hardware under XP or W2k/W2K3 or older Linux.

Vista is reportedly MSI-capable "by heart". Can't say if cooperation is required on part of the HW-specific device drivers (the way it is in Linux), or if MSI is somehow enforced, technically or by WHQL approval.

Linux 2.6 core IRQ routing functionality has been MSI-capable for years, AFAIK, though traditionally the individual device drivers have been lagging behind with taking use of those new capabilities. Each HW driver has to explicitly ask for MSI delivery style for its IRQ, upon the driver's initialization. The situation has improved a lot in the latest 2.6-series kernels, as the most important drivers are getting updated.

Modern PC hardware is stuffed with PCI Express busses. PCI Express relies on the purely "message-signaled" interrupt delivery for optimum performance. In "legacy-compatible IO-APIC mode", all PCIe-based devices in the system share only 4 IRQ numbers, and the IRQ delivery performance is further impaired by the multi-hop routing style, where specifically devices connected to the north-bridge get their interrupts delivered to the CPU via the south bridge's IO(x)APIC and back through north bridge.

Note: IO-APIC's have become a legacy affair :-)

IRQ sharing means that the interrupt service routines for the various hardware devices have to be called in vain. Each ISR has to run a couple of random IO transactions across the system busses, to read its device's status registers, only to find out that this ISR invocation has been a "false alert", caused by the IRQ sharing. The bus transactions take time, the CPU is idle until the bus-borne read is accomplished. This latency gets worse if the brief random IO's compete for bus bandwidth with bulkier DMA transfers of payload data (disk IO, networking, graphics). This mode of operation is massively inefficient and painful to CPU load, especially with multi-GHz CPU's. Thanks god in only stalls the respective CPU core, in today's multi-core systems.

Before the IRQ even reaches the CPU (before it gets a chance to launch its set of ISR's), its transaction may have to travel back'n'forth across the link between the north bridge and south bridge, again competing for bus bandwidth with DMA. This impairs interrupt latency.

Now imagine that all of this takes place especially on high-performance devices such as PCIe x16 graphics boards or modern RAID adapters, with some USB UHCI's and per-port PCIe hot-swap IRQ lines thrown in as ballast... Actually if you happen to have some classic PCI-X based (parallel PCI) adapters in your PCIe system, attached via some PXH bridges to the PCIe-only chipset, it's them PCI-X devices who have a chance of getting a dedicated IO(x)APIC input pin, and a dedicated IRQ number on the CPU :-)