* Posts by Frank Rysanek

118 publicly visible posts • joined 2 Oct 2007

Page:

But it said so in the manual

Frank Rysanek
Linux

ReiserFS ; spindles

The one Linux filesystem, notorious for its capability to work with myriads of small files, is ReiserFS. There are downsides though: the stable ReiserFS v3, included in the vanilla Linux kernel, has a volume size limit of 16 TB. ReiserFS v4 is not in the mainline kernel (in some part due to "functionality redundancy" reasons = inappropriate code structure) and its future is somewhat uncertain - but it is maintained out of tree and source code patches (= "installable package") for current Linux kernel versions are released regularly. Both versions also have other grey corners, just like everything else...

When working with a filesystem that large, I'd be concerned about IOps capability of the underlying disk drives (AKA "spindles"). The question is, how often you need to access those files, i.e. how many IOps your users generate... This problem is generic, ultimately independent of the filesystem you choose.

CIA used 'illegal, inaccurate code to target kill drones'

Frank Rysanek

even just different x86 chipsets?

Someone has previously reported that the code worked fine on one x86 "chipset", but not on another one. And that it was a piece of onboard hardware in the drones... yes it goes against the supposed "data harvesting" categorization of the software product. That reference to two x86 chipsets may be related to another problem, rather than the originally described FP precision issue.

Difficult to say what the word "chipset" is supposed to mean here - whether just the system bridges, or if the CPU is implied as well. E.g., a mobile Core2 certainly has some more oomph than an Atom.

Either way: I recall that the people making the RTAI nanokernel (for hard-realtime control under Linux) do have statistics and testing tools to evaluate a particuar chipset's IRQ latency, and it's true that even among "x86 PC compatibles" some brands and models of chipsets show reasonably deterministic responses, while other chipsets show quite some interesting anomalies in interrupt service response time. This was real news with the introduction of Intel 8xx series of chipsets, where Interrupts were delivered inband over the HubLink for the first time in x86 history, rather than out-of-band via discrete signals or a dedicated interrupt bus - so that interrupt messages were competing for bus bandwidth (time) with general bulk payload. At that time, competing "old-fashioned" chipsets such as the SIS 650 had a much better IRQ delivery determinism than the early Intel Pentium 4 chipsets. Some cases of high IRQ latencies are attributed to unaustere use of SMI by the BIOS, e.g. for software emulation of features missing in hardware... but that again tends to go hand in hand with a particular chipset, stems from BIOS modules provided by the chipmaker etc. Don't know what the situation is now, how the different later generations of Intel hardware behave, how the Geode compares for instance...

Heheh, were (GP)GPU's involved by any chance? That's another area where floating point math can screech in its hinges...

Intel eats crow on software RAID

Frank Rysanek
Go

RAID software quality from Intel etc

Several issues come to mind.

Historically, Intel has had soft-RAID "support" in several generations of their ICH's - on top of SATA HBA's, up to six drive channels. A few years ago it was called the "Application Accelerator", then it was renamed to "Matrix Storage". I don't know for sure if there's ever been a RAID5/XOR Accelerator in there, or if the RAID feature consisted of some ability to change PCI ID's of the SATA HBA at runtime + dedicated BIOS support (= RAID support consisting of software and PR/advertising, on top of a little chipset hack). Based on the vague response in the article, I'd guess that there's still no RAID5 XOR (let alone RAID6 Reed-Solomon) acceleration in the PCH hardware - what they said means that they're looking at the performance and trying to squeeze out as much as possible out of the software side. Looks like not much is new here on the software part (RAID BIOS + drivers) - the only news is SAS support (how many HBA channels?), which gives you access to some swift and reliable spindles (the desktop-grade SATA spindles are neither), if the ports support multi-lane operation they could be used for external attachment to entry-level HW RAID boxes, and if the claim about expander support is true, you could also attach a beefy JBOD enclosure with many individual drives (unless the setup gets plagued by some expander/HBA/drive compatibility issues, which are not uncommon even with the current "discrete" SAS setups). I'm wondering about "enclosure management" - something rather new to Intel soft-RAID, but otherwise a VERY useful feature (especially the per-drive failure LED's are nice to have).

The one safe claim about Intel on-chip SATA soft-raid has always been "lack of comfort" (lack of features). The Intel drivers + management software, from Application Accelerator to Matrix Storage, has been so spartan that it was not much use, especially in critical situations (drive fails and you need to replace it). I've seen worse (onboard HPT/JMicron I believe), but you can also certainly do much more with a pure-SW RAID stack - take Promise, Adaptec HostRAID or even the LSI soft-RAID for example. It's just that the vanilla Intel implementation has always lacked features (not sure about bugs/reliability, never used it in practice). Probably as a consequence, some motherboard vendors used to supply (and still do supply) their Intel ICH-R-based boards with a 3rd-party RAID BIOS option ROM (and OS drivers). I've seen Adaptec HostRAID and the LSI soft-stack. Some motherboards even give you a choice in the BIOS setup, which soft-stack you prefer: e.g., Intel Matrix Storage or Adaptec HostRAID. Again, based on one note in the article, this practice is likely to continue. I just wish Intel did something to improve the quality of their own vanilla software.

One specific chapter is Linux (FOSS) support. As the commercial software-RAID stacks contain all the "intellectual property" in software, they are very unlikely to get open-sourced. And there's not much point in writing an open-source driver from scratch on top of reverse-enginered on-disk format. There have been such attempts in history and led pretty much nowhere. Any tiny change in the vendor's closed-source firmware / on-disk format would "break" the open driver. And the open-source volunteers will never be able to write plausible management utils from scratch (unless supported by the respective RAID vendor). Linux and FreeBSD nowadays contain pretty good native soft-RAID stacks and historically the natural tendency has been to work on the native stacks and ignore the proprietary soft-RAID stacks. The Linux/BSD native soft-RAID stacks can run quite fine on top of any Intel ICH, whether it has the -R suffix or not :-)

People who are happy to use a soft-RAID hardly ever care about battery-backed write-back cache. Maybe the data is just not worth the additional money, or maybe it's easy to arrange regular backup in other ways - so that the theoretical risk of a dirty server crash becomes a non-issue. Power outages can be handled by a UPS. It's allways a tradeoff between your demands and budget.

As far as performance is concerned:

Parity-less soft-RAIDs are not limited by the host CPU's number-crunching performance (XOR/RS). If you omit the possibility of sub-prime soft RAID stack implementation, the only potential bottleneck that remains is bus throughput: the link from north bridge to south bridge, and the SATA/SAS HBA itself. Some Intel ICH's on-chip SATA HBA's used to behave as if two drives shared a virtual SATA channel (just like IDE master+slave) in the old days - not sure about the modern-day AHCI incarnations. Also the HubLink used to be just 256 MBps thick. Nowadays the DMI is 1 GBps+ (full duplex), which is plenty good enough for 6 modern rotating drives, even if you only care about sequential throughput. Based on practical tests, one thing's for sure: Intel's ICH on-chip SATA HBA's have always been the best performers around in their class - the competition was worse, sometimes much worse.

As for parity-based RAID levels (5, 6, their derivatives and others): a good indicator may be the Linux native MD RAID's boot messages. When booting, the Linux MD driver "benchmarks" the (potentially various) number-crunching subsystems available, such as the inherent x86 ALU XOR vs. MMX/SSE XOR, or several software algorithm implementations, and picks the one which is best. On basic desktop CPU's today (Core2), the fastest benchmark usually says something like 3 GBps, and that's for a single CPU core. I recall practical numbers like 80 MBps RAID5 sequential writing on a Pentium III @ 350 MHz in the old days.

The higher-end internal RAID cards, containing an IOP348 CPU at ~1GHz, tend to be limited to around 1 GBps when _not_ crunching the data with XOR (appears to be a PCI-e x8 bus limit). They're slower when number-crunching.

In reality, for many types of load I would expect the practical limit to be set by the spindles' seeking capability - i.e., for loads that consist of smaller transactions and random seeking. A desktop SATA drive can do about 60-75 random seeks per second, enterprise drives can do up to about 150. SSD's are much faster.

The one thing I've recently been wondering about is this: where did Intel get their SAS HBA susbsystem from? Already the IOP348 contains an 8way SAS HBA. Now the Sandy Bridge PCH should also contain some channels. Are they the same architecture? Are they not? Is that Intel's in-house design? Or, is it an "IP core" purchased from some incumbent in the SCSI/SAS chipmaking business? (LSI Fusion MPT or Agilent/Avago/PMC Tachyon come to mind.) The LSI-based HBA's tend to be compatible with everything around. Most complaints about SAS incompatibility that I've noticed tend to involve an Intel IOP348 CPU (on boards e.g. from Areca or Adaptec) combined with a particular expander brand or drive model / firmware version... Sometimes it was about SATA drives hooked up over a SAS expander etc. The situation gets hazy with other less-known vendors (Broadcom or Vitesse come to mind) producing their own RoC's with on-chip HBA's...

Firefox 4 beta gets hard on Windows

Frank Rysanek
Coat

DNS SRV records for HTTP *still* unsupported... (or are they?)

Audio visualization and 2D acceleration are the hot news TODAY? And a very simple web redundancy framework, potentially very useful, known and wanted by informed people for a decade, is still missing: SRV - just a small update to the DNS resolver. Just a few lines of code. The largely technophobic/ignorant web-surfing masses would appreciate it too, once it got into production use. A bug entry has been in the bugzilla for years, even with some early patches. The ultimate excuse from the Mozilla team has always been that there is no RFC standard. There is for SRV, but not specifically for SRV on HTTP.

https://bugzilla.mozilla.org/show_bug.cgi?id=14328

http://support.mozilla.com/tiki-view_forum_thread.php?comments_parentId=6112&forumId=1

The only party who would certainly not appreciate HTTP SRV, are the vendors of content switches for HTTP load balancing / HA solutions.

Mine's the one with BGP4+IPv6 in the pockets and global routing stability painted on the back...

NASA: Civilization will end in 2013 (possibly)

Frank Rysanek

good for the business

If this FUD turns out to be true, some devices will die. For various reasons, I'd expect overly old and maybe overloaded devices to die. The weakest links will blow. This might result in a burst of investment into power transmission and IT / Telco equipment, as well as some techie gadgetry "consumption spending" from end users. Short power blackouts are not a problem. As far as telco/data communications are concerned, the backbones have been based on fiber optics for ages, and in some countries wireless links are also quite widespread. If the old residential telco copper gets disrupted, maybe it's not all bad news :-)

Adaptec - the eternal wannabe

Frank Rysanek

technical details - linux compatibility etc.

With respect to Linux, Adaptec's FSA RAID (aacraid) historically has been and still is one of the first and best supported RAID controller brands, on par with maybe only 3ware. The Linux user-space utils starting with aaccli/arcconf and ending with the "Storage Manager" have always been among the best on the market. The vanilla Linux AACRAID driver even exhibits some "forward compatibility" with newer AACRAID cards - detects them as a generic AACRAID model and tends to work with them just fine.

I remember a time at the end of nineties when AACRAID was a novelty, but it soon became a cornerstone of Linux HW RAID driver support. I still remember how happy I was around 2003/2004 that the old DPT RAID flavour of Adaptec cards was finally gone - especially the last specimen of the DPT ZCR family tended to be unreliable and the firmware features were sub-prime.

3ware used to be the cheapest HW RAID, reliable and compatible, but lacking CPU horsepower. From 9000 series above I lost track, so I cannot judge the current portfolio (they finally seem to have switched to high-performance CPU's, the AMCC-flavour PowerPC).

Regarding Adaptec's own SCSI controllers (AHA/ASC/AIC): I started to avoid them with the second generation of AIC-7902/29320/39320, which apparently can be distinguished by the "A" suffix. The first-gen Adaptec U320 controllers had no problem against LSI U320 targets (I still have one or two pieces), but the latter variety couldn't run properly at U320 against LSI, hence there was a problem getting them to work with CDB16/LBA64, which was a problem with external storage boxes, typically featuring target-mode controllers by LSI. Even the earlier variety of U320 *and* 64bit (PCI) U160 controllers had some problem against ServerWorks chipsets (ceased to be a problem as Intel chipsets finally prevailed in servers).

None of this was a problem with the Adaptec RAID controllers, because

A) you don't attach an external RAID box to an internal PCI RAID

B) on ASR2120/2200 the AIC HBA chip is attached to the host PC via a PCI IOP CPU by Intel, hence no problem with PCI compatibility.

Regarding the Adaptec SATA RAID portfolio and "rebadging a SiliconImage chip": many people still fail to distinguish

A.) a proper hardware RAID controller (with its own CPU, RAM and firmware in Flash)

B.) from a "soft RAID" (just a cheap HBA chip with a companion Flash for the BIOS option ROM).

I cannot tell whether or not it was a marketing error on part of Adaptec to sell cheap soft-raids, which admittedly are a problem in Linux. The AAR-1200 series were a soft RAID (HostRaid in Adaptec lingo). The AAR-2410 / 2420 were/are a proper hardware RAID (aacraid family), in terms of features precisely on par with Adaptec 2120/2130. Actually the SATA implementation is even slightly better in some respects, such as independent drive channels and quicker response to drive failures (that's right, the failure response on SCSI is *slower*). When shopping for an Adaptec controller for Linux, you always had to check that you were buying an "aacraid". You always get what you pay for. The SiliconImage chip itself is pretty good in its class, has no obvious compatibility or performance problems - in that sense, it was certainly a good choice. Obviously not for the Linux folks, who don't like being fooled into buying a software RAID stack that they have to dump anyway (if it can be circumvented at all, starting from the BIOS).

Note that there were even SCSI HBA's wearing the "HostRaid" suffix - some members of the 29320/39320 family. Of course those were easier to identify as "just plain HBA's" by the basic product number.

One last note regarding Adaptec HostRaid: among the many "software RAID HBA" implementations out there, the Adaptec HostRaid BIOS and drivers were among the best. As good as it gets, without a dedicted CPU. Adaptec shipped the HostRaid even with onboard HBA's - initially with the SCSI AIC series, later on the stack also started to appear as just a BIOS option ROM with third-party onboard HBA's (Intel ICH, even Marvell I think). E.g. on some SuperMicro motherboards, you have a choice between an original Intel soft-RAID stack (matrix storage) and the Adaptec HostRaid option ROM. To me, the choice has always been clear - the Adaptec HostRaid, owing to its bug-free BIOS part and excellent OS-based management tools. Unfortunately for Adaptec, the onboard HostRaid stack was almost invisible in the motherboards' marketing material (product web, datasheets, packaging), and actually hardly any end-customers knew enough to tell a difference.

Obviously this train of thought is only valid for Windows. Forget about HostRaid for Linux. If you don't want to pay for a genuine HW RAID, save some money, buy a plain HBA and use a native Linux MD RAID. Some argue that the MD RAID even has advantages over a proprietary HW RAID in terms of both performance and "hardware-independent crash recovery".

I still remember the time when Intel rounded off the i960-based generation of the IOP CPU family and all the RAID vendors depending on that (Adaptec and MegaRAID among others) had a hard time taking the next step - some followed the path to Intel Xscale IOP's, others took other paths. Adaptec finally rolled out its own RoC chips, forming the basis of ASR-2130/2230 (MIPS-based?). Adaptec later returned to Intel with the "universal SATA/SAS family" (so the AACRAID firmware once again ran on Intel Xscale hardware), though actually the first Arm-based AACRAID was the old ASR5400 quad-channel SCSI if memory serves...

It may well be that the discontinuation of i960 by Intel has "mixed the cards" in the RAID game quite a bit. Non-intel CPU's got a chance and some Xscale-only startup competition has been founded, e.g. Areca (though there have been Areca models that do run on non-Xscale CPU's). The dot.com bust, the growing market acceptance of SATA (?) in servers and even the maturing open-source soft-RAID implementations in Linux/xBSD have just been additional nails in the Adaptec coffin.

The current "universal SATA/SAS" family (starting with 3800 series) is actually pretty good. The SFF-8087 with SGPIO support are the best SAS/SATA interconnect ever.

Some of our customers still demand Adaptec as "the top-notch RAID controller brand".

I tend to prefer Areca, which has similar features and IMO a richer yet lighter-weight management interface - but some customers are difficult to convert :-)

The BIOS interface to Adaptec cards has traditionally been fairly spartan (compared to e.g. Areca, but certainly on par with or better than MegaRAID, 3ware and others). Makes me wonder how many people are actually coding the firmware, BIOS and OS-based tools at Adaptec and the other vendors. I wouldn't be surprised if it's just a fairly narrow team of people, maybe down to 2-5. How much fluctuation is there in the core team, across all the ups and downs and mergers? Is anyone of the original AACRAID developers still working on the firmware? To me as a techie, the set of features and capabilities is actually the decisive selling point - rather than press announcements, acquisitions, stock splits, hostile takeovers, board-level coups and all the other corporation games...

Adaptec wanted to buy Symbios? Wow, didn't notice that :-) To me, Symbios has always been a part of LSI, a key part of the LSI SCSI expertise and excellence, up to U320.

Did you say that Adaptec bought some RAID stuff from IBM? I thought the MegaRAID acquisition path was AMI->IBM->LSI :-)

Intel's nine-piece Lynnfield band takes the stage

Frank Rysanek
Dead Vulture

R.I.P. QPI

Now, forget about the QPI folks, okay?

QPI has been the major buzzword around Core i7. Now suddenly it's not needed anymore. Not on a single-CPU desktop system, not when the PCI-e root complex has become a part of the CPU, and the CPU can talk "DMI" (notably similar to PCI-e 1.1 x4) straight to the ICH (sorry, PCH). Besides having an on-CPU RAM controller, of course.

Pillar first past 2TB post

Frank Rysanek

Re: reliability (in reply to Peter D'Hoye)

Exactly. People who opt for the bleeding edge "bits per square inch" on the platters, combined with four platters in a drive, should be prepared to replace a couple drives over the first month of operation, and some more over the first two years.

Neither SATA protocol-level compatibility, nor the drive size per se (compatibility with RAID firmware) have been too much of a problem lately, even with some lower-end RAID brands. It's nowadays a fairly safe bet that you can plug the latest drive into your two-year-old RAID box and it's gonna work. But the first deliveries of every bleeding-edge HDD model coming out can have maybe 20% of the drives essentially "dead on arrival", i.e. failing in RAID assembly burn-in. That's at least 10 times more compared to trailing-edge drives, such as the 80GB Barracuda 7200.10 being phased out just now.

Let me suggest a recipe: always warn the early adopters among your customers. When assembling a RAID unit, give it a thorough burn-in under generated load in your lab and replace any misbehaving drives before you ship the box to your customer. Keep a few pcs of the drives in stock for quick replacements. Favour RAID levels with more parity. If the RAID firmware is capable of that, schedule some periodic surface testing (exhaustive whole-surface reading) to prevent sudden "multiple failures" (bad sectors piling up undiscovered for a long time). When a drive fails, don't blame drive firmware, blame bad sectors on the high-density platters.

Frank Rysanek

The first in what? PR lingo...

The first to use 2TB drives in a RAID unit? Or, rather, the first one to boast that on the web?

There are other reasons to take this message with a grain of salt. I've met customers who specified minimum guaranteed IOps per TB required. The modern desktop drives with 1.5 TB and above may well be below that target even for fairly boring applications such as "file sharing" websites... you get a nominally huge storage box, but effectively you cannot make use of all the free space - you cannot access it fast enough.

That sort of drives can be good enough for round-robin surveillance video archival or maybe HD video capture+editing (provided that the FS doesn't require too many IOps, and that you don't unleash too many parallel users unto the RAID box). The resulting IOps throughput is also a matter of what RAID level you configure...

Google polishes Chrome into netbook OS

Frank Rysanek

Re: Hardware compatibility

@Toastan Buttar:

> "Every new piece of hardware" already works with Windows and seldom takes "hours of configuration".

I've recently bought a relatively low-end Acer notebook PC. Cheap stuff. I paid attention to it having an Intel CPU+chipset+IGP, but it also has a number of other-brand peripherials. I installed it to dual-boot XP and Fedora 10.

It took me half a day to install XP along with all the drivers. Especially the webcam driver gave me a headache - its power saving glitch prevented XP from shutting down correctly, and it took some time to google out a workaround (prevent PM on the USB port in Windows Device Manager). It *did* take hours, even just downloading and installing all the drivers, even if I disregard that webcam gotcha.

With Fedora 10 x86_64, I booted the "netinst" CD, pointed it to use a single flat partition for its filesystem, selected some apps to install, and went off to do other things. It didn't take more than 15 minutes. In an hour or so, the system was up and running, including WiFi, a multi-combo flash card reader, Realtek HD audio chip, and including the darn cheap webcam! Not to mention a host of apps (Mozilla / Gimp / OOo). All removable media / flash cards work out of the box, just as seamlessly as in Windows, or maybe better. It *was* significantly faster to install than Windows.

Note the look and feel of automatic updates in Fedora - the level of detail of progress reporting, the lack of reboots. Non-english language support (keyboard and display): no problem, either. Do I play games? Not anymore, I don't have the time. Out of curiosity, I did compile UFO AI from SVN source under Fedora, but it's still too buggy to be any serious use :-)

I do have frequent encounters with buggy 3rd-party hardware drivers for Windows. They tend to be stale and buggy versions, especially for cheap noname brands (imagine all the USB gadgets) - or sometimes incorrectly labeled on the device manufacturer's web site, or wrapped in a an impenetrable installer archive together with a buggy install script. Generic drivers in Linux tend to work surprisingly well out of the box for that same hardware.

Carbon capture would create fizzy underground oceans

Frank Rysanek
Thumb Down

Pump CO2 underground? How efficient is it? Basic physics

Imagine a 500MW coal-fired power plant. I live nearby one. Just think about the chimneys. How power-efficient would it be to pump the whole huge volume of exhaust gasses underground? How deep are the gas wells? How many metres of "water pillar"? It's one Bar every 10 m. Imagine that you'd need to compress a powerplant chimney worth of gas up to hundreds of Bars to pump them underground. Think about the heat produced, apart from the potential energy stored in the pressure difference. Does that seem worthwhile? Wouldn't it take more energy than the power plant would be able to produce?

Or, would you just take some water from down under, let the exhaust gasses bubble through it until most of the CO2 dissolves (and nitrogen bubbles back up), and pump the water back down to where you got it? That might be a tad more efficient... But, would the water take any further CO2, at our surface pressure?

Any solid data on that? URL pointers? Too lazy to do the basic maths myselfs...

Fedora 11 beta bares chest to all-comers

Frank Rysanek
Linux

@wireless setup

I've actually just installed Fedora 10 64b on an Acer notebook (a brand being dismissed with a grin by many of my colleagues) and IT ALL WORKS OUT OF THE BOX, including WiFi a/b/g/DraftN (Intel chip), bluetooth and a crappy integrated webcam that has a quirky driver in Windows. I did pay attention to having an Intel chipset in the notebook, and I went for a bargain model with an older 65nm C2D CPU and the chipset is two generations old by now, too.

As for WiFi: in the past I've configured WiFi by hand on a Broadcom-based AP with OpenWRT installed. So the first thing I did in Fedora, I tried a few lines with iwconfig. But somehow I couldn't get past WPA2. Only then did I take a fumble through the system configuration menus on the graphical desktop, and guess what: found a WiFi configuration tool, which did the magic with just a few mouse clicks - I managed to enter the WPA2 PSK at the first attempt.

Wow!

VirtenSys PCIe cloud switch arrives

Frank Rysanek
Happy

Further reading

from IDT and PLX - some PCIe switch chips with a prospect of multi-root architectures. Maybe pre-IOV, but the docs attached give a thorough technical background, answer many of my basic questions.

http://www.idt.com/products/getDoc.cfm?docID=18639469

http://www.idt.com/products/getDoc.cfm?docID=18688297

http://www.plxtech.com/products/expresslane/switches.asp

Interestingly for me, Pericom has not much to offer in that vein... Unsurprisingly, neither has Intel. That LSI paper on IOV mentioned before makes me wonder what LSI has up its sleeve.

Frank Rysanek
Alert

PCI-e IOV - multiple root complexes can't talk to each other

Ahh well. So you can put your NIC in an external expansion box, rather than into the server itself. But unless it's a very special NIC that supports IOV, you can only use it from one root complex (= from only one host computer). The IOV standard simply says that the external multi-root PCI-e switch can carry traffic for multiple PCI-e bus trees that don't know about each other (like VLAN's). Each bus tree is "owned" by a particular "OS partition" running on the host computer. At least the part of the bus tree carried by the external switch runs virtualized on the switch, though I guess IO virtualization at PC server chipset level is already in the works too.

Any peripherial board that is to be shared by multiple PCI-e trees must have special "virtualization" support, to be able to keep track of several simultaneous PCI-e conversations with multiple root complexes. Not so very sexy...

I bet the HPC folks would appreciate much more if the external PCI-e switch could cater for direct root-to-root data transfers - for HPC networking purposes. Imagine DMA from one root complex to another root complex (memory to memory). This doesn't necessarily mean that a standard IP networking stack would be involved - perhaps as a slow fallback solution or for management purposes. Rather, I could fancy some dedicated low-latency mailboxing framework. It would really get the multi-root PCI-e monster fairly close to a NUMA, except that we're still speaking distinct "OS partitions" in the IOV lingo. The way I understand PCI-e IOV, such direct transfers are impossible. Maybe via an intermediate "virtual NIC" or some other peripherial along those lines (call it a DMA engine with some dedicated memory of its own) implemented within the external PCI-e switch.

The sort of bandwidth available from PCI-e at least makes very good sense for direct RAID storage attachment. Perhaps not via an additional intermediate storage bus technology (that could be useful as a slow lane for some wider-area SAN interconnects).

This SSD thing is really catching on

Frank Rysanek
Stop

erase block size?

So what's the "erase block size" in the upcoming enterprise-grade flash drives? Did I hear you say interleaved erase+write cycles? How many ways/channels?

http://www.storagesearch.com/easyco-flashperformance-art.pdf

Adaptec lobs Series 5 RAID controllers at SATA and SAS

Frank Rysanek
Thumb Up

Re: how do they scale up

They probably mean the 5085, with 2x SFF-8088 (external x4 multilane SAS). That's two ports, 128 SAS addresses each, or even more with "fanout" expanders (see e.g. the SAS JBOD enclosures by AXUS).

Each ML SAS port can be connected to a daisy-chain or a tree of cascaded JBOD enclosures.

The internal SFF-8087 can also be used for cascading, provided that you have an expander-based SAS backplane in your server that provides an external SAS expansion port for daisy-chaining.

Those 256 drives may as well mean a firmware-side limitation, rather than the max.number of SAS addresses theoretically possible per a daisy-chain / cascaded tree. Still I'd be a little cautious about 256 drives per RAID. There can be real-world glitches that may limit the practically useful degree of cascading, performance with so many drives, choice of RAID level, maximum block device size that your OS can actually take, runtime reliability of such a monster etc.

I also keep hearing rumours of SATA drives being incompatible with some SAS expanders, or that you can only use a single expander per ML SAS port for SATA drives (no JBOD daisy-chaining) etc.

I believe the limit of 256 drives is the same with the older 3xxx series.

The 5085 is on par with an Areca ARC-1680x: same IOP CPU, same number of ports. As far as firmware features and comfort are concerned, nowadays I'd probably opt for the Areca.

Dutch Consumer Association declares war on Vista

Frank Rysanek

It's not just drivers for old hardware, far from that

Quite a lot of legacy business software fails to run correctly on Vista, even fairly simple apps and some handy utils. This is not a matter of hardware drivers - this is a matter of backwards compatibility with user-space software. I know about people who recently got a new name-brand Notebook PC with Vista preinstalled, and after losing a day or two trying to make their beloved apps work, they downgraded to XP (having a corporate multi-license) and were up and running in a few hours, including all the software and all the third-party hardware drivers in their most recent versions. And XP *sing* on the Vista-ready hardware :-)

Yes, there's the drawback with message-signaled interrupts, but that's quite negligible on a business desktop/laptop... If the MSI capability was back-ported to XP via SP3, that might be interesting :-)

Obviously this is going to improve over time, as third-party software suppliers provide Vista-compatible updates.

Microsoft shouts 'Long Live XP'

Frank Rysanek

PCIe MSI: a performance-based reason to buy Vista?

XP can't do Message Signaled Interrupts. XP runs in legacy IO-APIC mode. That's why you can't get rid of the insane level of IRQ sharing on modern PCIe hardware under XP or W2k/W2K3 or older Linux.

Vista is reportedly MSI-capable "by heart". Can't say if cooperation is required on part of the HW-specific device drivers (the way it is in Linux), or if MSI is somehow enforced, technically or by WHQL approval.

Linux 2.6 core IRQ routing functionality has been MSI-capable for years, AFAIK, though traditionally the individual device drivers have been lagging behind with taking use of those new capabilities. Each HW driver has to explicitly ask for MSI delivery style for its IRQ, upon the driver's initialization. The situation has improved a lot in the latest 2.6-series kernels, as the most important drivers are getting updated.

Modern PC hardware is stuffed with PCI Express busses. PCI Express relies on the purely "message-signaled" interrupt delivery for optimum performance. In "legacy-compatible IO-APIC mode", all PCIe-based devices in the system share only 4 IRQ numbers, and the IRQ delivery performance is further impaired by the multi-hop routing style, where specifically devices connected to the north-bridge get their interrupts delivered to the CPU via the south bridge's IO(x)APIC and back through north bridge.

Note: IO-APIC's have become a legacy affair :-)

IRQ sharing means that the interrupt service routines for the various hardware devices have to be called in vain. Each ISR has to run a couple of random IO transactions across the system busses, to read its device's status registers, only to find out that this ISR invocation has been a "false alert", caused by the IRQ sharing. The bus transactions take time, the CPU is idle until the bus-borne read is accomplished. This latency gets worse if the brief random IO's compete for bus bandwidth with bulkier DMA transfers of payload data (disk IO, networking, graphics). This mode of operation is massively inefficient and painful to CPU load, especially with multi-GHz CPU's. Thanks god in only stalls the respective CPU core, in today's multi-core systems.

Before the IRQ even reaches the CPU (before it gets a chance to launch its set of ISR's), its transaction may have to travel back'n'forth across the link between the north bridge and south bridge, again competing for bus bandwidth with DMA. This impairs interrupt latency.

Now imagine that all of this takes place especially on high-performance devices such as PCIe x16 graphics boards or modern RAID adapters, with some USB UHCI's and per-port PCIe hot-swap IRQ lines thrown in as ballast... Actually if you happen to have some classic PCI-X based (parallel PCI) adapters in your PCIe system, attached via some PXH bridges to the PCIe-only chipset, it's them PCI-X devices who have a chance of getting a dedicated IO(x)APIC input pin, and a dedicated IRQ number on the CPU :-)

Page: