BBC programmers discover raw sockets.
Back in September, The Register's networking desk chatted to a company called Teclo about the limitations of TCP performance in the Linux stack. That work, described here, included moving TCP/IP processing off to user-space to avoid the complex processing that the kernel has accumulated over the years. It's no surprise, then …
NO! This is precisely not what is being described.
The performance problem is due to the buffer-allocation and copying that goes on in the kernel when receiving and transmitting packets via the LAN interface - this applies just as much to "raw" datagrams as to TCP/IP.
There have been a few iterations around a solution, including PF_RING, but the real problem is that the generic Linux approach to device drivers doesn't really work with high-speed network devices. Worth reading up on Netmap for some more concrete details.
The Linux networking stack is, well, sub-optimal, to put it kindly. Getting it all out of kernel space isn't really the answer in the long term, but it does show the kernel developers a way forward.
EDIT: There is, incidentally, the question of whether TCP/IP is even the right protocol suite for this type of application (too small a window size, poor packet loss recovery), but the driver issues are independent of the high layer protocols...
"The Linux networking stack is, well, sub-optimal, to put it kindly."
A general purpose OS - or general purpose H/W for that matter - isn't optimised for anything in particular. It has to be a jack of all trades balancing performance against security, ease of use, multi-tasking & whatever. If you want optimisation according to some specific criterion you use something special purpose. You want real time response you use a real time OS. You want to mine Bitcoins you use ASICs.
True in general. However, it's not true that because you can't achieve optimal performance for a certain set of criteria that any improvement at all is impossible. There are a lot of ways in which a general purpose OS can reduce network processing overhead - virtual addresses and caches notwithstanding.
These include better use of scatter/gather features of the NIC or DMA controllers, careful control of allocation and copying and not shunting data to and from user space when the ultimate source or destination is another driver (eg streaming from/to a network to/from a file). None of these would render any other part of the operating system less usable. Nor do they require any form of "real time" response.
"These include better use of scatter/gather features of the NIC or DMA "
Which NIC or DMA?
If you tie the S/W to specific H/W, e.g. a particular model of NIC then you lose the ability to plug in different H/W. If you provide for alternative H/W you end up with a modular structure which has its own overhead. I'm not saying that something which has grown at the rate the Linux kernel has is going to be the result of a whole series of ideal decisions (I can think of a few I disagree with) but if you try to do everything there are going to be trade-offs.
>Which NIC or DMA?
If your pick the right driver abstraction you can make use of hardware capabilities when they are available and fall back to something less performant if they aren't: this is well-trodden ground for operating system design. It can even simplify the overall design - it's a slightly different topic, but, for example, the driver model in NetBSD eliminates a lot of redundant architecture-specific code just by better abstracting the individual operations.
If you pick the wrong abstraction, you can never take advantage of hardware acceleration. Some of the choices that have been made in Linux to date have their origins in the mists of Unix past and are not necessarily the right choice for the future. Software evolves over time and I simply don't accept any argument of the type it's too difficult/pointless/not invented here/everything is perfect.
Nothing really new here!
But then I have worked through the late 70's and early 80's developments in computing. It was the rise of cheap computers on the back of rapid advances in CPU performance that many functions that were handled by dedicated (and hence expensive) network adaptors (this is the real reason why the TLI library in Unix exists, it enabled an application to transfer data and hand off processing to the network adapter) were moved into the them lightly loaded workstation CPU. Similar design decisions lead to disc controller intelligence, graphics processing, modem processing ("soft modem") and other intelligent peripheral logic being moved into the CPU.
Subsequently we have seen the resurrection of dedicated graphics processors and 'intelligent' disk I/O controllers, but not the resurrection of dedicated and intelligent network processors; it seems the BBC and friends have discovered a need for one.
Interestingly, even with intelligent network protocol processors, performance was a big issue and any vendor looking seriously at high-speed networking always had to tweak the protocols so that they could be implemented in silicon - in fact protocols such as XTP, were developed that combined the functions of the network and transport layers; but these really upset the TCP/IP and protocol layering purists and so didn't garner much support... [So just another reason for not really bothering with IPv6.]
Broadcast hardware to do the functions the bbc are playing with already exist on the market today, implemented in dedicated silicon from the likes of motorola and others in rack mount formats capable of handling any bitrate encoding and distribution you choose. Rackmount encoders, streamers, switching solutions, watermark insertion servers etc all done in dedicated silicon inside with just a little os to manage the asic config itself on some management plane with the heavy packet flow happening on dedicated fibre links.
I'd be shocked if SKY or BT are shunting their streams about before distribution using ip running on a linux based computer enough for this to be a issue.
"I'd be shocked if SKY or BT are shunting their streams about before distribution using ip running on a linux based computer enough for this to be a issue."
You're likely to be surprised because they do, in fact many encoders and the like are really linux computers with dedicated encoder cores only if you are lucky and many still using off the shelf nics etc, The really expensive ones don't but they are really expensive...
Ty the time you get away from the content preparation to the real business of content distribution where throughput is really large then you very are likely to be using COTS hardware where tricks like this could make a signfiicant different
Why don't they just fix the Linux network stack so it has a proper modular architecture like Windows does (NDIS) ? TOE and similar hardware acceleration is way behind under Linux as it's a bolt on after-thought. This has been a long standing and widely known Linux weakness - and one that becomes more apparent as we move to ever faster network connections.
Is that still true? I know it was certainly true in the past, but presumably Microsoft has made some improvements to something in Windows while they were ruining the GUI. Anyone seen any benchmarks comparing TCP/IP performance on the latest Linux to Windows Server 2016?
"the Linux Network Stack could use some work but it's still a heck of a lot faster and much more sane than the pile of crud that is the Windows Network Stack."
Clearly you havn't tried using real world 10GB, 40GB and Mellanox type low latency connectivity. Windows is significantly faster than unmodified Linux - and more efficient - with significantly lower CPU use.
"I call bullshit. Mellanox cards are specifically designed to *not* use the kernel"
You call wrongly. Mellanox cards can only support hardware offload by hacking the Linux kernel with a hardware specific modification to support this. On Windows they can just inject a filter driver at the right layer in the NDIS stack.
Now I know people have opinions on the licence fee etc
But this is the BBC at its finest. Writing kernal bypasses to get better throughput.
This is why I pay my licence fee for the few incredible moments in my lifetime where I can be proud of a public utility that I fund indirectly showing off the skills of their talented staff
I'll have a look, but I would be extremely surprised if any of my IPlayer capable devices supported outputting 50p to my TV or to any other display. So I guess it would be 50p displayed on 60p then.
I don't think my Now TV puck or the Chrome Cast stick does 50 Hz. USA is all that matter, innit?
Perhaps if I launch iPlayer on the HTPC (which I have forced to 50Hz).. Hurray, that's gonna be a lovely end user experience!
When most (all?) TV panels and their image processing run natively at 60 Hz, and all these widgets also output at 60 Hz... Just let one device do the fps resampling.
My 60 Hz-native Samsung TV (admittedly old, but a good quality panel) does horrible things if you feed it 25 or 50 Hz, so it only gets fed by a PC through HDMI at 60 Hz. The NVIDIA GPU does an excellent job of interpolating iPlayer content.
Resampling 50 -> 60 is easier than 60 -> 50 (+1 frame every 100 ms will be almost imperceptible...) I don't imagine pixel rise/fall lag can even really keep up with that unless you have a VERY expensive panel.
I don't even think graphics cards do it this way anyway, a decent GPU should utilise some form of frame interpolation. TBH I'd rather have my GPU do this than my TV.
All TVs sold in the UK with flat panels have 60 Hz native panels. The image processing chip inside is internally interpolating the broadcast 50 Hz content to the panel's native refresh rate, irrespective of how you supply it with signal.
You need to spend orders of magnitude more - £3k, £6k - to get a panel which can natively run at 24, 25, 30 Hz (i.e, frames per second) without poorly done image processing. These are usually Grade 1 or Grade 2 broadcast panels like Sony's BVM and PVM series of broadcast monitors.
If you consider the image processing capable by a decent consumer GPU, versus the limited horsepower in your TV's silicon, I'd rather watch TV through my GPU (which I do thanks to an HDMI cap card) than watch it straight through the telly. TVs often do a a cheap bob deinterlace to get it to 50 Hz too, and then interpolate to 60.
And then, even being fed 50i, they can still go nuts and momentarily show progressive frames (which is incredibly jarring to watch when it happens) because it thinks it's showing film (25psf) content, until the image processing realises its mistake and reverts to TFF interlaced video.
"When most (all?) TV panels and their image processing run natively at 60 Hz, and all these widgets also output at 60 Hz... Just let one device do the fps resampling."
No they don't. Many run at 100Hz, 200Hz, 300HZ, and 400Hz as a quick hunt round panel specs tell me.
If your panel is native 60Hz, it's likely a fairly crap one.
Don't forget the majority of the world uses PAL or SECAM based solutions and that most of the world also uses 50Hz AC mains power.
Some (quite expensive) consumer sets will downscale internally to the panel's refresh rate whilst advertising a higher one.
There's a history of TV panels starting out (or being repurposed as) PC display panels - which will run at (probably) 60 Hz, unless they're more modern 100 or 120 Hz (...and even if they're advertised as 100, I'm not even sure they're natively 100 - quite possibly native 120 with some on-chip conversion!).
Taking this into consideration, 50 Hz countries are in the minority in light of global market forces which has always frustrated me. However you look at it, TVs and PC displays are made to a price point which doesn't usually include native support for 50 Hz and its higher rate multiples.
Also, consider the amount of legacy panels still in use in the UK - 720p or 'budget' 1080p panels in budget TVs which are now probably at least a decade old. They are all 60 Hz simply due to the economy of scale to manufacture one type of panel for worldwide use and throw a cheap deinterlacing and frame rate conversion algorithm in for non-NTSC markets.
Most people won't have a true 100 Hz panel in their house. People are still buying budget TVs en masse from Tesco for crying out loud, and those are all made-in-PRC specials which ALL use 60 Hz native panels.
As an aside, anything shy of a 600 Hz refresh rate is useless for true cross-standards use as it will always involve awkward, unequal frame rate conversion (and native capture at high, but not super-high, rates causes other issues with flicker from lighting etc). I agree with those who are frustrated that 600 Hz wasn't adopted as an UHD requirement. (600/24, 600/25 and 600/30 all leave no remainders, the first ideal refresh rate.)
http://www.rtings.com/tv/learn/fake-refresh-rates-samsung-clear-motion-rate-vs-sony-motionflow-vs-lg-trumotion has an interesting table showing fake vs. true panel rates. You may be surprised how many panels in models from big box manufacturers don't have refresh rates that match their advertised maximums.
"(+1 frame every 100 ms will be almost imperceptible...)"
No it isn't.
It looks absolutely awful on anything that actually moves.
That's 10 stutters per second.
What makes people think this won't be visible, when we demand at least 24fps for film, and are used to 50 images (deinterlaced) from video?
Not sure what nVidia GPU you use, but I have an AMD passively cooled one that does excellent vector adaptive deinterlacing of HD. But it doesn't do 50 -> 60 Hz conversion, because I have told it to output 50Hz to the TV. I'm pretty sure it wouldn't do a good job though (perhaps being helped botching things up by Windows).
"I would be extremely surprised if any of my IPlayer capable devices supported outputting 50p to my TV or to any other display"
I would be surprised if they didn't. Even if you are unfortunate enough to be in a region landed with inferior NTSC (Never Twice the Same Colour) broadcasts, supporting 1080p/50Hz is standard elsewhere - and technically simpler than supporting 60Hz.
"So I guess it would be 50p displayed on 60p then."
To do that, they have to slow it down to from 50Hz to 48Hz, and then do 3:2 pulldown, giving you a familiarly crap NTSC movie-like experience....
Well, my Chromestick and my Now TV puck really never switches from 60Hz.
These are designed by USA people, who really only care if SOMETHING comes out on a PAL TV, not that it looks good. I bet every single film on Now TV, Amazon and Netflix is done with 3:2 (probably stored as such) and fed out to 60Hz devices. Hence we get 60Hz here too. We certainly don't get 24p films and 50Hz PAL video.
Since the very same devices contain the iPlayer software, it too will use 60Hz. Apparently it would be a major disaster if the TV flickered for a second or so changing to 50Hz, according to US techies I have complained to.
"Sport at 30fps sourced from 50i? Possibly converted from 30fps again to display at 50Hz on your TV."
No one sane is going to capture or broadcast at a non standard rate like 30Hz as it introduces judder.
1080p/50 is recommended by the EBU for HDTV:
We are talking about streaming TV, and the devices used to display such streams. Of course 50Hz is recommended, but the devices don't implement it. So any 50p stream will indeed be a judderfest. And they are.
BBC makes films etc at 25p, so sure, low frame rates are used. Low frame rates don't cause annoying judder as such, it's temporal erros like pulldown/up that causes issues. Film has always been 24p and works just fine in a real projector.
Except to get a projector which natively supports 24 Hz, you'll have to flash a substantial wad of cash. Most projectors are internally 60 Hz, particularly if they're LCD or LCoS. Even then, all but the more expensive DLP projectors will likely do pulldown or interpolation - their internal image processing will just look nicer.
I forced my old Toshiba LCD TV to do 48Hz, so I could see judder-free 24p Bluray. Have you tried 48Hz?
The TV didn't oficially support it, but it seems the PLLs could bend enough.
Of course you need something that can output 48Hz for this. I fiddled with some utility under Windows to do this. This was in the early days when BD players were expensive, so I used a PC for BD.
I too love the BBC but I am surprised at their hypocrisy.
They obviously have the devs and in-house knowledge to hack Linux so isn't it a pity that as far as their customers are concerned they don't recognise the fact that some of us actually use Linux on the desktop?
The BBC has the facility in the iPlayer to allow users to download programmes for later consumption, all that is except users of Linux.
Currently they are running a beta programme using HTML5, again for everyone except Linux.
I have written to them on these subjects and have received a polite but firm reply to the effect that "We do not support Linux" Presumably not because it's too difficult but that they see no future in supporting the OS, yet they use the thing themselves.
As I said, hypocrites.
just large and with various bits that dont talk to other bits
Net result, the organisation says one thing and does another.
a pretense of having a virtuous character, moral or religious beliefs or principles, etc., that one does not really possess.
I daresay that attribute in a human is derived from the same root cause - bits of the brain that ought to talk to each other but don't.
Knowledge of the means by which a behaviour occurs doesn't alter the behaviour itself or its effect on others.
Also known as "an explanation isn't an excuse"
I think it's very wrong of BBC to support a virus-like program like Chrome above open source browsers. If Firefox is lacking something, they should have provided that something and CONTRIBUTED to open source. Sure, Chrome is nice, but it's also a platform for Google to put its tentacles in your computer forever.
Why isn't BBC more idealistsic? It's founded by a kind of tax after all (pay, or else..).
You do realize right that chrome is just the chromium open source project? Google add their own stuff onto certain cuts of the chromium project's work. There is nothing stopping you downloading chromium directly if you don't like the google additions. FF has become dog slow over the years, even with no plugins activated it starts up slower than chrome with plugins.
I'm no big Firefox fan, but initial startup time isn't a strong criteria for me when picking a browser.
Chrome is not a lot like Chromium, but yes, Chromium is the open source part of Chrome.
Firefox is a memory hog on a an epic scale, and the programmers involved can't see anything wrong with that. FAIL!
But all Google products are massive resource-hogs too. It takes a lot of resources to spy continously on the users. For example, I doubled my Android phone's battery life by hunting down and disable all Google spying related stuff I could find.
From the BBC blog:
"We’re currently testing the HTML5 player with:
• Firefox 41
• Opera 32
• Safari on iOS 5 and above
• BlackBerry OS 10.3.1 and above
• Internet Explorer 11 and Microsoft Edge on Windows 10
• Google Chrome on all platforms"
It's down to browser support for the features they want to use rather than your choice of OS. Have you tried it in Chrome on Linux yet?
Too be fair using linux on the back end does not imply linux on the front end.
The two levels of usage and use cases are substantially different.
That plus the blowback they'd get if they chose the wrong version of desktop.
And from the comments on the register any version is always the wrong one to someone.
"I have written to them on these subjects and have received a polite but firm reply to the effect that "We do not support Linux" Presumably not because it's too difficult but that they see no future in supporting the OS, yet they use the thing themselves."
There is a marked difference between using something internally and supporting it for external users.
Writing the client is not only the easy part, it's also the smallest. They would need to test it (both internally and externally). This would need to be done far more thoroughly than any tools used internally and may divert resources from supporting the iplayer infrastructure.. Then they would have to train up any support personnel and create documentation for publishing online. All for an OS that apparently accounts for 1.74% of users (source:https://www.netmarketshare.com/operating-system-market-share.aspx?qprid=10&qpcustomd=0) and at a time when the BBC is under pressure to cut costs.
An example of this in commerce. Pixar are famous not only for creating a lot of very successful movies, but also creating and selling the Renderman software they used to render those movies.. They have also written a lot of other tools, none of which will ever be released. The reason? They do not have the resources required to fully test these tools and support them, and they don't think it's worth their while investing in the resources.
"But this is the BBC at its finest. Writing kernal bypasses to get better throughput."
Network adaptor drivers have had to do this for years under Linux to be able to support the hardware offload features of modern NICs. This isn't a BBC issue, it's a Linux issue.
In the linked article, a diagram shows two copies of the packet being made, one entirely in kernel space and so presumably avoidable by kernel-space changes anyway. I can't see a couple of memcpy() calls being responsible for 90% of the CPU time (the article claims a 10-fold performance improvement) and so I conclude that the user-space implementation is actually missing out something pretty damn enormous.
I wonder if it is something I'd miss, like security, or routing, or...? Does anyone know?
>I wonder if it is something I'd miss, like security, or routing, or...? Does anyone know?
Isn't a major advantage of using linux that you can scale-out, free of incremental license costs?
On the other hand, there is always more than one way to do it.
I wonder what the saving is? Should we be doing this, or running some ARM cluster?
years ago I had occasion to trace a keypress from the interrupt service routine that handled it all the way through DOS 2.2
It was several thousand instructions before it appeared to the application.
Some like keyboard mapping, were valid. Others appeared utterly arbitrary and left over from legacy code.
It depends where the buffers are stored. I don't know much about the inner workings of the kernel but accessing the heap is slooooow. For example, if you're doing a large amount of small calculations it is often faster to do the same calculation numerous times than to cache it in memory and retrieve again.
They're trying to send uncompressed 4K video over IP (they probably mean UDP/IP). The blog says the data for each stream has got to be generated sequentially using one core, the application which generates it has to be capable of generating enough data to fill 340,000 packets per second, and one packet has to be sent out every 3 microseconds. I suppose it's easier to do that by writing it to a block of memory with the headers in the right place then passing a pointer to the network card driver which just sends it out. There's probably no kind of security whatsoever though.
I can't see a couple of memcpy() calls being responsible for 90% of the CPU time
Executing one CPU instruction is unlikely to take longer than copying 3 words of data from one place to another, and will often take less time. IP data is not 32 bit aligned and block copy operations are often forced to use 16 or even 8 bit words even if the bus width is 64 bits due to source and destination address misalignment. So copying a kB of data from one memory location two or three times is in fact far more time-consuming than executing a hundred or so instructions (which is about all it takes to format a TCP/IP header). So yes. I can easily see the copying operations taking up 90% of the total time.
Moving video drivers out of the NT kernel and into user-space to speed things up?
Whatever happened with that ... Oh, yes, I remember now. Massive security holes.
And the MS attempt was under the watch of Dave Cutler at the time. If HE couldn't make it work, I'm absolutely certain Auntie Beeb and some wet-behind-the-ears kids paying rent in SanFrancisco (instead of a mortgage) can't make it work.
But what do I know. I've only been in the business for over 4 decades.
I sort of agree, but I believe it was just the other way around with NT. Moving the video stuff INTO the kernel ring to speed things up, for XP. I'm too lazy to look it up, but I'm pretty sure.
Oh, and apart from "speeding up" I'm sure it was something to do with compatibility with the Win95 driver stuff. That's really gonna introduce a lot of stability...
Having a video driver crash, taking down the entire system, is so Windows. (Which is why NT and 2000 are so un-Windows.)
The speed-up I am referring to is in user-space. MS pulled the video stuff out of the kernel and into user-space in the transition between NT3.5x and the Win2K/XP era. Yes, it made the "seat of the pants" user-feel faster, but security sucked. Auntie Beeb & the Kids in San Francisco obviously have no concept of "those who don't grok history are doomed to repeat it".
I agree with your history, jake, but the Beeb are just writing their own code for their own server - they're not (AFAIK) proposing to push it out to millions of end users (although they are making it publicly available for anyone who has the same problems that they face). I imagine (I used to talk to their security bods) that there are numerous other controls protecting the core systems.
"MS pulled the video stuff out of the kernel and into user-space in the transition between NT3.5x and the Win2K/XP era. Yes, it made the "seat of the pants" user-feel faster, but security sucked!!"
erm, no, a driver in user space would likely improve security versus a kernel mode driver....
Jake, you're exactly backwards. MS pushed the graphics heap into the kernel in NT 4.0 in order to boost performance and it stayed there all the way through NT 5.0 and NT 5.1 (which you might know as Windows 2000 and XP). NT 6.0 moved most of the graphics heap out of the kernel and into userspace, but there is still a component running in kernel mode.
Moving video drivers out of the NT kernel and into user-space to speed things up?
IIRC it went the other way because of the speed of context switching on x86 chips and Microsoft needed a fast system to impress customers. Security? Well they already had the C2 (or whatever it was) certificate. Not that people really cared anyway.
I find that a distinctly odd occurrence. Historically things were put into the kernel because it was faster than userspace. So much faster that it was worth the programming difficulty, potential security holes and risk of locking up the system to do it for some things (excluding things like device drivers, which needed access to raw hardware and had to be in the kernel).
If anyone wanted performance above all else, they used to put it in the kernel. Is it really possible that Linux's network stack has become so inefficient and convoluted, that a userspace stack is actually magnitudes faster? That just sounds nuts. Admittedly the last time I had a good look at the network stack was in Linux 2.4 and early 2.6, so I might be out of date w.r.t the state of the art, but still, things in userspace being faster than kernel space just sounds wrong to me. Am I missing something?
Communication between hardware and kernel is fast indeed, the problem is when bulk of this communication starts or ends in userspace. Which implies a context switch has to be executed and adds to latency (between 1 or 2 microseconds, IIRC). In case of large bandwidth or latency sensitive network communication this can be big deal, since context overhead is added on every packet.
Also, given the right setup (e.g. CPU isolation and pinned execution threads to reserved CPUs) there is absolutely no reason for userspace code to be slower than kernel code (also no reason to be other way around, obviously); it is about cost of context switches (also cache hotness and similar things)
"Also, given the right setup (e.g. CPU isolation and pinned execution threads to reserved CPUs) there is absolutely no reason for userspace code to be slower than kernel code (also no reason to be other way around, obviously); it is about cost of context switches (also cache hotness and similar things)"
Yes, the code execution speed kernel and userspace should be the same, after all, code is code, the CPU doesn't care. It is the transition overhead of switching between kernel/user space which slows things down, along with the switch between supervisor/user mode rings (if you are using a userspace device driver for instance, which is why they are less performant than kernel drivers) .
And yes, highly tuned systems can reduce latency, especially if in addition to pinned execution, you also pin interrupts to certain cores ( core 0, so that you don't route through LAPICs on most hardware, but at this level, your workload type and the actual x86 motherboard you buy makes a hell of a difference to latency, as they all route interrupts differently :-) )
"excluding things like device drivers, which needed access to raw hardware and had to be in the kernel"
That's a misconception.
Obtaining initial access rights to hardware needs to be kernel-approved, but after that there is no reason a user space driver couldn't access the heck out of the hardware.
RAM is hardware as well, by the way, and I'm pretty sure user space code accesses it quite a lot.
It's to do with memory mapping and protecting all but certain memory regions from user mode access. This enforcement is all supported in hardware, and very fast.
I don't think you are correct.
1. A userspace driver has to go through the kernel every time it tries to access the hardware, resulting in a context switch which slows things down compared to direct kernel access
2. A userspace program accessing hardware requires the kernel to drop into (and then out of) supervisor mode each time it does so, these switches in/out of that mode add additional latency compared to a kernel thread, which stays in supervisor mode
3. Userspace code never accesses RAM directly, it does it via the VMM, which itself uses the MMU for translation. The kernel does not use the VMM, so in theory it is a bit faster, but the primary benefit here is being able to directly get/access physical memory addresses, and for things like DMA.
Sure, machines may have gotten so fast that all the above is barely noticeable overhead in general use cases, but it doesn't mean said overhead doesn't exist.
Well, I'm very rusty on all this. But I would have thought that the extent of kernel involvement, and ring transitions, would be dependent on what OS is involved. In the general case, I would have thought that it would be possible to do memory mapped I/O with a minimum of kernel involvement.
Correct me if I'm wrong, but the MMU overhead, isn't it always there, and isn't really an overhead at all? I don't think, for example, that the Linux kernel runs MMU-free (even if they are mapping some things to real actual addresses). But I could be wrong.
You are right, that the extent of ring involvement is dependant on the OS (it is also dependent on the CPU arch actually). Both Linux and Windows use two rings for kernel and userspace, not sure about the others (I remember hearing that openBSD uses all 4 of the x86 rings, but no idea if that is true).
I will admit, I was looking at this a while ago, when I implemented RDMA over Firewire as a poor mans Infiniband for clustering, but back then it was not possible to access the hardware from userspace, without essentially writing a shim kernel module that would sit and pass the needed data between kernel and user space, and therefore having the overheads I mentioned.
Now, that is Linux specific, however any monolithic kernel design by its nature has to have all userspace stuff go through the kernel. GNU Hurd goes to show that it is possible to have user-space device drivers without the overhead, but the kernel has to be designed for it.
The MMU is a hardware device (nowadays integrated in the CPU die) which handles memory translation on the low level. Not only is it already low latency, both the kernel and the userspace use it (all the time), so there is no difference between user/kernel space in this context. The difference is that userspace goes though an additional layer, the VMM (Virtual memory manager), so each process sees its own virtual address space. Only the kernel (that doesn't use the VMM) sees the real address space, and lacking in the extra indirection that the userspace has to travel, has a lower latency.
ok, but the VMM would only be involved in setting up virtualized areas. Not in every MMU access. Am I right?
I can't see VMM being an issue deciding where to stuff code: userspace or kernel space?
It's a bit sad if ring transition overhead is a deciding factor where to put code. Some aggregation of data before doing the transition seems to be a solution then (i.e buffering). Guess latency might be an issue.
Does look a bit like the BBC is trying to use general purpose hardware and software to do some very specialised high bandwidth stuff, as some have already pointed out.
But perhaps penny pinching in this way will be a good thing eventually. I'm all for it if we get a better Linux one day! (Well, depending on how much BBC is sinking into this project.)
> 1. A userspace driver has to go through the kernel every time it tries to access the hardware
This isn't necessarily the case. If the kernel can map the address space of the device into the virtual addresses space of the userland process there is no reason why the HW shouldn't be accessed from userland. Interrupts however are a different matter. I doubt you can safely run userland code from the ICS.
" If the kernel can map the address space of the device into the virtual addresses space of the userland process there is no reason why the HW shouldn't be accessed from userland."
That may depends on the device in question.
If the device has read access to data that its owning process shouldn't see, or write access to data its owning process shouldn't be able to write, then the user process ends up able to do things it shouldn't be able to do. IE system integrity is compromised.
Example: device has a register which contains e.g. a DMA start address which is used to store data received from (or sent to) the network. This is likely a *physical* address, ie data at that address may not belong to the userland process. See a problem with that?
" Interrupts however are a different matter. I doubt you can safely run userland code from the ICS."
Why not? So long as the interrupt is initially fielded by the OS itself, which does the necessary memory protection changes etc so the userland code can't access anything it shouldn't? If the userland code gets stuck in a loop, that would be inconvenient but it hasn't totally compromised system integrity, although it is a potential DoS attack, which is why this facility should only be available to an approved subset of applications.
Cite: connect-to-interrupt (CINT$) on RSX, VMS, and maybe others. Ask any decent dinosaur,
Surprised there is no mention of FPGAs. They are useful for doing high speed pre-filtering in effectively parallel logic. Content Addressable Memory (CAM) could be used with them to build fast searchable tables. The speed of that combination was exploited 30 years ago when Xilinx introduced their first 1800 gate devices.
SSM, any-source multicast and RTP is used extensively for audiovisual content and node synchronisation. IP Studio at the lowest level breaks down all AV streams into 'grains' of each data type - audio, video, control. You also need to preserve relative synchronisation of multiple cameras' input frames, (what's called genlocking to a reference signal) and preserve this correlation of the resultant IP streams right through to the vision mixer step.
I wonder if absolute guaranteed delivery is more critical for speed of delivery - you don't want to lose a *single* frame of video in a broadcast system, I know I'd rather run a slightly larger time and data overhead and know I'm receiving everything. And once you're at the multi-gigabit level, gains to be had from UDP might seem insignificant compared to your available throughput and UDP's inability to retransmit.
There's significant requirements through the IP Studio system for synchronisation of various component streams, and coordination of the grain streams, so it's possibly more efficient to just use TCP.
The R&D white paper documenting the CWG POC is an interesting read if you've not seen before: http://downloads.bbc.co.uk/rd/pubs/whp/whp-pdf-files/WHP289.pdf
Why don't they just use multicast or UDP?
Well the BBC is doing what they're doing because it gives them zero copy and ability to amortise kernel/userland context switches over lots of packets. Cloudflare is tinkering around in a similar way because it allows them to quickly write unprivileged software that will deal with new attack traffic.
UDP-based transmission is all well and good, but the problem these guys are trying to solve is a different one to the one you're thinking of (TCP's reliance on ACKs for flow control, with all that implies).
With "VM" now the Kernel, the OS (e.g.Linux) is now in "tool-space" . Next logical step is to have direct user-space Docker-space calls for high performance operations.
(My last real coding work was using a <1Mb Unix sys5 kernel, which looks and feels like a VM container to me!) and with a 4MHz 68000 you NEVER copied when you could pass pointers!)
If my grey cells haven't gone too grey then I think HP-UX implemented zero copy network stacks about 18 years ago. I suspect that it was just zero copy inside the kernel but still copied in and out of user space from the system call entry/exit functions. HP-UX wasn't the only one. The 2007 implementation of sendfile should allow sharing of the buffers between user space and kernel space using page aliasing. I don't know the intricacies of the x86 MMU but I guess it isn't using a global VAS and so this should be easier.
She's told me, when I asked, they don't have the resources to do a Linux version of iPlayer, yet from this article they clearly have in-house Linux skills. Liar, liar, knickers on fire, Aunty!
Ah well, I'll just keep on using get-iplayer (which I know Aunty Beeb doesn;t like) until Aunty Beeb decides to pull her finger out and create a proper iPlayer for us Linux users, then!
I don't understand why the network drivers were written that way in the first place. Ethernet has always used some form of layered protocol, and the most efficient way to build the layers is to add each header to the front of the existing payload, not to create a whole new buffer and copy the payload from one buffer to the next. It just means that whatever creates the initial buffer must leave space at the front for the largest possible combined header size. I've written several LAN stacks for small, slow 8-bit CPUs, and if the payload had to be copied from one area of RAM to another 2 or three times per packet the performance hit would have made the device unusable. Even the Rx buffers were set up to have a lot of free space at the start in case the data was going to be retransmitted, in which case the appropriate Tx headers could be prepended and the Rx buffer used as the Tx buffer.
I would have thought that such a specialised application would be best built from scratch rather than running under a general-purpose OS. By all means use off-the-shelf PC motherboard and components, but boot straight into your bespoke code. GUI controls and other stuff that is not time critical can be implemented on a completely different machine running any standard OS, and the raw directives & variables fed to the custom machine via a link (e.g. another LAN port, USB, infrared, even serial etc.)
For me it would be very interesting to observe how Linux developers will be trying to grind this problem.
Seems that kind of issues needs to be solved by major re-architecture. Something which is complete opposite of how Linux is progressing on dealing with some issues by trying to change state from A to B by series of small modifications.
Seems at least Linux clashed with something enough big which cannot be solved using Linux TraditionalWay(tm).
Biting the hand that feeds IT © 1998–2019