Re: This is why I love the bbc
Not real time of course, but just use get-iplayer. Download in HD overnight and watch at your leisure.
Back in September, The Register's networking desk chatted to a company called Teclo about the limitations of TCP performance in the Linux stack. That work, described here, included moving TCP/IP processing off to user-space to avoid the complex processing that the kernel has accumulated over the years. It's no surprise, then …
From the BBC blog:
"We’re currently testing the HTML5 player with:
• Firefox 41
• Opera 32
• Safari on iOS 5 and above
• BlackBerry OS 10.3.1 and above
• Internet Explorer 11 and Microsoft Edge on Windows 10
• Google Chrome on all platforms"
It's down to browser support for the features they want to use rather than your choice of OS. Have you tried it in Chrome on Linux yet?
Too be fair using linux on the back end does not imply linux on the front end.
The two levels of usage and use cases are substantially different.
That plus the blowback they'd get if they chose the wrong version of desktop.
And from the comments on the register any version is always the wrong one to someone.
"I have written to them on these subjects and have received a polite but firm reply to the effect that "We do not support Linux" Presumably not because it's too difficult but that they see no future in supporting the OS, yet they use the thing themselves."
There is a marked difference between using something internally and supporting it for external users.
Writing the client is not only the easy part, it's also the smallest. They would need to test it (both internally and externally). This would need to be done far more thoroughly than any tools used internally and may divert resources from supporting the iplayer infrastructure.. Then they would have to train up any support personnel and create documentation for publishing online. All for an OS that apparently accounts for 1.74% of users (source:https://www.netmarketshare.com/operating-system-market-share.aspx?qprid=10&qpcustomd=0) and at a time when the BBC is under pressure to cut costs.
An example of this in commerce. Pixar are famous not only for creating a lot of very successful movies, but also creating and selling the Renderman software they used to render those movies.. They have also written a lot of other tools, none of which will ever be released. The reason? They do not have the resources required to fully test these tools and support them, and they don't think it's worth their while investing in the resources.
"But this is the BBC at its finest. Writing kernal bypasses to get better throughput."
Network adaptor drivers have had to do this for years under Linux to be able to support the hardware offload features of modern NICs. This isn't a BBC issue, it's a Linux issue.
In the linked article, a diagram shows two copies of the packet being made, one entirely in kernel space and so presumably avoidable by kernel-space changes anyway. I can't see a couple of memcpy() calls being responsible for 90% of the CPU time (the article claims a 10-fold performance improvement) and so I conclude that the user-space implementation is actually missing out something pretty damn enormous.
I wonder if it is something I'd miss, like security, or routing, or...? Does anyone know?
>I wonder if it is something I'd miss, like security, or routing, or...? Does anyone know?
Isn't a major advantage of using linux that you can scale-out, free of incremental license costs?
On the other hand, there is always more than one way to do it.
I wonder what the saving is? Should we be doing this, or running some ARM cluster?
years ago I had occasion to trace a keypress from the interrupt service routine that handled it all the way through DOS 2.2
It was several thousand instructions before it appeared to the application.
Some like keyboard mapping, were valid. Others appeared utterly arbitrary and left over from legacy code.
It depends where the buffers are stored. I don't know much about the inner workings of the kernel but accessing the heap is slooooow. For example, if you're doing a large amount of small calculations it is often faster to do the same calculation numerous times than to cache it in memory and retrieve again.
They're trying to send uncompressed 4K video over IP (they probably mean UDP/IP). The blog says the data for each stream has got to be generated sequentially using one core, the application which generates it has to be capable of generating enough data to fill 340,000 packets per second, and one packet has to be sent out every 3 microseconds. I suppose it's easier to do that by writing it to a block of memory with the headers in the right place then passing a pointer to the network card driver which just sends it out. There's probably no kind of security whatsoever though.
I can't see a couple of memcpy() calls being responsible for 90% of the CPU time
Executing one CPU instruction is unlikely to take longer than copying 3 words of data from one place to another, and will often take less time. IP data is not 32 bit aligned and block copy operations are often forced to use 16 or even 8 bit words even if the bus width is 64 bits due to source and destination address misalignment. So copying a kB of data from one memory location two or three times is in fact far more time-consuming than executing a hundred or so instructions (which is about all it takes to format a TCP/IP header). So yes. I can easily see the copying operations taking up 90% of the total time.
Moving video drivers out of the NT kernel and into user-space to speed things up?
Whatever happened with that ... Oh, yes, I remember now. Massive security holes.
And the MS attempt was under the watch of Dave Cutler at the time. If HE couldn't make it work, I'm absolutely certain Auntie Beeb and some wet-behind-the-ears kids paying rent in SanFrancisco (instead of a mortgage) can't make it work.
But what do I know. I've only been in the business for over 4 decades.
I sort of agree, but I believe it was just the other way around with NT. Moving the video stuff INTO the kernel ring to speed things up, for XP. I'm too lazy to look it up, but I'm pretty sure.
Oh, and apart from "speeding up" I'm sure it was something to do with compatibility with the Win95 driver stuff. That's really gonna introduce a lot of stability...
Having a video driver crash, taking down the entire system, is so Windows. (Which is why NT and 2000 are so un-Windows.)
The speed-up I am referring to is in user-space. MS pulled the video stuff out of the kernel and into user-space in the transition between NT3.5x and the Win2K/XP era. Yes, it made the "seat of the pants" user-feel faster, but security sucked. Auntie Beeb & the Kids in San Francisco obviously have no concept of "those who don't grok history are doomed to repeat it".
I agree with your history, jake, but the Beeb are just writing their own code for their own server - they're not (AFAIK) proposing to push it out to millions of end users (although they are making it publicly available for anyone who has the same problems that they face). I imagine (I used to talk to their security bods) that there are numerous other controls protecting the core systems.
"MS pulled the video stuff out of the kernel and into user-space in the transition between NT3.5x and the Win2K/XP era. Yes, it made the "seat of the pants" user-feel faster, but security sucked!!"
erm, no, a driver in user space would likely improve security versus a kernel mode driver....
Jake, you're exactly backwards. MS pushed the graphics heap into the kernel in NT 4.0 in order to boost performance and it stayed there all the way through NT 5.0 and NT 5.1 (which you might know as Windows 2000 and XP). NT 6.0 moved most of the graphics heap out of the kernel and into userspace, but there is still a component running in kernel mode.
Moving video drivers out of the NT kernel and into user-space to speed things up?
IIRC it went the other way because of the speed of context switching on x86 chips and Microsoft needed a fast system to impress customers. Security? Well they already had the C2 (or whatever it was) certificate. Not that people really cared anyway.
I find that a distinctly odd occurrence. Historically things were put into the kernel because it was faster than userspace. So much faster that it was worth the programming difficulty, potential security holes and risk of locking up the system to do it for some things (excluding things like device drivers, which needed access to raw hardware and had to be in the kernel).
If anyone wanted performance above all else, they used to put it in the kernel. Is it really possible that Linux's network stack has become so inefficient and convoluted, that a userspace stack is actually magnitudes faster? That just sounds nuts. Admittedly the last time I had a good look at the network stack was in Linux 2.4 and early 2.6, so I might be out of date w.r.t the state of the art, but still, things in userspace being faster than kernel space just sounds wrong to me. Am I missing something?
Communication between hardware and kernel is fast indeed, the problem is when bulk of this communication starts or ends in userspace. Which implies a context switch has to be executed and adds to latency (between 1 or 2 microseconds, IIRC). In case of large bandwidth or latency sensitive network communication this can be big deal, since context overhead is added on every packet.
Also, given the right setup (e.g. CPU isolation and pinned execution threads to reserved CPUs) there is absolutely no reason for userspace code to be slower than kernel code (also no reason to be other way around, obviously); it is about cost of context switches (also cache hotness and similar things)
"Also, given the right setup (e.g. CPU isolation and pinned execution threads to reserved CPUs) there is absolutely no reason for userspace code to be slower than kernel code (also no reason to be other way around, obviously); it is about cost of context switches (also cache hotness and similar things)"
Yes, the code execution speed kernel and userspace should be the same, after all, code is code, the CPU doesn't care. It is the transition overhead of switching between kernel/user space which slows things down, along with the switch between supervisor/user mode rings (if you are using a userspace device driver for instance, which is why they are less performant than kernel drivers) .
And yes, highly tuned systems can reduce latency, especially if in addition to pinned execution, you also pin interrupts to certain cores ( core 0, so that you don't route through LAPICs on most hardware, but at this level, your workload type and the actual x86 motherboard you buy makes a hell of a difference to latency, as they all route interrupts differently :-) )
"excluding things like device drivers, which needed access to raw hardware and had to be in the kernel"
That's a misconception.
Obtaining initial access rights to hardware needs to be kernel-approved, but after that there is no reason a user space driver couldn't access the heck out of the hardware.
RAM is hardware as well, by the way, and I'm pretty sure user space code accesses it quite a lot.
It's to do with memory mapping and protecting all but certain memory regions from user mode access. This enforcement is all supported in hardware, and very fast.
I don't think you are correct.
1. A userspace driver has to go through the kernel every time it tries to access the hardware, resulting in a context switch which slows things down compared to direct kernel access
2. A userspace program accessing hardware requires the kernel to drop into (and then out of) supervisor mode each time it does so, these switches in/out of that mode add additional latency compared to a kernel thread, which stays in supervisor mode
3. Userspace code never accesses RAM directly, it does it via the VMM, which itself uses the MMU for translation. The kernel does not use the VMM, so in theory it is a bit faster, but the primary benefit here is being able to directly get/access physical memory addresses, and for things like DMA.
Sure, machines may have gotten so fast that all the above is barely noticeable overhead in general use cases, but it doesn't mean said overhead doesn't exist.
Well, I'm very rusty on all this. But I would have thought that the extent of kernel involvement, and ring transitions, would be dependent on what OS is involved. In the general case, I would have thought that it would be possible to do memory mapped I/O with a minimum of kernel involvement.
Correct me if I'm wrong, but the MMU overhead, isn't it always there, and isn't really an overhead at all? I don't think, for example, that the Linux kernel runs MMU-free (even if they are mapping some things to real actual addresses). But I could be wrong.
You are right, that the extent of ring involvement is dependant on the OS (it is also dependent on the CPU arch actually). Both Linux and Windows use two rings for kernel and userspace, not sure about the others (I remember hearing that openBSD uses all 4 of the x86 rings, but no idea if that is true).
I will admit, I was looking at this a while ago, when I implemented RDMA over Firewire as a poor mans Infiniband for clustering, but back then it was not possible to access the hardware from userspace, without essentially writing a shim kernel module that would sit and pass the needed data between kernel and user space, and therefore having the overheads I mentioned.
Now, that is Linux specific, however any monolithic kernel design by its nature has to have all userspace stuff go through the kernel. GNU Hurd goes to show that it is possible to have user-space device drivers without the overhead, but the kernel has to be designed for it.
The MMU is a hardware device (nowadays integrated in the CPU die) which handles memory translation on the low level. Not only is it already low latency, both the kernel and the userspace use it (all the time), so there is no difference between user/kernel space in this context. The difference is that userspace goes though an additional layer, the VMM (Virtual memory manager), so each process sees its own virtual address space. Only the kernel (that doesn't use the VMM) sees the real address space, and lacking in the extra indirection that the userspace has to travel, has a lower latency.
ok, but the VMM would only be involved in setting up virtualized areas. Not in every MMU access. Am I right?
I can't see VMM being an issue deciding where to stuff code: userspace or kernel space?
It's a bit sad if ring transition overhead is a deciding factor where to put code. Some aggregation of data before doing the transition seems to be a solution then (i.e buffering). Guess latency might be an issue.
Does look a bit like the BBC is trying to use general purpose hardware and software to do some very specialised high bandwidth stuff, as some have already pointed out.
But perhaps penny pinching in this way will be a good thing eventually. I'm all for it if we get a better Linux one day! (Well, depending on how much BBC is sinking into this project.)
> 1. A userspace driver has to go through the kernel every time it tries to access the hardware
This isn't necessarily the case. If the kernel can map the address space of the device into the virtual addresses space of the userland process there is no reason why the HW shouldn't be accessed from userland. Interrupts however are a different matter. I doubt you can safely run userland code from the ICS.
" If the kernel can map the address space of the device into the virtual addresses space of the userland process there is no reason why the HW shouldn't be accessed from userland."
That may depends on the device in question.
If the device has read access to data that its owning process shouldn't see, or write access to data its owning process shouldn't be able to write, then the user process ends up able to do things it shouldn't be able to do. IE system integrity is compromised.
Example: device has a register which contains e.g. a DMA start address which is used to store data received from (or sent to) the network. This is likely a *physical* address, ie data at that address may not belong to the userland process. See a problem with that?
" Interrupts however are a different matter. I doubt you can safely run userland code from the ICS."
Why not? So long as the interrupt is initially fielded by the OS itself, which does the necessary memory protection changes etc so the userland code can't access anything it shouldn't? If the userland code gets stuck in a loop, that would be inconvenient but it hasn't totally compromised system integrity, although it is a potential DoS attack, which is why this facility should only be available to an approved subset of applications.
Cite: connect-to-interrupt (CINT$) on RSX, VMS, and maybe others. Ask any decent dinosaur,
Surprised there is no mention of FPGAs. They are useful for doing high speed pre-filtering in effectively parallel logic. Content Addressable Memory (CAM) could be used with them to build fast searchable tables. The speed of that combination was exploited 30 years ago when Xilinx introduced their first 1800 gate devices.
SSM, any-source multicast and RTP is used extensively for audiovisual content and node synchronisation. IP Studio at the lowest level breaks down all AV streams into 'grains' of each data type - audio, video, control. You also need to preserve relative synchronisation of multiple cameras' input frames, (what's called genlocking to a reference signal) and preserve this correlation of the resultant IP streams right through to the vision mixer step.
I wonder if absolute guaranteed delivery is more critical for speed of delivery - you don't want to lose a *single* frame of video in a broadcast system, I know I'd rather run a slightly larger time and data overhead and know I'm receiving everything. And once you're at the multi-gigabit level, gains to be had from UDP might seem insignificant compared to your available throughput and UDP's inability to retransmit.
There's significant requirements through the IP Studio system for synchronisation of various component streams, and coordination of the grain streams, so it's possibly more efficient to just use TCP.
The R&D white paper documenting the CWG POC is an interesting read if you've not seen before: http://downloads.bbc.co.uk/rd/pubs/whp/whp-pdf-files/WHP289.pdf
Why don't they just use multicast or UDP?
Well the BBC is doing what they're doing because it gives them zero copy and ability to amortise kernel/userland context switches over lots of packets. Cloudflare is tinkering around in a similar way because it allows them to quickly write unprivileged software that will deal with new attack traffic.
UDP-based transmission is all well and good, but the problem these guys are trying to solve is a different one to the one you're thinking of (TCP's reliance on ACKs for flow control, with all that implies).
Biting the hand that feeds IT © 1998–2019