back to article BBC bypasses Linux kernel to make streaming videos flow

Back in September, The Register's networking desk chatted to a company called Teclo about the limitations of TCP performance in the Linux stack. That work, described here, included moving TCP/IP processing off to user-space to avoid the complex processing that the kernel has accumulated over the years. It's no surprise, then …

Silver badge

Re: This is why I love the bbc

Well, my Chromestick and my Now TV puck really never switches from 60Hz.

These are designed by USA people, who really only care if SOMETHING comes out on a PAL TV, not that it looks good. I bet every single film on Now TV, Amazon and Netflix is done with 3:2 (probably stored as such) and fed out to 60Hz devices. Hence we get 60Hz here too. We certainly don't get 24p films and 50Hz PAL video.

Since the very same devices contain the iPlayer software, it too will use 60Hz. Apparently it would be a major disaster if the TV flickered for a second or so changing to 50Hz, according to US techies I have complained to.

0
0
Reply
Silver badge

Re: This is why I love the bbc

We are talking about streaming TV, and the devices used to display such streams. Of course 50Hz is recommended, but the devices don't implement it. So any 50p stream will indeed be a judderfest. And they are.

BBC makes films etc at 25p, so sure, low frame rates are used. Low frame rates don't cause annoying judder as such, it's temporal erros like pulldown/up that causes issues. Film has always been 24p and works just fine in a real projector.

2
0
Reply

Re: This is why I love the bbc

Except to get a projector which natively supports 24 Hz, you'll have to flash a substantial wad of cash. Most projectors are internally 60 Hz, particularly if they're LCD or LCoS. Even then, all but the more expensive DLP projectors will likely do pulldown or interpolation - their internal image processing will just look nicer.

0
1
Reply

Re: This is why I love the bbc

"I have written to them on these subjects and have received a polite but firm reply to the effect that "We do not support Linux" Presumably not because it's too difficult but that they see no future in supporting the OS, yet they use the thing themselves."

There is a marked difference between using something internally and supporting it for external users.

Writing the client is not only the easy part, it's also the smallest. They would need to test it (both internally and externally). This would need to be done far more thoroughly than any tools used internally and may divert resources from supporting the iplayer infrastructure.. Then they would have to train up any support personnel and create documentation for publishing online. All for an OS that apparently accounts for 1.74% of users (source:https://www.netmarketshare.com/operating-system-market-share.aspx?qprid=10&qpcustomd=0) and at a time when the BBC is under pressure to cut costs.

An example of this in commerce. Pixar are famous not only for creating a lot of very successful movies, but also creating and selling the Renderman software they used to render those movies.. They have also written a lot of other tools, none of which will ever be released. The reason? They do not have the resources required to fully test these tools and support them, and they don't think it's worth their while investing in the resources.

1
0
Reply
Silver badge

Re: This is why I love the bbc

I checked out the iPlayer Russian F1 GP, and it looked absolutely horrible using my Now TV (gen 2) puck.

What device did you use to get 50p, if I may ask?

0
0
Reply
Silver badge

Re: This is why I love the bbc

"(+1 frame every 100 ms will be almost imperceptible...)"

No it isn't.

It looks absolutely awful on anything that actually moves.

That's 10 stutters per second.

What makes people think this won't be visible, when we demand at least 24fps for film, and are used to 50 images (deinterlaced) from video?

Not sure what nVidia GPU you use, but I have an AMD passively cooled one that does excellent vector adaptive deinterlacing of HD. But it doesn't do 50 -> 60 Hz conversion, because I have told it to output 50Hz to the TV. I'm pretty sure it wouldn't do a good job though (perhaps being helped botching things up by Windows).

0
0
Reply
Silver badge

Re: This is why I love the bbc

I forced my old Toshiba LCD TV to do 48Hz, so I could see judder-free 24p Bluray. Have you tried 48Hz?

The TV didn't oficially support it, but it seems the PLLs could bend enough.

Of course you need something that can output 48Hz for this. I fiddled with some utility under Windows to do this. This was in the early days when BD players were expensive, so I used a PC for BD.

0
0
Reply
Anonymous Coward

Re: This is why I love the bbc

"Most projectors are internally 60 Hz"

I just looked at tech specs of the 10 top sellers on Amazon. All of them support native 50Hz.

0
0
Reply

Re: This is why I love the bbc

All TVs sold in the UK with flat panels have 60 Hz native panels. The image processing chip inside is internally interpolating the broadcast 50 Hz content to the panel's native refresh rate, irrespective of how you supply it with signal.

You need to spend orders of magnitude more - £3k, £6k - to get a panel which can natively run at 24, 25, 30 Hz (i.e, frames per second) without poorly done image processing. These are usually Grade 1 or Grade 2 broadcast panels like Sony's BVM and PVM series of broadcast monitors.

If you consider the image processing capable by a decent consumer GPU, versus the limited horsepower in your TV's silicon, I'd rather watch TV through my GPU (which I do thanks to an HDMI cap card) than watch it straight through the telly. TVs often do a a cheap bob deinterlace to get it to 50 Hz too, and then interpolate to 60.

And then, even being fed 50i, they can still go nuts and momentarily show progressive frames (which is incredibly jarring to watch when it happens) because it thinks it's showing film (25psf) content, until the image processing realises its mistake and reverts to TFF interlaced video.

0
0
Reply

Re: This is why I love the bbc

Some (quite expensive) consumer sets will downscale internally to the panel's refresh rate whilst advertising a higher one.

There's a history of TV panels starting out (or being repurposed as) PC display panels - which will run at (probably) 60 Hz, unless they're more modern 100 or 120 Hz (...and even if they're advertised as 100, I'm not even sure they're natively 100 - quite possibly native 120 with some on-chip conversion!).

Taking this into consideration, 50 Hz countries are in the minority in light of global market forces which has always frustrated me. However you look at it, TVs and PC displays are made to a price point which doesn't usually include native support for 50 Hz and its higher rate multiples.

Also, consider the amount of legacy panels still in use in the UK - 720p or 'budget' 1080p panels in budget TVs which are now probably at least a decade old. They are all 60 Hz simply due to the economy of scale to manufacture one type of panel for worldwide use and throw a cheap deinterlacing and frame rate conversion algorithm in for non-NTSC markets.

Most people won't have a true 100 Hz panel in their house. People are still buying budget TVs en masse from Tesco for crying out loud, and those are all made-in-PRC specials which ALL use 60 Hz native panels.

As an aside, anything shy of a 600 Hz refresh rate is useless for true cross-standards use as it will always involve awkward, unequal frame rate conversion (and native capture at high, but not super-high, rates causes other issues with flicker from lighting etc). I agree with those who are frustrated that 600 Hz wasn't adopted as an UHD requirement. (600/24, 600/25 and 600/30 all leave no remainders, the first ideal refresh rate.)

http://www.rtings.com/tv/learn/fake-refresh-rates-samsung-clear-motion-rate-vs-sony-motionflow-vs-lg-trumotion has an interesting table showing fake vs. true panel rates. You may be surprised how many panels in models from big box manufacturers don't have refresh rates that match their advertised maximums.

0
0
Reply

Re: This is why I love the bbc

'All TVs' = 'all cheaper TVs'. Most important word in that sentence omitted. Sod's law.

0
0
Reply
Silver badge

Re: This is why I love the bbc

"All TVs sold in the UK with flat panels have 60 Hz native panels"

Not true of my Panasonic Plasmas or my various Samsung LCDs. They all run at a multiple of 50Hz.

So for instance my UE40D6530 runs natively at 400Hz.

0
0
Reply
Anonymous Coward

Re: This is why I love the bbc

"PC display panels - which will run at (probably) 60 Hz"

Pretty much all of them also run at 75Hz without issue.

0
1
Reply
Gold badge

Smells funny

In the linked article, a diagram shows two copies of the packet being made, one entirely in kernel space and so presumably avoidable by kernel-space changes anyway. I can't see a couple of memcpy() calls being responsible for 90% of the CPU time (the article claims a 10-fold performance improvement) and so I conclude that the user-space implementation is actually missing out something pretty damn enormous.

I wonder if it is something I'd miss, like security, or routing, or...? Does anyone know?

3
2
Reply
Silver badge

Re: Smells funny

Don't forget the mandatory DRM

Gotta protect the income of those Hollywood Moguls and their floating Gin Palaces now haven't we?

2
5
Reply
Silver badge

Re: Smells funny

>I wonder if it is something I'd miss, like security, or routing, or...? Does anyone know?

Isn't a major advantage of using linux that you can scale-out, free of incremental license costs?

On the other hand, there is always more than one way to do it.

I wonder what the saving is? Should we be doing this, or running some ARM cluster?

0
0
Reply

Re: Smells funny

years ago I had occasion to trace a keypress from the interrupt service routine that handled it all the way through DOS 2.2

It was several thousand instructions before it appeared to the application.

Some like keyboard mapping, were valid. Others appeared utterly arbitrary and left over from legacy code.

1
0
Reply

Re: Smells funny

When you're trying to send loads of data over a network copying memory is usually the worst thing you can do. A "couple" of memory copies is actually a few million per second when you're in the Gigabit domain.

4
0
Reply

Re: Smells funny

It depends where the buffers are stored. I don't know much about the inner workings of the kernel but accessing the heap is slooooow. For example, if you're doing a large amount of small calculations it is often faster to do the same calculation numerous times than to cache it in memory and retrieve again.

0
0
Reply

Re: Smells funny

What are you going to scale out on to? Even Linux needs hardware to run on and that isn't free.

1
0
Reply
Silver badge

Re: Smells funny

They're trying to send uncompressed 4K video over IP (they probably mean UDP/IP). The blog says the data for each stream has got to be generated sequentially using one core, the application which generates it has to be capable of generating enough data to fill 340,000 packets per second, and one packet has to be sent out every 3 microseconds. I suppose it's easier to do that by writing it to a block of memory with the headers in the right place then passing a pointer to the network card driver which just sends it out. There's probably no kind of security whatsoever though.

2
0
Reply
Silver badge

Re: Smells funny

"

I can't see a couple of memcpy() calls being responsible for 90% of the CPU time

"

Executing one CPU instruction is unlikely to take longer than copying 3 words of data from one place to another, and will often take less time. IP data is not 32 bit aligned and block copy operations are often forced to use 16 or even 8 bit words even if the bus width is 64 bits due to source and destination address misalignment. So copying a kB of data from one memory location two or three times is in fact far more time-consuming than executing a hundred or so instructions (which is about all it takes to format a TCP/IP header). So yes. I can easily see the copying operations taking up 90% of the total time.

1
0
Reply
Silver badge

Didn't microsoft try that?

Moving video drivers out of the NT kernel and into user-space to speed things up?

Whatever happened with that ... Oh, yes, I remember now. Massive security holes.

And the MS attempt was under the watch of Dave Cutler at the time. If HE couldn't make it work, I'm absolutely certain Auntie Beeb and some wet-behind-the-ears kids paying rent in SanFrancisco (instead of a mortgage) can't make it work.

But what do I know. I've only been in the business for over 4 decades.

5
18
Reply
Silver badge

Re: Didn't microsoft try that?

I sort of agree, but I believe it was just the other way around with NT. Moving the video stuff INTO the kernel ring to speed things up, for XP. I'm too lazy to look it up, but I'm pretty sure.

Oh, and apart from "speeding up" I'm sure it was something to do with compatibility with the Win95 driver stuff. That's really gonna introduce a lot of stability...

Having a video driver crash, taking down the entire system, is so Windows. (Which is why NT and 2000 are so un-Windows.)

6
0
Reply
Silver badge

Re: Didn't microsoft try that?

Moving video drivers out of the NT kernel and into user-space to speed things up?

IIRC it went the other way because of the speed of context switching on x86 chips and Microsoft needed a fast system to impress customers. Security? Well they already had the C2 (or whatever it was) certificate. Not that people really cared anyway.

3
0
Reply

Re: Didn't microsoft try that?

If you are only using the platform to do one thing, stream video at obscene speeds, its pretty easy to patch up the security.

That is not the same as moving it to user space for a general purpose toy desktop.

1
1
Reply
Silver badge

Re: Didn't microsoft try that?

The speed-up I am referring to is in user-space. MS pulled the video stuff out of the kernel and into user-space in the transition between NT3.5x and the Win2K/XP era. Yes, it made the "seat of the pants" user-feel faster, but security sucked. Auntie Beeb & the Kids in San Francisco obviously have no concept of "those who don't grok history are doomed to repeat it".

0
10
Reply

Re: Didn't microsoft try that?

I agree with your history, jake, but the Beeb are just writing their own code for their own server - they're not (AFAIK) proposing to push it out to millions of end users (although they are making it publicly available for anyone who has the same problems that they face). I imagine (I used to talk to their security bods) that there are numerous other controls protecting the core systems.

1
0
Reply
Anonymous Coward

Re: Didn't microsoft try that?

"MS pulled the video stuff out of the kernel and into user-space in the transition between NT3.5x and the Win2K/XP era. Yes, it made the "seat of the pants" user-feel faster, but security sucked!!"

erm, no, a driver in user space would likely improve security versus a kernel mode driver....

5
0
Reply
Silver badge

Re: Didn't microsoft try that?

You've got it a bit back to front...

https://technet.microsoft.com/library/cc750820.aspx

3
0
Reply

Re: Didn't microsoft try that?

Jake, you're exactly backwards. MS pushed the graphics heap into the kernel in NT 4.0 in order to boost performance and it stayed there all the way through NT 5.0 and NT 5.1 (which you might know as Windows 2000 and XP). NT 6.0 moved most of the graphics heap out of the kernel and into userspace, but there is still a component running in kernel mode.

4
0
Reply
Ogi

Moving out of the kernel to improve performance?

I find that a distinctly odd occurrence. Historically things were put into the kernel because it was faster than userspace. So much faster that it was worth the programming difficulty, potential security holes and risk of locking up the system to do it for some things (excluding things like device drivers, which needed access to raw hardware and had to be in the kernel).

If anyone wanted performance above all else, they used to put it in the kernel. Is it really possible that Linux's network stack has become so inefficient and convoluted, that a userspace stack is actually magnitudes faster? That just sounds nuts. Admittedly the last time I had a good look at the network stack was in Linux 2.4 and early 2.6, so I might be out of date w.r.t the state of the art, but still, things in userspace being faster than kernel space just sounds wrong to me. Am I missing something?

0
0
Reply
Silver badge

Re: Moving out of the kernel to improve performance?

Communication between hardware and kernel is fast indeed, the problem is when bulk of this communication starts or ends in userspace. Which implies a context switch has to be executed and adds to latency (between 1 or 2 microseconds, IIRC). In case of large bandwidth or latency sensitive network communication this can be big deal, since context overhead is added on every packet.

Also, given the right setup (e.g. CPU isolation and pinned execution threads to reserved CPUs) there is absolutely no reason for userspace code to be slower than kernel code (also no reason to be other way around, obviously); it is about cost of context switches (also cache hotness and similar things)

3
0
Reply
Silver badge

Re: Moving out of the kernel to improve performance?

"excluding things like device drivers, which needed access to raw hardware and had to be in the kernel"

That's a misconception.

Obtaining initial access rights to hardware needs to be kernel-approved, but after that there is no reason a user space driver couldn't access the heck out of the hardware.

RAM is hardware as well, by the way, and I'm pretty sure user space code accesses it quite a lot.

It's to do with memory mapping and protecting all but certain memory regions from user mode access. This enforcement is all supported in hardware, and very fast.

2
0
Reply
Ogi

Re: Moving out of the kernel to improve performance?

I don't think you are correct.

1. A userspace driver has to go through the kernel every time it tries to access the hardware, resulting in a context switch which slows things down compared to direct kernel access

2. A userspace program accessing hardware requires the kernel to drop into (and then out of) supervisor mode each time it does so, these switches in/out of that mode add additional latency compared to a kernel thread, which stays in supervisor mode

3. Userspace code never accesses RAM directly, it does it via the VMM, which itself uses the MMU for translation. The kernel does not use the VMM, so in theory it is a bit faster, but the primary benefit here is being able to directly get/access physical memory addresses, and for things like DMA.

Sure, machines may have gotten so fast that all the above is barely noticeable overhead in general use cases, but it doesn't mean said overhead doesn't exist.

3
3
Reply
Ogi

Re: Moving out of the kernel to improve performance?

"Also, given the right setup (e.g. CPU isolation and pinned execution threads to reserved CPUs) there is absolutely no reason for userspace code to be slower than kernel code (also no reason to be other way around, obviously); it is about cost of context switches (also cache hotness and similar things)"

Yes, the code execution speed kernel and userspace should be the same, after all, code is code, the CPU doesn't care. It is the transition overhead of switching between kernel/user space which slows things down, along with the switch between supervisor/user mode rings (if you are using a userspace device driver for instance, which is why they are less performant than kernel drivers) .

And yes, highly tuned systems can reduce latency, especially if in addition to pinned execution, you also pin interrupts to certain cores ( core 0, so that you don't route through LAPICs on most hardware, but at this level, your workload type and the actual x86 motherboard you buy makes a hell of a difference to latency, as they all route interrupts differently :-) )

1
0
Reply
Silver badge

Re: Moving out of the kernel to improve performance?

Well, I'm very rusty on all this. But I would have thought that the extent of kernel involvement, and ring transitions, would be dependent on what OS is involved. In the general case, I would have thought that it would be possible to do memory mapped I/O with a minimum of kernel involvement.

Correct me if I'm wrong, but the MMU overhead, isn't it always there, and isn't really an overhead at all? I don't think, for example, that the Linux kernel runs MMU-free (even if they are mapping some things to real actual addresses). But I could be wrong.

1
0
Reply
Ogi

Re: Moving out of the kernel to improve performance?

You are right, that the extent of ring involvement is dependant on the OS (it is also dependent on the CPU arch actually). Both Linux and Windows use two rings for kernel and userspace, not sure about the others (I remember hearing that openBSD uses all 4 of the x86 rings, but no idea if that is true).

I will admit, I was looking at this a while ago, when I implemented RDMA over Firewire as a poor mans Infiniband for clustering, but back then it was not possible to access the hardware from userspace, without essentially writing a shim kernel module that would sit and pass the needed data between kernel and user space, and therefore having the overheads I mentioned.

Now, that is Linux specific, however any monolithic kernel design by its nature has to have all userspace stuff go through the kernel. GNU Hurd goes to show that it is possible to have user-space device drivers without the overhead, but the kernel has to be designed for it.

The MMU is a hardware device (nowadays integrated in the CPU die) which handles memory translation on the low level. Not only is it already low latency, both the kernel and the userspace use it (all the time), so there is no difference between user/kernel space in this context. The difference is that userspace goes though an additional layer, the VMM (Virtual memory manager), so each process sees its own virtual address space. Only the kernel (that doesn't use the VMM) sees the real address space, and lacking in the extra indirection that the userspace has to travel, has a lower latency.

1
0
Reply
Anonymous Coward

Re: Moving out of the kernel to improve performance?

"Am I missing something?" Yes.

1
0
Reply
Silver badge

Re: Moving out of the kernel to improve performance?

ok, but the VMM would only be involved in setting up virtualized areas. Not in every MMU access. Am I right?

I can't see VMM being an issue deciding where to stuff code: userspace or kernel space?

It's a bit sad if ring transition overhead is a deciding factor where to put code. Some aggregation of data before doing the transition seems to be a solution then (i.e buffering). Guess latency might be an issue.

Does look a bit like the BBC is trying to use general purpose hardware and software to do some very specialised high bandwidth stuff, as some have already pointed out.

But perhaps penny pinching in this way will be a good thing eventually. I'm all for it if we get a better Linux one day! (Well, depending on how much BBC is sinking into this project.)

0
0
Reply
Silver badge
Trollface

Re: Moving out of the kernel to improve performance?

BeOS would have done it out of the box.

1
0
Reply
Silver badge

Re: Moving out of the kernel to improve performance?

> 1. A userspace driver has to go through the kernel every time it tries to access the hardware

This isn't necessarily the case. If the kernel can map the address space of the device into the virtual addresses space of the userland process there is no reason why the HW shouldn't be accessed from userland. Interrupts however are a different matter. I doubt you can safely run userland code from the ICS.

0
0
Reply
Anonymous Coward

Re: Moving out of the kernel to improve performance?

" If the kernel can map the address space of the device into the virtual addresses space of the userland process there is no reason why the HW shouldn't be accessed from userland."

That may depends on the device in question.

If the device has read access to data that its owning process shouldn't see, or write access to data its owning process shouldn't be able to write, then the user process ends up able to do things it shouldn't be able to do. IE system integrity is compromised.

Example: device has a register which contains e.g. a DMA start address which is used to store data received from (or sent to) the network. This is likely a *physical* address, ie data at that address may not belong to the userland process. See a problem with that?

" Interrupts however are a different matter. I doubt you can safely run userland code from the ICS."

Why not? So long as the interrupt is initially fielded by the OS itself, which does the necessary memory protection changes etc so the userland code can't access anything it shouldn't? If the userland code gets stuck in a loop, that would be inconvenient but it hasn't totally compromised system integrity, although it is a potential DoS attack, which is why this facility should only be available to an approved subset of applications.

Cite: connect-to-interrupt (CINT$) on RSX, VMS, and maybe others. Ask any decent dinosaur,

0
0
Reply

This post has been deleted by its author

Anonymous Coward

FPGAs

Surprised there is no mention of FPGAs. They are useful for doing high speed pre-filtering in effectively parallel logic. Content Addressable Memory (CAM) could be used with them to build fast searchable tables. The speed of that combination was exploited 30 years ago when Xilinx introduced their first 1800 gate devices.

1
0
Reply
Anonymous Coward

Re: FPGAs

Forget FPGAs. Why don't they just use multicast or UDP? Then they wouldn't have to worry about acks? Let the control channel be via TCP, but sent the stream via mcast/udp.

1
0
Reply

Re: FPGAs

SSM, any-source multicast and RTP is used extensively for audiovisual content and node synchronisation. IP Studio at the lowest level breaks down all AV streams into 'grains' of each data type - audio, video, control. You also need to preserve relative synchronisation of multiple cameras' input frames, (what's called genlocking to a reference signal) and preserve this correlation of the resultant IP streams right through to the vision mixer step.

I wonder if absolute guaranteed delivery is more critical for speed of delivery - you don't want to lose a *single* frame of video in a broadcast system, I know I'd rather run a slightly larger time and data overhead and know I'm receiving everything. And once you're at the multi-gigabit level, gains to be had from UDP might seem insignificant compared to your available throughput and UDP's inability to retransmit.

There's significant requirements through the IP Studio system for synchronisation of various component streams, and coordination of the grain streams, so it's possibly more efficient to just use TCP.

The R&D white paper documenting the CWG POC is an interesting read if you've not seen before: http://downloads.bbc.co.uk/rd/pubs/whp/whp-pdf-files/WHP289.pdf

0
0
Reply
Anonymous Coward

Re: FPGAs

"Surprised there is no mention of FPGAs. They are useful for doing high speed pre-filtering in effectively parallel logic"

Commonly known as Network Interface Cards with TOE and RDMA....

0
0
Reply
Silver badge

Re: FPGAs

Why don't they just use multicast or UDP?

Well the BBC is doing what they're doing because it gives them zero copy and ability to amortise kernel/userland context switches over lots of packets. Cloudflare is tinkering around in a similar way because it allows them to quickly write unprivileged software that will deal with new attack traffic.

UDP-based transmission is all well and good, but the problem these guys are trying to solve is a different one to the one you're thinking of (TCP's reliance on ACKs for flow control, with all that implies).

0
0
Reply
Silver badge
Joke

Steaming Videos

But ... but... doesn't Systemd do this already?

3
1
Reply

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Biting the hand that feeds IT © 1998–2018