58 application servers and 10 database servers for metadata
Hm. That could be the problem, then...
BBC techies have no idea why the load on its database "went through the roof" last weekend, when Auntie was struck by a huge, two-pronged outage that caused its iPlayer service and website to go titsup. During the downtime, the Beeb was pretty reticent on social media about what had gone wrong, preferring instead to simply …
>Why? Care to elaborate?
My (limited) understanding is that these servers do not exist in such quantity just to provide capacity and redundancy but because the metadata has to be patched together from a wide variety of existing systems that were originally intended to serve other purposes.
Even the programme content itself seems to come from a variety of sources (some from source material, some off-air or at least off-playout-system).
Complexity makes for problems that are hard to both diagnose and fix.
Because typically the only dynamic part is which part of a linear data progression you select, as the answer will in nearly all cases be constant, it being a historical record.
So you'll need some form of nearby storage of a linear constant string of data, probably chopped into manageable chunks, of which only the latest portion is typically hot.
Yes you *could* put it into multiple databases, frontended by multiple application servers.. but there is another hierarchical system with mass low latency access to large datasets and automatic in-memory caching of hot data..
It would make sense that if the cache was wiped, the load on the database servers would suddenly shoot through the roof as every request would have to be served from the original metadata. The cache failure may therefore be the root cause of the problem, not a coincidental second problem.
An up-vote from me. However, I'm surprised the load is that high that it can't be server by 10 DB servers. Even if each request required a write I'd expect 10 DB server to manage between 10000 and 30000 requests a second (the ones I'm working on here certainly can). and even more if it was only reading.
So, if we take the higher write count, assuming they transactions aren't that large we're looking at 1.8 million requests a minute. I find it hard to believe that the British public could put that kind of load on the system for hour after hour (Wiki says that around 70% of access is from the UK and very little at all if we only look at iPlayer)..
I believe the all time peak for hits on the BBC was 1 billion in one day after the 7th of July bombings. So at the rate of transactions/queries outlined above it should only take 9.3 hours of processing time even without caches.
I'm sure it's more complicated than that as some pages require more than one lookup but I'm still unsure why removal of the cache knocked the site out for so long.
PS. The numbers are just rough figures to play with.
I agree. It's certainly unlikely and it does sound too simple. But it fits. Each time they switched from their emergency site back to the full site the thing died, presumably because the missing cache caused a database overload. So they then switched to the emergency site for several hours, probably to restore the cache from a backup, meaning it wouldn't have to be rebuilt organically.
Fun as it is shooting in the dark, it would be rather nice for the Beeb's technicians to provide El Reg with a full explanation, so that we can all take away the learnings (you've no idea how much I hate that phrase but I'm sure they use it a lot at the BBC).
"The timing of the outage came just days after the BBC's Internet Blog ... celebrated the fact that it had been nearly a year since the Corporation .... moved live processing into the cloud".
Perhaps somebody got the wrong idea and the timing this week with the internation media contest "how to pizz off 75% of the Russian population" might also have provided fuel. As mentioned above, the caching service might have been targetted and then it's just a question of stressing the load.
...iPlayer Radio now has 1mth catchup. I'm pretty sure it was 1 week before the weekend....
Both of those statements are based on the 'days left to listen' bit below each listing in the category sectionof the android app...
Can anybody with a better memory than me confirm it changed at the w/e?
> I noticed that, however as of late last night some programmes from after the outage (e.g. Monday's I'm Sorry I Haven't a Clue) still weren't available so I dispute that the system is back to normal.
The system is still returning to normal service, evidently; further to my previous post, Pick of the Pops [Saturday] became available mid-week, as has the third episode of "It's a Fair Cop" ... which is roughly on schedule however the bbc.co.uk availability data says ep1: three weeks; ep2: four weeks; ep5: five days.
Clue fans will be happy to read, from a get-iplayer search earlier (Fri AM):
11473: I'm Sorry I Haven't A Clue: Series 61 - Episode 3, BBC Radio 4, Comedy,Highlights,Popular,Radio, 3 days 20 hours ago - Harry Hill joins regular panellists Tim, Graeme and Barry. Jack Dee hosts.
11474: I'm Sorry I Haven't A Clue: Series 61 - Episode 4, BBC Radio 4, Comedy,Highlights,Popular,Radio, 0 days 1 hours ago - Harry Hill joins regular panellists Tim, Graeme and Barry. Jack Dee hosts.
I don't give a fig about iPlayer*.
But iPlayer and normal web content should not be the same servers.
(* I don't have enough Cap for ANYONES video, not YouTube or Netflix (So I buy DVDs) and most of iPlayer doesn't work outside UK, which is where I happen to be. I get all the Broadcast UK content live fine though none via DAB platform and my media PC can record 2x DTT, 2 x Satellite (from four satellite positions) and 1 x Analogue Radio simultaneously (100kHz to 1300MHz, Analogue includes up to 8kHz bandwidth narrow band data such as PSK or FSK Weatherfax) UHF Digital reception, Motorised Sat Dish and 28 + 19 + 13 + 9 E sat reception with 16 outlets).
There is no BBC Licence. There is a UK tax collected by BBC that partially funds BBC. They refuse to collect it outside UK.
Of course some countries there is a TV tax applicable even if you can't get local reception.
The BBC may be partially funded from UK TV tax. It's not a BBC tax though. It's a tax for being able to receive UK TV in the UK.
"There is a UK tax collected by BBC"
If you think that then buying bread and milk is a tax on living, buying an air-con unit is a tax of comfort, buying wine is a tax of relaxing after work...You sound like some sort of disgruntled old man here.
And fyi I have no problem getting iPlayer when I am abroad! Chrome even has plugins to make it seemless.
Cloud will make all your problems go away, trust the cloud, no nasty techies who will tell you not to do things. Databases made of candy, worldwide loadbalancing with every bite!
Don't listen to your internal techies. Throw them away! Let go of your clue. Release yourself from responsibility and self-reliance. Open yourself to the freedom of total dependency. If anything ever goes wrong, you can blame us for everything, rant at us, pick up the phone and scream at us, it's all goooood. Come to ussss... come.....
*starts humming Hotel California*
"I believe the outage was caused by Samantha rummaging around in the record department with one of the archivists."
I believe the outage was caused by Samantha rummaging around the department archivist whilst going for a record - fixed (although really I Haven't a Clue)
The live score page (http://www.bbc.co.uk/sport/cricket/live-scores) currently shows the close of play score for the first day of the second test match between Sr Lanka and South Africa. That's OK.
But it also shows the in play score for day 4 of the first test, which ended with a SA victory on day 5 (20 July). Not quite what I would call a live score.
They still haven't really said where the excessive load came from - a failure internal to the system or something like a DoS external to it ...
If it was an internal failure it's unlikely that both systems (geographically seperate hot mirrors) would fail at the same time - unless there's a fundamental bug in the (replicated) system? The who point of their system redundancy (as I understand) is to kill one system if the other fails which they could obviously not do which infers the caching is not a redundant system ...
All this suggests to me the failure could have been triggered very close to the external gateways which again suggests external influences rather than internal ones.
Would the BBC tell us if they could be crippled by a DoS attack?
It's all shite anyway.
I weened myself off the shouty drama, posh twit presenters and insane stylised editing years ago.
A couple of years ago I weened myself off the 'News' Lies. Can't even stand the radio.
Caught myself watching the '6 O'Clock Lies' at a friends house the other day and the lobotomised presenter was talking about the 'recovery' and 'unprecedented fall in unemployment' with a straight face! ROFL HAHAHAHA!
This post has been deleted by its author
- He confessed that "restoring the service itself is not as simple as rebooting it (turning it off and on again is the ultimate solution to most problems)."
BLOODY MICROSOFT HAS A LOT TO ANSWER FOR. No, a reboot is NOT the ultimate solution to most problems. It's a temporary stopgap when you confess that you have no idea what happened and cross your fingers that the unknown problem will never happen again.
Anyone care to have a go at enumerating that list? No wonder they haven't got around to supporting the Xbox One yet.
On the other hand this might indicate a significant flaw in their system because it might mean that there is a completely separate software to support each version of Android on each different handset!
1200 devices sounds a pretty small list. All those different TVs, PVRs, BluRays, Game Boxes, Tablets, and other non-standard bits of kit over the years. PCs and things using web browsers are easy as the browsers keep getting updated. But think of the TVs whose firmware tends to get abandoned after a couple of years. No more updates and just have to hope that manufacturer followed the spec exactly.
I have been involved with supporting a few bits of abandoned hardware over the years. PVRs built by companies who then go bankrupt. Each change of iPlayer, Freeview, etc all comes with us all crossing our fingers that the old boxes will keep stumbling onwards. Hardware manufacturers don't always follow specs correctly, so I bet the Beeb has all kinds of daft little work-arounds for some of these bits of kit.
IMHO - This new iPlayer update seems to have been bodged through testing. It may look nice on a tablet, but it is hopeless to now try and use on a big screen TV with a remote control. It makes me wonder if this is an area that could have caused an issue. Some of those older hardware products could just plain be having problems accessing iPlayer data and asking stupid questions to the databases. The bizarre lack of usability testing makes me think this could be a distinct possibility...
Still seemed a tad sluggish last night (Wednesday), using my telly, rather than the PC. Still, I suppose we should be grateful for reasonably watchable quality from the Beeb over the net when it's working properly. Have you seen the foul pixellated stuff put out on the ITV player? Faces quite often degenerate into a Google unrecognisable blurred look if moving fast. Tried to watch Darling Buds of May last night and it reminded me of good ol' VHS. Bloomin' 'orrible quality on my Panny HD set. No - on second thoughts, even my old Super VHS tapes look surprisingly good on that. Let's equate it with 8mm movie film.