Before turning on its new custom-built data center in Prineville, Oregon, Facebook simulated the facility inside one of its two existing data center regions. Known as Project Triforce, the simulation was designed to pinpoint places where engineers had unknowingly fashioned the company's back-end services under the assumption …
basically they tested it in production with no idea what would happen? Good to know they don't actually have any important data to lose.
I know what you mean, but what would you suggest? Loadrunner? I have tested extreme ingestion rates for email archiving, and getting the type of unique load that is needed requires some doing. We tried signing up for SPAM and got a decent volume per day. This did not reflect the types of attachments though, so we needed to use additional sending agents. There is nothing like real user data to understand performance characteristics. Trying to model it is exceedingly difficult.
The stress tests are not easy to design. With facebook's growth rate, they had a lot on the line to get it right. One of the most fascinating things about user activity is that if something is not responsive it will actually increase load...they will click refresh or issue another request. You know what this does to a web server or the database in the background. First transaction is not cancelled on the server side...
I have also tested very large Email Archive (100k mailboxes) what we did was ran a pre-live test trial with simulted data to check that everything was basically functioning ok. Once complete we got volunteers from different areas of the business to be signed up by their bosses. Once we'd seen that that worked we moved our users into the archive system in tranches of a few thousand a time to make sure there was no adverse effects on their work or the system.
the importance of planning for growth
I have a dev right now that is telling me we can just throw hardware at a design we are building. My old database and design experience tells me no amount of hardware makes up for sloppy code, it follows the 2nd Law of Thermodynamics. We are bringing in an architect and performance analyst.
Up to a point.
Up to a point throwing hardware at a problem is cheaper than hiring developers. That point is relative to what you're building, of course.
Well said sir!
It's never done is it?
I have this argument with devs and project managers at least every 6-9 months and have had for the last 16 years in database work. Yes sometimes it is necessary, but as every DB tuning book as ever preached, start with the app code, get that almost perfect then start picking on the DB and the hardware. Then when you're done, go round the loop again as many times as time allows.
Re: Up to a point.
Of course, hiring the RIGHT developers to start with is usually cheaper than hiring the wrong developers and then throwing hardware at the problem this causes...
So precious snowflake developers will tell you, but it's nonsense, no amount of hardware will make O(n^2) into O(n).
I work in financial services IT, if we open another datacentre, move a system or systems, or add a fundamental component to our systems (such as when we Euro enabled) we test, test, test, test, every single component - in a dedicated test environment.
It must be really nice that facebook users are worth so little to them that they effectively test in production. Where else would they go, but from facebook? (It's a serious question, I can't really think of anywhere else.)
I don't usually drink beer...
As said by the Most Interesting Man in the World (from those Dos Equis commercials.)
"I don't usually test my code. But when I do, I test it in production."
"Where else would they go?"
Exactly. There *is* nowhere else to go.
I got some "Cannot write to the database errors" today while liking someone's status, so it appears as though not *all* is well...
I'm sure they are so proud of themselves, but let's see what they have achieved: They were not confident enough that their code base was developed properly for scalability, and so instead of building a simulation in a controlled test environment, they just tested in production.
They could have just switched on the third data center. What was the point of the simulation? If it failed, since it was handling real traffic, their production would have been impacted just the same, no?
Seems typical for Facebook, and perhaps for computer programmers fresh from college, with enough money and iron to do what they want without fear of consequences.
Another thing that I see with CS guys straight from Uni is that you've got to school them in the way things are done in business. You see incredibly piss-poor treatment of students', who these days are multi thousand pound paying customers, by their IT / IS services departments. My partner was at Oxford a few years back and their webmail would regularly be taken down in the middle of the day for upgrade work. Having never worked in business, she didn't really understand that this isn't normal and shouldn't be put up with. Now move that to a new graduate moving into business and there just isn't the uptime and customer services ethos drummed into you at Uni that you really need to hit the ground running.
Been there done that...
...at first job out of university. Mind you it was a bit of a departmental mindset! To this day my former colleagues and I can't quite figure out how we were allowed to get away with it.
Though in hindsight it did teach us all exactly how not to do things, which I'd like to think we all took note of! Also fobbing off helpdesk/customer support bods is a lot easier than fobbing off a rather irate traders...
Triforce? Really? /b/-tards around the world will be rejoicing.
And stimulating the interest of /b/-tards is usually not a very good thing for any commercial enterprise.
Of greater concern
...is that this suggests it was very possibly *built by* /b/tards.
Then again, some might argue that FB's corporate ethos and attitude to their audience has always suggested as much.