It's cheap as chips
We just ran a quick cost forecast in PlanForCloud and it's interesting: If you start with 100GB then add 10GB/month, it would cost $102.60 after 3 years on AWS Glacier vs $1,282.50 on AWS S3!
Amazon is digging deeper into the enterprise with a data back-up and archival service designed to help kill off tape. The cloud provider has just launched Glacier, which it says takes the headache out of digital archiving and delivers “extremely low” cost storage. Glacier has been built on the Amazon storage, management and …
Well, it's sort of cheap.
I've just had a good look around their site (and the AWS blog) and have found out a few things.
First, the data is stored redundantly (specifically can cope with failure of two stores simultaneously), and you can choose if you want it in the US, EU (Ireland, 10% more expensive) or APEC (Singapore, 12% more than the US).
You store data in 'archives'. Once you have uploaded an archive, you cannot change it (though you can add to it and delete the whole thing), you are charged for three months of storage as a minimum, and if you want to download it, you have to get the whole thing. So make sure you split your data up - each archive needs to be a file!
After requesting an 'archive' for download, you have to wait 3-5 hours before you can start to download it. You then have 24 hours to get it.
You need to know what you have stored. A list of the description (if you provide one), creation date and size of each archive is available, but is only updated once per day; if you need any more info you have to download the thing.
You can only download 5% of your stored data per month *pro rated daily* for free. After that, prices go up very fast! As an example, if you stored 1TB of data, and wanted to get the whole thing you would be charged about $369.80 (excluding taxes). (again, 10% more for EU, 12% more for APEC).
So, only good for archiving if you are pretty sure you're not going to want to get most of it back.
Working for the download charge:
Peak hourly retrieval for the month = 36 gigabyte per hour (80Mbps)
Billable peak hourly retrieval = Peak hourly retrieval (36) - Free retrieval hourly allowance (1.7GB) = 34.29
Retrieval fee = Billable peak hourly retrieval (34.29) x Hours in the month (720) x retrieval price ($0.01) = $246.92
Then you add the data download fee at $0.120 per GB. So 1024* 0.12 = $122.88. 122.88+246.92 = $369.8
"After requesting an 'archive' for download, you have to wait 3-5 hours before you can start to download it. You then have 24 hours to get it."
By my reckoning if peak recovery rate is 36GB/hour you're never going to get that TB back within your download window. Am I missing something?
That's a very good point. I naively assumed that it meant you had 24 hours to *start* downloading the job, but after having a look at the actual API reference it looks like at some random time after 24 hours it may just reset the TCP connection and return a 404 for any attempts to resume. That's just plain stupid.
Which unfortunately means that it's essentially unusable if the amount of data you store on it is greater than the maximum you can pull down your internet connection in 24 hours. That is unless you fancy doing a lot of maths to request multiple jobs about 12 hours apart and you can guarantee that you can maintain a constant download rate over the whole period.
Just remember when your CEO/CIO/CFO comes into your face waving the savings that Amazon has promised him to tell them that the big fat pipe you'll need to use this doesn't come free or the redundant one you might want to back it up.
Then you might want to look at the Article 29 working group report into the Cloud .
http://ec.europa.eu/justice/data-protection/article-29/documentation/opinion-recommendation/files/2012/wp196_en.pdf
Then you might want to order some more disks for your SAN
And their SLAs guarantee that the data on this life insurance policy or land deed will be available in 99years time?
That all my data won't dissapear if the US suspects that somebody on Amazon is hosting a pirate movie?
And there is no price rise when I suddenly want to move all my data off their platform to a competitor?
At first, I thought this was a slower, low-cost variant of S3: same concept, but bigger, cheaper SATA disks and more use of RAID than straight duplication. The multi-hour retrieval times quoted would be consistent with tape, but they denied in interviews that it's tape based - some kind of disk library, perhaps, where the disk is stored powered down in a vault somewhere, then spun up when you request your data back? That could explain a few hours - spin up and mount a RAID set, then copy the data off to a staging S3 bucket for you to read from. Throw in some smart placement (keep all your stuff together, destaging it from S3 in big batches) and they should avoid the worst case scenarios (lots of little requests for different archived objects, spread out in time.
A dozen 4Tb or two dozen 2Tb drives in a pod, with double or triple parity protection, would fit with their 40 Tb maximum object size plus a bit of overhead - and they've set up infrastructure for hooking up big external drives to S3 already for the Import/Export stuff.
I like the price compared to S3 - but it's $120/yr for a terabyte. Probably about what you'd expect to pay to rent a pair of 1 Tb SATA drives for the year, sitting in quiet corners of two different Amazon sheds, plus a small share of a couple of shared drives for parity protection?