Deduping supposedly encrypted data? Yet it can spot 20 copies of the same file from different users?
You want to patent that. No really...
Start-up Bitcasa invites you to shove all your data into the cloud and use your hard drive as a cache. It's offering infinite storage capacity, it says, for 10 bucks a month. Really. Bitcasa says it is different from Nirvanix, Mozy and others because it is not a cloud-based backup company. It's different from Dropbox because it …
Deduping supposedly encrypted data? Yet it can spot 20 copies of the same file from different users?
You want to patent that. No really...
They use the same cereal box decoder ring for each encryption. Because of this they can't "know anything about files and folders" but they can still spot that template you and 10 others copied from the internet for your PowerPoint slide deck. Template spotted; 10 copies not stored.
See? Easy peasy.
Too bad for them that Cap't Crunch and I have prior art dating to the early 80's.
but i guess that won't occur to the majority of punters who will be baffled by all the technospeak, but understand that when a phone company says 'unlimited' it doesn't actually mean unlimited, so when someone says 'infinite' obviously they're telling the truth
Take 10 punters with DIFFERENT powerpoint presentations. Hopefully the encryption will turn them into the same bit stream that can then be de-duplicated and only one copy stored.
Either that or everything is sent to /dev/null and is able to be retrieved as locg as it is cached on your hard drive....
Even if they use the same template, the resulting file will be different, so file-level matching is out based on hashing of the encrypted data. It says if you share even a single slide between ppts, they dedup it, which means perhaps block-level dedup with understanding of filetypes to a degree. likely they "stripe" intelligent chunks (slides, pages, or simply data-blocks) across hdds and match those chunk signatures. They may not know what's in the file server-side, but I'm sure the client-side is quite aware to pull this off.
I came up with the same idea a whilst back for my employer and tweeted about it. (Something along the lines of "One day, there will only be one copy of each item of content.") Why replicate when you can stream?
Because you do not have free internet? Just an example - streaming full-time is going to cost money - unless you have an all you can eat plan.
I'm not even going to list the companies. I'd run out of room. Unlimited bandwidth deals. Unlimited storage deals. Infinite (or @#$#$%$^ Unlimited) any bloody thing.
I'd rant. I'd rave. But I really can't be bothered. Apart from saying that the appearance of 'infinite', 'unlimited' or 'forever' in advertising for just about anything these days is a quick way to stop me ever becoming a customer.
That all sounded interesting until the bit about sharing stuff with a URL.
Here's my data and here is the big red target on the side... have fun!
In order to do deduplication they have to know what your data is --- obviously; they need to be able to match the hashes of a block from one user against a block from another user. And, naturally, since they're only storing one instance of the block, that block must be accessible by both users. The only way I can conceive of this working in an encrypted environment is if all users have the same encryption key... which rather defeats the purpose of encryption.
Or am I missing something?
Having read the "learn more" link on their site, they appear to suggest that the data is encrypted client side and stored unchanged, i.e. encrypted, on their storage.
If users do not have the same encryption key then I'd have though the liklihood of finding any duplicated data that users could share would be minimal. If they do have the same key, then identical files encrypted by 2 users could result in the same encrypted data block. As they have no access to the unencrypted data, they can't know if user1's powerpoint slide is the same as user2's unless they use the same key.
I suppose because they are looking below file level then blocks of data could be duplicated across users, and the smaller the block size the liklier that would be, peaking at about 50% when the blocksize is 1bit (assuming a random average)
Perhaps I am missing something too. I share Mr Given's Misgivings.
Any proper encryption will give you different output every time you encrypt the file, even if you use the same key, the Initialization Vector (IV) should be chosen at random.
Not that It looks any different than simple TLS after reading the features...
I just tried to register for their beta and got the message that space is limited. So they must be using some new meaning for the word infinite.
The link has many comments that explain it fairly well.
Let's see, if you have encrypted my data it is because does not bear any reasonable resemblance to the original. Hence, you cannot compare it with someone else's files and determine if I have an identical copy.
Unless of course, you store a signature (hash) of each of my files. That avoids you having to look at the file contents, you can just compare the hashes and determine if two files are identical.
But if you do that you have not encrypted my data, at least given any file I can tell if you have it or not.
But it's even worse than that and I don't know why anyone has not highlighted it. For encryption to be secure, it has to be based on something shared between the cloud and me, plus something else I only know and nobody else. Hence, if encryption keys are shared among clients, you're actually encoding using the same key to everyone.
Which is enormously risky, because if the private key is revealed, all the encrypted data, not only the one belonging to the user but everyone's can be drecrypted. Ask the HDMI and Sony guys about that.
This is such a scam that I don't know why even The Reg is giving it visibility. Oh, well, to allow for some anonymous coward to post.
All the posts I've read so far seem to talk about asymmetric encryption, or symmetric encryption. It is entirely possible to do what Bitcasa are claiming. Although I wonder if they got it right. Anyways, the classic Needham-Shcroeder protocol (assuming you replace nonces with timestamps) provides a good basis for it.
It works like this:
Alice is a subject that submits a file
Bob is a subject that has shared access to Alice's file
Sam is the Bitcasa server
Alice calls Sam and says she'd like to share a file with Bob.
Sam makes up a session key message consisting of Alice's name, Bob's name, a key for them to use, and a timestamp.
Sam encrypts all this under the key he shares with Alice, and he encrypts another copy of it under the key he shares with Bob.
He gives both ciphertexts to Alice.
Alice retrieves the key from the ciphertext that was encrypted for her, and passes on to Bob the ciphertext that was encrypted for him.
Next, Alice creates a hash of the unencrypted file, and sends that to Sam for indexing.
Alice now uploads her file to Sam, encrypted using the key from the ciphertext that was encrypted for her.
Bob has access anytime he likes, using the key from the ciphertext that was encrypted for him.
Simples. If, as I said, they got a). the protocol right and b). the implementation actually reflects the protocol.
It still doesn't explain how they can deduplicate encrypted files *on a block level*
...when your hashes are computed from the unencrypted source. You could split the file into block-sized chuncks and hash those. Or you could treat file contents (e.g. individual PowerPoint slides) separately. If fact you could templatize file chunking based on a new policy downloaded from the server for each and every session. If you wanted.
How the hash values (if that's what they're using) are computed will determine the granularity of deduplication. From there the problem is one of indexing and content management.
The real security issue such a system faces is key management. There will be a public/private keypair (async) for every user, and another for every user's device. There will be a syncronous key for every file. That's a lot of key management.
I guess the final point is that using a well-thunk-through combination of async, sync and one-way encryption, it's entirely possible to compare segments of files you don't know the contents of.
The point was about how encryption and deduplication are basically incompatible. Your example only addresses the sharing scenario, of course assuming that one knows before uploading who are you going to share things with. But does not address the situation where one wants to share the file after it is uploaded.
Neither addresses how the system is going to know which two files from different users are identical. Hashing prior to encryption is at best a leak, because the system knows who else has something that hashes to the same value and thus has a degree of knowledge of the content (i.e, the RIAA/MPAA can ask them who has a file by providing a hash) and at worst a terrible security nightmare since the hash is calculated by an untrusted party -the client PC-
The possibilities for breaking havoc are endless. So yeah, unlimited storage (patently false) without knowing what we are storing (false) Pretty much invalidates the whole product.
Dude, without prejudice, I iamgine you're probably not familiar with assymetric and symmetric encryption. If you're interested, check out how PGP managed session keys. Similar concept, different application.
If I upload a file it is encrypted using a symmetric key I own. If I then share that file at a later stage all I have to do is to share the symmetric key using my friend's public key. This would be done at the point where I instruct Bitcasa to "Share file.ptt with UserX".
The system knows two files (or file parts) are identical because their hash values are identical. Yes. This means the hashing password must come from the server.
I did not specifiy where keys are generated - I don't know that. Either client or server are a good choice, depending on your objective.
There are some very sensitive documents that use a similar protocol to make encrypted content searchable. Your risk analysis will highlight the impact and probability of any weakness. It is then a business decision to mitigate (manage), transfer or accept those risks.
Many problems I've worked on choose to both mitigate and accept - i.e. in the search implementation I did, the search index was also encrypted. AES was fast enough for that. The threat model showed that accepting the remainder of the risk (shared symmetric key for the index) had business legs.
Technology is easy. People and process are not.
I'm not an expert in encryption, but still not convinced that their claims are false. You seem well versed so maybe it's a good time to learn something. Let's see.
Sharing after uploading: so each shared file with each individual has an associated key? Looks like the right way to do it from a security point of view. However, that's an awful lot of keys to handle, does not look very scalable to me. Way less than "unlimited", which simple laws of physics says it was false from the start.
Hashing to de-duplicate: if the hashing password comes from the server, I have to upload the file unencrypted, right? Hence, the service knows the unencrypted contents of my file. Fails on "we don't know what you're storing" part.
To avoid that, the client can encrypt before uploading and then upload the hashes as part of say, metadata. Then the system will know the hash of the raw content only because it trusts the client, so I can make up whatever hash I want and check if the server has it. Great for content providers, I guess, but fails again in the "we don't know what you're storing".
I'm deeply suspicious of this too. The encryption just cannot be for everything.
The techcrunch article even says it "doesn’t know anything about the file itself, really. It doesn’t see the file’s title or know its contents." And you "can share a link (file, folder) with other users". Really? Share a file it "really" doesn't know about? Oh rly! Sounds like marketing bull to me.
If it is all encrypted client side before upload and the hosts "can't access or see" the encrypted data, then passing a link to it when it is sitting "in the cloud" gets decrypted how? Unless, the other user also has to have Bitcasa client and all clients have to use the same key to encrypt/decrypt...
Or it is indeed a bucket of still steaming, finest marketing.
They just use a really fast and effective encryption. I believe it's called "ROT13".
It's super effective.
"Thank you for signing up for the Bitcasa beta. Space is extremely limited and you are at the back of the queue.
To move yourself up in line, send a tweet or post to Facebook including your personal sharing link below. The more people you get to sign up, the sooner you get Bitcasa."
so they want you to spam FB and twitter and get people after you to sign up, how the fuck does that move you up a FIFO Q?
Well if you were say, in the bottom q, and then you persuaded lots of people to sign up they would be even further down the same q, so relatively speaking you would have move up a notch in a much bigger q.
They'll basically tell you to spam all your friends in order to increase your chances of actually getting on the beta.
 'Tell all your Facebook friends and paste the following URL into your Twitter feed' - and what about people that don't have farcebook and twitter?
This is exactly *why* I do not have Facebook or Twitter...
I'd love to know how they they are deduplicating encrypted files?
I can understand transmitting them as random rubbish but it's not possible to dedup data in this form.
Neither encryption or dedupe are my strong points, but why couldn't you encrypt at file level and dedupe at block level? If a block of encrypted data looks like "01010111", for example, and you dedupe any blocks with that same sequence, surely that is completely abstracted (and therefore irrelevant) to the encryption keys/vectors etc that are in use?
Crunchfund may be an investor, but this feels like the NSA will be the infrastructure provider! *Adjusts aluminium foil skullcap*
Surely to do the de-dupe the data must be unencrypted at their end (or at least accessible, a la Dropbox)?
Sure, putting *ALL* my data on-line where any third party could look at it is smart.
No thanks, I'll pass.
Infinite -> unlimited -> broadband. See where I'm going here?...
I signed up for beta and got the email
"Space is extremely limited and you are at the back of the queue. The more people you get to sign up, the sooner you get Bitcasa. Use the link below to share with friends or post to your social networks."
I think someone attended too many web 2.0 marketing seminars.
Can't work across users unless it's a common keypair - if it can, they can retrieve the data, can't they?
I'm glad to see they moved onto another slam-dunk product after the raging success that was the CrunchTablet.
It can work just fine if you deduplicate stripes of drive data rather than files; there are only so many combinations of #kB of data so I would have thought that with enough drive space even random data (which is what encrypted data should look like) will start to have a few patters.
The problem is that the number of possible permutations balloons with just one additional byte. Each one multiplies the total possible combinations by 2^8, or 256. Put in perspective, to actually store the two-byte words of every single 16-bit possibility from 0 to 65535 would require 2 x 65536B, or 128KiB (this from just 16 bits--double it and you leapfrog Mebibyte into Gibibyte territory).
Just keep sending the data round and round "the internet". Using the cache on all the servers and routers it passes through to store it. Then when you want to retreive it, just wait until it comes round through your severs on it's next "orbit"
It's a mashup of mercury delay line memory, logistics companies using their lorries as warehouses (while they're on the road, delivering your stuff) and the standard internet/cloud marketing BS.
although such capacity is rather small.
"Start-up Bitcasa invites you to shove all your data into the cloud"
And I invite Bitcasa to shove their idea into another place where the sun doesn't shine.
Maybe they are doing dedupe on the client, then encrypting the deduped blocks and only sending the unique blocks?
If they dedup at the hard drive level then there are only so many sequences of bytes you can have. If they do not know what is file and what is directory then your client must hold all of that infromation & they would not be able to use hash keys to identify files.
With several Petabytes of data even "random" content such as encryption is going to have a few matching patterns; true they cannot identify that you and your Facebook friend have the same slide in a power point slide but they don't need to. I don't see why every client needs to share any keys, I think the power point analogy in the article has sent people down the wrong mental street.
read more about Pigeonhole principle, in short even with 512bit hashes you can have collisions that don't translate to two identical blocks of data.
And with petabytes (if not exabytes) of data hash collisions go down from "improbable" to "likely".
Anyone find the patents they have filed, I'm missing all 20. Are my search skills that bad or !!!!
...on the subject of infinite space availability... They don't have to actuall PROVIDE infinite capacity, since they'll have finite customers. They just need to have enough capacity per-customer to meet the average customer - who probably needs very little storage. So they have to add the drive space (and infrastructure) for, say, 50gb for every new customer - or whatever it averages out to; your usual person hasn't got 100gb of MKVs - and really that's pretty cheap. You amortize it and it comes out OK, I bet.
You can make educated guesses about the transfer costs, too; they already say that high-volume stuff is 'cached' on the local drive. So you store DSCN4012.JPG for six months and they download it to show grandma. You've used 4mb of bandwidth for that chunk for a year.
As long as you can scale your storage to match each customer, and your averages are correct (and presumably they'll get better the more customers you have) there isn't any reason you shouldn't have effectively infinite storage. The service won't be practical for storing big-ass media libraries unless you have unlimited, *fast* network; it won't be practical for things like movie or audio editing or game development; it won't be practical for loading Crysis Tournament and Conquer of War: Lost Coast.
So what's going to go on there? Pictures of grandma and the cat, email folders, Word docs, and the odd music collection - tiny, in the scheme of things.
I don't see why the storage aspect won't work. Encryption, of course, is another matter.
Someone is selling quantum desktops and no one told me? Mean...
Yep, I signed up and got the same "back of queue" message.
It looks as if they failed to plan for demand, so goodness knows will happen when people start uploading files. Encrypted files that won't de-dupe...
Handy, because I really need to store an 11-dimensional model of the quantum state of every particle in the universe right now.
You know soon as you put a limit on anything in the universe it seems to somehow spell out a missing level you need to add.
1) Offer infinite storage.
2) Renege on infinite storage.
3) Hope that customer has enough crap stored on your servers that it's infeasible for them to jump to storage 2.0 company n+1 currently at phase 1.
Infinite storage is impossible
However there will be limited bandwidth, and limited time.
Like an unlimited mobile phone contract, you cannot talk for more than (60*24*31=) 44640 mins a month, and most people (they hope) talk for about 150-600.
Unlike voice, you can throttle bandwidth for heavy users (as isps do) - or ask heavy users to pay more for more bandwidth - but the storage is still free :)
Deduplication I leave to others, but would wonder how many of us encrypt our MP3/4 collections.
Of ~60GB of files, I have 30GB mp3, 20GB mp4/avi, 8GB photos and 2gb other
So, 1/6th cannot be deduped.
Getting clever, you could look at my music collection and make good geusses on things I would like if there was enough of a user base. Privacy is a subject for historians to invade