back to article Reader suggestion: Using HDFS as generic iSCSI storage

El Reg reader Greg Koss reckons the Hadoop filesystem can be used to build an iSCSI SAN. Er, say what? Here's his thinking: we could use Hadoop to provide generic storage to commodity systems for general purpose computing. Why would this be a good idea? Greg notes these points: Commodity storage pricing is roughly $60/TB in …

  1. casaloco

    Want

    Something every small business has needed for some time. Quality reliable shared network drives... in the cloud.

  2. Steve Loughran

    HDFS: the subset of posix it implements

    HDFS lacks a couple of features which people expect from a "full" posix filesystem

    1. The ability to overwrite blocks within an existing file.

    2. The ability to seek() past the end of a file and add data there.

    That is: you can only write to the end of a file, be it new or existing.

    Ignoring details like low-level OS integration (there's always the NFS gateway for that), without the ability to write to anywhere in a file, things are going to break.

    There's also a side issue: HDFS hates small files. They're expensive in terms of metadata stored in the namenode, you don't get much in return. It's while filesystem quotas include the number of entries (I believe; not my code)

    What then? Well, I''d be interested in seeing Greg's prototype. Being OSS, Hadoop is open to extension, and there's no fundamental reason why you couldn't implement sparse files (feature #2) and writing within existing blocks via some Log-structure writes-are-always-appends thing: it just needs someone to do it. Be aware that the HDFS project is one of the most cautious group when it comes to changes, protection against data loss comes way, way ahead of adding cutting edge features.

    without that you can certainly write code to snapshot anything to HDFS; in a large cluster they won't even notice you backing up a laptop regularly.

    One other thing Hadoop can do is support an object store API (Ozone is the project there), so anything which uses object store APIs can work with data stored in HDFS datanodes -bypassing the namenode (not for failure resilience but to keep the cost of billions of blob entries down. Anything written for compatible object storage APIs (I don't know what is proposed there) could backup to HDFS, without getting any expectations that this is a posix FS (that's untrue: I regularly have to explain to people why Amazon S3 isn't a replacement for HDFS).

    To close then: if this is a proof of concept, stick the code on github and everyone can take a look.

    stevel (hadoop committer)

    1. Alistair
      Linux

      Re: HDFS: the subset of posix it implements

      Dunno *why* it waited for me to get you an upvote Steve.

      I've *read* part of the code, since there is an app team in my shop that thinks HDFS is a good place for SAS temp files and report data.

      Update, edit are not options on HDFS in large scale - I noticed the hook for object stores, but even there it looks to me that its more worm style than anything else.

      Explaining that HDFS doesn't do these things takes some time - especially with some of the moronic marketing crap out there. (I've seen stuff that should have made the sales techs blush..... and I think I may have made one or two of them actually blush.)

    2. Dr. Mouse

      Re: HDFS: the subset of posix it implements

      Discaimer: I know nothing about HDFS. However:

      The only one I can see as an issue with the idea he suggests is #1. He is suggesting iSCSI use, in which case we are talking about a disk image. Therefore small files, seeking past the end (unless you are creating sparse images) and other points you mention don't apply. We are talking about exporting a single large file over iSCSI to be used as a disk on another system, which will partition and format it with it's own filesystem.

  3. Matt Piechota

    Since iSCSI is block based, we're talking about storing a ton of (say) 4k blocks in Hadoop and the iSCSI service asks for or updates block #12309854? It doesn't sound terrible. I could see some concurrency issues with multiple iSCSI services, but I'm not sure they're worse than iSCSI to traditional block storage as they'd still need a cluster-aware file system on the clients.

    The big question is: how/why is this better than something like DRBD?

  4. Henry Wertz 1 Gold badge

    Seems reasonable to me

    Stevel's comments might be a real problem -- in particular, overwriting blocks within an existing file might be something a lot of software expects to be able to do. And, in particular, if you have disk images served via iscsi I would think this software would be particularly likely to want to scribble into either a disk image or a differences file.

    Otherwise, I think this sounds quite reasonable -- why spend the huge bucks on specialized SAN hardware when this will do the same thing? The one reason ordinarilly would be stability but a) HDFS is proven software with known stability. b) SAN vendors have flubbed it now and then too.

  5. Nate Amsden

    sounds terrible

    Reminds me of a suggestion my VP of technology had at a company I was at several years ago, he wanted to use HDFS as a general purpose file storage system. "Oh, let's run our VMs off of it etc.." I just didn't know how to respond to it. I use the comparison as someone asking for motor oil as the dressing for their salad. I left the company not too long after and he was more or less forced out about a year later.

    There are distributed file systems which can do this kind of thing but HDFS really isn't built for that. Someone could probably make something work but I'm sure it'd be ugly, slow, and just an example of what not to even try to do.

    If you want to cut corners on storage, there are smarter ways to do it and still have a fast system, one might be just ditch shared storage altogether (I'm not about to do that but my storage runs a $250M/year business with high availability requirements and a small team). I believe both vSphere and Hyper-V have the ability to vmotion w/o shared storage for example (maybe others do too).

    Or you can go buy some low cost SAN storage, not everyone needs VMAX or HDS VSP type connectivity. Whether it's 3PAR 7200, or low end Compellent, Nimble or whatever.. lots of lower cost options available. I would stay away from the real budget shit e.g. HP P2000 (OEM'd Dot hill I believe), just shit technology.

  6. Ben Liddicott

    Then provision disks from the SAN and add them to the Hadoop cluster

    Rinse and repeat, Hey presto!!! Infinite storage!!!!

  7. SecretBatcave

    Yes, but what do you want to do with it?

    I mean, lets be honest, HDFS isn't a generalised block storage system. Its not ever particularly well designed for the job it intended to do.

    If you want to cluster bits of block storage into one name space, there are many, many better ways of doing it.

    For a start HDFS only really works for large files, large files that you want to stream. Random IO is not your friend here. So that makes it useless for VMs.

    If you want VM storage, and you want to host critical stuff on it you need to do two things:

    * capex some decent hardware (two 4u 60drive Iscsi targets) and let VMware sort out the DR/HA (which it can do very well) 60grand for hardware plus lics. That'll do 300kiops and stream 2/3 gigabytes a second.

    * capex some shit hardware, ZFS, block replicate, and spend loads of money on buying staff to support it. and the backups

    Seriously there is/are some dirt cheap systems out there that'll do this sort of thing without the testicle ache of trying to fiure out why your data has been silently corrupting for the last 3 weeks, and your backup has rotated out any good data.

    So you want a custom solution:

    1) GPFS and get support from pixit (don't use IBM, they can hardly breathe they are that stupid) <-fastest

    2) try ceph, but don't forget the backups <- does fancy FEC and object storage

    3) gluster, but thats just terrible. <- supported by redhat, but lots of usersapce bollocks

    4) lustre, however that a glorified network RAID0, so don't use shit hardware <- really fast, not so reliable

    5) ZFS with some cheap JBODs (zfs send for DR) <- default option, needs skill or support

    Basically you need to find a VFX shop and ask them what they are doing. You have three schools of thought:

    1)netapp <- solid, but not very dense

    2)GPFS <- needs planning, not off the shelf, great information lifcycle and global namespacing

    3)gluster <- nobody likes gluster.

    Why VFX, because they are built to a tight budget, to be as fast as possibly, and reliable as possible, because we have to support that shit, and we want to be in the pub.

  8. Dan 10

    I wondered recently whether HDFS would provide commodity iscsi storage for an ETL staging layer using an extract-once, read-many approach, but ISTR the random reads and variable file sizes (and other stuff I can't recall) didn't make it very friendly. The more obvious GPFS appears a better choice.

  9. Tom 64

    Fine, if you don't care about performance

    HDFS is dog slow. I mean really, really sloooooow.

    I'd probably not consider using it like this, you're much better off with a dedicated NAS unit with RAID, if you care at all about performance.

    1. Alistair
      Coat

      Re: Fine, if you don't care about performance

      HDFS is "slow" in certain contexts, but when you use it the way it was designed, it actually rocks. Its the right tool for certain types of work. Sadly most folks think its just a file system.

  10. Anonymous Coward
    Anonymous Coward

    Why go from file system to blocks to (on the client) file system? Seems a bit inefficient to me, and I would probably go for straight file level protocols right away and "do away" with some of the complexity.

    IF block level protocols is a must, I'd say that there are way better choices for the backend. Just go with the standard file system of the server, or a fancy one like ZFS that has lots of bells n' whistles.

    When it comes to backup & dr, I'd look into if that could not be solved at the application level, or if the server has some tools that could be used.

    To be able to say anything useful and not overly generic like above, it would be useful to have more information about the environment and the setup.

  11. Anonymous Coward
    Anonymous Coward

    Wow not too many people took the bait and responded to something that is an obviously bad idea

  12. joe.harlan22
    Thumb Down

    Wrong tool for the job!!!

    Square peg in a round hole much? Honestly, just because HDFS is a great data-hammer doesn't mean all your data is shaped like a nail. I would keep this idea locked firmly in a lab and take it no further.

  13. Anonymous Coward
    Anonymous Coward

    Maybe not HDFS

    Depending on your use case maybe Ceph, Gluster, Swift, or Scality would work better. Ceph does block if that's a requirement, but all of these are commodity server based storage grids.

  14. dellock6

    I vote too for Ceph for this kind of use case, it's designed to support (among other things) block devices and has a lot of implementations in production environments. Sorry but this one with HDFS sounds more like "I want to do something different from anyone else".

  15. ntevanza

    Location, location

    Do VSANs have Hadoop-style location-aware redundancy? If not, why not?

    1. Anonymous Coward
      Anonymous Coward

      Re: Location, location

      Ceph does, you specify it in the CRUSH map where you want replicas placed. The CRUSH map is a tree, ultimately the leaves are individual HDDs/SSDs. Branches represent things like data centres, buildings, rooms, racks, nodes … whatever your physical topology is.

      The default is to consider all nodes as branches of the root, and the disks as leaves off those branches. It'll pick a branch (node) that presently does not hold a copy of the object to be stored, then choose a leaf (disk). It repeats this for the number of replicas.

  16. david@stanaway.net

    TwinStrata

    Sounds like TwinStrata, and other cloud tiering appliances.

  17. Anonymous Coward
    Anonymous Coward

    other better ways to solve the problem

    iSCSI is a BLOCK protocol

    Look at the scale out block folks - CEPH or Swiftstack or MapR

    HDFS, is a file protocol, it was built to ingest webcrawling info. and map-reduce it for web-search

    for that task it works great.

    HDFS is woefully short of data protection, because it was built on the assumption that if data was lost that data would be refreshed from the up-stream source (like another webcrawl).

    HDFS zealots will say "rep count 3 cures every data protection situation", but it doesn't.

    don't get me started on Java behavior during node fail.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like