back to article Poll: Linux's big data guzzling worries melt away

Concerns about using Linux on servers to crunch huge data workloads are evaporating, according a survey. Fears that the open-source operating system isn't up to the job of processing big data have fallen by 40 per cent in the last 12 months, according to the Linux Foundation's annual Enterprise Linux report. Last year 20.3 …

COMMENTS

This topic is closed for new posts.
  1. Chemist

    CERN

    The LHC is going to generate 15PB of data per year.

    Their GRID computer :-

    Number of machines: 14,972 processors with 64,623 cores running Linux

    Some facts about storage capacity (November 2011):

    Tape: 41 PB of data, including non LHC data. (Source)

    Disk: 62K hard drives with a total capacity of 62660 TiB. RAID is used to increase redundancy and stability. (Source)

    That sounds like quite big data

    1. Charles Manning
      Linux

      Nice numbers

      But people often forget that most Linux systems are running on smartphones and similar.

      Linux is dominated by ARM CPUS with flash storage.

      If half a million Linux devices are activated every day, and each once has 1G of storage, then that's half a PB of Linux data deployed per day.

      As for programmers... anyone that can program for Windows can easily learn to program POSIX.

      1. Anonymous Coward
        Mushroom

        NO!

        We don't want them to because, for the most part, they suck.

      2. Chemist

        If half a million Linux devices are activated every day

        Sorry I find that a poor example - the data isn't "under one roof" and available to anyone on the CERN network.

  2. Gordon 10
    WTF?

    Confuzzled

    Whats the definition of big data in this instance? My usual interpretation is Hadoop and its relatives and offspring. But this seems more like a sciencey HPC definition.

    Why was there ever any doubt that Linux could handle these loads? Its working for GooHoo and others. Its not as if there are many credible alternatives in this space, unless you count explicit HPC OSes (which Im not doing).

    What an odd survey.

  3. Peter Gathercole Silver badge
    Happy

    I'm all for Linux in the datacentre...

    ...but (and you knew it was coming), I find that as a sysadmin looking after Linux and UNIX systems, I get much better feedback about the health and stability of my UNIX systems than I do from Linux.

    It's fine as long as everything is running well, but when things start going wrong, the proprietary error logging extensions that are present in most UNIXes make it much easier to spot and fix problems than Linux on generic hardware.

    Such things as ECC memory having to fix memory corruption errors, or disks having to re-read data multiple times or relocate sectors, or CPUs taking a re-startable check-stop. UNIX (in this case AIX) tell me this is happening, even if the system kept running. If I'm lucky on Linux, I may be able to find out disk errors by examining the S.M.A.R.T disk interface, but I would not want to have to do this for all 3000+ disks that run in the environment I currently work in. And the other errors....

    Of course, if you are in a cloudy environment which is designed to be able to cope with systems falling out of the environment (e.g. Google, Amazon), then this may not really be a big problem, and that is probably where Linux is gaining acceptance.

    1. GreenOgre

      The data is there, you just have to ask for it!

      The data is there, you just have to ask the system.

      WMI, IPMI and good ol' SNMP (depending on the h/w manufacturer's tools) are the languages to talk.

      A solid tool like Zabbix or Nagios will do the heavy lifting for you and combine ALL your monitoring, trending and alerting into one open system.

      1. Anonymous Coward
        Anonymous Coward

        OK, so where can those interested read more about this? Suggestions welcome. I'm aware of WMI, SNMP, Nagios, and monitoring things like system utilisation, access to SMART for cheap disks, etc (but not seen Zabbix till just now).

        I use Linux at home and at work (sometimes Suse, mostly Fedora, occasionally Ubuntu) and have done for years but not in datacentre type apps and like the earlier commenter I was also unaware that it had enterprise class robustness and manageability.

        In fact at one point I looked briefly at OpenSolaris specifically because it seemed to have better manageability for this kind of thing than the typically off the shelf Linux distro (maybe real RHEL is better?).

    2. Microphage

      Linux not for the datacentre.

      This is from Sep 2010, maybe they've fixed it since then.

      "Linux In The Data Center: How Does It Fit?"

      http://www.crn.com/news/applications-os/227400131/linux-in-the-data-center-how-does-it-fit.htm;jsessionid=m3xsUuaiF85Cc615kCzTWQ**.ecappj02

      What's a 're-startable check-stop`?

      1. Anonymous Coward
        Anonymous Coward

        I don't like that article much.

        Anything that starts with Ubuntu and recommends Ubuntu for serious use in comparison with SuSe and RHEL (even if only for SMEs) is a bit dubious in my book.

        Any author that can make the claim "Red Hat isn’t turnkey. The company doesn't try to pitch its solutions for small businesses. " without noting that quite a few SMEs out there are using RHEL (or itsderivatives) supported by other small businesses (maybe SME-specific IT support outfits, like, er, some of the better resellers who should be among CRN's target audience) isn't very credible.

        It mentions shiny new stuff (the cloud) without mentioning the basics (is my hardware failing? Are there any system-level trends I need to address e.g. utilisation, disk space, performance).

        Article rating: unsatisfactory. I could have done better (given time) and I'm not even in that sector of the business these days.

        "What's a 're-startable check-stop`?"

        I couldn't quickly spot that phrase in the article (and the search engine we all love finds this thread as the first reference) but I would ass*u*me it's intended to be a reference to checkpoint and restart technology, which means that an application, application suite (or even complete computer system) persistently saves ('checkpoints') its internal state from time to time, so that it can be restarted (resumed) from the last checkpoint if an error should occur, rather than going right back to the beginning of the application run.

        Why? Because if you've a processing job that takes a long time (many hours?) to run, and there is a significant risk if something misbehaves during the run, you may not want to have to start again from the top if there something does misbehave. Like rolling back a database transaction except on a wider scale.

        This kind of capability has been touted in some enterprise apps and some long-running technical apps for many years. It isn't actually easy to do well on a system-wide basis i.e. within the OS, although if I remember rightly, Tandem/Compaq's NonStop Clusters for SCO UNIX (?name?) seemed to make a good job of it on paper (I never saw it first hand).

        These days the HYPErvisor fans might suggest that checkpoint/restart functionality belongs in the HYPErvisor rather than the application or the base OS, and they might have a point in some cases.

        1. Peter Gathercole Silver badge

          @AC

          A 'checkstop' is a detected CPU failure (such things as an internal register parity error). A re-startable checkstop is one where the instruction being executed can be restarted from the beginning in order to retry the instruction.

          This may be IBM only terminology but I'm surprised you cannot find it in search engines. If you search for 'checkstop', 'powerpc' and 'restart' you will find references, and it is used when discussing mainframe and Power processors.

          From experience, some processors either crash or silently return incorrect results to the application (I've not looked after systems with very recent Intel or AMD processors, so the hardware capability may now be in these). IBM hardware will attempt to re-run the instruction, and if it still generates an error, will de-configure the CPU (if it is a multi-CPU system) while still allowing the system to run. It will probably kill the process that was running when the checkstop happened, but the system will keep running. But even re-startable checkstops are reported through to the error log to warn you that there may be a hardware problem creeping into a system.

          I agree that this was not mentioned in the original article, but I was commenting on my perception that none of the Linux distributions I have seen have the same degree of RAS as the proprietary UNIX systems out there.

  4. James 100

    Like the previous posters, any such "doubts" seem rather archaic by now. There's that bunch in Mountain View, with billions of web pages indexed and cached as well as all the Gmail/Picasa/Appengine/Google Storage stuff, a few online backup companies with tens or hundreds of petabytes each, Amazon holding around half a trillion files in S3 with a few hundred thousand requests per second. Even Microsoft have occasionally been "caught" using it behind the scenes for things like DNS - there's that little "Akamai" bunch in MA who are serving up 14-15 million web requests a second...

    Do the remaining people with doubts whether the software powering the biggest server farms on earth can do the job it's already been doing for years also have doubts about this new-fangled "elec-tricky" stuff?

    1. Anonymous Coward
      Anonymous Coward

      Quoting Goofle as an example of good engineering/IT practice is very high risk.

      Who cares if their adverts don't get served occasionally? If (or more accurately when) their free email goes astray, what's anyone going to do about it?

      Stock Exchanges are starting to use Linux instead of Tandem NonStop, instead of VMS, and indeed famously instead of Windows (in London, for example).

      When payroll routinely runs on Linux, then I'll be happy. (Yes I'm aware of Clockwork/PayThyme thank you, the point is it needs to be ubiquitous),

    2. LaeMing
      Happy

      Electrickery!

      Give them a break! They are only just comming to terms with the 'Telling-Bone'.

  5. Anonymous Coward
    Anonymous Coward

    It is an odd survey

    Like Gordon 10 says, there's *never* been any real or serious doubt (other than by MS customers being terrified by sales rep generated FUD) that Linux can do big data. It powers the world's most powerful supercomputers after all.

    In huge Linux farms it has been used to crunch staggering amounts of data and even in render farms for the film industry for ages.

    What a silly

    1. LaeMing
      Go

      Well, then the survey is at least showing the MS-Sales-rep FUD-machine is slipping.

  6. DJ Smiley
    IT Angle

    "RHEL 6 was updated to exploit latest multi-core chips, and support 4,096 cores per system image and addressable memory of 1TB – up to 64 cores and 64TB on RHEL 5. RHEL 6 also switched virtualisation technologies from Xen to KVM."

    I think you mean "up from" .. "64Gb on RHEL 5"

  7. I_am_Chris

    I don't get it

    What on earth is this article comparing Linux to?

    The only other OS specifically mentioned is windows. There's no effing way that that is better than Linux at handling big data.

    Fail elReg!

  8. Puff65537
    Mushroom

    If you get your pay as a "check" then most likely it comes from ADP, and they have 6 linux openings currently, running from entry level to senior positions. Note that they pay for high end hardware (IBM Z).

  9. Christian Berger

    It's hard to believe

    It's hard to believe anybody would even consider using an operating system for big data, which, until a few years ago, was still delivered with its main text editor unable to handle files larger than 64k.

This topic is closed for new posts.

Other stories you might like