back to article Tell me, professor, what is big data?

Big Data may be misunderstood and overhyped - but the promise of data growth enabling a goldmine of insight is compelling. Professor Mark Whitehorn, the eminent data scientist, author and occasional Register columnist, explains what big data is and why it is important. Sometimes life is generous and hands you an unexpected gift …

COMMENTS

This topic is closed for new posts.
  1. Pete 2 Silver badge

    Signal to noise

    The primary characteristic of big data is that for any particular problem you wish to apply to it, the proportion of interesting information compared to the amount of garbage you have to sift through is very, very small. With old-fashioned databases, designed when storage was expensive, the intended use was to focus on a small number of very well defined questions: where does account number 8892784237 live? How many months in arrears are they? What was their latest purchase?

    While we're getting better at storing big data, which just increases the volume of stuff we collect, we're still in the infancy of extracting relevant, accurate and causally-linked intelligence from it (just ask the NSA or GCHQ). Sure, you can use it to foretell failures, a la HAL in 2001 - so long as your computer doesn't have an agenda of its own - but the number of false positives is very high. There is also the danger with BD of treating everyone as if they were Mr. or Mrs. Average and not designing enough flexibility into its reporting to allow for the possibility that maybe some people are different, or purposely give odd answers.

    So while BD probably has some advantages in industrial processes, can feed back failure modes to manufacturers to design-out faults from products - it also leads to an increasing homogenisation when applied to people. Until we can advance the data selection processes to match the data collection abilities there will always be the wrong attributes applied to the wrong people, just because we accidentally triggered something while buying granny's incontinence pants on Amazon.

    1. AceRimmer1980
      Alert

      Re: Signal to noise

      "It is now possible to scan the sensor data looking for temperature and vibration patterns that are characteristic of an imminent component failure and get the part shipped and fitted before it fails."

      Dave, the ooo-arr-35 inside the combine blades is about to fail, I suggest you pop down and take a look.

      1. Pete 2 Silver badge

        Re: Signal to noise

        > Dave, the ooo-arr-35 inside the combine blades is about to fail, I suggest you pop down and take a look

        Gives a whole new meaning to getting bailed baled out

      2. tony2heads
        Alien

        Re: Signal to noise

        Hal 9000 reported that, "I've just picked up a fault in the AE-35 unit. It is going to go 100 percent failure within 72 hours."

        Now what could possibly go wrong

  2. Tom 7

    No - big data is this

    Select 'something I cant put my finger on' from 'any number of disparate documents on any number of computers and hard drives somewhere uncontrolled and undocumented and even if we do have it may actually be in a compressed jpg that someone scanned in so we have to try and OCR all images.

    Small data is the dot product of a few well organised tables stored on an old 386, or it would be but for MBA's and similar.

  3. Anonymous Coward
    Anonymous Coward

    Big Data is to Data as Cloud is to Network, i.e. a vague marketing idea which boils down to "buy our new stuff, it can do stuff better than the old stuff the other guys sell".

    As for "data is data", as the Economist style guide succinctly puts it: "Data are plural".

  4. Anonymous Coward
    Anonymous Coward

    I tend to agree with Tom 7

    'Big Data' is an expression of a problem, not a system.

    The article gives several examples of people who clearly know what they're doing, why they're collecting their data and how they process it. There's plenty of experience of this in manufacturing process control, machinery health monitoring and balancing (my own experience was in helicopters), and (as given in the article) microbiology. In all cases the data is manageable. The primary limitation is storage capacity, but generally that's more of a budgetry issue. This is nothing new and not worthy of a buzzword.

    'Big Data' is the problem of how to catalogue, index, and query data that isn't easily broken down. In that case the article does touch on one example - images - but similarly audio, text etc. are hard to categorise in a way that pre-empts how someone may want to retrieve or collate that information. Then there's the issue of businesses and institutions collecting massive amounts of data simply because they can, with no idea what to do with it. Or some vague notion that it may be useful. They just need a good slap.

    1. AceRimmer

      Re: I tend to agree with Tom 7

      How would you define the velocity, volume and variety of said slaps?

      1. Anonymous Coward
        Anonymous Coward

        Re: I tend to agree with Tom 7

        > How would you define the velocity, volume and variety of said slaps?

        African or European?

    2. Paul Kinsler
      Happy

      Re: the problem of how to catalogue, ...

      So if big data isn't "big" then "big data" is a pretty silly name for it. Since supposedly the phrase is used to indicate a soup of factoids elements with had to discern structure, we need a better name. Random stuff assembled together in a way posing as something interesting? Hmm... the art world may have got there first ... how about we just call it "dada" (singular "dadum") instead?

  5. Deft
    Meh

    I'm scared of big data

    I work in the pharmaceutical industry and I have to admit I'm not sure I do a great job of managing my small data nevermind the big stuff.

  6. Colin Millar

    Big data

    Is just small data that you haven't described adequately (yet). The puffers and fluffers of BigData(tm) are always trying to suggest that there is some magic solution which will allow you clarity within your data masses without you having to go to the effort of applying yourself to trying to understand (a) the nature of the information you have and (b) the reason you need to understand it. These people are snake oil salesmen.

    Organising audio and video? really? I suspect there are more ways of organising audio and video than there are people who own large amounts of audio and video. It is some of the most tagged, categorised and described information in existence. Just because the actual collection of information in a single video file is very complex doesn't make it difficult to handle. People tend to describe data in ways that are useful to them in their own application and to ignore the complexities that are not relevant to them.

    The combine harvester stuff is called data analysis and has been around for a while.

    1. jburnmurdoch

      Re: Big data

      "Organising audio and video? really? I suspect there are more ways of organising audio and video than there are people who own large amounts of audio and video. It is some of the most tagged, categorised and described information in existence."

      Unless I am mistaken, you appear to have missed the author's point, reading his reference to image and audio files as a statement about wanting to organise a music library.

      Tags, categories and 'organising' are exactly what the author is *not* talking about. Here's a real world example of what data scientists are doing with image data: automated analysis of satellite imagery in sub-saharan Africa to identify bodies of stagnant water likely to be mosquito breeding zones, in turn allowing regional teams to better target their malaria control efforts. Little bit more to it than selecting all images tagged "malaria".

  7. Kubla Cant

    In the days before world+dog got into relational databases, data had to be stored in files. When databases started to become popular there was a notorious vendor nostrum: "Just put everything into a database, and then you can get whatever information you need out".

    To an extent, this was true. Extracting meaning from data stored in files was always difficult and resistant to ad-hoc queries. But the implication that a database is a sort of magician's bag, into which you dump masses of disorganised data and from which you pull meaning and truth, never was.

    The advocates of big data seem to be resurrecting the magic bag.

    1. Anonymous Coward
      Anonymous Coward

      Re: The advocates of big data seem to be resurrecting the magic bag.

      Perhaps we could be more charitable, and say that "big data" (or "dada" as above :-) is about learning to construct bags that are more useful ("magic") than ordinary dumb bags; or learning ways in which such magic might be gleaned. But then, are the advocates saying something that nuanced?

  8. Ian Michael Gumby
    Boffin

    Big Data is like Pornography...

    ... I knows its when I sees it.

    That's probably the best and most accurate description of Big Data.

    Yes, the 3 V's was incomplete and there was a joke on LinkedIn to talk about the missing V's.

    (Or rather how man V's we could add.)

    The truth is that like Ponography.... being pornographic in nature, Big Data is an amorphous description. You don't necessarily have to have volume to make it Big. You don't have to have velocity. Same for variety. You could have 1 of the 3, 2 of the 3 or all of the 3 and it would be big data. You could have a relatively small amount of data, slow velocity and coming from one source, yet you would require big data tools to solve problems against the data set.

    Its because of this... there isn't a set definition of what is or is not big data.

    BTW, the initial comment ... about pornography was paraphrased of a quote by a judge in Cincinnati OH during Maplethorp's obscenity trial...

  9. Ian Michael Gumby

    Try and look at it this way...

    Look,

    Big Data is very simple.

    You work for company X.

    You make a widget or something. (It doesn't matter what it is.)

    In your daily business of building, storing, selling, shipping and supporting this widget, you generate data.

    Your customers and potential customers have heard your marketing buzz and their views of your website generate data.

    You want to better target your marketing message, so you purchase external data about your customer so that you can better tailor your pitch and increase your sales while reducing your effort and costs.

    So you have all of this data. Most that you generate may or may not have enough value to cover the expense of capturing and storing the data.

    Big Data is nothing more than the tools which let you effectively store and work with the data in hopes of creating value which you couldn't do using traditional tools.

    That's it.

    Its boring except for the fact that you're now managing clusters of machines and storing PBs of data.

    The trick is storing the data so that its easy to access, and the data science to pull value from the data.

    The hard part is knowing which patterns are real and which are imaginary. ;-)

    1. amanfromMars 1 Silver badge

      Re: Try and look at it this way... for Bigger Picture Power Plays

      Quite so, IMG. Well said, Sir or Madam [well, who knows for real and for sure, what and/or who is sharing info and secrets here on the world wide web of networks and El Reg] ..... I nearly completely concur with all that you said, with just the one little caveat which make a big difference and completely fcuks up systems admins which trawl and try to secure data it has no intellectual property rights to, but would covet because of the creative disruption and even wanton destruction that its free sharing can undoubtedly deliver.

      IT's never ever boring and always fabulously exciting whenever you are managing clusters of virtual machines which are as programmed and programmeable humanised beings which pimp and pump future rich datasets with coded instructions, remotely and anonymously ...... stealthily for leading control of idiot command and intelligent community enterprises.

  10. ecofeco Silver badge

    GIGO

    If something isn't organized in the first place, be it data or ingredients for soup or a shed, it's not very useful to begin with.

    GIGO.

    It seems to be first and foremost, a problem with people.

  11. Anonymous Coward
    Anonymous Coward

    Wrong?

    This is because SQL is for set manipulation and sets are by definition unordered; there is no concept of sequential rows in a table.

    Really?

    Drop Table dbo.T

    Go

    CREATE TABLE [dbo].[T]

    (

    I int IDENTITY(1,1) NOT NULL Primary Key,

    R int NULL,

    Z int NULL,

    Y As (I * R) % 79 Persisted

    )

    Go

    Create Index IX_T_Y On dbo.T (Y Desc)

    GO

    Insert T Default Values

    insert T (R) Select Null From T

    insert T (R) Select Null From T

    insert T (R) Select Null From T

    insert T (R) Select Null From T

    insert T (R) Select Null From T

    insert T (R) Select Null From T

    Declare @F Int = 45

    Update T Set @F = R = @F - 1

    Update A Set @F = Z = @F - 1 From T A With (Index=IX_T_Y)

    Select *

    From T

    Not sure what database the author is using. Perhaps he should get a real one.

    What is Data Scientist anyway?.

    1. BlueGreen

      Re: Wrong?

      @AC I can see your point about using identity but what else are you saying with your other cols and their updating?

      As to the prof's points, for starters "but it wasn’t until 1993 that we even understood transactions properly." - that surprises me, an explanation or pointer would be very interesting.

      Ok, ordering: yes, it's a set but there's nothing in the relational model that prevents one from matching a row with the logically prior row (or indeed, *any* logically prior row - the most recent preceding is nothing special) with a perfectly normal select. One can use a vendor specific jobbie like identity (which, @AC, can cause non-sequential numbering in MSSQL, so let's imagine a fixed version of that), or one can make it a transactional read-increment-write. As these are logical solutions, we can use new semantically equivalent stuff like row_number() or lead/lag (as I understand them to work). Not fast enough? Add an index. Changes no semantics but makes it usable.

      No innate ordering is a strawman here.

      "This turned out to be the data which is simple (atomic)"

      AFAICT 'atomic' means 'atomic in your domain'. There's nothing in relational theory - at all as far as I know - that prevents larger lumps of data being stored in one field *provided* it's normalised according to that domain. Else you would not be storing a string but a column of characters to be normalised back into a string, because strings aren't atomic, they're made of char, yes? But in most domains, the value of a string is in its enitirety, no less. If less, you haven't normalised properly.

      I'm also still puzzled at what big data is, by your definition.

      1. AceRimmer

        Re: Wrong?

        "Ok, ordering: yes, it's a set but there's nothing in the relational model that prevents one from matching a row with the logically prior row "

        Typically when you are comparing a row with "the next row" you have to define 2 sets, both containing the same data which you then join in such a way as to compare a row with its equivalent next row. With a self join method this would mean doing something like:

        select Sale1.Price - Sale2.Price

        from sales as Sale1

        inner join sales as Sale2

        where Sale1.ID = Sales2.ID+1

        Sales1 and Sales2 might both contain data from the same table but they are defined as separate sets.

        This is NOT the same as defining something as

        Select current.Price - next.Price

        from Sales

        Which something like MDX would allow you to do as MDX works with ordered data sets

        1. BlueGreen

          Re: Wrong? @ AceRimmer

          "Sales1 and Sales2 might both contain data from the same table but they are defined as separate sets."

          Even if they were defined thus semantically (and it possibly is), no implementation would actually duplicate the data, so it's a non-issue.

          I'm perfectly aware of how to do self-joins to get prior rows, we do many self joins only more complex than yours because we join on an identifying ID and a timestamp, so we have to match same IDs but with most recent prior timestamp to get a current/prior match. It's ugly so, ta-daaaa! I've created a view for it and we can now simply say

          select CurrentLocation - PreviousLocation from CurrentAndPreviousLocationsView

          Compare with your example.

          As a view I can munge it for performance, for example adding a sequential row number in the base tables & replacing the join predicate with a join like your example. Or replace it with lead/lag. Whatever. I re-assert that "SQL is wildly incompetent at comparing sequential rows" is untrue as there's nothing in the relational model relevant to this.

          1. AceRimmer

            Re: Wrong? @ BlueGreen

            What happens when you need to compare the current row with the next 10, 15, 1000 rows?

            The point is, you are always going to have to make SQL do something which is slightly unnatural when you want to compare rows like this.

            No one is saying it can't be done, as both of us have pointed out, it can be done and any SQL expert isn't going to run into any difficulties writing the SQL to do it.

            BUT if all you are doing with the data is ploughing though the sets and analysing it in a sequential manner then SQL is a second best choice.

            "SQL is wildly incompetent at comparing sequential rows"

            I think the Prof was exercising his Journalist Licence when making that statement :)

  12. Anonymous Coward
    Anonymous Coward

    For some reason, I'd really like to imagine that Professor Whitehorn keeps everything in C:\...

  13. Benchops

    Nooooo!!!!!!!! No. No. No. Really, no.

    > what is big data?

    what are big data?

    > there was data

    there were data

    > Data is

    Data are

    > and it is

    and they are

    > data has

    data have

    > This is

    These are

    > this data

    these data

    ... giving up now

    1. Brian Etherington

      Re: Nooooo!!!!!!!! No. No. No. Really, no.

      Yesssssss!!!!!!!! Yes. Yes. Yes. Really, yes.

      Whilst it is perfectly true that data was originally a plural noun, it is now accepted as a singular one. See, for example:

      bbc.co.uk/worldservice/learningenglish/radio/specials/1535_questionanswer/page58.shtml

      where another Professor (Michael Swan) argues persuasively that:

      "Originally, data was a plural noun, it comes from a Latin word that means things which are given, and that’s plural. The singular of that is datum or datum (different pronunciation). But English speaking people mostly don’t know Latin and so not everybody recognised the word was supposed to be plural.

      It looks singular to an English speaker, so more and more people came to use it as a singular and now that’s quite normal. At the beginning "The data is" was definitely a mistake but it’s so widely used now that it’s no longer possible to say that it’s a mistake. It’s become part of the language. This is actually quite a common reason for language change. People make mistakes and the mistakes are repeated by other people, and finally they no longer count as mistakes. It happens a lot with vocabulary."

      1. Benchops

        Re: Nooooo!!!!!!!! No. No. No. Really, no.

        Lots of plural words are now used in the singular, and they /have/ been accepted as correct usage. Fortunately we don't have a language police like the French, but in that absence I always have a peek at the OED, who seem to like tracking words and stuff. Agenda is a good example of that. However, whilst lots of people use the word data in the singular, including academics who um and aw about whether it should be singular or plural, OED still says it's a plural word (horribly I know of a university tutor who "corrected" foreign students' essays when they used data in the plural). When they say otherwise I guess I can breathe a huge sigh of relief.

        Undoubtedly data will be accepted in the singular in the not too distant future, but it isn't completely there yet, and I always imagine journalists ought to be good at English. (good == correct)

        1. Brian Etherington

          Re: Nooooo!!!!!!!! No. No. No. Really, no.

          The OED has this to say on the subject:

          "In Latin, data is the plural of datum and, historically and in specialized scientific fields , it is also treated as a plural in English, taking a plural verb, as in the data were collected and classified . In modern non-scientific use, however , despite the complaints of traditionalists, it is often not treated as a plural. Instead, it is treated as a mass noun, similar to a word like information, which cannot normally have a plural and which takes a singular verb. Sentences such as data was (as well as data were ) collected over a number of years are now widely accepted in standard English."

          1. Benchops

            Re: Nooooo!!!!!!!! No. No. No. Really, no.

            Maybe I should start breathing in for that big sigh of relief ;)

        2. Michael Wojcik Silver badge

          Re: Nooooo!!!!!!!! No. No. No. Really, no.

          I always imagine journalists ought to be good at English. (good == correct)

          In this context, "correct" is meaningless, unless you're a militant prescriptivist, in which case it means "my personal preferences, which I am deluded to believe are somehow special".

          1. Anonymous Coward
            Anonymous Coward

            Re: Nooooo!!!!!!!! No. No. No. Really, no.

            Here correct can mean what you think or what I think or without a non-subjective reference let's look at the closest thing to objectivity, as pointed out, the OED. They say that in non-scientific use it has become acceptable. When did computer science stop being a science?

            You don't have to think your preferences are special to think they're right, you just have to have a good way of justifying them.

  14. david_evans

    Protein shape

    Just to say that peptides are marvelous, but the shape of a protein has a not-very-well-understood effect. So maybe that project will be a huge exercise in collecting not terribly helpful info AGAIN. Good luck with that though.

  15. John Benson

    Hi I agree with Paul that a different term is needed. Before reading this article, all I knew about Big Data was that vendors of Big Storage were talking about it, so by association I assumed that Big Storage implied that you were dealing with Big Data.

    After reading this article, it seems that we need to break the assumed link between Big Storage and Big Data. You could have Big Storage for telephone call records and still be in the world of SQL "small data" (as defined in the article: easy-to-understand, simple relationships). "NoSQL" isn't a good replacement for "Big Data", because it seems to be mostly concerned about scalability and not a fundamental difference in the quality of data. So we obviously need a new term to describe what is A New Level Of Complex Data That Can't Be Easily Generated From Set-Theoretic Operations On Atomic Data.

    Can anybody come up with a better acronym than ANLOCDTCBEGFSTOOAD? Perhaps that would be best, because ANLOCDTCBEGFSTOOAD belongs to me, it's mine and I own it.

  16. Michael Wojcik Silver badge

    Definitions are like standards; there are so many to object to

    It's worth remembering that this article is just one author's opinion on the subject. It is an expert opinion, true, and should be assigned due weight; but there are many experts in this area, and many would disagree with Whitehorn's assessment of what "Big Data" might be.

    And in any event, since no one can impose a single definition of the phrase, it will remain a signifier for whatever its various users want it to signify. There are many who will debate, say, whether R is a preferable language for working with "Big Data", or whether SVMs are the classifier of choice, and so on - clearly all of these are only sensible for certain understandings of what Big Data might be, so insisting that it means "datasets and queries about them not amenable to relational algebra" (or any other definition) is just a way to avoid participating in the conversation.

    Personally, I am more interested in things like new languages (like Julia) and tools (like Tableau) and algorithmic approaches (like lattice algebras) than in whether my Big Data is someone else's Big Data. And I am far more interested in those subjects than in hearing a hundred oh-so-jaded Reg commentators complain that anything associated with "Big Data" is a marketing ploy. Yes, those of you who insist on pointing that out are all very clever; now go away.

This topic is closed for new posts.

Other stories you might like