back to article Big Data bites back: How to handle those unwieldy digits

Data is easy. It comes in tables that store facts and figures about particular items – say, people. The columns define the data to be stored about each item (such as FirstName, LastName) and there is one row for each person. Most tabular database engines are relational and we use SQL for querying. So this "Big Data" thang must …

COMMENTS

This topic is closed for new posts.
  1. Anonymous Coward
    Anonymous Coward

    " It’s a big job."

    Surely you mean Big Job?

  2. Jay Zelos
    Thumb Up

    Really good short article. I've never had to work with unstructured data, just tabular, and this was very insightful for me. More of the same please.

  3. Ian Michael Gumby
    Devil

    Meh!

    It's true that it difficult to explain what is and is not big data.

    Structured vs 'unstuctured' or semi-structured doesn't cut it because you can have structured data that is still considered to be Big Data.

    I agree that the Vs don't cut it either.

    A simple working description could be that Big Data is anything that doesn't fit well in to an RDBMS.

    You know Big Data has gone main stream when El Reg starts to cover it.

  4. Steve Knox

    Structure

    is easy to apply to pretty much any data -- if you have the time. I have yet to see an example of data which could not be structured.

    I personally think that big data is defined primarily by the relative amount of data and speed with which it needs to be processed. In other words, I would define big data as any data set which much be processed in less time than it would take to apply a consistent structure to the volume of data being processed.

    I welcome examples of data which cannot be structured.

    1. BlueGreen

      Re: Structure

      well, may I suggest address data as things which don't really fit a row structure well. You typically start house number/name, address line 1,address line 2, (sometimes address line 3), city/county/state, country (optionally) followed by postcode/zip code. That's to cover UK/US addresses.

      Let's break that down: house number or name - we don't know what type it is, it's basically characters (it could be entered as '3', 'no 3', 'number 3', '55b', or a name like 'meadow farm', or even rural oddities like 'last on henrietta avenue'). Then address line 1/2/3, where the last may be empty (or even the first, or any one, if that's how the user chooses to enter it) - not very relational (could split it off to another table with FK, but it never is). The city/country/state, so there's sod all syntactic information about which that is, so another char string, and then post/zip code, of which again no structured interpretation (unless you present two boxes, one for zip, one for postcode and ask the user to use the correct one, which I've never seen). No scope for primary key either.

      Sort that one out for me & I'll be happy.

      Have to add that addresses are about the only non-structurable things I've come across, so RDBMs are good IME.

      1. Steve Knox

        Re: Structure

        It's funny you should mention address, as I've recently worked on a project that dealt with address data in a semistructured format. Specifically, we had two tables of (US) addresses, one of which was formatted Street Address (as a single line), City, State, Zip, the other of which was Address Line 1, Address Line 2, Address Line 3, City, State, Zip. Since all of the Address lines were pretty much free-form, there was a lot of mess in there. I ended up matching on City, State, and Zip first, then comparing the Street Address with each of the three lines from the other table, parsing all lines into the following format:

        Street Number

        Street Direction

        Street Name

        Street Type

        Unit Type

        Unit Number

        All of that parsing was relatively straightforward. By far the most difficult piece was accounting for all the different variations in spelling and abbreviation. Do you know how many ways there are to abbreviate "street"!?

        So I sorted that to my satisfaction (over 95% correct match rate). I agree that you will have to compromise on exactly what you can include, but the reason for structuring the data should dictate what is acceptable to lose.

      2. Aaron Em

        Sorted! [Was: Re: Structure]

        Hard to blame this one on the relational database model when what it is, in fact, is a failure to exercise proper user interface design. Had you not stopped breaking it down partway through, you'd see your way clear of the problem without having to be told -- but, lucky for you, here's me to do the telling.

        Street number is /^[0-9]+$/, or, if you prefer, /^[0-9]+[a-z-]{0,3}$/ -- and if you get the latter form, you split it at the digit-letter boundary, then stuff the leading number into your "street number" field, and the trailing gubblick minus any punctuation into something like "suite number" or "flat number" or whatever you like.

        If you absolutely have to handle house names and rural stuff like "last on the left" along with numbers -- which, I dunno how it is in the UK, but it's extremely rare in the United States to encounter an address which doesn't have a number, even if that number is e.g. "#240, Carrier Route 461" -- then you can present a dropdown or other choose-one-of-many control, with options for "House number" and "House name", and use that info to present the appropriate input control or controls. That way, you get your semantic info (which is what you mean when you say "syntactic information" there), you're unambiguous about what you're receiving, and the user is neither confused nor annoyed by UI controls which he doesn't need to fill in and which don't behave the way he's expecting.

        Address lines: Why do you need them to remain individual lines in the database? Just join them with "; " on the way in, and dump any empty fields ("empty" here matching /^\s*$/) before the join. That way, you don't need to denormalize the address fields or split them into a separate table; you can still do fulltext search across them; and if you ever need to get the multiple lines back again, say for printing an address label or similar, you can just split on "; " and Bob's your uncle. If that's not good enough for you, then split the address lines off into a separate table keyed by address entry ID, like you're talking about wanting people to do -- I mean, if you're trying to build a normalized schema, that's not too far to go, right?

        (OK, I guess someday somebody might put a semicolon into one of the address fields, but you can reject it during form validation -- you are validating your forms, right, before you hand them to your poor unsuspecting backend? If not, you can fuck right off 'til you've learned the basics of your craft!)

        Each of city, state/province/whatever, and country is a field of its own, which solves your problem of lacking semantic information -- splitting out the fields makes it entirely unambiguous what's what. Ideally, you'd present them in that order, but you probably need to know the country in order to know whether to present state or province, so you probably want to ask for country first -- in fact, since the country has the largest effect on how the address is formatted in any case, you probably want to ask for that before anything else, and present address fields appropriate to the format in question.

        Once you know the country, you know enough to choose between state or province, and also to choose between presenting a "ZIP code" or a "postcode" field -- no need to show both, because that's incompetent, inelegant, and sloppy; there's also no need for a page load in between choosing country and the rest of it, not if you know what you're doing with jQuery, or if not jQuery then whatever inferior Javascript library you prefer to use. Similarly, if you work your way inward from the broadest possible category -- i.e., country first, then province, then &c., &c., at each step you get enough information to decide what to present in the next. With any luck you've only got maybe a half-dozen address formats to cope with, instead of forty, but even in the latter case you can still handle it properly this way, and without inflicting the agony of unmaintainability on yourself -- the effort involved scales linearly with each new address format, rather than exponentially, so it might be a big pain in the ass but at least it's manageable.

        And why are you relying for a primary key on anything the user gives you? That's what an autoincrement field is for!

        As with the apparent lack of guaranteed street numbering, I don't know how it is in the UK, but here in the States, you can skip all this bloody mess entirely by using the USPS's "Web Tools" address normalization API. That takes whatever you care to give it, and responds with either a validated and normalized equivalent of that address, or an "I dunno wtf" if the input it gets is too bogus for it to comprehend. My experience with that API has been very good, and in the rare case where it can't normalize something, I have no problem with presenting the user a "Sorry, but the Postal Service wasn't able to verify your address. Please double-check..." sort of response.

        Again, dunno if the Royal Mail offers anything similar, but if they don't they bloody well should, and if they do then you should bloody well use it. (And if they don't, or they do but you can't, then you're more or less on your own, sure -- but if you listen to what I'm telling you here, you should be able to do a pretty damn solid job of it, even without being backstopped by the agency which defines whatever addressing format you find yourself having to deal with.)

        There! Sorted -- and every bit of what I describe here I have done, in production, on real websites used by real people and developed on behalf of real clients. Go thou and do likewise.

        1. BlueGreen

          @Steve Knox, @Aaron Em

          Thanks for interesting replies. If, god forbid, I ever have to do that again I will certainly come back to what you & Aaron have said here.

          @Steve Knox

          > I agree that you will have to compromise on exactly what you can include

          now this is a very interesting point. Perhaps we were only trying to structure this info because we were pushing it into a 'structured' database. Perhaps we should have been wiser and just accepted its inherently semi-blobby nature. Perhaps we were simply was too deep into the forest to see the wood. Live and learn.

          @Aaron Em

          > Hard to blame this one on the relational database model when what it is, in fact, is a failure to exercise proper user interface design.

          You are quite right but one deals with what gets dropped into one's lap, and if that means existing crappy data from a free-form entry system, you don't get the option.

          [loadsa genuinely useful stuff about how to do it properly, including using postal service facilities]

          > And why are you relying for a primary key on anything the user gives you? That's what an autoincrement field is for!

          Ah no. A primary key should be what is natural for the data to prevent duplication. If it's not intrinsic to the data, and it appears never to be so for addresses, there is no primary key. Fudging one on with an autoincrement gains you absolutely nothing except a handy-for-joins foreign key, which is just fine and necessary but that is not the purpose of a primary key.

          Anyway thank you both. Very useful indeed.

    2. Ian Michael Gumby
      Devil

      Re: Structure

      "is easy to apply to pretty much any data -- if you have the time. I have yet to see an example of data which could not be structured."

      Sure you can shoe horn data in to some model, however over time that model becomes very large and unwieldy.

      Try doing a query which requires you to join several tables and there are billions and billions of rows.

      Yes it can be done, but watch your database start to crawl and sputter.

      Let us also not forget that a highly normalized database tends to lose its temporal reference too.

      1. Steve Knox

        Re: Structure

        Bear in mind, Mr. Gumby, that structure does not necessarily mean relational structure. While that's the most common, and often the most efficient, form that data structuring falls under, there are other ways of structuring data (hierarchical structures such as XML, for example.) And normalization is a consideration if and only if the nature of the data and intended use calls for it.

        Even so, you simply underscored my point. The primary concern is not whether the data can be modelled, but whether a given model* is time- and cost-efficient.

        *This assumes that the model is appropriate to the answers you need from the data, because, well, why are you considering it if it isn't?

    3. Michael Wojcik Silver badge

      Re: Structure

      I welcome examples of data which cannot be structured.

      You'll have to define your terms first, since "data" and "structure" are both far too vague.

      Suppose your data is a stream of noise (there are real applications for this, such as cryptographic nonce generation). You could claim you're "structuring" it just by chunking it into segments of equal length. Does noise count as data? Does chunking count as structure?

      When analysts talk about "unstructured data", they generally mean data which can't tractably be coerced into a structure that's useful for the application. It doesn't matter whether it's conceivable that it can be structured; what matters is whether the reduction in the cost of analysis obtained by that structure is greater than the effort of imposing that structure.

      The kind of structure - relational, hierarchical, etc - is irrelevant in the general question. Either there's a structure which can be achieved with reasonable cost and which represents useful information about the data,[1] or there isn't.

      [1] Note that this is a compression function. All structuring is compression: it encapsulates redundant metadata, which would otherwise have to be maintained alongside the data or derived from it (which is another form of compression). Of course it's possible to express all computing as a compression operation, running either forward or backward in time.

  5. Aaron Em

    "thang"

    Don't.

  6. Graham Wilson
    Facepalm

    This is an ongoing argument with little consensus.

    This debate has its origins as far back as I can remember. Computer designers, programmers like structure, fixed record lengths, tables of defined sizes etc.,etc. but real-world information is all over the place so there's never been a simple solution.

    Take a book for instance, the table of contents usually has a different structure to its index which is different again to the content--chapters, drawings, diagrams etc. This simple example doesn't fit well, with fixed records or RDBMS. Systems designed to work with unstructured and/or big data have failed in the past or have not been popular. The Pick Operating System comes to mind or databases such as Advanced Revelation which fitted oddball data into the fixed records of a standard filing systems DOS, NTFS etc. These never turned out to be very popular.

    Other systems had the BLOB class (Binary Large Object) images etc, but basically this data type wasn't 'native' but rather was also fitted (kludged) into the usual fixed records of the filing system. In essence, for most, the concept of Mufti-Valued, Variable Length records has always been a yawn.

    It's obvious why it's so but it doesn't' make much sense really. If you think about it a little, why we bother to format floppies or hard disks superficially makes sense for all the reasons hardware and software people go on about but most data do not fit exactly into say a 512 or 4096 byte etc. record. Except for music staves and ruled-line exercise books, we don't format paper pages so why bother doing so with hard disks? Conceptually, it's stupid: if we write a file with 3 bytes then the disk image should be 3 bytes and not a 512 byte sector with 3 bytes used and 509 remain as unused slack space. Similarly, a large TIF file in a variable-length system might create a sector of 50,456,678 bytes. There is no conceptual or logical reason why a 3-byte record cannot be followed contiguously by a 50,456,678 byte one on the same disk surface.

    Right, I know many of you will be thinking this guy's wacko. Probably so, but I've made these comments as a veteran programmer of beasts such as the Western Digital FD/WD1771/81/91 and the Intel 8272.floppy disk controller chips, so I clearly know the practical reasons why this nonsense continues. Except for obvious cases, tables--where all data is essentially the same size etc.--for me, fixed records as the default back in the days of the '91s and 8272s never made sense and it still doesn't. These record 'constructions' (512 bytes etc.) only exist for practical design reasons and bear little resemblance to real-world data.

    So fixed are we in the world of fixed-length records or multiples thereof, that large data concepts, a la this article, being processed natively, I reckon, is still many years off.

    The same stupid logic applies to the 8, 16, 32, 64-bit world? For example, going from a 8/8/8 (24-bit) colour image to a 16/16/16 (48-bit) image is a massive and usually unnecessary jump. Not only will most hardware not cope with such large dynamic ranges but when the bits exceed say 12/12/12 (36-bits) the registers are left processing much worthless data (as the significant/effective resolution has been well and truly exceeded thus petering out long before 16/16/16 is reached. Thus, designing systems that have 13 or 14-bit resolution would make much more sense. But oh no, systems have to be exact multiples of 8 bits or it doesn't look nice.

    I remember the old CDC-6600 mainframe, it used 12-bit words and 60-bit registers amongst other things. Right, the word and register sizes were designed for the job in hand and not some arbitrary multiple of eight because it sounds nice mathematically.

    Eventually, this problem will be solved (as it was (and is) solved intrinsically in the analog world), but at present there's too many vested interests and too much ingrained thinking for a commonplace solution to pop up anytime soon.

    1. BlueGreen

      Re: This is an ongoing argument with little consensus.

      > Take a book for instance, the table of contents usually has a different structure to its index which is different again to the content--chapters, drawings, diagrams etc. This simple example doesn't fit well, with fixed records or RDBMS.

      Depending on what you're trying to do, seems quite doable. index & Toc are separate tables, content is various tables with links. Not saying you should, just that you can.

      > Except for music staves and ruled-line exercise books, we don't format paper pages so why bother doing so with hard disks?

      Probably because it would be a damn sight slower & more complex to do this at a very low level, which is why sectors are abstracted away so I never have to see or worry about them, and just use the natty "file" thingies.

      > Conceptually, it's stupid: if we write a file with 3 bytes then the disk image should be 3 bytes and not a 512 byte sector with 3 bytes used and 509 remain as unused slack space

      AFAIK win & linux allow packing of many small files into sectors. Anyone know what this is called? Elsewise, if you think it's such a good idea, you're free to implement your own space efficient FS at an application level, or as a library so everyone can use it.

      > that have 13 or 14-bit resolution would make much more sense

      hmm, or you could be happy with 16 bits a 2/3 bits unnecessary extra depth (although I'd bet someone would say 16 bits isn't enough depth for a colour).

      > the old CDC-6600 mainframe, it used 12-bit words and 60-bit registers amongst other things. Right, the word and register sizes were designed for the job in hand

      yep, that job being a number cruncher. My desktop isn't a special purpose device and I don't want it to be. If you don't know what a cpu is going to be used for then any decision is going to be as wrong as any other, including using 12/13/14 bit words. And IBM's Stretch machine which had variable length bytes, was slow, I understand (<http://en.wikipedia.org/wiki/IBM_7030_Stretch#Data_formats>)

      1. Graham Wilson
        Thumb Up

        @BlueGreen -- Re: This is an ongoing argument with little consensus.

        Essentially, I agree with you.

        I've not hardline or extreme views on this issue one way or the other. In fact, perhaps, if pushed, I'd class myself more an irritated spectator although regularly I hit the boundaries of most storage/file systems because I've been involved in sorting/storing very diverse forms of data in the same machine/same location.

        Both nomenclature and type/class are problems. Naming conventions are problematic with little or no agreement on the methods of naming, file naming, truncating and rematching truncated with originals etc. And data itself is so variable, for instance, a 950MB TIF file (the largest on this machine) coexists with files of as little as zero length, that is they only exist as name metadata.

        Let me give you a practical example. I regularly hit the file-name/path-length limit of 255/260 characters in NTFS. Name a file with a long filename in a top directory then move it down a few nested directories and the file becomes inaccessible as filename/'MAX_PATH' now exceeds the allowable limit. OK you say, how about making the filename/path smaller? Yes, I can devise a method that's fine for me but how does one do it in a standard way that someone who doesn't know my method of truncating/abridging can understand? It's not only an IT/computer problem but a longstanding one for libraries, repositories, armies, NATO and such for time immemorial.

        Let's take a real world example. I find an old book, say on the Internet Archive, so how do I allocate it a totally unambiguous and unique filename--one that would automatically be identical to that produced by everyone else if they'd acted independently? If everyone has to apply only the most elementary rule, then it can only be done by reproducing the book's title page. Rule: 'title page becomes the filename'. OK, let's go to an example, this one for instance: http://archive.org/details/shorttreatiseona00rums At first glance, selecting a book with a very long title might seem extreme but in practice it's not. Almost every technical/scientific/geographic/philosophical book written before 1900 had very long titles as a matter of course, the same today goes for scientific articles and such (and there's also the problem of the abstract info).

        The Internet Archive knows it's not practical to apply the title page rule to the filename so it allocates a unique identifier which becomes the filename, specifically, 'shorttreatiseona00rums'. By adding an extent, pdf, .djvu, etc. we get the different available formats. When this unique identifier is used with the IA's RDBMS all problems are solved, we've a very workable library.

        Trouble is, with no agreed universal system of nomenclature, the unique identifier is only known to or accepted at the IA. At the other extreme, if we apply the obvious title-page rule then we'd end up with a filename something like this:

        "A short treatise on the application of steam, whereby is clearly shewn, from actual experiments, that steam may be applied to propel boats or vessels of any burthen against rapid currents with great velocity. The same principles are also introduced with effect, by a machine of a simple and cheap construction, for the purpose of raising water sufficient for the working of grist-mills, saw-mills, &c. and for watering meadows and other purposes of agriculture; Rumsey, James, 1743?-1792; Publisher: Printed by Joseph James: Chestnut-Street Philadelphia, 1788_location: ULS Lib, copy #: xyz123"

        Without any extent, this name is 593 characters long which is well over double the NTFS limit of 255 and this is only the beginning of the problems as:

        - NTFS is not a database file system, so even if it accepted longer filenames other user-metadata cannot be included in the file information. (Microsoft promised WinFS, a database filing system, with Vista but it never eventuated.)

        - NTFS/Windows' antiquated and maddening reserved character list means that translations are necessary: the ':' replaced with '--', '/' with '_', '?' with '¿' and so on. This results in inconsistencies and matching errors with other systems.

        - The problem isn't limited to NTFS, very few other file systems have filenames that exceed 255 characters. Nevertheless, Microsoft has acknowledged the problem with ReFS (Resilient File System) and it has extended the filename length to 32k in ReFS (although with Win 8 server it'll be limited to only 255 for compatibility): http://blogs.msdn.com/b/b8/archive/2012/01/16/building-the-next-generation-file-system-for-windows-refs.aspx

        - Thus the solution is to use NTFS with a RDBMS, but alas there's no universally agreed database standard. (There's more but that'll do.)

        So, the horns of the dilemma return, for the time being we're back to where we started.

        -------

        Your comments

        > Depending on what you're trying to do, seems quite doable. index & Toc are separate tables, content is various tables with links. Not saying you should, just that you can.

        Yeah right. However, the book paradigm's been around a long while and it's easy to format in any way the author wants as essentially there are no rules dictating where ink etc. goes on a page. This freedom is still harder to achieve in electronic models (e.g.: one can't scribble or type between the lines or in the margins in most wordprocessors or text editors).

        > Probably because it would be a damn sight slower & more complex to do this at a very low level, which is why sectors are abstracted away so I never have to see or worry about them, and just use the natty "file" thingies.

        Agreed it's more complex, and in the days of the 8272 FDC it would have been difficult to implement and slow. However, today very sophisticated industrial controllers are the norm in all HDs. Someone will have to give me a very good reason why it'd be impractical to implement. Sectoring and formatting is a legacy issue, thus a no-brainer for manufacturers as it's cheaper.

        > AFAIK win & linux allow packing of many small files into sectors. Anyone know what this is called?

        Called variously tail/slack space/block sub-allocation packing, it tries to efficiently use slack space and it's used in some Linux F/S and compression systems.

        > Elsewise, if you think it's such a good idea, you're free to implement your own space efficient FS at an application level, or as a library so everyone can use it.

        Yeah, in another life perhaps. Sooner or later--probably later as this stuff isn't user-candy a la iPhones. Eventually it'll happen as we leave the vestiges of 1950s computing behind.

        > …hmm, or you could be happy with 16 bits a 2/3 bits unnecessary extra depth (although I'd bet someone would say 16 bits isn't enough depth for a colour).

        A few bits doesn’t matter but it does with say with 36-bit colour versus 48-bit colour. 24-bit colour is, nowadays, severely limiting at the top end of imaging but genuine 48-bit is very difficult--I mean a full 16-bit dynamic range per channel and not 12-bit span shoved into a 16-bit channel. That's why camera manufacturers have RAW formats, writing out junk increases storage space needs and takes extra time.

        >…And IBM's Stretch machine which had variable length bytes, was slow, I understand <http://en.wikipedia.org/wiki/IBM_7030_Stretch#Data_formats>

        The 7030's before my time but my uni had a 7040 and a CDC 6600.

        Just because an idea is ancient doesn’t meant that it's obsolete. Here's the first IT example that comes to mind. Kim Watt of Breeze Computing wrote a utility for the Tandy TRS-80 called 'Super Utility'. Amongst the routines was a gem called 'Format Without Erase'. Take a floppy/storage that was getting a bit flaky and so long as the read data checksumed OK, then FWE would rebuild the magnetic image by reading data into memory then formatting the disk and finally laying down the data back on a freshly formatted track.

        I can't tell you how many times I've missed this utility when I switched to an IBM machine, it must have been hundreds and hundreds of times.

        I simply cannot understand why this isn't a major part of the S.M.A.R.T feature of HDs. If S.M.A.R.T provided a monitoring interface with info about data's magnetic threshold then a Format Without Erase feature built into an O/S could run in the background when machine activity was low to protect HDs. A bright idea that's died.

    2. Anonymous Coward
      Anonymous Coward

      Re: This is an ongoing argument with little consensus.

      Graham Wilson "Computer designers, programmers like structure, fixed record lengths, tables of defined sizes etc.,etc. but real-world information is all over the place so there's never been a simple solution."

      There have been record formats that tried to cover a lot of eventualities. Data is treated as a serial bit sequence with no prescribed multiple bit boundaries. Basically a type/length/data construct of bits. The type/length/data fields themselves are recursive, including continuation flags, so that they can expand until limited by the hardware/software architecture size constraints.

      There will always be format constraints imposed by the hardware/software architectures - or meda reliability and size. However these shouldn't need to be visible at application level. Unfortunately there will be some apparently never-ending things that always have to have an agreed limit to precision - like Pi.

      1. Graham Wilson

        @A.C. -- Re: This is an ongoing argument with little consensus.

        Totally agree.

        To an extent that post was to draw a parallel with the way ordinary users perceive large data problems. From experience, IT types fall broadly into two camps: those who appreciate the densest code ever produced in a single line of RPG and those who prefer to mirror the real world in IT. As I've always had to interact with ordinary users, I'd probably put myself into the latter class.

        Until recent years, IT, often by necessity, was a very cryptic process for most people. Nowadays, people expect IT to reflect much of what they've traditionally done sans IT--books, art etc. It stands to reason that users will push systems with bigger/large data models.

  7. Anonymous Coward
    Anonymous Coward

    Network protocol information is generally poorly structured and has large volumes that can't be quickly analysed to a human level of abstraction. Network protocol analysers in the past have produced tabular summaries - but with no universal structure that catered for mixed protocols and layers.

    Such tabular analysis can be done - but it takes a lot of different preprocessing algorithms to turn the raw data into something that tabular tools can handle.

  8. Stuart Dole

    Big Data?

    I supervised a project that stored images as BLOBs - that went pretty well (except that Java doesn't like unsigned bytes). Where we got hammered was trying to store numbers with a large dynamic range. For instance, how about a database of physical constants? You'd need to be able to have very large numbers (Avogadro's number), and very small (the Plank constant), maybe throw in "c", the mass of the electron, etc. SQL seems very poorly designed for scientific work - that is, BIG (and small) numbers... One of the roots of this problem is that everything needs to be converted to text to store or retrieve it (at least for the TCP/IP part of the journey), and I've seen floating point values mangled badly in this process.

    1. BlueGreen

      Re: Big Data?

      > Where we got hammered was trying to store numbers with a large dynamic range

      err, what? Doesn't your db allow floats in the standard ieee format (with restrictions such as no nan/infs)? MSSQL does and so does mysql, I would have expected they all do. Unqualified 'float' (ie. not float(n)) should do the trick. Check your docs.

      > One of the roots of this problem is that everything needs to be converted to text to store or retrieve it (at least for the TCP/IP part of the journey),

      No, that has to be rubbish. The network would never do that, it's too low & just a byte stream. Your DB driver might but... seems so unlikely.

      Weird.

  9. Paul Johnson (vldbsolutions.com)

    Yet Another Big Data Article...

    There have been, and will no doubt contiunue to be, many attempts to arrive at a succinct description of what constitutes 'big data'. For me, it's any dataset that can't easily be manipulated given the compute resources on hand. For many decades the IT industry has seen the world through the narrow lens of structured data held in a DBMS and managed by SQL. And mighty lucrative that's been for some.

    Two of the main three V's (volume and velocity) that have been used to describe 'big data' are not enough on their own to compromise large parallel, SQL-based DBMS systems such as Teradata, Netezza and Greenplum etc. The real challenge comes with the addition of variety to the mix - those use cases where the data does not lend itself to tabularisation, which in turn makes life impossible when SQL is the main/only/preferred data manipulation tool available.

    With the advent of Hadoop, the new paradigm of parallel processing outside of a DBMS using procedural languages (i.e. not SQL) has opened up new data processing possibilities. This has allowed variety - not volume or velocity - to be handled economically for the first time at scale.

    The main issue for me with this new paradigm is that the rest of the data processing world is highly SQL-centric. Just like Cobol on the IBM mainframe, that's not going to change any time soon. The 'new' (Hadoop/no-SQL) and 'old' (DBMS/SQL) data procssing worlds will have to learn to play nicely for the former to enter the mainstream, no matter how cheap or vogue it becomes.

  10. deadlockvictim

    Photoshop Plugins

    Developers have been handling digital pictures programatically for the last 25+ years. I wonder if the time of functions which work like Photoshop-plugins has come.

    If the current paradigms don't work well with new data, then we need to find appropriate ways to work with it.

  11. Anonymous Coward
    Anonymous Coward

    "if it isn’t valuable, why are you storing and analysing it?"

    That is a question that isn't asked often enough. Too often the data just gets stored in hopes of maybe someday be useful, for something, whatever that may turn out to be.

    Sometimes this just-in-case storing seems hardly a problem, for example because the data was yours anyway, and looks otherwise innocuous--and hopefully it'll stay innocuous. Sometimes, though, it becomes something else entirely again. Something Paul Ohm calls a database of ruin.

This topic is closed for new posts.

Other stories you might like