NoSQL and Big Data crashed into the ordered world of relational architectures a few years back, thanks to services like Twitter, Facebook and LinkedIn. But while concepts such as key value stores and content-specific stores have certainly enriched our environments, the downside to their arrival is that it has created quite a bit …
Non-issue. ODBC will allow access and data sharing with middleware. We do that daily with Oracle (Which we are migrating away from), SAP and a few smaller MySQL instances.
SQL Server is a better choice. BI side also much better than SAP in terms of price.
So a number is atomic and can be operated on with a condition like "salary > 50000". A string is non-atomic because it suits conditions like "third letter is ‘n’". But this just reflects two patterns of comparison and decomposition. One might equally say "third digit of salary as binary is 1", or "name < 'Jones'" (neither a very realistic example, but the point is clear).
Also, the author writes of a function that scans satellite images and looks for aircraft. It might return data like:
Delta winged – 0
Swept winged – 3
Date of Birth
Straight winged - 2
Rotating winged – 2
This looks like a pretty crap function. Why does it return a DOB when looking at aircraft images? Why are two wing types coded identically?
And why did he throw away his original data? That one went right over my head - only a moron assumes that he has extracted every piece of information from original data.
In both models you can only get answers to questions where you (a) have the information necessary to provide the answer and (b) know where the data is. All this crap about schema and non-schema is just babbling nonsense.
If you built an RDB where you are locked into the questions and type of questions you can ask by the structure of the DB or the datatypes you specified in design then you shouldn't be building RDBs. The only reason to lockdown data in the way described is to screw the customer later.
Both of the data models discussed have their uses - the difference is that one approach uses a database and the other one uses data indexing. They are not even mutually exclusive. You could easily use both approaches in the same data model. If you won't use RDBs cos you're a NoSQL fanboi (or vice versa) then you don't know enough to be trusted with anyone's data - go back to school and learn about abstract thinking - most people get it by their teenage years.
And why did he throw away his original data? That one went right over my head - only a moron assumes that he has extracted every piece of information from original data
Because the volume of original data may make it financially and or technically impossible to keep
Take, for example, the square kilometer array
In operation, SKA will generate 1TB/sec of pre-processed data
In an RDBMS system the level of at which the data is deemed to be atomic is determined by the database design. Each column holds one piece of information.
If the system using the database needs to be concerned with the individual characters in a string then a string is no longer atomic and storing that data as a string is breaking a fundamental RDBMS rule.
To put it another way, you wouldn't store comma separated values in a database column, you would split those values out into proper columns or rows depending upon the structure of the data. The exception to this would be if the database or the systems that used it were designed to not care what was in the string or care that it was comma separated values; it is just a string to be stored, read and reproduced as is.
This is also why database designers HATE free text fields, they always end up being abused and used as data stores for non-atomic data as a short cut for changing the database and/or user interface
(I think the DOB was a typo that got past the editor)
Re: Eh? @ AC 15:50
"If only I had thought ahead! If only I had stored the image file itself (in some non-tabular way), "
Suggesting that keeping it was an option that he chose not to excercise.
Re: Eh? @ AC 15:50
Do you think that perhaps maybe, just perhaps it is possible that the entire situation was invented for the benefit of the article and that the pictures, database and problem of having not stored then doesn't really exist?
I suspect the DoB is a mistake in the text. I'm assuming the numbers are not codes but a count of the number of types of planes detected in the images.
A string is none atomic because you might like to search inside it. So if you have a table, one column stores 140 character tweets, you might like to search for all tweets that contain mention of "the Register" and get all the rows that contain that condition. In that case the string we are interested in, can appear anywhere in the field we are looking at and the position that it appears is unimportant. For sentiment analysis we may be interested in the position of the string, so we might want to write a query that returns all the tweets with "the register" at the end of the tweet, because these tend to be more positive. Again we are looking inside the field and are interested in the properties that the text has.
Numerical values will normally be compared against external values (count >10) or against a value in another field. You would very rarely in the real world (and I struggle to find an example) want to return all rows where the 10's value is greater than the units value, I.E compare within the decimal number. Nor would we typically look for numbers that contain a value anywhere in the number. So would we ever write a query that returns all the stores that held a stock of oranges where the number of oranges had 12 in it ? I.e 12 oranges, 120 oranges, 312 etc ?
If you were storing tweets you would want to break each tweet down before storing it:
A back of a fag packet design would give you:
1 table containing all @ mentions in the tweet
1 table containing all hashtags in the tweet
1 table containing all words in the tweet (with a position indicator)
other tables might store all sentences or recognized word combinations in the tweet of which "the register" might be one.The ETL feed would run against a master data table which itself could be populated from mined data from the previous million or more tweets
Storing the data in this way would mean that it could be properly indexed in the RDBMS and avoid the use of the dreaded '%the register%' comparison which would kill query performance.
Re: Numbers as strings
look up benfords law:
Re: Eh? @ AC 15:50
Right - but he was plainly talking about a situation which is likely to arise when designing data models and having to make a choice about what to do with the data - not being faced with an extreme situation such as you describe which is not going to have a standard data treatment anyway.
So - the point as it applies to the discussion put forward in the article is still - "throwibng the original data away is the wrong choice to make"
"And why did he throw away his original data? That one went right over my head - only a moron assumes that he has extracted every piece of information from original data."
I think that is the point that the author was actually making. This is made clear in the bit where he says:
"You can see where this is going. In selecting the functions I choose to run I have defined exactly the questions that I can ask of the data. If I later want to know the number of King Penguins in the image, tough; I can’t."
"If only I had thought ahead! If only I had stored the image file itself (in some non-tabular way), I could write and run my new function and count the King Penguins. But no, I was foolish; I believed those idiots who told me that the relational model allows any question to be asked of the data."
I think it is relatively important to read the entire article.
RDBMS vs NoSQL
Try not to throw valuable data away, it does not matter if you store it in RDBMS or NoSQL, if you failed to store it you will always fail to find it with a search.
A number of RDBMS allow you to store semi-structured data (think JSON or XML), just like their NoSQL brethren, this allows you to manage structured and semi structured data in the same system, allows you to keep to ACID and all those things that are important. Just don't mess (update/delete entries) with the JSON/XML structures too much as this could be a big DB hit.
Do indexing, yes even in NoSQL, saves your big HBASE batch jobs from taking far too long.
Most importantly understand your data and its relationships, if you don't know that a field exists because your developers are a bit vague, then you will never be able to search on it, unless it is in a RDMBS of course, where you will have that helpful schema to fall back upon.
NoSQL. Take any relational dataset, denormalise the living bollocks out of it, store choice bits as objects and then spend an order of magnitude more dosh than you would scaling an ordinary RDBMS trying to make it scale at all after the questions you based your denormalisation on change. No thanks, caching purposes only.
Database de-normalization Tradeoff
Both strategies will enable you to do what you want, they are just implemented differently. The different implementations are a tradeoff between code manageability and data scalability. E.g. the question 'find all the employees in department X who are paid less than $45,000' can be implemented in single database or two databases, one storing employee data and the other payment data. In the first case it will be a simple Select From Where. The second case requires selecting 'paid less than $45,000' from the Payment DB and then doing look-ups in the Employee DB.
The first method is better for code maintenance whilst the second may enable you to scale higher. Even with a single DB you can scale by table partitioning, multiple CPUs and multiple Tablespaces located on different SANs. The second solution allows you to scale higher with cheap hardware.
The single DB solution is better for complex queries such as 'find all the employees in department X who are paid less than $45,000 and have been on X, Y, and Z training courses. This will be a simple Select From Where. In the Multiple DB scenario you have to know a little about the underlying DB to write an efficient query. E.G. how many people are paid less than $45,000 and how many people have been on X, Y, and Z training courses. Not knowing this information will result in extremely poor DB performance.
In the end a competent database developer knows what tradeoffs to make to ensure scalability whilst maintaining code manageability.
Re: Database de-normalization Tradeoff
"In the end a competent database developer knows what tradeoffs to make to ensure scalability whilst maintaining code manageability."
It seems to me that we have let the children play with our important data. Here is a reply I wrote on another thread several years ago:
I'll need my coat while I wait for the bus to go to the retirement day-centre...
Whilst NoSQL Databases might be great for applications like Twitter or Google Search Data they simply lack the Transaction support needed for Enterprise applications such as Payroll. If Twitter looses a tweet or corrupts some post its not a big problem. Enterprise databases require consistent views of the data which is fault tolerant and can be rolled back in time.
NoSQL is a bit of a red herring in this discussion, since it covers an array of technologies that are only united by what they are not. If you are breaking it down like this, something like mongo or raven falls closer to a relational db. Their documents have structure and you have to use that structure to query them in any sort of performant way. Where they differ from a traditional RDBMS is how they approach CAP and how that allows them to be scaled.
As a simple mental exercise...
...every time you see "NoSQL" substitute "dBase". I find it tends to break down the arguments quite succinctly and immediately calls to mind all the perils - lack of any real integrity constraints and being strongly tied to a particular system for starters (true, SQL isn't portable either but that's a little fettling as opposed to complete rewrites). It also shows that it's not a new idea and indeed it is one largely rejected 20 years ago.
What's changed? It's a new generation of cowboys that never learned the lessons of the past and that intrinsically assumes that their particular requirements are somehow unique. Instead of "we've got a lot of critical data here so we need to protect it at all costs" the argument becomes "we've got a lot of critical data here so the rules about how to look after it don't apply to us - they're for lesser beings than us".
Re: As a simple mental exercise...
Not necessarily. The difference is mainly : how much do you want to pay for integrity constraints and normalized schema? If the amount of data itself to be stored means that you cannot afford these two (look at the example of Square Kilometre Array, above) then the choice is either no data at all, or data without integrity constraints and normalized schema. This is exactly the kind of scenarios "NoSQL" are optimized for.
Practical example: if I were to ask a DBA to run a query returning some 10^7 rows, I am almost sure to hear some grumbling back, most likely related to locks or execution time. However if I run the same query on a NoSQL, no one will notice and I will have results in seconds (I know, I've done that). Similarly, if I need to store some 0.5*10^9 rows of data day in, day out, and keep the data available for few months (or years) to query later, I'm sure NoSQL will oblige, requiring only disk space as and when needed. SQL? I think DBAs will not be very happy, unless some very interesting (and expensive) technological solutions are put in place.
RDBMS can do "sub-atomic"
A good train of thought and I like the “atomic” theme but I think it’s over-polarising the issue a little.
It is possible to achieve the image analysis examples as User Defined Functions in Teradata and Oracle have a massive range of industry-specific “cartridges” that achieve this type of functionality. It depends a lot more on the business reasons for applying this type of functionality. In discovery mode, a more forgiving schema-lite environment is much more flexible for getting to grips with your problem (e.g. am I developing the best plane-spotting algorithm). Once the approach has been hardened for use in an operational environment (e.g. face scanning at an airport) then it goes on to the RDBMS platform to be integrated into broader analytics.