In the first of two articles on cloud-based data storage, we shall explore the drawbacks of relational databases. There are two types of databases you can use when hosting data in the cloud: a relational database or one that is so not so relational. For example, Amazon offers a choice of SimpleDB (not so relational) or MySQL …

COMMENTS

House rules Send corrections

This topic is closed for new posts.

Tuesday 17th May 2011 14:33 GMT Charlie Clark

A ... worker blames his tools

"The second issue is that relational databases are a poor fit for most software development."

Chris Date and others would beg to differ on this. Relations are objects, they are just not necessarily *objects* in the application layer (programming language runtime) so you need descriptors or utilities to manage the conversion. Unfortunately SQL is not particularly suited for programmatic manipulation. Such a pity as the underlying maths has some nice constructs. MULDIS-D is at least one attempt to communicate from the application layer to the data layer without using SQL.

As to their suitability: there are loads of problem domains for which the relational model is eminently suitable. The fact that you can get them to guarantee data consistency and integrity cannot be undervalued. Most of them come with optimised indexing and data types such as GEODIS for dealing with geographical data very efficiently.

6 0
Tuesday 17th May 2011 15:08 GMT Steven Knox

HA!

"The second issue is that relational databases are a poor fit for most software development. "

Most real-world data is relational. The class defined by the term "relation" is a defined subset of the class defined by the term "object". Any developer who can't go from there to developing efficient OO-code that works well with relational data needs to find a new career.

MOST efficiency problems fitting OO development to relational data come from developers who only understand OO development, and can't do set-based programming. So they write these massive loops to load rows into an object one at a time, perform a minor operation on that object and then save the object back into the row. Switching to a non-relational database hides, but doesn't help this kind of hack programming. Using a simple UPDATE statement does.

5 0
1. Tuesday 17th May 2011 16:02 GMT Is it me?
  
  Have to agree
  
  Most OO developers really don't understand database, or efficient programming. Not so relational databases provide efficient low cost OO code, but compared to native C and API SQL calls, OO performance is S%^t.
  
  But who needs efficient code crafted by skilled IT professionals these days.
  
  1 0
  1. Tuesday 17th May 2011 18:42 GMT David Dawson
    
    who needs efficient code?
    
    Well, the possibly unpalatable answer is, not that many.
    
    In many(most?) applications, development speed is what matters, not so much runtime speed; as computers are really quick and can cope with slow, but readily understand and easily maintained code.
    
    This is why it's used, to improve developer productivity, lower the skills required, so lowering costs; and it still works well enough....
    
    1 2
    1. Tuesday 17th May 2011 20:04 GMT Charlie Clark
      
      Yes, but...
      
      A sequential scan will never beat an index. Intelligently indexing the data in the first place means you can write less "efficient" application code. They are also key to scalability as a sequential scan is probably a good definition of a bottleneck. After that you want to reduce the number of connections between application and database as setting up sockets has a huge cost.
      
      The NoSQL stuff seems to be okay for non-relational work where the indexing is delegated to a full-text anyway - hashtags are the poor man's foreign key - and none of the "stuff" (difficult to call the contents of some online services data) is indispensable.
      
      1 0
      1. Wednesday 25th May 2011 13:52 GMT David Dawson
        
        Eh?
        
        NoSql dbs of various types have very good support for indexing.
        
        Indexing is not a relational concept at all. So, while I certainly agree that a sequential scan is slower than an index, I'm not sure quite what thats got to do with relational versus anything else.
        
        Not all nosql is key==value. Some are, eg memcached. Others are as radically different from each other as radically different from each other as they are from a relational db.
        
        eg, neo4j is a graph database, mongo is a document store. Both are NoSql, but pretty much the only thing they have in common is that they aren't relational DBs. both have very good support for indexes, using different methods.
        
        0 0
2. Tuesday 17th May 2011 18:41 GMT Anonymous Coward
  
  I am not so sure
  
  I think a lot of efficiency problems come from developers who don't understand OO and don't understand databases.
  
  Usually their only interest seems to be CV++, so introduce a whole suite of new technologies that they don't understand as well to slow it all down
  
  1 0
Tuesday 17th May 2011 16:02 GMT Spearchucker Jones

New tech, new following.

"The second issue is that relational databases are a poor fit for most software development. "

Agree with Steven and Charlie. I find these days that most advocates of NOSQL are also fans of the latest crazes like MVVM - and more often than not come from a world in which Hibernate was shoe-horned into solving all problems (badly). Java and .NET are both great at providing gimicks that try to make programming as simple as, say, Ruby. This is not without it's merits - it gets inexperienced devs productive quickly - but it's a roblem when you have more than two or three models in a view. In which case Hibernate is no longer your friend, much less a NOSQL DB. And that's not even mentioning ACID.

3 0
Tuesday 17th May 2011 16:02 GMT Le Dao

Object Relational Cacheing layers abound

Don't understand the article really why is it so limited in scope.

Amazon offer MySQL and SimpleDB yes but you can run Cassandra or SQL Server, db2 pretty much what you like with different degrees of admin and support. Surely these database types do different things well. Relational databases manage data to a set of rules and make it concurrent to many users in a controlled and predictable way. They were not intended to store less structured data such as a behaviour of a customer on a web site (which is teh basis of targeted ads or recommendations) or deal with concurrency of little changing (relatively static data) accessed by large numbers of users that is the backbone of many internet sites. The majority of these databases are opensource NoSQL or not RDBMS systems have evolved due to the limitations of traditional RDBMS in this space and the lack of enterprise software in this space, enterprise will resist adopting these open systems and of course pay database vendors shed loads of wonga to adopt a web scalable solution to support the delivery of content or product information to the users, who then buy through a systems sat on an RDBMS to ensure the data is structured and reported on easily.

Its also difficult to geo cache data close to the internet with RDBMS technology but a breeze with solutions like Cassandra so understanding the problem and characteristics is important in choosing your solution

How else will the Database market maintain its greater than 30% or revenue spend within IT licence model, other than spreading the reach of an RDBMS into the solution space.

The debate for me should be what are the decisions needed to identify the database type you use for a particular aspect of a solution, getting this understood and socialised will help stop people launching systems on the wrong platform or price point.

The comments about OR mapping approaches and the issues of developers connecting the application directly to a database have been around for years, I agree with the comments but do wish developers were taught a mature way to persist data. Using a framework such as Spring / Hibernate in Java or .Net or an ESB pattern or product will avoid the majority of pitfalls due to the learned knowledge these solutions encompass. There will always be some one who wished to reinvent how to make the wheel sadly.

0 0
Tuesday 17th May 2011 18:36 GMT Paul Wells

So part two is all about PICK?

Long before SQL, still faster, still easier to use.

0 0
Tuesday 17th May 2011 18:42 GMT Anonymous Coward

What does this have to do with cloud?

I failed to see the cloud aspect in this article whatsoever. And why is everybody getting pulled into discussing about databases' age old problems that has been discussed, addressed, taught gazillion times?

The only real new deal is how 'cloud' database such as those offered as a service by Amazon compare to traditional setups in terms of ROI and performance. Otherwise I fail to see the relevance in 'cloud development'.

1 0
Tuesday 17th May 2011 18:57 GMT Matt 21

Scaling

RDBMSs scale as well as any IT system. Techniques like Sybase Cluster Server cope very well and allow you to add on and remove extra performance fairly easily.

Putting things in the "cloud" is no magic solution where computers magically add themselves when you need them and dissolve away when yo don't. Any serious surge in power still needs to be planned and when you scale across uneven hardware you end up with some users complaining and some happy with performance.

2 0
1. Tuesday 17th May 2011 20:04 GMT smooth1x
  
  Scaling - better
  
  So does Informix with Object relational abilities built in (including in the optimizer) since 1994. With MACH11 as well as Enterprise Replication and multiple HDR Servers you can have multiple nodes either local or remote or just replicate partial copies. Since IBM bought Informix,tried to kill it (thanks a lot Janet Perna) and failed to market it in case it beats DB2 though no one thinks it will survive hence fewer people choose the better technology.
  
  0 0
Tuesday 17th May 2011 20:04 GMT SJG

I wish it were so simple.

This articles confuses the apparent ease of development with the overall success of a large computer system. There is no question that a NoSQL approach makes things easier for the developer who works with OO languages to build the main front-end system - simply store the full OO model into a single database object. This approach works on both SQL and NoSQL systems and there are plenty of methods to achieve this which completely hide any SQL from the developer.

Unfortunately, computer systems need to do so much more than CRUD on single objects. Just at the simplest level, crediting one account and debiting another account introduces the need for transactional integrity across objects. Producing a report on all orders by product for a day would involve retrieving, unwrapping and accessing all order objects with their embedded order lines and product details before even a simple count by product could be delivered. Both of these relational database with ACID support achieves with ease.

The scalability issues discussed in the document are not relational problems, they are data consistency problems with clustered systems - if NoSQL systems build with ACID data integrity they will also face the same challenged - see http://dbmsmusings.blogspot.com/2010/08/problems-with-acid-and-how-to-fix-them.html for an excellent blg on this.

The major issue raised with mature systems today is not the initial development cost nor the change agility - it is the poor data quality that exists within those systems that prevent much of the information value being realised. In many cases systems become black holes of data where the quality is so bad that reporting tools cannot accurately show the current status of information in the system.

The current trend to move to name-value pairs for all data which is implicit in many of the NoSQL products can only make this problem worse. Whilst Java is strongly typed as is C++, the use of name-value pairs breaks the strong typing such that data issues become much more likely. Even Google's Go still maintains strong typing.

The concepts in discussion here should be strongly-typed vs no typing; transactional integrity with ACID or none; and even strong schema definition vs no schema.

The price you pay for fast and agile development is dropping all those things that protect systems from data corruption, and it's often many years down the road when data corruption comes home to roost. As the vast majority of name-value pair based systems are less than 5 years old, I think the jury is still out on whether they will be able to deliver truly successful large systems.

0 0
Tuesday 17th May 2011 20:05 GMT DeepThought

I've read this twice but...?

I have an MSc in Computer Science specialising in databases, but I don't immediately understand what this article is saying.

I can see that sometimes you will have trouble scaling your database servers unless perhaps you pay a lot of money for sophistcated versions of RDBMSs. However it's not obvious how the simpler file stores on eg AWS overcome the same problems. That really needs to be explained.

Some people like ORM approaches. Some people no doubt still write COBOL programs that happily talk to sequential files. Databases however are still the preferred choice in the majority of situations, so any Cloud service needs to provide RDBMS functionality.

2 0
1. Wednesday 18th May 2011 13:24 GMT Anonymous Coward
  
  Confused? You will be...
  
  Agreed! This article explains nothing...
  
  0 0
Wednesday 18th May 2011 09:03 GMT Adam Inistrator

Programmer half life

SQL programmers spend half their lives writing programs to split documents into relations .. and then spend the other half joining them back up again. NoSQL programmers simply use one bird to kill the other when there is insufficient value in having the data as relations.

0 0
1. Wednesday 18th May 2011 10:26 GMT Anonymous Coward
  
  re: Programmer half life
  
  If that is what you see then you have been working with the wrong SQL developers
  
  1 0
Wednesday 18th May 2011 11:13 GMT akapuya

Relational models are the foundation of most applications, not an object model

The object model is indeed mapping to the relational model, but the right way to design a relational model is not by starting with the object model but with the pure data model. when you have a good relational model, you can use different technology stacks like Java, J2EE, PHP, Ruby to work with the data, it is not right to assume only object view is useful for the data.

Regarding scale, There are few new database implementations like Xeround which is built for performance and scalability by distribution. using such solutions you gain scalability of NoSQL tools but the simplicity and other advantages of SQL.

You are welcome to read this post on NoSQL vs SQL:

http://blog.xeround.com/2010/12/nosql-the-sequel

0 0
Wednesday 18th May 2011 12:07 GMT David Dawson

RDBMS is not king

The NoSQL 'movement' is a reaction against the mantra that relational databases are the best tool for the job of storing data. This is a very subjective statement, and one open to a lot of valid criticism, as the you cannot be so certain that they are suitable for _every_ job.

I think the reaction is a good one, there are horses for courses. RDBMS' of various flavours are extremely powerful tools, and are stuitable for many, or maybe most data storage jobs. They are not suitable for everything, however.

I'm bored of implemeting versioned data, for example. It can be done in an rdbms, of course it can, but the answer is, should it? Is there not a better model? The joins with multiple versioned data sets can be horrendous, and slow!

On the comment above on data consistency, this is a very valid concern. the no sql databases are still quite young, and there are implications of the models they promote are not yet fully understood.

An example from my work where an rdbms was the wrong answer.

I was implementing a message matching system (SMS). Message goes out, and delivery receipt comes back from the network. This needed to be matched against the original message id and the message marked as delivered.

Now, this was implemented in an oracle db. It was OK, we were getting up to 30 matches a second during peak load. This was the major limit in the scalability of the system.

Instead of this, we switched to a model where the outgoing message id was written a file, sorted using GNU sort for a simple sequential match. Every 5 seconds or so the incoming receipts were batch matched against this file. Potential throughput went to multiple thousands per second.

The results were then batch updated every minute or so into the DB for reporting purposes.

So, in summary, the DB was great for reporting and maintaining data. For the match use case, it was a drag, the wrong tool for the job.

This is a_very_ simple example; but then, I'm trying to make the case that you should use the right tool for the job, and a relational database is not always that tool.

I'm not against them, by any means, they're very good at what they do, but so is gnu sort, a file system and a bit of groovy scripting.

0 0
1. Wednesday 18th May 2011 18:20 GMT John Miles
  
  DB Speed
  
  I am a little surprised with your example - Oracle is capable of delivering much better performance than that.
  
  0 0
  1. Wednesday 18th May 2011 19:12 GMT David Dawson
    
    Well...
    
    The answer would be, it depends....
    
    Given the schema we had (very denormalised, heavy data validation), and the data volumes (multi 10s of millions of rows, with many being read, updated and inserted as a background load) and the general system load (this system was built around the database and was generally a heavy user) This was what we got.
    
    Other queries were going on, the 20/ sec wasn't all of the system load by any means, there was (is?) many, many times more going on. It was the sustained average we could expect out of this task, as it was going on constantly, fed by a queue that went up and down through the day, but only went to 0 in the late evening as incoming load dropped off.
    
    So, sorry if I wasn't clear/ lied by omission.
    
    I'm not against rdbms' of any flavour per se. I have just seen that its not always necessary to look at a problem through the prism of a relational model. It was very beneficial to remove the load from the database in the example above, and the system was significantly more performant because of it.
    
    We could've upgraded the database systems, added more capacity, but it hardly seemed worth it when we had a much cheaper, quicker to implement and almost certainly faster solution.
    
    0 0
    1. Wednesday 18th May 2011 20:28 GMT David Dawson
      
      Sorry, that should be ..
      
      Very _normalised_
      
      Oopsie, makes no sense that way round. oh well.
      
      0 0
    2. Wednesday 18th May 2011 22:18 GMT John Miles
      
      Re: Well
      
      that sounds more like what I would expect
      
      I have seen too many performance fixes that didn't understand the problem and the solution makes things much worse long term.
      
      0 0
Wednesday 18th May 2011 12:30 GMT Britt Johnston

aye of customer

Does this cloud thing upload several of my flat tables, match key columns with point and click, and assemble the data in a consolidated and more useful way?

You might have hit on a killer app for SAAS - customers won't care so much about how it works. There could even be humans coordinating the automatons in the background.

0 0
Wednesday 18th May 2011 13:23 GMT Mitch Kent

"Deep Dive"

For the Next article with "Deep Dive" at the forefront, can we please have a little more than a summary?

3 0
Thursday 19th May 2011 03:21 GMT Jean-Luc

There are signs that somebody's out of her depth...

When looking at SQL code, SELECT DISTINCT ... is more often than not a shorthand for "there is a cartesian somewhere in this query, but I can't figure it out".

Selecting data out of a database to modify that data with batches is another (ex: order fulfillment). Your best bet for speed is to leave the data _on_ the server and manipulate it with set-based update/insert/delete. Not "I will select instances, loop through them and associated objects and then write back my changes through my ORM".

With OO, signs are deep class hierarchies with little in the way of object compositions. Using multiple inheritance, when available, without good reasons. Or being Pattern-happy on trivial stuff.

Yes, SQL, even by itself, is difficult to use with certain types of problems: parts-of-parts data, data with date ranges, GIS, data that has to be mapped from/to deep object class hierarchies (but see point above).

Standard databases don't do massive scaling as efficiently as some of the, very interesting, new databases coming out. And they are overkill for much of the CRUD-oriented web frontend work. That's why I am reading up on things like CouchDB and MongoDB. Neat stuff.

Still objects and SQL mostly play well together in my experience. To date I am unsure what the object-relational impedance mismatch is supposed to be about. A big part of the problem is that many coders are not as clever as they could be with either OO or SQL so they find the going harder than it really is. They then blame it on OO-SQL not being aligned, especially when their ORM gets in the way.

1 0