A data modeling challenge to the relational-is-always-best set

Posted May 10, 2011
Filed under: Uncategorized |

Relational databases are great for a lot of jobs, but not nearly as many as they are used for.

I’ll give what I believe is a classic example of relational overuse, and I challenge any relational expert to publicly explain why its a good fit. What I’m looking for is reasons why the relational model is better, not why mysql or Oracle or Postgres is great and why MongoDB or CouchDB or Couchbase or Membase sucks.

The use case: tracking orders for customer service.

Using a document-oriented database, you represent orders as documents. Each order has a repeating group of line items and some customer information, including a bill-to and ship-to address. Each line has a fulfillment status and possibly one or more notes. Retrieving the order to display either to the customer on the web or to a customer service agent requires all of one simple query.

Using a relational database, it’s a bit more complicated:

One table contains the order header
One table contains the order lines
If the database is “properly normalized” (say in 3rd normal form) the order header doesn’t contain the customer’s name but rather a foreign keys that points to a customer table
Similarly, the order lines don’t contain product names, but rather foreign keys which point to a product table
On and on with payment methods, addresses, currencies etc, until in real life (Oracle Applications, Tech Ref Manual here) you wind up with 126 tables for Order management, including 22 for Orders and Returns, 16 for Customers, and 11 for Items

Why so complex? First, the theoretical answer to why Orders and Order Lines is not enough: for your data to be properly normalized (which most relational experts would say is at least third normal form), every non-prime attribute should be non-transitively dependent on every candidate key; for example, the dependency of customer name on orderID is a transitive dependency of customer name on customerID and customerID on orderID. Confused? Don’t feel alone.

What’s the common sense explanation behind the theory? The logic is that each column in a table should depend on “the key, the whole key, and nothing but the key”. If you follow the logic, data will appear only once and only where it needs to be. This avoids what relational folks call “update anomalies” where different tables can have inconsistent versions of the same data. In this case, the customer’s name is not really dependent on the order, but on the customer account. By modeling this “properly”, you can avoid ever having two names for me attached to different orders. There is quite a bit of literature devoted to avoiding this problem; if you’re interested in reading more, you might start with this overview.

Now, here’s the rub: say I decide that after years of marriage I decide to give up my ideal of equality for men and take my wife’s name. Should all my old orders now appear to be ordered under my married name? Probably not, but that creates additional complexity. Now I need to either a) keep a copy of the name in the order, under the theory that the dependency is not transitive or b) add date effectivity to the names with yet more tables, as the name dependency is actually on user and time. Pretty soon, you’re well on your way to becoming Oracle applications with its 126 tables for Order Management.

If you’d like to see another example, Kyle Banker from 10gen has a great blog post with example data and code comparing a (MongoDB-based) document-oriented approach to storing product information to some real-world e-commerce frameworks.

The elephant in the room is whether the theoretical advantage around avoiding update anomalies and more efficient updates is worth the complexity. For my money, the answer is “no way” in the case of storing orders for customer service. If you can make the competing case, I’d love to hear it.

— Max

34 comments so far

Ciff Elam on May 10, 2011

We used to have this argument with “C vs. ASM”….

Hammer, nail, bang.

-XC

Reply
Ron Leisti on May 10, 2011

In this case, the customer’s name is not a transitive dependency on the order ID, it is directly determined by the order ID. This is because the customer’s name can change from order to order (whether the customer changes their name, or their name is simply incorrectly typed in any particular order). Even in the case of an incorrect name, you still want that incorrectness to be reflected in the order, because an order is a record of real-world history (and it shouldn’t be re-written in the future). So in this case, it isn’t that complicated; just add the customer’s name as an attribute of the order.

Reply
- Max Schireson on May 10, 2011
  
  Agree.
  
  Reply
jkf on May 10, 2011

This is a misunderstanding of when you should normalize. If the name should not be updated when the customer changes their name, then the name is what should go on the order not an id. This is not a violation of 3rd normal form at all.

Only if the name should be updated should it should be extracted to another relation. It all depends on the constraints of your situation. This is the only definition of “proper” that there is for relational modelling.

It seems strange to me that avoiding data contradiction is seen as more “complexity”. Constraints allow a programmer to reason about the data safely.

Anyone who has used a poorly designed database were duplicate or conflicting values are allowed understand the nightmare of trying to fix problems or modify code when you don’t know what you can trust.

The relational model is “better” because it lets you describe exactly what is possible in the data and what the data means in terms of the constraints, while a document oriented store does not provide any guarantees to the programmer, which makes things more not less complex.

Reply
- Max Schireson on May 10, 2011
  
  I agree with you that in the case of that dependency, 3rd normal form would actually include the name in the order – but that is in practice not how folks model it. There’s a part of my argument which is more properly a dispute with over-normalization not a dispute with the model – but when a model has been around for decades and is often badly misused (and I’ve seen far far worse than pulling out names when they shouldn’t be updated), I think its fair to question whether the model fits the use case well.
  
  I agree that relational is good at expressing constraints, but I still believe the result is often more complexity not less.
  
  Reply
  - jkf on May 10, 2011
    
    I agree there is a lot of confusion about how to normalize, but the solution is for developers to be educated about the relational model, not to abandon it for a less safe model they will also not understand.
  - Max Schireson on May 10, 2011
    
    I am hugely in favor of education. In 9 years at Oracle I found that many developers didn’t understand normalization or in general how to work with databases well.
    
    I am not at all in favor of abandoning relational, but I think it should not be the only choice and there are many cases for which it is not the best fit.
Rudiger on May 10, 2011

You’ve just discovered that a relational database can have problems representing history; a document-oriented database can have the same problems with “temporality” if it is not properly designed. In reality, a relational database can work perfectly well for historical data if you handle temporal changes (ie. a customer name change) with proper date “as of”-ing. You’d have the customer’s latest name and his history of names in the customer table. The name he used for an order would be linked to the name he had “as of” the order date. For more information, Google “temporal databases”; they’re frequently implemented on top of relational databases.

Reply
- Max Schireson on May 10, 2011
  
  Yes, they are a good solution to this problem, but I believe in many cases you can have a much simpler approach which avoids the problem entirely. Certainly not all cases.
  
  Reply
agirbal on May 10, 2011

Hey Max, great post for an interesting topic.
Though I think the customer name example is not the main argument, I still think the customer should be referred to with a foreign key, even if it creates an extra read.
The huge gain is embedding order lines within the order entry, and still have them fully indexed and easy to query.
Most pages displaying orders will do just 1 read to obtain all order info, instead of (1+N) reads where N is number of products bought (if avg products per order is 5, you just gained ~5x capacity on your server).
Also if products are immutable, you can embed more product info and lower your reads further.

Reply
- Max Schireson on May 10, 2011
  
  Agree, that’s the no-brainer win. But I also believe that once you start down the path, data modelers often go too far towards normalization.
  
  Reply
Vincent DiBartolo on May 10, 2011

Hi Max,

I think you may be examining an extreme rather than a typical use case. Obviously Oracle’s commercial product has to be much more complete than the average because it has to solve so many disparate needs. I designed a Commerce system in the past that’s very much like the one you describe above and I modeled the entire system in maybe a dozen tables. It varies with need.

I feel you’re attempting to point out the weaknesses of the relational model based on an unreasonable functional sample. If your Document model design needed to track user’s last names, you’d have to find a way to shove that in as well. That will also bloat your document model. In the end your argument seems to be more with application requirements than data modeling. Normalization, in whatever model you choose, induces design bloat.

I think if you had the *same* set of requirements and you had a document-oriented solution and compared it to a relational-oriented solution my guess is they’d be similar enough to not cause a ton of disagreement over their validity.

-Vinny

Reply
- Max Schireson on May 10, 2011
  
  There’s clearly quite a bit of additional complexity in the Oracle apps and some of that would absolutely bleed into a document-oriented model. But I still maintain that a document model is simpler and a better fit in this case.
  
  Reply
  - Vincent DiBartolo on May 10, 2011
    
    So let’s hear why you think it’s better : )
    
    Honestly, I use both relational and document model. I think they both have their strengths. But in many cases the models are equivalent and it comes down to either developer tools, developer experience, or other intangibles.
  - Max Schireson on May 10, 2011
    
    Agree both have their strengths. I don’t think it makes sense to talk about why one is better overall, each will have its use cases.
    
    In this case I think document oriented is better primarily because it produces a much simpler model which will solve the problem, and secondarily because it will likely lead to much more efficient data access (less I/Os).
Ingon on May 10, 2011

One possible application of this model is that you can easily ask stuff like – how many customers do I have? Do I have unshipped orders? How much did I make from this particular customer? And many more…

Reply
- agirbal on May 10, 2011
  
  You can actually do all that with a denormalized db model. It is common to underestimate the power of document dbs and think that documents are just blobs.
  For example, lets say the order lines are embedded in the order entry, as an order list:
  { orderId: 123, created: X, customerId: 345, items: [{productId: 897, price: 50}, {productId: 234, price: 20}, …]}
  
  You can have a secondary index on the product id which is defined as {items.productId: 1} in mongodb.
  You can then very efficiently query all orders for a given product, or all customers who bought a product.
  Even for these queries speed is same or faster than if you were querying a fully normalized data.
  
  Reply
Jason Lotito on May 10, 2011

> the dependency of customer name on orderID is a transitive dependency of customer name on customerID and customerID on orderID.

That’s a long way of saying their is a relationship between orders and customers.

> requires all of one simple query.

As does the SQL way. You’ll have to define that relationship somewhere anyways, so if you can’t understand it in SQL, you won’t understand it at the application level.

> What I’m looking for is reasons why the relational model is better

Considering the document model hasn’t been shown to be better in this post, it’s a bold statement.

You present a single case (history tracking) where your document supposedly is superior (ignoring the fact that a change history would need to be supported regardless of the data store, and merely storing the data at the time of the transaction isn’t suitable).

You also make this lackluster claim at the end: ” the answer is “no way” in the case of storing orders for customer service”

That’s highly specific. And fairly useless.

Basically, what you are saying is that it’s useful to store a snapshot of a purchase order. This isn’t an unfair observation. However, it’s illogic to assume that means you don’t need to store referencing information.

Basically, you build the order from SQL relations, and store the resulting purchase order in some document store (this is where NoSQL would be fine). However, should the purchase order change (which happens all the time), you simply rebuild the purchase order. If you didn’t save the relationships, you then need to go through every single purchase order, searching for elements that changed, and change each and every one, and then still recreate a new purchase order (see whenever Apple releases a new product, and all the people orders product A are actually getting product A++).

And honestly, if you find relationships in SQL difficult, I can’t see how you can manage it at the application level with any reliability.

Reply
- Max Schireson on May 10, 2011
  
  I think the query in SQL to retrieve the order can be anything but simple as the data model grows in complexity.
  
  Yes, it is bold of me to ask why the relation model is actually better for this use case. But I’m still waiting for an answer 🙂
  
  Reply
- agirbal on May 10, 2011
  
  I agree with you that it is wrong to push the denormalization too far. The rule of thumb should be: do not duplicate data in the DB that may need updating, even rare updating. Otherwise you will have potentially inconsistent data or some updates will be painfully slow.
  But even with this rule in mind, you can usually cut down the number of tables by a factor of 2 or 3 with a document db. It allows you to embed data that clearly belong to another one, for example order lines for an order, and you end up with a much faster db.
  And again documents are not blobs, you can fully query them. If a product changes, you do _not_ have to go through every order and look for that product to update. Just have an index on the product id within the order line (which is good for querying anyway) and then you can very efficiently update these lines.
  Also not you do _not_ have to recreate the entire order object. With a db like mongo you can use in-place modifiers (like $set) where you can update parts of a documents, without doing any processing on the client side.
  
  Reply
craig on May 10, 2011

AFAIK, most systems that use a relational db for this use case should be saving off a ‘document’ that contains all the info for the order in de-normalized form. This is essential for integrity through price changes, address updates, and even the name change you mention in your post. I put document in quotes because this could be a bunch of tables storing orders and specific de-normalized data or it could be an xml file of the order with a snapshot of the data at the time of the order.

Reply
- Max Schireson on May 10, 2011
  
  No argument here 🙂
  
  Reply
  - ASP.Net Halifax on May 10, 2011
    
    You are 100% correct – combining/storing the different elements in a de-normalized fashion in one ‘document’ is much preferred. Keeping them normalized and processing “on the fly” can look like the right approach at first but the consequences can be disastrous.
Dave on May 10, 2011

The system you’ve presented is what any database person would call a toy system, and really a system of this complexity could be represented with flat files without to much trouble. Try representing a serious system (e.g. the 126 table oracle app you mention) using a documented oriented database, I think you’d come to grief fairly quickly.

Document oriented databases do have there place, but also there limitations, as do rdbms’es, try storing a 3d first person shooter in an rdbms and you’ll soon come to grief. Each has there own application domain

Reply
- Max Schireson on May 10, 2011
  
  Completely agree each has its own domain.
  
  I’ve built complex intelligence systems with 100’s of terabytes of data in document oriented systems. It turns out they were much quicker to build and much easier to maintain than the relational systems they replaced, and the users loved them. No experience with 3d first person shooters though 🙂
  
  Reply
artem on May 10, 2011

OK, here is use case: when you look at the order, you should see the contact (name and phone number) – someone at customer company whom you call when something is wrong with the order.

Now, the customer is calling you and notifying that your previous contact is gone and they’ve assigned new person who will be handling all interaction with you. His phone number is so-and so.

How is this handled in document-oriented data model?

Reply
- Max Schireson on May 10, 2011
  
  Thank you for a substantive use case, much better than generalizations and name calling that sometimes occurs in these debates!
  
  In a document oriented system, you still have to make some choices about your data structures, but I’ll give one reasonable approach:
  1. Mark the previous contact with an end date of today and a status of “inactive”
  2. Add an additional contact with a start date of today and a status of “active”
  
  This of course assumes that you have the application logic to handle these fields. That said I don’t think having those fields within the same document as the order makes it any more complicated – in fact I believe it makes it simpler, and I’ll explain why below. In a more rudimentary version of the system, you might directly update the contact information in the order, but of course that loses the history so in most cases I would lean towards the approach I described.
  
  One advantage of a document oriented database in dealing with this use case is that it doesn’t require fundamental changes (which in turn bring data migration scripts) to handle this if it comes in as a change request. Suppose your current system doesn’t handle this case at all, and you’re told by your users that its important. You can simply add the additional fields and the logic to handle them, and the absence of a start or end date or status can be treated as a active record. You might want to insert default values, but that would be an implementation-dependent optimization of coding ease or performance, not an absolute requirement.
  
  Let’s think about the corresponding case in the relational world. Say you hadn’t planned for it so you simply have one ship-to and one bill-to contactID as fields in the order header. These point to a contacts table. To make the corresponding changes, you need to do something like:
  1. Create a join table which is orderHeader_contacts or some such, with types, statuses and end dates
  2. Write a script to migrate all existing contactIDs in the orderHeaders table to the orderHeader_contacts table and find a time to run that script in production synchronized with your code changes
  3. Change every piece of code which touches orderHeaders to do one additional join to get the contactIDs from orderHeader_contacts
  4. Make sure that the addition of that additional join doesn’t cause performance problems in your code
  
  The data migration script makes life much more complicated. With a large data set, these can run for days, and you need very very clever design for your system to be able to be up and running while the migration is in process. This data migration issue is why craigslist moved from relational (MySQL) to document oriented (mongoDB) for their posting archive.
  
  What do you think of this approach? Does it make you more or less inclined to adopt a document-oriented store for this type of problem?
  
  Thanks again for the good question.
  
  — Max
  
  Reply
  - artem on May 11, 2011
    
    Well I think that keeping history of ‘inactive’ contacts in the order is overengineering – who cares about contacts that are not there anymore? However, history might be interesting if the same person later appears as a contact at another company – assuming that you are actually working with people who represent companies, that is, your case looks more like business-to-business situation. In which case, you might want to track the history of relationship between persons and companies, but this history has no direct relationship to orders.
    
    So in relational business-to-business case, you’d have Orders which refer to Customers which can have one (for simplicity) Person designated as contact. When contact information change, as far as order processing is concerned, you update Customer record to refer to new Contact and that’s it. Yes it’s important to get your data model right before your database grows large – but usually you have enough time to do iterations while data migration scripts run to completion in seconds.
    
    In document-oriented model, you either need to do the join yourself in your application code to show information from order document and customer document together everywhere it’s required. Or you need to go and update all the orders for given customer. Neither option looks attractive.
  - Max Schireson on May 11, 2011
    
    Whether you need the history depends on the usage. If the order was for expensive b-b stuff (say 100 jet engines), you’ve probably kept notes on lots of conversations and its important to have the history of who was the contact. If the order is for a DVD, you probably don’t care.
    
    Often with a document oriented system, you keep enough customer information with the order to not have to do the join for the most common operations. Yes, you might have to update multiple orders – but you might have to anyway in relational: say a B-B case where you have 100 orders from company X, and they’ve grown so Fred is now the contact for only half the orders and Jane is now the contact for the other orders.
    
    Hope this is useful,
    
    — Max
Sakuraba on May 11, 2011

The real question is: What prevents you from modeling your orders using a flat table approach and therefor essentially mimicking a document database? Nobody forces you to denormalize your data into oblivion.

To me the denormalization approach shines when you dont know upgfront which questions your data shall answer in the future. Relational models are not very expressive and can easily become complex, but their ability to answer arbitrary queries is very good.

(side note: in pain due to ORMs and their impedance mismatch to the relational world for years, so i am not a relational fanboy)

Reply
- Max Schireson on May 11, 2011
  
  One huge thing prevents you from modeling it that way: repeating groups. At a minimum you have headers and lines – two tables but maybe 100 rows for a single large order, so some additional complexity and potentially much more I/O. But for example if you want to have multiple contacts on the order in case they change (higher end b-b case discussed in a previous comment), then you have more tables in relational. And if a line breaks into multiple shipments (say you ordered 5 and they have 2 in stock and 3 are backordered), yet another table in relational. So can you limit the normalization in relational? Yes, but unless you give up functionality vs document oriented, only so far.
  
  Helpful?
  
  — Max
  
  Reply
Gary on May 12, 2011

I think as document databases as ‘locally’ structured and relational as ‘globally’ structured. The migration issue you pointed out highlights this. The RDBMS requires that every order can be related to a contact that has a start/end date defiend, while for the document db “the absence of a start or end date or status can be treated as a active record”
In the RDBMS model, the applciation code has to deal with single structure throughout. In document db, the application code has to cope with every possible structure any document might have.
I think the latter works well with applications dealing with ephemeral data (Craigslist being an excellent example). But in applications dealing with data potentially a decade old, catering for the variety in document formats over that timespan will be challenging, and testing it will be even more so.

Reply
- Max Schireson on May 12, 2011
  
  I think the locally structured vs globally structured is a good way to think about it.
  I do think its worth pointing out that a document oriented database, you aren’t required to allow the heterogeneous data structures, its the developer’s choice (rather than the technology’s choice). If you want to run a data migration for each change to keep everything consistent, nothing is stopping your from doing that.
  
  Reply
Brandon on August 5, 2012

Another perspective:

In development you have to take into account the project timeline. While it’s already been pointed out that in this case you would just record the name each time rather than relating it to another table, let’s just assume that you had used a different and more legitimate example for a minute. Or better yet, for the sake of argument, let’s assume your example actually was legitimate…

At this point the development team needs to make a decision on how to handle changing names. Since we’re assuming that we have to relate the names to another table, then the answer would be to create a new person. Of course this means that there’s a disconnect where you can’t accurately query information on this person… however in the real world this is probably the decision that would be made because of project constraints and the fact that this isn’t the sort of issue that you would suddenly abandon your relational database that works better for the other 95% of your operation.

Point being that in the real world you are generally going to want to live with the defeciencies of a model that handles 95% of what you do quite well, rather than adopting a new model designed for the 5% edge cases.

Of course, when the edge case is the bulk of what you are doing (heirarchical data, for example) then it’s time to use that model designed for that edge case.

Reply