A data modeling challenge to the relational-is-always-best set
Relational databases are great for a lot of jobs, but not nearly as many as they are used for.
I’ll give what I believe is a classic example of relational overuse, and I challenge any relational expert to publicly explain why its a good fit. What I’m looking for is reasons why the relational model is better, not why mysql or Oracle or Postgres is great and why MongoDB or CouchDB or Couchbase or Membase sucks.
The use case: tracking orders for customer service.
Using a document-oriented database, you represent orders as documents. Each order has a repeating group of line items and some customer information, including a bill-to and ship-to address. Each line has a fulfillment status and possibly one or more notes. Retrieving the order to display either to the customer on the web or to a customer service agent requires all of one simple query.
Using a relational database, it’s a bit more complicated:
- One table contains the order header
- One table contains the order lines
- If the database is “properly normalized” (say in 3rd normal form) the order header doesn’t contain the customer’s name but rather a foreign keys that points to a customer table
- Similarly, the order lines don’t contain product names, but rather foreign keys which point to a product table
- On and on with payment methods, addresses, currencies etc, until in real life (Oracle Applications, Tech Ref Manual here) you wind up with 126 tables for Order management, including 22 for Orders and Returns, 16 for Customers, and 11 for Items
Why so complex? First, the theoretical answer to why Orders and Order Lines is not enough: for your data to be properly normalized (which most relational experts would say is at least third normal form), every non-prime attribute should be non-transitively dependent on every candidate key; for example, the dependency of customer name on orderID is a transitive dependency of customer name on customerID and customerID on orderID. Confused? Don’t feel alone.
What’s the common sense explanation behind the theory? The logic is that each column in a table should depend on “the key, the whole key, and nothing but the key”. If you follow the logic, data will appear only once and only where it needs to be. This avoids what relational folks call “update anomalies” where different tables can have inconsistent versions of the same data. In this case, the customer’s name is not really dependent on the order, but on the customer account. By modeling this “properly”, you can avoid ever having two names for me attached to different orders. There is quite a bit of literature devoted to avoiding this problem; if you’re interested in reading more, you might start with this overview.
Now, here’s the rub: say I decide that after years of marriage I decide to give up my ideal of equality for men and take my wife’s name. Should all my old orders now appear to be ordered under my married name? Probably not, but that creates additional complexity. Now I need to either a) keep a copy of the name in the order, under the theory that the dependency is not transitive or b) add date effectivity to the names with yet more tables, as the name dependency is actually on user and time. Pretty soon, you’re well on your way to becoming Oracle applications with its 126 tables for Order Management.
If you’d like to see another example, Kyle Banker from 10gen has a great blog post with example data and code comparing a (MongoDB-based) document-oriented approach to storing product information to some real-world e-commerce frameworks.
The elephant in the room is whether the theoretical advantage around avoiding update anomalies and more efficient updates is worth the complexity. For my money, the answer is “no way” in the case of storing orders for customer service. If you can make the competing case, I’d love to hear it.