Kleppmann‘s Designing Data-Intensive Applications is a detailed and comprehensive review of databases, in the widest definition of that term, and the way that we use them.
Klepmann starts by describing how our databases generally don’t offer the guarantees that we typically assume that they do. In my career I have never worked with a relational database in SERIALIZABLE isolation because that guarantee has significant performance cost. What disappoints me regularly are people specifying NOLOCK, invariably with little idea of what nolock does. Yet I think that every transactional application that I’ve worked on pretended it was running in SERIALIZABLE isolation. And NoSQL databases have historically been even more troublesome.
Once scale comes in, Kleppmann brings up CAP theorem: In the event of network failure (partition), a database is either consistent or available. As network failure is inevitable, we either end up with applications that are inconsistent or un-available. Having been forced to work with applications that chose availability over consistency, I have developed a strong preference for being down over being wrong.
The great part of Kleppmann’s book, is that it describes how to beat CAP. If your input data can be modeled so that it is unique and immutable, which is generally the case, then any datastore that provides at least once delivery is consistent enough to build an application that is correct and available even with network partition. The trick is that there needs to be after-the-fact consistency checks running independently, and these checks may need to add reversals and send apologies. As Kleppmann points out, this is already standard practise for businesses. Orders are accepted in good faith, but they occasionally are cancelled by suppliers. Almost any application can model its data as a stream of events, which are inherently unique and immutable. And message queues offer at least once delivery. So applications get to have their cake and eat it too.
The part of this that I love as an analytics professional, is that this sort of event-driven architecture makes data capture so much simpler. It makes no sense to use a change data capture tool on top of the remaining database, if any, that provides the application’s current state. Instead, updates to a data warehouse or data lake come directly from the same stream of events and from the same message queue that the source application is writing.
These events are typically more meaningful than database transactions. Instead of having to reverse-engineer intent from a database transaction that updated a dozen rows in a half-dozen tables, the analyst can work directly with a business-level event which encompasses a shipment or a sale and all of its side-effects.
Not only is this much simpler, it is instant. Conceivably your analytics could be ahead of the application’s database! And all this with an architecture that removes all analytical load on the usually overburdened source application database. Even the ETL never reads the source application datavase.
Fast feedback is powerful. Real-time processing gives you the ability to respond to opportunities that you otherwise would have missed. I’m looking forward to working with more data from systems using this pattern.
Leave a Reply