I just read Jay Kreps‘ paper on data integration. It is the best written and most compelling software engineering that I have read in some time. It is dense, but if you think you’re a technical business intelligence or ETL expert, I have to recommend that you read it.
A few highlights:
- A good change data capture, or as Krep calls it data log, provides its organisation with the abstraction and all of the capabilities of a source control system for their data.
- That data integration is at the base of the knowledge worker’s hierarchy of needs, but that transformation is not. Centralised transformation still has a place, but there are practical uses for integrated data that has not yet been transformed.
- John Gage infamously said “The network is the computer”. He was factually wrong, but he was laying the conceptual framework that shortly thereafter gave us Google, Hadoop, and the internet as we know it today. In similar style, Kreps’ urges us to think of all of the data in our organisation as a single, somewhat-dysfunctional database. Localized parts of that database are accessible, “well indexed”, for specialised uses. Certain parts are well integrated with other parts. But much data is inaccessible or unintegrated. Our jobs as data professionals are to keep “indexing” or making data visible and integrating as long as there is a positive return on our efforts.
- Finally, if we’re willing to concede, as most of us already believe, that most organisations will be best served by a variety of analysis and visualisation tools, then we need to have a centralized, untransformed data source. Otherwise integration time, cost, and developer effort scale with the square of the number of integration points, rather than linearly. Or to paraphrase, the advent of big data and related tools have made Bill Inmon correct, regardless of the shortcomings of his philosophy twenty years ago.
A lot of these ideas are inherent to using Apache Kafka or AWS Kinesis, but I’m more than a little off balance to have read a well reasoned argument that Kimball was correct only for a niche use case.