I had the pleasure of speaking the Wellington AWS User Group last night on the announcements for data services. Here are my personal highlights.
Sagemaker has changed the game since its release only a year ago. Sagemaker is not merely a hosted Jupyter notebooks service, it also allows training clusters to be created on demand when training is needed and automates deployment of the resulting models. Additional features keep being added:
DynamoDB added two big features:
I think that RDS on VMWare is going to be a huge feature. Not only does this allow on-premise databases to burst capacity into AWS, or maintain high availability with AWS, it also provides high quality database administration at a bargain price. And my extensive professional experience as a database administrator and with other database administrators has taught me that database administration is overpriced and of generally disappointing quality. Hopefully this service will bring change to the whole market.
Kafka and Kinesis are direct competitors. Each has its strengths. Since lots of people were running Kafka, AWS is now providing Kafka as a service. Don’t worry, Kinesis isn’t going anywhere.
I’m really excited about Timestream, the new time series database. I do a lot of work with time series data, as do we all. But none of the existing tools in common deployment were engineered for time series data, or were really very good at it. I love Kibana, but it runs on a text search engine. And that a leading quantity prediction tool runs on a text search engine tells you just how badly we need better databases in this area.
In my career, I’ve spent a lot of time trying to get data out of databases and onto other hardware so that the analytics load didn’t shut down the business. One of the better tools for that is Change Data Capture (CDC). And every once in a while someone gets the bad idea of using an application’s internal audit log for CDC. Which I’ve never seen work, because the every application’s audit log that I have ever encountered has defects. Those “Created By” and “Updated By” columns are not reliable.
QLDB is the solution to this problem. The database keeps the audit log. Amazon writes it once, tests it exhaustively, and we can all trust that it works. And we don’t ruin our beautiful data models by mindlessly tacking columns onto tables.
Data Lakes are great. Their speed of development, low cost, and exceptional performance put most data warehouses to shame. They make my job easy.
AWS Glue is a great service. I haven’t directly interacted with a Hadoop cluster since it was announced. But there is a great deal of repetitive work involved in ingesting data into an AWS Glue based data lake. Every input needs to be secured, encrypted, and written into Parquet or ORC.
AWS Lake Formation is a code generator that helps to perform a lot of the common tasks in a data lake. Which I can attest, still take days and days in a data lake implementation. Better than months and months in data warehouses, but still work I’m happy to leave behind.
Those are some of my thoughts on re:Invent. I’ve also attached my slides from the user group.