re:Invent 2018 Data Service Highlights

Posted by Steven on December 12, 2018

I had the pleasure of speaking the Wellington AWS User Group last night on the announcements for data services. Here are my personal highlights.

Re:Invent 2018 Database Announcements from Steven Ensslen

Sagemaker

Sagemaker has changed the game since its release only a year ago. Sagemaker is not merely a hosted Jupyter notebooks service, it also allows training clusters to be created on demand when training is needed and automates deployment of the resulting models. Additional features keep being added:

Sagemaker search automates the comparison of different models with the same tags. AWS tracks your model performance so that you don’t have to.
Sagemaker added Git support! Which was a little painful before…
And, prior to re:Invent, Sagemaker recently automated hyper-parameter tuning. Hyper-parameters, for instance the depth of a neural-network, can make a big difference in machine learning performance, but are often either ignored or over-analyzed. Now they are a one-click step, done reliably and efficiently.

DynamoDB

DynamoDB added two big features:

DynamoDB added transaction support. Now multiple tables can be updated atomically. Which takes away the error-prone, heavy-lifting of having the application handle consistency across tables.
DynamoDB now has per-transaction billing. Provisioning DynamoDB capacity used to be complex, and most AWS customers were just constantly provisioning a good margin above their peak loads. Now you can just pay per use.

RDS on VMWare

I think that RDS on VMWare is going to be a huge feature. Not only does this allow on-premise databases to burst capacity into AWS, or maintain high availability with AWS, it also provides high quality database administration at a bargain price. And my extensive professional experience as a database administrator and with other database administrators has taught me that database administration is overpriced and of generally disappointing quality. Hopefully this service will bring change to the whole market.

Managed Service for Kafka (MSK) (public preview)

Kafka and Kinesis are direct competitors. Each has its strengths. Since lots of people were running Kafka, AWS is now providing Kafka as a service. Don’t worry, Kinesis isn’t going anywhere.

Timestream (preview)

I’m really excited about Timestream, the new time series database. I do a lot of work with time series data, as do we all. But none of the existing tools in common deployment were engineered for time series data, or were really very good at it. I love Kibana, but it runs on a text search engine. And that a leading quantity prediction tool runs on a text search engine tells you just how badly we need better databases in this area.

Quantum Ledger Database (preview)

In my career, I’ve spent a lot of time trying to get data out of databases and onto other hardware so that the analytics load didn’t shut down the business. One of the better tools for that is Change Data Capture (CDC). And every once in a while someone gets the bad idea of using an application’s internal audit log for CDC. Which I’ve never seen work, because the every application’s audit log that I have ever encountered has defects. Those “Created By” and “Updated By” columns are not reliable.

QLDB is the solution to this problem. The database keeps the audit log. Amazon writes it once, tests it exhaustively, and we can all trust that it works. And we don’t ruin our beautiful data models by mindlessly tacking columns onto tables.

Lake formation (preview)

Data Lakes are great. Their speed of development, low cost, and exceptional performance put most data warehouses to shame. They make my job easy.

AWS Glue is a great service. I haven’t directly interacted with a Hadoop cluster since it was announced. But there is a great deal of repetitive work involved in ingesting data into an AWS Glue based data lake. Every input needs to be secured, encrypted, and written into Parquet or ORC.

AWS Lake Formation is a code generator that helps to perform a lot of the common tasks in a data lake. Which I can attest, still take days and days in a data lake implementation. Better than months and months in data warehouses, but still work I’m happy to leave behind.

Those are some of my thoughts on re:Invent. I’ve also attached my slides from the user group.

Posted in: Book Review, Quick

Tags: aws, data lake, dynamodb, Glue, kafka, Kibana, kinesis, re:Invent, sagemaker, timestream

Comments

Be the first to comment.