Posted by on March 17, 2018

There are some good posts about using logstash’s JDBC input plugin to migrate your existing data into Elasticsearch. In order to make the technical examples simple, those posts demonstrate loading normalized tables into Elasticsearch.

Even with a data warehouse built in a relational database, rarely will the data model that is useful for your application’s transaction processing in a relational database be a sensible choice for its reporting and analysis. You should plan to project your data into a new model before you load it into Elasticsearch.

The first steps in any data analysis are to define the questions that you are answering and the measure by which you determine if you have answered them. Expect to create several indices in Elasticsearch, possibly as many as your have questions to answer, probably approximately one per major entity in the subject domain. In my experience, the number of indices is a small fraction of the number of tables that the data occupies in third normal form.

I mock up each of the indices as a view in the relational database. SQL database views are useful tools in your data migration. They are a concept that helps to bridge the communication gap between the SQL database people and the Elasticsearch people. They are easy to implement, to test, and to experiment on. They provide an abstraction layer and a separation of concerns. And they elevate your query to a first-class object.

One of the benefits of views is that they can be introspected, which can’t be done to queries in most SQL databases. I use the following query to redefine my index templates as we iterate over the design of the views.

SELECT 
    CONCAT('"',
            column_name,
            '": {
                        "type": ',
            CASE data_type
                WHEN 'int' THEN '"integer"'
                WHEN 'bigint' THEN '"long"'
                WHEN 'smallint' THEN '"short"'
                WHEN 'intyint' THEN '"short"'
                WHEN
                    'decimal'
                THEN
                    CONCAT('"scaled_float","scaling_factor":',
                            POWER(10, numeric_scale))
                WHEN 'double' THEN '"double"'
                WHEN 'json' THEN '"scaled_float","scaling_factor":100'
                WHEN 'datetime' THEN '"date"'
                ELSE '"keyword"'
            END,
            '},') elasticsearch_column_definition
FROM
    information_schema.columns
ORDER BY ordinal_position;

There are some peculiarities to my data which come through in this mapping generator. Notably, there’s no text. Also all of the data I care about from json sources turns out to be scaled_float. Doubtlessly, you’ll want to make adjustments.

I encourage everyone to structure their Elasticsearch data differently from their operational data. And I recommend trying database views as an abstraction layer.

Posted in: Technical

Comments

Be the first to comment.

Leave a Reply