Create and populate table

Author

Ricardo Rodrigo Basa

Published

January 7, 2024

For our sample table, we will use a simple 4-column table that is a simulated result of an ETL process from some OLTP DB. We have the following values: - date value: You can think of this as an entry date, or a purchase date - numeric value: You can think of this as quantity of items, or a monetary value - descriptive value: You can think of this as a category code, or a branch code

We will then populate our table with 50 million rows randomly generated. These will be inserted in date order ascending. Why 50 million? So that we give the DB a little bit of a workout. A DB will not break a sweat with hundred-thousand-row tables.

Create our sample table

First, we create our table in our database with the following DDL:

Take note that at this point, other than the primary key, we do not have any indexes defined.

Field Data Type Description
id INT A simple unsigned primary key
fk_id INT A simulated foreign key ID
arbitrary_value INT A value we can use aggregate functions on
entry_date DATETIME A datetime value

Populate our table

Generate data

We will now generate data for our table. To do this we use fixtuRes. We provide a YML configuration file:

# sample_table.yml
sample_table:
  columns:
    fk_id:
      type: integer
      min: 1
      max: 100
    arbitrary_value:
      type: integer
      min: 0
      max: 50
    entry_date:
      type: date
      min: 1973-01-01
      max: 2023-12-31
  arrange: entry_date

We use MockDataGenerator to create our data. This will produce a table ordered by entry_date. Then we add a sequential id column to have the id in the same order as entry_date.

The following is a sample output if we used size = 20:

Connect to DB

We now establish a connection to our MySQL database server. The RMySQL library has been deprecated in favor of the RMariaDB library.

It is always good practice to keep your connection credentials like usernames, passwords, API tokens in your environment variables. Never hard-coded in source code. And never commit your environment variable file to the repo.

Write our data to the DB

While there is a function we can use to write our mock_data to the DB table (dbWriteTable(conn, "sample_table", mock_data)), remember that we have 50 million rows. When mock_data is written to a CSV file, it results in a 1.3 GB file. Writing this to the DB will take some time. To avoid hitting the connection time-out constraint we will write the data by batches of date.

Let’s take a look at what we have.

Dont’ leave any DB connections open!

Now, we have a 50 million row DB table we can play around with.