Create and populate table
For our sample table, we will use a simple 4-column table that is a simulated result of an ETL process from some OLTP DB. We have the following values: - date value: You can think of this as an entry date, or a purchase date - numeric value: You can think of this as quantity of items, or a monetary value - descriptive value: You can think of this as a category code, or a branch code
We will then populate our table with 50 million rows randomly generated. These will be inserted in date order ascending. Why 50 million? So that we give the DB a little bit of a workout. A DB will not break a sweat with hundred-thousand-row tables.
Create our sample table
First, we create our table in our database with the following DDL:
Take note that at this point, other than the primary key, we do not have any indexes defined.
Field | Data Type | Description |
---|---|---|
id |
INT | A simple unsigned primary key |
fk_id |
INT | A simulated foreign key ID |
arbitrary_value |
INT | A value we can use aggregate functions on |
entry_date |
DATETIME | A datetime value |
Populate our table
Generate data
We will now generate data for our table. To do this we use fixtuRes. We provide a YML configuration file:
# sample_table.yml
sample_table:
columns:
fk_id:
type: integer
min: 1
max: 100
arbitrary_value:
type: integer
min: 0
max: 50
entry_date:
type: date
min: 1973-01-01
max: 2023-12-31
arrange: entry_date
We use MockDataGenerator
to create our data. This will produce a table ordered by entry_date
. Then we add a sequential id
column to have the id
in the same order as entry_date
.
The following is a sample output if we used size = 20
:
Connect to DB
We now establish a connection to our MySQL database server. The RMySQL
library has been deprecated in favor of the RMariaDB
library.
It is always good practice to keep your connection credentials like usernames, passwords, API tokens in your environment variables. Never hard-coded in source code. And never commit your environment variable file to the repo.
Write our data to the DB
While there is a function we can use to write our mock_data
to the DB table (dbWriteTable(conn, "sample_table", mock_data)
), remember that we have 50 million rows. When mock_data
is written to a CSV file, it results in a 1.3 GB file. Writing this to the DB will take some time. To avoid hitting the connection time-out constraint we will write the data by batches of date.
Let’s take a look at what we have.
Dont’ leave any DB connections open!
Now, we have a 50 million row DB table we can play around with.