Fake data are very useful in development environment for testing your application or some query performances for example. In this Blog Post I’ll share how I created a simple SQL script for PostgreSQL 🐘 to generate some fake data. Enjoy!
I started this idea when I was testing some query changes and performance improvements on a PostgreSQL database. On the very first attempt my query failed miserably because the data that I had was not enough, and creating all the data with relations was so time consuming. I’ve done some research and finally I got some amazing scripts.
All the commands are run directly in
psql command line interface, so let’s connect to it:
psql -h localhost -U postgres
Then I’ve created a PostgreSQL database for isolating my tables:
CREATE DATABASE test_db; \c test_db
All done, so let’s understand the model used in this blog post.
The Data Model
For the purpose of this blog post I’ve created a very simple data model that are widely used as example on a lot of sql snippets and discussions in the internet.
Here it goes an ERD image to represent the model:
And here it is the
CREATE TABLE script:
CREATE TABLE users( id SERIAL PRIMARY KEY, email VARCHAR(40) NOT NULL UNIQUE ); CREATE TABLE posts( id SERIAL PRIMARY KEY, user_id INTEGER NOT NULL REFERENCES users(id), title VARCHAR(100) NOT NULL UNIQUE ); CREATE TABLE comments( id SERIAL PRIMARY KEY, user_id INTEGER NOT NULL REFERENCES users(id), post_id INTEGER NOT NULL REFERENCES posts(id), body VARCHAR(500) NOT NULL );
Note that all tables have an
id SERIAL PRIMARY KEY column, so PostgreSQL will take care about generating the ids for me.
User has email,
Post belongs to a
User and it has title and
Comment belongs to a
Post and it’s made by another
User and it has body column.
I am using FOREIGN KEYS off course and NOT NULL as I want to have basic data validations on the DB level, as always.
Generate Random Data
Let’s start with the
users table. In this case the only field that I need to generate is a
So I started this querying on
GENERATE_SERIES(1, 10) as an easy way to generate 10 rows for me. Note how easy is to generate a thousand or a million of rows just changing a single number.
Then the idea is to use the
seq field given by the
GENERATE_SERIES to have just unique values as I have this constraint. I can also use
RANDOM() function to build some random data in this case I am choosing the email domain from a finite list.
INSERT INTO users(email) SELECT 'user_' || seq || '@' || ( CASE (RANDOM() * 2)::INT WHEN 0 THEN 'gmail' WHEN 1 THEN 'hotmail' WHEN 2 THEN 'yahoo' END ) || '.com' AS email FROM GENERATE_SERIES(1, 10) seq; SELECT * FROM users;
These are the inserted
id | email ----+--------------------- 1 | firstname.lastname@example.org 2 | email@example.com 3 | firstname.lastname@example.org 4 | email@example.com 5 | firstname.lastname@example.org 6 | email@example.com 7 | firstname.lastname@example.org 8 | email@example.com 9 | firstname.lastname@example.org 10 | email@example.com (10 rows) Time: 5.159 ms
I’ve got strongly inspired by this amazing PG Cast - Generating Fake Email Addresses.
This was easy, let’s move on and see how to choose random DB references.
Choosing Random Relationships
Post model is the first one which has a
FOREIGN KEY, so I have to use real
users.id values. Also I’d like to perform some queries that return some random number of rows, so I’d like to chose a random
User for each
Post. This is a simple way to guarantee some rand number of posts per user.
This was a bit more complex to build than the previous one. To make it easier to understand I used a lot of CTE.
I’ve started creating a CTE called
expanded to get a limited sequence of rows using
GENERATE_SERIES, the same idea as before. The trick I’ve used here was to cross join with the
users table in order to get all possible combination of generate sequence and
user_id. I am pretty sure that this is not very efficient but it solves the problem. Finally I’ve also used
RANDOM() to be used later on when choosing a random
On the second CTE
shuffled I’ve used the
MIN aggregation function on a grouped by
seq column over the
expanded temp table to get a single chosen value per
seq. Then we
INNER JOIN with the same
expanded temp table to get the chosen
Check this out:
INSERT INTO posts(user_id, title) WITH expanded AS ( SELECT RANDOM(), seq, u.id AS user_id FROM GENERATE_SERIES(1, 50) seq, users u ), shuffled AS ( SELECT e.* FROM expanded e INNER JOIN ( SELECT ei.seq, MIN(ei.random) FROM expanded ei GROUP BY ei.seq ) em ON (e.seq = em.seq AND e.random = em.min) ORDER BY e.seq ) SELECT s.user_id, 'It is ' || s.seq || ' ' || ( CASE (RANDOM() * 2)::INT WHEN 0 THEN 'sql' WHEN 1 THEN 'elixir' WHEN 2 THEN 'ruby' END ) as title FROM shuffled s; SELECT * FROM posts LIMIT 10;
And the inserted
id | user_id | title ----+---------+------------------------ 1 | 10 | It is 1 sql 2 | 2 | It is 2 sql 3 | 10 | It is 3 ruby 4 | 5 | It is 4 sql 5 | 3 | It is 5 elixir 6 | 9 | It is 6 elixir 7 | 4 | It is 7 sql 8 | 5 | It is 8 ruby 9 | 7 | It is 9 ruby 10 | 4 | It is 10 sql (10 rows) Time: 3.134 ms
comments table I have 2 relationships to chose:
post_id. The solution is pretty much the same, the only change is in the cross join, so we add more table to it.
shuffled CTE is the same as before:
INSERT INTO comments(user_id, post_id, body) WITH expanded AS ( SELECT RANDOM(), seq, u.id AS user_id, p.id AS post_id FROM GENERATE_SERIES(1, 200) seq, users u, posts p ), shuffled AS ( SELECT e.* FROM expanded e INNER JOIN ( SELECT ei.seq, MIN(ei.random) FROM expanded ei GROUP BY ei.seq ) em ON (e.seq = em.seq AND e.random = em.min) ORDER BY e.seq ) SELECT s.user_id, s.post_id, 'Here some comment ' || s.seq AS body FROM shuffled s; SELECT * FROM comments LIMIT 10;
And the generated random data:
id | user_id | post_id | body ----+---------+---------+---------------------- 1 | 9 | 3 | Here some comment 1 2 | 3 | 32 | Here some comment 2 3 | 9 | 41 | Here some comment 3 4 | 8 | 28 | Here some comment 4 5 | 9 | 40 | Here some comment 5 6 | 9 | 5 | Here some comment 6 7 | 9 | 7 | Here some comment 7 8 | 3 | 48 | Here some comment 8 9 | 4 | 50 | Here some comment 9 10 | 7 | 38 | Here some comment 10 (10 rows) Time: 1.987 ms
This is a very flexible way to build your generated data with as much data as you need to test your query performance. The truth is that I created this to run on PostgreSQL but it may be very easy to adapt to other SQL database. Thanks for reading! 👍