Database Design and normalization

Designing tables

At the start of a new project, or a new feature in the project, one of the first things that developers need to define or design are the entities that interact with the system. In our library project, we had users, addresses, books as entities and we designed tables and relationships based on how we assumed these entities would interact. As the project evolves and requirements change, the design of our tables may result in inconsistencies and anomalies when running queries.

Take a look at our books table. A book can have one and only one author, but many books are written collaboratively.

library=# \d books
                                       Table "public.books"
     Column     |            Type             |                     Modifiers
----------------+-----------------------------+----------------------------------------------------
 id             | integer                     | not null default nextval('books_id_seq'::regclass)
 title          | character varying(100)      | not null
 author         | character varying(100)      | not null
 published_date | timestamp without time zone | not null
 isbn           | integer                     |
Indexes:
    "books_pkey" PRIMARY KEY, btree (id)
    "books_isbn_key" UNIQUE CONSTRAINT, btree (isbn)
Referenced by:
    TABLE "reviews" CONSTRAINT "reviews_book_id_fkey" FOREIGN KEY (book_id) REFERENCES books(id) ON DELETE CASCADE
    TABLE "users_books" CONSTRAINT "users_books_book_id_fkey" FOREIGN KEY (book_id) REFERENCES books(id) ON UPDATE CASCADE

For example how do we insert the book http://www.amazon.com/gp/product/1937077632 Smart Money Smart Kids, by Dave Ramsey and Rachel Cruz.

library=# INSERT INTO books(title, author, published_date) VALUES ('Smart Money Smart Kids', 'Dave Ramsey', '4/29/2014');
INSERT 0 1
library=# INSERT INTO books(title, author, published_date) VALUES ('Smart Money Smart Kids', 'Rachel Cruz', '4/29/2014');
INSERT 0 1

Now we have duplicate data in our database

library=# SELECT * FROM books WHERE title = 'Smart Money Smart Kids';
 id |         title          |   author    |   published_date    | isbn
----+------------------------+-------------+---------------------+------
 23 | Smart Money Smart Kids | Dave Ramsey | 2014-04-29 00:00:00 |
 24 | Smart Money Smart Kids | Rachel Cruz | 2014-04-29 00:00:00 |
(2 rows)

Not only do we have duplicate id's for the same book, but this anomaly could be carried into the users table when checking out a book. These database inconsistencies occur all the time in real life work situations, database developers and administrators have to work together to redesign/normalize the database and also clean up the data.

The other way we could have done it was to add another author field call it author_2. In this case, we would have the same book_id, but multiple authors.

Table "public.books"
Column         |            Type             |                     Modifiers
---------------+-----------------------------+----------------------------------------------------
id             | integer                     | not null default nextval('books_id_seq'::regclass)
title          | character varying(100)      | not null
author         | character varying(100)      | not null
author_2       | character varying(100)      | not null
published_date | timestamp without time zone | not null
isbn           | integer                     |


 id |         title          |   author    |   author_2    |   published_date    | isbn
----+------------------------+-------------+---------------+---------------------+------
 23 | Smart Money Smart Kids | Dave Ramsey |  Rachel Cruz  | 2014-04-29 00:00:00 |

Normalizing a database

Separating this data into two tables is called normalizing the database. Normalizing is the process of organizing data effectively in tables. In our example above, adding multiple columns lets the user get to the same information from both columns. However, it restricts future growth. On the other hand, adding multiple records results in duplicate data, but permits data in normal form and future growth.

What is "Normal Form"

Normal form is a term used in database theory. Databases can be 1NF (First Normal Form), 2NF (Second Normal Form), 3NF (Third Normal Form) or BCNF also knows as 3.5NF or Boyce–Codd Normal Form and so on. The higher the normal form, the less likely a database will have inconsistencies. In most cases, 3NF is considered good enough.

As we worked through the above DB design, our second example was a denormalized database. We added an additional column to include data and did not allow for expansion of authors for a book. Our first example, even though it had duplicate records for the same book, is referred to as First Normal Form.

Denormalized table

 id |         title          |   author    |   author_2    |   published_date    | isbn
----+------------------------+-------------+---------------+---------------------+------
 23 | Smart Money Smart Kids | Dave Ramsey |  Rachel Cruz  | 2014-04-29 00:00:00 |

First Normal Form

 id |         title          |   author    |   published_date    | isbn
----+------------------------+-------------+---------------------+------
 23 | Smart Money Smart Kids | Dave Ramsey | 2014-04-29 00:00:00 |
 24 | Smart Money Smart Kids | Rachel Cruz | 2014-04-29 00:00:00 |

Second Normal Form

In the second normal form, we create a new table to store the authors, and we store the title and the author in this table. 2NF is not a common practice and is only mentioned here to show how the different normal forms progress. Usually, 3NF or third normal form is used which uses identifiers instead of data to reference entities.

Books

 id |         title
----+------------------------
 23 | Smart Money Smart Kids

Authors

         title          |   author
------------------------+-------------
 Smart Money Smart Kids | Dave Ramsey
 Smart Money Smart Kids | Rachel Cruz

Third Normal Form

To design the relationship a book has many authors, we introduce a one-to-many relationship and add an authors table in second normal form.

Books

 id |         title
----+------------------------
 23 | Smart Money Smart Kids

Authors

 author_id | book_id |   author
-----------+---------+-------------
         1 |      23 | Dave Ramsey
         2 |      23 | Rachel Cruz

Now we can represent as many authors as we need for the book, and retrieve the results correctly with a JOIN on the authors table. However, Dave Ramsey has written many books, and as more books get added to our database, we get duplicate author data.

Fourth Normal Form

In the Fourth Normal Form or 4NF, we break up the tables further into a many-to-many relationship. We add 2 tables, one to store the author data and the other to store the author, book relationship. We end up with three tables and each entity table books and authors hold the data for its particular entity only.

Books

 id |         title
----+------------------------
 23 | Smart Money Smart Kids

Authors


 id |     author
----+--------------
  1 |  Dave Ramsey
  2 |  Rachel Cruz

Books_Authors

 author_id | book_id
-----------+---------
         1 |      23
         2 |      23

Denormalizing database

Sometimes having data in a normalized format takes multiple joins to retrieve, which makes it inefficient. Denormalization is required when the expected data may not be exhibiting the pattern that the tables were designed for. For example, we may only have one book in our whole database with multiple authors. In that case the database can be denormalized back into a 1NF form. In our example, it would mean allowing multiple columns for authors.