You are reading the 1st edition of this book. On November 20th, this book will be replaced by the 2nd edition. The book you are currently reading, the first edition, will be available at a different URL. Please read this post for more information. Note that viewing the post requires registration (free) .

Schemas and Columns

You will need a database to run the commands in this chapter. You can use a new db or continue to use the library database from earlier chapters. As a reminder, a new database can be created using the terminal command createdb db_name. If you want to follow along, then make sure to have the correct data set up for this chapter; it can be found here.

Then use the following command to restore the library database: psql -d library < types.sql The SQL commands in types.sql can also be entered manually into the psql console.

Introduction

If we were using a text file to store a small amount of data that did not require optimal searching, it would make little sense to worry too much about how we store the individual pieces of data. In contrast, a relational database can contain tables with millions of rows. How data is stored in the columns is important, and the right datatype should be selected for the columns of a table.

Information types

What Data Type should I use?

As a schema creator you have a choice in which types of columns you use, and how many. RDBMSs require that you specify a type for every column. The schema creator's goal is optimization. A column that will only contain numerical data is best defined as a numerical type. Otherwise the database engine will need to convert the values to numbers with every query, and that's an expensive overhead.

In order to determine what type of column should be used, start by thinking about the data that will be stored in the column:

  • Is the data numerical (numbers 0-9) or alphanumerical (numbers 0-9, letters a-z and A-Z)?
  • If the data is numerical, does it contain negative numbers? What is the largest value it is likely to hold?
  • Are there multiple possible values?
  • Are there multiple possible values from a limited set of values?
  • Do possible values contain binary data?
  • Is the data related to dates and time?
  • Is the data a flag for a true/false condition?

After you answer the above questions, you can then start to choose a type for your column.

Below are some examples of data and their associated column type:

Data Column Type
Full name String with a sensible maximum length.
Age Integer with sensible maximum value.
Blog post String with higher maximum length.
Month of the year ENUM that contains a set of values for all months.
Money Numeric with a decimal point.

String Types

The string type is a sequence of "bits" for referring to strings of characters, such as text that says "cat", or the content of a novel.

CHAR and VARCHAR

These are some of the most commonly used string types. The important difference is how they're stored. The VAR in VARCHAR is short for 'variable'. A column defined as CHAR(50) will have a size within the table that assumes 50 characters. In the event that the string is shorter than the specified length of the CHAR column, PostgreSQL will simply pad the rest with spaces. In contrast to CHAR, VARCHAR contains an extra piece of data that informs PostgreSQL how many bytes to expect. No padding is added if the string inserted is shorter than the maximum size.

Charsets

SQL lets you select a particular Character Set or charset for databases and tables. Characters in a string are stored in binary format. And the method of encoding and decoding a string can vary, for instance, strings in a different language like Japanese are encoded and decoded differently than strings in english. The default charset in PostgreSQL is UTF8.

In PostgreSQL, charsets are described as character encoding. It can be set using the meta command:

library=# \encoding CHARACTER_SET

For example, if we wanted to set the current character set to Japanese we could write:

library=# \encoding EUC_JP

ENUM

A set of permitted values from which one must be selected.

Let's say you need a column for the day of the week. You could use a string type, such as CHAR or VARCHAR, but this could cause data entry problems. If someone enters a typo, you would have a hard time querying that incorrectly typed data.

For instance, with a 'weekday' column, if someone were to type an upper case or lower case entry, or misspell a day name during a hurried entry, then selecting all of the Monday records would be difficult; you would have to allow for all possible variations, including typos.

The ENUM type contains a pre-set list of values that a column can contain. An ENUM with those pre-set values needs to be created before using that type with CREATE TABLE. For example:

library=# CREATE TYPE day AS ENUM('monday', 'tuesday', 'wednesday', 'thursday', 'friday', 'saturday', 'sunday');
CREATE TYPE
library=# CREATE TABLE messages (
id SERIAL,
day_of_week DAY DEFAULT NULL,
message VARCHAR(255) NOT NULL,
PRIMARY KEY (id)
);
CREATE TABLE

If you were to now INSERT and SELECT on the table, you'd get the following:

library=# INSERT INTO messages (day_of_week, message) VALUES ('monday', 'This is a message on Monday');
INSERT 0 1

library=# SELECT * FROM messages;
 id | day_of_week |           message
----+-------------+-----------------------------
  1 | monday      | This is a message on Monday
(1 row)

Trying to insert a value that is not contained within the ENUM definition will cause an error to be thrown.

Numerical Data

Integers

An integer is simply a "whole number" . An example might be 1 or 50, -50, or 792197 depending on what storage type is used. The key point here is to give consideration to the value you'll be storing, and try to keep fields within a reasonable size of what you need to hold. For instance, a human age column is never going to reach higher than 100 (well, it might but that is a rare). It makes little sense to use a type that allows a range up to 1,000,000. Similarly, it makes little sense to use a type to a column that permits negative numbers. For instance, the number of times a book has been checked out of the library will never be negative.

With PostgreSQL, there are a number of predefined types for integer values, each with a defined range.

  • SMALLINT - range: -32768 to 32767
  • INT - range: -2147483648 to 2147483647
  • BIGINT - range: -9223372036854775808 to 9223372036854775807

Signed vs. Unsigned Values

Integer values are often described as being signed or unsigned. The real difference between the two is that a 'signed' numerical value can be a negative number, and an 'unsigned' value can only be a positive number. PostgreSQL does not have unsigned values like some other RDBMSs. An unsigned value can be simulated though, if we use a CHECK constraint that ensures an integer column is only positive.

CREATE TABLE people(age SMALLINT CHECK (age > 0));

Fixed Point Types

DECIMAL

The DECIMAL type takes two arguments, one being the maximum number of digits to hold between 1 and 131072, the second is the number of the digits to the right of the decimal point to store.

library=# CREATE TABLE decimal_values (
id SERIAL,
item VARCHAR(50) NOT NULL,
value DECIMAL(13,2) NOT NULL,
PRIMARY KEY(id)
);
CREATE TABLE

library=# INSERT INTO decimal_values (item, value) VALUES ('Test Item', 12.35);
INSERT 0 1
library=# INSERT INTO decimal_values (item, value) VALUES ('Test Item', 12.355);
INSERT 0 1
library=# INSERT INTO decimal_values (item, value) VALUES ('Test Item', 12.354);
INSERT 0 1
library=# SELECT * FROM decimal_values;
 id |   item    | value
----+-----------+-------
  1 | Test Item | 12.35
  2 | Test Item | 12.36
  3 | Test Item | 12.35
(3 rows)

If you paid close attention, here you'll have noticed something. I set the column data type to DECIMAL(13,2). In the last two examples I entered 12.356 and 12.354 and got 12.36 and 12.35 in return.

PostgreSQL has "rounded" the value specified to try to fit it to the number of decimal places, the 2 in DECIMAL(13,2). The DECIMAL type uses the same kind of rounding you may have been taught in school, where 0-4 will "round down" and 5 and above will "round up" :

  • 0.512 will round to 0.51
  • 0.514 will round to 0.51
  • 0.515 will round to 0.52

Handling monetary values

If your table is handling money and you need extremely precise figures, especially when aggregating data, there are guidelines on how to do so in PostgreSQL. A number of people recommend DECIMAL(13,2) as the type of choice. The value 13 represents the number of digits to hold in total, the 2 specifies how many digits to hold after the decimal point.

Floating Point Types

The defining characteristic of floating type numbers is that they are slightly inexact. Due to how these values are stored internally, some values cannot be stored or used without slight discrepancies taking place. Dealing with these situations is a complex topic that is beyond the scope of this book, and could easily be the topic for an entire course. Wikipedia has some more information about how these inaccuracies have affected the real world.

The PostgreSQL manual says this about floating point types:

If you require exact storage and calculations (such as for monetary amounts), use the numeric type instead.

If you want to do complicated calculations with these types for anything important, especially if you rely on certain behavior in boundary cases (infinity, underflow), you should evaluate the implementation carefully.

Comparing two floating-point values for equality might not always work as expected.

Most of the time, it makes sense to use one of the exact numeric types we've looked at above. However, should you need to use a floating-point type, PostgreSQL supports them with two levels of precision:

  • REAL - precision: 6 decimal digits
  • DOUBLE - precision: 15 decimal digits

Never, ever store money as a floating-point number.

Booleans

BOOLEAN values can be set as either TRUE or FALSE, with these being defined constants (essentially a text synonym) for 1 and 0.

This type is particularly useful for "flags" that indicate things such as enabled/disabled, on/off, has/does not have, etc:

library=# ALTER TABLE users ADD COLUMN enabled BOOLEAN DEFAULT FALSE;
ALTER TABLE

Table "public.users"
Column    |          Type          |                     Modifiers
----------+------------------------+----------------------------------------------------
id        | integer                | not null default nextval('users_id_seq'::regclass)
full_name | character varying(255) | not null
enabled   | boolean                | default false
Indexes:
"users_pkey" PRIMARY KEY, btree (id)

Date and Time Values

DATE

The DATE type is a simple date in MM-DD-YYYY format.

TIMESTAMP

This field comes with some useful functions:

  • date_part(text, timestamp) returns the numerical value, specified by text, where text represents an individual part of a timestamp. For instance, we can grab the numerical value of a month from a timestamp. For example, if the month in a timestamp is February, then the returned result will be 2. The following example would return 2:

    date_part('month', timestamp '2016-02-12 12:21:30')
    
  • NOW() returns the current date and time.

The TIMESTAMP type contains both a simple date and time in YYYY-MM-DD HH:MM:SS format. It allows NOW() as a default value so the field can have a field filled out at the time the row was created. For instance:

library=# ALTER TABLE users ADD COLUMN date_created TIMESTAMP DEFAULT NOW();
ALTER TABLE

An example of how this works in practice:

Table "public.users"
    Column    |            Type             |                     Modifiers
--------------+-----------------------------+----------------------------------------------------
 id           | integer                     | not null default nextval('users_id_seq'::regclass)
 full_name    | character varying(255)      | not null
 enabled      | boolean                     | default false
 date_created | timestamp without time zone | default now()
Indexes:
"users_pkey" PRIMARY KEY, btree (id)

library=# SELECT * FROM users;
 id | full_name  | enabled |        date_created
----+------------+---------+----------------------------
  1 | Jack Smith | f       | 2016-05-10 12:54:42.968132
  2 | John Smith | f       | 2016-05-10 12:54:42.968132
(2 rows)

See how the type for our timestamp column reads, "timestamp without time zone"? In PostgreSQL, the timezone may be included when defining a timestamp column. If we want to include the currently set timezone, then we use TIMESTAMP WITH TIMEZONE when we create a table with a timestamp column. There's also an abbreviation for including the timezone, TIMESTAMPTZ; using this will give us the same result.

TIME

The TIME type can contain both 24 hour time values in HH:MM:SS or larger values for time quantities. When used for this purpose it can store values between '-838:59:59' and '838:59:59'.

INTERVAL

The INTERVAL type is very flexible. It represents a certain period of time that can be in seconds, minutes, or hours. It can also be used to specify a stretch of time in centuries, months, years, and days. Here are some examples:

INTERVAL '3' - An interval of 3 seconds. INTERVAL '5' YEAR - An interval of 5 years. INTERVAL '2-5' YEAR TO MONTH - An interval of 2 years, 5 months

Typically, an interval type will either be in the format of years-months or days hours-minutes-seconds. PostgreSQL lets us specify both together with an alternative syntax:

INTERVAL '2 years 3 months 5 days 10 hours 22 minutes 55 seconds' - 2 years, 3 months, 5 days, 10 hours, 22 minutes, 55 seconds.

More about the INTERVAL type and other date/time types can be found here.

Note: There are several date/time functions that PostgreSQL gives us. Any function that deals with a span of time usually returns a value of type interval (e.g. if you subtract two timestamps you are given an INTERVAL in return).