You will need a database to run the commands in this chapter. You can use a new db or continue to use
library database from earlier chapters. As a reminder, a new database can be created using the terminal command
createdb db_name. If you want to follow along, then make sure to have the correct data set up for this chapter; it can be found here.
Then use the following command to restore the library database:
psql -d library < types.sql
The SQL commands in
types.sql can also be entered manually into the psql console.
If we were using a text file to store a small amount of data that did not require optimal searching, it would make little sense to worry too much about how we store the individual pieces of data. In contrast, a relational database can contain tables with millions of rows. How data is stored in the columns is important, and the right datatype should be selected for the columns of a table.
As a schema creator you have a choice in which types of columns you use, and how many. RDBMSs require that you specify a type for every column. The schema creator's goal is optimization. A column that will only contain numerical data is best defined as a numerical type. Otherwise the database engine will need to convert the values to numbers with every query, and that's an expensive overhead.
In order to determine what type of column should be used, start by thinking about the data that will be stored in the column:
After you answer the above questions, you can then start to choose a type for your column.
Below are some examples of data and their associated column type:
|Full name||String with a sensible maximum length.|
|Age||Integer with sensible maximum value.|
|Blog post||String with higher maximum length.|
|Month of the year||ENUM that contains a set of values for all months.|
|Money||Numeric with a decimal point.|
The string type is a sequence of "bits" for referring to strings of characters, such as text that says "cat", or the content of a novel.
These are some of the most commonly used string types. The important difference is how they're stored. The VAR in VARCHAR is short for 'variable'. A column defined as
CHAR(50) will have a size within the table that assumes 50 characters. In the event that the string is shorter than the specified length of the CHAR column, PostgreSQL will simply pad the rest with spaces. In contrast to CHAR, VARCHAR contains an extra piece of data that informs PostgreSQL how many bytes to expect. No padding is added if the string inserted is shorter than the maximum size.
SQL lets you select a particular Character Set or charset for databases and tables. Characters in a string are stored in binary format. And the method of encoding and decoding a string can vary, for instance, strings in a different language like Japanese are encoded and decoded differently than strings in english. The default charset in PostgreSQL is UTF8.
In PostgreSQL, charsets are described as character encoding. It can be set using the meta command:
library=# \encoding CHARACTER_SET
For example, if we wanted to set the current character set to Japanese we could write:
library=# \encoding EUC_JP
A set of permitted values from which one must be selected.
Let's say you need a column for the day of the week. You could use a string type, such as CHAR or VARCHAR, but this could cause data entry problems. If someone enters a typo, you would have a hard time querying that incorrectly typed data.
For instance, with a 'weekday' column, if someone were to type an upper case or lower case entry, or misspell a day name during a hurried entry, then selecting all of the Monday records would be difficult; you would have to allow for all possible variations, including typos.
The ENUM type contains a pre-set list of values that a column can contain. An ENUM with those
pre-set values needs to be created before using that type with
CREATE TABLE. For example:
library=# CREATE TYPE day AS ENUM('monday', 'tuesday', 'wednesday', 'thursday', 'friday', 'saturday', 'sunday'); CREATE TYPE library=# CREATE TABLE messages ( id SERIAL, day_of_week DAY DEFAULT NULL, message VARCHAR(255) NOT NULL, PRIMARY KEY (id) ); CREATE TABLE
If you were to now
SELECT on the table, you'd get the following:
library=# INSERT INTO messages (day_of_week, message) VALUES ('monday', 'This is a message on Monday'); INSERT 0 1 library=# SELECT * FROM messages; id | day_of_week | message ----+-------------+----------------------------- 1 | monday | This is a message on Monday (1 row)
Trying to insert a value that is not contained within the ENUM definition will cause an error to be thrown.
An integer is simply a "whole number" . An example might be 1 or 50, -50, or 792197 depending on what storage type is used. The key point here is to give consideration to the value you'll be storing, and try to keep fields within a reasonable size of what you need to hold. For instance, a human age column is never going to reach higher than 100 (well, it might but that is a rare). It makes little sense to use a type that allows a range up to 1,000,000. Similarly, it makes little sense to use a type to a column that permits negative numbers. For instance, the number of times a book has been checked out of the library will never be negative.
With PostgreSQL, there are a number of predefined types for integer values, each with a defined range.
Integer values are often described as being signed or unsigned. The real difference between the two is that a 'signed' numerical value can be a negative number, and an 'unsigned' value can only be a positive number. PostgreSQL does not have unsigned values like some other RDBMSs. An unsigned value can be simulated though, if we use a
CHECK constraint that ensures an integer column is only positive.
CREATE TABLE people(age SMALLINT CHECK (age > 0));
The DECIMAL type takes two arguments, one being the maximum number of digits to hold between 1 and 131072, the second is the number of the digits to the right of the decimal point to store.
library=# CREATE TABLE decimal_values ( id SERIAL, item VARCHAR(50) NOT NULL, value DECIMAL(13,2) NOT NULL, PRIMARY KEY(id) ); CREATE TABLE library=# INSERT INTO decimal_values (item, value) VALUES ('Test Item', 12.35); INSERT 0 1 library=# INSERT INTO decimal_values (item, value) VALUES ('Test Item', 12.355); INSERT 0 1 library=# INSERT INTO decimal_values (item, value) VALUES ('Test Item', 12.354); INSERT 0 1 library=# SELECT * FROM decimal_values; id | item | value ----+-----------+------- 1 | Test Item | 12.35 2 | Test Item | 12.36 3 | Test Item | 12.35 (3 rows)
If you paid close attention, here you'll have noticed something. I set the column data type to
DECIMAL(13,2). In the last two examples I entered 12.356 and 12.354 and got 12.36 and 12.35 in return.
PostgreSQL has "rounded" the value specified to try to fit it to the number of decimal places, the 2 in
DECIMAL type uses the same kind of rounding you may have been taught in school, where 0-4 will "round down" and 5 and above will "round up" :
If your table is handling money and you need extremely precise figures, especially when aggregating data, there are guidelines on how to do so in PostgreSQL. A number of people recommend DECIMAL(13,2) as the type of choice. The value 13 represents the number of digits to hold in total, the 2 specifies how many digits to hold after the decimal point.
The defining characteristic of floating type numbers is that they are slightly inexact. Due to how these values are stored internally, some values cannot be stored or used without slight discrepancies taking place. Dealing with these situations is a complex topic that is beyond the scope of this book, and could easily be the topic for an entire course. Wikipedia has some more information about how these inaccuracies have affected the real world.
The PostgreSQL manual says this about floating point types:
If you require exact storage and calculations (such as for monetary amounts), use the numeric type instead.
If you want to do complicated calculations with these types for anything important, especially if you rely on certain behavior in boundary cases (infinity, underflow), you should evaluate the implementation carefully.
Comparing two floating-point values for equality might not always work as expected.
Most of the time, it makes sense to use one of the exact numeric types we've looked at above. However, should you need to use a floating-point type, PostgreSQL supports them with two levels of precision:
Never, ever store money as a floating-point number.
BOOLEAN values can be set as either TRUE or FALSE, with these being defined constants (essentially a text synonym) for 1 and 0.
This type is particularly useful for "flags" that indicate things such as enabled/disabled, on/off, has/does not have, etc:
library=# ALTER TABLE users ADD COLUMN enabled BOOLEAN DEFAULT FALSE; ALTER TABLE Table "public.users" Column | Type | Modifiers ----------+------------------------+---------------------------------------------------- id | integer | not null default nextval('users_id_seq'::regclass) full_name | character varying(255) | not null enabled | boolean | default false Indexes: "users_pkey" PRIMARY KEY, btree (id)
The DATE type is a simple date in MM-DD-YYYY format.
This field comes with some useful functions:
textrepresents an individual part of a timestamp. For instance, we can grab the numerical value of a month from a timestamp. For example, if the month in a timestamp is February, then the returned result will be 2. The following example would return 2:
date_part('month', timestamp '2016-02-12 12:21:30')
The TIMESTAMP type contains both a simple date and time in YYYY-MM-DD HH:MM:SS format. It allows NOW() as a default value so the field can have a field filled out at the time the row was created. For instance:
library=# ALTER TABLE users ADD COLUMN date_created TIMESTAMP DEFAULT NOW(); ALTER TABLE
An example of how this works in practice:
Table "public.users" Column | Type | Modifiers --------------+-----------------------------+---------------------------------------------------- id | integer | not null default nextval('users_id_seq'::regclass) full_name | character varying(255) | not null enabled | boolean | default false date_created | timestamp without time zone | default now() Indexes: "users_pkey" PRIMARY KEY, btree (id) library=# SELECT * FROM users; id | full_name | enabled | date_created ----+------------+---------+---------------------------- 1 | Jack Smith | f | 2016-05-10 12:54:42.968132 2 | John Smith | f | 2016-05-10 12:54:42.968132 (2 rows)
See how the type for our timestamp column reads, "timestamp without time zone"?
In PostgreSQL, the timezone may be included when defining a timestamp column. If
we want to include the currently set timezone, then we use
TIMESTAMP WITH TIMEZONE
when we create a table with a timestamp column. There's also an abbreviation
for including the timezone,
TIMESTAMPTZ; using this will give us the same result.
The TIME type can contain both 24 hour time values in HH:MM:SS or larger values for time quantities. When used for this purpose it can store values between '-838:59:59' and '838:59:59'.
The INTERVAL type is very flexible. It represents a certain period of time that can be in seconds, minutes, or hours. It can also be used to specify a stretch of time in centuries, months, years, and days. Here are some examples:
INTERVAL '3' - An interval of 3 seconds.
INTERVAL '5' YEAR - An interval of 5 years.
INTERVAL '2-5' YEAR TO MONTH - An interval of 2 years, 5 months
Typically, an interval type will either be in the format of years-months or days hours-minutes-seconds. PostgreSQL lets us specify both together with an alternative syntax:
INTERVAL '2 years 3 months 5 days 10 hours 22 minutes 55 seconds' - 2 years, 3 months, 5 days, 10 hours, 22 minutes, 55 seconds.
More about the INTERVAL type and other date/time types can be found here.
Note: There are several date/time functions that PostgreSQL gives us. Any function that deals with a span of time usually returns a value of type interval (e.g. if you subtract two timestamps you are given an INTERVAL in return).