More on Select

If you don't already have a library database, then make sure to run createdb library from the terminal before starting this chapter. The library database is needed to work through this chapter's content. The schema and data needed to work through this chapter are stored here. If you haven't been following along feel free to download it, and then use the command below to restore your library database.

psql -d library < advanced_querying.sql

The SELECT command is one of the most important and frequently used features of SQL language. It is also very powerful. In the earlier chapters we saw a glimpse of features we get from SELECT -- namely, selecting specific columns, adding criteria, and sorting. We will now look at additional features available in the syntax of the SELECT command. For the most part, SQL is used to access the right data in order to make critical decisions. SQL WHERE clauses let us focus our attention on the exact dataset we need. We find the data we need by adding conditions to the query.

Conditions with WHERE

Conditions are added to SQL statements with the WHERE clause. Suppose we want to retrieve the records for the user John Smith. Run the query below from your PostgreSQL console. Only the records with the username equal to John Smith will be returned by the database.

library=# SELECT * FROM users WHERE username='John Smith';
 id |  username  | enabled |         last_login
----+------------+---------+----------------------------
  1 | John Smith | t       | 2016-04-06 13:00:16.551272
(1 row)

Only one record set should have returned.

Use of = sign in SQL

In the WHERE clause of SQL queries, = is treated as an 'equality' operator, in that it compares things. In the case above it's checking whether the column 'username' is equal to the value 'John Smith'. In UPDATE queries when used with the SET clause, it becomes the assignment operator, and it sets a value for a field.

Combining queries

The WHERE clause allows combining queries using AND and OR. If you run the query below, two records will be returned. You will see that the 'OR' allows additional data based on the clause.

library=# SELECT * FROM users WHERE username='John Smith' OR username='Jane Smiley';
 id |  username   | enabled |         last_login
----+-------------+---------+----------------------------
  1 | John Smith  | t       | 2016-04-06 13:00:16.551272
  2 | Jane Smiley | t       | 2016-04-06 13:00:16.562224
(2 rows)

If you run the following two statements you will see the difference on how query combining restricts the result by adding the AND clause.

library=# SELECT * FROM books WHERE title LIKE '%My%';
 id |       title        |   author    |       published_date       | isbn
----+--------------------+-------------+----------------------------+------
  1 | My First SQL book  | Mary Parker | 2016-04-06 13:00:16.555379 |
  2 | My Second SQL book | John Mayer  | 2016-04-06 13:00:16.560789 |
  3 | My Third SQL book  | Alice Munro | 2016-04-06 13:00:16.569222 |
(3 rows)
library=# SELECT * FROM books WHERE title LIKE '%My%' AND author='John Mayer';
 id |       title        |   author   |       published_date       | isbn
----+--------------------+------------+----------------------------+------
  2 | My Second SQL book | John Mayer | 2016-04-06 13:00:16.560789 |
(1 row)

Formatting data

Now, suppose you want to either format data to be returned or combine columns and then format data. You can return both the title of the book and the author in one column with the string by added in the middle of the two. Run the query below on your PostgreSQL console.

Concatenation is a useful feature, if you need to return formatted data in your application.

library=# SELECT CONCAT(title,' by ', author) AS "Books By" FROM books;
                               Books By
-----------------------------------------------------------------------
 My First SQL book by Mary Parker
 My Second SQL book by John Mayer
 My Third SQL book by Alice Munro
 Christmas Stories from French and Spanish writers by Antoinette Ogden
 Michael and His Lost Angel by  Henry Arthur Jones
 Weather Warnings for Watchers by   The Clerk
 Cronaca della rivoluzione di Milano by   Leone Tettoni
 Thomas More by Henriette Roland Holst van der Schalk
 Kim by Rudyard Kipling
 Po├ęsies de Daniel Lesueur by   Daniel Lesueur
...
(19 rows)

Aliasing

In the above case the column name has been changed by a technique known as aliasing. Aliasing is used when querying for different types of results and is not restricted to columns that have formatted content.

Example: Run the query below to retrieve the count of users that are enabled. The return column name has been aliased with the name enabled_count

library=# SELECT COUNT(id) AS enabled_count FROM users WHERE enabled = true;
 enabled_count
---------------
             3
(1 row)

If the command above was run without an alias, then the column name would be the original name of the function that was run on that column.

library=# SELECT COUNT(id) FROM users WHERE enabled = true;
 count
-------
     3
(1 row)

Using DISTINCT

Sometimes when we join tables we get duplicate records, especially in left or right joins with one to many relations or many to many relations. SQL provides the DISTINCT qualifier to filter duplicate rows.

In the example below we want to SELECT users that have at least one book. In order to do that, we make an INNER JOIN on the users_books table and retrieve the records. Since John Smith has two books, his name appears twice when the query is run.

Try to run the query below:

library=# SELECT u.username FROM users u
          LEFT JOIN users_books ub ON (ub.user_id = u.id);
  username
-------------
 John Smith
 John Smith
 Jane Smiley
 Alice Munro
(4 rows)

Now, try to run the query below. John Smith appears only once in the result set. Using the DISTINCT clause we are able to eliminate the additional record.

library=# SELECT DISTINCT u.username FROM users u
          LEFT JOIN users_books ub ON (ub.user_id = u.id);
  username
-------------
 Jane Smiley
 Alice Munro
 John Smith
(3 rows)

Using LIMIT

LIMIT is a useful clause in SELECT statements when we want to get a preview or taste of what data is available or would be returned. This is especially useful during development when forming your queries and getting an understanding of the dataset and data quality.

Another place where LIMIT is beneficial is in applications with large result sets. Returning all the data could take up a lot of resources; LIMIT allows the right amount of data to be returned to show the user. It is used in apps that support pagination; there are other techniques involved such as cursor to accomplish this, but the LIMIT clause of the SELECT is the base on which pagination apps are built.

library=# SELECT title, author FROM books LIMIT 5;
                       title                       |       author
---------------------------------------------------+---------------------
 My First SQL book                                 | Mary Parker
 My Second SQL book                                | John Mayer
 My Third SQL book                                 | Alice Munro
 Christmas Stories from French and Spanish writers | Antoinette Ogden
 Michael and His Lost Angel                        |  Henry Arthur Jones
(5 rows)

Subqueries

The results of a SELECT query can be used as a condition in another SELECT query. This is called nesting. The query that is nested is referred to as a subquery.

For example, suppose we need to

Select users that have no books checked out

We would do this by finding users whose user_id is not in the users_books table. The users_books table is the one in which we store the relations between users and books. If no relation is found, that would mean that the user has not checked out any books.

You can run the query below to view the results.

library=# SELECT u.username FROM users u
          WHERE u.id NOT IN (SELECT ub.user_id FROM users_books ub);
  username
-------------
 Alice Munro
(1 row)

In the code above, the NOT IN clause compares the current user_id to all results from the subquery. If that id number isn't part of the subquery results, then the username for current row is added to the result set.

Subqueries vs Joins

As you write more queries, you may find that there is more than one way to write a query and achieve the same results. The most common choices are between subqueries and JOINs.

Suppose we need to select users that have addresses in the database. We know that there is a 1-1 join between the users and addresses tables. We write the query using an INNER JOIN.

Run the query below and you can see the results. We have 3 users in the user table and only 2 are returned.

library=# SELECT users.* FROM users INNER JOIN addresses ON users.id = addresses.user_id;
 id |  username   | enabled |         last_login
----+-------------+---------+----------------------------
  1 | John Smith  | t       | 2016-04-06 13:00:16.551272
  2 | Jane Smiley | t       | 2016-04-06 13:00:16.562224
(2 rows)

The same results can be achieved with the subquery below. In this query, we look to see if a user_id is matched from the address table, in which case the user must have an address.

library=# SELECT * FROM users
          WHERE users.id = (SELECT user_id FROM addresses WHERE user_id = users.id);
 id |  username   | enabled |         last_login
----+-------------+---------+----------------------------
  1 | John Smith  | t       | 2016-04-06 13:00:16.551272
  2 | Jane Smiley | t       | 2016-04-06 13:00:16.562224
(2 rows)

When creating queries that return the same result, you may want to look at the performance of these queries and select the one that is faster. As a general rule, JOINs are faster than subqueries.

SQL provides the EXPLAIN statement, which gives a step by step analysis of how the query will be run internally by PostgreSQL. We won't get into how to use EXPLAIN in this book, but the PostgreSQL documentation has a good overview of its use.