As we stated earlier, the
SELECT command is one of the most important and frequently used features of the SQL language. It is also very powerful. In the earlier chapters we discovered a few of the features we get with
SELECT -- namely, selecting specific columns, selecting specific rows by filtering in various ways, and ordering the results of our queries. In this chapter we'll look at some more of the features available within the syntax of the
We'll introduce how to use functions to process data. We'll also explore how data can be grouped together based on various criteria. First of all though, let's look at how we can further filter our data by adding
DISTINCT clauses to our queries.
SELECT statements we've used so far have returned all of the results that match the conditions of our statement. That's not a problem for our
users table, since we only have a few rows of data in it. When working with large datasets, a common requirement is to only display one portion of the data at a time.
Displaying portions of data as separate 'pages' is a user interface pattern used in many web applications, generally referred to as 'pagination'. An example of this can be seen in the Launch School forum pages, where twelve forum posts are displayed on the first 'page' and you need to navigate to the next 'page' to see the next twelve.
OFFSET clauses of
SELECT are the base on which pagination is built. Let's look at how it works.
Say we are outputting the details of all of our users to a user admin page of our web application. We perhaps use this page to manage individual users and so only want to display one user at a time. We can do this by adding a
LIMIT clause to our
SELECT * FROM users LIMIT 1;
Setting the limit to
1 lets us select a single row, the first row, from our table.
Executing this query should produce the following result:
id | full_name | enabled | last_login ----+------------+---------+---------------------------- 1 | John Smith | f | 2017-10-25 10:26:10.015152 (1 row)
If we want to return the second row instead, we need a way to skip the first row when querying data.
We can do this with the
OFFSET clause. This clause will skip the first N rows. Let's give it a try.
SELECT * FROM users LIMIT 1 OFFSET 1;
id | full_name | enabled | last_login ----+--------------+---------+---------------------------- 2 | Jane Smith | t | 2017-10-25 10:26:50.295461 (1 row)
Here we still only return one row, due to the
LIMIT clause, but since we have
OFFSET our query by
1, it is the second row that is returned instead of the first. For each row we want to skip, we can increment the value of offset by 1.
If we want to return more than one result per page, we can adjust the values of
OFFSET accordingly. For example, we could imagine the query for page 2 of the Launch School general forum looking something like this:
SELECT topic, author, publish_date, category, replies_count, likes_count, last_activity_date FROM posts LIMIT 12 OFFSET 12;
As well as specific use cases such as pagination,
LIMIT can also be useful in development when testing our queries. We can use
LIMIT to get a preview or taste of what data is available or would be returned rather than returning the entire dataset. This is especially useful during development when forming your queries and getting an understanding of the dataset and data quality.
On the subject of data quality issues, a common one that you might encounter is duplicate data in your tables. There are situations where you might need to formulate queries in a certain way in order to deal with duplication. One way to do that is with the
DISTINCT clause, which we'll look at next.
Sometimes duplicate data is unavoidable. For example, we might get duplication when joining more than one table together. We'll delve into working with multiple tables later in this book, for now let's look at a way to deal with duplication by using the
DISTINCT qualifier. To demonstrate how this works let's first add some duplicate data to our
INSERT INTO users (id, full_name) VALUES (4, 'Harry Potter'); INSERT INTO users (id, full_name) VALUES (5, 'Jane Smith');
id | full_name | enabled | last_login ----+--------------+---------+---------------------------- 1 | John Smith | f | 2017-10-25 10:26:10.015152 2 | Jane Smith | t | 2017-10-25 10:26:50.295461 3 | Harry Potter | t | 2017-10-25 10:26:50.295461 4 | Harry Potter | t | 2017-10-25 10:36:38.188286 5 | Jane Smith | t | 2017-10-25 10:36:43.324015 (5 rows)
If we select the
full_name column from our table, we get five rows back, two of which contain duplicate names:
SELECT full_name FROM users;
full_name -------------- John Smith Jane Smith Harry Potter Harry Potter Jane Smith (5 rows)
We can use
DISTINCT as part of our
SELECT query to only return distinct, or unique, values. Try running the following query:
SELECT DISTINCT full_name FROM users;
full_name -------------- John Smith Jane Smith Harry Potter (3 rows)
Here we only get three rows back, since there are only three distinct values in our
DISTINCT can be useful when used in conjunction with SQL functions. Compare, for example, the results of the following two queries using the
SELECT count(full_name) FROM users;
count ------- 5 (1 row)
SELECT count(DISTINCT full_name) FROM users;
count ------- 3 (1 row)
The first query only really tells us how many rows of data we have in our table. The second one tells us how many unique names we have in the
We haven't really looked at SQL functions yet, so let's do that now.
These functions can generally be grouped into different types. Some of the most commonly used types of functions are:
We're not going to go into detail here on each function that exists for every function type. The most important thing is to know what types of functions exist and have a general understanding of how and when to use them. When you need to perform a specific operation, you can always look up a function for that operation in the documentation.
String Functions, as their name suggests, perform some sort of operation on values whose data type is String. Some examples are:
||This returns the length of every user's name. You could also use
||If any of the data in our
Just as string functions perform operations on String data, Date/ Time functions, for the most part, perform operations on date and time data. Many of the date/ time functions take time or timestamp inputs. Our
last_login column, for example, has a data type of
timestamp and so data in that column can act as an argument to such functions:
Aggregate functions perform aggregation; that is, they compute a single result from a set of input values. We briefly looked at one of these,
count, a little earlier in this chapter.
||Returns the number of values in the column passed in as an argument. This type of function can be very useful depending on the context. We could find the number of users who have enabled an account, or even how many users have certain last names if we use the above statement with other clauses.|
||Not to be confused with
||This returns the lowest value in a column for all of the selected rows. Can be used with various data types such as numeric, date/ time, and string.|
||This returns the highest value in a column for all of the selected rows. Can be used with various data types such as numeric, date/ time, and string.|
||Returns the average (arithmetic mean) of numeric type values for all of the selected rows.|
count function in our table above isn't all that useful. It would simply count how many values we have in the
id column; since we have five rows in our table, we have five values in the
id column, so the result would be
5. You can quickly execute that statement just to check:
SELECT count(id) FROM users;
count ------- 5 (1 row)
Aggregate functions really start to be useful when grouping table rows together. The way we do that is by using the
GROUP BY clause.
Sometimes you need to combine data results together to form more meaningful information. Let's say we wanted to implement one of the examples mentioned earlier, and count the number of users who have accounts that are or are not enabled. How would we approach this?
We could try using a
WHERE clause to first filter the results to rows where the
enabled column has a value of
true, and then use the
count function to count the values in the
id column (which is equal to the number of rows, we could actually use any column here).
SELECT count(id) FROM users WHERE enabled = true;
count ------- 4 (1 row)
This only shows the number of rows where the value of
true though; if we want to know how many rows there are where
false, we have to change our condition to
WHERE enabled = false.
What if we want to show the counts for
false at the same time? What we need is some way to split up the
false rows into two separate groups and then count the rows in each group.
GROUP BY clause allows us to do exactly that:
SELECT enabled, count(id) FROM users GROUP BY enabled;
enabled | count ---------+------- f | 1 t | 4 (2 rows)
Although we don't need to include
enabled in our column list in order to use it in our
GROUP BY clause, doing so makes our output more meaningful as we can clearly see the correlation between the count and the values in
One thing to be aware of when using aggregate functions, is that if you include columns in the column list alongside the function, then those columns must also be included in a
GROUP BY clause. For example, the following statement returns an error:
SELECT enabled, full_name, count(id) FROM users GROUP BY enabled; -- full_name is not grouped
ERROR: column "users.full_name" must appear in the GROUP BY clause or be used in an aggregate function
The PostgreSQL documentation says:
GROUP BYis present, or any aggregate functions are present, it is not valid for the
SELECTlist expressions to refer to ungrouped columns except within aggregate functions or when the ungrouped column is functionally dependent on the grouped columns, since there would otherwise be more than one possible value to return for an ungrouped column. A functional dependency exists if the grouped columns (or a subset thereof) are the primary key of the table containing the ungrouped column.
Of particular interest here is that all of the columns in the select list must either be included in the
GROUP BY clause, be the result of an aggregate function, or the
GROUP BY clause must be based on the primary key. This requirement ensures that there is a single value for every column in the result set. This rule can be a bit difficult to work with at first. However, practice using
GROUP BY will gradually lead to a working understanding.
Ian Austin, a Launch School student, wrote an excellent article on the
GROUP BY clause and aggregate functions. If you're having trouble understanding
GROUP BY, be sure to read this article - it's worth your time.
In this chapter we've built on our knowledge of
SELECT, and looked at a number of different ways we can make our
SELECT queries more flexible and powerful:
Thus far in this book we've looked at a lot of the things that you need to know in order to work with a single table in a database; from creating, altering, or even deleting the table itself, through to adding data and then querying that data in a number of different ways. There's one thing we haven't looked at yet, and that's changing data that already exists in your table.
In the next chapter we'll do exactly that by looking at the
DELETE commands that will let us modify existing data in our table.
Make sure you are connected to the
encyclopedia database. Write a query to retrieve the first row of data from the
SELECT * FROM countries LIMIT 1;
id | name | capital | population ----+--------+---------+------------ 1 | France | Paris | 67158000 (1 row)
Write a query to retrieve the name of the country with the largest population.
SELECT name FROM countries ORDER BY population DESC LIMIT 1;
name ------ USA (1 row)
Here we first order the rows in descending order by population, and then retrieve the
name value from the first of the ordered rows.
Write a query to retrieve the name of the country with the second largest population.
SELECT name FROM countries ORDER BY population DESC LIMIT 1 OFFSET 1;
name ------ Japan (1 row)
This is similar to the previous solution, but we retrieve the second row, rather than the first, by using
Write a query to retrieve all of the unique values from the
binomial_name column of the
SELECT DISTINCT binomial_name FROM animals;
binomial_name -------------------------- Aquila Chrysaetos Strigops habroptila Falco Peregrinus Columbidae Columbiformes (4 rows)
Columbidae Columbiformes occurs twice (for Dove and for Pigeon), we can prepend the column name with
DISTINCT in our query in order to return only the unique values.
Write a query to return the longest binomial name from the
SELECT binomial_name FROM animals ORDER BY length(binomial_name) DESC LIMIT 1;
binomial_name -------------------------- Columbidae Columbiformes (1 row)
Here we use the
length() function to determine the length of each string in the
binomial_name column, and use the resulting values to order our query (in descending order). By setting a
1, we return only the longest value.
Write a query to return the first name of any celebrity born in 1958.
SELECT first_name FROM celebrities WHERE date_part('year', date_of_birth) = 1958;
first_name ------------ Madonna Prince (2 rows)
Here we use the
date_part() function to extract the year from the
date_of_birth, and then use it within a
Write a query to return the highest maximum age from the
SELECT max(max_age_years) FROM animals;
max ----- 60 (1 row)
Write a query to return the average maximum weight from the
SELECT avg(max_weight_kg) FROM animals;
avg -------------------- 3.1700000000000000 (1 row)
Write a query to return the number of rows in the
SELECT count(id) FROM countries;
count ------- 4 (1 row)
Here we use the
count() function to count the number of rows. We could have used any column to count the rows, e.g.
count(name) would have returned the same result.
Write a query to return the total population of all the countries in the
SELECT sum(population) FROM countries;
sum ----------- 601544589 (1 row)
Write a query to return each unique conservation status code alongside the number of animals that have that code.
SELECT conservation_status, count(id) FROM animals GROUP BY conservation_status;
conservation_status | count ---------------------+------- CR | 1 LC | 4 (2 rows)
The key thing here is the
GROUP BY clause in conjunction with the
count() function. We are effectively combining the rows into groups based on a certain value (in this case
conservation_status), and calling the function on the rows in each group. Without the
GROUP BY clause, the function would treat all the rows as one single group.
Connect to the
ls_burger database. Write a query that returns the average burger cost for all orders that include fries.
SELECT avg(burger_cost) FROM orders WHERE side = 'Fries';
avg -------------------- 4.0000000000000000 (1 row)
Write a query that returns the cost of the cheapest side ordered.
SELECT min(side_cost) FROM orders WHERE side IS NOT NULL;
min ------ 0.99 (1 row)
SELECT side_cost FROM orders WHERE side IS NOT NULL ORDER BY side_cost ASC LIMIT 1;
Write a query that returns the number of orders that include Fries and the number of orders that include Onion Rings.
SELECT side, count(id) FROM orders WHERE side = 'Fries' OR side = 'Onion Rings' GROUP BY side;
side | count -------------+------- Fries | 2 Onion Rings | 1 (2 rows)