Data Analysis

By now you know how to create a database, multiple tables, join them and fill them with data. One of the useful functions of relational databases is their capacity to provide ways to analyze data. A term you'll see used a lot here is 'aggregation', which is an operation that involves using functions that act on the values in a column. An example is SUM(), which returns the sum of all the values of a column.

We're going to walk through how to perform some of the most common types of aggregation using some example data from a bank account. Many banks provide the ability to export an account's transaction history for analysis. These exports are usually in CSV format. While PostgreSQL can import CSV data directly, we've already done that and exported the resulting data using pg_dump.

Setup data

First, create a new database, finance, on your machine.

$ createdb finance

Next, download this file and import it into the new finance database:

psql -d finance < finance.sql

This file creates a table, transactions, and adds rows of transaction data to that table. This data represents the transactions that have taken place in a fictitious account.

The schema for the transactions table is:

finance=# \d transactions
                                      Table "public.transactions"
      Column      |          Type          |                         Modifiers
 id               | integer                | not null default nextval('transactions_id_seq'::regclass)
 paid_in          | numeric(13,2)          | not null
 paid_out         | numeric(13,2)          |
 date             | date                   | not null
 transaction_type | character(25)          | not null
 description      | character varying(255) | not null
    "transactions_pkey" PRIMARY KEY, btree (id)

With a table of generated data showing outgoing payments, you have data to work with, and in this chapter you'll learn how you can use SQL functions to perform analytics that extract useful information from that data.

How do we Analyze this data?

Since it can be hard to visualize a row of data from a Schema, here's an example:

finance=# SELECT * FROM transactions LIMIT 1;
 id | paid_in | paid_out |    date    |     transaction_type      |   description
  1 |    0.00 |    64.54 | 2013-01-01 | debit                     | Online Shopping
(1 row)

In SQL we can use the COUNT() function to tell how many rows are in the table. The COUNT() function returns the number of field values for a column. Since it's already known that the primary key will be unique, you can count the number of rows by using COUNT(id).

finance=# SELECT COUNT(id) FROM transactions;
(1 row)

So from this result, you can see we have 144 rows in this table. To get the number of different transaction descriptions you'll need to use the keyword DISTINCT, since we want unique entries only:

finance=# SELECT COUNT(DISTINCT description) FROM transactions;
(1 row)

SQL lets you group and aggregate data, but it's best illustrated by example. With the query below, you can find out all the different things you were charged for, and how much each cost for the year:

finance=# SELECT SUM(paid_out) AS paid_out_total, description FROM transactions GROUP BY description;
 paid_out_total |            description
          14.50 | MicroServer Rental 1 hour usage
         105.70 | Shared Hosting Service
         560.00 | Social Network Premium Membership
         260.00 | Bar Drinks
        6945.58 | Online Shopping
          59.90 | Amazon Book Purchase
         150.00 | Computer Equipment
          30.00 | Restaurant Dinner
          40.00 | ATM Withdrawal
           5.98 | Digital Movie Rental
         630.00 | Generic Telecoms ISP
          72.99 | Screencast Subscription
(12 rows)

The diagram below depicts how this works internally in SQL. The data in the diagram is not an exact match to transaction data but provided for guidance as an example.

What this SELECT query does is pick a field using the GROUP BY operation, in this case the description, and groups rows by the field specified. Since, in this case, the transactions are grouped by description, we can use the SUM() function to compute the total value of all the paid_out values in the table, which is the total cost from each source.

Without the GROUP BY clause the SUM() function wouldn't breakup the calculation into descriptions, and you'd get a total of all the paid_out values in the table:

finance=# SELECT SUM(paid_out) AS paid_out_to FROM transactions;
(1 row)

Another example that might make grouping clearer - grouping by month; PostgreSQL provides the handy TO_CHAR() and EXTRACT() functions that can be used on a date field to get a month name or number.

EXTRACT() is used to select subfields from date/time values. TO_CHAR(), on the other hand, is a more general function that can be used to convert from various data types to a formatted string. In this case, we're using TO_CHAR() to convert from a DATE to a formatted string with only the month from that field.

finance=# SELECT SUM(paid_out), TO_CHAR(date, 'Month') FROM transactions GROUP BY TO_CHAR(date, 'Month');
   sum   |  to_char
 3453.20 | January
 2154.65 | March
 1807.12 | February
 1435.80 | April
    1.99 | June
   21.89 | May
(6 rows)
finance=# SELECT SUM(paid_out), EXTRACT(MONTH FROM date) FROM transactions GROUP BY EXTRACT(MONTH FROM date);
   sum   | date_part
   21.89 |         5
 3453.20 |         1
 1807.12 |         2
 2154.65 |         3
 1435.80 |         4
    1.99 |         6
(6 rows)

If you want to sum data for just one service, you could do the following:

finance=# SELECT SUM(paid_out), description FROM transactions WHERE description = 'Online Shopping' GROUP BY description;
   sum   |   description
 6945.58 | Online Shopping
(1 row)

You might be curious what the average cost per Online Shopping session was; in this case, you can use the AVG() function.

finance=# SELECT AVG(paid_out), description FROM transactions WHERE description = 'Online Shopping' GROUP BY description;
         avg         |   description
 54.6896062992125984 | Online Shopping
(1 row)

Other functions that you might find useful are MAX(), which returns the largest value from a collection, and MIN(), which returns the smallest.

finance=# SELECT MAX(paid_out) FROM transactions;
(1 row)
finance=# SELECT MIN(paid_out) FROM transactions;
(1 row)

Data analytics with grouping and aggregation is a common task that SQL programmers work on. You'll get more opportunities to practice grouping and aggregation as you go through the SQL course.