Introduction

"The more we can organize, find and manage information, the more effectively we can function in our modern world." - Vint Cerf

The importance of data

As human beings, we are limited in our capacity to remember things. Having tools that allow immediate access and analysis of data enables us to make better decisions. Today, we have systems that allow a doctor to review the patient's medical record before prescribing a new drug. A bioinformatic scientist can query genome data to find the genetic basis of disease. A baseball scout looking at player statistics can assess the value of a player. All these decisions can be successfully arrived at by collecting data, organizing it and then studying it to find patterns and meaning. This data has to be stored in a way that lends itself to studying. In other words, we need to structure it.

This is a video by Hans Rosling. By collecting data about 200 countries over 200 years, he provides a perspective on human progress.

Structured data

Structured data is data that can be stored in tabular format (rows and columns), such as spreadsheets or todo lists, and can then be analyzed in various ways such as sorting alphabetically or finding total sales of a time period. Unstructured data is data that contains facts without any structure such as content inside emails or books or images. Moving unstructured data from all these sources into a table or a spreadsheet allows us to find the data easily and also to manage it better. In this book, we will focus on structured data, since the SQL language is designed for structured data.

Unstructured data

Structured data

Spreadsheet as Database

Suppose we need to store the name and email of reviewers of a popular website. The simplest approach is to open a spreadsheet, maybe in Google Docs, and enter a few names.

  1. John, joe@contoso.com
  2. Alice, alice@contoso.com

Now that we have a spreadsheet, we decide to add a header identifying the columns and an id for each user. So our worksheet looks like

A users table

Next, we need to add reviews of these users. We don't want to clutter the spreadsheet, so we add a new worksheet.

A reviews table

As you may have noticed, the Users worksheet has three columns named id, username and email and the Reviews worksheet has three columns named id , username and content. Most spreadsheets will use multiple worksheets to organize data. Each worksheet, in turn, has unique columns as new pieces of data that should be stored.

The spreadsheet as a whole can be thought of as a database, and the worksheets within the spreadsheet can be used to describe tables within a database. The rows and columns within a worksheet can be seen as analogous to the rows and columns in a table. Each row represents a single set of related data, while the columns represent a standardized way to store data for that particular field.

This simplified analogy serves perfectly to describe, conceptually, a database. Keep this analogy in mind as we go forward in the book:

Spreadsheet Database
Worksheet Table
Worksheet Column Table Column
Worksheet Row Table Record

Why learn SQL?

Now suppose other people in the company are interested in reading the reviews and adding users. Therefore, you decide to share the spreadsheet. Over time, the amount of information starts to increase. You begin to encounter issues with duplicate data, typos, perhaps even formatting issues if multiple people are working on one file. The simple spreadsheet now becomes unwieldy and finding/collecting information requires a lot of scrolling and searching.

At this point, it becomes apparent that we need to move from a spreadsheet to a relational database. A relational database is useful not only because the data is represented in a flat, two-dimensional table, but it provides connections between tables to model the relations. We can use those relations in various ways to cut back on duplicate data, reduce the chance of typos, and enforce good formatting for our data. The structured form given by the combination of SQL with a relational database allows us better control of our data, which in turn helps prevent complications related to that data.

With a relational database, we can very quickly find the users who left a specific review. This is where SQL comes in. SQL, which stands for Structured Query Language, is the programming language used to communicate with a relational database.

Cartoon

SQL can be pronounced as "Sequel" or as "Ess-Kew-Ell". People can be quite pedantic about which is correct. However it's best to just use whichever helps communicate better by adopting the pronunciation of the people around you.

SQL is a powerful language that uses simple English sentences that, with a few lines, allow you to Select (find), Order (organize), and Delete or Add (manage) a large amount of data.

A brief history of SQL

SQL dates back to the 1970s where it was conceived of by E. F. Codd in his paper "A relational model for large data banks". The paper laid the foundations for what would become relational databases. You may hear the term 'relational algebra' sometimes used when working with databases. The mathematical models underlying relational databases go beyond the scope of this book, but it's essentially the theory relational databases are built on. Over time, companies saw value in the concepts described by Codd's paper and in the 70s, many of them began developing the SQL language and producing products called relational databases.

There are slight variations in SQL syntax between different commercial relational databases. We chose to use the relational database product called PostgreSQL for this book because of its wide applicability and open source roots. However, after reading this book you should have a foundational understanding of SQL and relational databases, and should have the knowledge to use any relational database of your choosing.

Relational Databases

Relational databases have become so widespread that you're likely using multiple databases per day without even knowing it. Smaller examples include Firefox, which uses SQLite to keep track of the user's history and data, to banking systems which might be using an Oracle database to store daily transactions.

Another way you may have been using SQL is through a programming language, like Python or Ruby. If you've ever done a tutorial with Ruby on Rails, for example, it's likely the code you wrote generated SQL behind the scenes for you. No matter which language you're using, the database and its data will most likely out-live most of the application code in your program. Creating a well-designed database is like laying the foundations of a house, and learning SQL and relational database concepts will help you build your applications on a strong foundation.

The interface you use to interact with a relational database is a personal choice. You may access the database from the programming language, through a GUI application that allows access and administration, or through the command-line interface.

In this book, we will use the command-line interface. Don't worry if you've never used a command-line interface before, we'll walk you through using the psql console -- it's not as scary as you think. We think that with a command line-interface you'll have a stronger understanding of the database fundamentals and SQL.

Throughout this book, we'll also be using certain vocabulary related to RDBMSs and SQL:

  • when we use the term, "SQL Statement", we mean a SQL command used to access/use the database or the data within that database via the SQL language.
  • the term, "SQL query" or "SQL queries", is a bit more specific. It is a subset of a "SQL Statement". A query is a way to search, to lookup data within a database.