Introduction

"The more we can organize, find and manage information, the more effectively we can function in our modern world." - Vint Cerf

The importance of data

As human beings, we are limited in our capacity to remember things. Having tools that allow immediate access and analysis of data enables us to make better decisions.

Today, we have systems that allow a doctor to review the patient's medical record before prescribing a new drug. A bioinformatic scientist can query genome data to find the genetic basis of disease. A baseball scout looking at player statistics can assess the value of a player.

All these decisions can be successfully arrived at by collecting data, organizing it and then studying it to find patterns and meaning. This data has to be stored in a way that lends itself to studying. In other words, we need to structure it.

This is a video by Hans Rosling. By collecting data about 200 countries over 200 years, he provides a perspective on human progress.

Structured Data

In order to better learn about structured data, it is useful to understand the problem that it aims to solve: the limitations of unstructured data. Unstructured data is data that contains information without any structure, such as content inside emails or books or images.

Unstructured data

Storing data in an unstructured way might work if you only have a small amount, but once the amount of data starts to grow it soon becomes unwieldy. For example, if you want to find a particular note you made in a notebook, you may need to look through the entire book to find it. At this point you need a way of organising, or structuring, your data. One way to structure data is to store it in tabular format (rows and columns), such as in spreadsheets or todo lists.

Structured data

Storing data in a structured way, such as in a table or a spreadsheet, allows us to find the data easily and also to manage it better. The data can be arranged and analyzed in various ways such as sorting alphabetically or totalling a set of values.

A common way of storing data in a structured manner is to use a relational database. A basic definition of a database is simply 'a structured set of data held in a computer'.

Spreadsheet as Database

Suppose we need to store the name and email of reviewers of a popular website. The simplest approach is to open a spreadsheet, maybe in Google Docs or Microsoft Excel, and enter a few names.

  1. John, joe@contoso.com
  2. Alice, alice@contoso.com

Now that we have a spreadsheet, we decide to add a header identifying the columns and an id for each user. So our worksheet looks like

A users table

Next, we need to add reviews of these users. We don't want to clutter the spreadsheet, so we add a new worksheet.

A reviews table

As you may have noticed, the Users worksheet has three columns named id, username and email and the Reviews worksheet has three columns named id , username and content. Most spreadsheets will use multiple worksheets to organize data. Each worksheet, in turn, has unique columns as new pieces of data that should be stored.

The spreadsheet as a whole can be thought of as a database, and the worksheets within the spreadsheet can be used to describe tables within a database. A table, in the context of a database, can be defined as a list of individual but related data entries (the table rows), each of which store values for a set of defined, shared attributes (the table columns).

The rows and columns within a worksheet can be seen as analogous to the rows and columns in a table. Each row represents a single set of related data, while the columns represent a standardized way to store data for that particular attribute.

This simplified analogy serves perfectly to describe, conceptually, a database. Keep this analogy in mind as we go forward in the book:

Spreadsheet Database
Worksheet Table
Worksheet Column Table Column
Worksheet Row Table Record

Relational Database Management Systems

Now suppose other people in the company are interested in reading the reviews and adding users. Therefore, you decide to share the spreadsheet. Over time, the amount of information starts to increase. You begin to encounter issues with duplicate data, typos, perhaps even formatting issues if multiple people are working on one file. The simple spreadsheet now becomes unwieldy and finding/collecting information requires a lot of scrolling and searching. At this point, you would probably benefit from moving to a relational database management system.

A relational database is a database organized according to the relational model of data. In simple terms, the relational model defines a set of relations (which we can think of as analogous to tables) and describes the relationships, or connections, between them in order to determine how the data stored in them can interact. Using the relational model elevates our database from data that is represented in just a flat, two-dimensional table, to one where we can describe the data in a more complex and detailed way. Using a relational database helps us to cut down on duplicated data and provides a much more useful data structure for us to interact with.

A relational database management system, or RDBMS, is essentially a software application, or system, for managing relational databases. An RDBMS allows a user, or another application, to interact with a database by issuing commands using syntax that conforms to a certain set of conventions or standards.

There are many relational database management systems such as SQLite, MS SQL, PostgreSQL and MySQL. Some are lightweight, easy to install and use, while others are robust, scalable, but are complex to install. These various RDBMSes may vary in certain ways, and some of the commands they use may have slight syntactical differences. One thing they have in common however is the underlying language they all use: SQL.

It is worth noting that the relational model is not the only structured data model used by database software. A program like MongoDB, for example, uses a document-oriented data storage model. This, and other systems using non-relational data storage and retrieval models, are often loosely grouped together under the term 'NoSQL'. For the purposes of this book however, we are only interested in the relational model and how SQL can be used to interact with relational databases.

SQL

SQL, which stands for Structured Query Language, is the programming language used to communicate with a relational database.

Cartoon

SQL can be pronounced as "Sequel" or as "Ess-Kew-Ell". People can be quite pedantic about which is correct. However it's best to just use whichever helps communicate better by adopting the pronunciation of the people around you.

SQL is a powerful language that uses simple English sentences that, with a few lines, allow you to Select (find), Insert (add), Update (change), and Delete (remove) a large amount of data.

The goal of this book is to teach you SQL, which would allow you to use any of the RDBMS's mentioned above, and even others not mentioned. In order to teach you SQL, we have chosen to use the PostgreSQL RDBMS because of its wide applicability and open source roots. However, after reading this book you should have a foundational understanding of SQL and relational databases, and should have the knowledge to use any relational database of your choosing.

SQL is a little different to other programming languages you may have encountered. SQL is a declarative language; when you write an SQL statement you describe what needs to be done, but not exactly how to do it -- the exact details of how the query is executed are handled internally by the RDBMS you are using.

A brief history of SQL

SQL dates back to the 1970s where it was conceived of by E. F. Codd in his paper "A relational model for large data banks". The paper laid the foundations for what would become relational databases. You may hear the term 'relational algebra' sometimes used when working with databases. The mathematical models underlying relational databases go beyond the scope of this book, but it's essentially the theory relational databases are built on. Over time, companies saw value in the concepts described by Codd's paper and in the 1970s, many of them began developing the SQL language and producing products called relational databases.

Why learn SQL?

Relational databases have become so widespread that you're likely using multiple databases per day without even knowing it. Smaller examples include Firefox, which uses SQLite to keep track of the user's history and data, to banking systems which might be using an Oracle database to store daily transactions.

Another way you may have been using SQL is through a programming language, like Python or Ruby. If you've ever done a tutorial with Ruby on Rails, for example, it's likely the code you wrote generated SQL behind the scenes for you. No matter which language you're using, the database and its data will most likely out-live most of the application code in your program.

Creating a well-designed database is like laying the foundations of a house, and learning SQL and relational database concepts will help you build your applications on a strong foundation. Since databases are such a key part of almost all web applications, understanding the language of databases and how they work is a vital step towards becoming a well-rounded web-developer.

Summary

In this chapter we've looked at the importance of data, and how this importance underpins web applications and web development. We've introduced the concept of structured data by comparison with unstructured data, and explained that databases are a useful way to structure data that you want to work with in a web application. We've talked about how RDBMSs are software applications for managing relational databases, and that SQL is language that they use to do this.

The focus of this book is on how to use SQL in order to interact with a database. Before we can do that, however, we need to install an RDBMS, specifically PostgreSQL. We'll do that in the next chapter.