SQL Querying on Databases

We start off with a brief introduction to SQL and discuss the various platforms that use it and why it is so important as a skill for data scientists today. SQL is an acronym for Structured Query Language. As the name suggests, SQL was built for querying databases. It is important to note that SQL is simply a language that is used for Relational Database Management Systems (RBDMS). In RBDMS, data is stored in the form of a collection of tables and relations can be formed between the common fields of the different tables. Popularly used RBDMS in industry include Microsoft SQL Server, Oracle, Amazon Redshift, IBM DB2, and MySQL (open-source). All of these data management systems use SQL as the basic interfacing language for retrieving, managing and handling data. For the purpose of practicing the SQL language, we recommend SQLite – a serverless database engine which operates on database files directly, rather than going through a network. This makes it a lot simpler for data enthusiasts or self-learning individuals to keep up with new skills, and practice on the SQL language at their own pace, outside of work environments.

The sample database from Kaggle can be downloaded here. We will be working on this database to provide an introduction to SQL querying. We break this tutorial into four major sections:

  1. Basic querying: using simple SQL queries, we retrieve data and analyze selected columns from the table.
  2. Filtering: we learn to use comparison operators, combine various criteria, and match data of similar patterns for retrieving data that is of interest.
  3. Aggregate Functions: we learn built-in SQL commands for basic arithmetic operations and discuss possible pit-falls regarding how SQL does its arithmetic.
  4. Sorting and Grouping: By grouping certain common fields together, we can summarize and sort the data of interest into a more meaningful data set for analysis.

Basic querying

To open a database in SQLite, we write the following in the terminal command line.

sqlite3 database.sqlite

The following line of code opens up the table and allows us to explore it further by checking the number of tables.

.table

By doing so, we observe 6 tables in this database: “Country”, “Match”, “Player_Attributes”, “Team_Attributes”, “League”, “Player”, and “Team”. To find out the names of the columns in a table and the type of data in each column, we do the following.

.schema Player

In “Player”, the various columns are

sqlite> .schema Player
CREATE TABLE `Player` (
	`id`	INTEGER PRIMARY KEY AUTOINCREMENT,
	`player_api_id`	INTEGER UNIQUE,
	`player_name`	TEXT,
	`player_fifa_api_id`	INTEGER UNIQUE,
	`birthday`	TEXT,
	`height`	INTEGER,
	`weight`	INTEGER
);

Selecting columns

To select the column “birthday” from table “Player”.

SELECT birthday 
FROM player;

The print out is a long list, and often times we do not want a long print out. Hence, we can LIMIT the print out!

SELECT birthday 
FROM player
LIMIT 10;

To select multiple columns “player_api_id” and “birthday”, we can do this.

SELECT player_api_id, birthday 
FROM player
LIMIT 10;

Counting

We can count the number of entries of data in “Player” using COUNT.

SELECT COUNT(*)
FROM Player;

Counting the number of non-missing values in column “birthday”.

SELECT COUNT(birthday)
FROM Player;

Counting number of distinct values in a column “birthday”.

SELECT COUNT(DISTINCT birthday)
FROM Player;

Filtering

Filter to return the name of player “Aaron Meijers” from table “Player” using WHERE.

SELECT player_name 
FROM Player
WHERE player_name = 'Aaron Meijers';

Filter to return all players with a height greater than 180 cm.

SELECT player_name
FROM Player
WHERE height > 180;

Filter to return all columns for players whose heights are greater than 180 cm.

SELECT * 
FROM Player
WHERE height > 180;

Filter to return the names of all players whose weights are between 180 lbs and 200 lbs (inclusive).

SELECT player_name 
FROM Player
WHERE weight >= 180 
AND weight <= 200;

An equivalent code using BETWEEN can also be used, keeping in mind that the BETWEEN command is always inclusive.

SELECT player_name 
FROM Player
WHERE weight BETWEEN 180 AND 200;

Filtering to return the names of all players with a weight between 180 lbs and 200 lbs (inclusive) and a height between 180 cm and 190 cm (inclusive).

SELECT player_name 
FROM Player
WHERE (weight BETWEEN  180 AND 200)
AND (height BETWEEN 180 AND 190);

Filtering with a list of criteria – all players with a weight of 180 lbs, 185 lbs, 190 lbs, 195 lbs, or 200 lbs.

SELECT player_name
FROM Player
WHERE weight in (180, 185, 190, 195, 200);

Counting the number of missing birthdays with IS NULL.

SELECT COUNT(*)
FROM Player
WHERE birthday IS NULL;

Selecting the names of players whose birthdays are not missing (IS NOT NULL).

SELECT player_name
FROM Player
WHERE birthday IS NOT NULL;

Using LIKE or NOT LIKE with % and _. The % wildcard will match zero, one, or many characters that appear after the specified text. For example, the following query matches player names like “David Da Costa”, “David Davis”, “David D”, etc.

SELECT player_name
FROM Player
WHERE player_name LIKE 'David D%';

The _ wildcard will match a single character. For example, the following query matches names like ‘David Degan’, ‘David Dagan’, and so on:

SELECT player_name
FROM Player
WHERE player_name LIKE 'David D_gan';

Aggregate Functions

There are four basic types of aggregate functions: average (AVG), maximum (MAX), minimum (MIN), and sum (SUM). Since these operations are simple to understand, we demonstrate what’s ok and what’s not ok in SQL arithmetic operations. For example, if we want to calculate the percentage of players who scored one or more goals (where 0 equals no goals and 1 equals one or more goals). *Note: player_scored_goals is not actually a column in the table “Player”. We are merely using it for illustrative purposes here.

The following is not ok:

SELECT SUM(player_scored_goals) / COUNT(*) * 100.0 
FROM Player;

The following is ok:

SELECT SUM(player_scored_goals) * 100.0 / COUNT(*) 
FROM Player;

This is because in SQL arithmetic, a mathematical operation between two integers gives a result that is of integer type. Hence, the line of code that’s not ok, would give rise to an integer result in its first division operation. The ok code, first multiplies by a number with decimal place and then divide by an integer. The resulting answer would be one with decimal places, and hence accurate.

Sorting and Grouping

We introduce four main functionalities of SQL in this section: aliasing (AS), ordering (ORDER BY), grouping (GROUP BY), and filtering based on the result of an aggregate function (HAVING).

Aliasing

Retrieve the player names and weight in kilograms (current data is in pounds and 1 lb = 0.453592 kg) and alias the new weight as weight_kg.

SELECT player_name, weight * 0.453592 AS weight_kg 
FROM Player;

Ordering

By default ORDER BY will sort in ascending order. If you want to sort the results in descending order, you can use the DESC keyword. For example,

SELECT player_name
FROM Player
ORDER BY weight DESC;

ORDER BY can also be used to sort on multiple columns. It will sort by the first column specified, then sort by the next, then the next, and so on. For example,

SELECT player_name, height, weight
FROM Player
ORDER BY height, weight;

Grouping

Often one will need to aggregate results. For example, in the table Player, we might want to group players with common birthdays together. In SQL, GROUP BY allows you to group a result by one or more columns, like so:

SELECT birthday, COUNT(*) 
FROM Player
GROUP BY birthday;

Filtering with Aggregate Functions

In SQL, aggregate functions can’t be used in WHERE clauses. For example, If we want to group the players according to their weights, but only include groups greater than 50 people. The following query is invalid:

SELECT weight, count(*) 
FROM Player
GROUP BY weight
WHERE COUNT(player_name) > 50;

The correct version would require the SQL command HAVING:

SELECT weight, count(*) 
FROM Player
GROUP BY weight
HAVING COUNT(player_name) > 50;

Example (using all commands)

Retrieve the average weight and height for players, grouped by birthday, including only groups that have more than 7 players. Order the result by their birthdays in descending order and alias average weight as “avg_weight” and average height as “avg_height”.

SELECT birthday, COUNT(*), AVG(weight) AS avg_weight, AVG(height) AS avg_height
FROM Player
GROUP BY birthday
HAVING COUNT(player_name) > 7
ORDER BY birthday DESC;

The result is

sqlite> SELECT birthday, COUNT(*), AVG(weight) AS avg_weight, AVG(height) AS avg_height
   ...> FROM Player
   ...> GROUP BY birthday
   ...> HAVING COUNT(player_name) > 7
   ...> ORDER BY birthday DESC;
birthday|COUNT(*)|avg_weight|avg_height
1990-03-27 00:00:00|8|159.875|178.1175
1990-01-13 00:00:00|8|166.875|180.975
1989-03-02 00:00:00|10|170.8|181.102
1988-08-31 00:00:00|8|171.625|183.8325

One thought on “SQL Querying on Databases

  1. Pingback: SQL Querying with Joins, Set Theory Clauses, and Subqueries – Data Bay Today

Leave a comment