SQL Querying with Joins, Set Theory Clauses, and Subqueries

In this article, we introduce SQL querying techniques that are more applicable in real-life databases. In a realistic data environment, a data scientist will likely encounter scenarios where one has to retrieve data from multiple sources before analyzing them with statistical methods. In this case, techniques such as joining columns of information from different databases, retrieving data using a union or intersect between several databases, and even nesting SQL queries, can be indispensable skills for efficient data pipelining. For a more gentle introduction to SQL querying, this article that we’ve posted in the past will prove to be an invaluable resource, with detailed introductions on why SQL is such a sought after skill in the data science industry, and how one can practice SQL on their personal laptops without the need for industrial database servers.

The sample database from Kaggle can be downloaded here. We will be working on this database to provide an introduction to SQL querying. We break this tutorial into three major sections:

  • Joins: Introducing inner joins, outer joins, and cross joins to enrich the way that one can retrieve data from multiple databases.
  • Set Theory Clauses: Using UNION, INTERSECT, UNION ALL, and EXCEPT, we show how one can creatively employ set theory for convenient data retrieval.
  • Subqueries: Nested queries, although seemingly complicated, is an essential skill in more complicated situations where the databases are simply not in the desired format which one needs.

Joins

As the name suggests, joins result in the joining of two or more data sets. However, the manner in which the datasets join together will be the reason why there exists the different types of joins:

  • Inner Join
  • Outer Join
  • Cross Join

Inner Join

Using SQLite, a relational database management system, we open the database by doing the following.

sqlite3 database.sqlite -header

The -header tells SQLite that we would like to turn header on, such that header names for each column gets printed when queried. We then type the following to check the various data sets in this database.

.table

We see that there are 6 data sets: Country, League, Match, Team, Player, Player_Attributes, and Team_Attributes. To illustrate what the inner join does, it would be best if we do it with a schematic first.

IDValue
1A
2B
3C
Left_Table

IDValue
1D
3E
4F
Right_Table

The inner join of the two tables using ID, selects all rows from both tables as long as there is a match in ID. The following is the result.

IDLeft ValueRight Value
1AD
3CE
Inner join

Now, we practice with actual data sets from our database! For this illustration, we will use two tables: Player and Player_Attributes. To check the column names of the table, we do the following.

.schema Player
CREATE TABLE `Player` (
	`id`	INTEGER PRIMARY KEY AUTOINCREMENT,
	`player_api_id`	INTEGER UNIQUE,
	`player_name`	TEXT,
	`player_fifa_api_id`	INTEGER UNIQUE,
	`birthday`	TEXT,
	`height`	INTEGER,
	`weight`	INTEGER
);

The above is the print out of all the column names of table Player. If we were to do the same for Player_Attributes, we notice that they both have the player_api_id column, which is a unique id which matches to every player.

Example 1: We would like to create a new table with information about the player’s name, weight, dates, and overall rating from the tables: Player and Player_Attributes, matching on player_api_id with Player as the left table

SELECT p.player_api_id, 
       p.player_name, 
       p.weight, 
       pa.date, 
       pa.potential
FROM Player AS p
       INNER JOIN Player_Attributes AS pa
       ON p.player_api_id = pa.player_api_id;

In the code, we aliased the data set Player as p and Player_Attributes as pa. This makes the coding simpler and allows the code to be clearer in terms of which data column is retrieved from which data set. For instance, p.player_name tells SQL that the column “player_name” is retrieved from the table Player. The keyword ON informs about the column that the data should be matched on, with regards to an inner join. In this case, since both columns have the same name “player_api_id”, we could replace the last line with USING (player_api_id). There will be no confusion since both tables have the same header name for this particular column. If there’re different, then the ON keyword should be used.

Example 2: Self-joining by means of an inner join is possible too! Suppose we would like to have the player_api_id, the minimum overall rating, the maximum overall rating, and the percentage growth of each player’s overall ratings, with only one player_api_id showing on the table.

SELECT   p1.player_api_id, 
         MIN(p1.overall_rating) AS min_rating, 
         MAX(p2.overall_rating) AS max_rating,
         ((MAX(p2.overall_rating) - MIN(p1.overall_rating)) * 100.0 / MIN(p1.overall_rating)) AS perc_growth 
FROM     Player_Attributes as p1
         INNER JOIN Player_Attributes as p2
         ON p1.player_api_id = p2.player_api_id 
GROUP BY p1.player_api_id;

Example 3: We demonstrate the use of CASE WHEN with an illustration. Suppose we would like to classify overall ratings that are > 70 as “Exceptional”, between 60 to 70 (inclusive) to be “Average”, and below 60 is “Below Average”. The resulting table should include player_api_id, name, date, and overall rating in categories described above.

SELECT p1.player_api_id,
       p1.player_name,
       p2.date,
       p2.overall_rating,
       CASE WHEN p2.overall_rating > 70 THEN 'Exceptional'
            WHEN p2.overall_rating > 60 THEN 'AVERAGE'
            ELSE 'BELOW AVERAGE' END
       AS classification
FROM Player as p1
       INNER JOIN Player_Attributes as p2
       ON p1.player_api_id = p2.player_api_id;

Note that in CASE WHEN, all data passes through each WHEN statement, unless it satisfies one of the conditions. This means that since p2.overall_rating > 60 is in the second line of code, data here is ≤ 70. And hence, data which are “AVERAGE” has a 60 < rating ≤ 70.

Outer Joins

There exist four main types of outer joins: left joins, right joins, full joins, and cross joins . Illustrating the left and right joins with the same two tables shown previously:

IDValue
1A
2B
3C
Left Table
IDValue
1D
3E
4F
Right Table

Left join:

IDLeft ValueRight Value
1AD
2BNULL
3CE
Left Join

Right join:

IDLeft ValueRight Value
1AD
3CE
4NULLF
Right Join

The left join returns all records from the left table and the matched records from the right table, vice versa. Unmatched values are automatically set to NULL.

Example 4: Suppose we would like to do a left join between Player (left table) and Player_Attributes retrieving data on name of players, player_api_id, and average overall rating grouped by player_api_id.

SELECT   p1.player_api_id,
         p1.player_name,
         AVG(p2.overall_rating)
FROM     Player AS p1
         LEFT JOIN Player_Attributes as p2
         ON p1.player_api_id = p2.player_api_id
GROUP BY p1.player_api_id;

The number of resulting rows from a left join is always ≥ the number of results in an inner join. Right joins are not as common as left joins. One reason is that one can always write a right join as a left join by doing a left join with reversed table order.

Now, we move on to illustrate the result of a full join for the two tables shown previously in this section.

Left IDRight IDLeft ValueRight Value
11AD
2 NULLBNULL
33CE
NULL4NULLF
Full Join

The full join returns all records when there is a match in left table or right table records. In certain cases, such a join can result in extremely huge data sets, due to the inclusion of all data records between the two tables.

Example 5: Suppose we would like to have a full join between Player and Player_Attributes, retrieving player api id, player names, and average overall ratings of players with first name “David”.

SELECT   p1.player_api_id,
         p1.player_name,
         AVG(p2.overall_rating) AS avg_rating
FROM     Player as p1
         FULL JOIN Player_Attributes as p2
         USING (player_api_id)
WHERE    p1.player_name LIKE 'David %'
GROUP BY p1.player_api_id;

We use USING instead of ON because player_api_id appears as the same name with same data in both tables.

Next, we illustrate a cross join with the following two tables.

ID
1
2
3
Left Table ID
ID
1
3
4
Right Table ID

The result of a cross join:

Left IDRight ID
11
13
14
21
23
24
31
33
34
Cross Join

A cross join is used when one wishes to create combination of every row from two tables. In this case, we are interested to show all combinations of IDs between the two tables. However, do note that cross joins do not use ON or USING.

Example 6: Suppose we are trying to choose a pair of players to be our new strikers for the team. We will need player api ids for all combinations of two players.

SELECT p1.player_api_id AS p1id,
       p2.player_api_id AS p2id
FROM   Player AS p1
       CROSS JOIN Player_Attributes AS p2
WHERE  p1id <> p2id;

Set Theory Clauses

There exist four main types of SQL set theory clauses:

  • UNION: This operation combines two or more data sets, while removing duplicates.
  • UNION ALL: This operation combines two or more data sets, keeping all duplicates.
  • INTERSECT: Returns only common rows between the two SELECTs.
  • EXCEPT: Returns only rows, which are not available in the second SELECT statement.

UNION can also be used to determine all occurrences of a field across multiple tables.

Example 7: Suppose we want to determine all non-duplicate players in either Player or Player_Attributes, retrieving only the player_api_id.

SELECT player_api_id
FROM   Player
UNION 
SELECT player_api_id
FROM   Player_Attributes;

One thing to note is that, for set theory clauses to work, the number and type of columns, in both SELECTs must be the same. In the sense that one is comparing two tables with same headings, and structure but different set of values. Since UNION ALL is pretty obvious, we’ll skip the illustration and say that all data rows (unique or different) in all tables, will appear in the result.

Example 8: To find out which countries appear in both League and Country, we can do an intersect of their country ids.

SELECT country_id 
FROM   League
INTERSECT
SELECT id
FROM   Country;

Example 9: Similarly, to find out which countries are in the table of countries listed in Country but does not appear in the League, we can use EXCEPT. Take note of the ordering of the SELECTs for the tables.

SELECT id 
FROM   Country
EXCEPT
SELECT country_id
FROM   League;

Subquery

Subqueries are in fact nested queries, where one is querying the result of another query. Such instances are rather common, but first let us introduce the concepts of semi-joins and anti-joins before we show how they can be done with subqueries.

IDValue
1A
2B
3C
Left Table
IDValue
1D
3E
4F
Right Table

Semi-Join: Finds common ID and keeps only left value.

IDLeft Value
1A
3C
Semi-Join

Anti-Join: Finds uncommon ID and keeps only left value.

IDLeft Value
2B
Anti-Join

Example 10: Suppose we would like to have player api ids, and player names of all players who have an average overall rating > 70. The data needed will be retrieved from Player and Player_Attributes. *This example illustrates the use of a subquery within a WHERE keyword.

SELECT player_api_id, 
       player_name
FROM   Player
WHERE  player_api_id in
       (SELECT player_api_id
       FROM Player_Attributes
       GROUP BY player_api_id
       HAVING AVG(overall_rating) > 70);

Example 11: We can also include a subquery within a FROM keyword. We repeat example 2 from this article, but this time we achieve the same result using subqueries instead of joins! Suppose we would like to have the player_api_id, the minimum overall rating, the maximum overall rating, and the percentage growth of each player’s overall ratings, with only one player_api_id showing on the table.

SELECT 
    id,
    min_rating,
    max_rating,
    ((max_rating - min_rating) * 100.0 / min_rating) as perc_growth
FROM 
    (SELECT  p1.player_api_id as id,
             MIN(p1.overall_rating) AS min_rating, 
             MAX(p2.overall_rating) AS max_rating 
    FROM     Player_Attributes as p1
             INNER JOIN Player_Attributes as p2
             ON p1.player_api_id = p2.player_api_id 
    GROUP BY p1.player_api_id);

It is important to note that ((max_rating - min_rating) * 100.0 / min_rating) will give the correct result, while ((max_rating - min_rating) / min_rating * 100.0) gives a result of zero! This is because both max_rating and min_rating are integers, and in SQL arithmetic a division between integers gives an integer, which in this case is a zero! Hence, we multiply the numerator with 100.0 to change it to decimal places first before we do the division.

Leave a comment