Wednesday, April 25, 2012

SQL join types

Introduction

There are many kinds of joins in SQL. I often have trouble keeping them straight. This article divides the joins into pairs of opposites, giving brief definitions of each along with their opposites. For the rest of this article, we'll be joining two tables, which we'll call the "left table" and the "right table". We'll be using these tables:
create table rel1 (
  id int primary key, 
  letter varchar(1)
);

create table rel2 (
  id int primary key, 
  fk int, 
    foreign key (fk) references rel1(id),
  mychar varchar(1)
);

select * from rel1;
+----+--------+
| id | letter |
+----+--------+
|  1 | a      |
|  2 | b      |
|  3 | c      |
|  4 | d      |
+----+--------+

select * from rel2;
+----+--------+--------+
| id |    fk  | mychar |
+----+--------+--------+
|  1 |      1 | z      |
|  3 |      4 | o      |
|  5 |      2 | m      |
|  8 |      2 | q      |
+----+--------+--------+
`rel1` will be used as the left table, and `rel2` as the right table unless otherwise noted.

Inner vs. outer

Inner joins return only matching pairs of rows between the two tables. Outer joins also return rows from one or both tables that match 0 rows from the other table. (Note that `on rel1.id = rel2.id` is the "join predicate" in these examples)
select * 
from rel1 
inner join rel2 
  on rel1.id = rel2.id;
+----+--------+----+------+--------+
| id | letter | id | fk   | mychar |
+----+--------+----+------+--------+
|  1 | a      |  1 |    1 | z      |
|  3 | c      |  3 |    4 | o      |
+----+--------+----+------+--------+

select * 
from rel1 
outer join rel2 
  on rel1.id = rel2.id;
+------+--------+------+------+--------+
| id   | letter | id   | fk   | mychar |
+------+--------+------+------+--------+
|    1 | a      |    1 |    1 | z      |
|    2 | b      | NULL | NULL | NULL   |
|    3 | c      |    3 |    4 | o      |
|    4 | d      | NULL | NULL | NULL   |
| NULL | NULL   |    5 |    2 | m      |
| NULL | NULL   |    8 |    2 | q      |
+------+--------+------+------+--------+

Outer join: left and right outer join

Left outer joins return, in addition to the rows returned by the corresponding inner join, rows from the left table that don't match any rows in the right table. Right outer joins are identical to inner joins except that they also return unmatched rows from the right table.
select * 
from rel1 
left outer join rel2 
  on rel1.id = rel2.id;
+----+--------+------+------+--------+
| id | letter | id   | fk   | mychar |
+----+--------+------+------+--------+
|  1 | a      |    1 |    1 | z      |
|  2 | b      | NULL | NULL | NULL   |
|  3 | c      |    3 |    4 | o      |
|  4 | d      | NULL | NULL | NULL   |
+----+--------+------+------+--------+

select * 
from rel1 
right outer join rel2 
  on rel1.id = rel2.id;
+------+--------+----+------+--------+
| id   | letter | id | fk   | mychar |
+------+--------+----+------+--------+
|    1 | a      |  1 |    1 | z      |
|    3 | c      |  3 |    4 | o      |
| NULL | NULL   |  5 |    2 | m      |
| NULL | NULL   |  8 |    2 | q      |
+------+--------+----+------+--------+
Note that the union of a left join and a right join is a full outer join; their intersection is an inner join.

Semi vs. anti

"Semi"-joins return rows from the left table that match rows in the right table. "Anti"-joins return rows from the left table that don't match rows in the right table. Both of these join forms act as a sort of filter for the left table, since no columns from the right table are returned.
-- all rows in rel1 that have a matching row (by fk value) in rel2:
select * 
from rel1 
where id in (select fk from rel2);
+----+--------+
| id | letter |
+----+--------+
|  1 | a      |
|  2 | b      |
|  4 | d      |
+----+--------+

-- all rel1 rows that don't have a matching row (by fk value) in rel2:
select * 
from rel1 
where id not in (select fk from rel2);
+----+--------+
| id | letter |
+----+--------+
|  3 | c      |
+----+--------+
Note that there are multiple ways to implement semi- and anti-joins in SQL:
-- "semi join" implemented using "inner join" and "distinct"
select distinct rel1.* 
from rel1 
inner join rel2 
  on rel1.id = rel2.fk;
+----+--------+
| id | letter |
+----+--------+
|  1 | a      |
|  2 | b      |
|  4 | d      |
+----+--------+

-- "anti join" implemented using "left join" and "is null"
select distinct rel1.* 
from rel1 
left join rel2 
  on rel1.id = rel2.fk 
where rel2.id is null;
+----+--------+
| id | letter |
+----+--------+
|  3 | c      |
+----+--------+

Equi vs. theta

"Equi"-joins are those where the join predicate is an equality. "Theta"-joins are those where the join predicate is an inequality.
-- equi-join
select * 
from rel1 
inner join rel2 
  on rel1.id = rel2.id;
+----+--------+----+------+--------+
| id | letter | id | fk   | mychar |
+----+--------+----+------+--------+
|  1 | a      |  1 |    1 | z      |
|  3 | c      |  3 |    4 | o      |
+----+--------+----+------+--------+

-- theta-join
select * 
from rel1 
inner join rel2 
  on rel1.id != rel2.id;
+----+--------+----+------+--------+
| id | letter | id | fk   | mychar |
+----+--------+----+------+--------+
|  2 | b      |  1 |    1 | z      |
|  3 | c      |  1 |    1 | z      |
|  4 | d      |  1 |    1 | z      |
|  1 | a      |  3 |    4 | o      |
|  2 | b      |  3 |    4 | o      |
|  4 | d      |  3 |    4 | o      |
|  1 | a      |  5 |    2 | m      |
|  2 | b      |  5 |    2 | m      |
|  3 | c      |  5 |    2 | m      |
|  4 | d      |  5 |    2 | m      |
|  1 | a      |  8 |    2 | q      |
|  2 | b      |  8 |    2 | q      |
|  3 | c      |  8 |    2 | q      |
|  4 | d      |  8 |    2 | q      |
+----+--------+----+------+--------+
Note that the union of an equi-join and a theta-join is a cartesian product, if the same tables and join predicate are used.

Implicit vs. explicit

source Explicit joins use one of the "join" keywords, such as "inner join" or "left join", combined with a join predicate:
select * from rel1 inner join rel2 on rel1.id = rel2.id;
+----+--------+----+------+--------+
| id | letter | id | fk   | mychar |
+----+--------+----+------+--------+
|  1 | a      |  1 |    1 | z      |
|  3 | c      |  3 |    4 | o      |
+----+--------+----+------+--------+
Implicit joins do not use a "join" keyword, simply listing the tables in the "from" clause, and place the join predicate in the "where" clause:
select * from rel1, rel2 where rel1.id = rel2.id;
+----+--------+----+------+--------+
| id | letter | id | fk   | mychar |
+----+--------+----+------+--------+
|  1 | a      |  1 |    1 | z      |
|  3 | c      |  3 |    4 | o      |
+----+--------+----+------+--------+
Explicit joins are generally recommended, as they are thought to be more maintainable and more clearly express the programmer's intent. Note that this is purely a syntactic difference.

Cross join

This produces a cartesian product of the tables, joining every row in the left table with every row in the right table.
select * from rel1 inner join rel2 on 1;
+----+--------+----+------+--------+
| id | letter | id | fk   | mychar |
+----+--------+----+------+--------+
|  1 | a      |  1 |    1 | z      |
|  2 | b      |  1 |    1 | z      |
|  3 | c      |  1 |    1 | z      |
|  4 | d      |  1 |    1 | z      |
|  1 | a      |  3 |    4 | o      |
|  2 | b      |  3 |    4 | o      |
|  3 | c      |  3 |    4 | o      |
|  4 | d      |  3 |    4 | o      |
|  1 | a      |  5 |    2 | m      |
|  2 | b      |  5 |    2 | m      |
|  3 | c      |  5 |    2 | m      |
|  4 | d      |  5 |    2 | m      |
|  1 | a      |  8 |    2 | q      |
|  2 | b      |  8 |    2 | q      |
|  3 | c      |  8 |    2 | q      |
|  4 | d      |  8 |    2 | q      |
+----+--------+----+------+--------+

Natural join

The result is the set of rows, from the two tables, that have the same values for columns of the same name. If the tables share all columns names, the result is an intersection. If the tables share no column names, the result is a cartesian product. The number of columns in the result is the number of distinct column names in the tables.
select * 
from rel1 
natural join rel2;
+----+--------+------+--------+
| id | letter | fk   | mychar |
+----+--------+------+--------+
|  1 | a      |    1 | z      |
|  3 | c      |    4 | o      |
+----+--------+------+--------+

select * 
from rel1 
inner join rel2 
  using(id);
+----+--------+------+--------+
| id | letter | fk   | mychar |
+----+--------+------+--------+
|  1 | a      |    1 | z      |
|  3 | c      |    4 | o      |
+----+--------+------+--------+

select rel1.*, rel2.fk, rel2.mychar 
from rel1 
inner join rel2 
  on rel1.id = rel2.id;
+----+--------+------+--------+
| id | letter | fk   | mychar |
+----+--------+------+--------+
|  1 | a      |    1 | z      |
|  3 | c      |    4 | o      |
+----+--------+------+--------+
This join is not useful in practice, as it can be completely replaced by queries using (more easily understandable) inner joins. Additionally, it may produce non-intuitive results in that tables are not necessarily joined on foreign keys, which to my mind is the actual "natural" way to join tables. (Recall that rel1.id <=> rel2.fk was the foreign key relationship we defined).

Self join

This is just a special case of a join where the same table is used for both sides of the join. It can be combined with any of the preceding "join" forms.
select * 
from rel2 a 
left join rel2 b 
  on a.id < b.fk;
+----+------+--------+------+------+--------+
| id | fk   | mychar | id   | fk   | mychar |
+----+------+--------+------+------+--------+
|  1 |    1 | z      |    3 |    4 | o      |
|  1 |    1 | z      |    5 |    2 | m      |
|  1 |    1 | z      |    8 |    2 | q      |
|  3 |    4 | o      |    3 |    4 | o      |
|  5 |    2 | m      | NULL | NULL | NULL   |
|  8 |    2 | q      | NULL | NULL | NULL   |
+----+------+--------+------+------+--------+
Note the table aliases, used to distinguish between the two uses of the same table.

Summary

A convenient way to understand joins is in pairs, as I showed:
join typeoppositemajor point of difference
innerouterdo/don't keep non-matching rows
left outerright outeraugment inner join with unmatched rows from left/right table
semiantifilter rows from a single table, keeping/dropping those that match rows in another table
equithetajoin predicate is equality/inequality
implicitexplicitsyntax
Note that cross, natural, and self joins don't have opposites in this sense.

Sources

article about joins
Wikipedia's "join" article

2 comments:

  1. Nice post. I don't understand why equi/theta joins are distinct from inner joints.

    ReplyDelete
  2. Thanks for your comment. The point I was trying to make was that a single SQL statement employing a join may be classified several ways -- for instance, you could have an explicit, inner equi-join all at once -- but you couldn't have both a semi- and an anti-join.

    ReplyDelete