What the SQL?!? Lateral Joins¶
Today's "What the SQL?!?" features the keyword LATERAL
. A prerequisite to
understanding lateral joins are regular joins and subqueries. I'll explain those
briefly to see how LATERAL
can simplify a complicated SQL query.
Please note, our target database is PostgreSQL. These examples may work with
other databases, but might need some massaging to get them to work properly.
Search online for the specific vendor's documentation if errors pop up.
Try searching for "lateral joins $DB_VENDOR_NAME". Not all database vendors
support the keyword LATERAL
.
A Problem to Solve¶
We have a table with system uptimes. The table records a start timestamp and an end timestamp. If the service is still running, the end timestamp is left null because it hasn't ended. We want a query to display an overview this data.
Our final solution will return a row per day and 24 columns containing an uptime percentage for each hour in the day. It will look like the following.
cal_date | hour_0 | hour_1 | hour_2 | hour_3 | ... | hour_21 | hour_22 | hour_23
------------+--------+--------+--------+--------+-----+---------+---------+---------
2017-03-01 | 0 | 0.75 | 0.25 | 0 | ... | 0 | 0 | 0
2017-03-02 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 1
(2 rows)
Please note we'll use ...
abbreviate some of the results. All queries are
schema independent and should be copy/paste-able into any psql
session.
Sample Uptime Data¶
The sample uptime data is derived from a virtual table built from the following query:
SELECT
*
FROM (
VALUES
('2017-03-01 01:15:00-06'::timestamp, '2017-03-01 02:15:00-06'::timestamp),
('2017-03-01 08:00:00-06', '2017-03-01 20:00:00-06'),
('2017-03-02 19:00:00-06', null)
) AS t(start_ts, end_ts)
The data looks like:
start_ts | end_ts
---------------------+---------------------
2017-03-01 01:15:00 | 2017-03-01 02:15:00
2017-03-01 08:00:00 | 2017-03-01 20:00:00
2017-03-02 19:00:00 |
(3 rows)
We want to plot the time against a time sliced table representing all the effective hours in the uptime window. We'll make use of another virtual table to build up all the time slices:
SELECT
start_ts,
start_ts + interval '1 hour' AS end_ts
FROM generate_series('2017-03-01'::date,
'2017-03-03'::timestamp - interval '1 hour',
interval '1 hour'
) AS t(start_ts)
This we make use of PostgreSQL's generate_series to return all the hours between a time range. The data looks like:
start_ts | end_ts
---------------------+---------------------
2017-03-01 00:00:00 | 2017-03-01 01:00:00
2017-03-01 01:00:00 | 2017-03-01 02:00:00
2017-03-01 02:00:00 | 2017-03-01 03:00:00
-- ... many more rows ...
2017-03-01 03:00:00 | 2017-03-01 04:00:00
2017-03-02 22:00:00 | 2017-03-02 23:00:00
2017-03-02 23:00:00 | 2017-03-03 00:00:00
(48 rows)
Left Join¶
We use a left join to glue together overlapping time ranges between these two
data sets. We want all the data on the LEFT
side in the FROM
clause to return
regardless of an uptime record existing within its time slice.
SELECT
*
FROM (
-- build virtual table of all hours between
-- a date range
SELECT
start_ts,
start_ts + interval '1 hour' AS end_ts
FROM generate_series(
'2017-03-01'::date,
'2017-03-03'::timestamp - interval '1 hour',
interval '1 hour'
) AS t(start_ts)
) AS cal
LEFT JOIN (
-- build virtual table of uptimes
SELECT *
FROM (
VALUES
('2017-03-01 01:15:00-06'::timestamp, '2017-03-01 02:15:00-06'::timestamp),
('2017-03-01 08:00:00-06', '2017-03-01 20:00:00-06'),
('2017-03-02 19:00:00-06', null)
) AS t(start_ts, end_ts)
) AS uptime ON cal.end_ts > uptime.start_ts AND cal.start_ts <= coalesce(uptime.end_ts, current_timestamp)
The result set shows we have some variety in our sample data. With 3 slices up time and 3 slices of downtime.
start_ts | end_ts | start_ts | end_ts
---------------------+---------------------+---------------------+---------------------
2017-03-01 00:00:00 | 2017-03-01 01:00:00 | |
2017-03-01 01:00:00 | 2017-03-01 02:00:00 | 2017-03-01 01:15:00 | 2017-03-01 02:15:00
2017-03-01 02:00:00 | 2017-03-01 03:00:00 | 2017-03-01 01:15:00 | 2017-03-01 02:15:00
2017-03-01 03:00:00 | 2017-03-01 04:00:00 | |
...
2017-03-01 07:00:00 | 2017-03-01 08:00:00 | |
2017-03-01 08:00:00 | 2017-03-01 09:00:00 | 2017-03-01 08:00:00 | 2017-03-01 20:00:00
...
2017-03-01 20:00:00 | 2017-03-01 21:00:00 | 2017-03-01 08:00:00 | 2017-03-01 20:00:00
2017-03-01 21:00:00 | 2017-03-01 22:00:00 | |
...
2017-03-02 18:00:00 | 2017-03-02 19:00:00 | |
2017-03-02 19:00:00 | 2017-03-02 20:00:00 | 2017-03-02 19:00:00 |
...
2017-03-02 23:00:00 | 2017-03-03 00:00:00 | 2017-03-02 19:00:00 |
(48 rows)
If we try without the LEFT
clause, we'll only see 20 rows containing the up slices.
Time to compute some timing¶
Let's add some times and sensible column names and replace the *
SELECT
-- will use `first_ts` and `last_ts` to calculate uptime duration
CASE WHEN uptime.start_ts IS NOT NULL THEN
greatest(uptime.start_ts, cal.start_ts)
END AS first_ts,
least(cal.end_ts, uptime.end_ts) AS last_ts,
date_trunc('day', cal.start_ts)::date AS cal_date,
extract(hour from cal.start_ts) AS cal_hour,
extract(epoch from age(cal.end_ts, cal.start_ts)) AS cal_seconds
FROM (
-- build virtual table of all hours between
-- a date range
SELECT
start_ts,
start_ts + interval '1 hour' AS end_ts
FROM generate_series('2017-03-01'::date,
'2017-03-03'::timestamp - interval '1 hour',
interval '1 hour'
) AS t(start_ts)
) AS cal
LEFT JOIN (
-- build virtual table of uptimes
SELECT *
FROM (
VALUES
('2017-03-01 01:15:00-06'::timestamp, '2017-03-01 02:15:00-06'::timestamp),
('2017-03-01 08:00:00-06', '2017-03-01 20:00:00-06'),
('2017-03-02 19:00:00-06', null)
) AS t(start_ts, end_ts)
) AS uptime ON cal.end_ts > uptime.start_ts AND cal.start_ts <= coalesce(uptime.end_ts, current_timestamp)
first_ts | last_ts | cal_date | cal_hour | cal_seconds
---------------------+---------------------+------------+----------+-------------
| 2017-03-01 01:00:00 | 2017-03-01 | 0 | 3600
2017-03-01 01:15:00 | 2017-03-01 02:00:00 | 2017-03-01 | 1 | 3600
2017-03-01 02:00:00 | 2017-03-01 02:15:00 | 2017-03-01 | 2 | 3600
| 2017-03-01 04:00:00 | 2017-03-01 | 3 | 3600
| 2017-03-01 05:00:00 | 2017-03-01 | 4 | 3600
| 2017-03-01 06:00:00 | 2017-03-01 | 5 | 3600
| 2017-03-01 07:00:00 | 2017-03-01 | 6 | 3600
| 2017-03-01 08:00:00 | 2017-03-01 | 7 | 3600
2017-03-01 08:00:00 | 2017-03-01 09:00:00 | 2017-03-01 | 8 | 3600
...
2017-03-01 20:00:00 | 2017-03-01 20:00:00 | 2017-03-01 | 20 | 3600
| 2017-03-01 22:00:00 | 2017-03-01 | 21 | 3600
...
| 2017-03-02 19:00:00 | 2017-03-02 | 18 | 3600
2017-03-02 19:00:00 | 2017-03-02 20:00:00 | 2017-03-02 | 19 | 3600
...
2017-03-02 23:00:00 | 2017-03-03 00:00:00 | 2017-03-02 | 23 | 3600
(48 rows)
Subquery, Subquery, What's the Worry?¶
SQL is all about nested subqueries. It's hard to escape without creating views, but who has time to lookup that syntax and get their DBA's permission to run the DDL?!?
Let's add some duration times to the result set. We'll use the traditional sub query for it.
SELECT
-- calculate uptime seconds
coalesce(
extract(epoch FROM age(last_ts, first_ts)),
0
) AS up_seconds,
*
FROM (
SELECT
-- will use `first_ts` and `last_ts` to calculate uptime duration
CASE WHEN uptime.start_ts IS NOT NULL THEN
greatest(uptime.start_ts, cal.start_ts)
END AS first_ts,
least(cal.end_ts, uptime.end_ts) AS last_ts,
date_trunc('day', cal.start_ts)::date AS cal_date,
extract(hour from cal.start_ts) AS cal_hour,
extract(epoch from age(cal.end_ts, cal.start_ts)) AS cal_seconds
FROM (
-- build virtual table of all hours between
-- a date range
SELECT
start_ts,
start_ts + interval '1 hour' AS end_ts
FROM generate_series('2017-03-01'::date,
'2017-03-03'::timestamp - interval '1 hour',
interval '1 hour'
) AS t(start_ts)
) AS cal
LEFT JOIN (
-- build virtual table of uptimes
SELECT *
FROM (
VALUES
('2017-03-01 01:15:00-06'::timestamp, '2017-03-01 02:15:00-06'::timestamp),
('2017-03-01 08:00:00-06', '2017-03-01 20:00:00-06'),
('2017-03-02 19:00:00-06', null)
) AS t(start_ts, end_ts)
) AS uptime ON cal.end_ts > uptime.start_ts AND cal.start_ts <= coalesce(uptime.end_ts, current_timestamp)
) t1
up_seconds
------------
0
2700
900
0
0
...
3600
(48 rows)
Without the subquery we'd be getting into even more nested function calls and
would have to double compute values or have no visibility in the intermediate
steps. We could have calculated up_seconds
directly in the first query which
introduced first_ts
and last_ts
. That would look like this:
SELECT
coalesce(
extract(epoch FROM
age(
least(cal.end_ts, uptime.end_ts),
CASE WHEN uptime.start_ts IS NOT NULL THEN
greatest(uptime.start_ts, cal.start_ts)
END
)
),
0
) AS up_seconds
FROM --- ...
It's not for the weak stomach, but frankly speaking, neither is the subquery...
Enough Nesting, LATERAL
join save me!¶
Lateral joins can give us the best of both worlds: reduced subquery nesting and
traceable computed values. We're going to move the initial computed values like
first_ts
and last_ts
, move them to a virtual table then JOIN LATERAL
so
they can get their own table alias. We'll do it again for up_seconds
and use
first_ts
and last_ts
from its sibling table.
SELECT
t2.up_seconds
FROM (
-- build virtual table of all hours between
-- a date range
SELECT
start_ts,
start_ts + interval '1 hour' AS end_ts
FROM generate_series('2017-03-01'::date,
'2017-03-03'::timestamp - interval '1 hour',
interval '1 hour'
) AS t(start_ts)
) AS cal
LEFT JOIN (
-- build virtual table of uptimes
SELECT *
FROM (
VALUES
('2017-03-01 01:15:00-06'::timestamp, '2017-03-01 02:15:00-06'::timestamp),
('2017-03-01 08:00:00-06', '2017-03-01 20:00:00-06'),
('2017-03-02 19:00:00-06', null)
) AS t(start_ts, end_ts)
) AS uptime ON cal.end_ts > uptime.start_ts AND cal.start_ts <= coalesce(uptime.end_ts, current_timestamp)
JOIN LATERAL (
SELECT
-- will use `first_ts` and `last_ts` to calculate uptime duration
CASE WHEN uptime.start_ts IS NOT NULL THEN
greatest(uptime.start_ts, cal.start_ts)
END AS first_ts,
least(cal.end_ts, uptime.end_ts) AS last_ts,
date_trunc('day', cal.start_ts)::date AS cal_date,
extract(hour from cal.start_ts) AS cal_hour,
extract(epoch from age(cal.end_ts, cal.start_ts)) AS cal_seconds
) t1 ON true
JOIN LATERAL (
-- calculate uptime seconds for the time slice
SELECT
coalesce(
extract(epoch FROM age(last_ts, first_ts)),
0
) AS up_seconds
) t2 ON true
This gives us the same results but without the deep nesting.
up_seconds
------------
0
2700
900
0
0
3600
...
3600
(48 rows)
What's great about this strategy is we can quickly choose which columns to see as we build up the query.
SELECT
t2.up_seconds
...
-- or --
SELECT
t2.*,
t1.*
Let's build up the final calculation using the same strategy:
SELECT
t2.*,
t3.*
FROM ...
JOIN LATERAL (
-- calculate percentage between uptime seconds and available seconds
-- within the time slice
SELECT
up_seconds / cal_seconds AS up_pct
) t3 ON true
up_seconds | up_pct
------------+--------
0 | 0
2700 | 0.75
900 | 0.25
0 | 0
...
3600 | 1
(48 rows)
Plot the Hours¶
Now we have all the computed data we need. Let's plot it as a cross tab (but not
actually use crosstab
)
We'll need to consolidate the long list of data by cal_date
and pivot the
cal_hour
as a column and up_pct
as a value. In case of overlapping uptimes
we'll be pessimists and choose the lowest or min
uptime percentage.
The final query looks like:
SELECT
cal_date,
max(CASE WHEN cal_hour = 0 THEN up_pct END) AS hour_0,
max(CASE WHEN cal_hour = 1 THEN up_pct END) AS hour_1,
max(CASE WHEN cal_hour = 2 THEN up_pct END) AS hour_2,
max(CASE WHEN cal_hour = 3 THEN up_pct END) AS hour_3,
max(CASE WHEN cal_hour = 4 THEN up_pct END) AS hour_4,
max(CASE WHEN cal_hour = 5 THEN up_pct END) AS hour_5,
max(CASE WHEN cal_hour = 6 THEN up_pct END) AS hour_6,
max(CASE WHEN cal_hour = 7 THEN up_pct END) AS hour_7,
max(CASE WHEN cal_hour = 8 THEN up_pct END) AS hour_8,
max(CASE WHEN cal_hour = 9 THEN up_pct END) AS hour_9,
max(CASE WHEN cal_hour = 10 THEN up_pct END) AS hour_10,
max(CASE WHEN cal_hour = 11 THEN up_pct END) AS hour_11,
max(CASE WHEN cal_hour = 12 THEN up_pct END) AS hour_12,
max(CASE WHEN cal_hour = 13 THEN up_pct END) AS hour_13,
max(CASE WHEN cal_hour = 14 THEN up_pct END) AS hour_14,
max(CASE WHEN cal_hour = 15 THEN up_pct END) AS hour_15,
max(CASE WHEN cal_hour = 16 THEN up_pct END) AS hour_16,
max(CASE WHEN cal_hour = 17 THEN up_pct END) AS hour_17,
max(CASE WHEN cal_hour = 18 THEN up_pct END) AS hour_18,
max(CASE WHEN cal_hour = 19 THEN up_pct END) AS hour_19,
max(CASE WHEN cal_hour = 20 THEN up_pct END) AS hour_20,
max(CASE WHEN cal_hour = 21 THEN up_pct END) AS hour_21,
max(CASE WHEN cal_hour = 22 THEN up_pct END) AS hour_22,
max(CASE WHEN cal_hour = 23 THEN up_pct END) AS hour_23
FROM (
-- build virtual table of all hours between
-- a date range
SELECT
start_ts,
start_ts + interval '1 hour' AS end_ts
FROM generate_series('2017-03-01'::date,
'2017-03-03'::timestamp - interval '1 hour',
interval '1 hour'
) AS t(start_ts)
) AS cal
LEFT JOIN (
-- build virtual table of uptimes
SELECT *
FROM (
VALUES
('2017-03-01 01:15:00-06'::timestamp, '2017-03-01 02:15:00-06'::timestamp),
('2017-03-01 08:00:00-06', '2017-03-01 20:00:00-06'),
('2017-03-02 19:00:00-06', null)
) AS t(start_ts, end_ts)
) AS uptime ON cal.end_ts > uptime.start_ts AND cal.start_ts <= coalesce(uptime.end_ts, current_timestamp)
JOIN LATERAL (
SELECT
-- will use `first_ts` and `last_ts` to calculate uptime duration
CASE WHEN uptime.start_ts IS NOT NULL THEN
greatest(uptime.start_ts, cal.start_ts)
END AS first_ts,
least(cal.end_ts, uptime.end_ts) AS last_ts,
date_trunc('day', cal.start_ts)::date AS cal_date,
extract(hour from cal.start_ts) AS cal_hour,
extract(epoch from age(cal.end_ts, cal.start_ts)) AS cal_seconds
) t1 ON true
JOIN LATERAL (
SELECT
coalesce(
extract(epoch FROM age(last_ts, first_ts)),
0
) AS up_seconds
) t2 ON true
JOIN LATERAL (
-- calculate percentage between uptime seconds and available seconds
-- within the time slice
SELECT
up_seconds / cal_seconds AS up_pct
) t3 ON true
GROUP BY cal_date
cal_date | hour_0 | hour_1 | hour_2 | hour_3 | ... | hour_23
------------+--------+--------+--------+--------+ ... +---------
2017-03-01 | 0 | 0.75 | 0.25 | 0 | ... | 0
2017-03-02 | 0 | 0 | 0 | 0 | ... | 1
(2 rows)
More than CTE and Cross Join¶
This example only scratches the surface of LATERAL
s super powers. On the
surface LATERAL
can do things CTE
, cross join, and WINDOW
can do.
PostgreSQL describe LATERAL
as:
Subqueries appearing in FROM can be preceded by the key word LATERAL. This allows them to reference columns provided by preceding FROM items. (Without LATERAL, each subquery is evaluated independently and so cannot cross-reference any other FROM item.)
TL;DR - LATERAL
allows subqueries to reference earlier tables.