An Opinionated Introduction to Data Warehouse Query Strategies, Part 1 of N
I often work with new clients on building out a new data warehouse, or with new analysts accessing an existing data warehouse for the first time. In either case, it’s good to be aware of a few common principles when querying large data warehouse tables, especially those on data warehouse specific platforms such as Redshift, Snowflake and Teradata.
In this and likely a few more related posts, we’ll explore some common query strategies and pitfalls when dealing with data warehouse databases.
Don’t join fact tables
Joining large fact tables (such as an Orders
and a Shipments
fact table in an eCommerce data warehouse) together in attempt to create a report combining a few metrics potentially leads to issues that are later hard to track down.
Don’t :(
select
o.order_date,
sum(o.order_cnt) as order_cnt,
sum(s.shipment_cnt) as shipment_cnt
from
fct_orders o
join
fct_shipments s on o.order_id = s.order_id
group by 1
What do you mean? What’s wrong with this query?
-
Well, for one we might expose ourselves to the possibility of unintentional product joins - if not now, then down the road. For example, you might assume (or even know) that the
Shipments
fact table is at the Order level, i.e. each row in theShipments
fact is for exactly oneOrder
. However, unbeknownst to you, the distribution center modifies theShipments
process such that we find ourselves with multiple shipments per order for a non-trivial number of orders. In that case, the right side of the join toShipments
now not only likely creates query plan complications for our database (see below), but it also causes our query to return duplicate rows for each order from the left side of the join. While in some cases, this may be intentional (in which case the query should probably be explicitly structured that way), in most cases this is a subtle and hard-to-track-down bug. -
Secondly, on MPP databases where table distribution is user managed (e.g. Redshift), unless both fact tables are distributed on the same key and joined on the common distribution key, this likely causes redistribution issues as one of the large fact table now needs to be redistributed on a common key. Depending on the size of the result set, this may or may not be possible on your system.
So, what to do instead? Use Common-Table-Expressions (CTE) to pre-aggregate fact tables to a common level:
Do :)
with orders as
(
select
order_date,
sum(order_cnt) as order_cnt
from
fct_orders
group by 1
),
shipments as
(
select
order_date,
sum(shipment_cnt) as shipment_cnt
from
fct_shipments
group by 1
)
select
o.order_date as order_date,
o.order_cnt as order_cnt,
coalesce(s.shipment_cnt, 0) as shipment_cnt
from
orders o
-- we use a "left outer join" since not all orders
-- for a given order date might have shipped already
left outer join
shipments s on o.order_date = s.order_date
In short, redistribution is costly, and sometimes can be avoided by restructuring queries.
And in general,
Use CTEs
Generally, avoid subqueries and use CTEs instead. CTEs are a great way to reduce complexity and to enable reuse of code blocks within your query. Using CTEs, you can easily structure queries using modular building blocks and treat your SQL like a software developer, while still maintaining set logic.
Do :)
with complicated_logic as
(
...
),
metric_a as
(
select ...
from complicated_logic
where some_parameter = 'something'
),
metric_b as
(
select ...
from complicated_logic
where some_parameter = 'something else'
)
select
coalesce(ma.txn_date, mb.txn_date) as txn_date,
coalesce(ma.metric_a,0) as metric_a,
coalesce(mb.metric_b, 0) as metric_b
from
metric_a ma
-- here, we use a "full outer join"
-- since metric a and metric b might not have
-- values for every transaction date
full outer join
metric_b mb on ma.txn_date = mb.txn_date
On some platforms, CTEs can also be used recursively. But to understand recursion, you first must understand recursion.
Big Table First
Simply put, in a data warehouse query the fact table goes on the left
side of the join, any dimensions go on the right
side.
This makes the intent of the query more explicit, since we’re clearly saying we’re interested in the sum of a fact, by one or more dimensions. The fact is always more important than the dimension, so putting it first in the from
clause is good practice and leads to more readable and self-documenting code.
Do :)
select
f.txn_date,
u.user_type_name,
p.product_category_name,
sum(f.sales_amt) as sales_amt
from
fct_sales f
join
dim_user u on f.user_id = u.user_id
join
dim_product p on f.product_id = p.product_id
group by 1,2,3
Don’t Right Outer Join Anything Ever
If an inner join
is not possible due to missing data in the dimension table, use a left outer join
to join from a fact table to a dimension table, never a right outer join
.
The simple fix is to rearrange your query. In databases, 3 right turns don’t make a left turn.
Filter Early, Filter Often
In most cases, we’re interested in a subset of facts, typically for a certain time period or date range. Thus, we want to filter our, typically, very large fact table as efficiently as possible to just work with that slice of data downstream.
For partitioned tables - and really all your fact tables should be partitioned - filtering on the partition key (sortkey
in Redshift lingo) can have big performance benefits.
To get the most gain out of this, we need to filter as early as possible, as close to the fact table as possible.
Do :)
with metric_a as
(
select
txn_date,
sum(fct_col) as metric_a
from
fct_table_a
where
txn_date between '2018-01-01' and '2018-05-31'
group by 1
),
metric_b as
(
...
)
select
coalesce(ma.txn_date, mb.txn_date) as txn_date,
coalesce(ma.metric_a,0) as metric_a,
coalesce(mb.metric_b, 0) as metric_b
from
metric_a ma
full outer join
metric_b mb on ma.txn_date = mb.txn_date
-- don't filter here since that might not tell the query optimizer
-- to take advantage of partitioning
Don’t Filter on Computed Columns
Where possible, avoid filtering on computed columns. This can throw off the query optimizer on some platforms and can cause your nicely early-filtered query to perform a full table scans despite the filter.
So, instead of this:
Don’t :(
select
txn_date,
sum(fct_col) as metric_a
from
fct_table_a
where
date_trunc(‘month’, txn_date) between '2018-01-01' and '2018-05-01'
group by 1
You’re often better off using your date dimension:
Do :)
select
txn_date,
sum(fct_col) as metric_a
from
fct_table_a f
join
dim_date on f.txn_date = d.calendar_date
where
d.month_start_date between '2018-01-01' and '2018-05-01'
group by 1
For The Same Reason,
Don’t Join on Computed Columns
I see, and do, this all the time, but generally this is a smell that we need to preprocess our data earlier in the pipeline.
Don’t :(
select
txn_date,
sum(fct_col) as metric_a
from
fct_table_a f
join
dim_user u on lower(f.email_address) = lower(u.email_address)
group by 1
Do you have any more opinionated data warehouse query tips? Let me know!
I know I do…to be continued.
Comments