An Opinionated Introduction to Data Warehouse Query Strategies, Part 1 of N
I often work with new clients on building out a new data warehouse, or with new analysts accessing an existing data warehouse for the first time. In either case, it’s good to be aware of a few common principles when querying large data warehouse tables, especially those on data warehouse specific platforms such as Redshift, Snowflake and Teradata.
In this and likely a few more related posts, we’ll explore some common query strategies and pitfalls when dealing with data warehouse databases.
Don’t join fact tables
Joining large fact tables (such as an
Orders and a
Shipments fact table in an eCommerce data warehouse) together in attempt to create a report combining a few metrics potentially leads to issues that are later hard to track down.
select o.order_date, sum(o.order_cnt) as order_cnt, sum(s.shipment_cnt) as shipment_cnt from fct_orders o join fct_shipments s on o.order_id = s.order_id group by 1
What do you mean? What’s wrong with this query?
Well, for one we might expose ourselves to the possibility of unintentional product joins - if not now, then down the road. For example, you might assume (or even know) that the
Shipmentsfact table is at the Order level, i.e. each row in the
Shipmentsfact is for exactly one
Order. However, unbeknownst to you, the distribution center modifies the
Shipmentsprocess such that we find ourselves with multiple shipments per order for a non-trivial number of orders. In that case, the right side of the join to
Shipmentsnow not only likely creates query plan complications for our database (see below), but it also causes our query to return duplicate rows for each order from the left side of the join. While in some cases, this may be intentional (in which case the query should probably be explicitly structured that way), in most cases this is a subtle and hard-to-track-down bug.
Secondly, on MPP databases where table distribution is user managed (e.g. Redshift), unless both fact tables are distributed on the same key and joined on the common distribution key, this likely causes redistribution issues as one of the large fact table now needs to be redistributed on a common key. Depending on the size of the result set, this may or may not be possible on your system.
So, what to do instead? Use Common-Table-Expressions (CTE) to pre-aggregate fact tables to a common level:
with orders as ( select order_date, sum(order_cnt) as order_cnt from fct_orders group by 1 ), shipments as ( select order_date, sum(shipment_cnt) as shipment_cnt from fct_shipments group by 1 ) select o.order_date as order_date, o.order_cnt as order_cnt, coalesce(s.shipment_cnt, 0) as shipment_cnt from orders o -- we use a "left outer join" since not all orders -- for a given order date might have shipped already left outer join shipments s on o.order_date = s.order_date
In short, redistribution is costly, and sometimes can be avoided by restructuring queries.
And in general,
Generally, avoid subqueries and use CTEs instead. CTEs are a great way to reduce complexity and to enable reuse of code blocks within your query. Using CTEs, you can easily structure queries using modular building blocks and treat your SQL like a software developer, while still maintaining set logic.
with complicated_logic as ( ... ), metric_a as ( select ... from complicated_logic where some_parameter = 'something' ), metric_b as ( select ... from complicated_logic where some_parameter = 'something else' ) select coalesce(ma.txn_date, mb.txn_date) as txn_date, coalesce(ma.metric_a,0) as metric_a, coalesce(mb.metric_b, 0) as metric_b from metric_a ma -- here, we use a "full outer join" -- since metric a and metric b might not have -- values for every transaction date full outer join metric_b mb on ma.txn_date = mb.txn_date
On some platforms, CTEs can also be used recursively. But to understand recursion, you first must understand recursion.
Big Table First
Simply put, in a data warehouse query the fact table goes on the
left side of the join, any dimensions go on the
This makes the intent of the query more explicit, since we’re clearly saying we’re interested in the sum of a fact, by one or more dimensions. The fact is always more important than the dimension, so putting it first in the
from clause is good practice and leads to more readable and self-documenting code.
select f.txn_date, u.user_type_name, p.product_category_name, sum(f.sales_amt) as sales_amt from fct_sales f join dim_user u on f.user_id = u.user_id join dim_product p on f.product_id = p.product_id group by 1,2,3
Don’t Right Outer Join Anything Ever
inner join is not possible due to missing data in the dimension table, use a
left outer join to join from a fact table to a dimension table, never a
right outer join.
The simple fix is to rearrange your query. In databases, 3 right turns don’t make a left turn.
Filter Early, Filter Often
In most cases, we’re interested in a subset of facts, typically for a certain time period or date range. Thus, we want to filter our, typically, very large fact table as efficiently as possible to just work with that slice of data downstream.
For partitioned tables - and really all your fact tables should be partitioned - filtering on the partition key (
sortkey in Redshift lingo) can have big performance benefits.
To get the most gain out of this, we need to filter as early as possible, as close to the fact table as possible.
with metric_a as ( select txn_date, sum(fct_col) as metric_a from fct_table_a where txn_date between '2018-01-01' and '2018-05-31' group by 1 ), metric_b as ( ... ) select coalesce(ma.txn_date, mb.txn_date) as txn_date, coalesce(ma.metric_a,0) as metric_a, coalesce(mb.metric_b, 0) as metric_b from metric_a ma full outer join metric_b mb on ma.txn_date = mb.txn_date -- don't filter here since that might not tell the query optimizer -- to take advantage of partitioning
Don’t Filter on Computed Columns
Where possible, avoid filtering on computed columns. This can throw off the query optimizer on some platforms and can cause your nicely early-filtered query to perform a full table scans despite the filter.
So, instead of this:
select txn_date, sum(fct_col) as metric_a from fct_table_a where date_trunc(‘month’, txn_date) between '2018-01-01' and '2018-05-01' group by 1
You’re often better off using your date dimension:
select txn_date, sum(fct_col) as metric_a from fct_table_a f join dim_date on f.txn_date = d.calendar_date where d.month_start_date between '2018-01-01' and '2018-05-01' group by 1
For The Same Reason,
Don’t Join on Computed Columns
I see, and do, this all the time, but generally this is a smell that we need to preprocess our data earlier in the pipeline.
select txn_date, sum(fct_col) as metric_a from fct_table_a f join dim_user u on lower(f.email_address) = lower(u.email_address) group by 1
Do you have any more opinionated data warehouse query tips? Let me know!
I know I do…to be continued.