Overusing CTEs
Breaking free from CTE chains: Why simpler transforms make better pipelines
Common Table Expressions (CTEs) are a powerful SQL feature that let you write complex queries step by step. However, when building data pipelines, excessive use of CTEs can create hidden problems. Let's explore why breaking transforms into separate steps often works better.
Consider this CTE-heavy approach:
WITH cleaned_orders AS (
SELECT date, customer_id, amount
FROM raw_orders
WHERE status != 'cancelled'
),
daily_totals AS (
SELECT
date,
COUNT(*) as order_count,
SUM(amount) as total_amount
FROM cleaned_orders
GROUP BY date
),
customer_metrics AS (
SELECT
customer_id,
COUNT(*) as frequency,
AVG(amount) as avg_order
FROM cleaned_orders
GROUP BY customer_id
)
SELECT
d.*,
c.frequency,
c.avg_order
FROM daily_totals d
CROSS JOIN customer_metrics c;
While this works, breaking it into separate transforms is often better:
-- Transform 1: cleaned_orders
CREATE VIEW cleaned_orders AS
SELECT date, customer_id, amount
FROM raw_orders
WHERE status != 'cancelled';
-- Transform 2: daily_totals
CREATE VIEW daily_totals AS
SELECT
date,
COUNT(*) as order_count,
SUM(amount) as total_amount
FROM cleaned_orders
GROUP BY date;
-- Transform 3: customer_metrics
CREATE VIEW customer_metrics AS
SELECT
customer_id,
COUNT(*) as frequency,
AVG(amount) as avg_order
FROM cleaned_orders
GROUP BY customer_id;
-- Final join
CREATE VIEW combined_metrics AS
SELECT
d.*,
c.frequency,
c.avg_order
FROM daily_totals d
CROSS JOIN customer_metrics c;
The benefits of this approach include:
- Earlier error detection: Each transform runs independently, so you'll catch issues (like missing columns or type mismatches) at the specific step where they occur.
- Memory management: The database can optimize each step separately, rather than holding the entire CTE chain in memory.
- Reusability: Other pipelines can easily reference intermediate steps like cleaned_orders without duplicating logic.
Readability: Each transform has a clear, single purpose, making the code easier to understand and maintain. - Clearer dependencies: Your DAG (Directed Acyclic Graph) will show explicit connections between transforms, making it easier to spot redundancies or optimization opportunities.
Think of it like building with LEGO - while you could build everything as one connected piece, using separate modules makes it easier to test, modify, and reuse components. Just as you'd break a complex function into smaller, focused functions in programming, break complex data transformations into clear, single-purpose steps.
Remember: CTEs still have their place for temporary calculations or when you truly need everything in one query. The key is using them judiciously rather than as your default approach to complex transformations.
Updated about 1 month ago