Overusing CTEs

Common Table Expressions (CTEs) are a powerful SQL feature that let you write complex queries step by step. However, when building data pipelines, excessive use of CTEs can create hidden problems. Let's explore why breaking transforms into separate steps often works better.

Consider this CTE-heavy approach:

WITH cleaned_orders AS (  
  SELECT date, customer_id, amount  
  FROM raw_orders  
  WHERE status != 'cancelled'  
),  

daily_totals AS (  
  SELECT  
    date,  
    COUNT(*) as order_count,  
    SUM(amount) as total_amount  
  FROM cleaned_orders  
  GROUP BY date  
),  

customer_metrics AS (  
  SELECT  
    customer_id,  
    COUNT(*) as frequency,  
    AVG(amount) as avg_order  
  FROM cleaned_orders  
  GROUP BY customer_id  
)  

SELECT  
  d.*,  
  c.frequency,  
  c.avg_order  
FROM daily_totals d  
CROSS JOIN customer_metrics c;

While this works, breaking it into separate transforms is often better:

-- Transform 1: cleaned_orders  
CREATE VIEW cleaned_orders AS  
SELECT date, customer_id, amount  
FROM raw_orders  
WHERE status != 'cancelled';

-- Transform 2: daily_totals  
CREATE VIEW daily_totals AS  
SELECT  
  date,  
  COUNT(*) as order_count,  
  SUM(amount) as total_amount  
FROM cleaned_orders  
GROUP BY date;

-- Transform 3: customer_metrics  
CREATE VIEW customer_metrics AS  
SELECT  
  customer_id,  
  COUNT(*) as frequency,  
  AVG(amount) as avg_order  
FROM cleaned_orders  
GROUP BY customer_id;

-- Final join  
CREATE VIEW combined_metrics AS  
SELECT  
  d.*,  
  c.frequency,  
  c.avg_order  
FROM daily_totals d  
CROSS JOIN customer_metrics c;

The benefits of this approach include:

Earlier error detection: Each transform runs independently, so you'll catch issues (like missing columns or type mismatches) at the specific step where they occur.
Memory management: The database can optimize each step separately, rather than holding the entire CTE chain in memory.
Reusability: Other pipelines can easily reference intermediate steps like cleaned_orders without duplicating logic.
Readability: Each transform has a clear, single purpose, making the code easier to understand and maintain.
Clearer dependencies: Your DAG (Directed Acyclic Graph) will show explicit connections between transforms, making it easier to spot redundancies or optimization opportunities.

Think of it like building with LEGO - while you could build everything as one connected piece, using separate modules makes it easier to test, modify, and reuse components. Just as you'd break a complex function into smaller, focused functions in programming, break complex data transformations into clear, single-purpose steps.

Remember: CTEs still have their place for temporary calculations or when you truly need everything in one query. The key is using them judiciously rather than as your default approach to complex transformations.