Defining Your Cleaning Business Logic

Data cleaning as foundational warehouse design

Setting up a data warehouse isn't just about storing data - it's about establishing the foundations for how your organization understands and interacts with that data. Standardization decisions, particularly for critical entities like customer data, should be treated as a key part of your warehouse design phase, requiring careful consultation with stakeholders across the business.

Consider this common evolution of customer data handling:

-- Original view after rushed implementation  
CREATE VIEW cleaned_customers AS  
SELECT  
  TRIM(INITCAP(first_name)) as first_name,  
  TRIM(INITCAP(last_name)) as last_name,  
  title  
FROM raw_customers;

-- Marketing adds their requirements later  
CREATE VIEW marketing_customers AS  
SELECT  
  first_name,  
  last_name,  
  CASE  
    WHEN LOWER(title) IN ('mr', 'mister') THEN 'Mr'  
    WHEN LOWER(title) IN ('mrs', 'missus') THEN 'Mrs'  
    ELSE title  
  END as title  
FROM cleaned_customers;

-- Finance adds different requirements  
CREATE VIEW finance_customers AS  
SELECT  
  first_name,  
  last_name,  
  COALESCE(title, 'Unknown') as title,  
  -- New requirement: handle middle names  
  CASE  
    WHEN first_name LIKE '% %'  
    THEN SUBSTRING(first_name FROM POSITION(' ' IN first_name))  
  END as middle_name  
FROM cleaned_customers;

Instead, work with stakeholders upfront to understand requirements and create a comprehensive solution:

-- Thoroughly designed standardization after stakeholder consultation  
CREATE VIEW cleaned_customers AS  
SELECT  
  -- Names split according to legal team requirements  
  TRIM(SPLIT_PART(raw_name, ' ', 1)) as first_name,  
  CASE  
    WHEN array_length(string_to_array(raw_name, ' '), 1) > 2  
    THEN TRIM(SPLIT_PART(raw_name, ' ', 2))  
  END as middle_name,  
  TRIM(SPLIT_PART(raw_name, ' ', -1)) as last_name,  
  -- Title standardization matching CRM requirements  
  CASE  
    WHEN title IS NULL OR title = '' THEN 'Unknown'  
    WHEN LOWER(title) IN ('mr', 'mister') THEN 'Mr'  
    WHEN LOWER(title) IN ('mrs', 'missus') THEN 'Mrs'  
    WHEN LOWER(title) IN ('ms', 'miss') THEN 'Ms'  
    WHEN LOWER(title) = 'dr' THEN 'Dr'  
    ELSE INITCAP(TRIM(title))  
  END as title,  
  -- Additional fields requested by compliance  
  created_at,  
  last_modified_at  
FROM raw_customers;

Key principles for this approach

Stakeholder Consultation

Meet with teams across the business to understand

  • How they use customer data
  • What cleaning rules they currently apply
  • What issues they've encountered
  • Any compliance requirements

Documentation

Record not just the logic but the reasoning in your transform metadata:

  • Why specific standardization rules were chosen
  • Which teams requested which features
  • Any compliance or legal requirements incorporated

Refactoring Readiness

As new requirements emerge, don't be afraid to refactor the core view:

  • Better to modify the foundation than build workarounds
  • Each refactor is an opportunity to improve documentation
  • Changes benefit all downstream users immediately

Following the "shift left" principle, this standardization should happen as early as possible in your pipeline, becoming part of your warehouse's core design rather than an afterthought. When new requirements emerge - like handling international characters or new title formats - update the core cleaning logic rather than creating workarounds in downstream transformations.

Think of it like designing a building's foundation - invest time upfront to understand requirements, but also build in the flexibility to adapt as needs change. It's better to carefully modify the foundation than to keep adding precarious extensions on top.