Defining Your Cleaning Business Logic
Data cleaning as foundational warehouse design
Setting up a data warehouse isn't just about storing data - it's about establishing the foundations for how your organization understands and interacts with that data. Standardization decisions, particularly for critical entities like customer data, should be treated as a key part of your warehouse design phase, requiring careful consultation with stakeholders across the business.
Consider this common evolution of customer data handling:
-- Original view after rushed implementation
CREATE VIEW cleaned_customers AS
SELECT
TRIM(INITCAP(first_name)) as first_name,
TRIM(INITCAP(last_name)) as last_name,
title
FROM raw_customers;
-- Marketing adds their requirements later
CREATE VIEW marketing_customers AS
SELECT
first_name,
last_name,
CASE
WHEN LOWER(title) IN ('mr', 'mister') THEN 'Mr'
WHEN LOWER(title) IN ('mrs', 'missus') THEN 'Mrs'
ELSE title
END as title
FROM cleaned_customers;
-- Finance adds different requirements
CREATE VIEW finance_customers AS
SELECT
first_name,
last_name,
COALESCE(title, 'Unknown') as title,
-- New requirement: handle middle names
CASE
WHEN first_name LIKE '% %'
THEN SUBSTRING(first_name FROM POSITION(' ' IN first_name))
END as middle_name
FROM cleaned_customers;
Instead, work with stakeholders upfront to understand requirements and create a comprehensive solution:
-- Thoroughly designed standardization after stakeholder consultation
CREATE VIEW cleaned_customers AS
SELECT
-- Names split according to legal team requirements
TRIM(SPLIT_PART(raw_name, ' ', 1)) as first_name,
CASE
WHEN array_length(string_to_array(raw_name, ' '), 1) > 2
THEN TRIM(SPLIT_PART(raw_name, ' ', 2))
END as middle_name,
TRIM(SPLIT_PART(raw_name, ' ', -1)) as last_name,
-- Title standardization matching CRM requirements
CASE
WHEN title IS NULL OR title = '' THEN 'Unknown'
WHEN LOWER(title) IN ('mr', 'mister') THEN 'Mr'
WHEN LOWER(title) IN ('mrs', 'missus') THEN 'Mrs'
WHEN LOWER(title) IN ('ms', 'miss') THEN 'Ms'
WHEN LOWER(title) = 'dr' THEN 'Dr'
ELSE INITCAP(TRIM(title))
END as title,
-- Additional fields requested by compliance
created_at,
last_modified_at
FROM raw_customers;
Key principles for this approach
Stakeholder Consultation
Meet with teams across the business to understand
- How they use customer data
- What cleaning rules they currently apply
- What issues they've encountered
- Any compliance requirements
Documentation
Record not just the logic but the reasoning in your transform metadata:
- Why specific standardization rules were chosen
- Which teams requested which features
- Any compliance or legal requirements incorporated
Refactoring Readiness
As new requirements emerge, don't be afraid to refactor the core view:
- Better to modify the foundation than build workarounds
- Each refactor is an opportunity to improve documentation
- Changes benefit all downstream users immediately
Following the "shift left" principle, this standardization should happen as early as possible in your pipeline, becoming part of your warehouse's core design rather than an afterthought. When new requirements emerge - like handling international characters or new title formats - update the core cleaning logic rather than creating workarounds in downstream transformations.
Think of it like designing a building's foundation - invest time upfront to understand requirements, but also build in the flexibility to adapt as needs change. It's better to carefully modify the foundation than to keep adding precarious extensions on top.
Updated about 1 month ago