A lot of the time spent creating a data pipeline will be in the initial data cleansing.
The raw data now exists within the data warehouse, but still requires cleaning. Depending on the data source, the connectors are able to automatically cast fields to their correct data type, as well as remove special characters from field names and replace them with underscores. Any camel case field names, for example
camelCaseField, may also have underscores added between the identified words, where for the previous example this would now be
For the cases where the connector is unable to do this, manually choosing the correct data type will be required. These fields would be manually casted in the transform layer of the Kleene app.
Sometimes, the data returned from source will have issues preventing the field from being casted to a particular data type. This can happen with date fields from Google Sheets, where multiple different date formats are used, so the data warehouse has trouble casting all the different formats into one date data type. In this case, the values in the field would need to be cleaned before any data type casting can commence. An example of this is given below for a field containing dates of format
dd/mm/yyyy, where the latter does not follow a consistent number of digits, e.g. 3/6/2022 will appear instead of 03/06/2022:
(CASE WHEN contains(ACTUALSHIPDATE,'-') THEN ACTUALSHIPDATE WHEN contains(ACTUALSHIPDATE,'/') THEN (right(ACTUALSHIPDATE::text,4)::text ||'-'||LPAD(split_part(left(ACTUALSHIPDATE::text , len(ACTUALSHIPDATE::text)-5),'/',2),2,0) ||'-'||LPAD(split_part(left(ACTUALSHIPDATE::text , len(ACTUALSHIPDATE::text)-5),'/',1),2,0)) END)::date as ACTUAL_SHIP_DATE
Separate to data type casting and field value cleaning is handling semi-structured data. This entails the parsing out JSON. Semi-structured data will always need to be handled before moving the data downstream toward the insights layer. In Snowflake, the key function to handle semi-structured data is:
- Link to Snowflake documentation.
- When using this function on a field, it can be treated as casting the field to data type variant.
If there are any arrays (denoted by
[ ]) within the table, then they are not to be handled at the cleaned step, so instead leave them as they are, but feel free to rename the field to something that is easier to read or makes more sense. Arrays are handled at the Subentity stage.
All cleaned tables reside in their respective source specific 'cleaned' schema. The naming convention would be:
Once the data has been cleaned, the main bulk of the business-logic transformation begins with the subentity stage.
Updated about 2 months ago