Raw data begins by existing in source and outside of the data warehouse. The source may be a CRM like Salesforce or HubSpot, or a database like MySQL. Data can usually be retrieved via an API or database connection, but this will not always be in an immediately usable format.
Before data is landed into the data warehouse, the Kleene connectors extract the data into filestore (e.g. GCS, Amazon S3 and Azure Blob). The filestore houses data in raw formats like .csv, .tsv, .json. This is the data lake. Filestore is a stable, cloud-run storage system that acts as a backup in case the data warehouse ever goes down.
The Kleene connectors then copy this data over into one centralised location, which is the data warehouse. This data is also converted from the raw formats into a relational database format that can be queried using the SQL flavour of the data warehouse.
At this point, the data is not ready to be used for drawing insights, as it is still in its raw stage, so most fields contain data as string instead of, for instance, integer, date or boolean. There may also be nested JSON and nested arrays within the raw data at this point.
These intricacies will need to be handled within the pipeline before any data can be used for analysis.
As explained in Your Sources and Extracts, all raw data lands within schemas following this naming convention:
The next step from landing the raw data is to clean the data.
Updated 8 months ago