Raw data begins by existing in source and outside of the data warehouse. The source may be a CRM like Salesforce or HubSpot, or a database like MySQL. Data can usually be retrieved via an API or database connection, but this will not always be in an immediately usable format.

Before data is landed into the data warehouse, the Kleene connectors extract the data into filestore (e.g. GCS, Amazon S3 and Azure Blob). The filestore houses data in raw formats like .csv, .tsv, .json. This is the data lake. Filestore is a stable, cloud-run storage system that acts as a backup in case the data warehouse ever goes down.

The Kleene connectors then copy this data over into one centralised location, which is the data warehouse. This data is also converted from the raw formats into a relational database format that can be queried using the SQL flavour of the data warehouse.

At this point, the data is not ready to be used for drawing insights, as it is still in its raw stage, so most fields contain data as string instead of, for instance, integer, date or boolean. There may also be nested JSON and nested arrays within the raw data at this point.


990

A screenshot of newly ingested data, showing the presence of JSON and arrays. There exists more information on different levels of data granularity within these fields, thus they require cleaning and flattening before any insights can be drawn from them.


These intricacies will need to be handled within the pipeline before any data can be used for analysis.

As explained in Your Sources and Extracts, all raw data lands within schemas following this naming convention:

<source_name>_raw.<table_name>_raw

The next step from landing the raw data is to clean the data.


What’s Next