The most atomic object in the transform layer is a transform, with the next level up being a transform group, which was mentioned previously in Groups.
After creating a transform group, you can create and save individual transforms. Creating new transforms is roughly the same process as creating transform groups, except you must be within a transform group to do so by clicking on a similarly located plus sign on the top-right corner of the transform screen.
After creating a new transform, a console will appear where SQL can be written. Anything saved in a transform will be executed in SQL. This is an important point to remember and consider, as this allows the transform layer to be flexible in its use.
Once transforms are created and saved within groups, the next task would be to schedule these to run automatically (if required).
If individual transforms within groups are required to be scheduled at specific times, then the group schedule can be overwritten. However, if the transform has upstream dependencies, then that transform is locked into being dependency-based. It is recommended to schedule on the group level, so that the Kleene app scheduler can determine the order of execution and run the jobs when the upstream jobs have finished, as opposed to scheduling transforms to run at specific times and run the risk of transform jobs overlapping each other.
To schedule a transform, select the transform you wish to schedule, then navigate to the top right-hand corner of the transform editor and click SETTINGS.
A new menu will appear on screen with options very similar to that shown for the transform groups in Groups. However, there are many more options that relate to the dependencies of the transform, where you can read more about this here. But to schedule a transform, navigate to the bottom of this menu and click on "Schedule (UTC time zone)", where the interface for choosing a time in the week to schedule the transform will appear.
A transform can be used to create, replace, update and delete (CRUD) data on a scheduled basis. Anything written in the editing box of a transform will hence be executed as SQL, which means you can create tables, insert into tables and run other SQL commands as if you are in the SQL console. However, bear in mind that you cannot observe the output immediately like in the SQL console, since queries in a transform are sent directly to the data warehouse with limited output.
Since the transforms run any text inside as SQL, they can also be used to run any warehouse commands on a schedule, so for the case of a Snowflake data warehouse, the transforms can be used to schedule a query that changes the warehouse size at certain times in the day. The query for this may look like:
ALTER WAREHOUSE elt SET WAREHOUSE_SIZE = MEDIUM;
There are additional options to configure a transform on the top right-hand side of the transform screen. From left to right, these correspond to:
- RUN - the user has a choice of running the transform in three different ways:
- All upstream transform dependencies are run and ends on the selected transform.
- The selected transform is run and then all downstream transforms are run.
- Only the selected transform is run.
- SAVE - Saves the transform and all the current text inside the editor.
- SETTINGS - allows the user to configure:
- The transform name.
- The transform description.
- The transform input and output (this shows the upstream and downstream dependencies of the transform, so this is useful for debugging).
- Transform tags.
- Transforms can be tagged so that the search functionality returns a tagged set of transforms.
- Schedule, which would be rarely used if group scheduling is used.
These transform options can be seen as below:
Clicking on the settings button will reveal additional options:
- Undo changes
- Show history
- Move to another group
- Show in DAG
- Validate dependencies
- Delete transform
Options one, three, four and six should be self-explanatory. The second is interesting in that it allows you to view the previous versions of the transform. Versions are defined by changes being made to the SQL in the transform and then saved. When viewing the different versions, the additions are shown in green and the subtractions are shown in red.
To roll the transform back to a particular version, select the version and click the ROLLBACK TO THIS VERSION button on the top right-hand corner of the screen.
The fifth option is to check the upstream dependencies of the transform are met, i.e. exist. This option may only be useful when manually defining the dependencies of a transform and then checking all upstream dependencies that exist.
Transforms are pivotal features when building a data pipeline, but they are prone to failure/cause issues due to various reasons, which may include poorly formatted data being extracted, data not being extracted, transform timeouts etc. In the event of failure, it is important to understand where the problems occur and be able to fix the issues fast by knowing what the problem is. One such practice to enable this type of response is unit testing.
Updated 2 months ago