Amazon S3
version LATEST
Set up
Source
To get set up with the Amazon S3 connector, you will need the following:
- Region (e.g. eu-west-1)
- AWS Access Key
- AWS Secret Key
To get these AWS credentials, see Amazon documentation here.
Extract
For each extract, the following information is required:
- S3 Bucket Name
- Load type
- Load a single file - you must then enter the path to the specific file
- Load all files inside a folder - you must then enter the path to the folder
- Load a single file inside a ZIP - you must then enter the path to the zip and separately enter the path to the file within the zip
- Load all files inside a ZIP - you must then enter the path to the zip
Data structure
When loading all files from inside a folder, the data structure of each file must be identical.
- File Type
- CSV
- JSON
- Parquet
- TXT
- XLS
- XLSX
- XML
- Dynamo DB export
- AVRO
- Values delimiter - If TXT is chosen as the file type then the method of delimitation must be chosen we accept:
- Tab -
- Pipe -
|
- Comma -
,
- Tab -
- Header - if this is checked, the output columns are generated from the key names of the first row of data in the input files, otherwise, the column names will be named as column1, column2… columnN
- Sheet Name - Applies only to xls and xlsx files. Enter a value to load the data from the specified sheet. If this is left blank, kleene loads the data from the first sheet.
The following option and behavior apply when “Load all files inside a folder” has been selected:
- Load only files newer than X number of days ago - any files older than X number of days ago will be ignored
Once the files have been loaded, the list of ingested files will be stored within the _KLEENE_FILENAME column in the destination warehouse table.
- Filter Files - When this option is selected you can select all files that follow a certain pattern.
By inputting a wildcard in the Regex pattern
input you can return certain files in a folder. For example if you put *.csv
in the regex pattern all files that follow the pattern <something>.csv
will be extracted.
Additional Indo
- For incremental loads, only files that have not been loaded previously will be processed.
- Zipped files can be loaded in as well, if files end with .zip extension they will be automatically unzipped and loaded.
Limitations
- When “Load all files inside a folder” has been selected, only files with extensions that end with the respective file type are processed. For example, when File Type: JSON is selected, only files ending with .JSON are processed.
- For DynamoDB export files, a file extension of “JSON.gz” is expected for the input files.
- For JSON files, a list of JSON objects is expected on the top level, where each object is a single row in the output. If the top-level object is a single JSON object or hashmap, only a single row will be output.
- For XML files, no additional un-nesting or processing is done, only one row is output. The entire XML is output as JSON format as a single row into the table column named content.
- Format and File limitations are rooted in the Snowflake environment, please see a list of these here
Updated 8 months ago