Skip to main content
Version: 1.5.0

Amazon Redshift

Amazon AWS Data Warehouse

Synopsis

Creates a target that loads data into Amazon Redshift using S3 staging and the COPY command. The target handles data file creation, S3 upload, and Redshift COPY operations with support for multiple tables and file formats. Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud.

Schema

- name: <string>
description: <string>
type: amazonredshift
pipelines: <pipeline[]>
status: <boolean>
properties:
key: <string>
secret: <string>
session: <string>
region: <string>
endpoint: <string>
username: <string>
password: <string>
database: <string>
schema: <string>
port: <numeric>
staging_bucket: <string>
staging_prefix: <string>
iam_role: <string>
table: <string>
tables: <table[]>
format: <string>
compression: <string>
batch_size: <numeric>
max_size: <numeric>
part_size: <numeric>
timeout: <numeric>
field_format: <string>
interval: <string|numeric>
cron: <string>
debug:
status: <boolean>
dont_send_logs: <boolean>

Configuration

The following fields are used to define the target:

FieldRequiredDefaultDescription
nameYTarget name
descriptionN-Optional description
typeYMust be amazonredshift
pipelinesN-Optional post-processor pipelines
statusNtrueEnable/disable the target

AWS Credentials

FieldRequiredDefaultDescription
keyN*-AWS access key ID for S3 authentication
secretN*-AWS secret access key for S3 authentication
sessionN-Optional session token for temporary credentials
regionY-AWS region (e.g., us-east-1, eu-west-1)

* = Conditionally required. AWS credentials (key and secret) are required unless using IAM role-based authentication on AWS infrastructure.

Redshift Connection

FieldRequiredDefaultDescription
endpointY-Redshift cluster endpoint (without port or database)
databaseY-Redshift database name
usernameY-Redshift database username
passwordY-Redshift database password
schemaNpublicRedshift schema name
portN5439Redshift port number

S3 Staging Configuration

FieldRequiredDefaultDescription
staging_bucketY-S3 bucket name for staging files
staging_prefixNredshift-staging/S3 key prefix for staging files
iam_roleN-IAM role ARN for Redshift to access S3 (recommended)
part_sizeN5S3 multipart upload part size in MB
note

Using an IAM role is recommended for production. If not provided, AWS credentials will be used in the COPY command.

Table Configuration

FieldRequiredDefaultDescription
tableN*-Single table name for data loading
tablesN*-Array of table configurations for multiple tables
formatNparquetFile format: parquet, json, avro, csv
compressionN-File compression format

* = Either table or tables must be specified.

Table Array Configuration

When using the tables array, each table can have the following properties:

FieldRequiredDefaultDescription
tableY-Table name
schemaNParent schemaSchema name for this table
formatNParent formatFile format for this table
compressionNParent compressionCompression format for this table

Batch Configuration

FieldRequiredDefaultDescription
batch_sizeN-Maximum number of events per file
max_sizeN-Maximum file size in bytes
timeoutN300Timeout in seconds for COPY operations
field_formatN-Data normalization format. See applicable Normalization section

Scheduler

FieldRequiredDefaultDescription
intervalNrealtimeExecution frequency. See Interval for details
cronN-Cron expression for scheduled execution. See Cron for details

Debug Options

FieldRequiredDefaultDescription
debug.statusNfalseEnable debug logging
debug.dont_send_logsNfalseProcess logs but don't send to target (testing)

Details

Data Loading Process

The Redshift target uses a three-stage process to load data:

  1. File Creation: Accumulates data in memory and creates files in the specified format
  2. S3 Upload: Uploads the file to the staging S3 bucket
  3. COPY Command: Executes a Redshift COPY command to load data from S3 into the table
  4. Cleanup: Deletes the staging file from S3 after successful load

Supported File Formats

The target supports the following file formats for COPY operations:

Parquet (default)

  • Columnar format optimized for analytics
  • Best compression and query performance
  • Recommended for most use cases

JSON

  • Semi-structured data format
  • Uses auto option for schema detection
  • Useful for nested or variable schemas

Avro

  • Binary format with schema evolution support
  • Uses auto option for schema detection
  • Good for schema changes over time

CSV

  • Simple text format
  • Compatible with many tools
  • Less efficient than columnar formats

Authentication Methods

S3 Authentication

  • Static credentials (access key and secret key)
  • IAM role-based authentication on AWS infrastructure

Redshift Authentication

  • Username and password required
  • Connection uses SSL by default

COPY Command Authentication

  • IAM role (recommended): Redshift assumes role to access S3
  • Static credentials: Embedded in COPY command (not recommended for production)

Connection Pool

The target maintains a connection pool to Redshift with the following settings:

  • Maximum connections: 5
  • Minimum connections: 1
  • Connection lifetime: 1 hour
  • Idle timeout: 30 minutes

S3 Staging

Files are staged in S3 before loading into Redshift. The S3 key structure is:

{staging_prefix}{table_name}/{file_path}

After successful COPY, the staging file is automatically deleted from S3.

Multiple Tables

You can configure multiple tables with different formats and schemas:

tables:
- table: events
schema: public
format: parquet
- table: logs
schema: analytics
format: json

Data is routed to tables based on the SystemS3 field in the log message.

Error Handling

The target handles errors at each stage:

  • S3 upload failures return an error and retry
  • COPY command failures return detailed error messages
  • Connection pool handles reconnections automatically
  • Staging files are retained on error for troubleshooting

Performance Considerations

File Size: Larger files generally provide better COPY performance. Use batch_size and max_size to control file sizes.

Format: Parquet provides the best compression and load performance for analytical queries.

Parallelism: The queue parallelism setting controls concurrent workers. Redshift COPY operations are I/O intensive, so setting parallelism to 1 is recommended to avoid connection pool exhaustion.

IAM Role: Using an IAM role for COPY operations is more secure and performs better than embedding credentials.

Examples

Basic Configuration

The minimum configuration for a Redshift target:

targets:
- name: basic_redshift
type: amazonredshift
properties:
key: "AKIAIOSFODNN7EXAMPLE"
secret: "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY"
region: "us-east-1"
endpoint: "my-cluster.abc123xyz789.us-east-1.redshift.amazonaws.com"
database: "analytics"
username: "admin"
password: "MyPassword123"
staging_bucket: "my-redshift-staging"
table: "events"

With IAM Role

Configuration using IAM role for COPY operations (recommended):

targets:
- name: iam_redshift
type: amazonredshift
properties:
key: "AKIAIOSFODNN7EXAMPLE"
secret: "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY"
region: "us-east-1"
endpoint: "my-cluster.abc123xyz789.us-east-1.redshift.amazonaws.com"
database: "analytics"
username: "admin"
password: "MyPassword123"
staging_bucket: "my-redshift-staging"
iam_role: "arn:aws:iam::123456789012:role/RedshiftCopyRole"
table: "events"

Multiple Tables

Configuration with multiple tables and different formats:

targets:
- name: multi_table_redshift
type: amazonredshift
properties:
key: "AKIAIOSFODNN7EXAMPLE"
secret: "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY"
region: "us-east-1"
endpoint: "my-cluster.abc123xyz789.us-east-1.redshift.amazonaws.com"
database: "analytics"
username: "admin"
password: "MyPassword123"
staging_bucket: "my-redshift-staging"
iam_role: "arn:aws:iam::123456789012:role/RedshiftCopyRole"
schema: "public"
format: "parquet"
tables:
- table: "events"
format: "parquet"
- table: "logs"
format: "json"
- table: "metrics"
schema: "analytics"
format: "parquet"

JSON Format

Configuration using JSON format:

targets:
- name: json_redshift
type: amazonredshift
properties:
key: "AKIAIOSFODNN7EXAMPLE"
secret: "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY"
region: "us-east-1"
endpoint: "my-cluster.abc123xyz789.us-east-1.redshift.amazonaws.com"
database: "analytics"
username: "admin"
password: "MyPassword123"
staging_bucket: "my-redshift-staging"
iam_role: "arn:aws:iam::123456789012:role/RedshiftCopyRole"
table: "json_events"
format: "json"

With Batch Limits

Configuration with batch size and file size limits:

targets:
- name: batched_redshift
type: amazonredshift
properties:
key: "AKIAIOSFODNN7EXAMPLE"
secret: "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY"
region: "us-east-1"
endpoint: "my-cluster.abc123xyz789.us-east-1.redshift.amazonaws.com"
database: "analytics"
username: "admin"
password: "MyPassword123"
staging_bucket: "my-redshift-staging"
iam_role: "arn:aws:iam::123456789012:role/RedshiftCopyRole"
table: "events"
batch_size: 100000
max_size: 104857600

Custom Schema

Configuration with custom schema:

targets:
- name: custom_schema_redshift
type: amazonredshift
properties:
key: "AKIAIOSFODNN7EXAMPLE"
secret: "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY"
region: "us-east-1"
endpoint: "my-cluster.abc123xyz789.us-east-1.redshift.amazonaws.com"
database: "analytics"
username: "admin"
password: "MyPassword123"
schema: "events_schema"
staging_bucket: "my-redshift-staging"
iam_role: "arn:aws:iam::123456789012:role/RedshiftCopyRole"
table: "application_logs"

With Staging Prefix

Configuration with custom staging prefix:

targets:
- name: custom_prefix_redshift
type: amazonredshift
properties:
key: "AKIAIOSFODNN7EXAMPLE"
secret: "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY"
region: "us-east-1"
endpoint: "my-cluster.abc123xyz789.us-east-1.redshift.amazonaws.com"
database: "analytics"
username: "admin"
password: "MyPassword123"
staging_bucket: "my-redshift-staging"
staging_prefix: "datastream/staging/"
iam_role: "arn:aws:iam::123456789012:role/RedshiftCopyRole"
table: "events"

With Field Normalization

Using field normalization for standard format:

targets:
- name: normalized_redshift
type: amazonredshift
properties:
key: "AKIAIOSFODNN7EXAMPLE"
secret: "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY"
region: "us-east-1"
endpoint: "my-cluster.abc123xyz789.us-east-1.redshift.amazonaws.com"
database: "analytics"
username: "admin"
password: "MyPassword123"
staging_bucket: "my-redshift-staging"
iam_role: "arn:aws:iam::123456789012:role/RedshiftCopyRole"
table: "normalized_events"
field_format: "cim"

With Checkpoint Pipeline

Configuration with checkpoint pipeline for reliability:

targets:
- name: reliable_redshift
type: amazonredshift
pipelines:
- checkpoint
properties:
key: "AKIAIOSFODNN7EXAMPLE"
secret: "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY"
region: "us-east-1"
endpoint: "my-cluster.abc123xyz789.us-east-1.redshift.amazonaws.com"
database: "analytics"
username: "admin"
password: "MyPassword123"
staging_bucket: "my-redshift-staging"
iam_role: "arn:aws:iam::123456789012:role/RedshiftCopyRole"
table: "critical_events"

Scheduled Loading

Configuration with scheduled data loading:

targets:
- name: scheduled_redshift
type: amazonredshift
properties:
key: "AKIAIOSFODNN7EXAMPLE"
secret: "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY"
region: "us-east-1"
endpoint: "my-cluster.abc123xyz789.us-east-1.redshift.amazonaws.com"
database: "analytics"
username: "admin"
password: "MyPassword123"
staging_bucket: "my-redshift-staging"
iam_role: "arn:aws:iam::123456789012:role/RedshiftCopyRole"
table: "hourly_events"
interval: "1h"

Custom Port

Configuration with custom Redshift port:

targets:
- name: custom_port_redshift
type: amazonredshift
properties:
key: "AKIAIOSFODNN7EXAMPLE"
secret: "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY"
region: "us-east-1"
endpoint: "my-cluster.abc123xyz789.us-east-1.redshift.amazonaws.com"
database: "analytics"
username: "admin"
password: "MyPassword123"
port: 5440
staging_bucket: "my-redshift-staging"
iam_role: "arn:aws:iam::123456789012:role/RedshiftCopyRole"
table: "events"

Debug Configuration

Configuration with debugging enabled:

targets:
- name: debug_redshift
type: amazonredshift
properties:
key: "AKIAIOSFODNN7EXAMPLE"
secret: "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY"
region: "us-east-1"
endpoint: "my-cluster.abc123xyz789.us-east-1.redshift.amazonaws.com"
database: "analytics"
username: "admin"
password: "MyPassword123"
staging_bucket: "my-redshift-staging"
table: "test_events"
debug:
status: true
dont_send_logs: true

Production Configuration

Configuration for production with optimal settings:

targets:
- name: production_redshift
type: amazonredshift
pipelines:
- checkpoint
properties:
key: "AKIAIOSFODNN7EXAMPLE"
secret: "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY"
region: "us-east-1"
endpoint: "prod-cluster.abc123xyz789.us-east-1.redshift.amazonaws.com"
database: "production"
username: "datastream_user"
password: "SecurePassword123"
schema: "events"
staging_bucket: "prod-redshift-staging"
staging_prefix: "datastream/"
iam_role: "arn:aws:iam::123456789012:role/RedshiftCopyRole"
format: "parquet"
batch_size: 500000
max_size: 524288000
timeout: 600
field_format: "cim"
tables:
- table: "application_events"
format: "parquet"
- table: "security_events"
format: "parquet"
- table: "audit_logs"
schema: "audit"
format: "json"