Prabhath Kota: AWS RedShift

May 22, 2020

AWS RedShift

RedShift

Data Warehouse
Meant to support OLAP (not OLTP), column oriented and massively parallel scale-out architecture
OLTP meant to Analytics, aggregation of data
Master & slave nodes
It does not require/create indexes, materialised views, thereby faster & uses less data than traditional relational databases
Supports columnar storage like Parquet, ORC
But it has dist_key and sort_key
dist_key

It is the column on which its distributed on each node
Rows with same value of this column are guaranteed to be on the same node

sort_key

It is the column on which data is sorted on each node
Only one sort_key is permitted

RedShift doesn't complain on duplicate data even on primary key

Advantages

Faster, since it need not check if primary key already exists or not
Performance, query optimization

Disadvantages

Chances of improper data (duplicate data)
Its upto the user to send proper data to RedShit, user has to handle the improper data before sending it to RedShift Cluster

1) Create a RefShift cluster

2) Connect to cluster and Create tables

(Using SQL Workbench - recommended or any DB visualizer)

If not use Redshitf Query Editor it self

3) Create an IAM role for Redshift with S3 Read only access

4) Attach IAM S3 role to RedShift

5) Suppose your data is in S3, load your data from S3

copy dimproduct #<table_name_from_redshift>

from 's3://redshift-load-queue-test/dimproduct.csv'

iam_role '<IAM Role created in step-3>/RedShift-S3-Role'

region 'us-east-1'

format csv

delimiter ','

6) Upload huge data as gz files

sales1.txt.gz, sales2.txt.gz, sales3.txt.gz, sale4.txt.gz

Instead of creating single files, create one manifest file

manifest.txt

{

"entries": [

{"url":"s3://redshift-load-queue-test/sales1.txt.gz", "mandatory":true},

{"url":"s3://redshift-load-queue-test/sales2.txt.gz", "mandatory":true},

{"url":"s3://redshift-load-queue-test/sales3.txt.gz", "mandatory":true},

{"url":"s3://redshift-load-queue-test/sales4.txt.gz", "mandatory":true}

]

}

copy factsales

from 's3://redshift-load-queue-test/manifest.txt'

iam_role '<IAM Role created in step-3>'

region 'us-east-1'

GZIP

delimiter '|'

manifest

7) Copy JSON data

# Json data should not be in a list

# It should in individual elements

copy dimdate

from 's3://redshift-load-queue-prabhath/dimdate.json'

region 'us-east-1'

iam_role '<IAM Role created in step-3>/RedShift-S3-Role'

json as 'auto'

8) Find out any load errors

select * from stl_load_errors

9) Integrate Kinesis Streaming FireHose to destination as RedShift

Given the details, Kinesis form this query for you

COPY firehose_test_table (ticker_symbol, sector, change, price)

FROM 's3://redshift-load-queue-<>/stream2020/05/22/04/test-stream-4-2020-05-22-04-19-21-db425054-695f-4fd1-8721-fc07cdeea369.gz'

CREDENTIALS 'aws_iam_role='<IAM Role created in step-3>/RedShift-S3-Role'

JSON 'auto' gzip;

Prabhath Kota

May 22, 2020

AWS RedShift

No comments:

Post a Comment