RedShift
- Data Warehouse
- Meant to support OLAP (not OLTP), column oriented and massively parallel scale-out architecture
- OLTP meant to Analytics, aggregation of data
- Master & slave nodes
- It does not require/create indexes, materialised views, thereby faster & uses less data than traditional relational databases
- Supports columnar storage like Parquet, ORC
- But it has dist_key and sort_key
- dist_key
- It is the column on which its distributed on each node
- Rows with same value of this column are guaranteed to be on the same node
- sort_key
- It is the column on which data is sorted on each node
- Only one sort_key is permitted
- RedShift doesn't complain on duplicate data even on primary key
- Advantages
- Faster, since it need not check if primary key already exists or not
- Performance, query optimization
- Disadvantages
- Chances of improper data (duplicate data)
- Its upto the user to send proper data to RedShit, user has to handle the improper data before sending it to RedShift Cluster
1) Create a RefShift cluster
2) Connect to cluster and Create tables
(Using SQL Workbench - recommended or any DB visualizer)
If not use Redshitf Query Editor it self
3) Create an IAM role for Redshift with S3 Read only access
4) Attach IAM S3 role to RedShift
5) Suppose your data is in S3, load your data from S3
copy dimproduct #<table_name_from_redshift>
from 's3://redshift-load-queue-test/dimproduct.csv'
iam_role '<IAM Role created in step-3>/RedShift-S3-Role'
region 'us-east-1'
format csv
delimiter ','
6) Upload huge data as gz files
sales1.txt.gz, sales2.txt.gz, sales3.txt.gz, sale4.txt.gz
Instead of creating single files, create one manifest file
manifest.txt
{
"entries": [
{"url":"s3://redshift-load-queue-test/sales1.txt.gz", "mandatory":true},
{"url":"s3://redshift-load-queue-test/sales2.txt.gz", "mandatory":true},
{"url":"s3://redshift-load-queue-test/sales3.txt.gz", "mandatory":true},
{"url":"s3://redshift-load-queue-test/sales4.txt.gz", "mandatory":true}
]
}
copy factsales
from 's3://redshift-load-queue-test/manifest.txt'
iam_role '<IAM Role created in step-3>'
region 'us-east-1'
GZIP
delimiter '|'
manifest
7) Copy JSON data
# Json data should not be in a list
# It should in individual elements
copy dimdate
from 's3://redshift-load-queue-prabhath/dimdate.json'
region 'us-east-1'
iam_role '<IAM Role created in step-3>/RedShift-S3-Role'
json as 'auto'
8) Find out any load errors
select * from stl_load_errors
9) Integrate Kinesis Streaming FireHose to destination as RedShift
Given the details, Kinesis form this query for you
COPY firehose_test_table (ticker_symbol, sector, change, price)
FROM 's3://redshift-load-queue-<>/stream2020/05/22/04/test-stream-4-2020-05-22-04-19-21-db425054-695f-4fd1-8721-fc07cdeea369.gz'
CREDENTIALS 'aws_iam_role='<IAM Role created in step-3>/RedShift-S3-Role'
JSON 'auto' gzip;
No comments:
Post a Comment