May 22, 2020

AWS RedShift

RedShift
  • Data Warehouse
  • Meant to support OLAP (not OLTP), column oriented and massively parallel scale-out architecture
  • OLTP meant to Analytics, aggregation of data
  • Master & slave nodes
  • It does not require/create indexes, materialised views, thereby faster & uses less data than traditional relational databases
  • Supports columnar storage like Parquet, ORC
  • But it has dist_key and sort_key
  • dist_key
    • It is the column on which its distributed on each node
    • Rows with same value of this column are guaranteed to be on the same node
  • sort_key
    • It is the column on which data is sorted on each node
    • Only one sort_key is permitted
  • RedShift doesn't complain on duplicate data even on primary key
    • Advantages
      • Faster, since it need not check if primary key already exists or not
      • Performance, query optimization
    • Disadvantages
      • Chances of improper data (duplicate data)
      • Its upto the user to send proper data to RedShit, user has to handle the improper data before sending it to RedShift Cluster

1) Create a RefShift cluster

2) Connect to cluster and Create tables 
   (Using SQL Workbench - recommended or any DB visualizer)
   If not use Redshitf Query Editor it self

3) Create an IAM role for Redshift with S3 Read only access

4) Attach IAM S3 role to RedShift

5) Suppose your data is in S3, load your data from S3

  copy dimproduct    #<table_name_from_redshift>
  from 's3://redshift-load-queue-test/dimproduct.csv' 
  iam_role '<IAM Role created in step-3>/RedShift-S3-Role' 
  region 'us-east-1'
  format csv
  delimiter ','

6) Upload huge data as gz files
   sales1.txt.gz, sales2.txt.gz, sales3.txt.gz, sale4.txt.gz
   Instead of creating single files, create one manifest file 
   manifest.txt
         {
          "entries": [
              {"url":"s3://redshift-load-queue-test/sales1.txt.gz", "mandatory":true},
              {"url":"s3://redshift-load-queue-test/sales2.txt.gz", "mandatory":true},
              {"url":"s3://redshift-load-queue-test/sales3.txt.gz", "mandatory":true},
              {"url":"s3://redshift-load-queue-test/sales4.txt.gz", "mandatory":true}
          ]   
         }

  copy factsales    
  from 's3://redshift-load-queue-test/manifest.txt
  iam_role '<IAM Role created in step-3>
  region 'us-east-1'
  GZIP
  delimiter '|'
  manifest


7) Copy JSON data
   # Json data should not be in a list
   # It should in individual elements

   copy dimdate
   from 's3://redshift-load-queue-prabhath/dimdate.json'
   region 'us-east-1'
   iam_role '<IAM Role created in step-3>/RedShift-S3-Role'
   json as 'auto'

8) Find out any load errors
   select * from stl_load_errors

9) Integrate Kinesis Streaming FireHose to destination as RedShift
    Given the details, Kinesis form this query for you

      COPY firehose_test_table (ticker_symbol, sector, change, price) 
      FROM 's3://redshift-load-queue-<>/stream2020/05/22/04/test-stream-4-2020-05-22-04-19-21-db425054-695f-4fd1-8721-fc07cdeea369.gz' 
      CREDENTIALS 'aws_iam_role='<IAM Role created in step-3>/RedShift-S3-Role'
      JSON 'auto' gzip;


No comments:

Post a Comment