A Sail Recipe: Tackling an Out-of-Control Redshift Bill
The LakeSail Team
September 6, 2024
Amazon Redshift is a common data analytics solution for AWS users. However, ingesting data into Redshift can be slow and complicated, and Redshift’s costs can quickly add up. In our experience, Redshift’s performance hasn’t been ideal, especially with complex queries and large datasets, adding to the frustration of managing its high costs.
Let’s explore a more efficient and cost-effective solution with Sail. You will be able to work directly with your data in Amazon S3, saving you both time and money.
Getting Started with Sail
Sail provides a simple setup for working with your data using its Python package. You can install Sail via pip
.
pip install 'pysail[spark]'
Alternatively, you can build and run a standalone Sail server executable for your environment. Detailed setup instructions are available here.
The following code snippet shows how you can query S3 data using the PySpark API, but with the compute engine powered by Sail. The dataset is generated for the derived TPC-H benchmark and is stored in Parquet format.
from pysail.spark import SparkConnectServer
from pyspark.sql import SparkSession
server = SparkConnectServer()
server.start()
_, port = server.listening_address
sail = SparkSession.builder.remote(f"sc://localhost:{port}").getOrCreate()
# Please configure AWS credentials in your environment.
# Please replace the S3 path with your dataset path.
df = sail.read.parquet("s3://bucket/customer.parquet")
df.show(10)
+---------+------------------+--------------------+-----------+---------------+---------+------------+--------------------+
|c_custkey| c_name| c_address|c_nationkey| c_phone|c_acctbal|c_mktsegment| c_comment|
+---------+------------------+--------------------+-----------+---------------+---------+------------+--------------------+
| 30073|Customer#000030073| GTfpr7pFq3,0BNe9T| 11|21-840-931-1593| 5727.78| HOUSEHOLD|shall cajole care...|
| 30074|Customer#000030074| Y W2iBIxG,dia| 19|29-829-714-4352| 3811.88| FURNITURE|instructions are ...|
| 30075|Customer#000030075|kwqiDzOqJWhwk1rc1...| 3|13-475-838-2563| 223.81| FURNITURE|out the sly, pend...|
| 30076|Customer#000030076|Z M04qnvrCsVV9 Xl...| 18|28-384-478-4237| 8527.38| AUTOMOBILE|leep final deposi...|
| 30077|Customer#000030077|a9D4kG3U9bO MBiFH...| 6|16-420-733-4992| 9722.76| BUILDING| packages. quickl...|
| 30078|Customer#000030078|M8hHAU5k8clhlkrfj...| 19|29-148-512-9148| 3444.28| MACHINERY|packages. furious...|
| 30079|Customer#000030079|ZUGk8 m2mLx L3GaG...| 22|32-313-457-9545| 3916.80| FURNITURE|ld accounts play ...|
| 30080|Customer#000030080|ZQcqu IyNlesIUVv...| 4|14-214-576-4578| 3624.97| MACHINERY|cies haggle. foxe...|
| 30081|Customer#000030081|TGJtjMz2f BsWDEih...| 16|26-608-630-1157| 4827.24| MACHINERY|d the express dep...|
| 30082|Customer#000030082| n0AqDEC0HrcNBm2RJ| 11|21-399-468-2712| 4377.42| BUILDING|ly final foxes ha...|
+---------+------------------+--------------------+-----------+---------------+---------+------------+--------------------+
Working Directly with Data in Sail
DataFrame Operations
With Sail, you can run DataFrame operations directly on your data, making data manipulation much easier.
df.sort(df.c_custkey.desc()).show(10)
+---------+------------------+--------------------+-----------+---------------+---------+------------+--------------------+
|c_custkey| c_name| c_address|c_nationkey| c_phone|c_acctbal|c_mktsegment| c_comment|
+---------+------------------+--------------------+-----------+---------------+---------+------------+--------------------+
| 1500000|Customer#001500000|fTuxEKUZOnUfDbkRC...| 17|27-810-721-6824| 347.33| BUILDING|ts. slyly express...|
| 1499999|Customer#001499999|nTTtbgPm4mrXK3idT...| 5|15-643-345-7071| 1206.37| MACHINERY|uietly ironic fox...|
| 1499998|Customer#001499998|wQJcJHZwvh6NFdG1r...| 3|13-635-256-6036| 4932.42| BUILDING|ts integrate. sly...|
| 1499997|Customer#001499997| cW CfLAKhyizxDxoaQ2| 7|17-937-698-8566| 9164.57| AUTOMOBILE|s. carefully regu...|
| 1499996|Customer#001499996|,vTPcqLlxjBwpooT9...| 5|15-464-905-5862| -956.85| FURNITURE|ing to the furiou...|
| 1499995|Customer#001499995| UOZucLgE0s7w| 12|22-365-240-6224| 2311.72| FURNITURE|le among the slyl...|
| 1499994|Customer#001499994|1BRchCLSk69oVZi97...| 20|30-682-509-6416| 927.67| AUTOMOBILE|s boost at the re...|
| 1499993|Customer#001499993|xFmUnRIQJmXCmsB99...| 24|34-920-200-4606| 4223.59| FURNITURE| even accounts. q...|
| 1499992|Customer#001499992|T6 zTSc9v3o3eECnu...| 10|20-108-454-6654| 1238.67| BUILDING|unts boost blithe...|
| 1499991|Customer#001499991| g esnsF RH8Y5m| 16|26-289-217-5275| 4660.16| HOUSEHOLD|onic packages int...|
+---------+------------------+--------------------+-----------+---------------+---------+------------+--------------------+
SQL Operations
Sail also allows you to use SQL directly to interact with your data.
df.createOrReplaceTempView("customer")
sql = """
SELECT *
FROM customer
ORDER BY c_custkey ASC
LIMIT 10
"""
sail.sql(sql).show()
+---------+------------------+--------------------+-----------+---------------+---------+------------+--------------------+
|c_custkey| c_name| c_address|c_nationkey| c_phone|c_acctbal|c_mktsegment| c_comment|
+---------+------------------+--------------------+-----------+---------------+---------+------------+--------------------+
| 1|Customer#000000001| IVhzIApeRb ot,c,E| 15|25-989-741-2988| 711.56| BUILDING|to the even, regu...|
| 2|Customer#000000002|XSTf4,NCwDVaWNe6t...| 13|23-768-687-3665| 121.65| AUTOMOBILE|l accounts. blith...|
| 3|Customer#000000003| MG9kdTD2WBHm| 1|11-719-748-3364| 7498.12| AUTOMOBILE| deposits eat sly...|
| 4|Customer#000000004| XxVSJsLAGtn| 4|14-128-190-5944| 2866.83| MACHINERY| requests. final,...|
| 5|Customer#000000005|KvpyuHCplrB84WgAi...| 3|13-750-942-6364| 794.47| HOUSEHOLD|n accounts will h...|
| 6|Customer#000000006|sKZz0CsnMD7mp4Xd0...| 20|30-114-968-4951| 7638.57| AUTOMOBILE|tions. even depos...|
| 7|Customer#000000007|TcGe5gaZNgVePxU5k...| 18|28-190-982-9759| 9561.95| AUTOMOBILE|ainst the ironic,...|
| 8|Customer#000000008|I0B10bB0AymmC, 0P...| 17|27-147-574-9335| 6819.74| BUILDING|among the slyly r...|
| 9|Customer#000000009|xKiAFTjUsCuxfeleN...| 8|18-338-906-3675| 8324.07| FURNITURE|r theodolites acc...|
| 10|Customer#000000010|6LrEaV6KR6PLVcgl2...| 5|15-741-346-9870| 2753.54| HOUSEHOLD|es regular deposi...|
+---------+------------------+--------------------+-----------+---------------+---------+------------+--------------------+
Writing Data
Once you’ve processed your data, you can write it back to S3 easily.
# Please configure AWS credentials in your environment.
# Please replace the S3 path with your dataset path.
output_path = "s3://bucket/output.parquet"
sail.sql(sql).write.format("parquet").save(output_path)
After writing the data, you can quickly verify that everything was saved correctly.
sail.read.parquet(output_path).show()
+---------+------------------+--------------------+-----------+---------------+---------+------------+--------------------+
|c_custkey| c_name| c_address|c_nationkey| c_phone|c_acctbal|c_mktsegment| c_comment|
+---------+------------------+--------------------+-----------+---------------+---------+------------+--------------------+
| 1|Customer#000000001| IVhzIApeRb ot,c,E| 15|25-989-741-2988| 711.56| BUILDING|to the even, regu...|
| 2|Customer#000000002|XSTf4,NCwDVaWNe6t...| 13|23-768-687-3665| 121.65| AUTOMOBILE|l accounts. blith...|
| 3|Customer#000000003| MG9kdTD2WBHm| 1|11-719-748-3364| 7498.12| AUTOMOBILE| deposits eat sly...|
| 4|Customer#000000004| XxVSJsLAGtn| 4|14-128-190-5944| 2866.83| MACHINERY| requests. final,...|
| 5|Customer#000000005|KvpyuHCplrB84WgAi...| 3|13-750-942-6364| 794.47| HOUSEHOLD|n accounts will h...|
| 6|Customer#000000006|sKZz0CsnMD7mp4Xd0...| 20|30-114-968-4951| 7638.57| AUTOMOBILE|tions. even depos...|
| 7|Customer#000000007|TcGe5gaZNgVePxU5k...| 18|28-190-982-9759| 9561.95| AUTOMOBILE|ainst the ironic,...|
| 8|Customer#000000008|I0B10bB0AymmC, 0P...| 17|27-147-574-9335| 6819.74| BUILDING|among the slyly r...|
| 9|Customer#000000009|xKiAFTjUsCuxfeleN...| 8|18-338-906-3675| 8324.07| FURNITURE|r theodolites acc...|
| 10|Customer#000000010|6LrEaV6KR6PLVcgl2...| 5|15-741-346-9870| 2753.54| HOUSEHOLD|es regular deposi...|
+---------+------------------+--------------------+-----------+---------------+---------+------------+--------------------+
Cost Analysis
Amazon Redshift
The minimum requirement for a Redshift RA3 cluster is two nodes. Two ra3.4xlarge
nodes in the us-east-1 region costs $4,759.60 a month ($3.26 per hour, per node). This provides a total of 24 vCPUs and 192 GB of memory for the cluster. Additionally, there is an extra charge of $0.024 per GB-month for data storage.
Sail
Consider running one r8g.8xlarge
EC2 Instance for Sail. This gives you 32 vCPUs and 256 GB of memory for $1,376.14 a month ($1.88512 per hour). Compared with Redshift, you get 8 extra vCPUs and 64 GB of extra memory, with no additional cost for data storage. This means you would save at least $3,383.46 per month by switching to Sail, making it 71% cheaper than Redshift while also providing more memory and CPU power.
If you don’t need as much compute power, you can run Sail on a smaller instance, such as r8g.4xlarge
. This gives you 16 vCPUs and 128 GB of memory for $688.07 a month ($0.94256 per hour). This results in a saving of $4,071.53 per month compared to Redshift, making Sail 85% cheaper.
Conclusion
By switching from Amazon Redshift to Sail, you can reduce your cloud costs by 71% or more while gaining access to more powerful compute resources. In future posts, we will explore how Sail’s performance compares to Redshift. As Sail delivers significant performance gains and efficiency, these improvements could lead to requiring a much smaller instance, resulting in even greater cost savings.