A Sail Recipe: Tackling an Out-of-Control Redshift Bill

The LakeSail Team
September 6, 2024

Amazon Redshift is a common data analytics solution for AWS users. However, ingesting data into Redshift can be slow and complicated, and Redshift’s costs can quickly add up. In our experience, Redshift’s performance hasn’t been ideal, especially with complex queries and large datasets, adding to the frustration of managing its high costs.

Let’s explore a more efficient and cost-effective solution with Sail. You will be able to work directly with your data in Amazon S3, saving you both time and money.

Getting Started with Sail

Sail provides a simple setup for working with your data using its Python package. You can install Sail via pip.

bash
pip install 'pysail[spark]'

Alternatively, you can build and run a standalone Sail server executable for your environment. Detailed setup instructions are available here.

The following code snippet shows how you can query S3 data using the PySpark API, but with the compute engine powered by Sail. The dataset is generated for the derived TPC-H benchmark and is stored in Parquet format.

python
from pysail.spark import SparkConnectServer
from pyspark.sql import SparkSession

server = SparkConnectServer()
server.start()
_, port = server.listening_address

sail = SparkSession.builder.remote(f"sc://localhost:{port}").getOrCreate()

# Please configure AWS credentials in your environment.
# Please replace the S3 path with your dataset path.
df = sail.read.parquet("s3://bucket/customer.parquet")
df.show(10)
text
+---------+------------------+--------------------+-----------+---------------+---------+------------+--------------------+
|c_custkey|            c_name|           c_address|c_nationkey|        c_phone|c_acctbal|c_mktsegment|           c_comment|
+---------+------------------+--------------------+-----------+---------------+---------+------------+--------------------+
|    30073|Customer#000030073|   GTfpr7pFq3,0BNe9T|         11|21-840-931-1593|  5727.78|   HOUSEHOLD|shall cajole care...|
|    30074|Customer#000030074|       Y W2iBIxG,dia|         19|29-829-714-4352|  3811.88|   FURNITURE|instructions are ...|
|    30075|Customer#000030075|kwqiDzOqJWhwk1rc1...|          3|13-475-838-2563|   223.81|   FURNITURE|out the sly, pend...|
|    30076|Customer#000030076|Z M04qnvrCsVV9 Xl...|         18|28-384-478-4237|  8527.38|  AUTOMOBILE|leep final deposi...|
|    30077|Customer#000030077|a9D4kG3U9bO MBiFH...|          6|16-420-733-4992|  9722.76|    BUILDING| packages. quickl...|
|    30078|Customer#000030078|M8hHAU5k8clhlkrfj...|         19|29-148-512-9148|  3444.28|   MACHINERY|packages. furious...|
|    30079|Customer#000030079|ZUGk8 m2mLx L3GaG...|         22|32-313-457-9545|  3916.80|   FURNITURE|ld accounts play ...|
|    30080|Customer#000030080|ZQcqu  IyNlesIUVv...|          4|14-214-576-4578|  3624.97|   MACHINERY|cies haggle. foxe...|
|    30081|Customer#000030081|TGJtjMz2f BsWDEih...|         16|26-608-630-1157|  4827.24|   MACHINERY|d the express dep...|
|    30082|Customer#000030082|   n0AqDEC0HrcNBm2RJ|         11|21-399-468-2712|  4377.42|    BUILDING|ly final foxes ha...|
+---------+------------------+--------------------+-----------+---------------+---------+------------+--------------------+

Working Directly with Data in Sail

DataFrame Operations

With Sail, you can run DataFrame operations directly on your data, making data manipulation much easier.

python
df.sort(df.c_custkey.desc()).show(10)
text
+---------+------------------+--------------------+-----------+---------------+---------+------------+--------------------+
|c_custkey|            c_name|           c_address|c_nationkey|        c_phone|c_acctbal|c_mktsegment|           c_comment|
+---------+------------------+--------------------+-----------+---------------+---------+------------+--------------------+
|  1500000|Customer#001500000|fTuxEKUZOnUfDbkRC...|         17|27-810-721-6824|   347.33|    BUILDING|ts. slyly express...|
|  1499999|Customer#001499999|nTTtbgPm4mrXK3idT...|          5|15-643-345-7071|  1206.37|   MACHINERY|uietly ironic fox...|
|  1499998|Customer#001499998|wQJcJHZwvh6NFdG1r...|          3|13-635-256-6036|  4932.42|    BUILDING|ts integrate. sly...|
|  1499997|Customer#001499997| cW CfLAKhyizxDxoaQ2|          7|17-937-698-8566|  9164.57|  AUTOMOBILE|s. carefully regu...|
|  1499996|Customer#001499996|,vTPcqLlxjBwpooT9...|          5|15-464-905-5862|  -956.85|   FURNITURE|ing to the furiou...|
|  1499995|Customer#001499995|        UOZucLgE0s7w|         12|22-365-240-6224|  2311.72|   FURNITURE|le among the slyl...|
|  1499994|Customer#001499994|1BRchCLSk69oVZi97...|         20|30-682-509-6416|   927.67|  AUTOMOBILE|s boost at the re...|
|  1499993|Customer#001499993|xFmUnRIQJmXCmsB99...|         24|34-920-200-4606|  4223.59|   FURNITURE| even accounts. q...|
|  1499992|Customer#001499992|T6 zTSc9v3o3eECnu...|         10|20-108-454-6654|  1238.67|    BUILDING|unts boost blithe...|
|  1499991|Customer#001499991|     g  esnsF RH8Y5m|         16|26-289-217-5275|  4660.16|   HOUSEHOLD|onic packages int...|
+---------+------------------+--------------------+-----------+---------------+---------+------------+--------------------+

SQL Operations

Sail also allows you to use SQL directly to interact with your data.

python
df.createOrReplaceTempView("customer")
sql = """
    SELECT *
    FROM customer
    ORDER BY c_custkey ASC
    LIMIT 10
"""
sail.sql(sql).show()
text
+---------+------------------+--------------------+-----------+---------------+---------+------------+--------------------+
|c_custkey|            c_name|           c_address|c_nationkey|        c_phone|c_acctbal|c_mktsegment|           c_comment|
+---------+------------------+--------------------+-----------+---------------+---------+------------+--------------------+
|        1|Customer#000000001|   IVhzIApeRb ot,c,E|         15|25-989-741-2988|   711.56|    BUILDING|to the even, regu...|
|        2|Customer#000000002|XSTf4,NCwDVaWNe6t...|         13|23-768-687-3665|   121.65|  AUTOMOBILE|l accounts. blith...|
|        3|Customer#000000003|        MG9kdTD2WBHm|          1|11-719-748-3364|  7498.12|  AUTOMOBILE| deposits eat sly...|
|        4|Customer#000000004|         XxVSJsLAGtn|          4|14-128-190-5944|  2866.83|   MACHINERY| requests. final,...|
|        5|Customer#000000005|KvpyuHCplrB84WgAi...|          3|13-750-942-6364|   794.47|   HOUSEHOLD|n accounts will h...|
|        6|Customer#000000006|sKZz0CsnMD7mp4Xd0...|         20|30-114-968-4951|  7638.57|  AUTOMOBILE|tions. even depos...|
|        7|Customer#000000007|TcGe5gaZNgVePxU5k...|         18|28-190-982-9759|  9561.95|  AUTOMOBILE|ainst the ironic,...|
|        8|Customer#000000008|I0B10bB0AymmC, 0P...|         17|27-147-574-9335|  6819.74|    BUILDING|among the slyly r...|
|        9|Customer#000000009|xKiAFTjUsCuxfeleN...|          8|18-338-906-3675|  8324.07|   FURNITURE|r theodolites acc...|
|       10|Customer#000000010|6LrEaV6KR6PLVcgl2...|          5|15-741-346-9870|  2753.54|   HOUSEHOLD|es regular deposi...|
+---------+------------------+--------------------+-----------+---------------+---------+------------+--------------------+

Writing Data

Once you’ve processed your data, you can write it back to S3 easily.

python
# Please configure AWS credentials in your environment.
# Please replace the S3 path with your dataset path.
output_path = "s3://bucket/output.parquet"
sail.sql(sql).write.format("parquet").save(output_path)

After writing the data, you can quickly verify that everything was saved correctly.

python
sail.read.parquet(output_path).show()
text
+---------+------------------+--------------------+-----------+---------------+---------+------------+--------------------+
|c_custkey|            c_name|           c_address|c_nationkey|        c_phone|c_acctbal|c_mktsegment|           c_comment|
+---------+------------------+--------------------+-----------+---------------+---------+------------+--------------------+
|        1|Customer#000000001|   IVhzIApeRb ot,c,E|         15|25-989-741-2988|   711.56|    BUILDING|to the even, regu...|
|        2|Customer#000000002|XSTf4,NCwDVaWNe6t...|         13|23-768-687-3665|   121.65|  AUTOMOBILE|l accounts. blith...|
|        3|Customer#000000003|        MG9kdTD2WBHm|          1|11-719-748-3364|  7498.12|  AUTOMOBILE| deposits eat sly...|
|        4|Customer#000000004|         XxVSJsLAGtn|          4|14-128-190-5944|  2866.83|   MACHINERY| requests. final,...|
|        5|Customer#000000005|KvpyuHCplrB84WgAi...|          3|13-750-942-6364|   794.47|   HOUSEHOLD|n accounts will h...|
|        6|Customer#000000006|sKZz0CsnMD7mp4Xd0...|         20|30-114-968-4951|  7638.57|  AUTOMOBILE|tions. even depos...|
|        7|Customer#000000007|TcGe5gaZNgVePxU5k...|         18|28-190-982-9759|  9561.95|  AUTOMOBILE|ainst the ironic,...|
|        8|Customer#000000008|I0B10bB0AymmC, 0P...|         17|27-147-574-9335|  6819.74|    BUILDING|among the slyly r...|
|        9|Customer#000000009|xKiAFTjUsCuxfeleN...|          8|18-338-906-3675|  8324.07|   FURNITURE|r theodolites acc...|
|       10|Customer#000000010|6LrEaV6KR6PLVcgl2...|          5|15-741-346-9870|  2753.54|   HOUSEHOLD|es regular deposi...|
+---------+------------------+--------------------+-----------+---------------+---------+------------+--------------------+

Cost Analysis

Amazon Redshift

The minimum requirement for a Redshift RA3 cluster is two nodes. Two ra3.4xlarge nodes in the us-east-1 region costs $4,759.60 a month ($3.26 per hour, per node). This provides a total of 24 vCPUs and 192 GB of memory for the cluster. Additionally, there is an extra charge of $0.024 per GB-month for data storage.

Sail

Consider running one r8g.8xlarge EC2 Instance for Sail. This gives you 32 vCPUs and 256 GB of memory for $1,376.14 a month ($1.88512 per hour). Compared with Redshift, you get 8 extra vCPUs and 64 GB of extra memory, with no additional cost for data storage. This means you would save at least $3,383.46 per month by switching to Sail, making it 71% cheaper than Redshift while also providing more memory and CPU power.

If you don’t need as much compute power, you can run Sail on a smaller instance, such as r8g.4xlarge. This gives you 16 vCPUs and 128 GB of memory for $688.07 a month ($0.94256 per hour). This results in a saving of $4,071.53 per month compared to Redshift, making Sail 85% cheaper.

Conclusion

By switching from Amazon Redshift to Sail, you can reduce your cloud costs by 71% or more while gaining access to more powerful compute resources. In future posts, we will explore how Sail’s performance compares to Redshift. As Sail delivers significant performance gains and efficiency, these improvements could lead to requiring a much smaller instance, resulting in even greater cost savings.

LakeSail, Inc. © 2024