Jason Thorsness

github
github icon
linkedin
linkedin icon
twitter
twitter icon
hi
9Jul 12 24

S3 Trampoline

In my previous article, I posted a warning about cross-AZ transfer fees in AWS. These fees are charged for data transfer between AZs at a rate of $0.01 per GB. This can add up quickly when transferring large amounts of data — my 1 TiB demo cost me over $10! But by transferring the data through S3 instead of directly, not only is the fee avoided, but the data transfer is actually faster as well due to the efficiency of SingleStore Pipelines. I am calling this technique the S3 Trampoline, although I am sure there are some more experienced AWS customers who would call it obvious.

Cheaper with an S3 Trampoline

The key fact underlying this technique is that S3 does not charge for cross-AZ traffic. Data can be written from an EC2 instance in one AZ and read from a different AZ containing the SingleStore cluster. The data can be deleted as soon as it is loaded, so storage costs are negligible.

To use S3 correctly and achieve this benefit you’ll need to set up a VPC gateway endpoint for the regional S3 service. You can learn about this here. This is a simple procedure from the AWS console. From the VPC section, just pick Endpoints, Create Endpoint, and then pick the regional gateway: for example, com.amazonaws.us-west-2.s3 of type Gateway.

Once added, wait for a few minutes, and an endpoint will appear automatically in your VPC’s route table as a prefix list. Once it appears, all of your traffic from your EC2 instances in that VPC will route through the gateway to S3, and cheaper will be achieved.

Faster with an S3 Trampoline

The general procedure to load data from an EC2 instance into SingleStore via S3 is as follows:

  1. Select a format supported by SingleStore Pipelines. CSV is a good choice, just like it was in my previous article for LOAD DATA LOCAL INFILE.
  2. Instead of sending buffers via LOAD DATA LOCAL INFILE, write the buffers as objects to S3.
  3. Create and start a pipeline in SingleStore that reads from the S3 bucket.
  4. Delete files from S3 once the pipeline has finished with them.

When using LOAD DATA LOCAL INFILE, the data is sent from the client, to a SingleStore aggregator node, to the leaf nodes, where it is stored. When using a pipeline, the leaf nodes directly read the data from S3 and then distribute it as needed. In practice this reading and distribution can be more effectively optimized by the SingleStore engine and I’ve observed an end-to-end advantage of around 10% in my tests over the LOAD DATA LOCAL INFILE method.

Let’s See It In Action

I updated the megapush program to use the S3 Trampoline method when an S3 bucket name is provided at the end of the command line arguments.

First, set up a bucket and get some credentials. One way to do this is via an example script that can be executed from the AWS Cloud Shell. This script creates a bucket called “megapush” and associated user/role and gets some temporary credentials for use by the megapush program.

git clone https://github.com/jasonthorsness/megapush.git
cd megapush
chmod +x ./create_bucket.sh
./create_bucket.sh

Next, from a fast client machine — I used a c6a.8xlarge:

sudo yum install -y git
sudo yum install -y golang
git clone https://github.com/jasonthorsness/megapush.git
cd megapush
go build -o megapush .

Execute megapush with the bucket credentials and bucket name at the end:

AWS_REGION=us-west-2 \
AWS_ACCESS_KEY_ID=AAA \
AWS_SECRET_ACCESS_KEY=BBB \
AWS_SESSION_TOKEN=CCC \
./megapush
  20
  yourendpoint.aws-oregon-3.svc.singlestore.com
  3306
  admin
  BA3UY7pOJ92bqKtn4CB50AN2tarzv58t
  test
  pushed
  2147483648
  512
  megapush

Here’s how my run went, against the same S-16 instance I used in my previous article:

Generating test data locally (filling buffers)
20/20
Connecting to S3 and ensuring empty bucket
Start time:  2024-07-12 19:01:58
Starting upload to S3
Creating pipeline
Starting pipeline
1024 / 1024 GiB 16m10s
Stopping pipeline
Dropping pipeline
Waiting for cleanup of bucket
End time:  2024-07-12 19:18:08
Elapsed time:  16m10.188659959s
done

I consistently achieve higher performance using this method for this sizeable data load. Almost under 15 minutes, which is a target I am still working towards 😉.

Once finished, I also included a cleanup script to delete the resources created by the create_bucket.sh script:

Closing

When super-low-latency is important, it’s hard to beat a direct transfer of a buffer via an INSERT statement or LOAD DATA LOCAL INFILE. But for larger data loads when overall throughput over many minutes is what matters, it’s hard to beat SingleStore Pipelines — even if the data isn’t already in S3 and needs to be written as part of the transfer!

 Top