In my previous article, I posted a warning about cross-AZ transfer fees in AWS. These fees are charged for data transfer between AZs at a rate of $0.01 per GB. This can add up quickly when transferring large amounts of data — my 1 TiB demo cost me over $10! But by transferring the data through S3 instead of directly, not only is the fee avoided, but the data transfer is actually faster as well due to the efficiency of SingleStore Pipelines. I am calling this technique the S3 Trampoline, although I am sure there are some more experienced AWS customers who would call it obvious.
The key fact underlying this technique is that S3 does not charge for cross-AZ traffic. Data can be written from an EC2 instance in one AZ and read from a different AZ containing the SingleStore cluster. The data can be deleted as soon as it is loaded, so storage costs are negligible.
To use S3 correctly and achieve this benefit you’ll need to set up a VPC gateway endpoint for the regional S3 service. You can learn about this here. This is a simple procedure from the AWS console. From the VPC section, just pick Endpoints, Create Endpoint, and then pick the regional gateway: for example, com.amazonaws.us-west-2.s3 of type Gateway.
Once added, wait for a few minutes, and an endpoint will appear automatically in your VPC’s route table as a prefix list. Once it appears, all of your traffic from your EC2 instances in that VPC will route through the gateway to S3, and cheaper will be achieved.
The general procedure to load data from an EC2 instance into SingleStore via S3 is as follows:
When using LOAD DATA LOCAL INFILE, the data is sent from the client, to a SingleStore aggregator node, to the leaf nodes, where it is stored. When using a pipeline, the leaf nodes directly read the data from S3 and then distribute it as needed. In practice this reading and distribution can be more effectively optimized by the SingleStore engine and I’ve observed an end-to-end advantage of around 10% in my tests over the LOAD DATA LOCAL INFILE method.
I updated the megapush program to use the S3 Trampoline method when an S3 bucket name is provided at the end of the command line arguments.
First, set up a bucket and get some credentials. One way to do this is via an example script that can be executed from the AWS Cloud Shell. This script creates a bucket called “megapush” and associated user/role and gets some temporary credentials for use by the megapush program.
git clone https://github.com/jasonthorsness/megapush.git
cd megapush
chmod +x ./create_bucket.sh
./create_bucket.sh
Next, from a fast client machine — I used a c6a.8xlarge:
sudo yum install -y git
sudo yum install -y golang
git clone https://github.com/jasonthorsness/megapush.git
cd megapush
go build -o megapush .
Execute megapush with the bucket credentials and bucket name at the end:
AWS_REGION=us-west-2 \
AWS_ACCESS_KEY_ID=AAA \
AWS_SECRET_ACCESS_KEY=BBB \
AWS_SESSION_TOKEN=CCC \
./megapush
20
yourendpoint.aws-oregon-3.svc.singlestore.com
3306
admin
BA3UY7pOJ92bqKtn4CB50AN2tarzv58t
test
pushed
2147483648
512
megapush
Here’s how my run went, against the same S-16 instance I used in my previous article:
Generating test data locally (filling buffers)
20/20
Connecting to S3 and ensuring empty bucket
Start time: 2024-07-12 19:01:58
Starting upload to S3
Creating pipeline
Starting pipeline
1024 / 1024 GiB 16m10s
Stopping pipeline
Dropping pipeline
Waiting for cleanup of bucket
End time: 2024-07-12 19:18:08
Elapsed time: 16m10.188659959s
done
I consistently achieve higher performance using this method for this sizeable data load. Almost under 15 minutes, which is a target I am still working towards 😉.
Once finished, I also included a cleanup script to delete the resources created by the create_bucket.sh script:
When super-low-latency is important, it’s hard to beat a direct transfer of a buffer via an INSERT statement or LOAD DATA LOCAL INFILE. But for larger data loads when overall throughput over many minutes is what matters, it’s hard to beat SingleStore Pipelines — even if the data isn’t already in S3 and needs to be written as part of the transfer!