A Study on using Google Cloud Storage with the S3 Compatibility API
Google Cloud Storage’s XML API Provides Interoperability with some of the client libraries that use S3. If you have existing applications that read and write data using the S3 API/SDK/Client Libraries within configuration changes and minimal code change. While compatibility offers ease, it is important to be aware about where things break. This post outlines some of those scenarios
Sections
The post has 4 major sections
1. Where things break ( TL;DR )
2. Server Side Config
3. Client Side Config
4. Code Samples
Where things break
ACLs
There are some minor differences in the way AWS’s Predefined/Canned ACLs work and the way GCPs Canned ACLs work. Before that, a small refresher in ACL Concepts
ACLs have 2 Properties
1. Grantees ( Who gets access )
2. Scope ( How much access do they get )ACLs are of 2 types
1. Canned ACLs ( Predefined Scopes & Grantees )
2. Custom ACLs ( Custom Scope & Grantees )ACLs can be Applied at 2 Levels
1. Bucket
2. Object
Differences in Canned ACLs
The bolded ones where upstream programs use these permissions will make the the Compatibility break.
| AWS Canned ACL | GCP Canned ACL | Applies |
|---------------------------|---------------------------|---------|
| private | private | Both |
| public-read | public-read | Both |
| public-read-write | public-read-write | Both |
| aws-exec-read | - | Both |
| authenticated-read | authenticated-read | Both |
| bucket-owner-read | bucket-owner-read | Object |
| bucket-owner-full-control | bucket-owner-full-control | Object |
| log-delivery-write | - | Bucket |
| - | project-private | Both |
CORS
2 Types of CORS
1. Simple
2. Preflighted
While both GCS and AWS S3 have the same fields in CORS the way we specify CORS Configurations are different and hence reusing the client libraries is not possible in this case.
| AWS CORS | GCP CORS |
|----------------|-----------------|
| AllowedHeaders | ResponseHeaders |
| AllowedMethods | Methods |
| AllowedOrigins | Origins |
| MaxAgeSeconds | MaxAgeSec |
So when setting up CORS Configuration at the bucket the following error pops up
ClientError: An error occurred (MalformedLifecycleConfiguration) when calling the PutBucketLifecycleConfiguration operation: The XML you provided was not well-formed or did not validate against our published schema.
Object Lifecycle Policy
While setting Object Lifecycle Policies are supported by the XML API, the request structures are different in the case of GCS , and you will recieve this common error
ClientError: An error occurred (MalformedLifecycleConfiguration) when calling the PutBucketLifecycleConfiguration operation: The XML you provided was not well-formed or did not validate against our published schema.
Object Integrity Check
If your application uses Object Integrity Checks in it’s logic during upload then you may want to read this. We encounter 2 Broad Scenarios in Object Integrity Checks
2 Scenarios
1. Single Part Upload
2. Multi-Part Upload
There are different types of File Integrity Checks
Types of File Integrity Checks
1. CRC32C
2. MD5
3. ETags
How do these components Relate, The client side while uploading needs to validate
AWS Multi-Part Upload
On multipart uploads, the etag
is computed by taking the binary encoding of each part’s md5
hash, concatenating them together, doing an md5
of that, hex-encoding the result, then appending the -
followed by the number of parts.
GCP Multi-Part Upload
GCP Composite crc32c
is computed by taking the individual part’s crc32cs
A Table comparing the differences
| | Single Part | Multi-Part |
|-----|-------------|---------------------------------------|
| GCP | MD5, CRC32C | b64 encoded CRC32C (CRC32Cs of Parts) |
| AWS | MD5 | Hex encoded MD5 ( MD5 of Parts ) |
Server Side Configuration
Step 1: Create a Custom Role (Not Mandatory), skip this step if a pre-defined role can be assigned to the Principal.
Step 2: Create a Service Account/Accounts
Step 3: Add Storage Admin Role to Service Account
Or Assign the Custom Role that was created
Step 4: Go to Cloud Storage, Copy Storage Endpoint
Step 5: Create HMAC Keys
Once you have the HMAC Keys and the Interop Endpoint Setup for that project you are all set to use the S3 Interoperability
Client Side Configuration
Call out to this Stack Overflow Post
As a Resource
s3_resource = boto3.resource(service_name='s3', endpoint_url=GCP_URL, aws_access_key_id=ACCESS_KEY, aws_secret_access_key=SECRET_KEY, region_name=GCP_REGION_NAME)
As a Session
session = Session()
s3_session = session.resource(service_name='s3', endpoint_url=GCP_URL, aws_access_key_id=ACCESS_KEY, aws_secret_access_key=ACCESS_KEY, region_name=GCP_REGION_NAME)
As a Client
s3_client = boto3.client('s3', endpoint_url= GCP_URL, aws_access_key_id=ACCESS_KEY, aws_secret_access_key=SECRET_KEY, region_name=GCP_REGION_NAME)
Or alternatively change the boto.cfg
cat /etc/boto.cfg
Add the right values
[Credentials]
aws_access_key_id = ACCESS_KEY
aws_secret_access_key = SECRET_KEY
s3_host = storage.googleapis.com
Code Samples
Skipping the simple List Bucket, List Objects based examples as they are repetitive.
1. Create Bucket
2. Multipart Upload
3. Signed URLs
4. Object Versioning Enable
Create Bucket
Please note that the default configuration for any GCP Region with multi-region bucket configuration follows that multi-region
as a default. Specifying
import boto3s3 = boto3.resource(service_name='s3', endpoint_url=GCP_URL, aws_access_key_id=ACCESS_KEY, aws_secret_access_key=SECRET_KEY, region_name=GCP_REGION_NAME)
s3.create_bucket(Bucket = BUCKET_NAME, CreateBucketConfiguration= {'LocationConstraint': GCP_REGION_NAME })
Multi-part Upload
import boto3
from boto3.s3.transfer import TransferConfig
s3 = boto3.resource(service_name='s3', endpoint_url=GCP_URL, aws_access_key_id=ACCESS_KEY, aws_secret_access_key=SECRET_KEY, region_name=GCP_REGION_NAME)
config = TransferConfig(multipart_threshold=1024 * 25,
max_concurrency=10,
multipart_chunksize=1024 * 25,
use_threads=True)file_path = os.path.dirname(__file__) + FILE_NAMEs3_resource.Object(BUCKET_NAME, OBJECT_NAME).upload_file(file_path,
ExtraArgs={'ContentType': 'xxx/yyy'},
Config=config)
Signed URLs
It is a URL that provides limited permission and time to make a request. Signed URLs contain authentication information in their query string, allowing users without credentials to perform specific actions on a resource.
import boto3s3 = boto3.resource(service_name='s3', endpoint_url=GCP_URL, aws_access_key_id=ACCESS_KEY, aws_secret_access_key=SECRET_KEY, region_name=GCP_REGION_NAME)response = s3.meta.client.generate_presigned_url('get_object', Params={'Bucket': BUCKET_NAME, 'Key': OBJECT_NAME}, ExpiresIn=EXPIRATION)print(response.data)
Object Versioning
Enable object versioning in a bucket
import boto3s3 =boto3.resource(service_name='s3', endpoint_url=GCP_URL, aws_access_key_id=ACCESS_KEY, aws_secret_access_key=SECRET_KEY, region_name=GCP_REGION_NAME)versioning = s3.BucketVersioning(BUCKET_NAME)
versioning.enable()
print(versioning.status())