PolarSPARC |
AWS Simple Storage Service (S3) - Quick Notes
Bhaskar S | *UPDATED*12/29/2023 |
AWS Simple Storage Service
AWS Simple Storage Service (S3 for short) is an object storage service that offers industry-leading scalability, data availability, security, and performance.
The following is the summary of the various features/capabilities of S3:
Is a global public service that can be accessed from the Internet
Very high durabiltiy (protection against a data loss or data corruption) 99.999999999 percent (11 9s) of objects across multiple Availability Zones
Data stored in a container called a Bucket
Think of a bucket as a folder
Objects are stored as files in a bucket
Each bucket can store an UNLIMITED number of objects
Object sizes can range from 0 bytes to 5 TB
Buckets are defined at the REGION level
Buckets must have globally UNIQUE name across all Regions and all AWS accounts
Bucket names can contain between 3 to 63 characters of lowercase letters, numbers, hyphens, and periods
There is NO hierarchy structure (sub-bucket) within a bucket
One can mimic a hierarchy by creating a Folder within a bucket
Objects (files) have a key (like an URL to the file), which is the full path to the object (including the bucket name)
Objects can be accessed by a unique key using one of the two forms:
https://<bucket>.<region>.s3.amazonaws.com/<key>
https://s3.<region>.amazonaws.com/<bucket>/<key>
The object key is composed of a Prefix (the part between the bucket name and the object name)
There are NO limits to the number prefixes in a bucket
Delivers a strong read-after-write consistency
Performance
The following are the various features/capabilities on S3 Performance:
Automatically scales to high request rates with latency 100-200 ms for the first byte from a bucket
5500 read (GET/HEAD) requests AND 3500 update (PUT, COPY, POST, DELETE) requests per prefix in a bucket
Multipart upload recommended for files greater than 100 MB and a MUST for files greater than 5 GB
Multipart upload implies breaking a large file into parts and parallelize the uploads to speed up transfers (increase throughput)
Transfer Acceleration is used to transfer files to an CloudFront edge locations and then have it forward data to the destination bucket in the target Region using the AWS high-speed, low-latency private network backbone
One needs to enable Transfer Acceleration at the bucket level and ONLY pay for the data transfers that are accelerated
Byte-Range Fetches enables one to parallelize the retrieval operation by GETting specific byte ranges, which is better from resiliency aspect (one can retry only the failed parts)
Access Control
The default access for a bucket or an object is PRIVATE (only the resource owner has access).
The following are the methods of controlling access to buckets and objects:
Identity-based policies can be attached to users, groups, roles, or other AWS resources, granting access to buckets and objects
Identity-based policies can also be used when different buckets with different permission requirements
Resource-based policies could be Access Control Lists (or ACLs) or Bucket Policies
Resource-based ACL policies are NOT the preferred approach
Resource-based bucket policies apply at bucket level and enables one to define access policy rules that apply to all objects in that bucket
Bucket policies allows one to grant CROSS-ACCOUNT access
The following is an example of public read access policy to a bucket:
{ "Version": "2023-12-28", "Statement": [ { "Sid": "PublicAccess", "Effect": "Allow", "Principal": "*", "Action": [ "s3:GetObject" ] } ], "Resource": [ "arn:aws:s3:::example-bucket/*" ] }
Bucket policies are the preferred approach when one wants to keep access control policies in S3 environment
Access Points
Access Points simplify data access for any AWS application or service that stores data in S3. Access points are named network endpoints that are attached to buckets and can used to perform S3 object operations. Each access point has distinct permissions and network controls that S3 applies for any request that is made through that access point. Each access point enforces a customized access point policy that works in conjunction with the bucket policy that is attached to the underlying bucket.
The following are the various features/capabilities of S3 Access Points:
Can be use to simplify access control management
Can be used to grant permissions based on bucket prefix
Can be used to ONLY accept requests from a VPC to restrict S3 data access to a private network
Static Website Hosting
One can use an S3 bucket to host a static website that is accessible from the Internet. It can only include static web content as individual webpages, which can contain client-side scripts.
The website URL depends on the Region and can be one of the two forms:
https://<bucket>.s3-website.<region>.amazonaws.com
https://<bucket>.s3-website-<region>.amazonaws.com
Note that S3 static website hosting does NOT support server-side scripting.
Cross-Origin Resource Sharing (CORS)
It is a web browser based security mechanism to allow requests to other origins from the visited origin. An origin is the combination of the protocol (http/https), the domain (example.com), and the port. A web browser makes a preflight check to determine if the request to the other origin is allowed.
The following are the various features/capabilities of CORS:
If a client makes a cross-origin request on a static website bucket, the correct CORS headers needs to be enabled
One can allow either a specifc origin or specify "*" to allow all origins
The cross-origin requests will not be fulfilled unless the target origin allows the request using the CORS header Access-Control-Allow-Origin
The CORS configuration on the target origin is allowed by enabling the following options in a JSON rule:
Access-Control-Allow-Origin
Access-Control-Allow-Methods
Access-Control-Allow-Headers
Versioning
Versioning in S3 is a means of keeping multiple versions of an object in the same bucket. One can use the versioning feature to preserve, retrieve, and restore every version of every object stored in the buckets.
The following are the various features/capabilities of S3 Versioning:
Enabled at the bucket level
Once enabled, it CANNOT be disabled, but ONLY suspended
Objects stored in a bucket prior to enabling versioning have a version ID of null
Overwriting an object results in a new object version in the bucket. Note that the previous version(s) will also exist
Allows one to recover from unintentional user actions (accidental deletes or overwrites)
Deleting an object does not remove the object. Instead S3 inserts a DELETE MARKER and hides all the older version(s)
Older versions of an overwritten or deleted object can be retrieved by specifying a version ID
Deleteing a specific version of an object causes that version to be permanently deleted
Multi-Factor Authentication (MFA) Delete
MFA Delete is an optional additional layer of protection from unintentional delete of an object version or changing the versioning state of the bucket.
The following are the various features/capabilities of S3 MFA Delete:
Only the bucket owner can enable/disable this option
Can be enable ONLY using the AWS CLI
Forces clients to provide the MFA code from a device before performing operations on S3
MFA code MUST be when one wants to permanently delete an object version
MFA code MUST be when disabling the versioning on a bucket that is already enabled
Must set the HTTP header x-amz-mfa in all requests to S3
Replication
One needs to enable Replication for an S3 bucket. There are two types of replicaion - Cross Region Replication (or CRR) and Same Region Replication (or SRR).
The following are the various features/capabilities of S3 Replication:
Versioning must be ENABLED on both the source and the destination bucket
Buckets can be in different AWS accounts
Must have an IAM role ENABLED for cross region replication
After enabling replication, only new or updated objects will be replicated
One can replicate the delete marker as well if the setting Delete Marker Replication is enabled
Deleting specific versions on the source will not delete on the target
Deletion of the delete markers will not be replicated
There is no support for chaining meaning if bucket-1 is replicated to bucket-2 and bucket-2 is replicated to bucket-3, bucket-1 objects are not replicated to bucket-3
For replication between two Regions, AWS uses Asynchronous Replication
Used for disaster recovery
Storage Classes
S3 offers different storage classes (or tiers) as follows:
Standard - General Purpose
For frequently accessed data
99.99 % availability
Low latency and high throughput
Can sustain concurrent loss of 2 facilities
Useful for Big Data Analytics, Mobile and Gaming, Content Distribution
Standard-Infrequent Access (IA)
For data that is less frequently accessed, but requires rapid access when needed
Charged separately for retrieval
99.9 % availability
Minimum storage duration of 30 days
Useful for Disaster Recovery, Backups
One Zone-Infrequent Access
For less frequently accessed reproducible/derived data
Stores the object data in only one Availability Zone
99.5 % availability
Minimum storage duration of 30 days
Useful for secondary backup of on-prem data
Glacier Instant Retrieval
Millisec retrieval
Charged separately for retrieval
99.9 % availability
Minimum storage duration of 90 days
Useful for data that is retrieved once a quarter
Glacier Flexible Retrieval
Charged separately for retrieval
99.99 % availability
Minimum storage duration of 90 days
Has tiers of retrieval - Expedited (1 to 5 minutes), Standard (3 to 5 hours), and Bulk (5 to 12 hours)
Useful for data archival
Glacier Deep Archive
Charged separately for retrieval
99.99 % availability
Minimum storage duration of 180 days
Has tiers of retrieval - Standard (12 hours), and Bulk (48 hours)
Useful for long-term storage for compliance and regulatory needs
Intelligent Tiering
Charged a small monthly monitoring and auto-tiering fees
Automatically moves objects between access tiers based on usage patterns
No retrieval charges
The frequent access tier (automatic) is the default tier
The infrequent access tier (automatic) is for objects not accessed for 30 days
The archive instant access tier (automatic) is for objects not accessed for 90 days
The archive access tier (optional) is for objects not accessed in a configurable window between 90 days to 700+ days
The deep archive access tier (optional) is for objects not accessed in a configurable window between 180 days to 700+ days
Note that the durability the SAME across all the storage classes.
Lifecycle Rules
Enables one to specify the Lifecycle Rules of object(s) in a bucket. Data objects have a natural lifecycle - starting from frequently accessed (hot), to less frequently accessed (warm), and finally to archive or backup (cold).
The following are the various features/capabilities of S3 Lifecycle Rules:
Used in conjunction with versioning
Lifecycle rules apply to a bucket
Rules can be created for specific prefixes and object tags
The following are the notable points about Transition actions:
Transition can be Standard -> Standard IA -> Intelligent Tiering -> One-Zone IA -> Glacier Instant -> Glacier -> Glacier Deep Archive
Cannot transition from any other storage tier to Standard
Cannot transition from Intelligent Tiering to Standard-IA
Cannot transition from One-Zone IA to Standard-IA or Intelligent Tiering or Glacier Instant
The following are the notable points about Expiration actions:
Can delete old versions of an object
Can delete an object after certain period of time say after 365 days
Select and Glacier Select
The following are the various features/capabilities of S3 Select and Glacier Select:
Allows one to use SQL expressions to select specific objects from a very large zip file in a bucket
Allows one to retrieve less data using SQL by performing server-side filtering
Reduces network transfer and uses less CPU cost on the client-side
Data Encryption
Data Encryption refers to protecting data while it is in transit (as it travels to and from S3) and at rest (while it is stored on disks in S3). One can protect data in transit by using Secure Socket Layer/Transport Layer Security (SSL/TLS) or client-side encryption. For protecting data at rest in S3, we have the following four options:
Server-Side Encryption with S3 Managed Keys (SSE-S3)
Enabled by default for new buckets and objects
Keys are handled, managed and owned by AWS
Objects are encrypted on the server-side
Uses AES 256 bit encryption
Must set the HTTP header "x-amz-server-side-encryption": "AES256" to request AWS to encrypt the object
Server-Side Encryption with Key Management Service (KMS) Managed Keys (SSE-KMS)
Uses Key Management Service (KMS) to manage encryption keys
Has a default KMS key associated with S3 which can be used for object encryption
Provides an audit trail (via CloudTrail) of who used the keys
Must set the HTTP header "x-amz-server-side-encryption": "ams:kms" to request AWS to encrypt the object
Server-Side Encryption with Customer-Provided Keys (SSE-C)
The customer responsible for managing and providing the encryption keys
S3 performs the encryption and decryption of objects using the provided encryption keys
S3 DOES NOT store the customer encryption key
The encryption key must be passed in a HTTP header with every request
Client Side Encryption
The customers fully manage the keys and the encryption cycle
The customer is responsible for the encryption/decryption of the object before sending/after retrieving to/from S3
The customers can leverage the Amazon S3 Client-side Encryption library
Pre-Signed URL
The following are the various features/capabilities of S3 Pre-Signed URL:
Allows clients to get temporary access to an object in a private S3 bucket
With the shared pre-signed URLs, the clients inherit the permissions of the creator for the GET/PUT operations
Pre-signed URLs have an expiration duration after which they are NOT valid
Access Logs
The following are the various features/capabilities of S3 Access Logs:
One can enable this option for audit purposes to log all access to a bucket
One MUST create a separate Logging bucket for this
Any request made to S3, from any account, authorized or not, will be logged into the logging bucket
The logging bucket MUST be in the same Region as the source bucket
One MUST grant permission to the S3 Log Delivery group on the logging bucket
Event Notifications
The following are the various features/capabilities of S3 Event Notifications:
With this option, events are emitted by S3 when an object is created, removed, restored, replicated, etc
Filtering can be applied on objects such as only image files (with *.jpg extension)
The target of the event can be SNS, SQS, or a Lambda Function
One needs to grant IAM permissions for the events to be processed. This can be achieved using the IAM Policy in the SNS Resource Access Policy, the SQS Resource Access Policy, the Lambda Resource Policy attached to the Lambda Function
Useful for generating thumbnails of images whenever images are uploaded to a bucket
Object Lambda
The following are the various features/capabilities of S3 Object Lambda:
Allows one to use Lambda Functions to process the output of S3 GET requests and modify it before returning to the client
One needs to setup a S3 Lambda Access Point in addition to the access point for the bucket
Lambda Access Point is associated with exactly one standard access point and thus to one S3 bucket
One can use the AWS pre-built Lambda Functions or use a custom Lambda Function
Useful for removing Personally Identifiable Information PII data from an object
References