PolarSPARC |
Essential Cloud Infrastructure: Core Services - Summary Notes - Part 2
Bhaskar S | 11/02/2019 |
Storage and Database Services
This following table highlights the different storage service types (Object, Relational,Non-relational, Object, and Warehouse) and what each service is good for and its intended use:
This following decision tree helps identify the solution that best fits an application need:
Cloud Storage
Cloud Storage is GCPs Object Storage Service that allows worldwide storage and retrieval of any amount of data at anytime
This following illustration shows the details about Cloud Storage:
Cloud Storage can be used for a range of scenarios including serving website content, storing data for archival and disaster recovery, or distributing large data objects to users via a direct download
The key features of the Cloud Storage are - scalable to exabytes of data, time to first byte is in milliseconds, has very high availability across all storage classes (see below), and has a single API across those storage classes (see below)
Cloud Storage is a collection of Buckets that one place Objects into
One can create directories so to speak, but really a directory is just another object that points to different objects in the Bucket
One has specific URLs to access objects
Cloud Storage has 4 types of Storage Classes - Regional, Multi-Regional, Nearline, and Coldline
This following illustration talks about the different Storage classes:
Regional storage enables one to store data at lower cost with the trade-off of the data being stored in a specific regional location instead of having redundancy distributed over a large geographic area
Regional storage is the recommended option when storing frequently accessed data in the same Region as the Compute Engine instances. This provides one with better performance for data intensive computations
One should also choose Regional storage for data governance reasons when the data needs to remain in a specific region
Multi-Regional storage is geo-redundant meaning the Cloud Storage stores users data redundantly in at least 2 geographic locations separated by at least 100 miles
Multi-Regional storage can be placed only in multi-regional locations such as the United States, the European Union, or Asia
Multi-Regional storage is appropriate for storing data that is frequently accessed (from different locations) such as website content, interactive workloads, or data supporting mobile and gaming applications
Nearline storage is a low cost highly durable storage service for storing infrequently accessed data
Nearline storage is a good choice when one plans to read or modify their data less than once a month due to its low the storage cost. However, there is an associated retrieval cost
Nearline storage is also a great choice if one wants to continuously add files to Cloud Storage and plan to access those files once a month for analysis
Nearline storage is also recommended for backups and serving long-tail multimedia content
Coldline storage is a very low cost, highly durable storage service for data archival, online backup, and disaster recovery
With Coldline storage, data is available within milliseconds
Coldline storage is the best choice for data that one plans to access at most once a year due to its lower storage costs. However, there is a higher retrieval cost
Coldline storage is the recommended solution one wants to archive data or have access in a disaster recovery event
All of the Storage Classes have 99.9999999999 (eleven 9's) of durability. What that means is that one would definitely not lose the data, but may not be able to access the data on very rare event
This following illustration depicts the concepts of the Cloud Storage:
Cloud Storage is broken down into Buckets, Objects, and Access
Buckets are required to have a globally unique names and cannot be nested
The data that one puts into the Buckets are Objects that inherit the Storage Class of the Bucket
The Objects could be text files, document files, video files, etc and there is no minimum size to those Objects
To access the data one can use the gsutil command or via the REST APIs
When a user uploads an Object to a Bucket, if they don't specify a Storage Class for the Object, the Object is assigned the Bucket's Storage Class as the default
This following illustration talks about changing the default Storage Class:
One can change the default Storage Class of a Bucket, but can not change a Regional storage to a Multi-Regional storage and vice versa
Both Multi-Regional and Regional Buckets can be changed to Coldline or Nearline
One can change the Storage Class of an Object that already exists in a Bucket without moving the Object to a different Bucket or changing the URL to the Object
Setting a per Object Storage Class can be beneficial. For example, if a user has Objects in their Bucket that they want to keep and don't expect to access frequently. In that case, they can minimize costs by changing the Storage Class of those specified Objects to Nearline storage or Coldline storage
This following illustration talks about the access control relating to Buckets:
One can use IAM for the Project to control the following - which individual user or service account can see the Bucket, list the Objects in the Bucket, view the names of the Objects in the Bucket, or create new Buckets
Roles are inherited from the Project to Bucket to Object
Access Control Lists (ACLs) offer finer controls
Signed URLs provide a cryptographic key that gives time-limited access to a Bucket or Object
This following illustration talks about Access Control List:
ACL is a mechanism one can use to define who has access to their Buckets and Objects as well as what level of access they have
The maximum number of ACL entries one can create for a Bucket or an Object is a 100
Each ACL will consists of one or more entries. Each of these entries consists of 2 pieces of information - a Scope which defines who can perform the specified actions (user or groups), and a Permission which defines what actions can be performed (read or write)
The allUsers permission represents anyone who is on the Internet with or without a Google account
The allAuthenticatedUsers permission represents anyone who is authenticated with a Google account
This following illustration talsk about Signed URLs:
In situations when users do not have Google accounts, it is easier and more efficient to grant limited time access tokens that can be used by any user instead of using account-based authentication for controlling resource access. Signed URLs allow one to do this for Cloud Storage
For Signed URLs, one creates an URL that grants reader right access to a specific Cloud storage resource and specifies when the access expires. That URL is signed using a private key associated with the service account. When the request is received, Cloud Storage can verify that the access granting URL was issued on behalf of a trusted security principle (the service account)
Signed URLs must be expired after some reasonable amount of time, for example, expire after 10 months
This following illustration talks about the additional features of Cloud Storage:
Cloud Storage allows support for Customer-Supplied Encryption Keys when attaching persistent disks to VMs
Cloud Storage provides Object Lifecycle Management which lets one automatically delete or archive Objects
Cloud Storage supports Object Versioning which allows one to maintain multiple versions of Objects in their Buckets. However, there is cost associated with Versioning. One is charged for the Versions as if they were multiple files
Cloud Storage offers Directory Synchronization between a VM directory and a Bucket
This following illustration talks about Object Versioning:
In Cloud Storage, Objects are Immutable meaning an uploaded Object cannot change through its storage lifetime
To support the retrieval of Objects that are deleted or overwritten, Cloud Storage offers the Object Versioning feature
Object Versioning can be enabled for a Bucket
With Object Versioning enabled, Cloud Storage creates an archived version of an Object each time the live version of the Object is overwritten or deleted
The archived version retains the name of the Object, but is uniquely identified by a Generation Number as shown in the Fig.11 above Object A(g1)
When Object Versioning is enabled, one can list archived versions of an object, restore the live version of an Object to an older state, or permanently delete an archived version as needed
Object Versioning can be turned on or off for a Bucket at anytime
Turning Object Versioning off leaves existing Object versions in place and causes the Bucket to stop accumulating new archived Object versions
This following illustration talks about Object Lifecycle Management:
To support common use cases like setting a Time To Live (TTL) for Objects, archiving older versions of Objects, or downgrading Storage Classes of Objects to help manage costs, Cloud Storage offers Object Lifecycle Management
One can assign a Lifecycle Management configuration to a Bucket. The configuration is a set of rules that applies to all of the Objects in the Bucket
When an Object meets the criteria of one of the rules, Cloud Storage automatically performs a specified action on the Object. Some examples include: downgrade the Storage Class of Objects older than a year to Coldline storage, delete Objects created before a specific date, keep only the 3 most recent versions of each Object in a Bucket with Versioning enabled, etc
Object inspection occurs in asynchronous batches so the rules may not be applied immediately
Updates to the Lifecycle Management configuration may take up to 24 hours to go into effect. This means that when one changes the Lifecycle Management configuration, Lifecycle Management may still perform actions based on the old configuration for up to 24 hours
This following illustration talks about Object change notification:
Object Change Notification can be used to notify an application when an Object is updated or added to a Bucket through a Watch Request
Completing a Watch Request creates a new Notification Channel. The notification channel is the means by which a notification message is sent to an application watching a Bucket
Currently the only type of notification channel supported is a WebHook
After a notification channel is initiated, Cloud Storage notifies the application anytime an Object is added, updated, or removed from the Bucket
This following illustration talks about data import services:
To upload terabytes or even petabytes of data there are 3 services options - Transfer Appliance, Storage Transfer Service, and Offline Media Import
Transfer Appliance is a hardware appliance one can use to securely migrate large volumes of data from hundreds of terabytes up to one petabyte to GCP without disrupting business operations
The Storage Transfer Service enables high-performance imports of online data. That data source can be another Cloud Storage Bucket such as the Amazon S3 Bucket or an HTTP/HTTPS location
Offline Media Import is a third-party service where physical media such as storage arrays, hard-disk drives, tapes, and USB flash drives is sent to a provider who upload the data
This following illustration talks data consistency in the Cloud Storage:
Uploads to the Cloud Storage are Strongly Consistent meaning the Object is immediately available for download as well as metadata operations from any location where Google offers service. This is true whether one creates a new Object or overwrites an existing Object
Strong Global Consistency also extends to deletion operations on Objects
Bucket and Object listing are also Strongly Consistent
This following illustration depicts a decision tree for choosing a Storage Class:
Consider Coldline storage if data will be read less than once a year
Consider Nearline storage if data will be read less than once a month
Consider choosing Multi-Regional or Regional (based on locality needs) if data will be read and written often
Cloud SQL
This following illustration talks about the Cloud SQL service:
Cloud SQL is a fully managed service of either MySQL or PostgreSQL databases. This means that patches and updates are automatically applied, but one still have to administer the SQL users with the native authentication tools that come with these databases
This following illustration talks about the Cloud SQL instances:
Cloud SQL delivers high performance and scalability with up to 30 terabytes of storage capacity, 40,000 IOPS, and 416 gigabytes of RAM per instance. One can easily scale up to 64 processor cores and scale out with read replicas
This following illustration talks about Cloud SQL services:
Cloud SQL offers a Replica service that can replicate data between multiple Zones which is useful for automatic failover if an outage occurs
Cloud SQL also provides automated and on-demand backups with point-in-time recovery
Cloud SQL can also scale up which does require a machine restart or scale out using read Replicas
This following illustration talks about connecting to Cloud SQL instance:
When connecting from an application that is hosted within the same GCP Project as the Cloud SQL instance and is co-located in the same Region, choosing a Private IP connection (traffic not exposed to the Internet) will provide one with the most performance and secure connection
There are 3 options for connecting an application that is in another Region or Project or outside to the Cloud SQL instance. In those cases, use Cloud Proxy, use Manual SSL connection, or use unencrypted connection from a specific IP address
Cloud Proxy handles authentication, encryption, and key rotation on behalf of a user
For manual control over the SSL connection, one can generate and periodically rotate the certificates themselves
One can use an unencrypted connection by authorizing a specific IP address to connect to the Cloud SQL instance over its external IP address
This following illustration shows the decision tree that helps one find the right data storage service with full relational capability:
Cloud Spanner
Consider Cloud Spanner if one needs more than 30 terabytes of storage space, or over 4,000 concurrent connections to the database, or if the application needs to scale and be available globally
Consider hosting ones own database on a VM using Compute Engine if there is specific OS requirements, custom database configuration requirements, or special backup requirements
For Cloud SQL, each CPU core is subject to a 250 MB/s network throughput cap for peak performance. Each additional core increases the network cap, up to a theoretical maximum of 2000 MB/s
This following illustration talks about Cloud Spanner:
Cloud Spanner combines the benefits of relational database structure with non-relational horizontal scale. It provides petabytes of capacity and offers transactional consistency at global scale, schemas, SQL, and automatic ynchronous replication for high availability
Cloud Spanner Multi-Regional or Regional instances have different monthly up-time SLAs
This following illustration compares the characteristics of Cloud Spanner with other database services:
Like a relational database, Cloud Spanner has schema, SQL, and strong consistency
Like a non-relational database, Cloud Spanner offers high availability, horizontal scalability, and configurable replication
Cloud Spanner features allow for mission-critical use-cases such as building consistent systems for transactions and inventory management in the financial services and retail industries
This following illustration talks about Cloud Spanner replication:
Cloud Spanner instance replicates data across Zones, which can be within one Region or across several Regions. The database placement is configurable meaning one can choose which Region to put the database in. This architecture allows for high availability and global placement
This following illustration talks about Cloud Spanner data synchronization:
The replication of data in Cloud Spanner will be synchronized across Zones using Google's global fiber network. Using atomic clocks ensures atomicity whenever the data is updated
This following illustration shows when to choose Cloud Spanner:
Consider Cloud Spanner in the following situations: one has outgrown relational databases or sharding databases for throughput high-performance, need transactional consistency, global data, strong consistency, or just want to consolidate the databases
Cloud Firestore
This following illustration talks about Cloud Firestore NoSQL database:
Cloud Firestore is a fast, fully managed, serverless Cloud native, NoSQL document database, that simplifies storing, syncing, and querying data for mobile, web, and IoT apps at global scale
Cloud Firestore supports ACID transactions. In other words if any of the operations in the transaction fail and cannot be retried, the whole transaction will fail
Cloud Firestore supports automatic Multi-Regional replication and Strong Consistency so the data is safe and available even when disasters strike
Cloud Firestore allows one to run sophisticated queries against the NoSQL data without any degradation in performance
This following illustration indicates Cloud Firestore is the next-generation Cloud Datastore:
Cloud Firestore is actually the next generation of Cloud Datastore
Cloud Firestore can operate in Datastore mode, making it backwards compatible with Cloud Datastore. By creating a Cloud Firestore database in Datastore mode, one can access Cloud Firestores improved storage layer while keeping Cloud Datastore system behavior
Cloud Firestore queries are all Strongly Consistent vs Cloud Datastore limitation of Eventually Consistent
With Cloud Firestore, transactions are no longer limited to 25 entity groups
With Cloud Firestore, writes to an entity group are no longer limited to 1 per second
Cloud Firestore in native mode introduces new features such as a new Strongly Consistent storage layer, a collection and document data model, real-time updates, mobile and web client libraries
A general guideline is to use Cloud Firestore in Datastore mode for new server projects and native mode for new mobile and web applications
This following illustration shows when to choose Cloud Firestore:
Consider using Cloud Firestore if the schema might change, one needs an adaptable database, one needs to scale to zero, or one wants low maintenance overhead scaling up to terabytes
Cloud Bigtable
This following illustration talks about another NoSQL called Cloud Bigtable:
Cloud Bigtable is a fully managed NoSQL wide column database with petabyte-scale and very low latency. It seamlessly scales for throughput and learns to adjust to specific access patterns. However it does NOT support transactional consistency
Cloud Bigtable is a great choice for both operational and analytical applications including IoT, user analytics, and financial data analysis because it supports high read and write throughput at low latency
Cloud Bigtable is a great storage engine for machine learning applications
Cloud Bigtable supports the open-source industry standard HBase API
This following illustration talks about Cloud Bigtable storage model:
Cloud Bigtable stores data in massively scalable tables, each of which is a sorted key-value map
The table is composed of rows, each of which typically describes a single entity, and columns which contain individual values for each row
Each row is indexed by a single Row Key
Columns that are related to one another are typically grouped together into a Column Family
Each column is identified by a combination of the column family and a Column Qualifier which is a unique name within the column family
Each row column intersection can contain multiple Cells or versions at different timestamps, providing a record of how the stored data has been altered over time
Cloud Bigtable tables are sparse meaning if a Cell does not contain any data, it does not take up any space
This following illustration talks about how processing and storage are separate in Cloud Bigtable:
For Fig.32 above, we see that processing which is done through a front end server pool and nodes are handled separately from the storage
A Cloud Bigtable table is sharded into blocks of contiguous rows called Tablets (similar to HBase Regions) to help balance the workload of queries
Tablets are stored on Colossus which is Google's file system in SSTable format
An SSTable provides a persistent, ordered, immutable map from keys to values where both keys and values are arbitrary byte strings
This following illustration talks about Cloud Bigtable node rebalancing:
Cloud Bigtable learns to adjust to specific access patterns. If a certain Bigtable node is frequently accessing a certain subset of data, Cloud Bigtable will update the indexes so that other nodes can distribute that workload evenly (see Fig.33 above)
This following illustration shows hows Cloud Bigtable scales linearly:
Cloud Bigtable throughput scales linearly as more nodes are added
This following illustration shows when to choose Cloud Bigtable:
Consider using Cloud Bigtable if one needs to store more than one terabyte of structured data, have very high volumes of writes, need read write latency of less than 10 milliseconds along with strong consistency, or need a storage service that is compatible with the HBase API
The smallest Cloud Bigtable cluster one can create has 3 nodes and can handle 30,000 operations per second
Cloud Memorystore
This following illustration talks about Cloud Memorystore:
Cloud Memorystore is a fully managed Redis service that provides an in-memory data store service built on scalable, secure, and highly available infrastructure managed by Google
Applications running on GCP can achieve extreme performance by leveraging the highly scalable and available secure Redis service without the burden of managing complex Redis deployments
Cloud Memorystore automates complex tasks like enabling high availability, failover, patching, and monitoring
High availability instances are replicated across 2 Zones and provide a 99.9 percent availability SLA
Cloud Memorystore can support instances of up to 300 gigabytes and network throughput of 12 gigabytes per second
Hands-on with Cloud Storage
Download a sample file from the Internet using curl
curl http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/ClusterSetup.html > setup.html
Create a Cloud Storage Bucket called my-ps-bucket-01
gsutil mb gs://my-ps-bucket-01
Copy the downloaded setup.html to the Bucket my-ps-bucket-01
gsutil cp setup.html gs://my-ps-bucket-01
The following will be the typical output:
Copying file://setup.html [Content-Type=text/html]... / [1 files][ 56.5 KiB/ 56.5 KiB] Operation completed over 1 objects/56.5 KiB.
Get the default ACL for the Object setup.html from the Bucket my-ps-bucket-01
gsutil acl get gs://my-ps-bucket-01/setup.html
The following will be the typical output:
[ { "entity": "project-owners-874190793151", "projectTeam": { "projectNumber": "874190793151", "team": "owners" }, "role": "OWNER" }, { "entity": "project-editors-874190793151", "projectTeam": { "projectNumber": "874190793151", "team": "editors" }, "role": "OWNER" }, { "entity": "project-viewers-874190793151", "projectTeam": { "projectNumber": "874190793151", "team": "viewers" }, "role": "READER" }, { "email": "student-00-8509ae9e0f48@qwiklabs.net", "entity": "user-student-00-8509ae9e0f48@qwiklabs.net", "role": "OWNER" } ]
Set the ACL to private for the Object setup.html in the Bucket my-ps-bucket-01
gsutil acl set private gs://my-ps-bucket-01/setup.html
The following will be the typical output:
Setting ACL on gs://my-ps-bucket-01/setup.html... / [1 objects] Operation completed over 1 objects.
Once again, get the ACL for the Object setup.html from the Bucket my-ps-bucket-01
gsutil acl get gs://my-ps-bucket-01/setup.html
The following will be the typical output:
[ { "email": "student-00-8509ae9e0f48@qwiklabs.net", "entity": "user-student-00-8509ae9e0f48@qwiklabs.net", "role": "OWNER" } ]
Update the ACL to make the Object setup.html in the Bucket my-ps-bucket-01 publicly readable
gsutil acl ch -u AllUsers:R gs://my-ps-bucket-01/setup.html
The following will be the typical output:
Updated ACL on gs://my-ps-bucket-01/setup.html
Once again, get the ACL for the Object setup.html from the Bucket my-ps-bucket-01
gsutil acl get gs://my-ps-bucket-01/setup.html
The following will be the typical output:
[ { "email": "student-00-8509ae9e0f48@qwiklabs.net", "entity": "user-student-00-8509ae9e0f48@qwiklabs.net", "role": "OWNER" }, { "entity": "allUsers", "role": "READER" } ]
Get the Lifecycle Management policy on the Bucket my-ps-bucket-01
gsutil lifecycle get gs://my-ps-bucket-01
The following will be the typical output:
gs://my-ps-bucket-01/ has no lifecycle configuration.
Create a lifecycle policy file called lifecycle.json with the following contents:
Set the Lifecycle Management policy on the Bucket my-ps-bucket-01 using the policy file lifecycle.json
gsutil lifecycle set lifecycle.json gs://my-ps-bucket-01
The following will be the typical output:
Setting lifecycle configuration on gs://my-ps-bucket-01/...
Once again, get the Lifecycle Management policy on the Bucket my-ps-bucket-01
gsutil lifecycle get gs://my-ps-bucket-01
The following will be the typical output:
{"rule": [{"action": {"type": "Delete"}, "condition": {"age": 31}}]}
Get the Versioning status on the Bucket my-ps-bucket-01
gsutil versioning get gs://my-ps-bucket-01
The following will be the typical output:
gs://my-ps-bucket-01: Suspended
Enable Versioning on the Bucket my-ps-bucket-01
gsutil versioning set on gs://my-ps-bucket-01
The following will be the typical output:
Enabling versioning for gs://my-ps-bucket-01/...
Once again, get the Versioning status on the Bucket my-ps-bucket-01
gsutil versioning get gs://my-ps-bucket-01
The following will be the typical output:
gs://my-ps-bucket-01: Enabled
Once again, copy the downloaded setup.html to the Bucket my-ps-bucket-01
gsutil cp setup.html gs://my-ps-bucket-01
The following will be the typical output:
Copying file://setup.html [Content-Type=text/html]... Created: gs://my-ps-bucket-01/setup.html#1572201130871188 Operation completed over 1 objects/56.1 KiB.
List all the versions of setup.html in the Bucket my-ps-bucket-01
gsutil ls -a gs://my-ps-bucket-01/setup.html
The following will be the typical output:
gs://my-ps-bucket-01/setup.html#1572199690809910 gs://my-ps-bucket-01/setup.html#1572201130871188
References
Coursera - Essential Cloud Infrastructure: Core Services
Essential Cloud Infrastructure: Core Services - Summary Notes - Part 1