Essential Cloud Infrastructure: Core Services - Summary Notes

The Resource Manager lets a user hierarchically manage resources by Project, Folder, and Organization
This following illustration shows the Resource Manager from a policy perspective:

Fig.1

Policies contain a set of Roles and Members and policies are set on resources
The resources inherit policies from their parent. In other words resource policies are a union of parent and resource
If a parent policy is less restrictive, it overrides the more restrictive resource policy
This following illustration shows the Resource Manager from a billing perspective:

Fig.2

IAM policies are inherited top to bottom while billing is accumulated from the bottom up
Resource consumption is measured in quantities like rate of use or time, number of items, or feature use
Because a resource belongs to only one Project, a Project accumulates the consumption of all its resources
Each Project is associated with one Billing Account, which means that an Organization contains all Billing Accounts
This following illustration shows the Organization node:

Fig.3

An Organization node is the root node for all GCP resources
In the Fig.3 above, we have an individual Bob, who has control of the organizational domain through the Organization Admin role. Bob has delegated privileges and access to the individual Projects to Alice by making her a Project Creator
This following illustration shows resource consumption accumulates to a Project:

Fig.4

Because a Project accumulates the consumption of all its resources, it can be used to track resources and quota usage
Projects let a user enable billing, manage permissions and credentials, and enabled service and APIs
A Project can be identified by the Project Name, which is a human-readable way to identify users Projects
There is also the Project Number which is automatically generated by the server and assigned to a Project
There is the Project ID which is a unique ID that is generated from a Project Name
A user can find these 3 identifying attributes on the dashboard of the GCP console or by querying the Resource Manager API
This following illustration shows the resource hierarchy:

Fig.5

From a physical organization standpoint, resources are categorized as Global, Regional, or Zonal
Images, snapshots, and networks are Global resources
External IP addresses are Regional resources
Instances and disks are Zonal resources
Regardless of the type, each resource is organized into a Project. This enables each Project to have its own billing and reporting structure

This following illustration talks about Project Quotas:

Fig.6

All resources in GCP are subject to Project Quotas (limits)
Quotas (limits) typically fall into one of the 3 categories - How many resources can a user create per Project (Ex: one can have only 5 VPC networks for Project), How quickly can a user make API requests in a Project (Ex: one can only make 5 administrative actions per second per Project when using the Cloud Spanner API), How many resources can a user create per Region (Ex: one can only have 24 vCPUs per Region)
As the use of GCP expands over time, a users Quotas may increase accordingly
If a user expects a notable upcoming increase in usage, they can proactively request Quota adjustments from the Quotas page in the GCP Console
This following illustration talks about the need for Project Quotas:

Fig.7

Project Quotas exist to prevent runaway consumption in case of an error or malicious attack (Ex: imagine one accidentally creates 100 instances instead of 10 VM instances)
As Quotas are related to billing, they also prevent billing spikes or surprises
Quotas forces consideration and periodic review
Quotas are the maximum amount of resources one can create for that resource type as long as those resources are available
Quotas do NOT guarantee that resources will be available at all times

This following illustration talks about the use of Labels:

Fig.8

Labels are a utility for organizing GCP resources. Labels are key value pairs that one can attach to their resources like VMs, disks, snapshots, and images
A user can create and manage Labels using the GCP Console, gcloud command or the Resource Manager API
Each resource can have up to 64 Labels
Labels can be used in scripts to help analyze costs or to run bulk operations on multiple resources
This following illustration shows some examples for Labels:

Fig.9

Add Labels based on team or cost center to distinguish instances owned by different teams. This type of Label can be used for cost accounting or budgeting (Ex: team: marketing, team: research)
One can also use Labels to distinguish components (Ex: component: redis, component: frontend)
One can use Labels based on environment or stage (Ex: environment: prod, environment: test)
One should also consider using Labels to define an owner or a primary contact for a resource (Ex: owner: gaurav, contact: OPM)
One can add Labels to resources to define their state (Ex: state: in-use, state: readyfordeletion)
This following illustration compares Labels vs Tags:

Fig.10

Labels are user-defined strings in key-value format that are used to organize resources and they can propagate through billing
Tags on the other hand are user-defined strings that are applied to instances only and are mainly used in networking such as applying Firewall Rules
To help with project planning and controlling costs, one can set a budget. Setting a budget lets a user track how their spend is growing towards that amount
Set a budget name and specify which Project this budget applies to. Then one can set the budget at a specific amount or match it to the previous month spend
After setting a budget amount, one can set the budget alerts. These alerts send emails to billing admins after spend exceeds a percentage of the budget or a specified amount
The email contains the Project Name, the percent of the budget that was exceeded, and the budget amount

This following illustration talks about Site Reliability Engineering (SRE):

Fig.13

Monitoring is important to Google because it is at the base of Site Reliability Engineering (SRE)
SRE is a discipline that applies aspects of software engineering to operations whose goals are to create ultra scalable and highly reliable software systems
This following illustration talks about Stackdriver Monitoring:

Fig.14

Stackdriver dynamically configures monitoring after resources are deployed and has intelligent defaults that allow one to easily create charts for basic monitoring activities
This allows one to monitor their platform, system, and application metrics by ingesting data such as metrics, events, and metadata. One can then generate insights from this data through dashboards, charts, and alerts
This following illustration talks about Stackdriver Workspace:

Fig.15

A Workspace is the root entity that holds monitoring and configuration information in Stackdriver monitoring
Each Workspace can have between 1 and 100 monitored Projects including one or more GCP Projects and any number of AWS accounts
A Workspace contains the custom dashboards, alerting policies, uptime checks, notification channels, and group definitions that one uses with their monitored projects
A Workspace can access metric data from its monitored Projects
The metrics data and log entries remain in the individual Projects
The first monitor GCP Project in a Workspace is called the Hosting Project and must be specified when the Workspace is created. The name of that Project becomes a name of the Workspace
All Stackdriver users who have access to a Workspace have access to all its data by default. This means that a Stackdriver role assigned to one person on one Project applies equally to all Projects monitored by that Workspace
In order to give people different roles per Project and to control visibility to data, consider placing the monitoring of those Projects in separate Workspaces
Create alerting policies to notify a person when specific conditions are met. When a condition is met, users can be automatically notified through e-mail, SMS, or other channels
Uptime Checks can be configured to test the availability of public services from locations around the world
The type of Uptime Check can be set to HTTP, HTTPS or TCP
The resource to be checked can be an AppEngine application, a VM instance, a URL of a host, or an AWS instance or Load Balancer
Stackdriver Monitoring can access some metrics without a monitoring agent such as the CPU Utilization, some disk traffic metrics, network traffic, and uptime information
To access additional system resources and application services, one should install the Monitoring Agent
The Stackdriver Monitoring Agent is supported for Compute Engine and EC2 instances
This following illustration shows the commands to install the Monitoring Agent:

Fig.16

The Monitoring Agent can be installed with the two simple commands (as shown in the Fig.16 above). This assumes that the VM instance is running Linux, is being monitored by a Workspace, and has the proper credentials for the agent
If the standard metrics provided by Stackdriver Monitoring does not fit a users needs, they can create Custom metrics
This following illustration talks about using Custom metrics:

Fig.17

Stackdriver Logging allows one to store, search, analyze, monitor, and alert on logged data and events from GCP and AWS
This following illustration talks about Stackdriver Logging:

Fig.18

Stackdriver Logging is a fully managed service that performs at scale and can ingest application and system log data from thousands of VMs
Logging includes storage for logs, a user-interface called the Logs Viewer, and an API to manage logs programmatically
Stackdriver Logging lets a user read and write log entries, search and filter logs, and create log-based metrics
Logs are only retained for 30 days, but a user can export logs to Cloud Storage Buckets, BigQuery datasets, and Cloud Pub/Sub Topics
Exporting logs to BigQuery allows a user to analyze logs and also visualize them in Data Studio
BigQuery runs extremely fast SQL queries on gigabytes to petabytes of data. This allows a user to analyze logs such as the network traffic to better understand traffic growth to forecast capacity, network usage to optimize network traffic expenses, or network forensics to analyze incidence
Using Cloud Pub/Sub a user can stream logs to applications or endpoints
The Stackdriver Logging Agent is supported for Compute Engine and EC2 instances
This following illustration shows the commands to install the Logging Agent:

Fig.19

Stackdriver Error Reporting counts, analyses and aggregates the errors in a users running Cloud services
This following illustration talks about Stackdriver Error Reporting:

Fig.20

A centralized Error Management interface displays the results with sorting and filtering capabilities and one can even set up real-time notifications when new errors are detected
Stackdriver Error reporting is generally available for the App Engine Standard Environment and is a beta feature for App Engine Flexible environment, Compute Engine and AWS EC2
Stackdriver Trace is a distributed tracing system that collects latency data from user applications and displays it in the GCP console
This following illustration talks about Stackdriver Trace:

Fig.21

A user can track how requests propagate through their application and receive detailed, near real-time performance insights
Stackdriver Trace automatically analyzes all of user applications traces to generate in-depth latency reports that surface performance degradations and can capture traces from App Engine, HTTP load balancers, and applications instrumented with the Stackdriver Trace API
Stackdriver Debugger is a feature of GCP that lets a user inspect the state of a running application in real time, without stopping or slowing it
This following illustration talks about Stackdriver Debugger:

Fig.22

The Stackdriver Debugger adds less than 10 milliseconds to the request latency when the application state is captured
Stackdriver Debugger allows one to understand the behavior of their code in production and analyze its state to locate those hard to find bugs
With just a few mouse clicks, a user can take a snapshot of their running application state or inject a new logging statement
Stackdriver Debugger supports multiple languages including; Java, Python, Go, Node.js and Ruby