PolarSPARC |
AWS Analytics - Quick Notes
Bhaskar S | 01/12/2024 |
Amazon Athena
The following is the summary of the various features/capabilities of Athena:
Is a serverless interactive Query Service that makes it easy to analyze data directly from S3 using standard SQL
Helps one analyze unstructured, semi-structured, and structured data stored in S3
Supported formats CSV, JSON, or Columnar data formats such as Apache Parquet and Apache ORC
Integrates with AWS QuickSight for easy data visualization
Useful for business intelligence, reporting, analyze/query of logs from AWS services (VPC Flow, ELB, CloudTrail)
Given cost is based on data scanned, for cost optimization prefer Columnar format (Parquet, ORC)
Use AWS Glue to convert objects to Columnar format (Parquet, ORC)
Use data compression (gzip, snappy, etc) for faster retrievals and lower costs
Partition datasets in S3 using virtual columns (ex: s3://bucket/stocks/year=1990/month=1/)
Use larger files (greater than 128 MB) for optimal performance
Support for federated query across data sources not just S3, like RDS, DynamoDB, and on-prem sources using Data Source Connector that runs on Lambda
Amazon OpenSearch
The following is the summary of the various features/capabilities of OpenSearch:
Is a fully managed, petabyte scale service that makes it easy to deploy, operate, and scale OpenSearch clusters (open source ElasticSearch)
Is a fully open-sourced search and analytics engine for use cases, such as, log analytics, real-time application monitoring, and clickstream analysis
Useful for searching, analyzing, and visualizing text and unstructured data
Can search on any field (even partial matches)
Support for queries using SQL syntax (using a plugin)
Backup using Snapshots
Two types of cluster modes - Managed cluster or Serveless cluster
Can ingest data from Kinesis Data Firehose, CloudWatch Logs, etc
Provides support OpenSearch Dashboards for data visualization
Amazon Elastic Map Reduce (EMR)
The following is the summary of the various features/capabilities of Elastic Map Reduce:
Is a managed cluster platform that simplifies running big data frameworks such as Apache Hadoop or Apache Spark to process and analyze vast amounts of data
Used for processing, transforming, and moving data for analytics and business intelligence
The cluster can be made up of many EC2 Instances
Takes care of all the provisioning and configuration of the big data frameworks
Can integrate with Spot Instances of cost optimization
The Primary Node manages the cluster, coordinates the data and task distribution, and monitors the health
The Core Node runs tasks and store data in Hadoop
The optional Task Nodes are just for running tasks usually on Spot Instances
Amazon QuickSight
The following is the summary of the various features/capabilities of QuickSight:
Is a fully managed, serverless Business Intelligence service that can be used to create interactive dashboards
Can connect various data sources, such as AWS (Athena, RDS, Aurora, Redshift, S3, OpenSearch, etc), 3rd party SaaS sources (SalesForce, JIRA, etc), On-prem
Is fast, auto scalable, embeddable in websites, with per session pricing
The in-memory computation engine called SPICE responds with blazing speeds if data is imported into QuickSight
Amazon Glue
The following is the summary of the various features/capabilities of Glue:
Is a serverless Extract, Transform, and Load (ETL) service that makes it easy for users to discover, prepare, move, and load data from multiple sources
One can visually create, run, and monitor ETL pipelines to load data into a Redshift data warehouse
Runs the ETL jobs on fully managed, scaled-out Apache Spark environment
Uses Job Bookmarking to keep track of where a job is and pick up from where it left off (rather than starting from scratch) in cases of restarts
Can be used to convert data to Columnar format (Parquet, ORC)
The Data Catalog contains the metadata of datasets and is an index to the location, schema, and runtime metrics of the data from all the data stores
A Data Crawler is typically run to take an inventory of the data from all the data stores and write into the Data Catalog
Leverages the metadata from the Data Catalog for the data that is used as sources and targets to run the ETL jobs
The Data Catalog is used by other services such as Athena, EMR, etc., for data discovery
An Elastic View creates a virtual table (materialized view) that combines data across multiple data stores using SQL
AWS Lake Formation
The following is the summary of the various features/capabilities of Lake Formation:
Is a fully managed service that makes it easy to set up, secure, and manage data lake that is stored in S3
A Data Lake is a governed centralized repository of all data (structured and unstructured) that can be used for analytics and machine learning
Helps one discover the data sources, catalog, cleanse, transform, and ingest data into the data lake
A Blueprint is a data management template for a data source that enables one to create a workflow to ingest data into the data lake
Provides pre-defined blueprints for several source types, such as, CloudTrail Logs, S3, RDS, On-prem SQL/NoSQL, etc
Has its own permissions model that augments the IAM permissions model to enable fine-grained access control (at row and column level) to data stored in the data lake
Leverage Glue capabilities (data crawler, data catalog, ETL) to ingest data into the data lake
Amazon Managed Streaming for Apache Kafka (MSK)
The following is the summary of the various features/capabilities of Managed Streaming for Apache Kafka:
Is a fully managed service that enables one to build and run applications that use Apache Kafka to process streaming data
Provides the control-plane operations for creating, updating, and deleting clusters
Allows one to use the data-plane operations for producing and consuming data
Cluster is deployed in a customer VPC that spans multiple availability zones
Creates and manages Broker Nodes and Zookeeper Nodes in different availability zones in the VPC
The serverless option manages the provisioning and scaling of both compute and the storage
The default message size is 1 MB and can be configured to go up to 10 MB (Kinesis Data Streams has 1 MB limit)
Data retention can be as long as one needs (Kinesis Data Streams has a 1 year limit)
Consumers can be Apache Flink, Glue, Lambda, or custom code
Amazon Managed Service for Apache Flink
The following is the summary of the various features/capabilities of Managed Service for Apache Flink:
Is also referred to as Kinesis Data Analytics for Flink
Is a fully managed, auto scaled compute cluster for running any Flink application
Can source data from either Kinesis Data Streams or AWS Kafka (MSK)
Use Java, Scala, or SQL to process and analyze streaming data
Useful for time-series analytics, real-time dashboards and metrics, etc
Amazon Batch
The following is the summary of the various features/capabilities of Batch:
Allows one to run batch computing workloads of any scale
It automatically provisions compute resources and optimizes the workload distribution based on the quantity and scale of the workloads
A Batch Job is a unit of work (such as shell script, an executable, or docker container) that can be submitted for execution
A job is defined using a Job Definition, which is a template of the resources (cpu, memory, storage) needed for the job and other job dependencies
A submitted job goes into a Job Queue until it is scheduled onto the compute environment
Launches, manages, and terminates resources needed for a job to run
References
Official AWS Athena Documentation
Official AWS OpenSearch Documentation
Official AWS EMR Documentation
Official AWS QuickSight Documentation
Official AWS Glue Documentation
Official AWS Lake Formation Documentation
Official AWS Kafka Documentation