AWS Resource Requirements
Lakehouse Optimizer requires and uses the following resources.
Required AWS Resources
1 Resource Group that contains the following:
Amazon EC2 Ubuntu/CentOS Linux EC2 VM:
OS: Ubuntu Linux 20.04 or CentOS Linux 7.9
Recommended Type: t3.2xlarge or similar with minimum 8 cores
Docker Engine (version 23.0 or later) installed
50GB EBS Volume
Amazon RDS for SQL Server:
Instance type: db.t2.xlarge or similar (4 cores/16GB RAM)
Daily automated backups
Web Edition, engine version 15.00 ( SQL 2019 )
RDS requires two subnets in different availability zones
Security group requires inbound TCP on 1433
An app database ( suggested name ‘bplm’ or ‘lakehouse-monitor’ )
AWS Secrets Manager
Used to store sensitive passwords
Amazon Dynamo DB
On-Demand Capacity Mode (can be changed to Provisioned Mode after a period of 1-2 months of monitoring)
Standard Table class
On-demand backup (daily, TTL enabled on ALL tables, 3 days max, data no longer required after scheduled LHM analyzer runs will move aggregated data to SQL Server)
one set of tables per LHM deployment, in one AWS region (multi region support in future releases)
Amazon SQS:
Free Tier: First 1 Million Requests/Month is free
Amazon Route 53:
create a DNS entry for the VM’s public IP address/hostname
LHM install script will install CertBot and generate a self-signed SSL certificate for the VM, it requires a human readable URL
alternatively, provide a SSL Certificate from a trusted Certificate Authority at deployment time
AWS Service Limitations
When choosing a region to deploy the AWS services into, be mindful of current services in that region and service quotas. Find more information on service quotas in this AWS PDF https://docs.aws.amazon.com/pdfs/general/latest/gr/aws-general.pdf#aws-service-information
If need be, before deploying the prerequisite AWS services, request a quota increase by following the steps outlined here https://docs.aws.amazon.com/general/latest/gr/aws_service_limits.html.
AWS EC2 VM
Deployment with Docker Containers
Public IP, security group:
allow inbound traffic on 443,80 for web traffic, TCP protocol
inbound port 22 for ssh configuration, TCP, can be closed later
Cloud Provider: AWS
Required
OS: Ubuntu Linux 20.04 or CentOS Linux 7.9
Docker Engine (version 23.0 or later) installed
Host VM Specs | CPU | Memory (GB) | Disk (GB) |
---|---|---|---|
Recommended | 8 | 28+ | 30 |
Databricks Service Principal
Lakehouse Optimizer leverages a Databricks service principal for API calls. This principal requires account admin privileges as well as workspace admin for any monitored workspaces. Lakehouse Optimizer uses OATH authentication for service principals. Please follow step 1 and 2 in the documentation from Databricks linked below to create the service principal and an OATH secret for the principal. Save this secret for later use during deployment where it is saved in secrets manager for use by the Lakehouse Optimizer application. Make sure you also add the service principal as a member to the workspaces that you want monitored by Lakehouse Optimizer.
AWS Permission Policies
See Single AWS Account access policies for LHO for regular deployment or Cross AWS Account access policies for BPLM deployment for cross account AWS deployment.
Lakehouse Optimizer Agent Permissions
As part of fully configuring the Lakehouse Optimizer, there is an agent that gets deployed to your Databricks workflows and clusters. That agent requires write access to DynamoDB and Simple Queue Service (SQS) as outlined under the “LHO Agent Policy”. There are three options to enable this agent, and the preferred way is to configure both option 1 and 2. These are the options that can be configured automatically via the deployment scripts:
Create an IAM user and attach the LHO Agent Policy. This is the catch all option as it will future proof monitoring for new compute resources created in any monitored workspaces. Have an access key and secret for this user ready at deployment time as you will be prompted to enter them. The access key and secret are stored in AWS Secrets Manager. The access key and secret will also be in plain text in the global init script that is only accessible by workspace admins. This scenario also acts as a fall back if option 2 or 3 are not configured for all desired compute resources within monitored Databricks workspaces.
Create a role for the agent in IAM with the LHO Agent policy attached, configuring that role’s trust policies to trust the instance profiles used in your Databricks environment(s). (Instance Profile role will “assume” the LHO Agent IAM Role .. “AWS IAM role chaining”). This is the AWS preferred way of access. Only those compute resources with instance profiles will be analyzed.
Extend the policies for all instance profiles used by compute resources to include the “LHO Agent Policy”. The drawback to this approach is only compute resources that use these configured instance profiles will be reported on and it is currently a manual configuration.
Tags to activate in Cost Manager
In order to obtain consumption data for Databricks managed resources from AWS cost allocation tags must be properly configured at AWS
Enable CostExplorer
Activate user-defined cost allocation tags:
ClusterId, DatabricksInstancePoolId, SqlEndpointId - these tags are automatically added to VM resources by Databricks
BplmWorkspaceId - used for reporting workspace storage and NAT Gateway costs
BplmMetastoreId - used for reporting the Unity Catalog metastore storage cost
These operations require an AWS payer account and usually take a day to complete by AWS.
AWS Networking Diagram
AWS Marketplace Considerations
Below are the mandatory billable services created during an AWS marketplace deployment.
Networking components that will incur charges:
Virtual Private Cloud
Internet Gateway
A Route 53 DNS entry
EC2 - Compute resource used to host application container.
RDS - Required to store application data.
Secrets Manager - Used to store secrets securely used by the application, such as the SQL password or application registration client secret if using Entra ID single sign-on.
Dynamo DB
Simple Queue Service