UCM Analysis Deepdive Guide

https://www.youtube.com/watch?v=4YdRK9k0SBQ

Assessment Steps & Understanding Errors

The Lakehouse Optimizer’s Unity Catalog Assessment can only see what Databricks allows it to see, it will return an error when Databricks won’t allow it to see something it requests.

If you are lucky the run bar will be green at the end and say “Assessment Completed” meaning the assessment process could see everything it needed to and your results are as accurate as possible. 

However, more often than not it will say “Assessment Completed with Warnings” indicating that Databricks would not or could not provide all the information requested by the assessment process.

 

The assessment process is made up of a series of steps, and to view those you can click “Show Pervious Steps” link under the “Assessment Run Steps” section. 

Most likely the aforementioned warnings will be part of the “Retrieive table extended information” step.

This step can run into data inconsistencies usually either data orphans or unexpected file types that are not delta format but still hold actual data like a csv or parquet file.   

In such cases, the Databricks API returns an error, and the Lakehouse Optimizer can’t accurately assess if there is something to migrate or not, so the process logs this as an error it received when it requested the information from Databricks.

Investigating the Root Cause of Assessment Errors

Any errors/warnings shown should be investigated as they indicate some amount of information missing from the assessment results.

To do so we can scroll down to the bottom of the assessment page and click “Assessment Logs”.

Most of these errors will fall into one of the following categories:

  1. Table does not exist

  2. Is not a delta table

  3. Access is denied to an external location

Error Example

If we go to this schema and table in Hive Metastore we will see the same error, and we can go in the databricks file system and see that this DBFS location no longer exists – so this is simply an orphaned table reference that didn’t get deleted when its data was.  

If we delete this table reference and rerun, we will have one less error.

Hive Metastore Inventory

The Hive Metastore inventory catalogs all the different parts of your current Hive Metastore.

Number of Tables is the Foundation

The tables are of course the most important feature as the data tables are what actually will be moved from Metastore to unity catalog, so this number is the first indicator of the level of effort.   

If the number is one hundred thousand, then migration will take a while, and if its a couple hundred as you see here it will take much less. 

It largely depends on the expertise of the person migrating the data, but an experienced Databricks engineer should be able to migrate a few thousand tables a week using standard scripting techniques.

Scenario: Many More Items (Notebooks, Files) than Tables

Another big factor immediately evident in this example is when the workspace items like notebooks and files greatly exceed the number of tables.  

When we see this, it generally means there is a lot of ad hoc code in this workspace, and often that mean lots of nearly identical notebooks with only one line of code difference sharing tables over and over again. 

This adds complexity to migration because we can expect a lot of ad hoc table references that need to be handled inside the code for things to work and run in Unity Catalog.

That’s where the one of Lakehouse Optimizer Assessment’s key features comes in: the Active HMS Footprint.

Active Hive Metastore Footprint

The Active HMS Footprint highlights which tables, jobs, clusters, files and libraries have been actively used recently.

That doesn’t mean everything else other than these 40 tables, and these 15 files or notebooks can be written off as old and stale, but it does mean if you want to ensure your regular usage in this workspace is covered, than you can focus on just the active elements first to get up and running on unity catalog as soon as possible, and maybe leave the stale tables and notebooks till a time when they are used again – if ever.

The Other Inventory Categories: Grants, Groups, Users, APCs, Jobs

The rest of the inventory items are all elements that interact with the tables differently in Unity Catalog.   

Users in Unity Catalog are raised to the account level so you can manage them all in one place rather than individually in each workspace.  

Grants are encouraged to be handled more as groups and less individually, so you might find you can repurpose 300 or even 3000 into a handful of groups.

Recommendations

Our recommendations create a virtual checklist of what needs to happen in a migration, with links to Databricks documentation to help you walk through the process of migrating the actual data if you are planning to do it on your own, or just want to know more about the process. 

Export Assessment to Unity Catalog

If you really want to dive into the individual data, you can export the results of the report into Unity Catalog to be viewed or queried on there.

If you are going to be performing your own migration, then these results can serve as the foundation of the scripts you will write to perform the actual actions that make up the migration. 

Inside Databricks we can open up our full list of tables, databases, grants, all these moving parts of the Hive Metastore, and see all the current information on that, and access that through scripts and Databricks.   

With that you can begin to build configurations for scripts that will move and rename these elements based on table elements – like maybe you want to consolidate all grants with a specific permission into one grouping.   

Whatever your programmatic solution this assessment provides a good foundation to approach.