Configure parquet store file system
TABLE OF CONTENTS
Conduit storage location
Datasets caches and materialization assets are stored in Parquet format, thus the name “Parquet store”.
In Conduit, the following file systems (storage types) are supported for storing data source caches and data materialization:
Azure Blob Storage (abfs)
S3 (s3)
Google Cloud Storage (gcs)
HDFS (hdfs)
local file system (file).
Supported storage file systems
Azure Cloud Storage
General Prerequisites
Azure Blob storage account must already be created
the storage account must have "Enable hierarchical namespace" checked
the other settings can be left on default (unless required by user's system configuration, not Conduit)
e.g. networking, data protection, tags and others
see section below how to create a storage account
Azure Blob container must be already created
no special settings required
settings can be left on default values
see section below how to create a container
How to create a Azure Blob storage account:
Step 1)
Step 2) enable hierarchical namespace
How to create a Azure Blob container:
Navigate to your Azure Blob Storage account and click on "Containers" on the left panel.
Azure Blob storage authentication
The configuration of Azure Blob storage as storage type can be done in 2 ways depending on the type of authentication used:
access keys authentication
azure managed identity authentication
1. Access keys authentication
Prerequisites:
have access to Azure Blob access keys
more information on generating access keys can be found here.
The storage account must have hierarchical namespace enabled.
Settings
/etc/bpcs/docker/bde-server.env
Once all prerequisites are fulfilled, please update the following configuration with the proper values and add them to bde-server.env :
FS_TYPE=abfs
FS_ABFS_STORAGE_ACCOUNT={ Azure Blob storage account }
FS_ABFS_CONTAINER={ Azure Blob container }
FS_ABFS_ACCESS_KEY={ Azure Blob access key }
FS_DEFAULTFS=abfs://{ Azure Blob container }@{ Azure Blob storage account }.dfs.core.windows.net
CONDUIT_AZURE_CLOUD_TYPE=AzureCloud
CONDUIT_AZURE_CLOUD_STORAGE_ENDPOINT_SUFFIX=core.windows.net
remove the curly { } brackets. See below examples section.
If installing Conduit for the first time, then please see section "Conduit installation" for how to use the above variables.
If you want to update an already installed Conduit, then please continue with the steps described in section: "Configuration update after installation".
2. Azure Managed Identity authentication
This type of authentication is used when Conduit services are deployed on a virtual machine running in Azure. More information about managed identities for Azure can be found here.
Prerequisites
enable System-assigned managed identity. Follow steps from here
the storage account must have hierarchical namespace enabled.
the resource group of the virtual machine where Conduit services are running must have the following role: StorageBlobDataContributor
in Azure Portal navigate to "All services" -> "Resource groups" -> select resource group where Conduit services VM is using -> Access control (IAM) -> search or add "StorageBlobDataContributor
Settings
/etc/bpcs/docker/bde-server.env
Once all prerequisites are fulfilled, please update the following configuration with the proper values and add them to bde-server.env :
remove the curly { } brackets. See below examples section.
If installing Conduit for the first time, then please see section "Conduit installation" for how to use the above variables.
If you want to update an already installed Conduit, then please continue with the steps described in section: "Configuration update after installation".
Azure Government Storage
General Prerequisites
follow the steps from "
Azure Cloud Storage section" (see above)
Settings
Once all prerequisites are fulfilled, please update the following configuration with the proper values and add them to bde-server.env :
remove the curly { } brackets. See below examples section.
If installing Conduit for the first time, then please see section "Conduit installation" for how to use the above variables.
If you want to update an already installed Conduit, then please continue with the steps described in section: "Configuration update after installation".
S3
The configuration of this file system can be done in 2 ways, depending on the type of authentication used:
1. Access Key authentication
More information on access key generation can be found here.
It is important for the used service account to have the following permission: AmazonS3FullAccess
Please navigate to the following path:
The following configuration must be added to bde-server.env:
remove the curly { } brackets. See below examples section.
If installing Conduit for the first time, then please see section "Conduit installation" for how to use the above variables.
If you want to update an already installed Conduit, then please continue with the steps described in section: "Configuration update after installation".
2. IAM metadata authentication
More information about this type of authentication can be found here.
It is important for the used service account to have the following permission: AmazonS3FullAccess
Please navigate to the following path:
The following configuration must be added to bde-server.env
remove the curly { } brackets. See below examples section.
If installing Conduit for the first time, then please see section "Conduit installation" for how to use the above variables.
If you want to update an already installed Conduit, then please continue with the steps described in section: "Configuration update after installation".
Google Cloud Storage (GCS)
The configuration of this file system can be done in 2 ways, depending on the type of authentication used:
1. File credential authentication (using P12 certificate)
More information about this type of authentication can be found here.
It is important for the used service account to have the following permission: StorageAdmin
Please navigate to the following path:
The following configuration must be added to bde-server.env
if the configuration is new or the keyfile needs to be changed, the new file should be added to the following directory:
/etc/bpcs/docker/conduit/gcs/keyfile/
remove the curly { } brackets. See below examples section.
If installing Conduit for the first time, then please see section "Conduit installation" for how to use the above variables.
If you want to update an already installed Conduit, then please continue with the steps described in section: "Configuration update after installation".
2. IAM metadata authentication
More information about this type of authentication can be found here.
It is important for the used service account to have the following permission: StorageAdmin
Please navigate to the following path:
The following configuration must be added to bde-server.env
remove the curly { } brackets. See below examples section.
If installing Conduit for the first time, then please see section "Conduit installation" for how to use the above variables.
If you want to update an already installed Conduit, then please continue with the steps described in section: "Configuration update after installation".
HDFS
Please navigate to the following path:
The following configuration must be added to bde-server.env
If installing Conduit for the first time, then please see section "Conduit installation" for how to use the above variables.
If you want to update an already installed Conduit, then please continue with the steps described in section: "Configuration update after installation".
Local file system
Please navigate to the following path:
The following configuration must be added to bde-server.env:
If installing Conduit for the first time, then please see section "Conduit installation" for how to use the above variables.
If you want to update an already installed Conduit, then please continue with the steps described in section: "Configuration update after installation".
Conduit installation
On Conduit installation the user will be asked to choose one of the types above to be used as Parquet store location. The installation script will configure Conduit system according to the chosen option.
Prerequisites must be satisfied for each file system type before proceeding withe the installation.
Please see section "Supported storage file systems" for a guide to created appropriate configurations for each storage type.
Once configuration variables are created, use the values during installation dialogues.
Configuration update after installation
The following describes the steps required to reconfigure supported storage types for dataset cache location in Conduit.
If Conduit parquet storage configuration needs to be modified after installation, this can be done using the environment variables from bde-server.env file.
Please navigate to the following path:
Step 1) Delete old configurations
If the configurations already exist and need to be modified, all environment variables from bde-server.env, that start with FS_ (usually found at the bottom of the file), must be deleted first.
Step 2) Edit new configurations
See section above "Supported storage file system" for how to obtain new configurations for each different storage type.
Edit bde-server.env file with new values and save it.
Step 3) Clean Conduit storage metadata service
Also, it is required to clean a previous hive metastore volume, using the following command (the container must be stopped first):
Step 4) Restart Conduit services
After this step, Parquet store was updated to use the new storage type.
Example of Parquet store configurations
Please navigate to the following path:
Azure Cloud Storage
Azure Government Storage
S3
Google Cloud Storage
HDFS
Local file system
Related pages