/
Configure parquet store file system

Configure parquet store file system


TABLE OF CONTENTS

Conduit storage location

Datasets caches and materialization assets are stored in Parquet format, thus the name “Parquet store”. 

In Conduit, the following file systems (storage types) are supported for storing data source caches and data materialization: 

  • Azure Blob Storage (abfs) 

  • S3 (s3)

  • Google Cloud Storage (gcs)

  • HDFS (hdfs) 

  • local file system (file).

 


Supported storage file systems


Azure Cloud Storage

General Prerequisites

 

  • Azure Blob storage account must already be created

    • the storage account must have "Enable hierarchical namespace" checked

    • the other settings can be left on default (unless required by user's system configuration, not Conduit)

      • e.g. networking, data protection, tags and others

    • see section below how to create a storage account

  • Azure Blob container must be already created

    • no special settings required

    • settings can be left on default values

    • see section below how to create a container

 

How to create a Azure Blob storage account:

 

Step 1)

 

Step 2) enable hierarchical namespace

 

How to create a Azure Blob container:

 

Navigate to your Azure Blob Storage account and click on "Containers" on the left panel.

 

 

Azure Blob storage authentication

The configuration of Azure Blob storage as storage type can be done in 2 ways depending on the type of authentication used:

  1. access keys authentication

  2. azure managed identity authentication

 

1. Access keys authentication

 

Prerequisites:

  • have access to Azure Blob access keys 

    • more information on generating access keys can be found here

 

  • The storage account must have hierarchical namespace enabled.

 

Settings 

/etc/bpcs/docker/bde-server.env

Once all prerequisites are fulfilled, please update the following configuration with the proper values and add them to bde-server.env :

 

FS_TYPE=abfs FS_ABFS_STORAGE_ACCOUNT={ Azure Blob storage account } FS_ABFS_CONTAINER={ Azure Blob container } FS_ABFS_ACCESS_KEY={ Azure Blob access key } FS_DEFAULTFS=abfs://{ Azure Blob container }@{ Azure Blob storage account }.dfs.core.windows.net CONDUIT_AZURE_CLOUD_TYPE=AzureCloud CONDUIT_AZURE_CLOUD_STORAGE_ENDPOINT_SUFFIX=core.windows.net
  • remove the curly { } brackets. See below examples section.   

  • If installing Conduit for the first time, then please see section "Conduit installation" for how to use the above variables. 

  • If you want to update an already installed Conduit, then please continue with the steps described in section: "Configuration update after installation".

 

2. Azure Managed Identity authentication

This type of authentication is used when Conduit services are deployed on a virtual machine running in Azure. More information about managed identities for Azure can be found here.

 

Prerequisites

  • enable System-assigned managed identity. Follow steps from here

  • the storage account must have hierarchical namespace enabled.

 

  • the resource group of the virtual machine where Conduit services are running must have the following role: StorageBlobDataContributor

    • in Azure Portal navigate to "All services" -> "Resource groups" -> select resource group where Conduit services VM is using -> Access control (IAM) -> search or add "StorageBlobDataContributor

 

Settings

/etc/bpcs/docker/bde-server.env

Once all prerequisites are fulfilled, please update the following configuration with the proper values and add them to bde-server.env :

 

  • remove the curly { } brackets. See below examples section.

  • If installing Conduit for the first time, then please see section "Conduit installation" for how to use the above variables. 

  • If you want to update an already installed Conduit, then please continue with the steps described in section: "Configuration update after installation".

 


Azure Government Storage

General Prerequisites

  • follow the steps from "

Azure Cloud Storage section" (see above)

 

Settings

Once all prerequisites are fulfilled, please update the following configuration with the proper values and add them to bde-server.env :

  • remove the curly { } brackets. See below examples section.

  • If installing Conduit for the first time, then please see section "Conduit installation" for how to use the above variables. 

  • If you want to update an already installed Conduit, then please continue with the steps described in section: "Configuration update after installation".


S3 

The configuration of this file system can be done in 2 ways, depending on the type of authentication used:

1. Access Key authentication

  • More information on access key generation can be found here.

  • It is important for the used service account to have the following permission: AmazonS3FullAccess 

  • Please navigate to the following path:

  • The following configuration must be added to bde-server.env:

 

  • remove the curly { } brackets. See below examples section.   

  • If installing Conduit for the first time, then please see section "Conduit installation" for how to use the above variables. 

  • If you want to update an already installed Conduit, then please continue with the steps described in section: "Configuration update after installation".

 

2. IAM metadata authentication

  • More information about this type of authentication can be found here.

  • It is important for the used service account to have the following permission: AmazonS3FullAccess 

  • Please navigate to the following path:

  • The following configuration must be added to bde-server.env

 

  • remove the curly { } brackets. See below examples section.   

  • If installing Conduit for the first time, then please see section "Conduit installation" for how to use the above variables. 

  • If you want to update an already installed Conduit, then please continue with the steps described in section: "Configuration update after installation".


 

Google Cloud Storage (GCS) 

The configuration of this file system can be done in 2 ways, depending on the type of authentication used:

1. File credential authentication (using P12 certificate)

  • More information about this type of authentication can be found here

  • It is important for the used service account to have the following permission: StorageAdmin

  • Please navigate to the following path:

  • The following configuration must be added to bde-server.env

 

  • if the configuration is new or the keyfile needs to be changed, the new file should be added to the following directory:/etc/bpcs/docker/conduit/gcs/keyfile/

  • remove the curly { } brackets. See below examples section.   

  • If installing Conduit for the first time, then please see section "Conduit installation" for how to use the above variables. 

  • If you want to update an already installed Conduit, then please continue with the steps described in section: "Configuration update after installation".

 

2. IAM metadata authentication

  • More information about this type of authentication can be found here.

  • It is important for the used service account to have the following permission: StorageAdmin

  • Please navigate to the following path:

  • The following configuration must be added to bde-server.env

 

 

  • remove the curly { } brackets. See below examples section.   

  • If installing Conduit for the first time, then please see section "Conduit installation" for how to use the above variables. 

  • If you want to update an already installed Conduit, then please continue with the steps described in section: "Configuration update after installation".


HDFS 

Please navigate to the following path:

The following configuration must be added to bde-server.env

 

  • If installing Conduit for the first time, then please see section "Conduit installation" for how to use the above variables. 

  • If you want to update an already installed Conduit, then please continue with the steps described in section: "Configuration update after installation".

 


Local file system 

Please navigate to the following path:

The following configuration must be added to bde-server.env:

  • If installing Conduit for the first time, then please see section "Conduit installation" for how to use the above variables. 

  • If you want to update an already installed Conduit, then please continue with the steps described in section: "Configuration update after installation".


Conduit installation

On Conduit installation the user will be asked to choose one of the types above to be used as Parquet store location. The installation script will configure Conduit system according to the chosen option.
Prerequisites must be satisfied for each file system type before proceeding withe the installation. 

 

Please see section "Supported storage file systems" for a guide to created appropriate configurations for each storage type. 

 

Once configuration variables are created, use the values during installation dialogues. 


Configuration update after installation

The following describes the steps required to reconfigure supported storage types for dataset cache location in Conduit.

If Conduit parquet storage configuration needs to be modified after installation, this can be done using the environment variables from bde-server.env file.

Please navigate to the following path:

 

Step 1) Delete old configurations

If the configurations already exist and need to be modified, all environment variables from bde-server.env, that start with FS_ (usually found at the bottom of the file), must be deleted first.

Step 2) Edit new configurations

See section above "Supported storage file system" for how to obtain new configurations for each different storage type.

 

Edit bde-server.env file with new values and save it. 

 

Step 3) Clean Conduit storage metadata service

Also, it is required to clean a previous hive metastore volume, using the following command (the container must be stopped first):

 

 

Step 4) Restart Conduit services

After this step, Parquet store was updated to use the new storage type.

 

 


Example of Parquet store configurations

Please navigate to the following path:

Azure Cloud Storage

 

Azure Government Storage

 

S3

 

Google Cloud Storage

 

HDFS

 

Local file system

 

Related pages

 


Related content