Partitioning Your Data

The partition option within Conduit provides a way to instruct the tool to distribute the loading, storing, and processing of a data set within the Conduit SQL engine. The way that the partition is defined when the connector is created can have a significant impact on the performance of Conduit for a specific connector.

Generally speaking, the number of partitions for a specific data set should be a multiple of the number of processors on the nodes in the cluster. There are many variables to consider when defining the number of partitions for a data set, including:

Size of the data set
Type of data within the data set
Number of nodes within the cluster
Number of processors within each node
Memory available to each node

Too few partitions will lead to inefficient resource allocation for distributed operations on the data set. On the other hand, too many partitions and the operation plan will take too much time in figuring out how to distribute the various computational tasks of executing the query.

When deploying Conduit on a VM of with minimum requirements (Ubuntu 16.4, 4 cores, 16GB Ram), it is likely that 4 partitions would improve the caching and query speed for a sizable data set.

Partition Size can be adjusted for a data set the Connector Wizard on the Advanced tab.

Configuring Partition Column is recommended for large datasets when connector caching is enabled or when join queries with other data source types are expected.