Application

The application is an enterprise-grade, highly customizable ETL application - delivered as a single, self-contained JAR.

It ships out-of-the-box with connectors that extract and transform metadata from multiple sources:

Enterprise users can create and execute services - each specifying one of the available service types. While the application may also support more general services (e.g. maintenance services), the basic idea is that services are typically connectors that, for example, extract metadata from a source, apply ingestion and transformation rules, and upload the metadata to a destination application.

DatabricksConnector


Sophisticated features round off the complete package:

  • Dump and restore capabilities
  • Embedded working database
  • Advanced pattern filters
  • Standardized authentication methods
  • Smart, agent-based reconciliation mode
  • Workflow support

The application has a low memory footprint and can scale to arbitrarily large volumes of metadata without ever materializing the full magnitude of data in memory. It uses an embedded database with a landing repository to store raw extracts, and a staging repository to hold transformed entities before the final upload - all without the need to manage external databases.


Installing the application

The application is packaged and delivered as a single, executable JAR. The bundle includes the executables, resources, configurations, 3rd-party libraries and even an embedded database system and a web server. No further JARs are required to start the application.

The application does not rely on any external components, such as a database management system or an external servlet container - everything is contained in the delivered bundle.

Attention

Specific services might require additional JARs (for example, a specific JDBC driver). These can be loaded dynamically by specifying them in the configuration property connector.config.jars. The class path cannot be manipulated (java -classpath) to add additional JARs.


As a self-contained executable JAR, the application does not require any installation - just start the JAR 🙂.

A Java 17 JRE (Java Runtime Environment) is required to start the application.


Starting the application

The application is started as an executable JAR using java -jar:

java -jar dataspot-connector.jar

In general, when the application starts, it automatically

  • loads the application configuration,
  • loads the additional JARs,
  • and creates the working database, if it doesn't exist.

System properties (JVM) can be specified on the command-line as java -Dkey=value (for example, to configure the proxy settings, to define placeholders, or to define configuration properties). They must be passed before the command-line argument -jar.

Example: System property

`java -Dutilities.dump=true -jar dataspot-connector.jar`


CLI application

After startup, the command-line interface (CLI) application loads and validates a single service from the specified service file, executes the service, and finally terminates.

java -jar dataspot-connector.jar <options>

The options are passed in GNU-style syntax --key=value.

Option

Required

Description

--service=<name>

mandatory

The service name. The service is located in the service file (option --file).

--file=<file>

optional

The absolute or relative path to the service file containing the service (option --service).

--verbose

optional

Enables verbose output mode.

Tooltip

If the service file isn't specified, the default service file defined in connector.config.service.file is used.


Example: CLI application

java -jar dataspot-connector.jar --service=MyService --file=/home/connector/services/MyService.yaml


Working database

The application uses a working database for persisting entities during processing. The working database is also used by the application to store the execution details and the statuses of the currently running or finished services.

Note

A connector typically extracts metadata from an external source and stores it in the landing repository, transforms the data into the staging repository, and finally uploads it to the target system. The landing and staging repositories are located in the working database.


The working database is an H2 relational database that uses a single file for storage. The working database is stored in the file specified by the application configuration property connector.config.database.file or, if not specified, in the default database file.

Attention

The H2 database engine is an open source, relational database management system written in Java. It is embedded in the application, running inside the same JVM as the application itself, and requires no separate database management system or process, i.e. the working database does not require an external database management system to be installed. The required JARs are packaged and delivered with the application.


The working database contains only transient, intermediate data that is processed before being moved to a final destination. It is therefore safe to delete the working database file at any time, for example if it grows too large. When the application starts, it automatically creates the working database file, if it doesn't exist.

Tooltip

Alternatively to deleting the entire working database file, consider creating and executing an ApplicationReorg service to reorganize the working database by deleting finished services as well as their entities from the landing and staging repositories.


HTTP/HTTPS proxy settings

If required, the proxy settings of the application can be configured either by setting system properties or by setting environment variables. In either case, HTTP or HTTPS requests are automatically redirected to the specified proxy server.

Proxy system properties

The proxy settings can be specified using the following standardized system properties (JVM) in Java.

Tooltip

System properties (JVM) can be specified on the command-line as java -Dkey=value


System property

Protocol

Description

http.proxyHost

HTTP

The host name or IP address of the HTTP proxy server.

http.proxyPort

HTTP

The port number of the HTTP proxy server (default: 80).

http.proxyUser

HTTP

The username, if the HTTP proxy server requires authentication.

http.proxyPassword

HTTP

The password, if the HTTP proxy server requires authentication.

https.proxyHost

HTTPS

The host name or IP address of the HTTPS proxy server.

https.proxyPort

HTTPS

The port number of the HTTPS proxy server (default: 443).

https.proxyUser

HTTPS

The username, if the HTTPS proxy server requires authentication.

https.proxyPassword

HTTPS

The password, if the HTTPS proxy server requires authentication.

http.nonProxyHosts

HTTP/HTTPS

A list of host name or IP patterns (and ports) that should bypass the proxy.

Note

The list separator in the system property http.nonProxyHosts is |.


The system property http.nonProxyHosts defines a list of exceptions that should not be routed to the proxy server. This list provides a way to exclude traffic to certain destinations (e.g. localhost, 127.0.0.1, or *.internal.example.com) from passing through the proxy server. The excluded domains or IP addresses are specified as a list of domain[:port] values. HTTP or HTTPs requests to a destination that matches an entry in http.nonProxyHosts are not redirected to the proxy server.

Tooltip

If a port is specified, the exception applies only to that specific port, e.g. example.com:8080 applies only to port 8080 on example.com. If no port is specified, the exception applies to all ports.


Example: Proxy system properties

java -Dhttp.proxyHost=myproxy.com -Dhttp.proxyPort=8080 -Dhttp.proxyUser=myuser -Dhttp.proxyPassword=s3cr3t -Dhttp.nonProxyHosts=localhost|*.internal.example.com

Proxy environment variables

For convenience, the application supports the widely used environment variables http_proxy, https_proxy and no_proxy. When the application starts, it automatically reads these environment variables and converts them to the corresponding standardized system properties in Java.

Attention

The application extracts the hosts, ports, usernames, and passwords from the environment variables and sets the corresponding system properties, unless they are already defined. The system properties take precedence over the environment variables, i.e. if a system property is already defined, it's value is not overwritten.


Environment variable

Format

System properties

http_proxy

[protocol://][username:password@]host[:port]

http.proxyHost
http.proxyPort
http.proxyUser
http.proxyPassword

https_proxy

[protocol://][username:password@]host[:port]

https.proxyHost
https.proxyPort
https.proxyUser
https.proxyPassword

no_proxy

domain[:port],domain[:port],...

http.nonProxyHosts
(subdomain wildcards are converted)

Note

The list separator in the environment variable no_proxy is ,.


The specification of username:password@ in http_proxy and https_proxy is optional (if the proxy server does not require authentication, the username and password can be omitted). The specification of :port in http_proxy, https_proxy and no_proxy is optional.

Example: Proxy environment variables

http_proxy=http://myproxy.com:8080/
https_proxy=https://myuser:s3cr3t@myproxy.com/
no_proxy=localhost,*.internal.example.com

Configuration

The application configuration contains static application settings, such as the locations of application directories and resources or global execution options.

When the application starts, it automatically reads the application configuration from the following sources:

Source

Description

System properties

Settings defined as system properties (JVM) using java -Dkey=value.

application.properties

Settings defined in .properties configuration file.

application.yaml

Settings defined in .yaml configuration file.

Note

The sources are evaluated in the above order, from top to bottom. If a property is specified in multiple sources, the system properties take precedence over the properties in application.properties, which in turn take precedence over the properties in application.yaml.


Attention

The files application.properties and application.yaml are read from the current working directory. Any relative paths in the configuration are relative to the current working directory. The working directory can be - but is not necessarily - the directory of the application JAR dataspot-connector.jar.


Example: File application.yaml

connector:
config:
dump:
directory: ./temp
execution:
thread:
count: 32

Example: File application.properties

connector.config.dump.directory=./temp
connector.config.execution.thread.count=32

Example: System properties (JVM)

java -Dconnector.config.dump.directory=./temp -Dconnector.config.execution.thread.count=32

The following application configuration properties are supported.

Note

While application configuration properties can be specified as system properties, in application.properties, or in application.yaml, for the sake of simplicity, the application configuration examples in the following sections will only be illustrated in application.yaml.


Dump

Services can generate dumps during execution.

🔑 Property connector.config.dump.directory

The absolute or relative path to the dump directory.

optional

The default is ./.

The dump filename is determined by connector.config.dump.template.

Note

The dump directory is created automatically, if it doesn't exist.

Example: Property connector.config.dump.directory

connector:
config:
dump:
directory: ./temp

🔑 Property connector.config.dump.template

The dump filename template containing placeholders.

optional

The default is ${type}-${name}-${dump}-${id}.

The dump filename is determined by replacing the placeholders in the template with specific values, such as the service name and type, the dump type, and the date/time.

Placeholder

Description

${type}

The service type (DatabricksConnector).

${name}

The service name (the unique name of the service within the service file).

${dump}

The dump type (e.g. landing, staging, or payload).

${id}

The job execution ID.

${date}

The current date in YYYY.MM.DD format.

${time}

The current time in HH.MM.SS.SSSSSS format.

The dump directory is determined by connector.config.dump.directory.

Example: Property connector.config.dump.template

connector:
config:
dump:
template: ${name}_${type}-${dump}-${date}T${time}
# e.g. Test_DatabaseConnector-landing-2025.06.06T12.40.46.568000.json

Note

The template may also contain directories, allowing the dump files to be stored in separate folders, depending on the service name, service type or dump type. For example, the template ${type}/${name}-${dump}-${date}T${time} would segregate the dump files by service types, i.e. the dump files of each service type would be stored in a separate directory. For each service type, the directory is created automatically, if it doesn't exist.


Service file

Services are defined in service files in the format YAML.

🔑 Property connector.config.service.file

The absolute or relative path to the default service file.

optional

The default is ./services.yaml.

If a service file isn't specified when executing a service, the default service file is used.

Example: Property connector.config.service.file

connector:
config:
service:
file: /home/connector/services.yaml

Placeholders

Placeholders are tokens used in service configurations to represent string values which are stored in external sources.

🔑 Property connector.config.placeholders.files

The list of absolute or relative paths to additional properties files containing placeholders.

optional

The default is [] (no additional properties files).

Each additional properties file contains a list of keys and values. They are automatically loaded by the application and their values can be referenced in service files using placeholders.

Example: Additional properties file oauth.properties

oauth.providerUrl=https://login.microsoftonline.com/b0ebd953-fb5f-425e-98fc-ec46bf8ce2f1
oauth.clientId=6731de76-14a6-49ae-97bc-6eba6914391e
oauth.clientSecret=uT09P~lyF2T_Upb~q5P9r-i7iSuIFSnH0nA54cKE

Attention

As a recommendation, credentials or sensitive data - such as passwords, access tokens or client secrets - should not be stored in service files (where they might be compromised). Instead, they should reside separately in external sources.


Example: Property connector.config.placeholders.files

connector:
config:
placeholders:
files:
- ./resources/oauth.properties
- ./resources/secrets.properties

Parallel processing

Parallel processing is supported by specific service types (DatabricksConnector), where each thread processes a subset of the workload.

🔑 Property connector.config.execution.thread.count

The number of threads used for parallel processing.

optional

The default is 16.

Example: Property connector.config.execution.thread.count

connector:
config:
execution:
thread:
count: 32

🔑 Property connector.config.execution.thread.timeout

The maximum execution time (in milliseconds) for parallel processing.

optional

The default is -1 (unlimited).

If the maximum execution time is exceeded during parallel processing, the service execution is aborted.

Example: Property connector.config.execution.thread.timeout

connector:
config:
execution:
thread:
timeout: 60000 # 60 seconds

Additional JARs

The application can load additional JARs required by specific services.

🔑 Property connector.config.jars

The list of additional JARs to load.

optional

The default is [] (no additional JARs).

Note

Additional JARs might be required by specific services. For example, a DatabaseConnector that connects to a specific database will require a suitable JDBC driver. However, JDBC drivers are not delivered with the application. Instead, the suitable driver is loaded dynamically from an additional JAR.


When the application starts, it automatically loads the additional JARs from the specified files, directories, and URLs.

Property

Required

Description

type

mandatory

The JAR type [jdbc].

file

optional

The absolute or relative path of the JAR file.

directory

optional

The absolute or relative path of the directory containing JAR files. All JARs in the directory are loaded.

url

optional

The URL of the JAR file.

className

optional

The (fully qualified) name of the main class. If specified, the class is loaded from the JAR. Otherwise, the first suitable class from the JAR is loaded.

For each entry in the list, the JARs specified by file, directory, and url are loaded with a common, custom Java class loader.

Note

In case multiple JARs depend on each other (e.g. a JDBC JAR might have dependencies to further, 3rd-party libraries), loading all JARs with a common Java class loader ensures these JARs and their transitive dependencies load an link correctly.


Example: Property connector.config.jars

connector:
config:
jars:
# load a single JDBC driver from a file
- type: jdbc
file: /home/connector/postgresql-42.7.5.jar
# load a JDBC driver and all libraries from the directory
- type: jdbc
directory: /home/connector/google-big-query
# load a single JDBC driver from a URL
- type: jdbc
urlhttps: ://repo1.maven.org/maven2/com/mysql/mysql-connector-j/9.2.0/mysql-connector-j-9.2.0.jar

Why is this even necessary?

The application is delivered as an executable JAR and started with java -jar dataspot-connector.jar. When starting an application from an executable JAR with the option -jar, the class path cannot be manipulated by adding additional JARs. Java ignores the option -classpath and only uses the class path defined in the JAR itself. Therefore, external JARs (e.g. JDBC drivers) cannot be added with the option -classpath. Instead, additional JARs can be loaded dynamically by specifying them in the application configuration property connector.config.jars.


Working database

The application uses a working database for persisting entities during processing.

🔑 Property connector.config.database.file

The absolute or relative path to the working database file.

optional

The default is ./dsconnector.

Example: Property connector.config.database.file

connector:
config:
database:
file: /home/connector/db

Logging

The application writes logs to capture events, helping system administrators and developers understand what is currently happening.

🔑 Property logging.level.root

The logging level of the application [debug, info, warn, error, off].

optional

The default is info.

The logging level will usually be set to a higher, less verbose level (e.g. warn or error) in the application configuration.

Example: Property logging.level.root

logging:
level:
root: warn

For troubleshooting or maintenance purposes, the logging level could be set to a lower, more verbose level (e.g. info or debug) for a single service execution. In this case, the logging level should not be modified in the application configuration - and be applied to all service executions - but rather be set as a system property (JVM) for this single, specific execution.

java -Dlogging.level.root=info

Services

A service is a named instance of a service type (DatabricksConnector) with a specific configuration. Services are defined in service files and can be executed using the CLI application.

Creating a service

A service is created by defining its name, a service type, and the specific configuration in a service file.

Example: Service MyService

services:
MyService:
type: DatabaseConnector

The name and service type are mandatory. The specific configuration depends on the service type.

Service type

Each service type implements a specific task in a predefined, fixed sequence of steps (e.g. extraction, transformation, or reorganization). The application supports the following service types:

Service type

Description

DatabaseConnector

Connects to a database using JDBC and extracts metadata from the database catalog.

StorageConnector

Connects to a storage system (e.g. AWS S3Google Cloud StorageAzure Storage) and extracts metadata from directories and files (e.g. CSVParquet).

DatabricksConnector

Connects to a Databricks Unity Catalog and extracts metadata and runtime lineage information.

ApplicationReorg

Reorganizes the application by deleting expired jobs.

Service file

Services are defined in service files in the format YAML. The root property services contains a map of services configurations, with the map key being the service name.

Example: Service file with a single service

services:
MyDatabaseService:
type: DatabaseConnector
source:
url: jdbc:sqlserver://myserver:1433;DatabaseName=mydatabase

A service file may contain multiple services, possibly with different service types. Within the service file, each service must be identified by a unique name.

Example: Service file with multiple services

services:
MyDatabaseService:
type: DatabaseConnector
source:
url: jdbc:sqlserver://myserver:1433;DatabaseName=mydatabase

MyStorageService:
type: StorageConnector
source:
url: s3a://my-s3-bucket

Executing a service

A service can be executed using the command-line interface (CLI) application by specifying the service name and the service file.

Tooltip

If the service file isn't specified, the default service file defined in connector.config.service.file is used.

The application loads and executes the service:

  • The application validates the service configuration.
  • The application starts a job that, depending on the service type, performs a predefined, fixed sequence of steps.
  • Each step of the job typically processes a different section of the service configuration.
  • Finally, the job finishes successfully or terminates with an error message.

Each service type specifies a predefined, fixed sequence of steps. A service type (e.g. DatabaseConnector) might extract and transform metadata from a source, while another service type (e.g. ApplicationReorg) might perform maintenance work.

Note

During execution, the placeholders in the service configuration are resolved by searching for them in the configured placeholder sources. The execution details and the statuses of the currently running or finished services are stored in the working database.


Configuration

A service is configured by defining its unique name, the service type, and the configuration.

Services have the following configuration in common. Specific service types have additional configurations.

Tooltip

Properties marked with * are required for the service to run.

🔑 Property type*

The service type.

required


The service type identifies the predefined, fixed sequence of steps that are performed by the job when the service is executed.

Note

The property type is the only property common to all services. The entire rest of the service configuration depends solely on the selected service type.


Example: Property type

services:
MyService:
type: DatabaseConnector

While each service type has its own specific configuration, the following general concepts apply to all service configurations.

Placeholders

Placeholders are tokens used in service configurations to represent string values which are stored in external sources. Placeholders are specified in the format ${key}. Their actual values are resolved when the service is executed.

Format

Description

${key}

The key is an identifier that maps to a value in one of the external sources.

Attention

As a recommendation, credentials or sensitive data - such as passwords, access tokens or client secrets - should not be stored in service files (where they might be compromised). Instead, they should reside separately in external sources.


Example: Placeholders

services:
MyService:
type: DatabaseConnector
upload:
authentication:
method: password
username: ${basic.username}
password: ${basic.password}

Placeholder sources

When a service is executed, the application attempts to resolve each placeholder ${key} and replace it with an actual value by searching for key in the following sources:

Source

Description

connector.config.placeholders.files

Additional properties files specified in connector.config.placeholders.files.

Command-line arguments

Placeholders defined as command-line arguments using --key=value.

System properties

Placeholders defined as system properties (JVM) using java -Dkey=value

Environment variables

Placeholders defined as environment variables (e.g. export key=value on Linux)

application.properties

Placeholders defined in the configuration file application.properties.

application.yaml

Placeholders defined in the configuration file application.yaml.

Note

The sources are searched in the above order, from top to bottom. The additional properties files specified in the application configuration connector.config.placeholders.files have the highest precedence. The configuration file application.yaml has the lowest precedence.

If a placeholder ${key} is not found in any of the sources, the placeholder isn't replaced but remains in the string as ${key}.

Example: Additional properties file basic.properties

basic.username=myuser
basic.password=mypassword

Example: Command-line arguments

--basic.username=myuser --basic.password=mypassword

Example: System properties

-Dbasic.username=myuser -Dbasic.password=mypassword

Example: Environment variables

export BASIC_USERNAME=myuser
export BASIC_PASSWORD=mypassword

Tooltip

Placeholders defined as environment variables are specified in snake_case or SCREAMING_SNAKE_CASE.


Built-in placeholders

The following properties have built-in, predefined placeholders.

Property

Built-in placeholder

utilities.dump

${utilities.dump}

utilities.restore.landing

${utilities.restore.landing}

utilities.restore.staging

${utilities.restore.staging}

upload.options.dryRun

${upload.options.dryRun}

Properties with built-in placeholders can be set without ever specifying them in the service file. If the property is not specified in the service file, the built-in placeholder is resolved, by default:

  • If the built-in placeholder is found in one of the external sources, the property is set to the resolved value.
  • Otherwise, the built-in placeholder is ignored (i.e. the property is set to the property's default value).
Tooltip

Built-in placeholders allow certain features, typically related to maintenance (e.g. writing dumps or performing dry runs), to be enabled or disabled without ever modifying the service file.


Example: Built-in placeholder

A property (e.g. utilities.dump) could be specified in a service file with a fixed value (e.g. true). Changing the property's value would always involve modifying the service file:

services:
MyService:
type: DatabaseConnector
utilities:
dump: true

Alternatively, the property could be specified with a custom placeholder (e.g. ${dump.enabled}) and subsequently be set without modifying the service file, but by defining the placeholder value in an external source (e.g. java -Ddump.enabled=true):

services:
MyService:
type: DatabaseConnector
utilities:
dump: ${dump.enabled}

More conveniently, using its built-in placeholder (e.g. ${utilities.dump}), the property can be set by removing it from the service file altogether and instead only defining the built-in placeholder value in an external source (e.g. java -Dutilities.dump=true):

services:
MyService:
type: DatabaseConnector

Pattern filters

A pattern filter is an advanced pattern matching mechanism used for properties that define filters (e.g. names, types) in service configurations. Rather than using a single regular expression to match a given value, a pattern filter allows the value to be matched against multiple regular expressions.

Example: Pattern filter

names: #
accept: #
- Finance #
- Sales.* # match the name 'Finance' and names starting with 'Sales'
reject: #
- SalesTest #
- .*_Temp # except the name 'SalesTest' and names ending with '_Temp'

A given value matches the pattern filter if it is accepted and is not rejected:

  • Evaluate the accept list
    • If the accept list is null, all values are accepted.
    • If the accept list is empty, no values are accepted.
    • Otherwise, the value is accepted if it matches any regular expression in the accept list.
  • Evaluate the reject list
    • If the reject list is null, no values are rejected.
    • If the reject list is empty, no values are rejected.
    • Otherwise, the value is rejected if it matches any regular expression in the reject list.
Note

An empty pattern filter (accept and reject are null) matches any value (all values are accepted and no values are rejected).


Patterns are automatically anchored by adding ^ at the beginning and $ at the end (corresponding to start and end). These anchor characters should not be specified in the pattern. The entire value, from beginning to finish, must match the pattern - nothing can come before or after the pattern.

Tooltip

To match a value that only contains a specific pattern (e.g. TEMP) - allowing something to come before or after the pattern - the pattern should be embedded in .* (e.g. .*TEMP.*)


Example: Pattern filter

types: #
accept: #
- .*TABLE # match types ending with 'TABLE'
reject: #
- SYSTEM.* #
- .*TEMP.* # except types starting with 'SYSTEM' or containing 'TEMP'

Note

Both the value and the pattern may be null. The value null only matches a pattern, if the pattern is also null - and vice versa.


In the simplest case, a pattern filter could define an accept list with a single regular expression and no reject list. This is equivalent to matching against a single regular expression.

Example: Pattern filter

names: #
accept: Sales # match the name 'Sales'

Tooltip

Notice how accept is a single-value list - containing only the single value Sales.


Single-value lists

Specific properties in service configurations may contain lists of values. Lists are typically formatted using -, even when they contain only a single value.

Example: Single-value lists

extensions:
- parquet # primitive property

datatypes:
- stereotype: type #
restricted: true # structured property

For convenience, a single-value list in a service configuration can be formatted as a single value, rather than as a list with a single value.

Info

The single value is automatically treated as a list behind the scenes.


Example: Single-value lists (compact format)

extensions: parquet # primitive property

datatypes:
stereotype: type #
restricted: true # structured property

Tooltip

Notice how this applies not only to primitive properties, such as strings or numbers, but also to structured properties.

Connectors

In general, connectors are a category of service types that extract metadata from an external source, transform the metadata, and finally upload the transformed metadata to a destination application (dataspot).

Connector

Description

DatabaseConnector

Connects to a database using JDBC and extracts metadata from the database catalog.

StorageConnector

Connects to a storage system (e.g. AWS S3Google Cloud StorageAzure Storage) and extracts metadata from directories and files (e.g. CSVParquet).

DatabricksConnector

Connects to a Databricks Unity Catalog and extracts metadata and runtime lineage information.

While concrete connectors (e.g. DatabaseConnector, StorageConnector, DatabricksConnector) differ in their supported sources and ingestion options, this section describes the common architecture and configuration shared by all connectors. Refer to specific connectors for details on their respectively supported sources and ingestion options.

Architecture

Connectors adhere to an architectural blueprint, implementing a workflow with similar steps, and sharing the same components.

The overall architecture of the dataspot. connectors

Workflow

Connectors implement a workflow with similar steps, where metadata is extracted from a source, transformed, and asynchronously uploaded to a destination.

Step

Description

Extract

The connector connects to the specified source and extracts metadata in accordance with the selected extraction options and filters. The connector writes the extracted metadata to the landing repository.

Transform

The connector reads the extracted metadata from the landing repository and transforms the metadata in accordance with the selected transformation options. The connector writes the transformed metadata to the staging repository.

Upload

The connector reads the transformed metadata from the staging repository and creates the payload. The connector connects to the specified destination application using the selected authentication settings and upload options. The connector sends the payload to the destination application, where an asynchronous import job is submitted to process the uploaded payload.

Poll Job

The connector polls the asynchronous import job and displays the job progress and statistics. When the import job has finished, the connector terminates. If the import job fails, the connector displays the error logs and exits.

The landing and staging repositories are used as an intermediate storage for the metadata entities, avoiding to hold these entities in memory. Connectors have a minimal memory footprint - writing each entity to the repository as soon as it's extracted or transformed - allowing processing to scale to arbitrarily large volumes.

Tooltip

The import job is a regular, asynchronous job in dataspot. It can be monitored in the dashboard of the user.


Repositories

A connector typically uses the working database to store entities during processing:

Repository

Description

Landing

Stores the metadata extracted from the source.

Staging

Stores the metadata transformed from the landing repository.

Connectors can stream large-scale metadata without ever materializing the full volume in memory, by using a landing repository to store raw extracts, and a staging repository to hold transformed entities before the final upload.

Note

The metadata in the landing and staging repositories is automatically deleted when the connector terminates, regardless of whether the service completed successfully or not.


Dump and restore

Connectors can write dump files during processing or resume from previous dumps.

If the property utilities.dump is enabled, the extraction, transformation, and upload steps automatically write dump files of the landing repository, the staging repository, and the payload.

Tooltip

The dump directory is determined by connector.config.dump.directory and is created automatically, if it doesn't exist.


Step

Dump

Type

Format

Extract

Landing repository

landing

JSON (.json)

Transform

Staging repository

staging

JSON (.json)

Upload

Payload

payload

JSON + gzip (.json.gz)

Tooltip

The dump filename is determined by connector.config.dump.template where the placeholder ${dump} is automatically replaced by the corresponding dump type (landing, staging, or payload).


If the property utilities.restore.landing is defined, the connector restores the landing repository from the specified landing dump file, rather than extracting metadata from the source. In this case, the extraction step is skipped altogether and processing resumes with transforming the landing repository that was restored from the dump file.

If the property utilities.restore.staging is defined, the connector restores the staging repository from the specified staging dump file, rather than extracting metadata from the source and transforming it. In this case, the extraction and transformation steps are skipped altogether and processing resumes with uploading the staging repository that was restored from the dump file.

Note

For troubleshooting and traceability, the dumps of the landing and staging repositories also include the relevant ingestion options from the configuration. When the landing or staging repository is restored from a dump file, the relevant ingestion options are also restored and processing resumes with these options.


DatabricksConnector

DatabricksConnector is a connector that connects to a Databricks Unity Catalog instance, extracts metadata from the workspace, and transforms and uploads the metadata to a destination application using the upload API.

Note

DatabricksConnector supports Databricks Unity Catalog  instances running on AWS, Azure, and Google cloud platforms.


Functionality

DatabricksConnector follows the general connector architecture and workflow.

The metadata is extracted from the workspace catalog and transformed to the destination application, as follows:

Source

dataspot.

Catalog

Collection

Schema

Collection

Table

UmlClass

Column

UmlAttribute

Foreign key

UmlAssociation

View definition

Derivation

Data type

UmlDatatype

Runtime lineage

Transformation
Rule

Note

Runtime data lineage is the automatic capture of data flow metadata during query execution on Databricks - tracking table-to-table and column-level dependencies, along with associated notebooks, jobs, and dashboards, in near real time. Databricks Unity Catalog aggregates this lineage across all attached workspaces into a unified metastore-wide graph.


The transformed metadata is uploaded to the destination application by calling the upload API. The reconciliation options of the upload API specify how uploaded metadata is reconciled with existing metadata. The workflow options of the upload API specify the workflow statuses of inserted, updated, or deleted metadata.

Configuration

A DatabricksConnector service is configured by defining its unique name, the service type DatabricksConnector, and the configuration.

Example: DatabricksConnector

services:
MyService:
type: DatabricksConnector

In additional to the general connector configuration to specify the destination application, DatabricksConnector has the following configuration to specify the source and the ingestion options.

Tooltip

Properties marked with * are required for DatabricksConnector to run.


Source

DatabricksConnector connects to a Databricks Unity Catalog instance using the specified workspace URL and authentication settings.

🔑 Property source.url*

The workspace URL of the Databricks Unity Catalog instance.

required


DatabricksConnector connects to the Databricks Unity Catalog instance specified by the workspace URL.

Example: Property source.url

services:
MyService:
type: DatabricksConnector
source:
urlhttps: ://dbc-00182f59-66eb.cloud.databricks.com

🔑 Property source.warehouseId*

The warehouse ID.

required


DatabricksConnector extracts runtime data lineage using the specified warehouse ID.

Example: Property source.warehouseId

services:
MyService:
type: DatabricksConnector
source:
urlhttps: ://dbc-00182f59-66eb.cloud.databricks.com
warehouseId: b079313fa6222089

Authentication

DatabricksConnector can specify the authentication settings of the Databricks Unity Catalog instance.

🔑 Property source.authentication

The authentication settings of the Databricks Unity Catalog instance.

optional

The default is null (no authentication).

If an authentication is defined, DatabricksConnector connects to the Databricks Unity Catalog instance with the specified authentication. Otherwise, DatabricksConnector connects without authentication.

🔑 Property source.authentication.method

The authentication method.

required

The property is required if source.authentication is specified.

DatabricksConnector supports the following authentication methods:

Authentication method

source.authentication.method

Token

token

OAuth 2.0

oauth

Example: Property source.authentication.method

services:
MyService:
type: DatabricksConnector
source:
authentication:
method: token

Token

DatabricksConnector can use a token for connecting to the Databricks Unity Catalog instance.

🔑 Property source.authentication.token

The personal access token (PAT).

required

The property is required if source.authentication.method is token.

DatabricksConnector uses the specified token for authentication.

Example: Property source.authentication.token

services:
MyService:
type: DatabricksConnector
source:
authentication:
method: token
token: ${databricks.pat}

OAuth 2.0

DatabricksConnector can use OAuth 2.0 authentication for connecting to the Databricks Unity Catalog instance. The application supports non-interactive (machine to machine) grants to obtain an access token as a client application.

🔑 Property source.authentication.clientId

The OAuth 2.0 client ID.

required

The property is required if source.authentication.method is oauth.

DatabricksConnector uses the client ID to authenticate.

Note

A provider URL must not be specified, as it is automatically inferred from the workspace URL.


Example: Property source.authentication.clientId

services:
MyService:
type: DatabricksConnector
source:
authentication:
method: oauth
clientId: 6731de76-14a6-49ae-97bc-6eba6914391e

🔑 Property source.authentication.credentials.type

The credentials type.

required

The property is required if source.authentication.method is oauth.

DatabricksConnector supports the following credentials types to obtain an access or ID token:

Credentials type

source.authentication.credentials.type

Client credentials with client secret

clientSecret

Example: Property source.authentication.credentials.type

services:
MyService:
type: DatabricksConnector
source:
authentication:
method: oauth
clientId: 6731de76-14a6-49ae-97bc-6eba6914391e
credentials:
type: clientSecret

‣ Client credentials with client secret

The OAuth 2.0 client credentials grant with a client secret is a non-interactive (machine to machine) authentication. DatabricksConnector authenticates as a client application, rather than as an end-user, to obtain an access token.

🔑 Property source.authentication.credentials.clientSecret

The client secret.

required

The property is required if source.authentication.credentials.type is clientSecret.

DatabricksConnector uses the specified client secret to authenticate.

Example: Property source.authentication.credentials.clientSecret

services:
MyService:
type: DatabricksConnector
source:
authentication:
method: oauth
clientId: 6731de76-14a6-49ae-97bc-6eba6914391e
credentials:
type: clientSecret
clientSecret: ${databricks.clientSecret}