External Orchestrator Integrations

Currently, Gradient is optimized to work with predefined Databricks Jobs and Databricks Workflows. However, the Sync Python Library allows you to integrate your Databricks pipelines when using 3rd party tools like Airflow and Azure Data Factory.

Typically, these tools leverage the run_submit function, found in the Databricks APIs, to initiate one-time runs using Databricks provisioned compute.

If you are using a tool that uses this job invocation method, you can follow the pattern below to submit your event logs to Gradient for evaluation, generate a recommendation, and apply that recommendation to your next job run.

Retrieving a recommendation

To retrieve a recommendation, use the Sync Python Library to execute the following steps:


1) Initiate creating the recommendation:

The Gradient create_project_recommendation function generates a recommendation for the specified project

 sync.api.projects.create_project_recommendation(project_id: str)
  • Parameters:

    • project_id (str, optional) – ID of project to which the prediction belongs (can be found in the Gradient UI project page), defaults to None

  • Returns: recommendation ID

  • Return type: Response[str]

2) Wait for the recommendation to complete:

The previous command is asynchronous so the wait_for_recommendation function is called to wait for completion of the recommendation

sync.api.projects.wait_for_recommendation(project_id: str, recommendation_id: str)
  • Parameters

    • project_id (str) – project ID

    • recommendation_id (str) – recommendation ID

  • Returns: recommendation object

  • Return type: Response[dict]

This process should be performed before you call the Databricks run_submit function in order to apply the retrieved recommendation to the cluster configuration passed to the run_submit function.

Once the Databricks run_submit function is invoked using the new cluster configuration, a RUN ID will be returned. Upon successful completion of the run, you are ready to submit the Spark event log to Gradient for analysis.

Submit Logs to Gradient

Using the Databricks RUN ID obtained via the Databricks run_submit function and the parameters described below, call the create_submission_for_run from the Sync Python Library.

  • RUN ID - The one-time Databricks RUN ID returned by the run_submit function

  • PLAN_TYPE - Your Databricks Account Tier (e.g. Standard, Enterprise, Premium)

  • COMPUTE_TYPE - the Databricks compute type. Currently only "Jobs Compute" is supported

  • PROJECT_ID - The Gradient PROJECT ID. See Manual single job import for instructions on creating a Project.

  • ALLOW_INCOMPLETE_CLUSTER_REPORT - set this to TRUE in order to allow for log submission with incomplete cluster information

For AWS Databricks

sync.api.awsdatabricks.create_submission_for_run (
        run_id,
        plan_type=[Standard, Enterprise, Premium],
        compute_type=[Jobs Compute],
        project_id=project["id"],
        allow_incomplete_cluster_report=True)

For Azure Databricks

sync.api.azuredatabricks.create_submission_for_run (
        run_id,
        plan_type=[Standard, Enterprise, Premium],
        compute_type=[Jobs Compute],
        project_id=project["id"],
        allow_incomplete_cluster_report=True)

When implementing this pattern in your Non-Databricks-Workflow pipelines, use the standard method for passing generated data (ie RECOMMENDATION ID and RUN ID) between tasks for your process orchestrator (e.g. Airflow Xcom, Pipeline Variables, Persist and Read, etc)

Follow the link(s) below for tool specific examples:

pageApache Airflow for Databricks

Last updated