Links

Set up Databricks Workspace

Webhooks provide an easy 1-click experience to on-board new jobs
The Databricks workspace setup is a one-time setup for your organization. With the webhook tutorial below, all users within an organization will be able to:
  • Onboard new jobs onto Gradient with a single click through the Gradient UI
  • Onboard jobs at mass scale
  • Integrate Gradient without any modifications to your Databricks workflows tasks
Before you begin!
  • Ensure that you've created a Sync API Key since you'll need that here
  • Install the Sync CLI on your dev box using the instructions here
  • A user with admin access to your Databricks workspace is required to complete the steps below
  • Verify your workspace allows outbound and inbound traffic from your Databricks clusters. The Gradient integration process makes calls to AWS APIs and Sync services hosted at https://api.synccomputing.com. IP Whitelisting may be required.

Step 1: Configure the web-hook

Prior to configuring the notification destination in the Databricks Workspace, we need to retrieve the webhook URL and credentials from the Gradient API. We can use the Sync CLI to do this.

1.1 Create new webhook credentials:

sync-cli workspaces reset-webhook-creds <workspace-id>
Your <workspace-id> is the "o" parameter on your Databricks URL
Where to find the worksapce-id parameter from your Databricks address
Example output:
sync-cli workspaces reset-webhook-creds 839284039492
{
"username": "290s381e-8ep4-4d6a-84d4-433d84897fsc",
"password": "jc0dUD8zd44Uwid26jGI",
"url": "https://api.synccomputing.com/integrations/v1/databricks/notify"
}
The webhook credentials returned by this command cannot be retrieved again - so write them down somewhere!

1.2 Create a new webhook destination.

With the webhook URL and credentials, a workspace admin can now create a webhook notification destination. In your Databricks console go to admin > notification destinations > add destination
Set the following parameters in the UI:
  • Name: "Gradient"
  • Username: Use the "username" generated from the previous output
  • Password: Use the "password" generated from the previous output
  • URL: Use the "url" generated from the previous output
New notification destination to set up a webhook

Step 2: Create workspace configuration

Next, you need to configure your Databricks workspace with the webhook and Sync credentials:
Run the sync-cli command create-workspace-config
sync-cli workspaces create-workspace-config /
--databricks-plan-type <plan-type> /
--databricks-webhook-id <webhook-id> /
<workspace-id>
  • <plan-type> - Select between Standard, Premium, Enterprise
  • <webhook-id> - Go back to admin > Notification destinations and edit the "Gradient" webhook. Next to the "Edit destination settings" title, there's a copy button. Click it to copy the Webhook ID (see image below)
Where to find the webhook-id (note - you have to ciick on the copy button)
Once the command is run, you will need to provide the CLI with following information:
  1. 1.
    Databricks host
  2. 2.
    Databricks token
  3. 3.
    Sync API key ID
  4. 4.
    Sync API key secret
  5. 5.
    AWS instance profile ARN (for Databricks on AWS only)
  6. 6.
    Databricks plan type
  7. 7.
    Webhook ID (same step as <webhook-id> above)
Example output:
% sync-cli workspaces create-workspace-config /
--instance-profile-arn arn:aws:iam::481126062844:instance-profile/sync-minimum-access /
--databricks-plan-type Enterprise /
--databricks-webhook-id 8bd3b048-e496-4u09-b9de-4e2298e117y6 /
656201176161048
Databricks host (prefix with https://) [https://dbc-d85uga-1d40.cloud.databricks.com]:
Databricks token:
Sync API key ID [SXbT6fduHB8FfPPy5psUdP5g7cS9SPm]:
Sync API key secret:
{
"workspace_id": "3522015453188848",
"databricks_host": "**********",
"databricks_token": "**********",
"sync_api_key_id": "**********",
"sync_api_key_secret": "**********",
"instance_profile_arn": "arn:aws:iam::123123565455:instance-profile/sync-minimum-access",
"webhook_id": "7465b068-e490-4a87-b9ce-4e8740e123c6",
"plan_type": "Standard"
}

Step 3: Integrate workspace

The next step is to download the code used to submit the Spark event logs to Gradient. Once again, we will use the CLI to perform the following tasks:
  1. 1.
    Adds/updates the init script to the workspace “/Sync Computing” directory
  2. 2.
    Adds/updates secrets used by the init script and the Sync reporting job
  3. 3.
    Adds/updates the job run recording/reporting notebook to the workspace in the “/Sync Computing” directory
  4. 4.
    Adds/updates the Databricks Secrets scope, "Sync Computing | <your Sync tenant id>", used by Gradient to store credentials and configurations
  5. 5.
    Creates/updates a job with the name “Sync Computing: Record Job Run” that sends up the event log and cluster report for each prediction
  6. 6.
    Creates/updates and pins an all-purpose cluster with the name “Sync Computing: Job Run Recording” for the prediction job
Run the command sync-cli workspaces apply-workspace-config <workspace-id>
Example Output
% sync-cli workspaces apply-workspace-config 789564875555745
Workspace synchronized

Step 4: Verify Permissions to Gradient generated artifacts

The final step is to ensure that all the newly created artifacts are accessible during job runs. By default Databricks jobs have the permissions of the job owner.
Therefore, you should ensure that the owner, directly or through group permissions, can access the following artifacts:

1. The “/Sync Computing” directory

You should be able to see and access the "Sync Computing" directory in your Workspace. See the screenshot below.
Workspace directory where the Sync Computing directory should be seen

2. The "Sync Computing | <your Sync tenant id>" secret scope

You should be able to see and have access to the "Sync Computing | <your Sync tenant id>" secret scope. Check if you can view the scope with the list-scopes command below:
databricks secrets list-scopes

3. The “Sync Computing: <your Sync tenant id> Job Run Recording” cluster

You should be able to see and run the "Sync Computing: <your Sync tenant id> Job Run Recording" cluster in the Databricks console under Compute > All-purpose Compute.
Gradient requires cloud permissions to access cluster information. An instance profile with the correct permissions is required. Please see "AWS additional steps" for instructions on how to create an appropriate instance profile.
The Job Run Recording cluster
Your workspace should now be configured to send logs using Databricks web-hook notifications.
The "Sync Computing: Job Run Recording" cluster is created using the configuration below. If your workspace has any policies enabled that would restrict creation of this cluster, the setup process cannot proceed. In this case, please reach out to us at [email protected] for further assistance.
{
"cluster_name": "Sync Computing | <sync-tenant-id>: Job Run Recording",
"spark_version": "13.3.x-scala2.12",
"aws_attributes": {
"instance_profile_arn": "<your instance profile ARN>"
},
"node_type_id": "m4.large",
"driver_node_type_id": "m4.large",
"custom_tags": {
"sync:tenant-id": "<sync-tenant-id>"
},
"spark_env_vars": {
"DATABRICKS_HOST": "{{secrets/Sync Computing | <sync-tenant-id>/DATABRICKS_HOST}}",
"DATABRICKS_TOKEN": "{{secrets/Sync Computing | <sync-tenant-id>/DATABRICKS_TOKEN}}",
"SYNC_API_KEY_ID": "{{secrets/Sync Computing | <sync-tenant-id>/SYNC_API_KEY_ID}}",
"SYNC_API_KEY_SECRET": "{{secrets/Sync Computing | <sync-tenant-id>/SYNC_API_KEY_SECRET}}",
"SYNC_API_URL": "https://api.synccomputing.com"
},
"autotermination_minutes": 10,
"enable_elastic_disk": false,
"disk_spec": {
"disk_type": {
"ebs_volume_type": "GENERAL_PURPOSE_SSD"
},
"disk_count": 1,
"disk_size": 32
},
"enable_local_disk_encryption": false,
"data_security_mode": "NONE",
"runtime_engine": "STANDARD",
"num_workers": 0
}

Step 5: Select your cloud provider below to complete the installation