1 of 41

Sync Docs

Sync Gradient

The Gradient Platform

Introduction

Gradient is an infrastructure management and optimization platform that continuously monitors and learns users' Databricks Jobs so that it can optimize the data infrastructure to hit cost and runtime goals. It supports both co-pilot and autopilot modes. Use it as a co-pilot to receive passive recommendations for optimizations you can apply in a click, or enable auto-apply for optimization at scale.

Gradient uses a closed-loop feedback system to automatically build custom tuned machine learning models for each Databricks Job it is managing, using historical run logs. Through this mechanism, Gradient continuously drives Databricks Jobs cluster configurations to hit user defined business goals, such as maximum costs and runtimes.

Demo Video

What problem is Gradient solving?

Managing and optimizing Databricks Job clusters is tedious, time intensive, and difficult for data and platform engineers; there are far too many Spark configurations and infrastructure choices to know what's right and what makes sense. Additionally, just when they've gone through the effort of optimizing a job, something changes that wipes away all that hard work.

To make matters worse, changing infrastructure incorrectly can also lead to crashed jobs due to out of memory errors. A major risk to production pipelines that often blocks engineers from optimizing in the first place.

If engineers do want to try managing clusters, it comes at the expense of taking time away from delivering new products and features. Furthermore, managing at scale where hundreds or thousands of jobs are running is simply not feasible for any sized team.

Gradient provides data teams with an easy and scalable solution that can significantly diminish engineering time spent on cluster optimization, while cutting costs and improving runtimes. It can even automatically manage clusters for all of your jobs - with no code changes.

Who is it for?

Data Engineers - Avoid spending time tuning and optimizing clusters while still achieving optimal cost and runtime performance.
Data Platform Managers - Ensure your team's Databricks Jobs are achieving high level business objectives without having to bug your engineers or change any code. This becomes particularly important for teams who are looking to scale their Databricks usage.
VP of Engineering / CTOs - Gradient works for you and not the cloud providers. It was built to help you efficiently produce data products that meet your business.

How Does it Work?

Internal model

Gradient uses a proprietary machine learning algorithm trained on historical event log information to find the best configurations for your job; the algorithm creates and maintains a custom ML model for each job.

The algorithm has two phases:

Learning phase: Gradient will test your job performance against a few different configurations to understand how your job responds in terms of cost and runtime.
Optimizing phase: Once the learning phase is complete, Gradient will use the model built internally to drive the job cluster to more optimal configurations given the SLA requirements of the user. Even when optimizing, Gradient will continuously learn from each job run and improve the model for the job.

Projects

Projects are how Gradient organizes your Databricks Jobs and enables continuous optimization. All job runs and optimization recommendations for the job are available under the associated Project to give you a holistic picture of how the job is performing. Additionally, Projects help unlock more optimization potential and key features which may be important for your infrastructure via Project Settings.

Each Project is continually updated with the most recent recommendation provided by Gradient, allowing you to review cost and runtime metrics over time for the configured Spark workload.

Main Features

Timeline Visualizations - Monitor your jobs cost and runtime metrics over time to understand behaviors and watch for anomalies due to code changes, data size change, spot interruptions, or other causes.
Metrics Visualizations - Easily view spark metrics for your runs and correlate that with Gradient's recommendations, all visualized beautifully in a single pane.
Auto-apply Recommendation - Recommendations can be automatically applied to your jobs after each run for a "set and forget" experience.
AWS and Azure support - Granular cost and cluster metrics are gathered from popular cloud providers.
Auto Databricks jobs import & setup - Provide your Databricks host and token, and we’ll do all the heavy lifting of automatically fetching all of your qualified jobs and importing them into Gradient.
You set max runtime, Gradient minimizes costs - Simply set your ideal max runtime SLA (service level agreement) and we’ll configure the cluster to hit your goals at the lowest cost.
Aggregated cost metrics - Gradient conveniently combines both Databricks DBU costs and Cloud costs to give you a complete picture of your spend.
Custom integration with Sync CLI - Sync CLI and APIs can be used to support custom integration with users' environments.
Databricks autoscaling optimization - Optimize your min and max workers for your job. It turns out autoscaling parameters are just another set of numbers that need tuning. Check out our previous blog post.
EBS recommendations - Optimize your AWS EBS using recommendations provided by Gradient and save on costs.

High Level System Diagram

Gradient is a SaaS platform that remotely monitors and applies recommendations to users Databricks clusters at each Job start. The high level closed-loop flow of information across user and Gradient environments is shown below.

See our FAQ for more detailed information into what on the Gradient side.

Discover Quickstart

Find your top jobs to optimize as well as discover new opportunities to improve your efficiency even more. This page is refreshed daily so you always get up-to date insights and historical tracking.

1. Enter your Databricks credentials

After logging into Gradient, click on the "Discover" link on the left hand navigation. Click on the "Add Workspace" button to bring up the credentials prompt, as seen below. You will need to enter the following information:

Databricks Workspace ID - Can be found in the address bar of your web browser in your Databricks URL. It is the "o" parameter "o=9172567527460388", so in this case you would enter the number "9172567527460388".
Databricks Host - Can be found in the address bar of your web browser when at your Databricks workspace. It should look like this: https://dbc-6c213588-2400.cloud.databricks.com/
Databricks Token - You will need Databricks admin access to generate a personal access token for your Databricks workspace. Copy and paste your token value here. We recommend setting the "Lifetime" field blank so the token does not expire and interrupt any service down the line.
Compute Provider - Select which cloud your Databricks workspace is run on

That's it! You're done!

2. View Results

The discover page displays various pieces gathered from your workspace of information as explained below. Each "Finding" widget on the right is clickable and will filter the jobs list based on the parameters of the finding.

Top jobs to optimize with Gradient
Jobs with Photon enabled
Jobs with Autoscaling enabled
All purpose compute jobs
Jobs submitted from run-submit (meaning they could come from an external orchestrator like Airflow)
Timeline of total number of jobs
Timeline of total core hours (proportional to your costs)

3. Select jobs to import into Gradient

Identify the jobs you want to import into Gradient to start automatically optimizing their clusters. If this is your first time setting up Gradient, please complete your workspace setup. If you've already onboarded a workspace, proceed to project setup to directly import your jobs.

Add Workspace

Gradient requires a one time setup for your Databricks workspace. The workspace integration will enable both monitoring and optimization capabilities.

Installation & Setup Overview

In the steps below, we'll walk you through creating a Sync account, creating your Sync API Keys, and then adding your Databricks workspace to the Gradient platform all through the UI.

Create Sync API Key
Add Databricks Workspace

Create Sync API Key

An API key is required to programmatically interact with Gradient using our library, our CLI, or the Gradient REST API. You can create an API key after your Sync account has been created and authorized.

1. Launch Gradient and Log in

Go to the Gradient application and log in. Launch Gradient

2. Create your Sync API keys

Navigate to the Org Settings tab and click on the Generate Personal Key button to generate your API key. You can always come back to the Org Settings tab to view your API.

Add Databricks Workspace

The Databricks workspace setup is a one-time setup for your organization. With the webhook tutorial below, all users within an organization will be able to:

Onboard new jobs onto Gradient with a single click through the Gradient UI
Onboard jobs at mass scale
Integrate Gradient without any modifications to your Databricks workflows tasks

Before you start

Setting up a Databricks Workspace integration involves three steps:

Giving Gradient access to your Databricks Workspace (such as Databricks host, token, and other details)
Giving Gradient access to your cloud provider to fetch metadata on compute infra (such as EC2 instances, EBS volumes, etc)
Configuring a Databricks Webhook to notify Gradient about your job start and stop events

This doc covers the first step. At the bottom of this doc is a link to the doc that will help you complete the second step.

Step 1: Go to Org Settings > Integrations page > Add > Databricks Workspace

In the integrations page, click on "ADD" to see the "Add Databricks Workspace" console.

Step 2: Databricks Workspace details and Sync API key

We need to know how to connect to your Databricks Workspace. Provide details of your Databricks Workspace and choose the Sync API Key to use with this workspace integration.

Databricks Workspace ID - Can be found in the address bar of your web browser in your Databricks URL. It is the "o" parameter "o=9172567527460388", so in this case you would enter the number "9172567527460388".
Databricks Host - Can be found in the address bar of your web browser when at your Databricks workspace. It should look like this: https://dbc-6c213588-2400.cloud.databricks.com/
Databricks Token - You will need Databricks admin access to generate a personal access token for your Databricks workspace. Copy and paste your token value here. We recommend setting the "Lifetime" field blank so the token does not expire and interrupt any service down the line.
Sync API Key - Select the Sync API key from the drop down menu. If you haven't created one yet, you can create one here.
Databricks Plan Type - Select your plan type which will impact the pricing used to calculate your Databricks costs.

Step 3: Cloud provider, logs, and data collection

We need to know how to get logs and collect data for your job runs.

Select one of the options below to continue setting up your Databricks Workspace integration depending on your cloud provider.

AWS Databricks Setup

Some companies require different cloud devops teams to be involved with this step for security reasons. Please contact [email protected] for any assistance.

Step 1: Cloud provider and region

Cloud Provider - Select AWS as your cloud provider
AWS Region - This is necessary only if you specify AWS as your cloud provider.
Logs and Data Collection - Choose how you want to provide logs to Gradient. We recommend Sync-Hosted collection which manages collecting logs with just a few configurations on your end. Self-Hosted requires additional set up on your end.
Monitoring Type - Choose how you want Gradient to monitor your Databricks clusters.
- We recommend EventBridge Rule monitoring which sets up your AWS environment to send EC2 related events to Gradient. Only available if you pick Sync-Hosted for Logs and Data Collection. More information on this page.
- Another Monitoring Type is Webhook which is covered on this page.

For the recommended Sync-Hosted collection method, AWS IAM roles and permissions are required to be set up to complete the rest of the workspace integration, as seen in the screen shot below.

We recommend using Sync-Hosted collection for your logs and other data. Sync-Hosted collection enables you to integrate with Gradient in 5 minutes.

Monitoring Type EventBridge Rule is currently available under Private Preview.

We recommend using EventBridge Rule monitoring for monitoring your Databricks clusters. Only available if you also pick Sync-Hosted collection.

Step 2: Create AWS IAM Permissions Policy

These steps need to be performed in the AWS Account that is associated with the Databricks Workspace. The screenshot below shows you how to locate the correct AWS Account.

Copy and paste the json in a new AWS IAM policy permission, as see in the example screen shot below. Give the policy a name, such as sync-external-access

This permission will be used in the next few steps after a new role is created.

Are your Databricks cluster logs stored in S3?

If so, we'll need permissions to fetch cluster logs from your S3 bucket.

Use the policy below to grant Sync access to your S3 location where cluster logs are stored. Replace the bucket names and prefixes with yours.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "ec2:DescribeInstances",
                "ec2:DescribeVolumes"
            ],
            "Resource": "*"
        },
        {
            "Sid": "VisualEditor1",
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::your-bucket-and-prefix1-here/*",
                "arn:aws:s3:::your-bucket-and-prefix2-here/*"
            ]
        }
    ]
}

Are your Databricks cluster logs stored in DBFS?

If your cluster logs are stored in DBFS then the policy below will suffice.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "ec2:DescribeInstances",
                "ec2:DescribeVolumes"
            ],
            "Resource": "*"
        }
    ]
}

Step 3: Create an AWS IAM Role (custom trust policy)

Create a new AWS IAM role with the "Custom trust policy" trust entity and paste the JSON in the policy field, as seen in the example screenshot below. Give the role a name, such as sync-external-user-role

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::533267411813:role/sync-computing-collector"
            },
            "Action": "sts:AssumeRole",
            "Condition": {
                "StringEquals": {
                    "sts:ExternalId": "<externalId from Gradient>"
                }
            }
        }
    ]
}

Step 4: Add permission to role

In the next step of the AWS IAM role creation, add the permission created previously to the new AWS IAM role (in the example above it is named sync-external-access). Example screen shot below:

Step 5: Name and create the IAM role

Give the new IAM role a name, such as sync-external-user-role, and create the new role.

Step 6: Copy and paste the AWS IAM Role ARN

Go back to the AWS IAM role just created (in the previous example with the name sync-external-user-role), and copy the ARN link and paste it in the last field in the Gradient dialogue box.

Once entered, click on "Save & Test Access" to proceed in the Gradient UI.

Access check test - A series of access checks will occur to verify the permissions have been properly set up. If all checks are passed with green checks, you may proceed to the next section. If there are any red checks, please verify the installation is correct. Please email [email protected] if there are any issues or questions.

Finally

If you picked EventBridge Rule monitoring continue with the setup on this page:

If you picked Webhook monitoring continue with the setup on this page:

EventBridge Setup

In this guide we are going to setup your AWS EventBridge to send EC2 related events to Gradient.

Gradient can use AWS EventBridge to accurately monitor your clusters at scale. It uses AWS EventBridge to capture events related to EC2 state changes and tag changes on a resource. These events look like:

To enable these events, you will need:

EventBridge Rules in the AWS console. You should have a default bus there already:

The Gradient workspace integration modal with the custom event pattern for EventBridge:

Unless specified you can use the default values from the AWS Console.

Start with clicking Create Rule in the AWS Console and let's go:

Step 1: Define rule detail

The only important settings here are Rule with an event pattern and your default bus, then click Next:

Step 2: Build event pattern

Under Event Source, select AWS events or Eventbridge partner events then scroll down to Creation Method:

Under Creation Method, select Custom pattern (JSON editor). You will need to copy the Event Pattern from the Gradient modal (from Step 2) and paste it under Event Pattern in the AWS console. Then click Next:

Step 3: Select target

Select EventBridge event bus and Event bus in a different account or Region, copy the bus ARN from the Gradient modal (from Step 3) and paste it under Event bus as target in the AWS console.

Click Skip to Review and create. You can add tags later if needed.

Step 4: Review and create

On this page you can review the rule settings entered previously, when done click Create rule to finish setting up the rule:

Once the rule is created go to the Gradient modal and complete the Databricks Workspace integration by clicking the Add button (or the Update button if you are editing an existing Databricks Workspace integration).

If you have any issues adding your workspace, or need to delete a workspace, please contact Sync at [email protected]. Deleting workspaces in Gradient will be supported soon.

Azure Databricks Setup

Databricks workspace setup for Azure users.

Some companies require different cloud devops teams to be involved with this step for security reasons. Please contact [email protected] for any assistance.

This doc shows you how to register an app in Microsoft Entra. After you complete the steps below, you will have the Subscription Id, Client Id, Tenant Id, and Client Secret that you need to create an Azure Databricks Workspace integration in Gradient.

Step 1: Register an Application

In order to grant Gradient permissions to collection the data needed to generate a recommendation, you can to register an app using .

Step 2: Register a New App for Gradient

Entra, the Microsoft identity platform, performs identity and access management (IAM) only for registered applications. The next few steps will walk you through registering Gradient in order to grant and read access to Azure Databricks resources.

Name the app and choose the appropriate account type. (For most organizations, this is set to single tenant for the current directory)

Write down or save the values for Client ID and Tenant ID.

You will need to provide Gradient with the following values in

Tenant ID
Client ID
Client secret
Subscription ID

Step 3: Add a Client Secret to the New App

A client secret is an authentication technique that uses a string value in the Azure application instead of a certificate for identity. Gradient will use this secret to retrieve cluster information needed to make a recommendation.

To create a secret click on Certificates & Secrets > + New Client Secret

Record the value of the secret as it is required by the log submission process.

Step 4: Grant the Gradient App Permissions

The app requires read access to the subscription. The easiest way to provide access is to assign the Reader role to the app. .

From your Subscription settings, click on the Access control (IAM) link on the side tab. From here, add a new role assignment.

You can find the Reader role by searching or scrolling through your list of roles. Once you find it, click on the role.

From the members tab, add the app you created in that last section by clicking the + select members link and searching for your app by name.

Complete the assignment by verifying your settings and clicking the Review + assign button

Step 5: Add Entra App Client details into Gradient

Gradient requires four pieces of information for access:

Subscription ID - This can be copied from the Subscriptions service in the Azure console.
Tenant ID - This can be copied from the Active Directory overview page in the Azure console.
Client ID - The ID of the Active Directory principle with which the library should authenticate. This can be copied from the Active Directory “App registrations” page in the Azure console.
Client secret - This can be generated from the “Certificates & secrets” page for the app.

Once entered, click on "Save & Test Access" to proceed in the Gradient UI.

Final Step

Click on the link below to proceed to the webhook setup.

Webhook Setup

Step 1: Get your Webhook ID

In the next window, use the values generated to create a new webhook destination and retrieve the required "Webhook ID" value.

In your Databricks console go to settings > workspace admin > notifications. On the notifications page go to notification destinations > manage > add destination. Set the following parameters in the "Create a new destination" panel in the Databricks console:

Name: "Sync Computing"
Username: Copy the "username" generated above
Password: Copy the "password" generated above
URL: Copy the "url" generated above

After your webhook destination is generated. Click to re-open your webhook, and copy the webhook ID below.

Step 2: Paste your webhook ID and Add your workspace

Paste the webhook ID into the "Add Databricks Workspace" panel in the Gradient platform. Click "Add" to complete the setup!

If you have any issues adding your workspace, or need to delete a workspace, please contact Sync at [email protected]. Deleting workspaces in Gradient will be supported soon.

Project Setup

The fastest way to obtaining your first recommendation is to utilize the Gradient UI to import jobs from your Databricks workspace. Follow the instructions below to get started.

For guided assistance, please reach out to [email protected]

Advanced use cases

For more advanced use cases beyond the quick-start, see the advanced use cases page

Import Jobs to Projects

Introduction

Projects is a solution to continuously optimize and monitor a repeat production Databricks workloads. To implement projects, integration of the Sync library in a user's orchestration system (e.g. Airflow, Databricks Workflows) is necessary.

Once integrated, the Gradient UI will provide high level metrics and easy to use controls to monitor and manage your Apache Spark clusters.

1. Add a new project

From the Projects tab, click on the button.

2. Import multiple Databricks jobs

Use the Databricks Auto Import wizard to easily create multiple projects, each linked to a Databricks Job in your workspace.

The Auto Import wizard connects to your specified Databricks workspace using a Databricks Personal Access Token obtained during the Add Databricks Workspace step.

The import wizard requires at least 1 successful run of your job within Databricks console. (So if you created a fresh job, never run before, the import wizard won't see it)

NOTICE: The import wizard will make the following changes to your selected Databricks Jobs:

Add the web-hook notification destination to the job so that Gradient is notified on every successful run
Update the job cluster with the init script, env vars, and instance profile to collect worker instance and volume information.

3. Review the candidate jobs

Review the compatible Databricks jobs and select the jobs for which you would like to create a Gradient project and select create projects for each of the selected jobs. By creating a project, the following properties will be added for each.

If you want to manually import a single job, follow the manual single job import instructions.
Community Edition accounts are limited to only 3 Projects. To create more Projects, sign up for an Enterprise account.

4. View new projects

You should now see the project[s] you created on you Projects summary dashboard. New projects will have a status of "Pending Setup" until the project is configured to receive logs for recommendations.

Sync Project ID: Each project is associated with a project_id parameter. This number will be important for future steps to link Databricks with Gradient. This number can be found at the top of the page within each project.

Verify and Run Jobs

To ensure a successful operation of Gradient, there are a few final steps and checks that will depend on how you currently configure your jobs.

After these steps, you should have a successful first run of Gradient. If you need any help, feel free to reach out to us via intercom or email us at [email protected].

For the Databricks Job you want to optimize, go to the Databricks console Job's page.

Step 1: Select instance profile permissions and logging

Instance profiles: Users must have an instance profile correctly configured with access to AWS's S3 and describe cluster functions. See the AWS additional steps instructions for more information.

Enable Logging: Logging must be enabled for your existing job, either DBFS or S3 log location is fine

Step 2: Verify the job modifications for Gradient

The following items should have been automatically completed via the job import step. These are just steps to verify the setup completed correctly. Ideally, no further action is required.

Webhook notification: Ensure that the job has webhooks enabled.

Spark environmental variables: Ensure the Spark environmental variables are populated with secrets.

Cluster init scripts: Ensure the Sync init script is selected

Step 3: Run your job!

Your Databricks job should be fully connected to Gradient. Simply run your job as you normally would via the Databricks UI and click on "Run"

After your job is completed, a secondary "Record Job Run" job will be started which will automatically collect the logs generated and transmit them to Gradient.

The "Record Job Run" job can take 10-15 minutes to complete.

After the completion of the "Record Job Run" job, head over to the Gradient UI, where you should see the first datapoint populated:

Generate and Apply Recommendation

With your first datapoint in the Gradient UI, you are now ready to generate your first recommendation and apply it.

1. Click on the Generate button to create your first recommendation

The Generate button will create a new recommendation based on the logs submitted from your last successful job run. If this is your first recommendation, your Gradient status will be "learning", meaning Gradient will train an internal model based on a few test runs of your job.

2. Apply the recommendation to your job

On the right side of the Gradient UI, click on the "Apply" button to automatically update your Databricks job with the recommendation.

3. Re-run your job with the new configuration

Go back to the Databricks console and click on the "run" button for the job being optimized. The Gradient UI should then be populated with its 2nd data point.

4. (optional) Enable Auto-Apply for continuous optimization

To avoid manually applying recommendations, you can also enable Auto-Apply in the "Edit settings" button in the Gradient project page.

If this option is enabled, recommendations will be automatically applied after each run of your job.

Click on the slider to enable Auto-Apply Recommendation. A warning page will pop up to verify this feature. Click on Save.

Advanced Use Cases

The Sync Python library provides developers with the fundamental tools to integrate the Sync continuous optimization solution in a number of ways:

- A guide to help integrate Gradient with Apache Airflow for Databricks

Install the Sync-CLI

Install the Sync-CLI on your local machine to help setup the rest of the Gradient installation process. The Sync-CLI also enables other advanced functions with Gradient.

Network restrictions: If your company restricts non whitelisted external IP addresses from your Databricks clusters, be sure to request permission to access: https://api.synccomputing.com

Linux machine: The Sync CLI is best run on a Linux machine.

Step 1: Setting up your Environment

Start by making sure your environment meets all the prerequisites. The Gradient CLI is actually part of the Sync Library, which requires Python v3.7 or above and which only runs on Linux/Unix based systems.

python --version

Step 2: Create a virtual environment

This is a good practice for whenever you install any new Python tool, as it allows you to avoid conflicts between projects and makes environment management simpler.

Here, we will create a virtual environment called gradient-cli that will reside under the ~/VirtualEnvironments path.

python -m venv ~/VirtualEnvironments/gradient-cli

Step 3: Install the Sync Library

Activating your new virtual environment.

source ~/VirtualEnvironments/gradient-cli/bin/activate

Next use the pip package installer to install the latest version of the Sync Library.

pip install -I https://github.com/synccomputingcode/syncsparkpy/archive/latest.tar.gz

You can confirm that the installation was successful by viewing the CLI executable’s version by using the –version or –help options.

sync-cli --help

Step 4. Configure the Sync Library

Configuring the CLI with your credentials and preferences is the final step for the installation and setup for the Sync CLI. To do this, run the configure command:

sync-cli configure

You will be prompted for the following values:

Sync API key ID:

Sync API key secret:

Default prediction preference (performance, balanced, economy) [economy]:

Would you like to configure a Databricks workspace? [y/n]:

Databricks host (prefix with https://):

Databricks token:

Databricks AWS region name:

Manual Workspace Setup

Webhooks provide an easy 1-click experience to on-board new jobs

The Databricks workspace setup is a one-time setup for your organization. With the webhook tutorial below, all users within an organization will be able to:

Onboard new jobs onto Gradient with a single click through the Gradient UI
Onboard jobs at mass scale
Integrate Gradient without any modifications to your Databricks workflows tasks

Before you begin!

Ensure that you've created a Sync API Key since you'll need that here
Install the Sync CLI on your dev box using the instructions here
A user with admin access to your Databricks workspace is required to complete the steps below
Verify your workspace allows outbound and inbound traffic from your Databricks clusters. The Gradient integration process makes calls to AWS APIs and Sync services hosted at https://api.synccomputing.com. IP Whitelisting may be required.

Step 1: Configure the web-hook

Prior to configuring the notification destination in the Databricks Workspace, we need to retrieve the webhook URL and credentials from the Gradient API. We can use the Sync CLI to do this.

1.1 Create new webhook credentials:

sync-cli workspaces reset-webhook-creds <workspace-id>

Your <workspace-id> is the "o" parameter on your Databricks URL

Example output:

sync-cli workspaces reset-webhook-creds 839284039492
{                           
  "username": "290s381e-8ep4-4d6a-84d4-433d84897fsc",
  "password": "jc0dUD8zd44Uwid26jGI",
  "url": "https://api.synccomputing.com/integrations/v1/databricks/notify"
}

The webhook credentials returned by this command cannot be retrieved again - so write them down somewhere!

1.2 Create a new webhook destination.

With the webhook URL and credentials, a workspace admin can now create a webhook notification destination. In your Databricks console go to admin > notification destinations > add destination

Set the following parameters in the UI:

Name: "Gradient"
Username: Use the "username" generated from the previous output
Password: Use the "password" generated from the previous output
URL: Use the "url" generated from the previous output

Step 2: Create workspace configuration

Next, you need to configure your Databricks workspace with the webhook and Sync credentials:

Run the sync-cli command create-workspace-config

sync-cli workspaces create-workspace-config /
--databricks-plan-type <plan-type> /
--databricks-webhook-id <webhook-id> / 
<workspace-id>

<plan-type> - Select between Standard, Premium, Enterprise
<webhook-id> - Go back to admin > Notification destinations and edit the "Gradient" webhook. Next to the "Edit destination settings" title, there's a copy button. Click it to copy the Webhook ID (see image below)

Once the command is run, you will need to provide the CLI with following information:

Databricks host
Databricks token
Sync API key ID
Sync API key secret
AWS instance profile ARN (for Databricks on AWS only. See AWS Instance Profile)
Databricks plan type
Webhook ID (same step as <webhook-id> above)

Example output:

% sync-cli workspaces create-workspace-config /
--instance-profile-arn arn:aws:iam::481126062844:instance-profile/sync-minimum-access /
--databricks-plan-type Enterprise /
--databricks-webhook-id 8bd3b048-e496-4u09-b9de-4e2298e117y6 /
656201176161048          

Databricks host (prefix with https://) [https://dbc-d85uga-1d40.cloud.databricks.com]:                                                                                                                                                                                                                                  
Databricks token:                                                                                                                                                                                                                                                                                                         
Sync API key ID [SXbT6fduHB8FfPPy5psUdP5g7cS9SPm]:                                                                                                                                                                                                                                                                       
Sync API key secret:                                                                                                                                                                                                                                                                                                      
                                                                                                                                                                                                                                                                                                                          
{                                                                                                                                                                                                                                                                                                                         
  "workspace_id": "3522015453188848",                                                                                                                                                                                                                                                                                                 
  "databricks_host": "**********",                                                                                                                                                                                                                                                                                        
  "databricks_token": "**********",                                                                                                                          
  "sync_api_key_id": "**********",                                                                                                                           
  "sync_api_key_secret": "**********",                                                                                                                       
  "instance_profile_arn": "arn:aws:iam::123123565455:instance-profile/sync-minimum-access",                                                                                                                                                                                                                               
  "webhook_id": "7465b068-e490-4a87-b9ce-4e8740e123c6",                                                                                                      
  "plan_type": "Standard"                                                                                                                                  
}

Step 3: Integrate workspace

The next step is to download the code used to submit the Spark event logs to Gradient. Once again, we will use the CLI to perform the following tasks:

Adds/updates the init script to the workspace “/Sync Computing” directory
Adds/updates secrets used by the init script and the Sync reporting job
Adds/updates the job run recording/reporting notebook to the workspace in the “/Sync Computing” directory
Adds/updates the Databricks Secrets scope, "Sync Computing | <your Sync tenant id>", used by Gradient to store credentials and configurations
Creates/updates a job with the name “Sync Computing: Record Job Run” that sends up the event log and cluster report for each prediction
Creates/updates and pins an all-purpose cluster with the name “Sync Computing: Job Run Recording” for the prediction job

Run the command sync-cli workspaces apply-workspace-config <workspace-id>

Example Output

% sync-cli workspaces apply-workspace-config 789564875555745                                                                                                                                                                                           
Workspace synchronized

Step 4: Verify Permissions to Gradient generated artifacts

The final step is to ensure that all the newly created artifacts are accessible during job runs. By default Databricks jobs have the permissions of the job owner.

Therefore, you should ensure that the owner, directly or through group permissions, can access the following artifacts:

1. The “/Sync Computing” directory

You should be able to see and access the "Sync Computing" directory in your Workspace. See the screenshot below.

2. The "Sync Computing | <your Sync tenant id>" secret scope

You should be able to see and have access to the "Sync Computing | <your Sync tenant id>" secret scope. Check if you can view the scope with the list-scopes command below:

databricks secrets list-scopes

3. The “Sync Computing: <your Sync tenant id> Job Run Recording” cluster

You should be able to see and run the "Sync Computing: <your Sync tenant id> Job Run Recording" cluster in the Databricks console under Compute > All-purpose Compute.

Gradient requires cloud permissions to access cluster information. An instance profile with the correct permissions is required. Please see "AWS additional steps" for instructions on how to create an appropriate instance profile.

Your workspace should now be configured to send logs using Databricks web-hook notifications.

The "Sync Computing: Job Run Recording" cluster is created using the configuration below. If your workspace has any policies enabled that would restrict creation of this cluster, the setup process cannot proceed. In this case, please reach out to us at [email protected] for further assistance.

{
    "cluster_name": "Sync Computing | <sync-tenant-id>: Job Run Recording",
    "spark_version": "13.3.x-scala2.12",
    "aws_attributes": {
        "instance_profile_arn": "<your instance profile ARN>"
    },
    "node_type_id": "m4.large",
    "driver_node_type_id": "m4.large",
    "custom_tags": {
        "sync:tenant-id": "<sync-tenant-id>"
    },
    "spark_env_vars": {
        "DATABRICKS_HOST": "{{secrets/Sync Computing | <sync-tenant-id>/DATABRICKS_HOST}}",
        "DATABRICKS_TOKEN": "{{secrets/Sync Computing | <sync-tenant-id>/DATABRICKS_TOKEN}}",
        "SYNC_API_KEY_ID": "{{secrets/Sync Computing | <sync-tenant-id>/SYNC_API_KEY_ID}}",
        "SYNC_API_KEY_SECRET": "{{secrets/Sync Computing | <sync-tenant-id>/SYNC_API_KEY_SECRET}}",
        "SYNC_API_URL": "https://api.synccomputing.com"
    },
    "autotermination_minutes": 10,
    "enable_elastic_disk": false,
    "disk_spec": {
        "disk_type": {
            "ebs_volume_type": "GENERAL_PURPOSE_SSD"
        },
        "disk_count": 1,
        "disk_size": 32
    },
    "enable_local_disk_encryption": false,
    "data_security_mode": "NONE",
    "runtime_engine": "STANDARD",
    "num_workers": 0
}

Step 5: Select your cloud provider below to complete the installation

AWS Instance Profile

** Only needed for Self-Hosted collection in AWS Databricks **

You'll need to set up an AWS instance profile for Self-Hosted collection in AWS Databricks. This instance profile will be used by the cluster, running in your environment, to send logs to Gradient.

AWS Instance profile access

The Gradient Agent needs AWS access to retrieve instance market information during job execution. To access this information, Gradient uses Boto3 which will leverage permissions granted through the cluster's instance profile. See Example AWS Profile below for required permissions.

Gradient reads and writes logs to the storage path defined in the cluster delivery configuration. If the logs are configured to be delivered to an S3 location, the cluster instance profile must have permission to read and write data to the S3 destination and it must include putObjectAcl permission.

Consult your company - The steps below outline a method to create an instance profile from scratch via the AWS console. Please consult with your company to understand the best way to create or modify existing instance profiles that is in line with your own policies.

Step1: Create a new Role in your AWS Console

In your AWS console, go to IAM > Roles and click on Create role

Select AWS service as the entity type and EC2 as the service

Gradient does not need any additional permissions at this point. Implement any default permissions you may need. If none are needed, click next.

Insert a name for the role. Below we use sync-minimum-access. Click on create Role once completed.

Step 2: Add an in-line permission to the Role

Click into the Role you just created, and under Permissions, click on Add permission > create inline policy

Click on the JSON editor

Copy and paste the code block below into the JSON policy editor.

Be sure to update <your-s3-bucket-path> to be the same s3 bucket path as where you store your Databricks logs (screen-shot from the Databricks cluster).

If you are using dbfs:// to store your logs, you don't need the s3 permission block in the instance profile script below

{
    "Version": "2012-10-17",
        "Statement": [
            {
                "Sid": "VisualEditor0",
                "Effect": "Allow",
                "Action": [
                    "ec2:DescribeInstances",
                    "ec2:DescribeVolumes"
                ],
                "Resource": "*"
            },
            {
                "Sid": "VisualEditor1",
                "Effect": "Allow",
                "Action": [
                    "s3:PutObject",
                    "s3:GetObject",
                    "s3:ListBucket",
                    "s3:PutObjectAcl"
                ],
                "Resource": [
                    "arn:aws:s3:::<your-s3-bucket-path>"
                ]
            }
        ]
}

Click on Next. On the next page click on Create Policy.

Step 3: Add the instance profile to Databricks

In the Databricks admin page, go to Instance profiles and click on "Add instance profile"

On the next page copy and paste the "Instance profile ARN" and "IAM role ARN" values from the AWS console Role's page. Click "add" to complete.

Done! You should now be able to select this instance profile in the cluster page of your jobs

Apache Airflow for Databricks

Currently, Gradient is optimized to work with predefined Databricks Jobs and Databricks Workflows. However, the Sync Python Library allows you to integrate your Databricks pipelines when using 3rd party tools like Airflow and Azure Data Factory. Typically, these tools use run_submit() or DatabricksSubmitRunOperator(), found in Databricks' APIs and Databricks' Airflow provider, to initiate runs using Databricks.

If you are using a tool that uses this job invocation method, you can follow the pattern below to submit your event logs to Gradient for evaluation, generate a recommendation, and apply that recommendation to your next job run.

Process Overview

These instructions guide you through the Gradient integration for Airflow DAGs containing DatabricksSubmitRunOperator() tasks through the use of a pre-execute hook.

The pre-execute hook for DatabricksSubmitRunOperator() creates/fetches the relevant project. It then retrieves a recommendation from Gradient for an optimized cluster config for this project. The recommendation overrides the cluster config being passed to DatabricksSubmitRunOperator(). The task then runs with this optimized cluster config instead of the original untuned cluster config.

Prerequisites

Databricks Workspace Integrated with Gradient - This process requires a Databricks Workspace Integration to be configured. Detailed instructions are available here.
Environment variables for SYNC_API_KEY_ID and SYNC_API_KEY_SECRET . The values for these variables can be found in Gradient under Org Settings -> API Keys. Managing environment variables under Airflow can be found here.
syncsparkpy library has been installed and configured in your Airflow instance - steps to configure the library can be found here.

Install syncsparkpy via:

pip install git+https://github.com/synccomputingcode/[email protected]

Tested with Python 3.8+ and Airflow 2.0+

Decide DAG parameters:
- Cluster Log Location - A Databricks supported path for cluster log delivery (See Databricks cluster documentation for details)
- App ID - An App ID is a human readable unique identifier supplied with each Databricks Job run. Its purpose is to provide criteria by which to group execution metrics. DatabricksSubmitRunOperator tasks that utilize multiple clusters are not currently supported.
- Auto apply - Whether or not you want recommendations automatically applied
- Databricks workspace id

Steps

1. Import `syncsparkpy` library function

from sync.databricks.integrations.airflow import airflow_gradient_pre_execute_hook

2. Add relevant params to DAG definition

In Gradient, a project is a collection of runs of a specific task, or tasks, that correspond to a single compute cluster. Projects are named using atask_id:gradient_app_id format.

The example below shows a DAG with three tasks where each task associated with its own compute. The gradient_app_id is at the DAG level and is set to gradient_databricks_multitask. In Gradient, a project is created for each task in the DAG:

bronze_task:gradient_databricks_multitask
silver_task:gradient_databricks_multitask
gold_task:gradient_databricks_multitask

params={
        'gradient_app_id': 'gradient_databricks_multitask',
        'gradient_auto_apply': True,
        'cluster_log_url': 'dbfs:/cluster-logs',
        'databricks_workspace_id': '10295812058'
    }

3. Add `pre_execute` hook kwarg to `DatabricksSubmitRunOperator` task and set it to the library function imported in step 1

pre_execute=airflow_gradient_pre_execute_hook

Example Airflow DAG

Gradient additions are annotated below

from airflow import DAG
from airflow.decorators import task
from airflow.operators.python import PythonVirtualenvOperator
from airflow.providers.databricks.operators.databricks import DatabricksSubmitRunOperator
from airflow.utils.dates import days_ago
from airflow.models.variable import Variable
from airflow.models import TaskInstance





default_args = {
    'owner': 'airflow'
}

with DAG(
    dag_id='gradient_databricks_multitask',
    default_args=default_args,
    schedule_interval = None,
    start_date=days_ago(2),
    tags=['demo'],
    
        'gradient_app_id': gradient_databricks_multitask,
        'gradient_auto_apply': True,
        'cluster_log_url': 'dbfs:/cluster-logs',
        'databricks_workspace_id': '10295812058'
    }
) as dag:

    def get_task_params():
        task_params = {
            "new_cluster":{
                  "node_type_id":"i3.xlarge",
                  "driver_node_type_id":"i3.xlarge",
                  "custom_tags":{},
                  "num_workers":4,
                  "spark_version":"14.0.x-scala2.12",
                  "runtime_engine":"STANDARD",
                  "aws_attributes":{
                     "first_on_demand":0,
                     "availability":"SPOT_WITH_FALLBACK",
                     "spot_bid_price_percent":100
                  }
               },
               "notebook_task":{
                  "notebook_path":"/Users/[email protected]/gradient_databricks_multitask",
                  "source":"WORKSPACE"
               }
        }

        return task_params

    notebook_task = DatabricksSubmitRunOperator(
        
        task_id="notebook_task",
        dag=dag,
        json=get_task_params(),
    )

##################################################################


    notebook_task

Done!

Once the code above is implemented in your DAG, head over to the Projects dashboard in Gradient. There you'll be able to easily review recommendations and can make changes to the cluster configuration as needed.

Gradient Terraform Integration

Integrating Gradient into your Terraform process typically involves the following steps:

Include Workspace and Job Configuration in your Terraform Plan
Configure Terraform to ignore recommendation fields when detecting drift
Let Gradient “auto-apply” recommendations directly to your Databricks Job via Databricks API

Databricks Workspace and Job Configurations Used by Gradient

Gradient utilizes Databricks webhook notification destinations to be notified upon the start of managed Databricks Jobs. Each notification destination should be incorporated into your infrastructure management process to maintain Gradient configuration within your Databricks workspace definition. See example.

Databricks Workspace Webhook Notification Destination

resource "databricks_notification_destination" "sync_webhook" {
  display_name = "Notification Destination"
  config {
    generic_webhook {
      url      = "https://example.com/webhook"
      username = "username" // Optional
      password = "password" // Optional
    }
  }
}

Additionally, each workflow cluster being managed by Gradient should reference this webhook.

Databricks Job Notifications

resource "databricks_job" "example_job" {
  name = "example job"
  ...
  webhook_notifications {
    on_start {
      id = databricks_notification_destination.sync_webhook.id
    }
  }

Managing Terraform Drift

If you are using terraform plan to tell you when there is a configuration drift of resources created by Terraform, we recommend you use one of the following methods to omit Databricks Job cluster configurations generated by Gradient. This will avoid the most recent cluster configuration from being overwritten by Terraform.

Ignore the Entire Cluster Configuration

Specifying ‘ignore_chages = all’ under ‘lifecycle’ definition of the entire cluster configuration will result in the entire cluster configuration being ignored by the drift detection process.

resource "databricks_cluster" "single_node" {
  cluster_name            = "Single Node"
  spark_version           = data.databricks_spark_version.latest_lts.id
  node_type_id            = data.databricks_node_type.smallest.id
  autotermination_minutes = 20

  spark_conf = {
    # Single-node
    "spark.databricks.cluster.profile" : "singleNode"
    "spark.master" : "local[*]"
  }

  custom_tags = {
    "ResourceClass" = "SingleNode"
  }

  lifecycle {
    ignore_changes = all
  }

Ignore only the Cluster Configurations Managed by Gradient

Explicitly specifying which configurations to ignore allows configurations not managed by Gradient to be evaluated by the drift detection process. However, it is important to notes that these configurations may change as new features are added to Gradient.

resource "databricks_cluster" "example_cluster" {
  cluster_name  = "example-cluster"
  spark_version = "7.3.x-scala2.12"
  node_type_id  = "i3.xlarge"
  num_workers   = 2


  custom_tags = {
    "sync:project-id" = "<insert-project-id>" # customer needs to add their project id tag
    # ...other tags
  }
  # other configurations...

  lifecycle {
    ignore_changes = [ # Fields sync modifies
      num_workers,
      node_type_id
      # Note: Gradient also modifies EBS volumes
    ]
  }
}

Apply All Recommendations to Terraform

If you choose not "ignore changes" and want to reintegrate the recommendations back into their terraform resource, you can retrieve the latest recommendation using the following function in the Sync Python Library:

sync.api.projects.get_latest_project_config_recommendation(project_id: str) → Optional[sync.api.projects.Response[dict]]

Get Latest Project Configuration Recommendation.

Parameters
project_id (str) – project ID

Returns
Project Configuration Recommendation object

Return type
Response object or None

This function returns a Python dictionary containing the recommended cluster configuration for the project. Parse and persist this data in the format required by your infrastructure management process.

Auto-Apply Recommendations

To avoid manually applying recommendations, you can also enable Auto-Apply in the "Edit settings" button in the Gradient project page. If this option is enabled, recommendations will be automatically applied after each run of your job.

This setting is applicable only to Databricks Workflows. Auto-Apply is not applicable if you're using the DatabricksSubmitRunOperator or Databricks /api/2.1/jobs/runs/submit API.

The Auto-Apply setting is applicable only to Databricks Workflows.

Auto-Apply is not applicable if you're using the DatabricksSubmitRunOperator or Databricks /api/2.1/jobs/runs/submit API.

Project Settings

To access the Project Settings, click on the "Edit settings" button located in the upper right corner of the Project Details page.

Configure Settings

The Project Setting allow you to set the recommendation configurations for your project:

SLA - Set these values if you have an SLA that you want Gradient to consider when making optimization recommendations. Typically longer SLAs will allow for Gradient to find lower cost cluster configurations. Shorter SLAs may cause higher cost recommendations. No matter what SLA is specified, Gradient will always try to find the lowest cost cluster.
Maintain Scaling Type - If enabled, Gradient will maintain the same Autoscaling option as the original job settings. Meaning, if Autoscaling is enabled, Gradient will provide a recomendation with Autoscaling (e.g. Autoscaling --> Autoscaling). If disabled, Gradient may recommend to switch scaling types (e.g. Autoscaling --> Fixed Cluster)
Auto-Apply Recommendations - If enabled, Gradient will automatically update the Databricks Job's cluster configurations with the latest generated recommendation.

The Project Settings page also allows you to Delete the Project.

Account Settings

Password Management

The Account tab is where you can manage your password for non-Google authenticated accounts.\

Syn API Keys

The Account tab is also where you can manage you API Keys.

It is recommended that you to change its API keys routinely — according to your company's security policies. This is part of best security practices and is commonly required by operation control frameworks, such as SOC 2, HIPAA, or ISO 27001.

ROI Reporting

Overview

Gradient calculates Return on Investment (ROI) using sophisticated metrics that adapt to your workload characteristics. This documentation explains how Gradient determines and reports ROI, including our methodology for choosing the most appropriate metrics for different scenarios.

ROI Metrics

Gradient reports two key ROI metrics:

Savings to Date: Actual savings achieved through Gradient optimization
Projected 12 Month Savings: Estimated savings over the next year based on current patterns

How Gradient Calculates ROI

Determining the Right Metric

Gradient uses two different approaches to calculate ROI, choosing the most appropriate one based on your workload characteristics:

Cost Change Percentage: Direct comparison of costs before and after Gradient
Cost per Gigabyte (Cost/GB) Change Percentage: Normalized metric that accounts for varying data sizes

Selection Logic

Gradient automatically selects the most appropriate metric using the following logic:

First, we compute the correlation between input data size and runtime (Pearson correlation coefficient)
Then we use the correlation coefficient to select the appropriate metric between "Cost change %" and "Cost/GB change %"
- If correlation ≥ 0.7 (strong correlation)
  - We use the maximum value between "Cost change %" and "Cost/GB change %"
- If correlation < 0.7
  - We prefer to use "Cost change %"
  - However, if costs have increased due to increased data size then we fall back to "Cost/GB change %"

ROI Calculation Formulas

Using Cost Change Percentage

Savings to Date = 
    Average Starting Cost × Average Cost Change % × Number of Submissions to Date

Projected Annual Savings = 
    Average Starting Cost × Average Cost Change % × Annual Run Frequency

Using Cost/GB Change Percentage

Savings to Date = 
    Average Current Data Size × Average Starting Cost per GB × 
    Average Cost per GB Change % × Number of Submissions to Date

Projected Annual Savings = 
    Average Current Data Size × Average Starting Cost per GB × 
    Average Cost per GB Change % × Annual Run Frequency

Understanding Cost/GB Metric

The Cost/GB metric is particularly useful when:

Your data size varies significantly between runs
Overall costs are increasing due to larger data volumes
You need to measure efficiency improvements independently of data size

Think of Cost/GB like a car's miles per gallon (MPG): Even if you're driving more miles (processing more data), you can still measure if you're using fuel (resources) more efficiently. A lower Cost/GB indicates better efficiency, even if total costs are higher.

Aggregated ROI Reporting

Gradient calculates ROI at two levels:

Project Level: Using the formulas above for individual workloads
Organization Level: Aggregating savings across all projects

Best Practices for Interpreting ROI

Consider Data Size Variations
- Monitor both cost changes and Cost/GB metrics
- Understand which metric Gradient is using for your workload
- Look for efficiency improvements even when total costs increase
Review Correlation Metrics
- Understand how your workload's runtime correlates with data size
- This helps explain which ROI calculation method Gradient is using
Monitor Trends
- Track both immediate savings and projected annual savings
- Consider seasonal patterns in your workload frequency
- Review historical trends to understand optimization impact

FAQ

Why do cluster recommendations have cost ranges?

Cluster costs for a Spark application run can vary due to autoscaling, Spot interruptions, and Spot fallback to On-demand.

During the Optimizing phase of a project, Gradient displays recommendation costs as a range to help you understand the variable cost of your Spark application.

How does Gradient calculate my costs?

When a Databricks cluster is running there are two main sources of charges:

Cloud provider charges are dominated by the instance rental and storage costs. Each unique cloud resource (e.g. an EC2 instance or an EBS volume) has a certain charge rate [usd/hr], and the cost of each resource is the charge rate multiplied by the rental duration.
Databricks platform charges. Just like the cloud provider, Databricks has a charge rate per instance which is a function of both the instance type, your Databricks Plan type, and the Databricks runtime engine (STANDARD or PHOTON). The instance rental is the only cloud resource that has an associated Databricks charge – storage comes for free!

The charge for a cluster can be estimated by adding together the individual costs of each resource. We use the list price for the charge rate of each resource, and resource durations are gathered by attaching an init_script to your cluster which periodically polls for the resources that compose it.

At the end of your job, that data is shipped to us where we can formulate a robust timeline of when each resource was added and removed from the cluster.

Why doesn’t Sync’s cluster cost match my Databricks billing?

The dominant source of error in Sync’s cost estimate comes from the resource duration. Sync only polls the cluster resources during the times for which the init_script is active. However, as depicted in the figure below, there is some time before the init script runs where charges still accumulate.

We have found from studying our own internal jobs that the unaccounted time tends to be about 2-3 minutes per instance. For the shortest clusters, this error will result in our estimate being well below what you see in your billing. However, as the cluster duration gets longer the relative error diminishes and we expect the values to match quite well.

We understand the importance of our estimates reflecting what you see in your billing, even for the shortest clusters. With that mind, we will always strive to improve the accuracy of our estimates for all clusters, and we will keep our users informed of future improvements as they come.

What data do you collect?

From the Spark Eventlog
- Event timestamps to estimate application start & end time
- Various cluster configurations are checked
- Some cloud provider info, such as region
From the cluster report (init script output)
- AWS API “describe instances” responses to get a record of the cluster instance composition
- AWS API “describe volumes” responses get get a record of the cluster storage composition
- Databricks job, cluster, and task information (all from databricks api calls). Things like instance types, cluster id, and more are gathered from here.
- The above data is combined to do things like runtime & cost estimation F

Why did my costs go up during the learning phase?

During the learning phase, Gradient will test out a few different configurations to understand how your job responds to in terms of cost and runtime. Because of this, costs may momentarily increase.

Why do you need a Databricks token?

Gradient uses your Databricks token to access and integrate with your Jobs so that tracking and updating cluster configurations can be done all through the UI, making users lives much easier.

Won't the extra training runs increase costs?

Running extra training steps outside of your normal workflow will increase costs via those few extra job runs. For an initial proof-of-concept this is a risk free way of trying out Gradient.

However, when users are ready to apply Gradient in production, we recommend utilizing runs you were already going to perform to minimize these training costs. See our guide for .

Can I train Gradient in my DEV environment?

Gradient optimizes clusters based on the actual code and data of your job. If your DEV environment's workloads are an exact clone of your PROD workloads, then yes Gradient will work.

Users with highly sensitive and tight SLA driven PROD workloads typically prefer to run Gradient in a cloned DEV environment.

If your DEV environment's workloads are different than your PROD environment (e.g. uses a smaller data size, or different code) then running Gradient in DEV will only optimize the cluster for your DEV workloads which likely would not transfer to your PROD workloads.

If this is your use case, please reach out to Sync to find a good solution

Training and testing recommendations in a DEV environment will add an additional overall cost since you have to pay for this test job run itself. This will eat into your overall ROI gains with Gradient.

If users allow Gradient to "learn" while in production, you will utilize job runs you have to run anyway. This significantly reduces the cost overhead of optimization and dramatically increases your overall ROI.

What are the risks of letting Gradient run in PROD?

During the learning phase, Gradient will try different configurations to help characterize your job which could result in fluctuations in cost and runtime.

If your jobs have strict SLA requirements, we recommend working with Sync to see how we can ensure your SLA limits are still in compliance. Reach out to us via intercom or email at

Where is my Databricks Webhook ID?

In your Databricks console, navigate to the webhook that you created. This is under Admin Settings -> Notifications -> Manage button for Notification destinations.

Next to the "Edit destination settings" title, there's a copy button. Click it to copy the Webhook ID (see image below)

What is the difference between Sync Hosted vs. Self Hosted?

Gradient needs to collect cluster and event logs for your job runs. There are two ways that Gradient can do this automatically:

[Recommended] Sync Hosted: the infrastructure used to collect logs is hosted in Sync's environment. When Gradient receives a notification that your job run has started, then it uses this infrastructure to monitor and fetch (or pull) logs, from your environment, to Sync.
Self Hosted: the infrastructure used to collect logs is hosted in your environment. When Gradient receives a notification that your job has completed, then it uses this infrastructure to send (or push) logs, from your environment, to Sync.

Self Hosted collection uses a Sync-provisioned all-purpose cluster in your Databricks environment to perform log collection once a job run completes. The all-purpose cluster runs a notebook that utilizes a Sync Gradient Python library to send the logs to Gradient.

Sync Hosted collection is much simpler and doesn't have the overhead of Self Hosted collection. Instead, it requires a few cloud permissions in order for Gradient to collect information about the cloud resources used to run a job. The infrastructure and code to perform log collection is entirely within Sync's environment.

Why do spot clusters have such wild cost/runtime variation?

Running Spark on Spot instances is a complicated risk. When spot instances are revoked randomly, the impact on the runtime can be very dramatic and unpredictable. The impact on runtime even with just one worker pulled can offset the cost advantage of Spot instances. We have found that, sometimes, an optimized and reliable on-demand cluster can be cheaper than using Spot instances. This can be counter intuitive to many users.

Tutorials & Best Practices

Running Gradient in Production

Many companies run their data workloads in their production environment in a variety of ways. While there are many ways users can run Gradient in production with the versatile Sync-CLI, below are a few paths that have resonated with users:

If the use cases below don't quite match your use case, please feel free to reach out and we'd be happy to help find a solution. Either contact us via the intercom app or email us at [email protected]

Production auto-enabled - Allow Gradient to continuously monitor, control, and optimize your production jobs.
- Ideal for - Users who want to optimize many jobs at scale and are comfortable with variations in cost/runtime during training.
Optimization Windows - Optimize your production jobs during regularly scheduled maintenance windows, allowing Gradient to control your jobs during a finite period so users can monitor and approve the changes.
- Ideal for - Users who want more control and oversight over changes that Gradient is making. Typically these jobs may be more sensitive to cost and runtime variations.
Development Clones - Allow Gradient to optimize jobs on a clone of your production jobs in a non-production environment. Optimized cluster configurations can then be transferred over to your production jobs.
- Ideal for - Users who cannot experiment with configurations in production at all and need new configurations to be fully vetted before deploying to production. Typically these are the most conservative users in regards to their production jobs.

Production Auto-Enabled

Step 1: Onboard all of your production jobs

Select all of the jobs you want Gradient to optimize and on-board them through job import method in the Gradient UI.

Step 2: Enable auto-apply on all of your jobs

Be sure to enable auto-apply in Gradient for the jobs you want to optimize so Gradient will automatically apply the recommendations to your clusters.

Step 3: Set any SLA requirements in Gradient

If you have any SLA requirements for your jobs, be sure to set them in Gradient

Step 4: Run your jobs like you normally would

Optimization Windows

Step 1: Decide on an optimization window

Select an optimization window timeframe (e.g. first week of the month). By limiting the optimization window to a finite, but periodic, time frame, it allows engineers to still be in control with what the optimization is doing.

Step 2: Select jobs to optimize

Select N number of jobs you want to optimize that are good candidates for tuning in your PROD environment.

By good candidates we mean you're OK if there's some variance in runtime and cost in your PROD environment during the optimization phase.

During the optimization Gradient will try different configurations and runtime/cost may go up or down during this period. Be sure that the stakeholders of your jobs are OK with this.

Step 3: Enable auto-apply

Enable auto-apply for those jobs to allow Gradient to automatically update your jobs during the optimization window. Let your engineers check-in on those jobs during the window via the Gradient UI to monitor progress.

Step 4: Pick a configuration and lock it in

At the end of the time frame, pick the configuration that lead to a cost and runtime you prefer. Apply those settings in your Databricks jobs, and disable auto-apply in the Gradient UI to lock in the configurations and prevent future changes.

Step 5: Go back to step 2

When a new optimization window arrives, go back to step 2 and select a new batch of jobs to optimize.

Development Clones

Step 1: Clone your Production jobs into your DEV environment

Clone your current Production jobs into a development environment, including the input data and any other dependencies.

Step 2: Onboard those DEV jobs onto Gradient

Onboard these development jobs into Gradient through the standard job import methods.

Step 3: Enable auto-apply

Be sure to enable auto-apply so the Gradient recommendations can be automatically applied during each iteration.

Step 4: Run your jobs 5-10 times to find an optimal point

Run each of yoru jobs up to 10 times to complete both learning and optimizing phases of the Gradient algorithm. Support scripts like the can be used to speed up this process.

Step 5: Review and select your ideal configuration

Once optimization is complete, review the configurations and select the one that matches your busines needs. Copy the cluster configurations to your production jobs.

Demos

Demo of Gradient - Get a high level overview of the operation of Gradient and the main value points.

Developer Docs

Resources

This page contains links to resources to help you integrate with Gradient.

Sync API

Sync Library

Security

Privacy and Security Compliance

Privacy Overview

We collect Personal Information that clients provide us through our Sites, and in connection with other business dealings we may have with clients. Such information may include First and last name, Company name, Title, Email address, IP address, Login user name, Mailing address, Telephone number, Fax number, and Personal preferences regarding products and services. We use client Personal Information primarily to facilitate our ongoing and proposed business dealings (“Business Use”).

“Business Use” includes the creation of user profiles, establishing and maintaining client accounts so that we may provide products or services requested by our clients, registering clients as users of these products or services so that the client may access them through our Sites or otherwise, communicating with clients about updates, maintenance, outages, or other technical matters concerning these products or services, providing clients with training and support regarding usage of these products or services, notifying clients about changes to any of the policies and procedures for the use of these products or services, verifying the accuracy of account and technical contact information we have on file for clients, responding to questions or inquiries that clients may have about our products or services.

We may also use client Personal Information as required to comply with laws and regulations relating to the products or services we provide in any jurisdictions in which we or our affiliated companies operate, including the United States. We may use Usage Information internally within Sync Computing to help us improve our products or services or to develop new products or services. For Marketing Purposes, and with client consent or as otherwise permitted by applicable law, we may use client Personal Information for purposes relating to marketing our content, products, and services, or those of our business partners.

Data Collected and Stored

All client data is encrypted in transit and at rest.

Client data is stored in secure data centers hosted by AWS and Heroku.

In-Transit encryption protocols include HTTPS and SSL/TLS

Data stored in the cloud is stored using AES-256 encryption.

Data is automatically encrypted before being written to disk.

Identity Authentication

Single sign-on (SSO) and multi-factor authentication (MFA) support.

With SSO the user authentication process is delegated to identity providers that support the Security Assertion Markup Language (SAML) 2.0 standard.

Clients are capable and encouraged to leverage MFA using their SSO provider.

Personnel Security

At Sync Computing, we encourage all employees to participate in helping secure our client data and company assets. Where applicable by law, Sync Computing performs background screenings on personnel before joining the organization. All Sync Computing personnel regularly complete security and privacy awareness training.\

Application Security

Application security is of vital importance to Sync Computing. We incorporate security throughout our Software Development Lifecycle (SDLC), from the design of our products to the deployment of our software into our production environment.

We leverage a variety of third-party security partners to support our expectations of secure SDLC processes and secure production SaaS application environments.

Secure development and change management methods are outlined in our policies & procedures and every engineer is required to acknowledge and adhere to these methods. policies and procedures determine when and how changes occur. \

Availability

Sync Computing designs our application to be highly available and leverages Cloud Service Provider (CSP) technologies to attain availability objectives. Some of the CSP technologies that Sync Computing leverages are redundant storage, content distribution networks, auto-scaling technologies, and others.

Compliance

Sync Computing’s SOC2 Status

Sync Computing has obtained our SOC2 Type 2 Report for the Security, Availability, and Confidentiality Trust Services Criteria.

What is SOC2?

Developed by the American Institute of Certified Public Accounts (AICPA), a SOC 2 Report confirms the results of a comprehensive audit that focuses on the system-level controls that process customer data.

SOC 2 reports cover the design and documentation of controls and provide evidence of how the organization operated the documented controls over an extended period of time for a given point in time.

What is the difference between SOC 2 Type1 and Type2 reports?

There are two different types of SOC 2 reports.

A SOC 2 Type 1 report describes a service provider’s systems and whether the system is suitably designed to meet relevant trust principles.

A SOC 2 Type 2 report details the operational effectiveness of those systems and includes a historical element that shows how controls were managed by a business over a period of time.\

Why SOC2 for Sync Computing?

Sync Computing is committed to establishing trust with our customers, delivering innovative technology and accurate predictions and optimization recommendations for Apache Spark workloads. We regularly test our infrastructure and applications rigorously to isolate and remediate vulnerabilities. We also work with industry security teams and third-party specialists to keep our users and their data safe.

Becoming a certified SOC 2 Compliant solutions provider, we have multiple layers of protection across a distributed, reliable infrastructure. All Sync Computing data is stored in a secure data centers managed and secured by Amazon Web Services (AWS) and Heroku.\

Trust Center

Portal

Please go to our Trust Center Portal where you can request access to our

Security and policy documents
SOC2 Type 2 report

Once on the portal, select the docs you want to access and submit your request. You'll hear back from us within 24 hours.

Click on the link to access the Trust Center Portal now:

Product Announcements

Product Updates

January 15, 2024

Cost breakdowns for onboarded workloads - users can see a breakdown of cloud vs. Databricks costs on the Projects page
A new phase indicator makes it easy to quickly see when Gradient is Learning vs. Optimizing

December 16, 2024

New Insights column in the Projects table gives you insights abouts your workloads - such as whether there's variable data size, it's a dormant project, and many others

November 5, 2024

EventBridge for AWS Databricks Workspace integration is now GA!

October 18, 2024

Updated table on the Discover page now supports sorting, filtering, rearranging, and customizing columns! You can also export the data as CSV

October 16, 2024

Big improvements to importing jobs from Databricks

September 27, 2024

Updated projects table - sort, filter, rearrange and customize columns, export data as CSV

September 10, 2024

Gradient UI updates to use blue as the primary color

August 28, 2024

Summary tab in project details provides up to date high-level metrics on project performance
[Private Preview] Workspace integration can now use AWS EventBridge for cluster monitoring

July 31, 2024

Projects page now has customizable columns! Select the columns you want to view and then use them in a filter to find exactly what you're looking for.

July 26, 2024

Projects details page update provides an improved layout showing cost, runtime, and job status. Charts show up to 60 runs and provides visual differentiation between the zoomed out view (viewing all runs) vs. the zoomed in / focused view (viewing a single run).

July 18, 2024

Projects page updates to show KPIs, consumption charts, and the ability to search and filter projects

May 15, 2024

Updated Terms of Use

April 10, 2024

Sync-Hosted log and data collection for AWS Databricks and Azure Databricks - onboard your jobs in 5 minutes!
Spark and Gradient metrics for your projects, enhancing the single pane experience of Gradient

March 26, 2024

AWS Region is now required when creating an AWS Databricks workspace integration

March 13, 2024

[Private preview] Worker instance recommendations (instance size)
Sync-hosted log and data collection for AWS Databricks
Edit workspaces
Access checks during workspace integration
Support for instance fleets

February 13, 2024

Databricks Workspace integration
Improved jobs import flow

January 24, 2024

Account page is now Org Settings!
Org Settings consolidates personal info, API keys, and lists users in the account

November 28, 2023

Gradient public preview
Gradient support for AWS Databricks and Azure Databricks
Project level cost and runtime graphs
AWS EBS recommendations
Cost ranges for recommendations for improved cost accuracy when using Spot nodes
Project setting for Auto-apply
Apply button for recommendations
Databricks quick-start notebooks added
Databricks Webhook integration to support at-scale onboarding

September 25, 2023

Databricks quick-start notebooks added
Official quick-start access modified to include the quick-start notebooks and focus on Databricks

Jun 20, 2023

Release Notes: Gradient

Need Help?

Troubleshooting Guide

Common Issues

This section provides some troubleshooting tips related to Gradient

Environment Variables not set

This error is returned when Gradient can not access one to the environment variables that are required to return a recommendation. If you are using the CLI, make sure you have run

and that you have provided all the required values. If you are using the programatic Python interface, make sure the environment you working in has the following environment variable set:

Insufficient Permissions to Make AWS Calls

If you receive an AWS permissions error, please review your AWS policy to ensure the following permissions are granted to the role executing the Gradient prediction.