Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Gradient requires a one time setup for your Databricks workspace. The workspace integration will enable both monitoring and optimization capabilities.
In the steps below, we'll walk you through creating a Sync account, creating your Sync API Keys, and then adding your Databricks workspace to the Gradient platform all through the UI.
Gradient is an infrastructure management and optimization platform that continuously monitors and learns users' Databricks Jobs so that it can optimize the data infrastructure to hit cost and runtime goals. It supports both co-pilot and autopilot modes. Use it as a co-pilot to receive passive recommendations for optimizations you can apply in a click, or enable auto-apply for optimization at scale.
Gradient uses a closed-loop feedback system to automatically build custom tuned machine learning models for each Databricks Job it is managing, using historical run logs. Through this mechanism, Gradient continuously drives Databricks Jobs cluster configurations to hit user defined business goals, such as maximum costs and runtimes.
Managing and optimizing Databricks Job clusters is tedious, time intensive, and difficult for data and platform engineers; there are far too many Spark configurations and infrastructure choices to know what's right and what makes sense. Additionally, just when they've gone through the effort of optimizing a job, something changes that wipes away all that hard work.
To make matters worse, changing infrastructure incorrectly can also lead to crashed jobs due to out of memory errors. A major risk to production pipelines that often blocks engineers from optimizing in the first place.
If engineers do want to try managing clusters, it comes at the expense of taking time away from delivering new products and features. Furthermore, managing at scale where hundreds or thousands of jobs are running is simply not feasible for any sized team.
Gradient provides data teams with an easy and scalable solution that can significantly diminish engineering time spent on cluster optimization, while cutting costs and improving runtimes. It can even automatically manage clusters for all of your jobs - with no code changes.
Data Engineers - Avoid spending time tuning and optimizing clusters while still achieving optimal cost and runtime performance.
Data Platform Managers - Ensure your team's Databricks Jobs are achieving high level business objectives without having to bug your engineers or change any code. This becomes particularly important for teams who are looking to scale their Databricks usage.
VP of Engineering / CTOs - Gradient works for you and not the cloud providers. It was built to help you efficiently produce data products that meet your business.
Find your top jobs to optimize as well as discover new opportunities to improve your efficiency even more. This page is refreshed daily so you always get up-to date insights and historical tracking.
After logging into Gradient, click on the "Discover" link on the left hand navigation. Click on the "Add Workspace" button to bring up the credentials prompt, as seen below. You will need to enter the following information:
Databricks Workspace ID - Can be found in the address bar of your web browser in your Databricks URL. It is the "o" parameter "o=9172567527460388"
, so in this case you would enter the number "9172567527460388".
Databricks Host - Can be found in the address bar of your web browser when at your Databricks workspace. It should look like this: https://dbc-6c213588-2400.cloud.databricks.com/
Compute Provider - Select which cloud your Databricks workspace is run on
That's it! You're done!
The discover page displays various pieces gathered from your workspace of information as explained below. Each "Finding" widget on the right is clickable and will filter the jobs list based on the parameters of the finding.
Top jobs to optimize with Gradient
Jobs with Photon enabled
Jobs with Autoscaling enabled
All purpose compute jobs
Jobs submitted from run-submit (meaning they could come from an external orchestrator like Airflow)
Timeline of total number of jobs
Timeline of total core hours (proportional to your costs)
Databricks Token - You will need Databricks admin access to generate a for your Databricks workspace. Copy and paste your token value here. We recommend setting the "Lifetime" field blank so the token does not expire and interrupt any service down the line.
Identify the jobs you want to import into Gradient to start automatically optimizing their clusters. If this is your first time setting up Gradient, please complete your . If you've already onboarded a workspace, proceed to to directly import your jobs.
Gradient uses a proprietary machine learning algorithm trained on historical event log information to find the best configurations for your job; the algorithm creates and maintains a custom ML model for each job.
The algorithm has two phases:
Learning phase: Gradient will test your job performance against a few different configurations to understand how your job responds in terms of cost and runtime.
Optimizing phase: Once the learning phase is complete, Gradient will use the model built internally to drive the job cluster to more optimal configurations given the SLA requirements of the user. Even when optimizing, Gradient will continuously learn from each job run and improve the model for the job.
Projects are how Gradient organizes your Databricks Jobs and enables continuous optimization. All job runs and optimization recommendations for the job are available under the associated Project to give you a holistic picture of how the job is performing. Additionally, Projects help unlock more optimization potential and key features which may be important for your infrastructure via Project Settings.
Each Project is continually updated with the most recent recommendation provided by Gradient, allowing you to review cost and runtime metrics over time for the configured Spark workload.
Timeline Visualizations - Monitor your jobs cost and runtime metrics over time to understand behaviors and watch for anomalies due to code changes, data size change, spot interruptions, or other causes.
Metrics Visualizations - Easily view spark metrics for your runs and correlate that with Gradient's recommendations, all visualized beautifully in a single pane.
Auto-apply Recommendation - Recommendations can be automatically applied to your jobs after each run for a "set and forget" experience.
AWS and Azure support - Granular cost and cluster metrics are gathered from popular cloud providers.
Auto Databricks jobs import & setup - Provide your Databricks host and token, and we’ll do all the heavy lifting of automatically fetching all of your qualified jobs and importing them into Gradient.
You set max runtime, Gradient minimizes costs - Simply set your ideal max runtime SLA (service level agreement) and we’ll configure the cluster to hit your goals at the lowest cost.
Aggregated cost metrics - Gradient conveniently combines both Databricks DBU costs and Cloud costs to give you a complete picture of your spend.
Custom integration with Sync CLI - Sync CLI and APIs can be used to support custom integration with users' environments.
Databricks autoscaling optimization - Optimize your min and max workers for your job. It turns out autoscaling parameters are just another set of numbers that need tuning. Check out our previous blog post.
EBS recommendations - Optimize your AWS EBS using recommendations provided by Gradient and save on costs.
Gradient is a SaaS platform that remotely monitors and applies recommendations to users Databricks clusters at each Job start. The high level closed-loop flow of information across user and Gradient environments is shown below.
See our FAQ for more detailed information into what information is sent and collected on the Gradient side.
Databricks workspace setup for Azure users.
This doc shows you how to register an app in Microsoft Entra. After you complete the steps below, you will have the Subscription Id, Client Id, Tenant Id, and Client Secret that you need to create an Azure Databricks Workspace integration in Gradient.
In order to grant Gradient permissions to collection the data needed to generate a recommendation, you can to register an app using Microsoft Entra.
Entra, the Microsoft identity platform, performs identity and access management (IAM) only for registered applications. The next few steps will walk you through registering Gradient in order to grant and read access to Azure Databricks resources.
Name the app and choose the appropriate account type. (For most organizations, this is set to single tenant for the current directory)
Write down or save the values for Client ID and Tenant ID.
A client secret is an authentication technique that uses a string value in the Azure application instead of a certificate for identity. Gradient will use this secret to retrieve cluster information needed to make a recommendation.
To create a secret click on Certificates & Secrets > + New Client Secret
Record the value of the secret as it is required by the log submission process.
The app requires read access to the subscription. The easiest way to provide access is to assign the Reader role to the app. You can refer to the full documentation here.
From your Subscription settings, click on the Access control (IAM) link on the side tab. From here, add a new role assignment.
You can find the Reader role by searching or scrolling through your list of roles. Once you find it, click on the role.
From the members tab, add the app you created in that last section by clicking the + select members link and searching for your app by name.
Complete the assignment by verifying your settings and clicking the Review + assign button
Gradient requires four pieces of information for access:
Subscription ID - This can be copied from the Subscriptions service in the Azure console.
Tenant ID - This can be copied from the Active Directory overview page in the Azure console.
Client ID - The ID of the Active Directory principle with which the library should authenticate. This can be copied from the Active Directory “App registrations” page in the Azure console.
Client secret - This can be generated from the “Certificates & secrets” page for the app.
Once entered, click on "Save & Test Access" to proceed in the Gradient UI.
Click on the link below to proceed to the webhook setup.
For guided assistance, please reach out to
For more advanced use cases beyond the quick-start, see the
Install the Sync-CLI on your local machine to help setup the rest of the Gradient installation process. The Sync-CLI also enables other advanced functions with Gradient.
Network restrictions: If your company restricts non whitelisted external IP addresses from your Databricks clusters, be sure to request permission to access: https://api.synccomputing.com
Linux machine: The Sync CLI is best run on a Linux machine.
Start by making sure your environment meets all the prerequisites. The Gradient CLI is actually part of the Sync Library, which requires Python v3.7 or above and which only runs on Linux/Unix based systems.
This is a good practice for whenever you install any new Python tool, as it allows you to avoid conflicts between projects and makes environment management simpler.
Here, we will create a virtual environment called gradient-cli that will reside under the ~/VirtualEnvironments path.
Activating your new virtual environment.
Next use the pip package installer to install the latest version of the Sync Library.
You can confirm that the installation was successful by viewing the CLI executable’s version by using the –version or –help options.
Configuring the CLI with your credentials and preferences is the final step for the installation and setup for the Sync CLI. To do this, run the configure command:
You will be prompted for the following values:
\
The Sync Python library provides developers with the fundamental tools to integrate the Sync continuous optimization solution in a number of ways:
Apache Airflow Integration- A guide to help integrate Gradient with Apache Airflow for Databricks
Projects is a solution to continuously optimize and monitor a repeat production Databricks workloads. To implement projects, integration of the Sync library in a user's orchestration system (e.g. Airflow, Databricks Workflows) is necessary.
Once integrated, the Gradient UI will provide high level metrics and easy to use controls to monitor and manage your Apache Spark clusters.
Use the Databricks Auto Import wizard to easily create multiple projects, each linked to a Databricks Job in your workspace.
The Auto Import wizard connects to your specified Databricks workspace using a Databricks Personal Access Token obtained during the Add Databricks Workspace step.
NOTICE: The import wizard will make the following changes to your selected Databricks Jobs:
Add the web-hook notification destination to the job so that Gradient is notified on every successful run
Update the job cluster with the init script, env vars, and instance profile to collect worker instance and volume information.
Review the compatible Databricks jobs and select the jobs for which you would like to create a Gradient project and select create projects
for each of the selected jobs. By creating a project, the following properties will be added for each.
You should now see the project[s] you created on you Projects summary dashboard. New projects will have a status of "Pending Setup" until the project is configured to receive logs for recommendations.
An API key is required to programmatically interact with Gradient using our library, our CLI, or the Gradient REST API. You can create an API key after your Sync account has been created and authorized.
Go to the Gradient application and log in. Launch Gradient
Navigate to the Org Settings tab and click on the Generate Personal Key
button to generate your API key. You can always come back to the Org Settings tab to view your API.
To ensure a successful operation of Gradient, there are a few final steps and checks that will depend on how you currently configure your jobs.
After these steps, you should have a successful first run of Gradient. If you need any help, feel free to reach out to us via intercom or email us at support@synccomputing.com.
For the Databricks Job you want to optimize, go to the Databricks console Job's page.
Instance profiles: Users must have an instance profile correctly configured with access to AWS's S3 and describe cluster functions. See the AWS additional steps instructions for more information.
Enable Logging: Logging must be enabled for your existing job, either DBFS or S3 log location is fine
The following items should have been automatically completed via the job import step. These are just steps to verify the setup completed correctly. Ideally, no further action is required.
Webhook notification: Ensure that the job has webhooks enabled.
Spark environmental variables: Ensure the Spark environmental variables are populated with secrets.
Cluster init scripts: Ensure the Sync init script is selected
Your Databricks job should be fully connected to Gradient. Simply run your job as you normally would via the Databricks UI and click on "Run"
After your job is completed, a secondary "Record Job Run" job will be started which will automatically collect the logs generated and transmit them to Gradient.
The "Record Job Run" job can take 10-15 minutes to complete.
After the completion of the "Record Job Run" job, head over to the Gradient UI, where you should see the first datapoint populated:
The Databricks workspace setup is a one-time setup for your organization. With the webhook tutorial below, all users within an organization will be able to:
Onboard new jobs onto Gradient with a single click through the Gradient UI
Onboard jobs at mass scale
Integrate Gradient without any modifications to your Databricks workflows tasks
Setting up a Databricks Workspace integration involves three steps:
Giving Gradient access to your Databricks Workspace (such as Databricks host, token, and other details)
Giving Gradient access to your cloud provider to fetch metadata on compute infra (such as EC2 instances, EBS volumes, etc)
Configuring a Databricks Webhook to notify Gradient about your job start and stop events
This doc covers the first step. At the bottom of this doc is a link to the doc that will help you complete the second step.
In the integrations page, click on "ADD" to see the "Add Databricks Workspace" console.
We need to know how to connect to your Databricks Workspace. Provide details of your Databricks Workspace and choose the Sync API Key to use with this workspace integration.
Databricks Workspace ID - Can be found in the address bar of your web browser in your Databricks URL. It is the "o" parameter "o=9172567527460388"
, so in this case you would enter the number "9172567527460388".
Databricks Host - Can be found in the address bar of your web browser when at your Databricks workspace. It should look like this: https://dbc-6c213588-2400.cloud.databricks.com/
Databricks Token - You will need Databricks admin access to generate a personal access token for your Databricks workspace. Copy and paste your token value here. We recommend setting the "Lifetime" field blank so the token does not expire and interrupt any service down the line.
Sync API Key - Select the Sync API key from the drop down menu. If you haven't created one yet, you can create one here.
Databricks Plan Type - Select your plan type which will impact the pricing used to calculate your Databricks costs.
We need to know how to get logs and collect data for your job runs.
Select one of the options below to continue setting up your Databricks Workspace integration depending on your cloud provider.
With your first datapoint in the Gradient UI, you are now ready to generate your first recommendation and apply it.
The Generate button will create a new recommendation based on the logs submitted from your last successful job run. If this is your first recommendation, your Gradient status will be "learning", meaning Gradient will train an internal model based on a few test runs of your job.
On the right side of the Gradient UI, click on the "Apply" button to automatically update your Databricks job with the recommendation.
Go back to the Databricks console and click on the "run" button for the job being optimized. The Gradient UI should then be populated with its 2nd data point.
To avoid manually applying recommendations, you can also enable Auto-Apply in the "Edit settings" button in the Gradient project page.
If this option is enabled, recommendations will be automatically applied after each run of your job.
Click on the slider to enable Auto-Apply Recommendation. A warning page will pop up to verify this feature. Click on Save.
The Gradient Agent needs AWS access to retrieve instance market information during job execution. To access this information, Gradient uses Boto3 which will leverage permissions granted through the cluster's instance profile. See Example AWS Profile below for required permissions.
Gradient reads and writes logs to the storage path defined in the cluster delivery configuration. If the logs are configured to be delivered to an S3 location, the cluster instance profile must have permission to read and write data to the S3 destination and it must include putObjectAcl permission.
In your AWS console, go to IAM > Roles and click on Create role
Select AWS service
as the entity type and EC2 as the service
Gradient does not need any additional permissions at this point. Implement any default permissions you may need. If none are needed, click next.
Insert a name for the role. Below we use sync-minimum-access
. Click on create Role once completed.
Click into the Role you just created, and under Permissions, click on Add permission > create inline policy
Click on the JSON editor
Copy and paste the code block below into the JSON policy editor.
Be sure to update <your-s3-bucket-path> to be the same s3 bucket path as where you store your Databricks logs (screen-shot from the Databricks cluster).
Click on Next. On the next page click on Create Policy.
In the Databricks admin page, go to Instance profiles and click on "Add instance profile"
On the next page copy and paste the "Instance profile ARN" and "IAM role ARN" values from the AWS console Role's page. Click "add" to complete.
Done! You should now be able to select this instance profile in the cluster page of your jobs
Webhooks provide an easy 1-click experience to on-board new jobs
The Databricks workspace setup is a one-time setup for your organization. With the webhook tutorial below, all users within an organization will be able to:
Onboard new jobs onto Gradient with a single click through the Gradient UI
Onboard jobs at mass scale
Integrate Gradient without any modifications to your Databricks workflows tasks
Before you begin!
Ensure that you've created a Sync API Key since you'll need that here
A user with admin access to your Databricks workspace is required to complete the steps below
Verify your workspace allows outbound and inbound traffic from your Databricks clusters. The Gradient integration process makes calls to AWS APIs and Sync services hosted at https://api.synccomputing.com. IP Whitelisting may be required.
Prior to configuring the notification destination in the Databricks Workspace, we need to retrieve the webhook URL and credentials from the Gradient API. We can use the Sync CLI to do this.
Your <workspace-id> is the "o" parameter on your Databricks URL
Example output:
With the webhook URL and credentials, a workspace admin can now create a webhook notification destination. In your Databricks console go to admin > notification destinations > add destination
Set the following parameters in the UI:
Name: "Gradient"
Username: Use the "username" generated from the previous output
Password: Use the "password" generated from the previous output
URL: Use the "url" generated from the previous output
Next, you need to configure your Databricks workspace with the webhook and Sync credentials:
Run the sync-cli command create-workspace-config
<plan-type> - Select between Standard
, Premium
, Enterprise
<webhook-id> - Go back to admin > Notification destinations and edit the "Gradient" webhook. Next to the "Edit destination settings" title, there's a copy button. Click it to copy the Webhook ID (see image below)
Once the command is run, you will need to provide the CLI with following information:
Databricks host
Databricks token
Sync API key ID
Sync API key secret
AWS instance profile ARN (for Databricks on AWS only. See AWS Instance Profile)
Databricks plan type
Webhook ID (same step as <webhook-id> above)
Example output:
The next step is to download the code used to submit the Spark event logs to Gradient. Once again, we will use the CLI to perform the following tasks:
Adds/updates the init script to the workspace “/Sync Computing” directory
Adds/updates secrets used by the init script and the Sync reporting job
Adds/updates the job run recording/reporting notebook to the workspace in the “/Sync Computing” directory
Adds/updates the Databricks Secrets scope, "Sync Computing | <your Sync tenant id>", used by Gradient to store credentials and configurations
Creates/updates a job with the name “Sync Computing: Record Job Run” that sends up the event log and cluster report for each prediction
Creates/updates and pins an all-purpose cluster with the name “Sync Computing: Job Run Recording” for the prediction job
Run the command sync-cli workspaces apply-workspace-config <workspace-id>
Example Output
The final step is to ensure that all the newly created artifacts are accessible during job runs. By default Databricks jobs have the permissions of the job owner.
Therefore, you should ensure that the owner, directly or through group permissions, can access the following artifacts:
You should be able to see and access the "Sync Computing" directory in your Workspace. See the screenshot below.
You should be able to see and have access to the "Sync Computing | <your Sync tenant id>" secret scope. Check if you can view the scope with the list-scopes command below:
You should be able to see and run the "Sync Computing: <your Sync tenant id> Job Run Recording" cluster in the Databricks console under Compute > All-purpose Compute.
Your workspace should now be configured to send logs using Databricks web-hook notifications.
In this guide we are going to setup your AWS EventBridge to send EC2 related events to Gradient.
Gradient can use AWS EventBridge to accurately monitor your clusters at scale. It uses AWS EventBridge to capture events related to EC2 state changes and tag changes on a resource. These events look like:
To enable these events, you will need:
EventBridge Rules in the AWS console. You should have a default bus there already:
The Gradient workspace integration modal with the custom event pattern for EventBridge:
Unless specified you can use the default values from the AWS Console.
Start with clicking Create Rule in the AWS Console and let's go:
The only important settings here are Rule with an event pattern and your default bus, then click Next:
Under Event Source, select AWS events or Eventbridge partner events then scroll down to Creation Method:
Under Creation Method, select Custom pattern (JSON editor). You will need to copy the Event Pattern from the Gradient modal (from Step 2) and paste it under Event Pattern in the AWS console. Then click Next:
Select EventBridge event bus and Event bus in a different account or Region, copy the bus ARN from the Gradient modal (from Step 3) and paste it under Event bus as target in the AWS console.
Click Skip to Review and create. You can add tags later if needed.
On this page you can review the rule settings entered previously, when done click Create rule to finish setting up the rule:
Once the rule is created go to the Gradient modal and complete the Databricks Workspace integration by clicking the Add button (or the Update button if you are editing an existing Databricks Workspace integration).
In the next window, use the values generated to create a new webhook destination and retrieve the required "Webhook ID" value.
In your Databricks console go to settings > workspace admin > notifications. On the notifications page go to notification destinations > manage > add destination. Set the following parameters in the "Create a new destination" panel in the Databricks console:
Name: "Sync Computing"
Username: Copy the "username" generated above
Password: Copy the "password" generated above
URL: Copy the "url" generated above
After your webhook destination is generated. Click to re-open your webhook, and copy the webhook ID below.
Paste the webhook ID into the "Add Databricks Workspace" panel in the Gradient platform. Click "Add" to complete the setup!
Cloud Provider - Select AWS as your cloud provider
AWS Region - This is necessary only if you specify AWS as your cloud provider.
Logs and Data Collection - Choose how you want to provide logs to Gradient. We recommend Sync-Hosted collection which manages collecting logs with just a few configurations on your end. Self-Hosted requires additional set up on your end.
Monitoring Type - Choose how you want Gradient to monitor your Databricks clusters.
For the recommended Sync-Hosted collection method, AWS IAM roles and permissions are required to be set up to complete the rest of the workspace integration, as seen in the screen shot below.
Monitoring Type EventBridge Rule is currently available under Private Preview.
We recommend using EventBridge Rule monitoring for monitoring your Databricks clusters. Only available if you also pick Sync-Hosted collection.
Copy and paste the json in a new AWS IAM policy permission, as see in the example screen shot below. Give the policy a name, such as sync-external-access
Create a new AWS IAM role with the "Custom trust policy" trust entity and paste the JSON in the policy field, as seen in the example screenshot below. Give the role a name, such as sync-external-user-role
In the next step of the AWS IAM role creation, add the permission created previously to the new AWS IAM role (in the example above it is named sync-external-access
). Example screen shot below:
Give the new IAM role a name, such as sync-external-user-role
, and create the new role.
Go back to the AWS IAM role just created (in the previous example with the name sync-external-user-role
), and copy the ARN link and paste it in the last field in the Gradient dialogue box.
Once entered, click on "Save & Test Access" to proceed in the Gradient UI.
If you picked EventBridge Rule monitoring continue with the setup on this page:
If you picked Webhook monitoring continue with the setup on this page:
Gradient calculates Return on Investment (ROI) using sophisticated metrics that adapt to your workload characteristics. This documentation explains how Gradient determines and reports ROI, including our methodology for choosing the most appropriate metrics for different scenarios.
Gradient reports two key ROI metrics:
Savings to Date: Actual savings achieved through Gradient optimization
Projected 12 Month Savings: Estimated savings over the next year based on current patterns
Gradient uses two different approaches to calculate ROI, choosing the most appropriate one based on your workload characteristics:
Cost Change Percentage: Direct comparison of costs before and after Gradient
Cost per Gigabyte (Cost/GB) Change Percentage: Normalized metric that accounts for varying data sizes
Gradient automatically selects the most appropriate metric using the following logic:
First, we compute the correlation between input data size and runtime (Pearson correlation coefficient)
Then we use the correlation coefficient to select the appropriate metric between "Cost change %" and "Cost/GB change %"
If correlation ≥ 0.7 (strong correlation)
We use the maximum value between "Cost change %" and "Cost/GB change %"
If correlation < 0.7
We prefer to use "Cost change %"
However, if costs have increased due to increased data size then we fall back to "Cost/GB change %"
The Cost/GB metric is particularly useful when:
Your data size varies significantly between runs
Overall costs are increasing due to larger data volumes
You need to measure efficiency improvements independently of data size
Think of Cost/GB like a car's miles per gallon (MPG): Even if you're driving more miles (processing more data), you can still measure if you're using fuel (resources) more efficiently. A lower Cost/GB indicates better efficiency, even if total costs are higher.
Gradient calculates ROI at two levels:
Project Level: Using the formulas above for individual workloads
Organization Level: Aggregating savings across all projects
Consider Data Size Variations
Monitor both cost changes and Cost/GB metrics
Understand which metric Gradient is using for your workload
Look for efficiency improvements even when total costs increase
Review Correlation Metrics
Understand how your workload's runtime correlates with data size
This helps explain which ROI calculation method Gradient is using
Monitor Trends
Track both immediate savings and projected annual savings
Consider seasonal patterns in your workload frequency
Review historical trends to understand optimization impact
Cluster costs for a Spark application run can vary due to autoscaling, Spot interruptions, and Spot fallback to On-demand.
During the Optimizing phase of a project, Gradient displays recommendation costs as a range to help you understand the variable cost of your Spark application.
When a Databricks cluster is running there are two main sources of charges:
Cloud provider charges are dominated by the instance rental and storage costs. Each unique cloud resource (e.g. an EC2 instance or an EBS volume) has a certain charge rate [usd/hr], and the cost of each resource is the charge rate multiplied by the rental duration.
Databricks platform charges. Just like the cloud provider, Databricks has a charge rate per instance which is a function of both the instance type, your Databricks Plan type, and the Databricks runtime engine (STANDARD or PHOTON). The instance rental is the only cloud resource that has an associated Databricks charge – storage comes for free!
The charge for a cluster can be estimated by adding together the individual costs of each resource. We use the list price for the charge rate of each resource, and resource durations are gathered by attaching an init_script to your cluster which periodically polls for the resources that compose it.
At the end of your job, that data is shipped to us where we can formulate a robust timeline of when each resource was added and removed from the cluster.
The dominant source of error in Sync’s cost estimate comes from the resource duration. Sync only polls the cluster resources during the times for which the init_script is active. However, as depicted in the figure below, there is some time before the init script runs where charges still accumulate.
We have found from studying our own internal jobs that the unaccounted time tends to be about 2-3 minutes per instance. For the shortest clusters, this error will result in our estimate being well below what you see in your billing. However, as the cluster duration gets longer the relative error diminishes and we expect the values to match quite well.
We understand the importance of our estimates reflecting what you see in your billing, even for the shortest clusters. With that mind, we will always strive to improve the accuracy of our estimates for all clusters, and we will keep our users informed of future improvements as they come.
From the Spark Eventlog
Event timestamps to estimate application start & end time
Various cluster configurations are checked
Some cloud provider info, such as region
From the cluster report (init script output)
AWS API “describe instances” responses to get a record of the cluster instance composition
AWS API “describe volumes” responses get get a record of the cluster storage composition
Databricks job, cluster, and task information (all from databricks api calls). Things like instance types, cluster id, and more are gathered from here.
The above data is combined to do things like runtime & cost estimation F
During the learning phase, Gradient will test out a few different configurations to understand how your job responds to in terms of cost and runtime. Because of this, costs may momentarily increase.
Gradient uses your Databricks token to access and integrate with your Jobs so that tracking and updating cluster configurations can be done all through the UI, making users lives much easier.
Running extra training steps outside of your normal workflow will increase costs via those few extra job runs. For an initial proof-of-concept this is a risk free way of trying out Gradient.
Gradient optimizes clusters based on the actual code and data of your job. If your DEV environment's workloads are an exact clone of your PROD workloads, then yes Gradient will work.
Users with highly sensitive and tight SLA driven PROD workloads typically prefer to run Gradient in a cloned DEV environment.
If your DEV environment's workloads are different than your PROD environment (e.g. uses a smaller data size, or different code) then running Gradient in DEV will only optimize the cluster for your DEV workloads which likely would not transfer to your PROD workloads.
Training and testing recommendations in a DEV environment will add an additional overall cost since you have to pay for this test job run itself. This will eat into your overall ROI gains with Gradient.
If users allow Gradient to "learn" while in production, you will utilize job runs you have to run anyway. This significantly reduces the cost overhead of optimization and dramatically increases your overall ROI.
During the learning phase, Gradient will try different configurations to help characterize your job which could result in fluctuations in cost and runtime.
In your Databricks console, navigate to the webhook that you created. This is under Admin Settings -> Notifications -> Manage button for Notification destinations.
Next to the "Edit destination settings" title, there's a copy button. Click it to copy the Webhook ID (see image below)
Gradient needs to collect cluster and event logs for your job runs. There are two ways that Gradient can do this automatically:
[Recommended] Sync Hosted: the infrastructure used to collect logs is hosted in Sync's environment. When Gradient receives a notification that your job run has started, then it uses this infrastructure to monitor and fetch (or pull) logs, from your environment, to Sync.
Self Hosted: the infrastructure used to collect logs is hosted in your environment. When Gradient receives a notification that your job has completed, then it uses this infrastructure to send (or push) logs, from your environment, to Sync.
Self Hosted collection uses a Sync-provisioned all-purpose cluster in your Databricks environment to perform log collection once a job run completes. The all-purpose cluster runs a notebook that utilizes a Sync Gradient Python library to send the logs to Gradient.
Sync Hosted collection is much simpler and doesn't have the overhead of Self Hosted collection. Instead, it requires a few cloud permissions in order for Gradient to collect information about the cloud resources used to run a job. The infrastructure and code to perform log collection is entirely within Sync's environment.
Running Spark on Spot instances is a complicated risk. When spot instances are revoked randomly, the impact on the runtime can be very dramatic and unpredictable. The impact on runtime even with just one worker pulled can offset the cost advantage of Spot instances. We have found that, sometimes, an optimized and reliable on-demand cluster can be cheaper than using Spot instances. This can be counter intuitive to many users.
Integrating Gradient into your Terraform process typically involves the following steps:
Include Workspace and Job Configuration in your Terraform Plan
Configure Terraform to ignore recommendation fields when detecting drift
Let Gradient “auto-apply” recommendations directly to your Databricks Job via Databricks API
Gradient utilizes Databricks webhook notification destinations to be notified upon the start of managed Databricks Jobs. Each notification destination should be incorporated into your infrastructure management process to maintain Gradient configuration within your Databricks workspace definition. See example.
Additionally, each workflow cluster being managed by Gradient should reference this webhook.
If you are using terraform plan
to tell you when there is a configuration drift of resources created by Terraform, we recommend you use one of the following methods to omit Databricks Job cluster configurations generated by Gradient. This will avoid the most recent cluster configuration from being overwritten by Terraform.
Specifying ‘ignore_chages = all’ under ‘lifecycle’ definition of the entire cluster configuration will result in the entire cluster configuration being ignored by the drift detection process.
Explicitly specifying which configurations to ignore allows configurations not managed by Gradient to be evaluated by the drift detection process. However, it is important to notes that these configurations may change as new features are added to Gradient.
This function returns a Python dictionary containing the recommended cluster configuration for the project. Parse and persist this data in the format required by your infrastructure management process.
To avoid manually applying recommendations, you can also enable Auto-Apply in the "Edit settings" button in the Gradient project page. If this option is enabled, recommendations will be automatically applied after each run of your job.
Currently, Gradient is optimized to work with predefined Databricks Jobs and Databricks Workflows. However, the Sync Python Library allows you to integrate your Databricks pipelines when using 3rd party tools like Airflow and Azure Data Factory. Typically, these tools use run_submit()
or DatabricksSubmitRunOperator()
, found in Databricks' APIs and Databricks' Airflow provider, to initiate runs using Databricks.
If you are using a tool that uses this job invocation method, you can follow the pattern below to submit your event logs to Gradient for evaluation, generate a recommendation, and apply that recommendation to your next job run.
These instructions guide you through the Gradient integration for Airflow DAGs containing DatabricksSubmitRunOperator()
tasks through the use of a pre-execute hook.
The pre-execute hook for DatabricksSubmitRunOperator()
creates/fetches the relevant project. It then retrieves a recommendation from Gradient for an optimized cluster config for this project. The recommendation overrides the cluster config being passed to DatabricksSubmitRunOperator()
. The task then runs with this optimized cluster config instead of the original untuned cluster config.
Decide DAG parameters:
App ID - An App ID is a human readable unique identifier supplied with each Databricks Job run. Its purpose is to provide criteria by which to group execution metrics. DatabricksSubmitRunOperator tasks that utilize multiple clusters are not currently supported.
Auto apply - Whether or not you want recommendations automatically applied
Databricks workspace id
syncsparkpy
library functionpre_execute
hook kwarg to DatabricksSubmitRunOperator
task and set it to the library function imported in step 1Gradient additions are annotated below
From the Projects tab, click on the button.
However, when users are ready to apply Gradient in production, we recommend utilizing runs you were already going to perform to minimize these training costs. See our guide for .
If this is your use case, please reach out to Sync to find a good solution
If your jobs have strict SLA requirements, we recommend working with Sync to see how we can ensure your SLA limits are still in compliance. Reach out to us via intercom or email at
If you choose not "ignore changes" and want to reintegrate the recommendations back into their terraform resource, you can retrieve the latest recommendation using the following function in the :
This setting is applicable only to Databricks Workflows. Auto-Apply is not applicable if you're using the or Databricks API.
Auto-Apply is not applicable if you're using the or Databricks API.
Databricks Workspace Integrated with Gradient - This process requires a Databricks Workspace Integration to be configured. Detailed instructions are available .
Environment variables for SYNC_API_KEY_ID
and SYNC_API_KEY_SECRET
. The values for these variables can be found in Gradient under Org Settings -> API Keys. Managing environment variables under Airflow can be found .
syncsparkpy
library has been installed and configured in your Airflow instance - steps to configure the library can be found .
Cluster Log Location - A Databricks supported path for cluster log delivery ()
Once the code above is implemented in your DAG, head over to the Projects dashboard in Gradient. There you'll be able to easily review and can make changes to the cluster configuration as needed.
Select all of the jobs you want Gradient to optimize and on-board them through job import method in the Gradient UI.
Be sure to enable auto-apply in Gradient for the jobs you want to optimize so Gradient will automatically apply the recommendations to your clusters.
If you have any SLA requirements for your jobs, be sure to set them in Gradient
This section provides some troubleshooting tips related to Gradient
This error is returned when Gradient can not access one to the environment variables that are required to return a recommendation. If you are using the CLI, make sure you have run
and that you have provided all the required values. If you are using the programatic Python interface, make sure the environment you working in has the following environment variable set:
If you receive an AWS permissions error, please review your AWS policy to ensure the following permissions are granted to the role executing the Gradient prediction.
Select an optimization window timeframe (e.g. first week of the month). By limiting the optimization window to a finite, but periodic, time frame, it allows engineers to still be in control with what the optimization is doing.
Select N number of jobs you want to optimize that are good candidates for tuning in your PROD environment.
By good candidates we mean you're OK if there's some variance in runtime and cost in your PROD environment during the optimization phase.
During the optimization Gradient will try different configurations and runtime/cost may go up or down during this period. Be sure that the stakeholders of your jobs are OK with this.
Enable auto-apply for those jobs to allow Gradient to automatically update your jobs during the optimization window. Let your engineers check-in on those jobs during the window via the Gradient UI to monitor progress.
At the end of the time frame, pick the configuration that lead to a cost and runtime you prefer. Apply those settings in your Databricks jobs, and disable auto-apply in the Gradient UI to lock in the configurations and prevent future changes.
When a new optimization window arrives, go back to step 2 and select a new batch of jobs to optimize.
Clone your current Production jobs into a development environment, including the input data and any other dependencies.
Onboard these development jobs into Gradient through the standard job import methods.
Be sure to enable auto-apply so the Gradient recommendations can be automatically applied during each iteration.
Run each of yoru jobs up to 10 times to complete both learning and optimizing phases of the Gradient algorithm. Support scripts like the auto-training notebook can be used to speed up this process.
Once optimization is complete, review the configurations and select the one that matches your busines needs. Copy the cluster configurations to your production jobs.
To access the Project Settings, click on the "Edit settings" button located in the upper right corner of the Project Details page.
The Project Setting allow you to set the recommendation configurations for your project:
SLA - Set these values if you have an SLA that you want Gradient to consider when making optimization recommendations. Typically longer SLAs will allow for Gradient to find lower cost cluster configurations. Shorter SLAs may cause higher cost recommendations. No matter what SLA is specified, Gradient will always try to find the lowest cost cluster.
Maintain Scaling Type - If enabled, Gradient will maintain the same Autoscaling option as the original job settings. Meaning, if Autoscaling is enabled, Gradient will provide a recomendation with Autoscaling (e.g. Autoscaling --> Autoscaling). If disabled, Gradient may recommend to switch scaling types (e.g. Autoscaling --> Fixed Cluster)
Auto-Apply Recommendations - If enabled, Gradient will automatically update the Databricks Job's cluster configurations with the latest generated recommendation.
The Project Settings page also allows you to Delete the Project.
The Account tab is where you can manage your password for non-Google authenticated accounts.\
The Account tab is also where you can manage you API Keys.
It is recommended that you to change its API keys routinely — according to your company's security policies. This is part of best security practices and is commonly required by operation control frameworks, such as SOC 2, HIPAA, or ISO 27001.
Cost breakdowns for onboarded workloads - users can see a breakdown of cloud vs. Databricks costs on the Projects page
A new phase indicator makes it easy to quickly see when Gradient is Learning vs. Optimizing
New Insights column in the Projects table gives you insights abouts your workloads - such as whether there's variable data size, it's a dormant project, and many others
EventBridge for AWS Databricks Workspace integration is now GA!
Updated table on the Discover page now supports sorting, filtering, rearranging, and customizing columns! You can also export the data as CSV
Big improvements to importing jobs from Databricks
Updated projects table - sort, filter, rearrange and customize columns, export data as CSV
Gradient UI updates to use blue as the primary color
Summary tab in project details provides up to date high-level metrics on project performance
[Private Preview] Workspace integration can now use AWS EventBridge for cluster monitoring
Projects page now has customizable columns! Select the columns you want to view and then use them in a filter to find exactly what you're looking for.
Projects details page update provides an improved layout showing cost, runtime, and job status. Charts show up to 60 runs and provides visual differentiation between the zoomed out view (viewing all runs) vs. the zoomed in / focused view (viewing a single run).
Projects page updates to show KPIs, consumption charts, and the ability to search and filter projects
Sync-Hosted log and data collection for AWS Databricks and Azure Databricks - onboard your jobs in 5 minutes!
Spark and Gradient metrics for your projects, enhancing the single pane experience of Gradient
AWS Region is now required when creating an AWS Databricks workspace integration
[Private preview] Worker instance recommendations (instance size)
Sync-hosted log and data collection for AWS Databricks
Edit workspaces
Access checks during workspace integration
Support for instance fleets
Databricks Workspace integration
Improved jobs import flow
Account page is now Org Settings!
Org Settings consolidates personal info, API keys, and lists users in the account
Gradient public preview
Gradient support for AWS Databricks and Azure Databricks
Project level cost and runtime graphs
AWS EBS recommendations
Cost ranges for recommendations for improved cost accuracy when using Spot nodes
Project setting for Auto-apply
Apply button for recommendations
Databricks quick-start notebooks added
Databricks Webhook integration to support at-scale onboarding
Databricks quick-start notebooks added
Official quick-start access modified to include the quick-start notebooks and focus on Databricks
Release Notes: Gradient
Many companies run their data workloads in their production environment in a variety of ways. While there are many ways users can run Gradient in production with the versatile Sync-CLI, below are a few paths that have resonated with users:
Ideal for - Users who want to optimize many jobs at scale and are comfortable with variations in cost/runtime during training.
Ideal for - Users who want more control and oversight over changes that Gradient is making. Typically these jobs may be more sensitive to cost and runtime variations.
Ideal for - Users who cannot experiment with configurations in production at all and need new configurations to be fully vetted before deploying to production. Typically these are the most conservative users in regards to their production jobs.
Updated
If the use cases below don't quite match your use case, please feel free to reach out and we'd be happy to help find a solution. Either contact us via the intercom app or email us at
- Allow Gradient to continuously monitor, control, and optimize your production jobs.
- Optimize your production jobs during regularly scheduled maintenance windows, allowing Gradient to control your jobs during a finite period so users can monitor and approve the changes.
- Allow Gradient to optimize jobs on a clone of your production jobs in a non-production environment. Optimized cluster configurations can then be transferred over to your production jobs.
Please go to our Trust Center Portal where you can request access to our
Security and policy documents
SOC2 Type 2 report
Once on the portal, select the docs you want to access and submit your request. You'll hear back from us within 24 hours.
Click on the link to access the Trust Center Portal now:
We collect Personal Information that clients provide us through our Sites, and in connection with other business dealings we may have with clients. Such information may include First and last name, Company name, Title, Email address, IP address, Login user name, Mailing address, Telephone number, Fax number, and Personal preferences regarding products and services. We use client Personal Information primarily to facilitate our ongoing and proposed business dealings (“Business Use”).
“Business Use” includes the creation of user profiles, establishing and maintaining client accounts so that we may provide products or services requested by our clients, registering clients as users of these products or services so that the client may access them through our Sites or otherwise, communicating with clients about updates, maintenance, outages, or other technical matters concerning these products or services, providing clients with training and support regarding usage of these products or services, notifying clients about changes to any of the policies and procedures for the use of these products or services, verifying the accuracy of account and technical contact information we have on file for clients, responding to questions or inquiries that clients may have about our products or services.
We may also use client Personal Information as required to comply with laws and regulations relating to the products or services we provide in any jurisdictions in which we or our affiliated companies operate, including the United States. We may use Usage Information internally within Sync Computing to help us improve our products or services or to develop new products or services. For Marketing Purposes, and with client consent or as otherwise permitted by applicable law, we may use client Personal Information for purposes relating to marketing our content, products, and services, or those of our business partners.
All client data is encrypted in transit and at rest.
Client data is stored in secure data centers hosted by AWS and Heroku.
In-Transit encryption protocols include HTTPS and SSL/TLS
Data stored in the cloud is stored using AES-256 encryption.
Data is automatically encrypted before being written to disk.
Single sign-on (SSO) and multi-factor authentication (MFA) support.
With SSO the user authentication process is delegated to identity providers that support the Security Assertion Markup Language (SAML) 2.0 standard.
Clients are capable and encouraged to leverage MFA using their SSO provider.
At Sync Computing, we encourage all employees to participate in helping secure our client data and company assets. Where applicable by law, Sync Computing performs background screenings on personnel before joining the organization. All Sync Computing personnel regularly complete security and privacy awareness training.\
Application security is of vital importance to Sync Computing. We incorporate security throughout our Software Development Lifecycle (SDLC), from the design of our products to the deployment of our software into our production environment.
We leverage a variety of third-party security partners to support our expectations of secure SDLC processes and secure production SaaS application environments.
Secure development and change management methods are outlined in our policies & procedures and every engineer is required to acknowledge and adhere to these methods. policies and procedures determine when and how changes occur. \
Sync Computing designs our application to be highly available and leverages Cloud Service Provider (CSP) technologies to attain availability objectives. Some of the CSP technologies that Sync Computing leverages are redundant storage, content distribution networks, auto-scaling technologies, and others.
Sync Computing has obtained our SOC2 Type 2 Report for the Security, Availability, and Confidentiality Trust Services Criteria.
Developed by the American Institute of Certified Public Accounts (AICPA), a SOC 2 Report confirms the results of a comprehensive audit that focuses on the system-level controls that process customer data.
SOC 2 reports cover the design and documentation of controls and provide evidence of how the organization operated the documented controls over an extended period of time for a given point in time.
There are two different types of SOC 2 reports.
A SOC 2 Type 1 report describes a service provider’s systems and whether the system is suitably designed to meet relevant trust principles.
A SOC 2 Type 2 report details the operational effectiveness of those systems and includes a historical element that shows how controls were managed by a business over a period of time.\
Sync Computing is committed to establishing trust with our customers, delivering innovative technology and accurate predictions and optimization recommendations for Apache Spark workloads. We regularly test our infrastructure and applications rigorously to isolate and remediate vulnerabilities. We also work with industry security teams and third-party specialists to keep our users and their data safe.
Becoming a certified SOC 2 Compliant solutions provider, we have multiple layers of protection across a distributed, reliable infrastructure. All Sync Computing data is stored in a secure data centers managed and secured by Amazon Web Services (AWS) and Heroku.\
\
sync-external-access
sync-external-user-role