1 of 2

The Gradient Platform

Introduction

Gradient is an infrastructure management and optimization platform that continuously monitors and learns users' Databricks Jobs so that it can optimize the data infrastructure to hit cost and runtime goals. It supports both co-pilot and autopilot modes. Use it as a co-pilot to receive passive recommendations for optimizations you can apply in a click, or enable auto-apply for optimization at scale.

Gradient uses a closed-loop feedback system to automatically build custom tuned machine learning models for each Databricks Job it is managing, using historical run logs. Through this mechanism, Gradient continuously drives Databricks Jobs cluster configurations to hit user defined business goals, such as maximum costs and runtimes.

Demo Video

What problem is Gradient solving?

Managing and optimizing Databricks Job clusters is tedious, time intensive, and difficult for data and platform engineers; there are far too many Spark configurations and infrastructure choices to know what's right and what makes sense. Additionally, just when they've gone through the effort of optimizing a job, something changes that wipes away all that hard work.

To make matters worse, changing infrastructure incorrectly can also lead to crashed jobs due to out of memory errors. A major risk to production pipelines that often blocks engineers from optimizing in the first place.

If engineers do want to try managing clusters, it comes at the expense of taking time away from delivering new products and features. Furthermore, managing at scale where hundreds or thousands of jobs are running is simply not feasible for any sized team.

Gradient provides data teams with an easy and scalable solution that can significantly diminish engineering time spent on cluster optimization, while cutting costs and improving runtimes. It can even automatically manage clusters for all of your jobs - with no code changes.

Who is it for?

Data Engineers - Avoid spending time tuning and optimizing clusters while still achieving optimal cost and runtime performance.
Data Platform Managers - Ensure your team's Databricks Jobs are achieving high level business objectives without having to bug your engineers or change any code. This becomes particularly important for teams who are looking to scale their Databricks usage.
VP of Engineering / CTOs - Gradient works for you and not the cloud providers. It was built to help you efficiently produce data products that meet your business.

How Does it Work?

Internal model

Gradient uses a proprietary machine learning algorithm trained on historical event log information to find the best configurations for your job; the algorithm creates and maintains a custom ML model for each job.

The algorithm has two phases:

Learning phase: Gradient will test your job performance against a few different configurations to understand how your job responds in terms of cost and runtime.
Optimizing phase: Once the learning phase is complete, Gradient will use the model built internally to drive the job cluster to more optimal configurations given the SLA requirements of the user. Even when optimizing, Gradient will continuously learn from each job run and improve the model for the job.

Projects

Projects are how Gradient organizes your Databricks Jobs and enables continuous optimization. All job runs and optimization recommendations for the job are available under the associated Project to give you a holistic picture of how the job is performing. Additionally, Projects help unlock more optimization potential and key features which may be important for your infrastructure via Project Settings.

Each Project is continually updated with the most recent recommendation provided by Gradient, allowing you to review cost and runtime metrics over time for the configured Spark workload.

Main Features

Timeline Visualizations - Monitor your jobs cost and runtime metrics over time to understand behaviors and watch for anomalies due to code changes, data size change, spot interruptions, or other causes.
Metrics Visualizations - Easily view spark metrics for your runs and correlate that with Gradient's recommendations, all visualized beautifully in a single pane.
Auto-apply Recommendation - Recommendations can be automatically applied to your jobs after each run for a "set and forget" experience.
AWS and Azure support - Granular cost and cluster metrics are gathered from popular cloud providers.
Auto Databricks jobs import & setup - Provide your Databricks host and token, and we’ll do all the heavy lifting of automatically fetching all of your qualified jobs and importing them into Gradient.
You set max runtime, Gradient minimizes costs - Simply set your ideal max runtime SLA (service level agreement) and we’ll configure the cluster to hit your goals at the lowest cost.
Aggregated cost metrics - Gradient conveniently combines both Databricks DBU costs and Cloud costs to give you a complete picture of your spend.
Custom integration with Sync CLI - Sync CLI and APIs can be used to support custom integration with users' environments.
Databricks autoscaling optimization - Optimize your min and max workers for your job. It turns out autoscaling parameters are just another set of numbers that need tuning. Check out our previous blog post.
EBS recommendations - Optimize your AWS EBS using recommendations provided by Gradient and save on costs.

High Level System Diagram

Gradient is a SaaS platform that remotely monitors and applies recommendations to users Databricks clusters at each Job start. The high level closed-loop flow of information across user and Gradient environments is shown below.

See our FAQ for more detailed information into what information is sent and collected on the Gradient side.