Links

How Does it Work?

Internal model

Gradient uses a proprietary machine learning algorithm trained on historical event log information to find the best configurations for your job; the algorithm creates and maintains a custom ML model for each job.
The algorithm has two phases:
  • Learning phase: Gradient will test your job performance against a few different configurations to understand how your job responds in terms of cost and runtime.
  • Optimizing phase: Once the learning phase is complete, Gradient will use the model built internally to drive the job cluster to more optimal configurations given the SLA requirements of the user. Even when optimizing, Gradient will continuously learn from each job run and improve the model for the job.

Projects

Projects are how Gradient organizes your Databricks Jobs and enables continuous optimization. All job runs and optimization recommendations for the job are available under the associated Project to give you a holistic picture of how the job is performing. Additionally, Projects help unlock more optimization potential and key features which may be important for your infrastructure via Project Settings.
Each Project is continually updated with the most recent recommendation provided by Gradient, allowing you to review cost and runtime metrics over time for the configured Spark workload.

Main Features

  • Timeline Visualizations - Monitor your jobs cost and runtime metrics over time to understand behaviors and watch for anomalies due to code changes, data size change, spot interruptions, or other causes.
  • Auto-apply Recommendation - Recommendations can be automatically applied to your jobs after each run for a "set and forget" experience.
  • AWS and Azure support - Granular cost and cluster metrics are gathered from popular cloud providers.
  • Auto Databricks jobs import & setup - Provide your Databricks host and token, and we’ll do all the heavy lifting of automatically fetching all of your qualified jobs and importing them into Gradient.
  • You set max runtime, Gradient minimizes costs - Simply set your ideal max runtime SLA (service level agreement) and we’ll configure the cluster to hit your goals at the lowest cost.
  • Aggregated cost metrics - Gradient conveniently combines both Databricks DBU costs and Cloud costs to give you a complete picture of your spend.
  • Custom integration with Sync CLI - Sync CLI and APIs can be used to support custom integration with users' environments.
  • Databricks autoscaling optimization - Optimize your min and max workers for your job. It turns out autoscaling parameters are just another set of numbers that need tuning. Check out our previous blog post.
  • EBS recommendations - Optimize your AWS EBS using recommendations provided by Gradient and save on costs.

High Level System Diagram

Gradient is a SaaS platform that remotely monitors and applies recommendations to users Databricks clusters at each Job start. The high level closed-loop flow of information across user and Gradient environments is shown below.
See our FAQ for more detailed information into what information is sent and collected on the Gradient side.