Comment on page


Why do cluster recommendations have cost ranges?

Cluster costs for a Spark application run can vary due to autoscaling, Spot interruptions, and Spot fallback to On-demand.
During the Optimizing phase of a project, Gradient displays recommendation costs as a range to help you understand the variable cost of your Spark application.

What data do you collect?

  • From the Spark Eventlog
    • Event timestamps to estimate application start & end time
    • Various cluster configurations are checked
    • Some cloud provider info, such as region
  • From the cluster report (init script output)
    • AWS API “describe instances” responses to get a record of the cluster instance composition
    • AWS API “describe volumes” responses get get a record of the cluster storage composition
    • Databricks job, cluster, and task information (all from databricks api calls). Things like instance types, cluster id, and more are gathered from here.
    • The above data is combined to do things like runtime & cost estimation F

Why did my costs go up during the learning phase?

During the learning phase, Gradient will test out a few different configurations to understand how your job responds to in terms of cost and runtime. Because of this, costs may momentarily increase.

Why do you need a Databricks token?

Gradient uses your Databricks token to access and integrate with your Jobs so that tracking and updating cluster configurations can be done all through the UI, making users lives much easier.

Won't the extra training runs increase costs?

Running extra training steps outside of your normal workflow will increase costs via those few extra job runs. For an initial proof-of-concept this is a risk free way of trying out Gradient.
However, when users are ready to apply Gradient in production, we recommen utilizing runs you were already going to perform to minimize these training costs. See our guide for running Gradient in production for more details.

Can I train Gradient in my DEV environment?

Gradient optimizes clusters based on the actual code and data of your job. If your DEV environment's workloads are an exact clone of your PROD workloads, then yes Gradient will work.
Users with highly sensitive and tight SLA driven PROD workloads typically prefer to run Gradient in a cloned DEV environment.
If your DEV environment's workloads are different than your PROD environment (e.g. uses a smaller data size, or different code) then running Gradient in DEV will only optimize the cluster for your DEV workloads which likely would not transfer to your PROD workloads.
If this is your use case, please reach out to Sync to find a good solution [email protected]

Why does Sync recommend running in our PROD environment?

Training and testing recommendations in a DEV environment will add an additional overall cost since you have to pay for this test job run itself. This will eat into your overall ROI gains with Gradient.
If users allow Gradient to "learn" while in production, you will utilize job runs you have to run anyway. This significantly reduces the cost overhead of optimization and dramatically increases your overall ROI.

What are the risks of letting Gradient run in PROD?

During the learning phase, Gradient will try different configurations to help characterize your job which could result in fluctuations in cost and runtime.
If your jobs have strict SLA requirements, we recommend working with Sync to see how we can ensure your SLA limits are still in compliance. Reach out to us via intercom or email at [email protected]