Why do cluster recommendations have cost ranges?

Cluster costs for a Spark application run can vary due to autoscaling, Spot interruptions, and Spot fallback to On-demand.

During the Optimizing phase of a project, Gradient displays recommendation costs as a range to help you understand the variable cost of your Spark application.

How does Gradient calculate my costs?

When a Databricks cluster is running there are two main sources of charges:

  1. Cloud provider charges are dominated by the instance rental and storage costs. Each unique cloud resource (e.g. an EC2 instance or an EBS volume) has a certain charge rate [usd/hr], and the cost of each resource is the charge rate multiplied by the rental duration.

  2. Databricks platform charges. Just like the cloud provider, Databricks has a charge rate per instance which is a function of both the instance type, your Databricks Plan type, and the Databricks runtime engine (STANDARD or PHOTON). The instance rental is the only cloud resource that has an associated Databricks charge – storage comes for free!

The charge for a cluster can be estimated by adding together the individual costs of each resource. We use the list price for the charge rate of each resource, and resource durations are gathered by attaching an init_script to your cluster which periodically polls for the resources that compose it.

At the end of your job, that data is shipped to us where we can formulate a robust timeline of when each resource was added and removed from the cluster.

Why doesn’t Sync’s cluster cost match my Databricks billing?

The dominant source of error in Sync’s cost estimate comes from the resource duration. Sync only polls the cluster resources during the times for which the init_script is active. However, as depicted in the figure below, there is some time before the init script runs where charges still accumulate.

We have found from studying our own internal jobs that the unaccounted time tends to be about 2-3 minutes per instance. For the shortest clusters, this error will result in our estimate being well below what you see in your billing. However, as the cluster duration gets longer the relative error diminishes and we expect the values to match quite well.

We understand the importance of our estimates reflecting what you see in your billing, even for the shortest clusters. With that mind, we will always strive to improve the accuracy of our estimates for all clusters, and we will keep our users informed of future improvements as they come.

What data do you collect?

  • From the Spark Eventlog

    • Event timestamps to estimate application start & end time

    • Various cluster configurations are checked

    • Some cloud provider info, such as region

  • From the cluster report (init script output)

    • AWS API “describe instances” responses to get a record of the cluster instance composition

    • AWS API “describe volumes” responses get get a record of the cluster storage composition

    • Databricks job, cluster, and task information (all from databricks api calls). Things like instance types, cluster id, and more are gathered from here.

    • The above data is combined to do things like runtime & cost estimation F

Why did my costs go up during the learning phase?

During the learning phase, Gradient will test out a few different configurations to understand how your job responds to in terms of cost and runtime. Because of this, costs may momentarily increase.

Why do you need a Databricks token?

Gradient uses your Databricks token to access and integrate with your Jobs so that tracking and updating cluster configurations can be done all through the UI, making users lives much easier.

Won't the extra training runs increase costs?

Running extra training steps outside of your normal workflow will increase costs via those few extra job runs. For an initial proof-of-concept this is a risk free way of trying out Gradient.

However, when users are ready to apply Gradient in production, we recommend utilizing runs you were already going to perform to minimize these training costs. See our guide for running Gradient in production for more details.

Can I train Gradient in my DEV environment?

Gradient optimizes clusters based on the actual code and data of your job. If your DEV environment's workloads are an exact clone of your PROD workloads, then yes Gradient will work.

Users with highly sensitive and tight SLA driven PROD workloads typically prefer to run Gradient in a cloned DEV environment.

If your DEV environment's workloads are different than your PROD environment (e.g. uses a smaller data size, or different code) then running Gradient in DEV will only optimize the cluster for your DEV workloads which likely would not transfer to your PROD workloads.

If this is your use case, please reach out to Sync to find a good solution support@synccomputing.com

Why does Sync recommend running in our PROD environment?

Training and testing recommendations in a DEV environment will add an additional overall cost since you have to pay for this test job run itself. This will eat into your overall ROI gains with Gradient.

If users allow Gradient to "learn" while in production, you will utilize job runs you have to run anyway. This significantly reduces the cost overhead of optimization and dramatically increases your overall ROI.

What are the risks of letting Gradient run in PROD?

During the learning phase, Gradient will try different configurations to help characterize your job which could result in fluctuations in cost and runtime.

If your jobs have strict SLA requirements, we recommend working with Sync to see how we can ensure your SLA limits are still in compliance. Reach out to us via intercom or email at support@synccomputing.com

Where is my Databricks Webhook ID?

In your Databricks console, navigate to the webhook that you created. This is under Admin Settings -> Notifications -> Manage button for Notification destinations.

Next to the "Edit destination settings" title, there's a copy button. Click it to copy the Webhook ID (see image below)

What is the difference between Sync Hosted vs. Self Hosted?

Gradient needs to collect cluster and event logs for your job runs. There are two ways that Gradient can do this automatically:

  1. [Recommended] Sync Hosted: the infrastructure used to collect logs is hosted in Sync's environment. When Gradient receives a notification that your job run has started, then it uses this infrastructure to monitor and fetch (or pull) logs, from your environment, to Sync.

  2. Self Hosted: the infrastructure used to collect logs is hosted in your environment. When Gradient receives a notification that your job has completed, then it uses this infrastructure to send (or push) logs, from your environment, to Sync.

Self Hosted collection uses a Sync-provisioned all-purpose cluster in your Databricks environment to perform log collection once a job run completes. The all-purpose cluster runs a notebook that utilizes a Sync Gradient Python library to send the logs to Gradient.

Sync Hosted collection is much simpler and doesn't have the overhead of Self Hosted collection. Instead, it requires a few cloud permissions in order for Gradient to collect information about the cloud resources used to run a job. The infrastructure and code to perform log collection is entirely within Sync's environment.

Last updated