Comment on page
Cluster costs for a Spark application run can vary due to autoscaling, Spot interruptions, and Spot fallback to On-demand.
During the Optimizing phase of a project, Gradient displays recommendation costs as a range to help you understand the variable cost of your Spark application.
- From the Spark Eventlog
- Event timestamps to estimate application start & end time
- Various cluster configurations are checked
- Some cloud provider info, such as region
- From the cluster report (init script output)
- AWS API “describe instances” responses to get a record of the cluster instance composition
- AWS API “describe volumes” responses get get a record of the cluster storage composition
- Databricks job, cluster, and task information (all from databricks api calls). Things like instance types, cluster id, and more are gathered from here.
- The above data is combined to do things like runtime & cost estimation F
During the learning phase, Gradient will test out a few different configurations to understand how your job responds to in terms of cost and runtime. Because of this, costs may momentarily increase.
Gradient uses your Databricks token to access and integrate with your Jobs so that tracking and updating cluster configurations can be done all through the UI, making users lives much easier.
Running extra training steps outside of your normal workflow will increase costs via those few extra job runs. For an initial proof-of-concept this is a risk free way of trying out Gradient.
However, when users are ready to apply Gradient in production, we recommen utilizing runs you were already going to perform to minimize these training costs. See our guide for running Gradient in production for more details.
Gradient optimizes clusters based on the actual code and data of your job. If your DEV environment's workloads are an exact clone of your PROD workloads, then yes Gradient will work.
Users with highly sensitive and tight SLA driven PROD workloads typically prefer to run Gradient in a cloned DEV environment.
If your DEV environment's workloads are different than your PROD environment (e.g. uses a smaller data size, or different code) then running Gradient in DEV will only optimize the cluster for your DEV workloads which likely would not transfer to your PROD workloads.
Training and testing recommendations in a DEV environment will add an additional overall cost since you have to pay for this test job run itself. This will eat into your overall ROI gains with Gradient.
If users allow Gradient to "learn" while in production, you will utilize job runs you have to run anyway. This significantly reduces the cost overhead of optimization and dramatically increases your overall ROI.
During the learning phase, Gradient will try different configurations to help characterize your job which could result in fluctuations in cost and runtime.