Under The Hood
Inside Gradient, a project loop continuously monitors and optimizes a workload, repeatedly iterating each time the workload is run. The basic steps for the loops are outlined below:
1) Input information: A user sets their project settings and project data. In the case of Apache Spark, this information entails Spark event logs and cluster metrics.
2) Runtime prediction: The core of the continuous optimizer is based on a mathematical model of the Apache Spark internals. Specifically the stages and tasks are simulated based on the input event log of the user. When different Apache Spark configurations or cloud instances are simulated, a transformation is applied to the tasks to simulate the runtime of the application with the new configurations.
3) Cost modeling: The estimated cost of the job is based on the basic formula of (machine hours)*(charge rate). Although simple, several API calls are required from multiple platforms (e.g. Databricks, AWS), where low level machine usage information and real-time pricing information is pulled to predict the cost of various configurations.
4) Recommendation Selection: The final step is selecting the predicted configuration which meets a user's goals as defined in the settings at the lowest cost. These changes can then be implemented in the next job run. This then repeats the entire loop over again where historical information can be used to find even more optimization opportunities.
As of June 2023, the latest accuracy metrics are shown below. With a single iteration of the project loop, utilizing the TPC-DS benchmark, 87% of tested jobs resulted in actual cost savings when predicting from an autoscaled to a fixed cluster. When predicting from a fixed cluster to a fixed cluster, 65% of tested jobs resulted in real cost savings.
In the future, with additional iterations, the goal of Gradient is to further improve these gains as more information is gathered about each project.