Spark & Troubleshooting
February 4, 2022
Siddhartha Vemuganti
Data Engineering & Cloud Architecture
Sparks features and troubleshooting challenges
Spark has become the go-to tool/platform for processing “Big Data” and making the data available for Analytics, AI/ML and other applications.
In this blog post, let’s look into potential solutions to some of the common questions while troubleshooting Spark applications. These suggestions are based on past production deployment experience, to help optimize job and cluster performance and cost saving to the organization or client implementing Spark in their ecosystem.
Spark Job related issues are handled by - Data Engineering, Data Scientists and Analysts and Cluster/Stack by Operations/Administration and Data Engineering.
The assumption is you are using Spark and are aware of its rich features and architecture. The features at the core are what made it the market leader but sometimes troubleshooting Spark applications could become a bit complex. Below are some of the complexities to keep in mind while implementing Spark.
- In-memory execution engine
- Spark jobs fail for any numer of reasons, but one recurring reason is due to lack of sufficient available memory. Generally speaking only traces of temp data disk writes survive to refer back to figure out the issue. Inefficient planning could lead to spike in operating costs on the cloud.
- Parallel processing
- Due to parallelized data partitions across clusters while processing the data, tracing an application issue becomes difficult and is like looking for a needle in a haystack.
- Multiple versions
- Spark 1.x - Spark 3.x –> Spark version variants across platforms, on-prem and cloud-specific (AWS, Azure, GCP, Databricks) implementations, come with different sets of capabilities.
- Configuration options - An “R & D” approach
- Getting the hundreds of configuration options, including hardware and software environment settings, “optimized” and “right” 👍😅, without over-allocating resources (memory and CPU’s) sometimes feels like you are tweaking an enigma machine 😲
At some point or the other, we all have scoured through 🔎 documentation and online forums such as stackoverflow to solve the issue at hand. In keeping with that practice, let’s get into a few common questions and potential answers w.r.t Spark jobs and Clusters.
Spark Jobs
How many executors, cores and how much memory should be allocated for each job?
These decisions depend on factors such as Number of nodes, Memory of each node, Total cores in the cluster and Number of jobs planned on the cluster. These factors determine core availability, available memory for computations after accounting for overhead, cluster deamons like cluster manager and system resources.
To achieve optimized usage of cluster (cores, memory/core), data partitioning - the beginner’s guide. It is recommended to start with two-three cores per executor, but no more than five cores per executor for good throughput
Some good rules of thumb to remember and help arrive at an informed answer are:
- Number of available executors = Total cluster cores / Number of cores per executor
- Number of executors per node = Number of available executors / Number of cluster nodes
- Memory per executor = (Node memory / Number of executors per node) - heap overhead (~7%)
How can I identify and correct data skew?
- Spark’s low CPU utilization, slow progressing stages & stalling tasks (reflected in Spark UI), causing OOM issues is the first sign to check for data skew. Spark jobs either fail or stall because they are trying to move large files where one key-value, or a few, have a large share of the total data associated with them
- Use Broadcast for join operation on a skewed dataset and if “hot spots” occur repartitioning or coalescing of data may be options to consider
- Handle Null values, if any, while preprocessing data
- Use Salting technique to join KEYS to allow for even distribution of data
- Avoid small files, as they generate a lot of metadata and slow down the cluster’s performance
- Spark’s low CPU utilization, slow progressing stages & stalling tasks (reflected in Spark UI), causing OOM issues is the first sign to check for data skew. Spark jobs either fail or stall because they are trying to move large files where one key-value, or a few, have a large share of the total data associated with them
Other issues could arise as you determine which jobs need to be executed on a specific cluster i.e. either at the cluster level or at the stack level. Therefore, it is not only important for all teams in a Spark Job’s lifecycle to know the type of data that is being processed, the Spark job and its demands of the cluster (number of executors, number of cores, amount of memory allocated) but also knowing the cluster design, type of nodes that make up the cluster, resources availability, storage etc.,
Some of the many challenges that may arise are:
Cluster/Stack
- How should I size my cluster and which servers/instance types should I choose?
- It depends on many factors and decisions - starting with business needs, On-prem or Cloud platform, Budget, Data (size, volume, frequency, type, historical, initial, incremental etc.), Usecase, Storage needs, SLAs, Solution Architecture, Application Design and other dependent upstream or downstream systems etc.
- How can I view what is happening throughout the Spark cluster and applications?
- Access to the Spark UI gives a good perspective of job performance, information of processing stages etc. In addition, the YARN Web UI can provide cluster relevant information and Spark jobs usage of cluster resources
- Access to YARN/REST API opens up the door to monitor job level metrics using job specific ports while deploying the jobs on a YARN cluster - a treasure trove of data to analyze and fine tune things at all levels.
Conclusion
An Organization encouraging and supporting a culture of sharing knowledge and teaming up of Developers, Engineers, Data Scientists with the Operations Team will not only help optimally allocate resources due to clear communication but also mitigate problems & issues across jobs and clusters, as effectively as possible, in a short amount of time. The outcome will be stronger productive teams, with a work-life balance to facilitate delivery of business needs and organizational bottom line - a win-win for all.