AWS’s Elastic MapReduce (EMR) can occasionally get stuck with a Resizing status during changes in an instance group capacity. In these cases, the actual number of running instances won’t match the request number. Elastigroup’s EMR Auto-Recovery process is designed to handle this situation.
Here’s what a stuck EMR looks like without Elastigroup’s EMR Auto-Recover:
How It Works
When a change in instance group capacity is applied, a process will monitor its status and if the time limit of 30 minutes is exceeded, Spotinst Elastigroup will automatically stop the resizing process on the specific instance group and will create a new instance group with the same configuration which we will fall back to.
The original instance group will be “banned” for 2 hours and all actions of launching new instances will be applied in the new instance group. i.e. – if in the original instance group there were 3 missing instances that were requested to be launched, they will be launched as part of the new instance group.
Once the resizing process is finished, scaling operations will be resumed.
Here’s a diagram of the Auto-Recovery process:
- Create a Wrapped EMR Cluster on Elastigroup to run tasks nodes for your existing EMR cluster on Spot instances.
- Clone your existing EMR cluster into an Elastigroup.
- Learn about Elastigroup’s Scaling Policies for EMR.