Primary node election in distributed computing

Sat, 05 Sep 2020

Primary node can help with coordination of multiple node deployed in a cloud. One of the projects I've worked on had a NodeJS worker that ran multiple types of tasks. I wanted to upgrade this setup in order to be easily scalable, have a primary node (or coordinator) that triggers the tasks and continue processing even when some of the nodes fail.

The Checklist

  • All nodes may participate in an "election" to choose the coordinator
  • Support any number of nodes, including 1 node setup
  • Handle node fail (even the coordinator node) without impacting the flow
  • Allow new nodes to join the party
  • Don't depend on an expensive technology (paid or resource hungry)

Besides the main scope of the solution I also needed to ensure that the election of the coordinator follows these 3 basic principles:

  • Termination: the election process must complete in a finite time
  • Uniqueness: only one node can be coordinator
  • Agreement: all other nodes know who's the coordinator.

After I established that I have all the scenarios in mind, I started investigating different algorithms and solutions that the market uses (Apache Zookeeper, Port Locking, Ring Networks). However, most of these require a lot of setup or were incompatible in a multi server setup and I also wanted to embrace a KISS approach so continue reading to see the solution.

The Primary Node Election Algorithm

  1. Node generates a random numeric id

  2. Node retrieves a COORDINATOR_ID key from Redis

  3. If key !NULL

    • We have a coordinator
    • Wait Z minutes (e.g. Z = 1 hour)
    • GoTo Step2
  4. If key NULL

    • No coordinator announce

    • Push the id from Step1 in a Redis list

    • Waits X seconds (depending on how long the deployment takes, e.g. 10 seconds)

    • Retrieve all items in the list and extract the highest number

    • If result === node id

      • Current node is primary
      • Set Redis key COORDINATOR_ID with expiry Z+X
      • Do all the hard work :)
    • Wait Z minutes

    • GoTo Step2

Downside of this solution is that if the coordinator node fails, it actually takes 2*Z until a new election takes place.

There's room for improvement so please don't hesitate to leave a feedback :)

Categories: how-to, javascript, linux, nodejs