My lesson learned with distributed lock

sonht1109,distributed lockredis

Distributed lock really helps me to solve my issues, but it doesn't mean the issue will be solved immediately if I use distributed lock.

System Overview

Okay. Image that, Im working on the feature that distributes voucher to user. Like, if user signs up, they will receive the voucher item. The flow is simplified as the picture below.

my lesson learned with distributed lock

So the input is user_id and the output is voucher item distributed to user.

Let's see how Voucher Service distributes voucher item to user. In real world, this service is so complicated that none of us really knows how it works hahaha, but just imagine that we have one table called voucher_item.

voucher_item

iduser_idcode
11BV1LN2KJ
2nullASXKKQRG

If user_id is not null (eg: 1), it means user 1 already claims voucher item, in this case user 1 has voucher item 1. Otherwise, the voucher is waiting for being claimed.

So when Voucher Service consumes message, it will query the first valid voucher which has {user_id: null} and then updates that voucher with corresponding user_id. Done.

my lesson learned with distributed lock

Scenario

I receive my task related to voucher, but it's not about signup. It's about employee import, like, admin will import list of employees into system and then give all of them vouchers. Actually, it's quite the same as the signup feature, but the input now will be array of users instead of one user.

But, as I said, the Voucher Service is too complicated, it's not smart to implement another function to handle that. The easiest way that I come up with is, I will push a bunch of message into queue. Let's say, admin imports 10 users from 1 to 10, so we will push 10 separate messages into queue to utilize available logic.

my lesson learned with distributed lock

Okay, let's keep other steps (3),(4) as now, no need to change anything. We have queue, messages will be handled sequentially. I test that on my local machine and it works perfectly, let's ask QA team to test. Hahaha easy easy.

But then, "Nah, please fix that. It doesn't work on dev env", QA team said.

*dev env is the env that QA uses for testing.

What's wrong here?

Based on QA team's actual result, they test with only 2 employees:

employees

id
1
2

Hey hey, are you kidding me? Sounds magic here. But I tested and it worked perfectly on my local machine, right.

Let's see. If user 1 already receives notification, it means, the voucher is already claimed by user 1 (see the logic flow above). But somehow, user 2 claims the voucher of user 1. Then I realize that, the biggest differences between my local machine and dev env is that dev env has 2 pods of Voucher Service (auto scaling). Okay, let's update the diagram.

my lesson learned with distributed lock

Much clearer now. 2 messages can be handled almost at the same time, and 2 message handlers can point to the same voucher item.

Okay, now our issue is, 2 processes both get access to the same resource. Definitely, distributed lock will save us in this case.

Actually, I intended to implement distributed lock by myself using Redis setnx but then I was recommended using Redlock (opens in a new tab) which works perfectly with multiple Redis nodes.

Let's apply lock into Voucher Service. We can easily find nodejs redlock package (opens in a new tab) and install.

But where should I put the lock? Hmm, it should be placed right after getting voucher item from database, since we can use voucher_item.id as key of redlock.

my lesson learned with distributed lock

One more time, QA said "Nah sir, the issue is still happening. Please fix that."

What's wrong again?

Is redlock lying me? Nah. Let's see what happens. Since 2 messages can be handled almost at the same time, I come up with this theory.

my lesson learned with distributed lock

Yahhhhhh. Seems reasonable. It's because the process that acquires lock and update voucher item happens too fast.

Okay. let's give it some extra time before releasing the lock and the issue should be solved now.

my lesson learned with distributed lock

I change the lock release time from t1 to t2. But you may wonder, what should the gap (t2 - t1) be? I think it should be at least equal to the execution time of logic between acquiring lock and releasing lock. For example, the logic above takes 10ms to acquired lock + update voucher item + release lock, then t2 - t1 should >= 10ms.

Conclusion

That's all I want to share. The lesson learned here is, should choose the lock release time carefully, it will save your life.

Reference


Thank you for reading. Welcome all your comments/feedbacks.


© 2023 - 2024 by sonht1109