dynamodb管理ttl

by Yan Cui

崔燕

如何使用DynamoDB TTL和Lambda安排临时任务 (How to schedule ad-hoc tasks with DynamoDB TTL and Lambda)

CloudWatch Events let you easily create cron jobs with Lambda. However, it’s not designed for running lots of ad-hoc tasks, each to be executed once, at a specific time. The default limit on CloudWatch Events is a lowly 100 rules per region per account. It’s a soft limit, so it’s possible to request a limit increase. But the low initial limit suggests it’s not designed for use cases where you need to schedule millions of ad-hoc tasks.

通过CloudWatch Events，您可以轻松地使用Lambda创建cron作业。但是，它并非设计用于运行特定任务，每个任务只能在一次执行一次。 CloudWatch Events的默认限制是每个帐户每个区域最低100条规则。这是一个软限制，因此可以请求增加限制。但是较低的初始限制表明它不适用于需要安排数百万个临时任务的用例。

CloudWatch Events is designed for executing recurring tasks.

CloudWatch Events专为执行重复任务而设计。

问题 (The Problem)

It’s possible to do this in just about every programming language. For example, .Net has the Timer class and JavaScript has the setInterval function. But I often find myself wanting a service abstraction to work with. There are many use cases for such a service, for example:

几乎每种编程语言都可以做到这一点。例如，.Net具有Timer类，而JavaScript具有setInterval函数。但是我经常发现自己想要使用服务抽象。此类服务有很多用例，例如：

A tournament system for games would need to execute business logic when the tournament starts and finishes.
当比赛开始和结束时，用于游戏的比赛系统将需要执行业务逻辑。
An event system (think eventbrite.com or meetup.com) would need a mechanism to send out timely reminders to attendees.
事件系统(认为eventbrite.com或meetup.com )将需要一个机制来及时提醒发送给与会者。
A to-do tracker (think wunderlist) would need a mechanism to send out reminders when a to-do task is due.
待办事项跟踪程序(想想清单 )需要一种机制来在待办事项到期时发出提醒。

However, AWS does not offer a service for this type of workloads. CloudWatch Events is the closest thing, but as discussed above it’s not intended for the use cases above. You can, however, implement them using cron jobs. But such implementations have other challenges.

但是，AWS不为此类工作负载提供服务。 CloudWatch Events是最接近的事情，但是如上所述，它并不适用于上述用例。但是，您可以使用cron作业来实现它们。但是这样的实现还有其他挑战。

I have implemented such service abstraction a few times in my career already. I experimented with a number of different approaches:

在我的职业生涯中，我已经实现了几次这样的服务抽象。我尝试了多种不同的方法：

cron job (with CloudWatch Events)
cron作业(带有CloudWatch Events)
wrapping the .Net Timer class as an HTTP endpoint
将.Net Timer类包装为HTTP终结点
using SQS Visibility Timeout to hide tasks until they’re due
使用SQS可见性超时来隐藏任务，直到到期

And lately, I have seen a number of folks use DynamoDB Time-To-Live (TTL) to implement these ad-hoc tasks. In this post, we will take a look at this approach and see where it might be applicable for you.

最近，我看到许多人使用DynamoDB生存时间 (TTL)来实现这些临时任务。在这篇文章中，我们将研究这种方法，并查看它可能适用于您的地方。

我们如何衡量方法？ (How do we measure the approach?)

For this type of ad-hoc task, we normally care about:

对于此类临时任务，我们通常关心：

Precision: how close to my scheduled time is the task executed? The closer the better.
精度：任务在我预定的时间有多近？越近越好。
Scale (number of open tasks): can the solution scale to support many open tasks, i.e. tasks that are scheduled but not yet executed?
规模(未完成任务的数量) ：解决方案是否可以扩展以支持许多未完成任务，即已计划但尚未执行的任务？
Scale (hotspots): can the solution scale to execute many tasks around the same time? E.g. millions of people set a timer to remind themselves to watch the Superbowl, so all the timers fire within close proximity to kickoff time.
扩展(热点)：解决方案可以扩展以在同一时间执行许多任务吗？例如，数以百万计的人设置了一个计时器来提醒自己观看超级碗，因此所有计时器都在启动时间附近触发。

DynamoDB TTL作为调度机制 (DynamoDB TTL as a scheduling mechanism)

From a high level, this approach looks like this:

从较高的角度来看，这种方法看起来像这样：

A scheduled_items DynamoDB table which holds all the tasks that are scheduled for execution.
scheduled_items DynamoDB表包含计划执行的所有任务。
A scheduler function that writes the scheduled task into the scheduled_items table, with the TTL set to the scheduled execution time.
一个scheduler函数，它将调度的任务写入scheduled_items表，并且将TTL设置为调度的执行时间。
An execute-on-schedule function that subscribes to the DynamoDB Stream for scheduled_items and reacts to REMOVE events. These events correspond to when items have been deleted from the table.
在execute-on-schedule功能订阅到DynamoDB流为scheduled_items并响应REMOVE事件。这些事件对应于从表中删除项目的时间。

可伸缩性(未完成任务的数量) (Scalability (number of open tasks))

Since the number of open tasks just translates to the number of items in the scheduled_items table, this approach can scale to millions of open tasks.

由于未完成任务的数量仅转换为scheduled_items表中项目的数量，因此这种方法可以扩展到数百万个未完成任务。

DynamoDB can handle large throughputs (thousands of TPS) too. So this approach can also be applied to scenarios where thousands of items are scheduled per second.

DynamoDB也可以处理大吞吐量(数千TPS)。因此，这种方法还可以应用于每秒计划数千个项目的方案。

可扩展性(热点) (Scalability (hotspots))

When many items are deleted at the same time, they are simply queued in the DynamoDB Stream. AWS also auto scales the number of shards in the stream, so as throughput increases the number of shards would go up accordingly.

同时删除许多项目时，它们只是在DynamoDB流中排队。 AWS还自动缩放流中分片的数量，因此随着吞吐量的增加，分片的数量将相应增加。

But, events are processed in sequence. So it can take some time for your function to process the event depending on:

但是，事件是按顺序处理的。因此，您的函数可能需要一些时间来处理事件，具体取决于：

its position in the stream, and
它在信息流中的位置，以及
how long it takes to process each event.
处理每个事件需要多长时间。

So, while this approach can scale to support many tasks all expiring at the same time, it cannot guarantee that tasks are executed on time.

因此，尽管这种方法可以扩展以支持许多同时到期的任务，但它不能保证任务能按时执行。

精确 (Precision)

This is a big question about this approach. According to the official documentation, expired items are deleted within 48 hours. That is a huge margin of error!

这是关于此方法的一个大问题。根据官方文件，过期物品将在48小时内删除。那是一个很大的误差范围！

As an experiment, I set up a Step Functions state machine to:

作为实验，我设置了一个“步骤功能”状态机以：

add a configurable number of items to the scheduled_items table, with TTL expiring between 1 and 10 mins
向scheduled_items表中添加可配置的项目数，其中TTL在1至10分钟之间到期
track the time the task is scheduled for and when it’s actually picked up by the execute-on-schedule function
跟踪计划任务的时间以及按计划execute-on-schedule功能实际提取任务的时间
wait for all the items to be deleted
等待所有项目被删除

The state machine looks like this:

状态机如下所示：

I performed several runs of tests. The results are consistent regardless of the number of items in the table. A quick glimpse at the table tells you that, on average, a task is executed over 11 mins AFTER its scheduled time.

我进行了几次测试。无论表中的项目数量如何，结果都是一致的。快速浏览一下表格即可了解到，平均而言，任务在预定时间后的11分钟内执行。

I repeated the experiments in several other AWS regions:

我在其他几个AWS区域重复了实验：

I don’t know why there is such a marked difference between US-EAST-1 and the other regions. One explanation is that the TTL process requires a bit of time to kick in after a table is created. Since I was developing against the US-EAST-1 region initially, its TTL process has been “warmed” compared to the other regions.

我不知道为什么US-EAST-1与其他地区之间有如此明显的区别。一种解释是，创建表后，TTL过程需要一点时间才能启动。自从我最初针对US-EAST-1地区开发以来，与其他地区相比，它的TTL流程已被“温暖”。

结论 (Conclusions)

Based on the result of my experiment, it will appear that using DynamoDB TTL as a scheduling mechanism cannot guarantee a reasonable precision.

根据我的实验结果，看来使用DynamoDB TTL作为调度机制不能保证合理的精度。

On the one hand, the approach scales very well. But on the other, the scheduled tasks are executed at least several minutes behind, which renders it unsuitable for many use cases.

一方面，该方法可以很好地扩展。但是，另一方面，计划的任务至少要延迟几分钟才能执行，这使其不适用于许多用例。