小程序坚屏_如何构建坚如磐石的应用程序

小程序坚屏

不同的应用程序设计选项概述 (An overview of different app design options)

When we design software, we constantly think about error cases. Errors have a huge impact on the way we design and architecture a solution. So much so, in fact, that there is a philosophy known as Let It Crash.

在设计软件时，我们会不断考虑错误情况。错误对我们设计和构建解决方案的方式产生巨大影响。实际上，如此之多，有一种被称为“ 让它崩溃”的哲学。

Let it crash is the Erlang way to treat failures by just letting the application crash and allowing a supervisor to restart the crashed process from a clean state.

让它崩溃是通过使应用程序崩溃并允许管理员从干净状态重新启动崩溃的进程来处理故障的Erlang方法。

Errors could be everywhere, and the more your application grows, the more there will be points of failure that you need to keep under control. External service calls, sending email, database queries are all operations that could fail.

错误可能无处不在，并且您的应用程序增长的越多，需要控制的故障点就越多。外部服务呼叫，发送电子邮件，数据库查询都是可能失败的操作。

失败种类 (Kinds of Failures)

A failure can have different origins which lead to different impacts on your service availability. Think of a scenario where we’re running too many SQL queries and the database server is going to throttle the application. In that case, we could retry the query or add a catch in the code to identify the failing queries and provide a sensible response to the user.

故障的来源可能不同，从而对您的服务可用性产生不同的影响。考虑一下我们正在运行太多SQL查询并且数据库服务器将限制应用程序的情况。在这种情况下，我们可以重试查询或在代码中添加捕获以识别失败的查询并向用户提供明智的响应。

These kinds of errors are called Transient Errors, which means that the database server is temporary overloaded but it’s going to come back soon.

这些类型的错误称为“ 瞬时错误”，这意味着数据库服务器是暂时过载的，但很快就会回来。

Transient errors are not related to any problem in the application. They are usually caused by external conditions such as network failures, overloaded servers, or service rate limits. For that reason, it’s safe for a client to ignore it and retry the failed operation after a while.

暂时性错误与应用程序中的任何问题无关。它们通常是由外部条件引起的，例如网络故障，服务器超载或服务速率限制。因此，对于客户端来说，忽略它并在一段时间后重试失败的操作是安全的。

These errors are much more frequent within cloud native applications, because the apps are split into different services and deployed on different servers that communicate over the network.

这些错误在云本机应用程序中更为常见，因为这些应用程序分为不同的服务，并部署在通过网络进行通信的不同服务器上。

识别瞬态错误 (Identifying Transient Errors)

Transient errors can usually be detected in an automatic manner. We can recognise the errors by inspecting the transport layer metadata (for example HTTP errors, network errors, timeouts) or when they are explicitly marked as transient (such as rate limits).

通常可以自动检测瞬态错误。我们可以通过检查传输层元数据(例如HTTP错误，网络错误，超时)或将其明确标记为瞬态(例如速率限制)来识别错误。

处理错误 (Treating the Errors)

There are different actions we can do in case of an error. One trivial approach could be to just retry the request, API call, or query.

如果发生错误，我们可以采取不同的措施。一种简单的方法是仅重试请求，API调用或查询。

Even though this solution might be fine in many cases, there are lot of cases when it can lead to a performance decrease for the app.

即使此解决方案在许多情况下可能都很好，但在很多情况下，它可能导致应用程序性能下降。

Let’s take the case of a network failure. Indefinitely retrying some API calls to a disconnected service would result in continuous network timeouts, and the application will be stuck waiting for a response for a very long time.

让我们以网络故障为例。无限期地重试对断开连接的服务的某些API调用将导致连续的网络超时，并且应用程序将被阻塞，等待很长时间。

Before going ahead with complex implementations, let’s evaluate the pros and cons of the “just-retry” option.

在进行复杂的实现之前，让我们评估一下“ just-retry”选项的优缺点。

PROS

优点

Trivial implementation.
琐碎的实现。
Stateless (every retry request is isolated and you don’t need any extra information).
无状态的(每个重试请求都是隔离的，您不需要任何其他信息)。

CONS

缺点

For heavily loaded applications, the caller will continuously send requests to the degraded server resulting in a denial of service.
对于负载较重的应用程序，调用者将不断向降级的服务器发送请求，从而导致服务被拒绝。
Cannot provide a response until the server comes back.
在服务器恢复之前无法提供响应。

This simple retry strategy can be considered as a very first approach to solving the issue. For low traffic apps it would work, but if you have a more complex architecture, it’s definitely not enough.

可以将这种简单的重试策略视为解决问题的第一种方法。对于低流量的应用程序，它可以工作，但是如果您拥有更复杂的架构，那绝对不够。

So let’s discuss a more resilient approach.

因此，让我们讨论一种更具弹性的方法。

窃取IEEE的想法 (Stealing an Idea from the IEEE)

The next stop of your journey for a reliable application is to avoid the wasted time and to make the application more responsive. The exponential backoff algorithm could be the right tool for the job.

使用可靠的应用程序的下一个步骤是避免浪费时间，并使应用程序响应更快。指数补偿算法可能是这项工作的正确工具。

The concept of the exponential backoff directly comes from the Ethernet network protocol (IEEE 802.3) where it’s used for packet collision resolution.

指数补偿的概念直接来自以太网网络协议(IEEE 802.3)，用于分组冲突解决。

For our purposes, the exponential backoff can be used to avoid wasting time between timed out calls or to avoid hammering an overloaded server with an continual flow of requests that cannot be resolved.

对于我们的目的，可以使用指数退避来避免在两次超时调用之间浪费时间，或者避免用无法解决的连续请求流对过载的服务器造成冲击。

Binary exponential backoff for packet collisions can be resumed with help from the follow definition:

数据包冲突的二进制指数补偿可以在以下定义的帮助下恢复：

After *c* collisions, a random number of slot times between 0 and 2*c* - 1 is chosen. For the first collision, each sender will wait 0 or 1 slot times. After the second collision, the senders will wait anywhere from 0 to 3 slot times. After the third collision, the senders will wait anywhere from 0 to 7 slot times (inclusive), and so forth. As the number of retransmission attempts increases, the number of possibilities for delay increases exponentially - Exponential backoff - Wikipedia
在* c *冲突之后，选择0到2 * c *-1之间的随机时隙时间。对于第一次冲突，每个发送方将等待0或1个时隙。第二次冲突后，发送方将等待0到3个时隙。第三次冲突后，发件人将在0到7个时隙(含)之间的任何时间等待，依此类推。随着重传尝试次数的增加，延迟的可能性呈指数增长- 指数退避-维基百科

This algorithm can be quickly adapted to many use cases. The following example is a PHP message handler class that exponentially waits for a response from an API endpoint.

该算法可以快速适应许多用例。以下示例是一个PHP消息处理程序类，该类以指数形式等待来自API端点的响应。

<?php
/*** Assume that we're using a message bus which is able to* retry failed messages with a custom retry delay.*/
class FetchCarMessageHandler
{public function handle(Message $msg){try {$id = (int)$msg->getContent();$cars = $client->get('/car/'.$id);return Result::success($cars);} catch (TimeoutException $e) {$lastBackoff = $msg->getLastBackoff();// The infrastructure layer will automagically retry the message after XYZ secondsreturn Result::retryAfter($lastBackoff * 2, $msg);}}
}

重试与指数退避 (Retry vs Exponential Backoff)

The previous two strategies are both sub-optimal. They guarantee that you’ll eventually be able to generate a response to give back to the client, but they rely on continuously calling the external service until a successful response is received.

前两个策略都不是最优的。他们保证您最终将能够生成响应并回馈给客户端，但是它们依赖于不断调用外部服务，直到收到成功的响应为止。

We may be lucky and receive a response after a couple of retries, or we could fall in the retry-wait-retry-wait… infinite loop and never receive the response.You know, Murphy’s law is always here: “Anything that can go wrong will go wrong.”

我们可能是幸运和几个重试后得到答复，或者我们可以落在重试等待重试等待...无限循环，从来没有收到response.You知道，墨菲定律总是在这里：“ 凡是可以去错误会出错 。”

As you might imagine, scaling a service oriented infrastructure that in case of failure continuously retries the request to the dependant services is the perfect recipe for application collapse.

就像您想象的那样，扩展面向服务的基础结构，以在发生故障的情况下连续重试对相关服务的请求，这是解决应用程序崩溃的完美方法。

We need a stronger strategy to maintain infrastructure resilience.

我们需要一个更强大的战略来维持基础架构的弹性。

电子产品可以帮助我们 (Electronics may Help Us)

In case of continuous errors, the easy thing to do is clear. We do not want to loop and retry calling an external service. The point is we’ll just stop doing it, by taking the concept of Circuit Breakers from electronics.

如果连续出现错误，那么很容易做到。我们不想循环并重试调用外部服务。关键是，我们将从电子产品中采用断路器的概念来停止这样做。

从电子学到计算机科学 (From Electronics to Computer Science)

A circuit breaker is a component that wraps a protected call to an external service and can monitor the responses by checking the service health. Exactly like an electronic component, a software circuit breaker could be open or closed. An open status would mean that the service behind the circuit is down, and a closed status would mean that the service is up.

断路器是将受保护的呼叫包装到外部服务并可以通过检查服务运行状况来监视响应的组件。就像电子组件一样，软件断路器可以打开或关闭。打开状态将意味着电路后面的服务已关闭，而闭合状态将意味着服务已启动。

So the circuit breaker can autonomously control the service status and decide to open or close the circuit, so that in case of disconnection or server overload, the client stops sending new connections and the degraded service can use more resources to come back to a healthy state.

因此，断路器可以自主控制服务状态并决定打开或关闭电路，以便在断开连接或服务器过载的情况下，客户端停止发送新的连接，并且降级的服务可以使用更多的资源来恢复正常状态。

In case of an open circuit, we could decide to quickly answer to the client with a fallback response. For example, cached data, default data, or whatever make sense for the particular application.

在开路的情况下，我们可以决定Swift返回给客户端与后备响应。例如，缓存数据，默认数据或对特定应用程序有意义的任何数据。

Let’s see a real example from the e-commerce world. We’re going to use the circuit breaker method to protect the product listing API call.

让我们从电子商务世界中看到一个真实的例子。我们将使用断路器方法来保护产品列表API调用。

<?php
class CircuitBreaker
{private $maxFailures;private $service;private $redisClient;public function __construct(int $maxFailures, callable $service){$this->maxFailures = $maxFailures;$this->service = $service;$this->redisClient = new RedisClient();}private function isUp(string $key){return (int)$this->redisClient->get($key) < $this->maxFailures;}private function fail(string $key, int $ttl){$this->redisClient->incr($key, 1);$this->redisClient->expire($key, $ttl);}public function __invoke(){[$arguments, $defaultResponse] = func_get_args();$key = md5($arguments);if (!$this->isUp($key)) {return $defaultResponse;}try {$result = call_user_func_array($this->service, $arguments);return $result;} catch (\Throwable $e) {$this->fail($key, 10);return $defaultResponse;}}
}

The circuit breaker will transparently handle all errors and show the default response in case of an API call failure. It also allows defining a max number of retries to avoid too many failed calls.

断路器将透明地处理所有错误，并在API调用失败的情况下显示默认响应。它还允许定义最大重试次数，以避免太多的失败呼叫。

In this case, protecting a third party service API call is a very simple task: we just need to provide the callback and number of max failures allowed, after which the circuit breaker will be opened for 10 seconds and the default response is given back to the client, as in the example below.

在这种情况下，保护第三方服务API调用是一个非常简单的任务：我们只需要提供允许的回调和最大失败次数，然后将断路器打开10秒钟，并将默认响应返回给客户端，如下例所示。

<?php
$productListing = new CircuitBreaker(10, function($searchKey) {// $result is given from the api callreturn $result;}
);
$productsToShow = $productListing(['t-shirt'], []);

结论 (Conclusion)

Whether you’re designing a SOA, micro services or a cloud native application, you should be ready to tackle the failure case in the right way. Errors and failures are in the same room from the day you launch your app.

无论您是设计SOA，微服务还是云本机应用程序，都应该准备好以正确的方式解决故障案例。从启动应用程序之日起，错误和失败就在同一房间内。

Here some of the well known tactics to build a real rock solid app:

这里有一些众所周知的策略来构建一个真正的坚如磐石的应用程序：

https://docs.microsoft.com/en-us/azure/architecture/patterns/retry
https://docs.microsoft.com/zh-CN/azure/architecture/patterns/retry
https://en.m.wikipedia.org/wiki/Exponential_backoff
https://zh.m.wikipedia.org/wiki/Exponential_backoff
https://martinfowler.com/bliki/CircuitBreaker.html
https://martinfowler.com/bliki/CircuitBreaker.html