Fallacies of Distributed Systems
分布式系统的谬误
Fallacies of distributed systems are a set of assertions made by L Peter Deutsch and others at Sun Microsystems describing false assumptions that programmers new to distributed applications invariably make.
分布式系统谬误是由 L Peter Deutsch和Sun Microsystems其他成员做出的一些断言,讲述了一些刚接触分布式应用的程序员所提出的错误假设。
The mass adoption of microservices has forced more engineers to understand the implications of that decision within their systems.
微服务的广泛使用迫使更多的工程师去了解他们系统决策中潜在的影响。
I often see these 8 fallacies generally ignored or downplayed when discussing system design.
我通常发现人们在讨论系统设计时会忽略这8个谬误,或对它们轻描淡写。
Fallacies of Distributed Systems Infographic
分布式系统谬误的信息图表
I thought it might be fun to cover them and their potential mitigations.
我觉得讲讲它们和可能得预防措施应该会很有趣。
What is a microservice?
什么是微服务?
Microservices - also known as the microservice architecture - is an architectural style that structures an application as a collection of services that are
*Highly maintainable and testable
*Loosely coupled
*Independently deployable
*Organized around business capabilities
*Owned by a small team
微服务,也以微服务架构为人们所知,是一种架构模式,它将应用构建成一些列的服务,这些服务具有:
*易维护和测试
*低耦合
*可独立部署
*围绕业务
*维护规模小
The microservice architecture enables the rapid, frequent and reliable delivery of large, complex applications. It also enables an organization to evolve its technology stack.
微服务架构让规模大而复杂的应用分发变得更迅速、频繁和可靠。它也让开发组织能够迭代它们的技术栈。
The network is reliable
网络很可靠
The network is reliable
网络很可靠
To build a reliable system, you have to understand and come to terms with the fact that any particular communication can fail; Therefore, we need to provide a way for systems to deal with this potential miscommunication. So ultimately, this comes down to retransmission, which can come in many forms.
为了构建可靠的系统,你必须知道且接受这样一个事实,那就是任何具体的通信都有失败的可能;因此,我们得为系统提供一种方法来应对潜在的通信失败问题。所以最终可以把它归结为任何形式的重传。
One such pattern is the store and forward pattern. Instead of sending the data directly to the downstream server, we store it locally or elsewhere. This also allows for recovery in catastrophic scenarios where a simple retry loop would have lacked such guarantees.
其中一种模式是存储转发模式。除了直接把数据发送到下游服务器,我们把它存到本地或其他地方。这也让系统能在灾难性场景中进行恢复,而简单的重试循环是缺乏这样的保证的。
There are multitudes of technologies that fit his pattern RabbitMQ, ActiveMQ and various proprietary solution from your favourite cloud vendor.
现有许多技术来适配其模式,如RabbitMQ、ActiveMQ和你最喜欢的云服务供应商提供的各种专有解决方案。
Latency is zero
没有延迟
Pictured on the left is the time to access memory in a modern system, on the right the time it takes to do a round trip across the world.
图片左边是现代系统访问内存的时间,图片右边是遍历所需要的时间。
I like to think about latency as strictly overhead to get any request done. Message can be large, or it can be small, and latency is unchanged. Unlike bandwidth, latency usually has to do with the speed of light and the communication distance (or path). So the distance between the two systems plays a significant role here.
我喜欢把延迟看作是完成任何请求的严格开销。消息可以很大,也可以很小,并且延迟固定不变的。这和带宽不同,延迟通常与光速和通信距离(或路径)有关。因此,两个系统之间的距离在这里起着重要作用。
Latency is omnipresent. It occurs in all communication.
延时无处不在。在任何通讯中都会出现。
Ideally, this overhead should be small as possible. Latency is very similar to unloading groceries from the car. The time it takes you to travel from the kitchen to the car is latency.
在理想条件下,这种开销应该尽可能地小。延迟和车子卸货非常相似。你从厨房到汽车所花费的时间就是一种延迟。
Do you want to grab as much as you can in one trip, or do you want to bring the items individually, taking several hundred round-trips to unload the car?
你是想在一趟中尽可能多地携带物品,还是每次只带一个物品,分几百次把它们从车上卸下来?
What is a CDN?
什么是内容交付网络?
Content delivery networks and edge computing are essentially trying to make the distance between the fridge and trunk as close as possible. By duplicating the data closer to where it is needed we significantly reduce latency.
内容交付网络和边缘计算本质上是尝试让冰箱和货车之间的距离尽可能近。通过将数据复制到更接近需要的位置,来让我们大幅减少延迟。
Bandwidth is infinite
Bandwidth is infinite
带宽无限大
带宽无限大
Assuming that you continue to increase data size on a channel without limit; can be quite the mistake. This problem only turns its head when scale difficulties enter the conversation, and specific communication channels hit their limits.
如果你继续这么无限制地增加通道上的数据大小,这可能会是个很大的错误。当规模达到一定程度,且某些通信通道达到极限时,问题才会显现出来。
I first ran into this problem when I accidentally increased the payload that my homepage needed to function by a factor of 10. This specific API was an uncached call for 3 MBs on every page load. This included a round trip to the database as well for the entire payload.
第一次遇到这个问题是我不小心将主页运行所需的有效负载增加10倍的时候。这个特定的API是一个无缓存调用,每次页面消耗3 MB。这包括整个有效载荷的数据库往返行程。
We quickly hit several bandwidth limits in our system, which brought the site down fairly quickly.
我们一下子就达到了系统的带宽限制,这样很快就会让网站崩溃。
Now you may be thinking you just told me to take as much as I could on each round trip to reduce the effects of latency. That is true, but it does have its limits. This depend highly on your systems design and respective priorities but being aware of the trade off is critically important.
现在你可能在想,你只是让我每次往返尽量多带一些数据,从而减少延迟的影响。这没什么问题,但它也是有局限性的。这在很大程度上取决于你的系统设计和它们各自的优先级。而知道要权衡是尤为重要的。
The network is secure
网络很安全
Assuming you can trust the network you are on or the people you are building your system for can be a crucial mistake.
如果你信任你所在的网络或正在构建网络系统的人,那这可能是一个严重的错误。
Nowadays, this has become even more apparent with the advent of crowdsourced bug bounty programs and significant exploits in the news every day.
如今,随着漏洞赏金计划的大量出现和每天新闻中提到的重大漏洞,这一点变得更加明显。
Taking a security-first stance when designing your systems will reap dividends in the future. Even taking the time to assess your current system for security vulnerabilities can be a great place to start and will quickly produce a short list for improvement.
在设计系统时采取安全第一的策略是个长久之计。甚至花时间评估当前系统的安全漏洞也是一个很好的开始,你很快就能列出一个简短的改进列表。
Topology doesn’t change
网络拓扑一成不变
Network structure won’t always be the same. For example, if a critical piece of infrastructure goes down, can the traffic continue to flow to appropriate destinations? Do we have single point of failure?
网络结构并不是一成不变的。举个例子,如果基础设施的关键部分出现故障,数据流还能否继续流向合适的目的地呢?我们能确定是单点故障吗?
With the advent of Docker and Kubernetes, the ease of changing network topology now almost makes us take it for granted, almost dangerously so.
随着Docker和Kubernetes的出现,现在改变网络拓扑的便利性也几乎让我们认为这是理所当然的,而这么想也是很危险的。
Tools like Zookeeper and Consul really help resolve problems around service discovery and allow applications to react to changes in the layout and make up of our systems.
像Zookeeper和Consul这样的工具确实有助于解决服务方面的问题,并允许应用程序对布局和系统组成的变化做出反应。
Building systems that can react to these change in topology can be tricky, but ultimately result in much more resilient systems.
构建能够对这些拓扑变化做出反应的系统可能很棘手,但系统最终也会因此变得更加灵活。
There is one administrator
只有一个管理者
This one took me some time to grasp, essentially saying that you can’t control everything.
我花了一些时间来理解这一点,从根本上来说,就是不能什么都由你来控制。
As your systems grow, they will rely on other systems outside your control. So take a second to think about all the dependencies; you have everything from your code down to the servers you run them on.
随着系统迭代,它们会依赖于其它那些你无法控制的系统。所以还是花点时间想想所有的依赖关系吧;你拥有从代码到运行它们的服务器所有的东西。
It’s essential to have a clear way of managing your systems and their respective configurations. As the number of systems with various configuration increases it becomes hard to manage and track. Infrastructure as Code (IaC) can help codify those variations in your systems.
通过一种清晰的方法来管理系统及其各自的配置是非常重要的。随着各种配置的系统数量增加,管理和跟踪变得越来越困难。基础设施即代码(IaC)可以帮助编写系统中的这些变化。
Having a good way of diagnosing issues when they come up, monitoring and observability will be critical tools that can save you hours.
当问题出现时,拥有一种很好的判断、监控和观察方法会成为节省你时间的重要工具。
Appropriate decoupling can also help ensure overall system resiliency and uptime.
适当的解耦也有助于确保整个系统的弹性和正常运行时间。
What is Infrastructure as Code (IaC)?
什么是基础设施即代码(IaC)?
Infrastructure as Code (IaC) is the managing and provisioning of infrastructure through code instead of through manual processes.
基础设施即代码(IaC)是通过代码而非手工流程来管理和提供基础设施。
With IaC, configuration files are created that contain your infrastructure specifications, which makes it easier to edit and distribute configurations. It also ensures that you provision the same environment every time. By codifying and documenting your configuration specifications, IaC aids configuration management and helps you to avoid undocumented, ad-hoc configuration changes.
使用IaC可以创建包含基础结构规范的配置文件,这使得编辑和分发配置变得更加容易。它还能确保每次都形成相同的环境。通过编写和记录配置规范,IaC能有助于配置管理,并避免未记录和临时配置更改。
Transport cost is zero
传输无需代价
We often think that the resources we use to send data between systems are a simple business cost. Now when things are small, this overhead and cost can be negligible.
我们往往认为用于在系统之间发送数据的资源是一项简单的业务成本。当传输数据很小,这种开销和成本可以忽略不计。
Still, as systems grow, that cost may be worth optimizing message formats like JSON can be a bit heavy compared to transfer optimized formats like gRPC or MessagePack.
尽管如此,随着系统的迭代,与gRPC或MessagePack等传输优化格式相比,优化JSON等消息格式的成本可能会有点高。
Being aware of such costs is essential; however, it does have its tradeoffs. Doing so early may create more headache than its worth in the near term.
要意识到这些开销是必须的;然而,它也是有利有弊的。短期这么做可能会带来更多的问题。
The network is homogeneous
网络是同构的
I have written my fair share of shims in my day; taking one format of data and transforming it into another.
我写过不少的诗;将一种格式的数据转换成另一种格式。
We like everything to be clean and tidy, but the real world is far from it. Being interoperable is essential.
我们喜欢一切都干净整洁,但现实世界远非如此。具有互操作性是必不可少的。
This flexibility ensures our systems continue to function when the “new hot framework” comes into play or when you need to run your new system in environments it wasn’t intended for. (obviously, interoperability has its limits)
当“新的热门框架”开始发挥作用时,或者当你需要在不适合的环境中运行新系统时,这种灵活性可确保我们的系统继续运行。(显然,互操作性有其局限性)
Knowing that all systems aren’t the same and not coupling your solution to one aspect can save you time and headaches down the road.
知道所有的系统都是不一样的,并且不把你的解决方案耦合到一起可以节省大量的时间,避免头疼的问题。
参考文章
分布式系统的八大谬误
最后
感兴趣的朋友也可以阅读《Fallacies of Distributed Systems》的英文原文。