While the most articles about deep learning are focusing at the modeling part, there are also few about how to deploy such models to production. Some of them say “production”, but they often simply use the un-optimized model and embed it into a Flask web server. In this post, I will explain why using this approach is not scaling well and wasting resources.
尽管有关深度学习的大多数文章都集中在建模部分,但关于如何将此类模型部署到生产环境的文章也很少。 他们中的一些人说“生产”,但他们通常只是使用未优化的模型并将其嵌入Flask Web服务器中。 在这篇文章中,我将解释为什么使用这种方法不能很好地扩展并浪费资源。
“生产”方式 (The “production” approach)
If you search for how to deploy TensorFlow, Keras or Pytorch models into production there are a lot of good tutorials, but sometimes you come across very simple simple examples, claiming production ready. These examples often use the raw keras model, a Flask web server and containerize it into a docker container. These examples use Python to serve predictions. The code for these “production” Flask webservers look like this:
如果您搜索如何将TensorFlow,Keras或Pytorch模型部署到生产环境中,则有很多不错的教程,但是有时您会遇到非常简单的示例,声称可以进行生产。 这些示例通常使用原始keras模型,Flask Web服务器并将其容器化到docker容器中。 这些示例使用Python进行预测。 这些“生产” Flask Web服务器的代码如下所示:
from flask import Flask, jsonify, request
from tensorflow import keras
app = Flask(__name__)
model = keras.models.load_model("model.h5")
@app.route("/", methods=["POST"])
def index():
data = request.json
prediction = model.predict(preprocess(data))
return jsonify({"prediction": str(prediction)})
Furthermore, they often show how to containerize the Flask server and bundle it with your model into docker. These approaches also claim that they can easily scale by increasing the number of docker instances.
此外,他们经常展示如何容器化Flask服务器并将其与模型捆绑到docker中。 这些方法还声称,它们可以通过增加docker实例数量来轻松扩展。
Now let us recap what happens here and why it is not “production” grade.
现在让我们回顾一下这里发生的事情以及为什么它不是“生产”等级。
没有优化模型 (Not optimizing models)
First usually the model is used as it is, which means the Keras model from the example was simply exported by model.save(). The model includes all the parameters and gradients, which are necessary to train the model, but not required for inference. Also, the model is neither pruned nor quantized. As an effect using not optimized models have a higher latency, need more compute and are larger in terms of file size.
首先通常是按原样使用模型,这意味着示例中的Keras模型只是通过model.save()导出的。 该模型包括所有参数和梯度,这些参数和梯度是训练模型所必需的,但不是推理所必需的。 而且,该模型既不修剪也不量化。 结果,使用未优化的模型会导致较高的延迟,需要更多的计算并且文件大小也会更大。
Example with B5 Efficientnet:
B5 Efficientnet的示例:
- h5 keras model: 454 MByte h5 keras模型:454 MByte
- Optimized tensorflow model (no quantization): 222 MByte 优化的张量流模型(无量化):222 MByte
使用Flask和Python API (Using Flask and the Python API)
The next problem is that plain Python and Flask is used to load the model and serve predictions. Here are a lot of problems.
下一个问题是使用普通的Python和Flask加载模型并提供预测。 这里有很多问题。
First let’s look at the worst thing you can do: loading the model for each request. In the code example from above, the model is used when the script is called, but in other tutorials they moved this part into the predict function. What that does is loading the model every single time you make a prediction. Please do not do that.
首先让我们看一下您可以做的最坏的事情:为每个请求加载模型。 在上面的代码示例中,在调用脚本时使用了模型,但是在其他教程中,他们将这一部分移至了预测函数中。 这样做是每次您进行预测时都加载模型。 请不要那样做。
That being said, let’s look at Flask. Flask includes a powerful and easy-to-use web server for development. On the official website, you can read the following:
话虽如此,让我们看一下Flask。 Flask包括一个功能强大且易于使用的Web服务器,用于开发。 在官方网站上 ,您可以阅读以下内容:
While lightweight and easy to use, Flask’s built-in server is not suitable for production as it doesn’t scale well.
Flask轻巧易用,但内置的服务器扩展性不好,因此不适合生产 。
That said, you can use Flask as a WSGI app in e.g. Google App Engine. However, many tutorials are not using Google App Engine or NGIX, they just use it as it is and put it into a docker container. But even when they use NGIX or any other web servers, they usually turn off multi threading completely.
也就是说,您可以在例如Google App Engine中将Flask用作WSGI应用程序。 但是,许多教程并未使用Google App Engine或NGIX,而是直接使用它并将其放入docker容器中。 但是,即使他们使用NGIX或任何其他Web服务器,也通常会完全关闭多线程。
Let’s look a bit deeper into the problem here. If you use a TensorFlow, it handles compute resources (CPU, GPU) for you. If you load a model and call predict, TensorFlow uses the compute resources to make these predictions. While this happens, the resource is in-use aka locked. When your web server only serves one single request at the time, you are fine, as the model was loaded in this thread and predict is called from this thread. But once you allow more than one requests at the time, your web server stops working, because you can simply not access a TensorFlow model from different threads. That being said, in this setup you can not process more than one request at once. Doesn’t really sound like scalable, right?
让我们在这里更深入地研究问题。 如果您使用TensorFlow,它将为您处理计算资源(CPU,GPU)。 如果您加载模型并调用预测,TensorFlow将使用计算资源进行这些预测。 发生这种情况时,资源在使用中也被锁定。 当您的Web服务器当时仅服务一个请求时,就可以了,因为模型已加载到该线程中,并且从该线程中调用了predict。 但是一旦您一次允许多个请求,您的Web服务器就会停止工作,因为您根本无法从其他线程访问TensorFlow模型。 话虽这么说,在这种设置中您不能一次处理多个请求。 听起来真的不是可扩展的,对吗?
Example:
例:
- Flask development web server: 1 simultaneous request Flask开发Web服务器:1个同时请求
- TensorFlowX Model server: parallelism configurable TensorFlowX模型服务器:可配置并行性
使用docker扩展“低负载”实例 (Scaling “low-load” instances with docker)
Ok, the web server does not scale, but what about scaling the number of web servers? In a lot of examples this approach is the solution to the scaling problem of single instances. There is not much to say about it, it works sure. But scaling this way wastes money, resources and energy. It’s like having a truck and putting in one single parcel and once there are more parcels, you get another truck, instead of using the existing truck smarter.
好的,Web服务器无法扩展,但是如何扩展Web服务器的数量呢? 在许多示例中,此方法是解决单个实例的缩放问题的方法。 没什么可说的,它可以正常工作。 但是以这种方式扩展会浪费金钱,资源和能量。 这就像拥有一辆卡车并放入一个包裹,一旦有更多的包裹,您将获得另一辆卡车,而不是更智能地使用现有的卡车。
Example latency:
延迟示例:
- Flask Serving like shown above: ~2s per image 上图所示的烧瓶投放:每张图片约2秒
- Tensorflow model server (no batching, no GPU): ~250ms per image Tensorflow模型服务器(无批处理,无GPU):每个图像约250ms
- Tensorflow model server (no batching, GPU): ~100ms per image Tensorflow模型服务器(无批处理,GPU):每个图像约100毫秒
不使用GPU / TPU (Not using GPUs/TPUs)
GPUs made deep learning possible as they can do operations massively in parallel. When using docker containers to deploy deep learning models to production, the most examples do NOT utilize GPUs, they don’t even use GPU instances. The prediction time for each request is magnitudes slower on CPU machines, so latency is a big problem. Even with powerful CPU instances you will not achieve comparable results to the small GPU instances.
GPU使深度学习成为可能,因为它们可以并行进行大规模操作。 当使用Docker容器将深度学习模型部署到生产环境时,大多数示例不使用GPU,甚至不使用GPU实例。 在CPU机器上,每个请求的预测时间要慢得多,因此延迟是一个大问题。 即使使用功能强大的CPU实例,您也无法获得与小型GPU实例相当的结果。
Just a side note: In general it is possible to use GPUs in docker, if the host has the correct driver installed. Docker is completely fine for scaling up instances, but scale up the correct instances.
附带说明:通常,如果主机安装了正确的驱动程序,则可以在docker中使用GPU。 Docker可以很好地扩展实例,但是可以扩展正确的实例。
Example costs:
费用示例:
- 2 CPU instances (16 Core, 32GByte, a1.4xlarge): 0,816 $/h 2个CPU实例(16核,32GB,a1.4xlarge):0,816 $ / h
- 1 GPU instance (32G RAM, 4 Cores, Tesla M60, g3s.xlarge): 0,75 $/h 1个GPU实例(32G RAM,4核,Tesla M60,g3s.xlarge):0,75 $ / h
已经解决了 (It’s already solved)
As you can see, loading trained model and putting it into Flask docker containers is not an elegant solution. If you want deep learning in production, start from the model, then think about servers and finally about scaling instances.
如您所见,加载经过训练的模型并将其放入Flask docker容器中并不是一个很好的解决方案。 如果要在生产中进行深度学习,请从模型开始,然后考虑服务器,最后考虑扩展实例。
优化模型 (Optimize the model)
Unfortunately optimizing a model for inference is not that straight forward as it should be. However, it can easily reduce inference time by multiples, so it’s worth it without doubts. The first step is freezing the weights and removing all the trainings overhead. This can be achieved with TensorFlow directly but requires you to convert your model into either an estimator or into a Tensorflow graph (SavedModel format), if you came from a Keras model. TensorFlow itself has a tutorial for this. To further optimize, the next step is to apply model pruning and quantization, where insignificant weights are removed and model size is reduced.
不幸的是,为推理优化模型并不是应该的。 但是,它可以轻松地将推理时间减少几倍,因此毫无疑问是值得的。 第一步是冻结重量并消除所有训练开销。 这可以直接用TensorFlow来实现,但是如果您来自Keras模型,则需要将模型转换为估算器或Tensorflow图(SavedModel格式)。 TensorFlow本身对此有一个教程 。 为了进一步优化,下一步是应用模型修剪和量化 ,删除不重要的权重并减小模型大小。
使用模型服务器 (Use model servers)
When you have an optimized model, you can look at different model servers, meant for deep learning models in production. For TensorFlow and Keras TensorFlowX offers the tensorflow model server. There are also others like TensorRT, Clipper, MLFlow, DeepDetect.
拥有优化的模型后,您可以查看不同的模型服务器,这些服务器用于生产中的深度学习模型。 对于TensorFlow和Keras, TensorFlowX提供了tensorflow模型服务器 。 还有其他一些像TensorRT,Clipper,MLFlow,DeepDetect。
TensorFlow model server offers several features. Serving multiple models at the same time, while reducing the overhead to a minimum. It allows you to version your models, without downtime when deploying a new version, while still being able to use the old version. It also has an optional REST API endpoint additionally to the gRPC API. The throughput is magnitudes higher than using a Flask API, as it is written in C++ and uses multi-threading. Additionally, you can even enable batching, where the server batches multiple single predictions into a batch for very high load settings. And finally, you can put it into a docker container and scale even more.
TensorFlow模型服务器提供了多种功能。 同时为多个模型提供服务,同时将开销降至最低。 它允许您对模型进行版本控制,而在部署新版本时不会停机,同时仍可以使用旧版本。 除了gRPC API外,它还具有可选的REST API端点。 与使用Flask API相比,吞吐量要高出许多,因为它是用C ++编写的并且使用多线程。 另外,您甚至可以启用批处理,其中服务器将多个单个预测批处理为非常高的负载设置的批处理。 最后,您可以将其放入docker容器并进一步扩展。
Hint: tensorflow_model_server is available on every AWS-EC2 Deep learning AMI image, with TensorFlow 2 it’s called tensorflow2_model_server.
提示:在每个AWS-EC2深度学习AMI映像上都可以使用tensorflow_model_server,对于TensorFlow 2,它称为tensorflow2_model_server。
使用GPU实例 (Use GPU instances)
And lastly, I would recommend using GPUs or TPUs for inference environments. The latency and throughput are much higher with such accelerators, while saving energy and money. Note that it is only being utilized if your software stack can utilize the power of GPUs (optimized model + model server). In AWS you can look into Elastic Inference or just use a GPU instance with Tesla M60 (g3s.xlarge).
最后,我建议在推理环境中使用GPU或TPU。 使用此类加速器时,延迟和吞吐量要高得多,同时可以节省能源和金钱。 请注意,只有在您的软件堆栈可以利用GPU(优化的模型+模型服务器)的功能时,才可以使用它。 在AWS中,您可以研究Elastic Inference或仅将GPU实例与Tesla M60(g3s.xlarge)一起使用。
Originally posted on digital-thnking.de
最初发布在digital-thnking.de
翻译自: https://towardsdatascience.com/how-to-not-deploy-keras-tensorflow-models-4fa60b487682
本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/387967.shtml
如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!