如何不部署Keras / TensorFlow模型

While the most articles about deep learning are focusing at the modeling part, there are also few about how to deploy such models to production. Some of them say “production”, but they often simply use the un-optimized model and embed it into a Flask web server. In this post, I will explain why using this approach is not scaling well and wasting resources.

尽管有关深度学习的大多数文章都集中在建模部分,但关于如何将此类模型部署到生产环境的文章也很少。 他们中的一些人说“生产”,但他们通常只是使用未优化的模型并将其嵌入Flask Web服务器中。 在这篇文章中,我将解释为什么使用这种方法不能很好地扩展并浪费资源。

“生产”方式 (The “production” approach)

If you search for how to deploy TensorFlow, Keras or Pytorch models into production there are a lot of good tutorials, but sometimes you come across very simple simple examples, claiming production ready. These examples often use the raw keras model, a Flask web server and containerize it into a docker container. These examples use Python to serve predictions. The code for these “production” Flask webservers look like this:

如果您搜索如何将TensorFlow,Keras或Pytorch模型部署到生产环境中,则有很多不错的教程,但是有时您会遇到非常简单的示例,声称可以进行生产。 这些示例通常使用原始keras模型,Flask Web服务器并将其容器化到docker容器中。 这些示例使用Python进行预测。 这些“生产” Flask Web服务器的代码如下所示:

from flask import Flask, jsonify, request
from tensorflow import keras
app = Flask(__name__)
model = keras.models.load_model("model.h5")
@app.route("/", methods=["POST"])
def index():
data = request.json
prediction = model.predict(preprocess(data))
return jsonify({"prediction": str(prediction)})

Furthermore, they often show how to containerize the Flask server and bundle it with your model into docker. These approaches also claim that they can easily scale by increasing the number of docker instances.

此外,他们经常展示如何容器化Flask服务器并将其与模型捆绑到docker中。 这些方法还声称,它们可以通过增加docker实例数量来轻松扩展。

Now let us recap what happens here and why it is not “production” grade.

现在让我们回顾一下这里发生的事情以及为什么它不是“生产”等级。

没有优化模型 (Not optimizing models)

First usually the model is used as it is, which means the Keras model from the example was simply exported by model.save(). The model includes all the parameters and gradients, which are necessary to train the model, but not required for inference. Also, the model is neither pruned nor quantized. As an effect using not optimized models have a higher latency, need more compute and are larger in terms of file size.

首先通常是按原样使用模型,这意味着示例中的Keras模型只是通过model.save()导出的。 该模型包括所有参数和梯度,这些参数和梯度是训练模型所必需的,但不是推理所必需的。 而且,该模型既不修剪也不量化。 结果,使用未优化的模型会导致较高的延迟,需要更多的计算并且文件大小也会更大。

Example with B5 Efficientnet:

B5 Efficientnet的示例:

  • h5 keras model: 454 MByte

    h5 keras模型:454 MByte
  • Optimized tensorflow model (no quantization): 222 MByte

    优化的张量流模型(无量化):222 MByte

使用Flask和Python API (Using Flask and the Python API)

The next problem is that plain Python and Flask is used to load the model and serve predictions. Here are a lot of problems.

下一个问题是使用普通的Python和Flask加载模型并提供预测。 这里有很多问题。

First let’s look at the worst thing you can do: loading the model for each request. In the code example from above, the model is used when the script is called, but in other tutorials they moved this part into the predict function. What that does is loading the model every single time you make a prediction. Please do not do that.

首先让我们看一下您可以做的最坏的事情:为每个请求加载模型。 在上面的代码示例中,在调用脚本时使用了模型,但是在其他教程中,他们将这一部分移至了预测函数中。 这样做是每次您进行预测时都加载模型。 请不要那样做。

That being said, let’s look at Flask. Flask includes a powerful and easy-to-use web server for development. On the official website, you can read the following:

话虽如此,让我们看一下Flask。 Flask包括一个功能强大且易于使用的Web服务器,用于开发。 在官方网站上 ,您可以阅读以下内容:

While lightweight and easy to use, Flask’s built-in server is not suitable for production as it doesn’t scale well.

Flask轻巧易用,但内置的服务器扩展性不好,因此不适合生产

That said, you can use Flask as a WSGI app in e.g. Google App Engine. However, many tutorials are not using Google App Engine or NGIX, they just use it as it is and put it into a docker container. But even when they use NGIX or any other web servers, they usually turn off multi threading completely.

也就是说,您可以在例如Google App Engine中将Flask用作WSGI应用程序。 但是,许多教程并未使用Google App Engine或NGIX,而是直接使用它并将其放入docker容器中。 但是,即使他们使用NGIX或任何其他Web服务器,也通常会完全关闭多线程。

Let’s look a bit deeper into the problem here. If you use a TensorFlow, it handles compute resources (CPU, GPU) for you. If you load a model and call predict, TensorFlow uses the compute resources to make these predictions. While this happens, the resource is in-use aka locked. When your web server only serves one single request at the time, you are fine, as the model was loaded in this thread and predict is called from this thread. But once you allow more than one requests at the time, your web server stops working, because you can simply not access a TensorFlow model from different threads. That being said, in this setup you can not process more than one request at once. Doesn’t really sound like scalable, right?

让我们在这里更深入地研究问题。 如果您使用TensorFlow,它将为您处理计算资源(CPU,GPU)。 如果您加载模型并调用预测,TensorFlow将使用计算资源进行这些预测。 发生这种情况时,资源在使用中也被锁定。 当您的Web服务器当时仅服务一个请求时,就可以了,因为模型已加载到该线程中,并且从该线程中调用了predict。 但是一旦您一次允许多个请求,您的Web服务器就会停止工作,因为您根本无法从其他线程访问TensorFlow模型。 话虽这么说,在这种设置中您不能一次处理多个请求。 听起来真的不是可扩展的,对吗?

Example:

例:

  • Flask development web server: 1 simultaneous request

    Flask开发Web服务器:1个同时请求
  • TensorFlowX Model server: parallelism configurable

    TensorFlowX模型服务器:可配置并行性

使用docker扩展“低负载”实例 (Scaling “low-load” instances with docker)

Ok, the web server does not scale, but what about scaling the number of web servers? In a lot of examples this approach is the solution to the scaling problem of single instances. There is not much to say about it, it works sure. But scaling this way wastes money, resources and energy. It’s like having a truck and putting in one single parcel and once there are more parcels, you get another truck, instead of using the existing truck smarter.

好的,Web服务器无法扩展,但是如何扩展Web服务器的数量呢? 在许多示例中,此方法是解决单个实例的缩放问题的方法。 没什么可说的,它可以正常工作。 但是以这种方式扩展会浪费金钱,资源和能量。 这就像拥有一辆卡车并放入一个包裹,一旦有更多的包裹,您将获得另一辆卡车,而不是更智能地使用现有的卡车。

Example latency:

延迟示例:

  • Flask Serving like shown above: ~2s per image

    上图所示的烧瓶投放:每张图片约2秒
  • Tensorflow model server (no batching, no GPU): ~250ms per image

    Tensorflow模型服务器(无批处理,无GPU):每个图像约250ms
  • Tensorflow model server (no batching, GPU): ~100ms per image

    Tensorflow模型服务器(无批处理,GPU):每个图像约100毫秒

不使用GPU / TPU (Not using GPUs/TPUs)

GPUs made deep learning possible as they can do operations massively in parallel. When using docker containers to deploy deep learning models to production, the most examples do NOT utilize GPUs, they don’t even use GPU instances. The prediction time for each request is magnitudes slower on CPU machines, so latency is a big problem. Even with powerful CPU instances you will not achieve comparable results to the small GPU instances.

GPU使深度学习成为可能,因为它们可以并行进行大规模操作。 当使用Docker容器将深度学习模型部署到生产环境时,大多数示例不使用GPU,甚至不使用GPU实例。 在CPU机器上,每个请求的预测时间要慢得多,因此延迟是一个大问题。 即使使用功能强大的CPU实例,您也无法获得与小型GPU实例相当的结果。

Just a side note: In general it is possible to use GPUs in docker, if the host has the correct driver installed. Docker is completely fine for scaling up instances, but scale up the correct instances.

附带说明:通常,如果主机安装了正确的驱动程序,则可以在docker中使用GPU。 Docker可以很好地扩展实例,但是可以扩展正确的实例。

Example costs:

费用示例:

  • 2 CPU instances (16 Core, 32GByte, a1.4xlarge): 0,816 $/h

    2个CPU实例(16核,32GB,a1.4xlarge):0,816 $ / h
  • 1 GPU instance (32G RAM, 4 Cores, Tesla M60, g3s.xlarge): 0,75 $/h

    1个GPU实例(32G RAM,4核,Tesla M60,g3s.xlarge):0,75 $ / h

已经解决了 (It’s already solved)

As you can see, loading trained model and putting it into Flask docker containers is not an elegant solution. If you want deep learning in production, start from the model, then think about servers and finally about scaling instances.

如您所见,加载经过训练的模型并将其放入Flask docker容器中并不是一个很好的解决方案。 如果要在生产中进行深度学习,请从模型开始,然后考虑服务器,最后考虑扩展实例。

优化模型 (Optimize the model)

Unfortunately optimizing a model for inference is not that straight forward as it should be. However, it can easily reduce inference time by multiples, so it’s worth it without doubts. The first step is freezing the weights and removing all the trainings overhead. This can be achieved with TensorFlow directly but requires you to convert your model into either an estimator or into a Tensorflow graph (SavedModel format), if you came from a Keras model. TensorFlow itself has a tutorial for this. To further optimize, the next step is to apply model pruning and quantization, where insignificant weights are removed and model size is reduced.

不幸的是,为推理优化模型并不是应该的。 但是,它可以轻松地将推理时间减少几倍,因此毫无疑问是值得的。 第一步是冻结重量并消除所有训练开销。 这可以直接用TensorFlow来实现,但是如果您来自Keras模型,则需要将模型转换为估算器或Tensorflow图(SavedModel格式)。 TensorFlow本身对此有一个教程 。 为了进一步优化,下一步是应用模型修剪和量化 ,删除不重要的权重并减小模型大小。

使用模型服务器 (Use model servers)

When you have an optimized model, you can look at different model servers, meant for deep learning models in production. For TensorFlow and Keras TensorFlowX offers the tensorflow model server. There are also others like TensorRT, Clipper, MLFlow, DeepDetect.

拥有优化的模型后,您可以查看不同的模型服务器,这些服务器用于生产中的深度学习模型。 对于TensorFlow和Keras, TensorFlowX提供了tensorflow模型服务器 。 还有其他一些像TensorRT,Clipper,MLFlow,DeepDetect。

TensorFlow model server offers several features. Serving multiple models at the same time, while reducing the overhead to a minimum. It allows you to version your models, without downtime when deploying a new version, while still being able to use the old version. It also has an optional REST API endpoint additionally to the gRPC API. The throughput is magnitudes higher than using a Flask API, as it is written in C++ and uses multi-threading. Additionally, you can even enable batching, where the server batches multiple single predictions into a batch for very high load settings. And finally, you can put it into a docker container and scale even more.

TensorFlow模型服务器提供了多种功能。 同时为多个模型提供服务,同时将开销降至最低。 它允许您对模型进行版本控制,而在部署新版本时不会停机,同时仍可以使用旧版本。 除了gRPC API外,它还具有可选的REST API端点。 与使用Flask API相比,吞吐量要高出许多,因为它是用C ++编写的并且使用多线程。 另外,您甚至可以启用批处理,其中服务器将多个单个预测批处理为非常高的负载设置的批处理。 最后,您可以将其放入docker容器并进一步扩展。

Hint: tensorflow_model_server is available on every AWS-EC2 Deep learning AMI image, with TensorFlow 2 it’s called tensorflow2_model_server.

提示:在每个AWS-EC2深度学习AMI映像上都可以使用tensorflow_model_server,对于TensorFlow 2,它称为tensorflow2_model_server。

使用GPU实例 (Use GPU instances)

And lastly, I would recommend using GPUs or TPUs for inference environments. The latency and throughput are much higher with such accelerators, while saving energy and money. Note that it is only being utilized if your software stack can utilize the power of GPUs (optimized model + model server). In AWS you can look into Elastic Inference or just use a GPU instance with Tesla M60 (g3s.xlarge).

最后,我建议在推理环境中使用GPU或TPU。 使用此类加速器时,延迟和吞吐量要高得多,同时可以节省能源和金钱。 请注意,只有在您的软件堆栈可以利用GPU(优化的模型+模型服务器)的功能时,才可以使用它。 在AWS中,您可以研究Elastic Inference或仅将GPU实例与Tesla M60(g3s.xlarge)一起使用。

Originally posted on digital-thnking.de

最初发布在digital-thnking.de

翻译自: https://towardsdatascience.com/how-to-not-deploy-keras-tensorflow-models-4fa60b487682

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/387967.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

[BZOJ3626] [LNOI2014] LCA 离线 树链剖分

题面 考虑到询问的\(l..r,z\)具有可减性,考虑把询问差分掉,拆成\(r,z\)和\(l-1,z\)。 显然这些LCA一定在\(z\)到根的路径上。下面的问题就是怎么统计。 考虑不是那么暴力的暴力。 我们似乎可以把\(1..r\)的所有点先瞎搞一下,求出一个点内部有…

Linux查看系统各类信息

说明:Linux下可以在/proc/cpuinfo中看到每个cpu的详细信息。但是对于双核的cpu,在cpuinfo中会看到两个cpu。常常会让人误以为是两个单核的cpu。其实应该通过Physical Processor ID来区分单核和双核。而Physical Processor ID可以从cpuinfo或者dmesg中找到…

biopython中文指南_Biopython新手指南-第1部分

biopython中文指南When you hear the word Biopython what is the first thing that came to your mind? A python library to handle biological data…? You are correct! Biopython provides a set of tools to perform bioinformatics computations on biological data s…

整合后台服务和驱动代码注入

整合后台服务和驱动代码注入 Home键的驱动代码: /dev/input/event1: 0001 0066 00000001 /dev/input/event1: 0000 0000 00000000 /dev/input/event1: 0001 0066 00000000 /dev/input/event1: 0000 0000 00000000 对应输入的驱动代码: sendevent/dev/…

Java作业09-异常

6. 为如下代码加上异常处理 byte[] content null; FileInputStream fis new FileInputStream("testfis.txt"); int bytesAvailabe fis.available();//获得该文件可用的字节数 if(bytesAvailabe>0){content new byte[bytesAvailabe];//创建可容纳文件大小的数组…

为数据计算提供强力引擎,阿里云文件存储HDFS v1.0公测发布

2019独角兽企业重金招聘Python工程师标准>>> 在2019年3月的北京云栖峰会上,阿里云正式推出全球首个云原生HDFS存储服务—文件存储HDFS,为数据分析业务在云上提供可线性扩展的吞吐能力和免运维的快速弹性伸缩能力,降低用户TCO。阿里…

对食材的敬畏之心极致产品_这些数据科学产品组合将给您带来敬畏和启发(2020年中的版本)

对食材的敬畏之心极致产品重点 (Top highlight)为什么选择投资组合? (Why portfolios?) Data science is a tough field. It combines in equal parts mathematics and statistics, computer science, and black magic. As of mid-2020, it is also a booming fiel…

android模拟用户输入

目录(?)[-] geteventsendeventinput keyevent 本文讲的是通过使用代码,可以控制手机的屏幕和物理按键,也就是说不只是在某一个APP里去操作,而是整个手机系统。 getevent/sendevent getevent&sendevent 是Android系统下的一个工具&#x…

真格量化常见报错信息和Debug方法

1.打印日志 1.1 在代码中添加运行到特定部分的提示: 如果我们在用户日志未能看到“调用到OnQuote事件”文字,说明其之前的代码就出了问题,导致程序无法运行到OnQuote函数里的提示部分。解决方案为仔细检查该部分之前的代码是否出现问题。 1.2…

向量积判断优劣弧_判断经验论文优劣的10条诫命

向量积判断优劣弧There are a host of pathologies associated with the current peer review system that has been the subject of much discussion. One of the most substantive issues is that results reported in leading journals are commonly papers with the most e…

自定义PopView

改代码是参考一个Demo直接改的&#xff0c;代码中有一些漏洞&#xff0c;如果发现其他的问题&#xff0c;可以下方直接留言 .h文件 #import <UIKit/UIKit.h> typedef void(^PopoverBlock)(NSInteger index); interface CustomPopView : UIView //property(nonatomic,copy…

线控耳机监听

当耳机的媒体按键被单击后&#xff0c;Android系统会发出一个广播&#xff0c;该广播的携带者一个Action名为MEDIA_BUTTON的Intent。监听该广播便可以获取手机的耳机媒体按键的单击事件。 在Android中有个AudioManager类&#xff0c;该类会维护MEDIA_BUTTON广播的分发&#xf…

当编程语言掌握在企业手中,是生机还是危机?

2019年4月&#xff0c;Java的收费时代来临了&#xff01; Java是由Sun微系统公司在1995年推出的编程语言&#xff0c;2010年Oracle收购了Sun之后&#xff0c;Java的所有者也就自然变成了Oracle。2019年&#xff0c;Oracle宣布将停止Java 8更新的免费支持&#xff0c;未来Java的…

sql如何处理null值_如何正确处理SQL中的NULL值

sql如何处理null值前言 (Preface) A friend who has recently started learning SQL asked me about NULL values and how to deal with them. If you are new to SQL, this guide should give you insights into a topic that can be confusing to beginners.最近开始学习SQL的…

名言警句分享

“当你想做一件事&#xff0c;却无能为力的时候&#xff0c;是最痛苦的。”基拉大和转载于:https://www.cnblogs.com/yuxijun/p/9986489.html

文字创作类App分享-简书

今天我用Mockplus做了一套简书App的原型&#xff0c;这是一款文字创作类的App&#xff0c;用户通过写文、点赞等互动行为&#xff0c;提高自己在社区的影响力&#xff0c;打造个人品牌。我运用了Mockplus基础组件、交互组件、移动组件等多个组件库&#xff0c;简单拖拽&#xf…

数据可视化 信息可视化_动机可视化

数据可视化 信息可视化John Snow’s map of Cholera cases near London’s Broad Street.约翰斯诺(John Snow)在伦敦宽街附近的霍乱病例地图。 John Snow, “the father of epidemiology,” is famous for his cholera maps. These maps represent so many of our aspirations …

android 接听和挂断实现方式

转载▼标签&#xff1a; android 接听 挂断 it 分类&#xff1a; android应用技巧 参考&#xff1a;android 来电接听和挂断 支持目前所有版本 注意&#xff1a;android2.3版本及以上不支持下面的自动接听方法。 &#xff08;会抛异常&#xff1a;java.lang.Securi…

Eclipse External Tool Configration Notepad++

Location&#xff1a; C:\Program Files\Notepad\notepad.exe Arguments&#xff1a;  ${resource_loc} 转载于:https://www.cnblogs.com/rgqancy/p/9987610.html

利用延迟关联或者子查询优化超多分页场景

2019独角兽企业重金招聘Python工程师标准>>> MySQL并不是跳过offset行&#xff0c;而是取offsetN行&#xff0c;然后返回放弃前offset行&#xff0c;返回N行&#xff0c;那当offset 特别大的时候&#xff0c;效率就非常的低下&#xff0c;要么控制返回的总页数&…