使用管道符组合使用命令_如何使用管道的魔力

使用管道符组合使用命令

Surely you have heard of pipelines or ETL (Extract Transform Load), or seen some method in a library, or even heard of any tool to create pipelines. However, you aren’t using it yet. So, let me introduce you to the fantastic world of pipelines.

当然，您听说过管道或ETL(提取转换负载)，或者在库中看到了一些方法，甚至听说过任何用于创建管道的工具。但是，您尚未使用它。因此，让我向您介绍梦幻般的管道世界。

Before understanding how to use them, we have to understand what it is.

在了解如何使用它们之前，我们必须了解它的含义。

A pipeline is a way to wrap and automatize a process, which means that the process will always be executed in the same way, with the same functions and parameters and the outcome will always be in the predetermined standard.

管道是包装和自动化过程的一种方式，这意味着该过程将始终以相同的方式执行，并具有相同的功能和参数，并且结果将始终符合预定的标准。

So, as you may guess, the goal is to apply pipelines in every development stage to try to guarantee that the designed process never ends up different from the one idealized.

因此，您可能会猜到，目标是在每个开发阶段都应用管道，以确保设计过程永远不会与理想化过程不同。

There are in particular two uses of pipelines in data science, either in production or during the modelling/exploration, that have a huge importance. Furthermore, it makes our life much easier.

在数据科学中，无论是在生产中还是在建模/探索期间，管道都有两种特别重要的用途。此外，它使我们的生活更加轻松。

The first one is the data ETL. During production, the ramifications are way greater, and consequently, the level of detail spent in it, however, it can be summed up as:

第一个是数据ETL。在生产过程中，后果会更加严重，因此花费在其中的详细程度也可以总结为：

E (Extract) — How am I going to collect the data? If I am going to collect them from one or several sites, one or more databases, or even a simple pandas csv. We can think of this stage as the data reading phase.

E(摘录)—我将如何收集数据？如果我要从一个或多个站点，一个或多个数据库甚至一个简单的熊猫csv收集它们。我们可以将此阶段视为数据读取阶段。

T (Transform) — What do I need to do for the data to become usable? This can be thought of as the conclusion of the exploratory data analysis, which means after we know what to do with the data (remove features, transform categorical variables into binary data, cleaning strings, etc.), we compile it all in a function that guarantees that cleaning will always be done in the same way.

T(转换)—要使数据变得可用我需要做什么？这可以被认为是探索性数据分析的结论，这意味着在我们知道如何处理数据(删除功能，将分类变量转换为二进制数据，清理字符串等)之后，我们将其全部编译为一个函数这样可以确保始终以相同的方式进行清洁。

L (Load) — This is simply to save the data in the desired format (csv, data base, etc.) somewhere, either cloud or locally, to use anytime, anywhere.

L(负载)—这只是将数据以所需的格式(csv，数据库等)保存在云端或本地的任何地方，以便随时随地使用。

The simplicity of the creation of this process is such that it can be done only by grabbing that exploratory data analysis notebook, put that pandas read_csv inside a funcion; write the several functions to prepare the data and compile them in one; and finally create a function saving the result of the previous one.

创建此过程非常简单，只需抓住该探索性数据分析记录本，然后将熊猫read_csv放入函数中即可完成 。编写几个函数来准备数据并将它们合而为一；最后创建一个保存上一个结果的函数。

Having this, we can create the main function in a python file and with one line of code executes the created ETL, without risking any changes. Not to mention the advantages of changing/updating everything in a single place.

有了这一点，我们可以在python文件中创建main函数，并用一行代码执行创建的ETL，而无需冒险进行任何更改。更不用说在单个位置更改/更新所有内容的优势。

And the second, and likely the most advantageous pipeline, helps solve one of the most common problems in machine learning: the parametrization.

第二，可能是最有利的管道，有助于解决机器学习中最常见的问题之一：参数化。

How many times have we faced these questions: which model to choose? Should I use normalization or standardization?

我们已经面对这些问题多少次了：选择哪种模型？我应该使用标准化还是标准化？

Libraries such as scikit-learn offer us the pipeline method where we can put several models, with their respective parameters’ variance, add pre-processing such as normalization, standardization or even a custom process, and even add cross-validation at the end. Afterwards, all possibilities will be tested, and the results returned, or even only the best result, like in the following code:

诸如scikit-learn之类的库为我们提供了流水线方法，我们可以放置几个具有各自参数差异的模型，添加诸如标准化，标准化甚至是自定义过程之类的预处理，甚至最后添加交叉验证。之后，将测试所有可能性，并返回结果，甚至仅返回最佳结果，如以下代码所示：

def build_model(X,y):                          
 pipeline = Pipeline([
        ('vect',CountVectorizer(tokenizer=tokenize)),
        ('tfidf', TfidfTransformer()),
        ('clf', MultiOutputClassifier(estimator=RandomForestClassifier()))                           ])# specify parameters for grid search                           parameters = { 
    # 'vect__ngram_range': ((1, 1), (1, 2)),  
    # 'vect__max_df': (0.5, 0.75, 1.0),                                
    # 'vect__max_features': (None, 5000, 10000),
    # 'tfidf__use_idf': (True, False),
    # 'clf__estimator__n_estimators': [50,100,150,200],
    # 'clf__estimator__max_depth': [20,50,100,200],
    # 'clf__estimator__random_state': [42]                                                   } 
                                                  
# create grid search object                          
cv = GridSearchCV(pipeline, param_grid=parameters, verbose=1)                                                   return cv

At this stage, the sky is the limit! There are no parameters limits inside the pipeline. However, depending on the database and the chosen parameters it can take an eternity to finish. Even so, it is a very good tool to funnel the research.

在这个阶段，天空是极限！管道内部没有参数限制。但是，根据数据库和所选的参数，可能要花很长时间才能完成。即使这样，它还是进行研究的很好工具。

We can add a function to read the data that comes out of the data ETL, and another to save the created model and we have model ETL, wrapping up this stage.

我们可以添加一个函数来读取来自数据ETL的数据，另一个函数可以保存创建的模型，并且我们拥有模型ETL，从而结束了这一阶段。

In spite of everything that we talked about, the greatest advantages of creating pipelines are the replicability and maintenance of your code that improve exponentially.

尽管我们讨论了所有问题，但是创建管道的最大优势是代码的可复制性和维护性得到了指数级的提高。

So, what are you waiting for to start creating pipelines?

那么，您还等什么来开始创建管道？

An example of these can be found in this project.

在此项目中可以找到这些示例。

翻译自: https://towardsdatascience.com/how-to-use-the-magic-of-pipelines-6e98d7e5c9b7

使用管道符组合使用命令

本文来自互联网用户投稿，该文观点仅代表作者本人，不代表本站立场。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如若转载，请注明出处：http://www.mzph.cn/news/392158.shtml

如若内容造成侵权/违法违规/事实不符，请联系多彩编程网进行投诉反馈email:809451989@qq.com，一经查实，立即删除！

关于网页授权的两种scope的区别说明

关于网页授权的两种scope的区别说明 1、以snsapi_base为scope发起的网页授权，是用来获取进入页面的用户的openid的，并且是静默授权并自动跳转到回调页的。用户感知的就是直接进入了回调页（往往是业务页面） 2、以snsapi_userinfo为…

安卓流行布局开源库_如何使用流行度在开源库之间进行选择

安卓流行布局开源库by Ashish Singal通过Ashish Singal 如何使用流行度在开源库之间进行选择 (How to choose between open source libraries using popularity) Through my career as a product manager, I’ve worked closely with engineers to build many technology prod…

TCP/IP分析(一) 协议概述

各协议层分工明确转载于:https://www.cnblogs.com/HonkerYblogs/p/11247604.html

window 下分linux分区,如何在windows9x下访问linux分区

1. 简介Linux 内核支持众多的文件系统类型, 目前它可以读写( 至少是读) 大部分的文件系统.Linux 经常与Microsoft Windows 共存于一个系统或者硬盘中.Linux 对windows9x/NT 的文件系统支持的很好, 反之你想在windows 下…

C# new关键字和对象类型转换(双括号、is操作符、as操作符)

一、new关键字 CLR要求所有的对象都通过new来创建,代码如下: Object objnew Object(); 以下是new操作符做的事情 1、计算类型及其所有基类型(一直到System.Object,虽然它没有定义自己的实例字段)中定义的所有实例字段需要的字节数.堆上每个对象都需要一些额外的成员,包括“类型…

JDBC01 利用JDBC连接数据库【不使用数据库连接池】

目录： 1 什么是JDBC 2 JDBC主要接口 3 JDBC编程步骤【学渣版本】 5 JDBC编程步骤【学神版本】 6 JDBC编程步骤【学霸版本】 1 什么是JDBC JDBC是JAVA提供的一套标准连接数据库的接口，规定了连接数据库的步骤和功能；不同的数据库提供商提供了一…

leetcode 778. 水位上升的泳池中游泳（并查集）

在一个 N x N 的坐标方格 grid 中，每一个方格的值 grid[i][j] 表示在位置 (i,j) 的平台高度。现在开始下雨了。当时间为 t 时，此时雨水导致水池中任意位置的水位为 t 。你可以从一个平台游向四周相邻的任意一个平台，但是前提是此时水位必须…

2020年十大币预测_2020年十大商业智能工具

2020年十大币预测In the rapidly growing world of today, when technology is expanding at a rate like never before, there are plenty of tools and skills to explore, learn, and master. In this digital and data age, Business Information and Intelligence have cl…