白裤子变粉裤子怎么办_使用裤子构建构建数据科学的monorepo

白裤子变粉裤子怎么办

At HousingAnywhere, one of the first major obstacles we had to face when scaling the Data team was building a centralised repository that contains our ever-growing machine learning applications. Between these projects, many of them share dependencies with each other which means code refactoring can become a pain and consumes lots of time. In addition, as we’re very opposed to Data Scientists’ tendency to copy/paste codes, we need a unified location where we can store reusable functions that can be easily accessed.

在HousingAnywhere ,扩展数据团队时我们面临的第一个主要障碍之一是建立一个包含我们不断增长的机器学习应用程序的集中式存储库。 在这些项目之间,它们中的许多彼此共享依赖关系,这意味着代码重构可能会很麻烦并且会花费大量时间。 另外,由于我们非常反对数据科学家复制/粘贴代码的趋势,因此我们需要一个统一的位置,在这里我们可以存储易于访问的可重用功能。

The perfect solution to our use case was building a monorepo. In this article, I’ll go through how a simple monorepo can be built using the build automation system Pantsbuild.

对于我们的用例,完美的解决方案是构建一个monorepo。 在本文中,我将介绍如何使用构建自动化系统Pantsbuild构建简单的monorepo

什么是monorepo? (What is a monorepo?)

A monorepo is a repository where code for many projects are stored together. Having a centralised repository for your team comes with a number of benefits:

monorepo是一个用于存储许多项目代码的存储库。 为您的团队建立集中式存储库有许多好处:

  • Reusability: Allows projects to share functions, in the case of Data Science, codes for preprocessing data, calculating metrics and even plotting graphs can be shared across projects.

    可重用性 :允许项目共享功能,就数据科学而言,可以在项目之间共享用于预处理数据计算度量甚至绘图的代码。

  • Atomic changes: It only takes one operation to make changes across multiple projects.

    原子更改 :只需执行一项操作即可在多个项目中进行更改。

  • Large scale refactoring: can be done easily and quickly, ensuring projects would still work afterwards.

    大规模重构 可以轻松,快速地完成,从而确保项目在以后仍然可以正常工作。

Monorepo, however, is not a solution that fits all as there are a number of disadvantages:

但是,Monoropo并不是一个适合所有人的解决方案,因为存在许多缺点:

  • Security issues: There are no means to expose only parts of the repository.

    安全问题 :没有办法只公开存储库的一部分。

  • Big codebase: As the repo grows in size, it can cause problems as developers have to check out the entire repository.

    大型代码库 :随着存储库大小的增加,由于开发人员必须检出整个存储库,因此可能导致问题。

At HousingAnywhere, our team of Data Scientists find monorepo to be the perfect solution for our use cases in the Data team. Many of our machine learning applications have smaller projects that spin off from them. The monorepo enables us to quickly intergrate these new projects into the CI/CD pipeline, reducing the amount of time having to setup pipeline individually for each new project.

在HousingAnywhere,我们的数据科学家团队发现monorepo是我们数据团队中用例的理想解决方案。 我们的许多机器学习应用程序都有一些较小的项目,这些项目可以从中分离出来。 monorepo使我们能够快速将这些新项目集成到CI / CD管道中,从而减少了为每个新项目分别设置管道的时间。

We tried out a number of build automation systems and the one that we stuck with is Pantsbuild. Pants is one of the few systems that supports Python natively, and is an open-source project widely used by Twitter, Toolchain, Foursquare, Square, and Medium.

我们尝试了许多构建自动化系统, 我们坚持使用的是Pantsbuild Pant是本机支持Python的少数系统之一,并且是Twitter,Toolchain,Foursquare,Square和Medium广泛使用的开源项目。

Recently Pants has updated to v2 which only supports Python at the moment but it isn’t too much of a limitation for Data Science projects.

最近,Pant已更新到v2 ,目前仅支持Python,但对Data Science项目的限制不是太大。

一些基本概念 (Some basic concepts)

There are a couple of concepts in Pants that you should understand beforehand:

您需要事先了解裤子中的几个概念:

  • Goals help users tell Pants what actions to take e.g. test

    目标可以帮助用户告诉裤子要采取哪些措施,例如进行test

  • Tasks are the Pants modules that run actions

    任务是运行动作的裤子模块

  • Targets describe what files to take those actions upon. These targets are defined in a BUILD file

    目标描述要对这些文件执行哪些操作。 这些目标在BUILD文件中定义

  • Target types define the types of operations that can be performed on a target e.g. you can perform tests on test targets

    目标类型定义可以在目标上执行的操作的类型,例如,您可以在测试目标上执行测试

  • Addresses describe the location of a target in the repo

    地址描述了目标在仓库中的位置

For more information, I highly recommend reading this documentation where the developers of Pants have done an excellent job in explaining these concepts in detail.

有关更多信息,我强烈建议阅读本文档 ,Pant的开发人员在详细解释这些概念方面做得很好。

一个示例存储库 (An example repository)

In this section, I’ll go through how you can easily setup a monorepo using Pants. First, makes sure these requirements are met to install Pants:

在本节中,我将介绍如何使用裤子轻松设置monorepo。 首先,确保满足以下要求才能安装裤子:

  • Linux or macOS.

    Linux或macOS。
  • Python 3.6+ discoverable on your PATH.

    可在PATH上发现的Python 3.6+。

  • Internet access (so that Pants can fully bootstrap itself).

    Internet访问(以便裤子可以完全自举)。

Now, let’s set up a new repository:

现在,让我们建立一个新的存储库:

mkdir monorepo-example
cd monorepo-example
git init

Alternatively, you can clone the example repo via:

或者,您可以通过以下方式克隆示例存储库 :

git clone https://github.com/uiucanh/monorepo-example.git

Next, run these commands to download the setup file:

接下来,运行以下命令以下载安装文件:

printf '[GLOBAL]\npants_version = "1.30.0"\nbackend_packages = []\n' > pants.toml
curl -L -o ./pants https://pantsbuild.github.io/setup/pants && \ chmod +x ./pants

Then, bootstrap Pants by running ./pants --version . You should receive 1.30.0 as output.

然后,通过运行./pants --version引导裤子。 您应该收到1.30.0作为输出。

Let’s add a couple of simple apps to the repo. First, we’ll create a utils/data_gen.py and a utils/metrics.py that contain a couple of util functions:

让我们向仓库添加几个简单的应用程序。 首先,我们将创建一个utils/data_gen.py和一个utils/metrics.py ,其中包含几个util函数:

import numpy as npdef generate_linear_data(n_samples: int = 100, n_features: int = 1,x_min: int = -5, x_max: int = 5,m_min: int = -10, m_max: int = 10,noise_strength: int = 1, seed: int = None,bias: int = 10):# Set the random seedif seed is not None:np.random.seed(seed)X = np.random.uniform(x_min, x_max, size=(n_samples, n_features))m = np.random.uniform(m_min, m_max, size=n_features)y = np.dot(X, m).reshape((n_samples, 1))if bias != 0:y += bias# Add Gaussian noisey += np.random.normal(size=y.shape) * noise_strengthreturn X, ydef split_dataset(X: np.ndarray, y: np.ndarray,test_size: float = 0.2, seed: int = 0):# Set the random seednp.random.seed(seed)# Shuffle datasetindices = np.random.permutation(len(X))X = X[indices]y = y[indices]# SplittingX_split_point = int(len(X) * (1 - test_size))y_split_point = int(len(y) * (1 - test_size))X_train, X_test = X[:X_split_point], X[X_split_point:]y_train, y_test = y[:y_split_point], y[y_split_point:]return X_train, X_test, y_train, y_test
import numpy as npdef mean_absolute_percentage_error(y_true: np.ndarray, y_pred: np.ndarray):y_true, y_pred = np.array(y_true), np.array(y_pred)return np.mean(np.abs((y_true - y_pred) / y_true)) * 100def r2(y_test: np.ndarray, y_pred: np.ndarray):y_mean = np.mean(y_test)ss_tot = np.square(y_test - y_mean).sum()ss_res = np.square(y_test - y_pred).sum()result = 1 - ss_res / ss_totreturn result

Now, we’ll add an application first_app/app.pythat imports these codes. The app uses data fromgenerate_linear_data , passes them to a Linear Regression model and outputs the Mean Absolute Percentage Error.

现在,我们将添加一个导入这些代码的应用程序first_app/app.py 该应用程序使用generate_linear_data数据,将其传递到线性回归模型,然后输出平均绝对百分比误差。

import os
import sys# Enable import from outer directory
file_path = os.path.dirname(os.path.realpath(__file__))
sys.path.insert(0, file_path + "/..")from utils.data_gen import generate_linear_data, split_dataset  # noqa
from utils.metrics import mean_absolute_percentage_error, r2  # noqa
from sklearn.linear_model import LinearRegression  # noqaclass Model:def __init__(self, X, y):self.X = Xself.y = yself.m = LinearRegression()self.y_pred = Nonedef split(self, test_size=0.33, seed=0):self.X_train, self.X_test, self.y_train, self.y_test = split_dataset(self.X, self.y, test_size=test_size, seed=seed)def fit(self):self.m.fit(self.X_train, self.y_train)def predict(self):self.y_pred = self.m.predict(self.X_test)def main():X, y = generate_linear_data()m = Model(X, y)m.split()m.fit()m.predict()print("MAPE:", mean_absolute_percentage_error(m.y_test, m.y_pred))if __name__ == '__main__':main()

And another app second_app/app.pythat uses the first app codes:

还有另一个使用第一个应用程序代码的应用程序second_app/app.py

import sys
import os# Enable import from outer directory
file_path = os.path.dirname(os.path.realpath(__file__))
sys.path.insert(0, file_path + "/..")from utils.metrics import r2  # noqa
from utils.data_gen import generate_linear_data, split_dataset  # noqa
from first_app.app import Model  # noqadef main():X, y = generate_linear_data()m = Model(X, y)m.split()m.fit()m.predict()result = r2(m.y_test, m.y_pred)print("R2:", result)return resultif __name__ == '__main__':_ = main()

Then we add a couple of simple tests for these apps, for example:

然后,我们为这些应用添加一些简单的测试,例如:

import numpy as np
from first_app.app import Modeldef test_model_working():X, y = np.array([[1, 2, 3], [4, 5, 6]]), np.array([[1], [2]])m = Model(X, y)m.split()m.fit()m.predict()assert m.y_pred is not None

In each of these directories, we’ll need a BUILD file. These files contain information about your targets and their dependencies. In these files, we’ll declare what requirements are needed for these projects as well as declare the test targets.

在每个目录中,我们需要一个BUILD文件。 这些文件包含有关目标及其依赖项的信息。 在这些文件中,我们将声明这些项目需要哪些要求以及声明测试目标。

Let’s start from the root of the repository:

让我们从存储库的根目录开始:

python_requirements()

This BUILD file contains a macro python_requirements() that creates multiple targets to pull third party dependencies from a requirements.txt in the same directory. It saves us time from having to do it manually for each requirement:

此BUILD文件包含一个宏python_requirements() ,该宏创建多个目标以从同一目录中的requirements.txt中提取第三方依赖项。 它为我们节省了手动完成每个需求的时间:

python_requirement_library(
name="numpy",
requirements=[
python_requirement("numpy==1.19.1"),
],
)

The BUILD file inutils would look like below:

utils的BUILD文件如下所示:

python_library(name = "utils",sources = ["data_gen.py","metrics.py",],dependencies = [# The `//` signals that the target is at the root of your project."//:numpy"]
)python_tests(name = 'utils_test',sources = ["data_gen_test.py","metrics_test.py",],dependencies = [":utils",]
)

Here we have two targets: First one is a Python library that contains Python codes which are defined in source i.e. our two utility files. It also specifies the requirements needed to run these codes which is numpy, one of our third party dependencies we defined in the root BUILD file.

这里我们有两个目标:第一个是Python库,其中包含在source代码中定义的Python代码,即我们的两个实用程序文件。 它还指定了运行这些代码numpy所需的要求, numpy是我们在根BUILD文件中定义的第三方依赖项之一。

The second target is the collection of tests we defined earlier, their dependency is the previous Python library. To run these tests, it’s as simple as running ./pants test utils:utils_test or ./pants test utils:: from root. The second : tells Pants to run all the test targets in that BUILD file. The output should look like this:

第二个目标是我们前面定义的测试集合,它们的依赖关系是先前的Python库。 要运行这些测试,就像从根目录运行./pants test utils:utils_test./pants test utils::一样简单。 第二个:告诉Pant运行该BUILD文件中的所有测试目标。 输出应如下所示:

============== test session starts ===============
platform darwin -- Python 3.7.5, pytest-5.3.5, py-1.9.0, pluggy-0.13.1
cachedir: .pants.d/test/pytest/.pytest_cache
rootdir: /Users/ducbui/Desktop/Projects/monorepo-example, inifile: /dev/null
plugins: cov-2.8.1, timeout-1.3.4
collected 3 items
utils/data_gen_test.py . [ 33%]
utils/metrics_test.py .. [100%]

Similarly, we’ll create 2 BUILD files for first_app and second_app

同样,我们将为first_appsecond_app创建2个BUILD文件

python_library(name = "first_app",sources = ["app.py"],dependencies = ["//:numpy","//:scikit-learn","//:pytest","utils",],
)python_tests(name = 'app_test',sources = ["app_test.py"],dependencies = [":first_app",]
)

In the second_app BUILD file, we declare the library fromfirst_app above as the dependency for this library. This means that all the dependencies from that library, together with its source will be the dependencies for first_app .

second_app BUILD文件中,我们从上面的first_app声明该库作为该库的依赖项。 这意味着该库中的所有依赖项及其源将成为first_app的依赖项。

python_library(name = "second_app",sources = ["app.py"],dependencies = ["first_app",],
)python_tests(name = 'app_test',sources = ["app_test.py"],dependencies = [":second_app",]
)

Similarly, we also add some test targets to these BUILD files and they can be run with ./pants test first_app:: or ./pants test second_app:: .

同样,我们也向这些BUILD文件添加了一些测试目标,它们可以通过./pants test first_app::./pants test second_app::

The final directory tree should look like this:

最终目录树应如下所示:

.
├── BUILD
├── first_app
│ ├── BUILD
│ ├── app.py
│ └── app_test.py
├── pants
├── pants.toml
├── requirements.txt
├── second_app
│ ├── BUILD
│ ├── app.py
│ └── app_test.py
└── utils
├── BUILD
├── data_gen.py
├── data_gen_test.py
├── metrics.py
└── metrics_test.py

The power of Pants comes from the ability to trace transitive dependencies between projects and test targets that were affected by the change. The developers of Pants provide us with this nifty bash script that can be used to track down affected test targets:

Pant的强大之处在于能够跟踪受更改影响的项目与测试目标之间的传递依赖关系。 Pants的开发人员为我们提供了这个漂亮的bash脚本,可用于跟踪受影响的测试目标:

#!/bin/bashset -x
set -o
set -e# Disable Zinc incremental compilation to ensure no historical cruft pollutes the build used for CI testing.
export PANTS_COMPILE_ZINC_INCREMENTAL=Falsechanged=("$(./pants --changed-parent=origin/master list)")
dependees=("$(./pants dependees --dependees-transitive --dependees-closed ${changed[@]})")
minimized=("$(./pants minimize ${dependees[@]})")
./pants filter --filter-type=-jvm_binary ${minimized[@]} | sort > minimized.txt# In other contexts we can use --spec-file to read the list of targets to operate on all at
# once, but that would merge all the classpaths of all the test targets together, which may cause
# errors. See https://www.pantsbuild.org/3rdparty_jvm.html#managing-transitive-dependencies.
# TODO(#7480): Background cache activity when running in a loop can sometimes lead to race conditions which
# cause pants to error. This can probably be worked around with --no-cache-compile-rsc-write. See
# https://github.com/pantsbuild/pants/issues/7480.for target in $(cat minimized.txt); do./pants test $target
done

To showcase its power, let’s run an example. We’ll create a new branch, make a modification to data_gen.py (e.g. changing the default parameter for generate_linear_data ) and commit:

为了展示其功能,让我们来看一个例子。 我们将创建一个新分支,对data_gen.py进行修改(例如,更改generate_linear_data的默认参数)并提交:

git checkout -b "example_1"
git add utils/data_gen.py
git commit -m "support/change-params"

Now, running the bash script we’ll see a minimized.txt that contains all the projects that are impacted and the test targets that will be executed:

现在,运行bash脚本,我们将看到一个minimized.txt ,其中包含所有受影响的项目以及将要执行的测试目标:

first_app:app_test
second_app:app_test
utils:utils_test
Image for post
Transitive dependencies
传递依存关系

Looking at the graph above, we can clearly see that changing utils would affect all of its above nodes, including first_app and second_app .

查看上图,我们可以清楚地看到更改utils会影响其上面的所有节点,包括first_appsecond_app

Let’s do another example, this time we’ll only modify second_app/app.py . Switch branch, commit and run the script again. Insideminimized.txt , we’ll only get second_app:app_test as it’s the topmost node.

让我们再举一个例子,这次我们只修改second_app/app.py 切换分支,提交并再次运行脚本。 里面minimized.txt ,我们只得到second_app:app_test ,因为它是最顶层的节点。

And that’s it, hopefully, I’ve managed to demonstrate to you how useful Pantsbuild can be for Data Science monorepos. Together with a properly implemented CI/CD pipeline, the speed and reliability of development can be improved vastly.

就是这样,希望我能够向您演示Pantsbuild对Data Science monorepos的有用性。 加上正确实施的CI / CD管道,可以极大地提高开发速度和可靠性。

翻译自: https://towardsdatascience.com/building-a-monorepo-for-data-science-with-pantsbuild-2f77b9ee14bd

白裤子变粉裤子怎么办

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/389826.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

支持向量机SVM算法原理及应用(R)

支持向量机SVM算法原理及应用(R) 2016年08月17日 16:37:25 阅读数:22292更多 个人分类: 数据挖掘实战应用版权声明:本文为博主原创文章,转载请注明来源。 https://blog.csdn.net/csqazwsxedc/article/detai…

mad离群值_全部关于离群值

mad离群值An outlier is a data point in a data set that is distant from all other observations. A data point that lies outside the overall distribution of the dataset. Or in a layman term, we can say, an outlier is something that behaves differently from th…

青年报告_了解青年的情绪

青年报告Youth-led media is any effort created, planned, implemented, and reflected upon by young people in the form of media, including websites, newspapers, television shows, and publications. Such platforms connect writers, artists, and photographers in …

post提交参数过多时,取消Tomcat对 post长度限制

1.Tomcat 默认的post参数的最大大小为2M, 当超过时将会出错,可以配置maxPostSize参数来改变大小。 从 apache-tomcat-7.0.63 开始,参数 maxPostSize 的含义就变了: 如果将值设置为 0,表示 POST 最大值为 0,…

map(平均平均精度_客户的平均平均精度

map(平均平均精度Disclaimer: this was created for my clients because it’s rather challenging to explain such a complex metric in simple words, so don’t expect to see much of math or equations here. And remember that I try to keep it simple.免责声明 &#…

Sublime Text 2搭建Go开发环境,代码提示+补全+调试

本文在已安装Go环境的前提下继续。 1、安装Sublime Text 2 2、安装Package Control。 运行Sublime,按下 Ctrl(在Tab键上边),然后输入以下内容: import urllib2,os,hashlib; h 7183a2d3e96f11eeadd761d777e62404 e330…

zookeeper、hbase常见命令

a) Zookeeper:帮助命令-help i. ls /查看zk下根节点目录 ii. create /zk_test my_data//在测试集群没有创建成功 iii. get /zk_test my_data//获取节点信息 iv. set / zk_test my_data//更改节点相关信息 v. delete /zk_test//删除节点信…

鲜活数据数据可视化指南_数据可视化实用指南

鲜活数据数据可视化指南Exploratory data analysis (EDA) is an essential part of the data science or the machine learning pipeline. In order to create a robust and valuable product using the data, you need to explore the data, understand the relations among v…

Linux lsof命令详解

lsof(List Open Files) 用于查看你进程开打的文件,打开文件的进程,进程打开的端口(TCP、UDP),找回/恢复删除的文件。是十分方便的系统监视工具,因为lsof命令需要访问核心内存和各种文件,所以需要…

史密斯卧推:杠铃史密斯下斜卧推、上斜机卧推、平板卧推动作图解

史密斯卧推:杠铃史密斯下斜卧推、上斜机卧推、平板卧推动作图解 史密斯卧推(smith press)是固定器械上完成的卧推,对于初级健身者来说,自由卧推(哑铃卧推、杠铃卧推)还不能很好地把握平衡性&…

图像特征 可视化_使用卫星图像可视化建筑区域

图像特征 可视化地理可视化/菲律宾/遥感 (GEOVISUALIZATION / PHILIPPINES / REMOTE-SENSING) Big data is incredible! The way Big Data manages to bring sciences and business domains to new levels is almost sort of magical. It allows us to tap into a variety of a…

375. 猜数字大小 II

375. 猜数字大小 II 我们正在玩一个猜数游戏,游戏规则如下: 我从 1 到 n 之间选择一个数字。你来猜我选了哪个数字。如果你猜到正确的数字,就会 赢得游戏 。如果你猜错了,那么我会告诉你,我选的数字比你的 更大或者更…

海量数据寻找最频繁的数据_在数据中寻找什么

海量数据寻找最频繁的数据Some activities are instinctive. A baby doesn’t need to be taught how to suckle. Most people can use an escalator, operate an elevator, and open a door instinctively. The same isn’t true of playing a guitar, driving a car, or anal…

OSChina 周四乱弹 —— 要成立复仇者联盟了,来报名

2019独角兽企业重金招聘Python工程师标准>>> Osc乱弹歌单(2018)请戳(这里) 【今日歌曲】 Devoes :分享吴若希的单曲《越难越爱 (Love Is Not Easy / TVB剧集《使徒行者》片尾曲)》: 《越难越爱 (Love Is No…

2023. 连接后等于目标字符串的字符串对

2023. 连接后等于目标字符串的字符串对 给你一个 数字 字符串数组 nums 和一个 数字 字符串 target ,请你返回 nums[i] nums[j] (两个字符串连接)结果等于 target 的下标 (i, j) (需满足 i ! j)的数目。 示例 1&…

webapi 找到了与请求匹配的多个操作(ajax报500,4的错误)

1、ajax报500,4的错误,然而多次验证自己的后台方法没错。然后跟踪到如下图的错误信息! 2、因为两个函数都是无参的,返回值也一样。如下图 3,我给第一个函数加了一个参数后,就不报错了,所以我想,…

可视化 nlp_使用nlp可视化尤利西斯

可视化 nlpMy data science experience has, thus far, been focused on natural language processing (NLP), and the following post is neither the first nor last which will include the novel Ulysses, by James Joyce, as its primary target for NLP and literary elu…

本地搜索文件太慢怎么办?用Everything搜索秒出结果(附安装包)

每次用电脑本地的搜索都慢的一批,后来发现了一个搜索利器 基本上搜索任何文件都不用等待。 并且页面非常简洁,也没有任何广告,用起来非常舒服。 软件官网如下: voidtools 官网提供三个版本,用起来差别不大。 网盘链…

小程序入口传参:关于带参数的小程序扫码进入的方法

1.使用场景 1.医院场景:比如每个医生一个id,通过带参数二维码,扫码二维码就直接进入小程序医生页面 2.餐厅场景:比如每个菜一个二维码,通过扫码这个菜的二维码,进入小程序后,可以直接点这道菜&a…

python的power bi转换基础

I’ve been having a great time playing around with Power BI, one of the most incredible things in the tool is the array of possibilities you have to transform your data.我在玩Power BI方面玩得很开心,该工具中最令人难以置信的事情之一就是您必须转换数…