白裤子变粉裤子怎么办
At HousingAnywhere, one of the first major obstacles we had to face when scaling the Data team was building a centralised repository that contains our ever-growing machine learning applications. Between these projects, many of them share dependencies with each other which means code refactoring can become a pain and consumes lots of time. In addition, as we’re very opposed to Data Scientists’ tendency to copy/paste codes, we need a unified location where we can store reusable functions that can be easily accessed.
在HousingAnywhere ,扩展数据团队时我们面临的第一个主要障碍之一是建立一个包含我们不断增长的机器学习应用程序的集中式存储库。 在这些项目之间,它们中的许多彼此共享依赖关系,这意味着代码重构可能会很麻烦并且会花费大量时间。 另外,由于我们非常反对数据科学家复制/粘贴代码的趋势,因此我们需要一个统一的位置,在这里我们可以存储易于访问的可重用功能。
The perfect solution to our use case was building a monorepo. In this article, I’ll go through how a simple monorepo can be built using the build automation system Pantsbuild.
对于我们的用例,完美的解决方案是构建一个monorepo。 在本文中,我将介绍如何使用构建自动化系统Pantsbuild构建简单的monorepo 。
什么是monorepo? (What is a monorepo?)
A monorepo is a repository where code for many projects are stored together. Having a centralised repository for your team comes with a number of benefits:
monorepo是一个用于存储许多项目代码的存储库。 为您的团队建立集中式存储库有许多好处:
Reusability: Allows projects to share functions, in the case of Data Science, codes for preprocessing data, calculating metrics and even plotting graphs can be shared across projects.
可重用性 :允许项目共享功能,就数据科学而言,可以在项目之间共享用于预处理数据 , 计算度量甚至绘图的代码。
Atomic changes: It only takes one operation to make changes across multiple projects.
原子更改 :只需执行一项操作即可在多个项目中进行更改。
Large scale refactoring: can be done easily and quickly, ensuring projects would still work afterwards.
大规模重构 : 可以轻松,快速地完成,从而确保项目在以后仍然可以正常工作。
Monorepo, however, is not a solution that fits all as there are a number of disadvantages:
但是,Monoropo并不是一个适合所有人的解决方案,因为存在许多缺点:
Security issues: There are no means to expose only parts of the repository.
安全问题 :没有办法只公开存储库的一部分。
Big codebase: As the repo grows in size, it can cause problems as developers have to check out the entire repository.
大型代码库 :随着存储库大小的增加,由于开发人员必须检出整个存储库,因此可能导致问题。
At HousingAnywhere, our team of Data Scientists find monorepo to be the perfect solution for our use cases in the Data team. Many of our machine learning applications have smaller projects that spin off from them. The monorepo enables us to quickly intergrate these new projects into the CI/CD pipeline, reducing the amount of time having to setup pipeline individually for each new project.
在HousingAnywhere,我们的数据科学家团队发现monorepo是我们数据团队中用例的理想解决方案。 我们的许多机器学习应用程序都有一些较小的项目,这些项目可以从中分离出来。 monorepo使我们能够快速将这些新项目集成到CI / CD管道中,从而减少了为每个新项目分别设置管道的时间。
We tried out a number of build automation systems and the one that we stuck with is Pantsbuild. Pants is one of the few systems that supports Python natively, and is an open-source project widely used by Twitter, Toolchain, Foursquare, Square, and Medium.
我们尝试了许多构建自动化系统, 我们坚持使用的是Pantsbuild 。 Pant是本机支持Python的少数系统之一,并且是Twitter,Toolchain,Foursquare,Square和Medium广泛使用的开源项目。
Recently Pants has updated to v2 which only supports Python at the moment but it isn’t too much of a limitation for Data Science projects.
最近,Pant已更新到v2 ,目前仅支持Python,但对Data Science项目的限制不是太大。
一些基本概念 (Some basic concepts)
There are a couple of concepts in Pants that you should understand beforehand:
您需要事先了解裤子中的几个概念:
Goals help users tell Pants what actions to take e.g.
test
目标可以帮助用户告诉裤子要采取哪些措施,例如进行
test
Tasks are the Pants modules that run actions
任务是运行动作的裤子模块
Targets describe what files to take those actions upon. These targets are defined in a BUILD file
目标描述要对这些文件执行哪些操作。 这些目标在BUILD文件中定义
Target types define the types of operations that can be performed on a target e.g. you can perform tests on test targets
目标类型定义可以在目标上执行的操作的类型,例如,您可以在测试目标上执行测试
Addresses describe the location of a target in the repo
地址描述了目标在仓库中的位置
For more information, I highly recommend reading this documentation where the developers of Pants have done an excellent job in explaining these concepts in detail.
有关更多信息,我强烈建议阅读本文档 ,Pant的开发人员在详细解释这些概念方面做得很好。
一个示例存储库 (An example repository)
In this section, I’ll go through how you can easily setup a monorepo using Pants. First, makes sure these requirements are met to install Pants:
在本节中,我将介绍如何使用裤子轻松设置monorepo。 首先,确保满足以下要求才能安装裤子:
- Linux or macOS. Linux或macOS。
Python 3.6+ discoverable on your
PATH
.可在
PATH
上发现的Python 3.6+。- Internet access (so that Pants can fully bootstrap itself). Internet访问(以便裤子可以完全自举)。
Now, let’s set up a new repository:
现在,让我们建立一个新的存储库:
mkdir monorepo-example
cd monorepo-example
git init
Alternatively, you can clone the example repo via:
或者,您可以通过以下方式克隆示例存储库 :
git clone https://github.com/uiucanh/monorepo-example.git
Next, run these commands to download the setup file:
接下来,运行以下命令以下载安装文件:
printf '[GLOBAL]\npants_version = "1.30.0"\nbackend_packages = []\n' > pants.toml
curl -L -o ./pants https://pantsbuild.github.io/setup/pants && \ chmod +x ./pants
Then, bootstrap Pants by running ./pants --version
. You should receive 1.30.0
as output.
然后,通过运行./pants --version
引导裤子。 您应该收到1.30.0
作为输出。
Let’s add a couple of simple apps to the repo. First, we’ll create a utils/data_gen.py
and a utils/metrics.py
that contain a couple of util functions:
让我们向仓库添加几个简单的应用程序。 首先,我们将创建一个utils/data_gen.py
和一个utils/metrics.py
,其中包含几个util函数:
import numpy as npdef generate_linear_data(n_samples: int = 100, n_features: int = 1,x_min: int = -5, x_max: int = 5,m_min: int = -10, m_max: int = 10,noise_strength: int = 1, seed: int = None,bias: int = 10):# Set the random seedif seed is not None:np.random.seed(seed)X = np.random.uniform(x_min, x_max, size=(n_samples, n_features))m = np.random.uniform(m_min, m_max, size=n_features)y = np.dot(X, m).reshape((n_samples, 1))if bias != 0:y += bias# Add Gaussian noisey += np.random.normal(size=y.shape) * noise_strengthreturn X, ydef split_dataset(X: np.ndarray, y: np.ndarray,test_size: float = 0.2, seed: int = 0):# Set the random seednp.random.seed(seed)# Shuffle datasetindices = np.random.permutation(len(X))X = X[indices]y = y[indices]# SplittingX_split_point = int(len(X) * (1 - test_size))y_split_point = int(len(y) * (1 - test_size))X_train, X_test = X[:X_split_point], X[X_split_point:]y_train, y_test = y[:y_split_point], y[y_split_point:]return X_train, X_test, y_train, y_test
import numpy as npdef mean_absolute_percentage_error(y_true: np.ndarray, y_pred: np.ndarray):y_true, y_pred = np.array(y_true), np.array(y_pred)return np.mean(np.abs((y_true - y_pred) / y_true)) * 100def r2(y_test: np.ndarray, y_pred: np.ndarray):y_mean = np.mean(y_test)ss_tot = np.square(y_test - y_mean).sum()ss_res = np.square(y_test - y_pred).sum()result = 1 - ss_res / ss_totreturn result
Now, we’ll add an application first_app/app.py
that imports these codes. The app uses data fromgenerate_linear_data
, passes them to a Linear Regression model and outputs the Mean Absolute Percentage Error.
现在,我们将添加一个导入这些代码的应用程序first_app/app.py
该应用程序使用generate_linear_data
数据,将其传递到线性回归模型,然后输出平均绝对百分比误差。
import os
import sys# Enable import from outer directory
file_path = os.path.dirname(os.path.realpath(__file__))
sys.path.insert(0, file_path + "/..")from utils.data_gen import generate_linear_data, split_dataset # noqa
from utils.metrics import mean_absolute_percentage_error, r2 # noqa
from sklearn.linear_model import LinearRegression # noqaclass Model:def __init__(self, X, y):self.X = Xself.y = yself.m = LinearRegression()self.y_pred = Nonedef split(self, test_size=0.33, seed=0):self.X_train, self.X_test, self.y_train, self.y_test = split_dataset(self.X, self.y, test_size=test_size, seed=seed)def fit(self):self.m.fit(self.X_train, self.y_train)def predict(self):self.y_pred = self.m.predict(self.X_test)def main():X, y = generate_linear_data()m = Model(X, y)m.split()m.fit()m.predict()print("MAPE:", mean_absolute_percentage_error(m.y_test, m.y_pred))if __name__ == '__main__':main()
And another app second_app/app.py
that uses the first app codes:
还有另一个使用第一个应用程序代码的应用程序second_app/app.py
:
import sys
import os# Enable import from outer directory
file_path = os.path.dirname(os.path.realpath(__file__))
sys.path.insert(0, file_path + "/..")from utils.metrics import r2 # noqa
from utils.data_gen import generate_linear_data, split_dataset # noqa
from first_app.app import Model # noqadef main():X, y = generate_linear_data()m = Model(X, y)m.split()m.fit()m.predict()result = r2(m.y_test, m.y_pred)print("R2:", result)return resultif __name__ == '__main__':_ = main()
Then we add a couple of simple tests for these apps, for example:
然后,我们为这些应用添加一些简单的测试,例如:
import numpy as np
from first_app.app import Modeldef test_model_working():X, y = np.array([[1, 2, 3], [4, 5, 6]]), np.array([[1], [2]])m = Model(X, y)m.split()m.fit()m.predict()assert m.y_pred is not None
In each of these directories, we’ll need a BUILD file. These files contain information about your targets and their dependencies. In these files, we’ll declare what requirements are needed for these projects as well as declare the test targets.
在每个目录中,我们需要一个BUILD文件。 这些文件包含有关目标及其依赖项的信息。 在这些文件中,我们将声明这些项目需要哪些要求以及声明测试目标。
Let’s start from the root of the repository:
让我们从存储库的根目录开始:
python_requirements()
This BUILD file contains a macro python_requirements()
that creates multiple targets to pull third party dependencies from a requirements.txt
in the same directory. It saves us time from having to do it manually for each requirement:
此BUILD文件包含一个宏python_requirements()
,该宏创建多个目标以从同一目录中的requirements.txt
中提取第三方依赖项。 它为我们节省了手动完成每个需求的时间:
python_requirement_library(
name="numpy",
requirements=[
python_requirement("numpy==1.19.1"),
],
)
The BUILD file inutils
would look like below:
utils
的BUILD文件如下所示:
python_library(name = "utils",sources = ["data_gen.py","metrics.py",],dependencies = [# The `//` signals that the target is at the root of your project."//:numpy"]
)python_tests(name = 'utils_test',sources = ["data_gen_test.py","metrics_test.py",],dependencies = [":utils",]
)
Here we have two targets: First one is a Python library that contains Python codes which are defined in source
i.e. our two utility files. It also specifies the requirements needed to run these codes which is numpy
, one of our third party dependencies we defined in the root BUILD file.
这里我们有两个目标:第一个是Python库,其中包含在source
代码中定义的Python代码,即我们的两个实用程序文件。 它还指定了运行这些代码numpy
所需的要求, numpy
是我们在根BUILD文件中定义的第三方依赖项之一。
The second target is the collection of tests we defined earlier, their dependency is the previous Python library. To run these tests, it’s as simple as running ./pants test utils:utils_test
or ./pants test utils::
from root. The second :
tells Pants to run all the test targets in that BUILD file. The output should look like this:
第二个目标是我们前面定义的测试集合,它们的依赖关系是先前的Python库。 要运行这些测试,就像从根目录运行./pants test utils:utils_test
或./pants test utils::
一样简单。 第二个:
告诉Pant运行该BUILD文件中的所有测试目标。 输出应如下所示:
============== test session starts ===============
platform darwin -- Python 3.7.5, pytest-5.3.5, py-1.9.0, pluggy-0.13.1
cachedir: .pants.d/test/pytest/.pytest_cache
rootdir: /Users/ducbui/Desktop/Projects/monorepo-example, inifile: /dev/null
plugins: cov-2.8.1, timeout-1.3.4
collected 3 items
utils/data_gen_test.py . [ 33%]
utils/metrics_test.py .. [100%]
Similarly, we’ll create 2 BUILD files for first_app
and second_app
同样,我们将为first_app
和second_app
创建2个BUILD文件
python_library(name = "first_app",sources = ["app.py"],dependencies = ["//:numpy","//:scikit-learn","//:pytest","utils",],
)python_tests(name = 'app_test',sources = ["app_test.py"],dependencies = [":first_app",]
)
In the second_app
BUILD file, we declare the library fromfirst_app
above as the dependency for this library. This means that all the dependencies from that library, together with its source will be the dependencies for first_app
.
在second_app
BUILD文件中,我们从上面的first_app
声明该库作为该库的依赖项。 这意味着该库中的所有依赖项及其源将成为first_app
的依赖项。
python_library(name = "second_app",sources = ["app.py"],dependencies = ["first_app",],
)python_tests(name = 'app_test',sources = ["app_test.py"],dependencies = [":second_app",]
)
Similarly, we also add some test targets to these BUILD files and they can be run with ./pants test first_app::
or ./pants test second_app::
.
同样,我们也向这些BUILD文件添加了一些测试目标,它们可以通过./pants test first_app::
或./pants test second_app::
。
The final directory tree should look like this:
最终目录树应如下所示:
.
├── BUILD
├── first_app
│ ├── BUILD
│ ├── app.py
│ └── app_test.py
├── pants
├── pants.toml
├── requirements.txt
├── second_app
│ ├── BUILD
│ ├── app.py
│ └── app_test.py
└── utils
├── BUILD
├── data_gen.py
├── data_gen_test.py
├── metrics.py
└── metrics_test.py
The power of Pants comes from the ability to trace transitive dependencies between projects and test targets that were affected by the change. The developers of Pants provide us with this nifty bash script that can be used to track down affected test targets:
Pant的强大之处在于能够跟踪受更改影响的项目与测试目标之间的传递依赖关系。 Pants的开发人员为我们提供了这个漂亮的bash脚本,可用于跟踪受影响的测试目标:
#!/bin/bashset -x
set -o
set -e# Disable Zinc incremental compilation to ensure no historical cruft pollutes the build used for CI testing.
export PANTS_COMPILE_ZINC_INCREMENTAL=Falsechanged=("$(./pants --changed-parent=origin/master list)")
dependees=("$(./pants dependees --dependees-transitive --dependees-closed ${changed[@]})")
minimized=("$(./pants minimize ${dependees[@]})")
./pants filter --filter-type=-jvm_binary ${minimized[@]} | sort > minimized.txt# In other contexts we can use --spec-file to read the list of targets to operate on all at
# once, but that would merge all the classpaths of all the test targets together, which may cause
# errors. See https://www.pantsbuild.org/3rdparty_jvm.html#managing-transitive-dependencies.
# TODO(#7480): Background cache activity when running in a loop can sometimes lead to race conditions which
# cause pants to error. This can probably be worked around with --no-cache-compile-rsc-write. See
# https://github.com/pantsbuild/pants/issues/7480.for target in $(cat minimized.txt); do./pants test $target
done
To showcase its power, let’s run an example. We’ll create a new branch, make a modification to data_gen.py
(e.g. changing the default parameter for generate_linear_data
) and commit:
为了展示其功能,让我们来看一个例子。 我们将创建一个新分支,对data_gen.py
进行修改(例如,更改generate_linear_data
的默认参数)并提交:
git checkout -b "example_1"
git add utils/data_gen.py
git commit -m "support/change-params"
Now, running the bash script we’ll see a minimized.txt
that contains all the projects that are impacted and the test targets that will be executed:
现在,运行bash脚本,我们将看到一个minimized.txt
,其中包含所有受影响的项目以及将要执行的测试目标:
first_app:app_test
second_app:app_test
utils:utils_test
Looking at the graph above, we can clearly see that changing utils
would affect all of its above nodes, including first_app
and second_app
.
查看上图,我们可以清楚地看到更改utils
会影响其上面的所有节点,包括first_app
和second_app
。
Let’s do another example, this time we’ll only modify second_app/app.py
. Switch branch, commit and run the script again. Insideminimized.txt
, we’ll only get second_app:app_test
as it’s the topmost node.
让我们再举一个例子,这次我们只修改second_app/app.py
切换分支,提交并再次运行脚本。 里面minimized.txt
,我们只得到second_app:app_test
,因为它是最顶层的节点。
And that’s it, hopefully, I’ve managed to demonstrate to you how useful Pantsbuild can be for Data Science monorepos. Together with a properly implemented CI/CD pipeline, the speed and reliability of development can be improved vastly.
就是这样,希望我能够向您演示Pantsbuild对Data Science monorepos的有用性。 加上正确实施的CI / CD管道,可以极大地提高开发速度和可靠性。
翻译自: https://towardsdatascience.com/building-a-monorepo-for-data-science-with-pantsbuild-2f77b9ee14bd
白裤子变粉裤子怎么办
本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/389826.shtml
如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!