数据科学家数据工程师

While working in a software project it is very common and, in fact, a standard to start right away versioning code, and the benefits are already pretty obvious for the software community: it tracks every modification of the code in a particular code repository. If any mistake is made, developers can always travel through time and compare earlier versions of the code in order to solve the problem while minimizing disruption to all the team members. Code for software projects is the most precious asset and for that reason must be protected at all costs!

在软件项目中工作时，它是非常普遍的，实际上是立即开始版本控制代码的标准，对于软件社区来说，好处已经非常明显：它跟踪特定代码存储库中对代码的每次修改。如果有任何错误，开发人员可以随时浏览并比较早期版本的代码，以解决问题，同时最大程度地减少对所有团队成员的破坏。软件项目代码是最宝贵的资产，因此必须不惜一切代价保护它！

Well, for Data Science projects, data can also be considered the crown jewels, so why us, as Data Scientists, don’t treat as the most precious thing on earth through versioning control?
好吧，对于数据科学项目，数据也可以被视为皇冠上的明珠，那么为什么我们作为数据科学家不通过版本控制将其视为地球上最宝贵的东西呢？

For those familiar with Git, you might be thinking, “Git cannot handle large files and directories.. at least it can’t with the same performance as it deals with small code files. So how can I version control my data in the same old fashion we version control code?”. Well, this is now possible, and it’s easy as just typing git cloneand see the data files and ML model files saved in the workspace, and all this magic can be achieved with DVC.

对于熟悉Git的人来说，您可能会想： “ Git无法处理大文件和目录。至少，它不能具有与处理小代码文件相同的性能。 那么如何以与版本控制代码相同的旧版本来控制数据呢？”。 嗯，这已经成为可能，而且很容易，只需键入git clone并查看保存在工作区中的数据文件和ML模型文件，并且所有这些魔力都可以通过DVC来实现。

DVC快速入门 (Quick start with DVC)

First things first, we have to get DVC installed in our machines. It’s pretty straightforward and you can do it by following these steps.

首先，我们必须在计算机中安装DVC。这非常简单，您可以按照以下步骤进行操作。

As I’ve already mentioned, tools for data version control such as DVC makes it possible to build large projects while making it possible to reproduce the pipelines. Using DVC it’s very simple to add datasets into a git repository, and when I mean by simple, is as easy as typing the line below:

正如我已经提到的那样，用于数据版本控制的工具(例如DVC)使构建大型项目成为可能，同时又可以重现管道。使用DVC，将数据集添加到git存储库非常简单，而我的意思很简单，就像键入以下行一样：

dvc add path/to/dataset

Regardless of the size of the dataset, the data is added to the repository. Assuming that we also want to push the dataset into the cloud, it is also possible with the below command:

无论数据集的大小如何，数据都会添加到存储库中。假设我们也想将数据集推送到云中，也可以使用以下命令：

dvc push path/to/dataset.dvc

Out of the box, DVC supports many cloud storage services such as S3, Google Storage, Azure Blobs, Google Drive, etc… And since the dataset was pushed to the cloud through the version control system, if I clone the project into another machine, I’m able to download the data, or any other artifact, using the following command:

DVC开箱即用，支持许多云存储服务，例如S3，Google Storage，Azure Blob，Google Drive等。由于数据集是通过版本控制系统推送到云的，因此如果我将项目克隆到另一台计算机上，我可以使用以下命令下载数据或任何其他工件：

dvc pull

Well, now that you know how to start with DVC, I suggest you to go and further explore the tool, or similar ones. Version control should be your best friend as a Data Scientist, as they allow not only to version datasets but also to create reproducible pipelines, while keeping all the developments traceable and reproducible.

好了，既然您知道如何开始使用DVC，我建议您继续研究该工具或类似工具。作为数据科学家，版本控制应该是您最好的朋友，因为它们不仅允许版本数据集，而且允许创建可复制的管道，同时保持所有开发的可追溯性和可复制性。

If this hasn’t yet convinced, next I’ll tell why you must start versioning control your data!!

如果尚未确定，接下来我将告诉您为什么必须开始版本控制您的数据！

为什么要开始使用数据版本控制？ (Why should I start using data version control?)

1.保存并复制所有数据实验 (1. Save and reproduce all of your data experiments)

As Data Scientists we know that to develop a Machine Learning model, is not all about code, but also about data and the right parameters. A lot of times, in order to find the perfect match, experimentation is required, which makes the process highly iterative and extremely important to keep track of the changes made as well as their impacts on the end results. This becomes even more important in a complex environment where multiple data scientists are collaborating. In that sense, if we are able to have a snapshot of the data used to develop a certain version of the model and have it versioned, it makes the process of iteration and model development not only easier but also trackable.

作为数据科学家，我们知道开发机器学习模型不仅与代码有关，而且与数据和正确的参数有关。很多时候，为了找到完美的匹配，需要进行实验，这使得该过程具有高度的重复性，并且对于跟踪所做的更改及其对最终结果的影响非常重要。在由多个数据科学家协作的复杂环境中，这一点变得更加重要。从这个意义上讲，如果我们能够拥有用于开发模型的特定版本的数据的快照并对其进行版本化，那么它不仅使迭代和模型开发过程变得更加容易而且可跟踪。

2.调试和测试 (2. Debugging and testing)

While playing around in Kaggle competitions many times we do not understand the real challenges inherent to the development of an ML-based solution while working with production systems. In fact, one of the biggest challenges is to deal with the variety of data sources and the amount of data that we’ve available. Sometimes can be a bit daunting to reproduce the results of experimentation if we are not even able to retrieve the exact dataset that has been used. Data version control can ease these issues and make the process of machine learning solutions development must simpler, organized, and reproducible.

当多次参加Kaggle比赛时，我们不了解在与生产系统一起工作时开发基于ML的解决方案所固有的真正挑战。实际上，最大的挑战之一是处理各种数据源和我们可用的数据量。如果我们甚至无法检索已使用的确切数据集，有时要重现实验结果可能会有些艰巨。数据版本控制可以缓解这些问题，并使机器学习解决方案的开发过程必须更简单，更有条理并且可重现。

3.合规与审计 (3. Compliance and auditing)

Privacy regulations, such as GDPR, already request companies and organizations to demonstrate compliance and history of the available data sources. The ability to track data version provided by version control tools is the first step to have companies data sources ready for compliance, and an essential step in maintaining a strong and robust audit train and risk management processes around data.

隐私法规(例如GDPR)已经要求公司和组织证明合规性和可用数据源的历史记录。跟踪版本控制工具提供的数据版本的能力是使公司数据源准备好合规的第一步，并且是维持围绕数据的强大而强大的审核培训和风险管理流程的重要步骤。

4.协调软件和数据科学团队 (4. Align software and data science teams)

Sometimes, to have Data Science and Software teams talking the same language can be quite challenging and can highly depend on the profiles involved in the interactions between the teams. To start implementing some of the good practices from the software into the data science processes, can help not only to align the work between the teams involved, but also to accelerate the development and integration of the solutions.

有时，让数据科学和软件团队说相同的语言可能会非常具有挑战性，并且在很大程度上取决于团队之间交互所涉及的配置文件。从软件到数据科学流程开始实施一些良好实践，不仅可以帮助使相关团队之间的工作保持一致，还可以加快解决方案的开发和集成。

结论 (Conclusions)

Data science is had to productize, and one of the main reasons for that is because there are too many mutable elements, such as data. The concept of versioning for data science applications can be interpreted in many possible ways, from models to data versioning. This article aimed to cover the importance and benefits of versioning data for the data science teams, but there are many more aspects that we should pay attention to as Data Scientists. In the end, keeping an eye on continuous delivery principles is very important for the success of ML-based solutions!

数据科学必须进行生产，其主要原因之一是因为可变元素(例如数据)太多。从模型到数据版本控制，可以采用许多可能的方式来解释数据科学应用程序的版本控制概念。本文旨在介绍对数据科学团队进行数据版本控制的重要性和好处，但是作为数据科学家，我们还有许多方面应注意。最后，密切注意连续交付原则对于基于ML的解决方案的成功非常重要！

Fabiana Clemente is CDO at YData.

Fabiana Clemente 是 YData的 CDO 。

Improved data for AI

改善AI数据

YData provides a data-centric development platform for Data Scientists to work to high-quality and synthetic data.

YData为数据科学家提供了以数据为中心的开发平台，以处理高质量和合成数据。