数据科学家 数据工程师_数据科学家应该对数据进行版本控制的4个理由

数据科学家 数据工程师

While working in a software project it is very common and, in fact, a standard to start right away versioning code, and the benefits are already pretty obvious for the software community: it tracks every modification of the code in a particular code repository. If any mistake is made, developers can always travel through time and compare earlier versions of the code in order to solve the problem while minimizing disruption to all the team members. Code for software projects is the most precious asset and for that reason must be protected at all costs!

在软件项目中工作时,它是非常普遍的,实际上是立即开始版本控制代码的标准,对于软件社区来说,好处已经非常明显:它跟踪特定代码存储库中对代码的每次修改。 如果有任何错误,开发人员可以随时浏览并比较早期版本的代码,以解决问题,同时最大程度地减少对所有团队成员的破坏。 软件项目代码是最宝贵的资产,因此必须不惜一切代价保护它!

Well, for Data Science projects, data can also be considered the crown jewels, so why us, as Data Scientists, don’t treat as the most precious thing on earth through versioning control?

好吧,对于数据科学项目,数据也可以被视为皇冠上的明珠,那么为什么我们作为数据科学家不通过版本控制将其视为地球上最宝贵的东西呢?

For those familiar with Git, you might be thinking, “Git cannot handle large files and directories.. at least it can’t with the same performance as it deals with small code files. So how can I version control my data in the same old fashion we version control code?”. Well, this is now possible, and it’s easy as just typing git cloneand see the data files and ML model files saved in the workspace, and all this magic can be achieved with DVC.

对于熟悉Git的人来说,您可能会想: “ Git无法处理大文件和目录。至少,它不能具有与处理小代码文件相同的性能。 那么如何以与版本控制代码相同的旧版本来控制数据呢?”。 嗯,这已经成为可能,而且很容易,只需键入git clone并查看保存在工作区中的数据文件和ML模型文件,并且所有这些魔力都可以通过DVC来实现。

DVC快速入门 (Quick start with DVC)

First things first, we have to get DVC installed in our machines. It’s pretty straightforward and you can do it by following these steps.

首先,我们必须在计算机中安装DVC。 这非常简单,您可以按照以下步骤进行操作 。

As I’ve already mentioned, tools for data version control such as DVC makes it possible to build large projects while making it possible to reproduce the pipelines. Using DVC it’s very simple to add datasets into a git repository, and when I mean by simple, is as easy as typing the line below:

正如我已经提到的那样,用于数据版本控制的工具(例如DVC)使构建大型项目成为可能,同时又可以重现管道。 使用DVC,将数据集添加到git存储库非常简单,而我的意思很简单,就像键入以下行一样:

dvc add path/to/dataset

Regardless of the size of the dataset, the data is added to the repository. Assuming that we also want to push the dataset into the cloud, it is also possible with the below command:

无论数据集的大小如何,数据都会添加到存储库中。 假设我们也想将数据集推送到云中,也可以使用以下命令:

dvc push path/to/dataset.dvc

Out of the box, DVC supports many cloud storage services such as S3, Google Storage, Azure Blobs, Google Drive, etc… And since the dataset was pushed to the cloud through the version control system, if I clone the project into another machine, I’m able to download the data, or any other artifact, using the following command:

DVC开箱即用,支持许多云存储服务,例如S3,Google Storage,Azure Blob,Google Drive等。由于数据集是通过版本控制系统推送到云的,因此如果我将项目克隆到另一台计算机上,我可以使用以下命令下载数据或任何其他工件:

dvc pull

Well, now that you know how to start with DVC, I suggest you to go and further explore the tool, or similar ones. Version control should be your best friend as a Data Scientist, as they allow not only to version datasets but also to create reproducible pipelines, while keeping all the developments traceable and reproducible.

好了,既然您知道如何开始使用DVC,我建议您继续研究该工具或类似工具。 作为数据科学家,版本控制应该是您最好的朋友,因为它们不仅允许版本数据集,而且允许创建可复制的管道,同时保持所有开发的可追溯性和可复制性。

If this hasn’t yet convinced, next I’ll tell why you must start versioning control your data!!

如果尚未确定,接下来我将告诉为什么必须开始版本控制您的数据!

为什么要开始使用数据版本控制? (Why should I start using data version control?)

1.保存并复制所有数据实验 (1. Save and reproduce all of your data experiments)

As Data Scientists we know that to develop a Machine Learning model, is not all about code, but also about data and the right parameters. A lot of times, in order to find the perfect match, experimentation is required, which makes the process highly iterative and extremely important to keep track of the changes made as well as their impacts on the end results. This becomes even more important in a complex environment where multiple data scientists are collaborating. In that sense, if we are able to have a snapshot of the data used to develop a certain version of the model and have it versioned, it makes the process of iteration and model development not only easier but also trackable.

作为数据科学家,我们知道开发机器学习模型不仅与代码有关,而且与数据和正确的参数有关。 很多时候,为了找到完美的匹配,需要进行实验,这使得该过程具有高度的重复性,并且对于跟踪所做的更改及其对最终结果的影响非常重要。 在由多个数据科学家协作的复杂环境中,这一点变得更加重要。 从这个意义上讲,如果我们能够拥有用于开发模型的特定版本的数据的快照并对其进行版本化,那么它不仅使迭代和模型开发过程变得更加容易而且可跟踪。

2.调试和测试 (2. Debugging and testing)

While playing around in Kaggle competitions many times we do not understand the real challenges inherent to the development of an ML-based solution while working with production systems. In fact, one of the biggest challenges is to deal with the variety of data sources and the amount of data that we’ve available. Sometimes can be a bit daunting to reproduce the results of experimentation if we are not even able to retrieve the exact dataset that has been used. Data version control can ease these issues and make the process of machine learning solutions development must simpler, organized, and reproducible.

当多次参加Kaggle比赛时,我们不了解在与生产系统一起工作时开发基于ML的解决方案所固有的真正挑战。 实际上,最大的挑战之一是处理各种数据源和我们可用的数据量。 如果我们甚至无法检索已使用的确切数据集,有时要重现实验结果可能会有些艰巨。 数据版本控制可以缓解这些问题,并使机器学习解决方案的开发过程必须更简单,更有条理并且可重现。

3.合规与审计 (3. Compliance and auditing)

Privacy regulations, such as GDPR, already request companies and organizations to demonstrate compliance and history of the available data sources. The ability to track data version provided by version control tools is the first step to have companies data sources ready for compliance, and an essential step in maintaining a strong and robust audit train and risk management processes around data.

隐私法规(例如GDPR)已经要求公司和组织证明合规性和可用数据源的历史记录。 跟踪版本控制工具提供的数据版本的能力是使公司数据源准备好合规的第一步,并且是维持围绕数据的强大而强大的审核培训和风险管理流程的重要步骤。

4.协调软件和数据科学团队 (4. Align software and data science teams)

Sometimes, to have Data Science and Software teams talking the same language can be quite challenging and can highly depend on the profiles involved in the interactions between the teams. To start implementing some of the good practices from the software into the data science processes, can help not only to align the work between the teams involved, but also to accelerate the development and integration of the solutions.

有时,让数据科学和软件团队说相同的语言可能会非常具有挑战性,并且在很大程度上取决于团队之间交互所涉及的配置文件。 从软件到数据科学流程开始实施一些良好实践,不仅可以帮助使相关团队之间的工作保持一致,还可以加快解决方案的开发和集成。

结论 (Conclusions)

Data science is had to productize, and one of the main reasons for that is because there are too many mutable elements, such as data. The concept of versioning for data science applications can be interpreted in many possible ways, from models to data versioning. This article aimed to cover the importance and benefits of versioning data for the data science teams, but there are many more aspects that we should pay attention to as Data Scientists. In the end, keeping an eye on continuous delivery principles is very important for the success of ML-based solutions!

数据科学必须进行生产,其主要原因之一是因为可变元素(例如数据)太多。 从模型到数据版本控制,可以采用许多可能的方式来解释数据科学应用程序的版本控制概念。 本文旨在介绍对数据科学团队进行数据版本控制的重要性和好处,但是作为数据科学家,我们还有许多方面应注意。 最后,密切注意连续交付原则对于基于ML的解决方案的成功非常重要!

Fabiana Clemente is CDO at YData.

Fabiana Clemente YData的 CDO

Improved data for AI

改善AI数据

YData provides a data-centric development platform for Data Scientists to work to high-quality and synthetic data.

YData为数据科学家提供了以数据为中心的开发平台,以处理高质量和合成数据。

翻译自: https://medium.com/swlh/4-reasons-why-data-scientists-should-version-data-672aca5bbd0b

数据科学家 数据工程师

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/389219.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

JDK 下载相关资料

所有版本JDK下载地址: http://www.oracle.com/technetwork/java/archive-139210.html 下载账户密码: 2696671285qq.com Oracle123 转载于:https://www.cnblogs.com/bg7c/p/9277729.html

商米

2019独角兽企业重金招聘Python工程师标准>>> 今天看了一下商米的官网,发现他家的东西还真的是不错。有钱了,想去体验一下。 如果我妹妹还有开便利店的话,我会推荐他用这个。小巧便捷,非常方便。 转载于:https://my.osc…

C#生成安装文件后自动附加数据库的思路跟算法

using System; using System.Collections.Generic; using System.Windows.Forms; using System.Data.SqlClient; using System.Data; using System.ServiceProcess; namespace AdminZJC.DataBaseControl { /// <summary> /// 数据库操作控制类 /// </summary> …

python交互式和文件式_使用Python创建和自动化交互式仪表盘

python交互式和文件式In this tutorial, I will be creating an automated, interactive dashboard of Texas COVID-19 case count by county using python with the help of selenium, pandas, dash, and plotly. I am assuming the reader has some familiarity with python,…

不可不说的Java“锁”事

2019独角兽企业重金招聘Python工程师标准>>> 前言 Java提供了种类丰富的锁&#xff0c;每种锁因其特性的不同&#xff0c;在适当的场景下能够展现出非常高的效率。本文旨在对锁相关源码&#xff08;本文中的源码来自JDK 8&#xff09;、使用场景进行举例&#xff0c…

数据可视化 信息可视化_可视化数据以帮助清理数据

数据可视化 信息可视化The role of a data scientists involves retrieving hidden relationships between massive amounts of structured or unstructured data in the aim to reach or adjust certain business criteria. In recent times this role’s importance has been…

VS2005 ASP.NET2.0安装项目的制作(包括数据库创建、站点创建、IIS属性修改、Web.Config文件修改)

站点&#xff1a; 如果新建默认的Web安装项目&#xff0c;那它将创建的默认网站下的一个虚拟应用程序目录而不是一个新的站点。故我们只有创建新的安装项目&#xff0c;而不是Web安装项目。然后通过安装类进行自定义操作&#xff0c;创建新站如下图&#xff1a; 2、创建新的安项…

docker的基本命令

docker的三大核心&#xff1a;仓库(repository),镜像(image),容器(container)三者相互转换。 1、镜像(image) 镜像&#xff1a;组成docker容器的基础.类似安装系统的镜像 docker pull tomcat 通过pull来下载tomcat docker push XXXX 通过push的方式发布镜像 2、容器(container)…

seaborn添加数据标签_常见Seaborn图的数据标签快速指南

seaborn添加数据标签In the course of my data exploration adventures, I find myself looking at such plots (below), which is great for observing trend but it makes it difficult to make out where and what each data point is.在进行数据探索的过程中&#xff0c;我…

使用python pandas dataframe学习数据分析

⚠️ Note — This post is a part of Learning data analysis with python series. If you haven’t read the first post, some of the content won’t make sense. Check it out here.Note️ 注意 -这篇文章是使用python系列学习数据分析的一部分。 如果您还没有阅读第一篇文…

实现TcpIp简单传送

private void timer1_Tick(object sender, EventArgs e) { IPAddress ipstr IPAddress.Parse("192.168.0.106"); TcpListener serverListener new TcpListener(ipstr,13);//创建TcpListener对象实例 ser…

SQLServer之函数简介

用户定义函数定义 与编程语言中的函数类似&#xff0c;SQL Server 用户定义函数是接受参数、执行操作&#xff08;例如复杂计算&#xff09;并将操作结果以值的形式返回的例程。 返回值可以是单个标量值或结果集。 用户定义函数准则 在函数中&#xff0c;将会区别处理导致语句被…

无向图g的邻接矩阵一定是_矩阵是图

无向图g的邻接矩阵一定是To study structure,tear away all flesh soonly the bone shows.要研究结构&#xff0c;请尽快撕掉骨头上所有的肉。 Linear algebra. Graph theory. If you are a data scientist, you have encountered both of these fields in your study or work …

移动pc常用Meta标签

移动常用 <meta charset"UTF-8"><title>{$configInfos[store_title]}</title><meta content"widthdevice-width,minimum-scale1.0,maximum-scale1.0,shrink-to-fitno,user-scalableno,minimal-ui" name"viewport"><m…

前端绘制绘制图表_绘制我的文学风景

前端绘制绘制图表Back when I was a kid, I used to read A LOT of books. Then, over the last couple of years, movies and TV series somehow stole the thunder, and with it, my attention. I did read a few odd books here and there, but not with the same ferocity …

Rapi

本页内容 ●引言●SMARTPHONE SDK API 库●管理设备中的目录文件●取系统信息●远程操作电话和短信功能 Windows Mobile日益成熟&#xff0c;开发者队伍也越来越壮大。作为一个10年的计算机热爱者和程序员&#xff0c;我也经受不住新技术的诱惑&#xff0c;倒腾起Mobile这个玩具…

android 字符串特殊字符转义

XML转义字符 以下为XML标志符的数字和字符串转义符 " ( 或 &quot;) ( 或 &apos;) & ( 或 &amp;) lt(<) (< 或 <) gt(>) (> 或 >) 如题&#xff1a; 比如&#xff1a;在string.xml中定义如下一个字符串&#xff0c;…

如何描绘一个vue的项目_描绘了一个被忽视的幽默来源

如何描绘一个vue的项目Source)来源 ) Data visualization is a great way to celebrate our favorite pieces of art as well as reveal connections and ideas that were previously invisible. More importantly, it’s a fun way to connect things we love — visualizing …

数据存储加密和传输加密_将时间存储网络应用于加密预测

数据存储加密和传输加密I’m not going to string you along until the end, dear reader, and say “Didn’t achieve anything groundbreaking but thanks for reading ;)”.亲爱的读者&#xff0c;我不会一直待到最后&#xff0c;然后说&#xff1a; “没有取得任何开创性的…