照顾好自己才能照顾好别人_您必须照顾的5个基本数据

照顾好自己才能照顾好别人

I am pretty sure that on your data journey you came across some courses, videos, articles, maybe use cases where someone takes some data, builds a classification/regression model, shows you great results, you learn how that model works and why it works that way and not another and everything seems to be fine. You think you just learned a new thing (and you did), you are happy about that (yes, you are ! I am not kidding around here, you’re doing great!) and you continue to the next piece of content.

我很确定在您的数据之旅中,您遇到了一些课程,视频,文章,也许是用例,其中有人获取了一些数据,建立了分类/回归模型,向您展示了出色的结果,您了解了该模型的工作原理以及工作原理这样,而不是另一种,一切似乎都很好。 您以为自己刚刚学到了新事物(并且确实做到了),对此感到很高兴(是的,您是!我在这里不是在开玩笑,您做得很好!),并且您继续阅读下一篇内容。

But later on you start to ask additional questions (everyone has different length of that “later on”), like: where did that data come from? and if I have more data, will that model run so smoothly as it did during the demonstration? does the data in real world exist in such format? can I get similar data and if I can will it be so easy to process? what did the results of that model mean? can I present that data in prettier way? and so on and so on and so on.

但是稍后,您开始提出其他问题(每个人的“以后”的时长都不同),例如:数据来自何处? 如果我有更多数据,该模型是否会像演示期间那样平稳运行? 现实世界中的数据是否以这种格式存在? 我可以得到类似的数据吗?如果可以的话,它会很容易处理吗? 该模型的结果是什么意思? 我可以用更漂亮的方式显示这些数据吗? 等等,依此类推。

When I started to learn about data analytics, data science, world of data in general I was always amused by the results people will get after processing some piece of data, or after running a machine learning model or after getting keys from word buckets etc. But every time I would try to do something on my own it will always appear a new obstacle: the data I would like to analyze is too much or not enough, one model will run with one piece of data, but it won’t with another etc etc.

当我开始学习数据分析,数据科学,一般的数据领域时,我总是对人们在处理某些数据,运行机器学习模型或从单词存储桶中获取键等后所获得的结果感到很满意。但是每次我尝试自己做某事时,总会出现新的障碍:我想分析的数据太多或不够,一个模型将只处理一个数据,而不会其他等等

After having all these difficulties and learning to deal with them the hard way I would like to share with the essential 5 Vs of data that you have to have taken care of before you start your data project/solution.

在经历了所有这些困难并学会了用困难的方式解决这些问题之后,我想与您在开始数据项目/解决方案之前必须要处理的5个基本数据进行分享。

1st V —音量 (1st V — Volume)

When we talk “volume” in regards of data we have to be aware of amount of data that has to be handled in the project — should we use several servers to handle that volume and distribute the load between them? or maybe our computer with our own hard disk is quite enough to solve the problem?

当我们谈论数据的“卷”时,我们必须知道项目中必须处理的数据量-我们是否应该使用多个服务器来处理该卷并在它们之间分配负载? 还是我们拥有自己的硬盘的计算机足以解决问题?

2nd V —速度 (2nd V — Velocity)

Velocity is the speed with which data travels through our model/project/solution. The speed with which it is ingested, processed and delivered to the end client. We have to be aware if this is real-time data, near real-time or maybe this is just historic data which is not going anywhere soon and we can talk her out slowly and efficiently 😉

速度是数据在我们的模型/项目/解决方案中传播的速度。 摄取,处理并交付给最终客户端的速度。 我们必须知道这是实时数据还是接近实时数据,或者这仅仅是历史数据,很快就不会流传了,我们可以慢慢有效地与她交谈talk

3rd V —综艺 (3rd V — Variety)

Data comes from various sources, in various types, structured, semi-structured and not structured at all (officially unstructured XD) and boy, I’ve got burned on it a lot. When my pipeline will expect one data type (because I tested it with the sample and it worked) and then it will give me an error because there is additional type or structure that is not yet supported by my solution. This kind of things has to be defined in the beginning, you have to know the levels of variety of the data you are working with.

数据来自各种类型,结构化,半结构化和根本没有结构化(官方非结构化XD)的各种来源,而我对此非常着迷。 当我的管道需要一种数据类型时(因为我已经用示例对其进行了测试并且可以工作),然后它会给我一个错误,因为我的解决方案尚不支持其他类型或结构。 这类事情必须在一开始就定义好,您必须知道所使用的各种数据的级别。

4th V —准确性 (4th V — Veracity)

Is the data I am working with is worth trusting? Is it trustworthy? Is it still correct after all the manipulations and cleanings? Was the pipe of transformation correct? These are the questions we ask when we talk about the veracity of the data. We can collect all the data we need and it won’t be that difficult, but will it be accurate and consistent, won’t it be falsely altered — that’s another challenge. We all aware that in order to get insights from the data we have to perform a little of preprocessing and we have to make sure that process does not skew the data.

我正在使用的数据值得信任吗? 值得信赖吗? 经过所有的操作和清洁后,它仍然正确吗? 转型的管道正确吗? 这些是我们谈论数据准确性时要提出的问题。 我们可以收集所需的所有数据,并不会那么困难,但是它将是准确且一致的,不会被错误地更改,这是另一个挑战。 我们都知道,为了从数据中获得见解,我们必须执行一些预处理,并且必须确保过程不会使数据倾斜。

5th V —价值 (5th V — Value)

And the last V goes for value. Because in the end of the day the whole point of all this is to get value from data. That includes creating reports and dashboards, finding useful insights that can improve business, highlighting critical areas to make more informed decisions.

最后一个V表示价值。 因为归根结底,这一切的全部目的都是从数据中获取价值。 这包括创建报告和仪表板,发现可以改善业务的有用见解,突出显示关键区域以做出更明智的决策。

You may object that those are 5 Vs of big data and you will be right. Yes, those are 5 Vs of big data, but not only. Any data project has to deal with these 5 Vs. Big data project will have it more complicated to handle, small data project will be just easier to manage all 5 of Vs.

您可能会反对说那是大数据的5 V,您将是对的。 是的,那是大数据的5 V,但不仅如此。 任何数据项目都必须处理这5个V。 大数据项目将使其更复杂,小数据项目将更易于管理所有5个V。

For example, I was working on a data solution for the HR department and in the beginning we had to address the 5 Vs of the data. Even though we didn’t have terabytes of data, we had a lot of small Excel files were the data was previously stored and distributed (volume). There were 3 different sources of data to collect from: Excel files, corporate DB and corporate CRM (variety). The data would be updated on a daily basis and users would want the actual data as quickly as possible with a maximum delay of 30 minutes — it’s not even close to real-time, but we still have to make sure that the pipeline is executed fast enough (velocity). Data coming from Excels would be always altered by the human at some point of time and there is always a dispute which actualization goes first, so we had to deal with that too (veracity). And in order to get value from the data we had to find a way to visualize it and create a possibility for the end user to explore it and make their own conclusions (value).

例如,我正在为人事部门开发数据解决方案,一开始我们必须处理5 V数据。 即使我们没有太字节的数据,我们还是有很多小的Excel文件,这些文件是以前存储和分发(卷)的数据。 有3种不同的数据来源可供收集:Excel文件,公司DB和公司CRM(品种)。 数据将每天进行更新,并且用户希望尽可能快地获取实际数据,最大延迟为30分钟-甚至不接近实时,但我们仍然必须确保管道能够快速执行足够(速度)。 来自Excel的数据将始终在某个时间点被人类更改,并且始终存在首先实现的争议,因此我们也必须处理(准确性)。 为了从数据中获得价值,我们必须找到一种可视化的方法,并为最终用户创造一种探索它并得出自己的结论(价值)的可能性。

We invested our time in the beginning to find the solutions for every V with our data and having done that we were able to finish our project just in time — even with lovely documentation.

我们从一开始就投入了时间,使用我们的数据为每个V查找解决方案,并且这样做即使没有精美的文档也能及时完成我们的项目。

So even though you are just going to process Titanic datasets, think of 5 Vs, it will take you 2 minutes, but you will be ready for the unpredictable. despite you know who’s gonna die there XD.

因此,即使您只是要处理Titanic数据集,以5 V为例,它也将花费您2分钟的时间,但您已经为不可预测的事情做好了准备。 尽管您知道谁会在那里死XD。

Originally published at https://sergilehkyi.com on August 10, 2020.

最初于 2020年8月10日 发布在 https://sergilehkyi.com 上。

翻译自: https://medium.com/swlh/5-essential-data-vs-you-have-to-take-care-of-b4e03e8964c1

照顾好自己才能照顾好别人

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/388687.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

matlab数字仿真实验,DVR+备用电源自动投入的MATLAB数字仿真实验仿真实验

一、动态电压恢复器(DVR)的数字仿真实验动态电压恢复器(Dynamic Voltage Restorer,DVR)是一种基于电力电子技术的串联补偿装置,通常安装在电源与敏感负荷之间,其作用在于:保证电网供电质量,补偿供电电网产生的电压跌落…

c#,xp系统,Matlab6.5

编译环境:c#,xp系统,Matlab6.5 新建一个窗体项目,添加matlab引用。 然后试了四种方式调用matlab: 第一种 view plaincopy to clipboardprint?MLApp.MLAppClass matlab new MLApp.MLAppClass(); matlab.Visible 1;…

java script 对象

java script 对象 1.创建方式 1)通过字面量的形式创建 例;var stt{x:1,y:2,y:3}; 或;var stt{ x:1, y:2, for:3 } 注意关键字必须放到引号中间 2)通过new创建对象 例:var new stt(); stt.name 小鱼; stt.age 20…

认识数据分析_认识您的最佳探索数据分析新朋友

认识数据分析Visualization often plays a minimal role in the data science and model-building process, yet Tukey, the creator of Exploratory Data Analysis, specifically advocated for the heavy use of visualization to address the limitations of numerical indi…

架构探险笔记10-框架优化之文件上传

确定文件上传使用场景 通常情况下,我们可以通过一个form(表单)来上传文件,就以下面的“创建客户”为例来说明(对应的文件名是customer_create.jsp),需要提供一个form,并将其enctype属…

matlab飞行数据仿真,基于MATLAB的飞行仿真

收稿日期: 2005 - 05 - 15   第 23卷  第 06期 计  算  机  仿  真 2006年 06月    文章编号: 1006 - 9348 (2006) 06 - 0057 - 05 基于 MATLAB的飞行仿真 张镭 ,姜洪洲 ,齐潘国 ,李洪人 (哈尔滨工业大学电液伺服仿真及试验系统研究所 ,黑龙江 哈尔滨 150001) 摘要:该…

Windows Server 2003 DNS服务安装篇

导读-- DNS(Domain Name System,域名系统)是一种组织成层次结构的分布式数据库,里面包含有从DNS域名到各种数据类型(如IP地址)的映射“贵有恒,何必三更起五更勤;最无益,只怕一日曝十日寒。”前一段时间巴哥因为一些生活琐事而中止…

正则表达式matlab,正则表达式中一个word的匹配 @MATLAB - 优秀的Free OS(Linux)版 - 北大未名BBS...

我目前想做的就是判断一个str是否可以被认为是有效的MATLAB index。最好的方法是直接运行,然后看运行结果或报错类型,但是我不打算在不知道是什么类型的东西之前运行它,所以可以预先parse一下,简单判断是否“长得跟有效的MATLAB i…

arima模型怎么拟合_7个统计测试,用于验证和帮助拟合ARIMA模型

arima模型怎么拟合什么是ARIMA? (What is ARIMA?) ARIMA models are one of the most classic and most widely used statistical forecasting techniques when dealing with univariate time series. It basically uses the lag values and lagged forecast error…

jQuery禁止Ajax请求缓存

一 现象 get请求在有些浏览器中会缓存。浏览器不会发送请求,而是使用上次请求获取到的结果。 post请求不会缓存。每次都会发送请求。 二 解决 jQuery提供了禁止Ajax请求缓存的方法: $.ajax({type: "get",url: "http://www.baidu.com?_&…

python 实例

参考 http://developer.51cto.com/art/201804/570408.htm 转载于:https://www.cnblogs.com/artesian0526/p/9552510.html

[WPF]ListView点击列头排序功能实现

[WPF]ListView点击列头排序功能实现 这是一个非常常见的功能,要求也很简单,在Column Header上显示一个小三角表示表示现在是在哪个Header上的正序还是倒序就可以了。微软的MSDN也已经提供了实现方式。微软的方法中,是通过ColumnHeader Templ…

天池幸福感的数据处理_了解幸福感与数据(第1部分)

天池幸福感的数据处理In these exceptional times, the lockdown left many of us with a lot of time to think. Think about the past and the future. Think about our way of life and our achievements. But most importantly, think about what has been and would be ou…

标线markLine的用法

series: [{markLine: {itemStyle: {normal: { lineStyle: { type: solid, color:#000 },label: { show: true, position:left } }},data: [{name: 平均线,// 支持 average, min, maxtype: average},{name: Y 轴值为 100 的水平线,yAxis: 100},[{// 起点和终点的项会共用一个 na…

php pfm 改端口,罗马2ESF和PFM 修改建筑 军团 派系 兵种等等等很多东西的教程

本帖最后由 clueber 于 2013-10-5 12:30 编辑本人是个罗马死忠加修改党,恩,所以分享一下自己的修改心得修改工具为ESF1.0.7和PFM3.0.3首先是ESF修改。ESF可以用来改开局设定和存档,修改开局设定是startpos.esf文件,在存档在我这里…

红草绿叶

从小到大喜欢阴天,喜欢下雨,喜欢那种潮湿的感觉。却又丝毫容不得脚上有一丝的水汽,也极其讨厌穿凉鞋。小时候特别喜欢去山上玩,偷桃子柿子,一切一切都成了美好的回忆,长大了,那些事情就都不复存…

wpf listview 使用

单列&#xff1a; <ListView Grid.Column"1" Height"284" HorizontalAlignment"Left" Margin"64,73,0,0" Name"listView1" VerticalAlignment"Top" Width"310" > <ListView.Items…

php 获取当天到23 59,js 获取当天23点59分59秒 时间戳 (最简单的方法)

原生Ajax 和Jq Ajax前言:这次介绍的是利用ajax与后台进行数据交换的小例子,所以demo必须通过服务器来打开.服务器环境非常好搭建,从网上下载wamp或xampp,一步步安装就ok,然后再把写好的页面放在服务器中指定的 ...『TCP&sol;IP详解——卷一&#xff1a;协议』读书笔记——1…

詹森不等式_注意詹森差距

詹森不等式背景 (Background) In Kaggle’s M5 Forecasting — Accuracy competition, the square root transformation ruined many of my team’s forecasts and led to a selective patching effort in the eleventh hour. Although it turned out well, we were reminded t…

【转载】儒林外史人物——荀玫

写在前面&#xff1a;本博客内容为转载&#xff0c;原文URL&#xff1a;http://blog.sina.com.cn/s/blog_9132ac5b0101iukw.html 说完周进&#xff0c;本应顺着说范进&#xff0c;但我觉得荀玫他们村的事情过于喜感&#xff0c;想先说荀玫。 荀玫简直是儒林中的某类标杆人物&am…