照顾好自己才能照顾好别人
I am pretty sure that on your data journey you came across some courses, videos, articles, maybe use cases where someone takes some data, builds a classification/regression model, shows you great results, you learn how that model works and why it works that way and not another and everything seems to be fine. You think you just learned a new thing (and you did), you are happy about that (yes, you are ! I am not kidding around here, you’re doing great!) and you continue to the next piece of content.
我很确定在您的数据之旅中,您遇到了一些课程,视频,文章,也许是用例,其中有人获取了一些数据,建立了分类/回归模型,向您展示了出色的结果,您了解了该模型的工作原理以及工作原理这样,而不是另一种,一切似乎都很好。 您以为自己刚刚学到了新事物(并且确实做到了),对此感到很高兴(是的,您是!我在这里不是在开玩笑,您做得很好!),并且您继续阅读下一篇内容。
But later on you start to ask additional questions (everyone has different length of that “later on”), like: where did that data come from? and if I have more data, will that model run so smoothly as it did during the demonstration? does the data in real world exist in such format? can I get similar data and if I can will it be so easy to process? what did the results of that model mean? can I present that data in prettier way? and so on and so on and so on.
但是稍后,您开始提出其他问题(每个人的“以后”的时长都不同),例如:数据来自何处? 如果我有更多数据,该模型是否会像演示期间那样平稳运行? 现实世界中的数据是否以这种格式存在? 我可以得到类似的数据吗?如果可以的话,它会很容易处理吗? 该模型的结果是什么意思? 我可以用更漂亮的方式显示这些数据吗? 等等,依此类推。
When I started to learn about data analytics, data science, world of data in general I was always amused by the results people will get after processing some piece of data, or after running a machine learning model or after getting keys from word buckets etc. But every time I would try to do something on my own it will always appear a new obstacle: the data I would like to analyze is too much or not enough, one model will run with one piece of data, but it won’t with another etc etc.
当我开始学习数据分析,数据科学,一般的数据领域时,我总是对人们在处理某些数据,运行机器学习模型或从单词存储桶中获取键等后所获得的结果感到很满意。但是每次我尝试自己做某事时,总会出现新的障碍:我想分析的数据太多或不够,一个模型将只处理一个数据,而不会其他等等
After having all these difficulties and learning to deal with them the hard way I would like to share with the essential 5 Vs of data that you have to have taken care of before you start your data project/solution.
在经历了所有这些困难并学会了用困难的方式解决这些问题之后,我想与您在开始数据项目/解决方案之前必须要处理的5个基本数据进行分享。
1st V —音量 (1st V — Volume)
When we talk “volume” in regards of data we have to be aware of amount of data that has to be handled in the project — should we use several servers to handle that volume and distribute the load between them? or maybe our computer with our own hard disk is quite enough to solve the problem?
当我们谈论数据的“卷”时,我们必须知道项目中必须处理的数据量-我们是否应该使用多个服务器来处理该卷并在它们之间分配负载? 还是我们拥有自己的硬盘的计算机足以解决问题?
2nd V —速度 (2nd V — Velocity)
Velocity is the speed with which data travels through our model/project/solution. The speed with which it is ingested, processed and delivered to the end client. We have to be aware if this is real-time data, near real-time or maybe this is just historic data which is not going anywhere soon and we can talk her out slowly and efficiently 😉
速度是数据在我们的模型/项目/解决方案中传播的速度。 摄取,处理并交付给最终客户端的速度。 我们必须知道这是实时数据还是接近实时数据,或者这仅仅是历史数据,很快就不会流传了,我们可以慢慢有效地与她交谈talk
3rd V —综艺 (3rd V — Variety)
Data comes from various sources, in various types, structured, semi-structured and not structured at all (officially unstructured XD) and boy, I’ve got burned on it a lot. When my pipeline will expect one data type (because I tested it with the sample and it worked) and then it will give me an error because there is additional type or structure that is not yet supported by my solution. This kind of things has to be defined in the beginning, you have to know the levels of variety of the data you are working with.
数据来自各种类型,结构化,半结构化和根本没有结构化(官方非结构化XD)的各种来源,而我对此非常着迷。 当我的管道需要一种数据类型时(因为我已经用示例对其进行了测试并且可以工作),然后它会给我一个错误,因为我的解决方案尚不支持其他类型或结构。 这类事情必须在一开始就定义好,您必须知道所使用的各种数据的级别。
4th V —准确性 (4th V — Veracity)
Is the data I am working with is worth trusting? Is it trustworthy? Is it still correct after all the manipulations and cleanings? Was the pipe of transformation correct? These are the questions we ask when we talk about the veracity of the data. We can collect all the data we need and it won’t be that difficult, but will it be accurate and consistent, won’t it be falsely altered — that’s another challenge. We all aware that in order to get insights from the data we have to perform a little of preprocessing and we have to make sure that process does not skew the data.
我正在使用的数据值得信任吗? 值得信赖吗? 经过所有的操作和清洁后,它仍然正确吗? 转型的管道正确吗? 这些是我们谈论数据准确性时要提出的问题。 我们可以收集所需的所有数据,并不会那么困难,但是它将是准确且一致的,不会被错误地更改,这是另一个挑战。 我们都知道,为了从数据中获得见解,我们必须执行一些预处理,并且必须确保过程不会使数据倾斜。
5th V —价值 (5th V — Value)
And the last V goes for value. Because in the end of the day the whole point of all this is to get value from data. That includes creating reports and dashboards, finding useful insights that can improve business, highlighting critical areas to make more informed decisions.
最后一个V表示价值。 因为归根结底,这一切的全部目的都是从数据中获取价值。 这包括创建报告和仪表板,发现可以改善业务的有用见解,突出显示关键区域以做出更明智的决策。
You may object that those are 5 Vs of big data and you will be right. Yes, those are 5 Vs of big data, but not only. Any data project has to deal with these 5 Vs. Big data project will have it more complicated to handle, small data project will be just easier to manage all 5 of Vs.
您可能会反对说那是大数据的5 V,您将是对的。 是的,那是大数据的5 V,但不仅如此。 任何数据项目都必须处理这5个V。 大数据项目将使其更复杂,小数据项目将更易于管理所有5个V。
For example, I was working on a data solution for the HR department and in the beginning we had to address the 5 Vs of the data. Even though we didn’t have terabytes of data, we had a lot of small Excel files were the data was previously stored and distributed (volume). There were 3 different sources of data to collect from: Excel files, corporate DB and corporate CRM (variety). The data would be updated on a daily basis and users would want the actual data as quickly as possible with a maximum delay of 30 minutes — it’s not even close to real-time, but we still have to make sure that the pipeline is executed fast enough (velocity). Data coming from Excels would be always altered by the human at some point of time and there is always a dispute which actualization goes first, so we had to deal with that too (veracity). And in order to get value from the data we had to find a way to visualize it and create a possibility for the end user to explore it and make their own conclusions (value).
例如,我正在为人事部门开发数据解决方案,一开始我们必须处理5 V数据。 即使我们没有太字节的数据,我们还是有很多小的Excel文件,这些文件是以前存储和分发(卷)的数据。 有3种不同的数据来源可供收集:Excel文件,公司DB和公司CRM(品种)。 数据将每天进行更新,并且用户希望尽可能快地获取实际数据,最大延迟为30分钟-甚至不接近实时,但我们仍然必须确保管道能够快速执行足够(速度)。 来自Excel的数据将始终在某个时间点被人类更改,并且始终存在首先实现的争议,因此我们也必须处理(准确性)。 为了从数据中获得价值,我们必须找到一种可视化的方法,并为最终用户创造一种探索它并得出自己的结论(价值)的可能性。
We invested our time in the beginning to find the solutions for every V with our data and having done that we were able to finish our project just in time — even with lovely documentation.
我们从一开始就投入了时间,使用我们的数据为每个V查找解决方案,并且这样做即使没有精美的文档也能及时完成我们的项目。
So even though you are just going to process Titanic datasets, think of 5 Vs, it will take you 2 minutes, but you will be ready for the unpredictable. despite you know who’s gonna die there XD.
因此,即使您只是要处理Titanic数据集,以5 V为例,它也将花费您2分钟的时间,但您已经为不可预测的事情做好了准备。 尽管您知道谁会在那里死XD。
Originally published at https://sergilehkyi.com on August 10, 2020.
最初于 2020年8月10日 发布在 https://sergilehkyi.com 上。
翻译自: https://medium.com/swlh/5-essential-data-vs-you-have-to-take-care-of-b4e03e8964c1
照顾好自己才能照顾好别人
本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/388687.shtml
如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!