Data Science and Machine Learning are hard sports to play. It’s difficult enough to motivate yourself to sit down and learn some maths, let alone to becoming an expert on the matter.
数据科学和机器学习是一项艰巨的运动。 激励自己坐下来学习一些数学知识是非常困难的,更不用说要成为这方面的专家了。
I began my journey into machine learning with a prediction problem. I was tasked with predicting a variable, but had around 100 other variables I could use. As a fresh graduate, I understandably took this as a regression problem and despite my colleagues being seemingly impressed, in all honesty, my result was pretty bad. I knew I could do better.
我以预测问题开始了机器学习之旅。 我的任务是预测变量,但是我可以使用大约100个其他变量。 作为一名应届毕业生,我可以理解为这是一个回归问题,尽管我的同事们似乎印象深刻,但老实说,我的成绩很差。 我知道我可以做得更好。
From there, I read, I experimented, I read some more, then experimented some more, and this led to a bit of a journey where I quit that job, went back into education, then back into industry, and along the way I’ve been lucky enough to work with people who’ve shape the field of Artificial Intelligence along the way.
从那里开始,我读了书,做了实验,又读了一些书,然后再做了更多的实验,这导致了一段旅程,我辞掉了工作,回到了教育领域,然后回到了行业,并一路走来。我们很幸运地与在整个过程中塑造了人工智能领域的人们一起工作。
In what follows, I present 5 difficulties that Machine Learning practitioners and Data Scientists deal with on a daily basis. I offer sympathy to those who need it!
接下来,我提出了机器学习从业者和数据科学家每天要解决的5个困难。 我向需要帮助的人表示同情!
困难1:适应问题领域 (Difficulty 1: Adapting to the Problem Domain)
How many mathematicians study Linguistics? How many mathematicians study Healthcare? So why are we any good at solving problems in these fields?
多少位数学家学习语言学? 多少数学家学习医疗保健? 那么,为什么我们擅长解决这些领域的问题呢?
The art of being a Mathematician comes from the ability to abstract a problem in a manner that makes it solvable. In Linguistics, we can treat each “phone” as a discrete variable and create a model that determines the joint distribution between each phone. In Healthcare, we can build a model that picks up latent features in X-rays that discern a disease.
成为数学家的艺术源于以解决问题的方式抽象问题的能力。 在语言学中,我们可以将每个“电话”视为离散变量,并创建一个模型来确定每个电话之间的联合分布。 在医疗保健领域,我们可以建立一个模型,该模型可以拾取识别疾病的X射线中的潜在特征。
但是伙计,这很难。 (But dude, it’s tough.)
To be a successful machine learning researcher you have to really be willing to put the time and effort into fully immersing yourself in the domain knowledge. Many of the successful game-changers in the field have broken ground in fields that they had experience in. Deepminds founder Demis Hassabis ran a games company before returning to UCL to study Neuroscience, ultimately leading to his developments in Reinforcement Learning and leading to his advances in games like Atari and Go.
要成为一名成功的机器学习研究人员,您必须真正愿意花费时间和精力将自己完全浸入领域知识中。 在该领域中许多成功的游戏规则改变者在他们经验丰富的领域都取得了突破。Deepminds创始人Demis Hassabis在回到UCL学习神经科学之前经营着一家游戏公司 ,最终导致他在强化学习方面的发展并取得了进步在Atari和Go等游戏中。
Not all of us are as fortunate as Demis in having a background in a field that we’re trying to revolutionise. Often we’ll be at work and a project comes up that we have to try figure out: and next week we may have another task. Project switching has its pro’s and con’s, but ultimately you suffer on the level of depth you go to.
在试图革新的领域拥有背景知识的人,并不是所有人都像Demis一样幸运。 通常,我们会在工作,需要提出一个项目,我们必须设法弄清楚:下周,我们可能还有另一项任务。 项目切换有其优点和缺点,但最终您会陷入深度学习。
It definitely helps if you know a little bit about your niche before you apply some ML, but for what it’s worth, I sympathise with your struggles.
如果您在应用ML之前对利基有所了解,肯定会有所帮助,但是对于它的价值,我很同情您的努力。
难度2:识别和忽略噪音 (Difficulty 2: Identifying and Ignoring the Noise)
Noise is second to none in statistics, machine learning and data science. Honestly, it’s everywhere. From dirty data, to rogue data points, to literature built on weak foundations, to models capturing latent bias: noise is literally everywhere.
在统计,机器学习和数据科学中,噪声是首屈一指的。 老实说,它无处不在。 从脏数据到流氓数据点,再到建立在薄弱基础上的文献,再到捕获潜在偏差的模型:噪声无处不在。
Machine Learning models generally perform by minimising the squared sum of errors (or some form of misclassification measure) but when you’re researching a new topic or getting feedback from a colleague, noise can be pretty hard to define — the last thing you want to do is be chasing down the rabbit hole.
机器学习模型通常通过最小化误差的平方和(或某种形式的错误分类度量)来执行,但是当您研究新主题或从同事那里获得反馈时,很难定义噪音,这是您想要做的最后一件事要做的就是追逐兔子洞。
There are a few ways to get around it:
有几种解决方法:
- Speak to reliable people often, keep them close 经常与可信赖的人交谈,保持亲密接触
- Learn how to spot nonsense, keep it at a distance 了解如何发现废话,保持一定距离
- Fail often, fail quick. 经常失败,很快就会失败。
Experiment more, speak to people more, try more things and eventually you’ll begin to recognise and ‘smell’ noise. You’ll avert it, and progress quicker.
多做实验,多与人交流,尝试更多事情,最终您将开始认识并“闻”到噪音。 您将避免它,并加快进度。
As an example: many algorithms have a high accuracy rating because the dependant variable happens so infrequently. E.g. a model which predicts how many people in London get struck by lightening on a daily basis will almost certainly be 99.9999% correct without any training. The “noise” is recognising that people don’t get struck by lightening that often, and by adjusting your model for it.
例如:许多算法具有很高的准确度,因为因变量很少发生。 例如,一个模型,每天预测在伦敦有多少人被闪电击中,几乎可以肯定在没有任何培训的情况下正确率为99.9999%。 “噪音”是指人们不会因为经常减轻重量和调整模型而受到打击。
困难三:接受良好的教育 (Difficulty 3: Getting Good Education)
Education is so important in this field because the domain of knowledge required is so broad. From computer science, to maths, to algorithms, to statistics: there’s a lot to cover in a relatively short amount of time.
在这一领域,教育是如此重要,因为所需的知识领域如此广泛。 从计算机科学,数学,算法到统计数据:在相对较短的时间内涵盖了很多内容。
Formal education (like University) is one thing but education in machine learning really surpasses that. Practitioners have to develop an ability to quickly learn things themselves and be able to implement them well.
正规教育(例如大学)是一回事,但是机器学习方面的教育确实超越了正规教育。 从业者必须发展一种能力,以快速地自己学习事物并能够很好地实施它们。
The reason why this is so important (and so difficult) is that it’s tempting at times to find a github repository where someone else has spent some time solving the same problem you have, pulling their code and applying it to your problem. The solution make look ok but plenty of things can just get missed in between all of it and there’s no comparison to having the fundamental understanding.
之所以如此重要(而且如此困难),是因为有时会很想找到一个github存储库,让其他人花了一些时间解决您遇到的相同问题,提取他们的代码并将其应用于您的问题。 该解决方案看起来不错,但是在所有解决方案之间可能会遗漏很多东西,并且与拥有基本理解没有可比之处。
难度4:发布负面结果 (Difficulty 4: Publishing Negative Results)
Negative results happen all the time, they’re hard, but they happen. You have to recognise that negative results are also results and that they should be welcomed.
负面结果一直在发生,很难,但确实会发生。 您必须认识到负面结果也是结果,应该欢迎他们。
Machine Learning has two sides to it: the theoretical and the applied side. Theorists will publish less frequently with the hope of making a bigger splash and applied academics will tend to publish more often but solve bigger problems.
机器学习有两个方面:理论方面和应用方面。 理论家们将减少发表频率,以期引起更大的轰动,而应用学者则倾向于增加发表频率,但解决更大的问题。
However in the pursuit of experimentation or in the pursuit of publishing, a lot of negative results are often put to the side and not overly discussed. This then leads to other practitioners repeating these same experiments and at the aggregate, a lot of time is wasted. This inefficiency also breeds a form of ego where people are respected by only the ‘positive’ results they’ve discovered, rather than the results they can confirm to be simply incomplete.
但是,在进行实验或出版时,常常会带来很多负面结果,而不会进行过多讨论。 然后,这导致其他从业者重复这些相同的实验,并且总的来说浪费了很多时间。 这种低效率也滋生了一种自我的形式,在这种自我中,人们仅受到他们发现的“积极”结果的尊重,而不是仅仅确认其不完全的结果。
Everyone benefits if we can classify problems better.
如果我们能够更好地对问题进行分类,那么每个人都会受益。
难题5:掌握研究 (Difficulty 5: Keeping on Top of the Research)
Did I mention that there’s a lot of it?
我是否提到过很多?
截至撰写本文时,Google已在本年度出版了340多种出版物。 (Google has published over 340 publications THIS YEAR as of writing this article.)
Google don’t mess around either: their research is always very good. Let alone with all the publications and Universities in the world — how am I meant to keep on top of all this research?
Google也不搞混:他们的研究始终非常出色。 更不用说世界上所有的出版物和大学了-我要在所有这些研究中保持领先地位是什么意思?
You kind of…just…have to find a way.
您……只是……必须找到一种方法。
I read a lot and spend most of my day looking out for new approaches and methodologies to solve the problems I’m facing but at times, you can get lost in a swathe of research or even, not even find the right articles because there’s so much research that it’s hard to identify what’s useful.
我读了很多书,花了整整一天的时间寻找解决我所面临问题的新方法和方法,但有时,您可能会迷失于大量的研究中,甚至找不到合适的文章,因为许多研究表明,很难确定有用的东西。
Using citations is a great method to filter research and staying on top of the most cited papers every year definitely helps but in finding an ‘edge’ or in discovering ‘novel’ applications of models, you just have to do the leg work and read as much as you can.
使用引用是过滤研究的一种好方法,并且每年留在被引用最多的论文上肯定有帮助,但是在寻找模型的“优势”或发现“新颖”应用时,您只需做些简单的工作并阅读尽你所能。
Ultimately and in my opinion, to be a successful Machine Learning Researcher or Data Scientist, you need to be able to teach yourself. You just have to find a reason to know how a neural-network works or why a Random Forest sucks in some cases, and use this to drive your understanding.
最终,以我的观点,要成为成功的机器学习研究员或数据科学家,您需要能够自学。 您只需要找到一个理由来了解神经网络的工作原理,或者在某些情况下为什么会吸引随机森林,并以此来加深您的理解。
The reason being is that it’s such a multi-disciplined subject that moves leaps and bounds every year. I graduated from my masters program in 2016 and since then the whole AI sphere has been reinvented 3 times over.
原因是,它是一个如此多学科的学科,每年都在飞跃发展。 我于2016年从硕士课程毕业,自那时以来,整个AI领域已被彻底改造了3次。
Thanks for reading! If you have any messages, please let me know!
谢谢阅读! 如果您有任何留言,请告诉我!
Keep up to date with my latest articles here!
在这里了解我的最新文章!
翻译自: https://medium.com/swlh/how-hard-is-it-to-be-a-real-data-scientist-85ab88f451f
本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/389470.shtml
如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!