大数据数据科学家常用面试题
During my time as a Data Scientist, I had the chance to interview my fair share of candidates for data-related roles. While doing this, I started noticing a pattern: some kinds of (simple) mistakes were overwhelmingly frequent among candidates! In striking disagreement with a famous quote by Tolstoy, it seems to me, “most unhappy mistakes in case studies look alike”.
在担任数据科学家期间,我有机会采访了相当一部分与数据相关的职位的候选人。 在这样做的同时,我开始注意到一种模式:候选人中绝大多数(简单)的错误非常频繁! 在我看来,与托尔斯泰的一句名言大相径庭的是,“案例研究中最不幸的错误看起来是相似的”。
In my mind, I started picturing the kind of candidate that I would hire in a heartbeat. No, not a Rockstar/Guru/Evangelist with 12 years of professional experience managing Kubernetes clusters and working with Hadoop/Spark, while simultaneously contributing to TensorFlow’s development, obtaining 2 PhDs, and publishing at least 3 Deep Learning papers per year. Nope; I would just instantly be struck by a person who at least does not make the kind of mistakes I am about to describe… And I can imagine the same happening in other companies, with other interviewers.
在我的脑海中,我开始想象自己会心动的候选人。 不,不是拥有12年管理Kubernetes集群和Hadoop / Spark的专业经验的Rockstar / Guru / Evangelist,同时又为TensorFlow的发展做出了贡献,获得了2个博士学位,并每年发表至少3篇Deep Learning论文。 不; 我将立即被至少没有犯我将要描述的那种错误的人打动……我可以想象在其他公司和其他面试官中也发生了同样的情况。
Although this is a personal and quite opinionated list, I hope these few tips and tricks can be of some help to people at the start of their data science career! I am putting here only the more DS-related things that came to my mind, but of course writing Pythonic, readable, and expressive code is also something that will please immensely whomever is interviewing you!
尽管这是一份个人且颇为自以为是的清单,但我希望这些提示和技巧对人们在数据科学事业开始时能够有所帮助! 我只想起更多与DS相关的事情,但是当然编写Python式,可读性和表达性代码也将极大地取悦与您面谈的任何人!
马虎使用熊猫 (Sloppy use of Pandas)
Let’s face it: for most of your day-to-day tasks as a data scientist you will be manipulating tables, slicing them, grouping them by the values contained in a column, applying transformations to them, and so on. This almost automatically implies that Pandas is one of the most important foundational tools for a data scientist, and if you are able to showcase some mastery with it, well, people will take you quite seriously.
让我们面对现实:作为数据科学家,您在日常的大部分工作中都会处理表格,对其进行切片,将它们按列中包含的值进行分组,对其进行转换等等。 这几乎自动意味着,Pandas是数据科学家最重要的基础工具之一,如果您能够展示它的精通知识,那么人们会非常重视您的。
On the contrary, if you systematically do very low-level manipulations on your DataFrames where a built-in Pandas command exist, you will potentially raise all kinds of red flags.
相反,如果您在存在内置Pandas命令的DataFrame上系统地进行非常低级的操作,则可能会引发各种危险信号。
Here are a few tricks to improve with Pandas:
以下是熊猫改进的一些技巧:
USE IT!
用它!
- Whenever you have to do any manipulation of a DataFrame or Series, stop for a couple of minutes and read the docs to check whether there are already built-in methods that can save you 90% of the work. Even if you don’t find them, in the process of reading through the documentation you will learn tons of stuff that will very likely come in handy in the future. 每当您需要对DataFrame或Series进行任何处理时,都请停几分钟并阅读文档,以检查是否已经有内置方法可以节省90%的工作。 即使您找不到它们,在阅读文档的过程中,您还将学到很多东西,这些东西将来很有可能会派上用场。
Read tutorials written by trustworthy people, see how they do some operations. Especially, Part II of Tom Augspurger’s Modern Pandas tutorial is quite a good place to start with. Even better, read not just part II, but the whole series. Also, this talk by Vincent D. Warmerdam is worth looking at.
阅读可信赖人员撰写的教程,了解他们如何进行某些操作。 特别是, Tom Augspurger的Modern Pandas教程的第二部分是一个很好的起点。 更好的是,不仅阅读第二部分,还阅读整个系列。 此外, 文森特·D·沃默丹(Vincent D. Warmerdam)的演讲值得一看。
If you have to perform some complicated, maybe not built-in, transformation of your data, consider wrapping it in a function! After you do that,
.pipe(...)
and.apply(...)
are your friends.如果您必须执行一些复杂的(也许不是内置的)数据转换,请考虑将其包装在函数中! 完成之后,
.pipe(...)
和.apply(...)
是您的朋友。
Final tip: do not use inplace=True
anywhere. Contrary to popular belief, it doesn’t bring any performance bonus and it naturally makes you write unclear code, as it hinders your ability to chain methods. Hopefully this feature will be discontinued sometime in the future.
最后提示:请勿在任何地方使用inplace=True
。 与流行的看法相反,它不会带来任何性能上的好处,并且自然会使您编写不清楚的代码,因为这会妨碍您链接方法的能力。 希望此功能将来会停止 。
信息从测试仪泄漏 (Information leaking from the test set)
The test set is sacred; while building models or selecting the best one you got so far, it should not even be looked at. Think about it: the reason why we have a test set in the first place is that we want to have an unbiased estimate of the generalization error of a model. If we are allowed to get a sneak peek into “the future” (i.e., data that during training and model building fundamentally we should not have access to) it’s almost guaranteed that we will get influenced by that, and bias our error estimates.
测试集是神圣的; 在构建模型或选择迄今为止获得的最佳模型时,甚至不应该考虑它。 想想看:我们之所以首先拥有一个测试集,是因为我们想要对模型的泛化误差进行无偏估计。 如果允许我们窥视“未来”(即从根本上讲我们在培训和模型构建过程中不应该使用的数据),几乎可以保证我们会受到此影响,并偏离我们的错误估计。
Although I’ve never seen anybody directly fit a model on the test set, quite commonly instead candidates performed hyperparameter tuning and model selection by looking at some metric on the test set. Please do not do that, but rather save part of the data as a validation set instead, or even better, perform cross-validation.
尽管我从未见过有人直接将模型拟合到测试集上,但相当普遍的是,考生通过查看测试集上的某些指标来执行超参数调整和模型选择。 请不要这样做,而是将部分数据保存为验证集,或者甚至更好地执行交叉验证。
Another quite common thing which causes leakage of information from the test set is fitting scalers (like sklearn.preprocessing.StandardScaler
) or oversampling routines (e.g., imblearn.over_sampling.SMOTE
) on the whole dataset. Again, feature engineering, resampling, and so on are part of how a model is built and trained: keep the test set out of it.
导致信息从测试集中泄漏的另一种非常普遍的情况是整个数据集上的拟合缩放器(例如sklearn.preprocessing.StandardScaler
)或过采样例程(例如, imblearn.over_sampling.SMOTE
)。 同样,特征工程,重采样等也是模型构建和训练的一部分:将测试集保留在模型之外。
平均缺陷 (Flaw of averages)
Although summary statistics, like averages, quantiles, and so on, are useful to get a first impression of the data, don’t make the mistake of reducing distributions to a single number when this doesn’t make sense. A classic cautionary example to showcase this is Anscombe’s quartet, but my favorite is the Datasaurus Dozen.
尽管摘要统计信息(例如平均值,分位数等)对于获得数据的第一印象很有用,但不要犯这样的错误,即在没有意义的情况下将分布简化为单个数。 一个典型的警示示例就是Anscombe的四重奏 ,但我最喜欢的是Datasaurus Dozen 。
More often than not, the distribution of your data points matters more than their average value, and especially in some applications the shape of the tails of your distributions is what at the end of the day governs decisions.
通常,数据点的分布比其平均值更重要,尤其是在某些应用程序中,分布的尾部形状最终决定了决策。
If you show that you take this kind of issues in consideration, and don’t even wink when somebody mentions Jensen’s inequality, only good things can happen.
如果您证明自己考虑了此类问题,甚至在有人提到詹森的不平等时甚至都不眨眼,那么只会发生好事。
盲目使用图书馆 (Blind use of libraries)
When you are given a case study, you often have an advantage you can capitalize on: you choose the model(s) to use. That means that you can anticipate some of the questions interviewers might ask you!
在进行案例研究时,通常会具有一个可以利用的优势:选择要使用的模型。 这意味着您可以预见面试官可能会问您的一些问题!
For example, if you end up using an XGBClassifier
for your task, try to understand how it works, as deeply as you can. Everyone knows it’s based on decision trees, but which other “ingredients” do you need for it? Do you know how XGBoost handles missing values? Could you explain Bagging and Boosting in layman’s terms?
例如,如果最终为任务使用XGBClassifier
,请尝试尽可能深入地了解其工作方式。 每个人都知道它基于决策树,但是您还需要其他“成分”吗? 您知道XGBoost如何处理缺失值吗? 您能用外行人的术语解释装袋和提振吗?
Even if you end up using linear regression, you should have a clear idea about what is happening under the hood, and the meaning behind the parameters you set. If you say “I set the learning rate to X”, and somebody follows with “What’s a learning rate?”, it’s quite bad if you cannot at least spend a few words on it.
即使最终使用线性回归,也应该对幕后情况以及所设置参数的含义有一个清晰的了解。 如果您说“我将学习率设置为X”,然后有人说“什么是学习率?”,那么您至少不能在上面花几个字就很不好了。
可视化选择差 (Poor visualization choices)
Choosing the correct options for your plots goes a long way too. Ultimately, I think the most common mistakes here are due to poor choice of normalization or not using the correct scales for the axes.
为您的绘图选择正确的选项还有很长的路要走。 最终,我认为这里最常见的错误是由于归一化选择不当或未使用正确的轴比例。
Let’s look at an example; the following snippet of code
让我们看一个例子; 以下代码片段
just creates two arrays with samples from an exponential distribution; then, it generates the following plot
只是创建两个具有指数分布样本的数组; 然后,生成以下图
I saw some variation of this an enormous amount of times; basically, what we would really like to do is compare the distribution of something among two groups, but in this plot we are only showing raw counts of observed values. If one of the groups has more samples than the other, a plot like this is meaningless to get an idea of the underlying distributions. A better choice would be to normalize what we are displaying in a sensible way: in this case, just setting the parameter density=True
transforms the raw counts into relative frequencies, and gives us the following:
我看到了很多次这种变化。 基本上,我们真正想做的是比较两组之间某物的分布,但是在此图中,我们仅显示了观测值的原始计数。 如果一组中的一个样本比另一组中的样本更多,则这样的图对于了解基本分布毫无意义。 更好的选择是以一种明智的方式对显示的内容进行规范化:在这种情况下,只需将参数density=True
设置即可将原始计数转换为相对频率,并提供以下信息:
Nice! Now we can explicitly see that, after all, a
and b
are samples from the same distribution. There is still something that I dislike here: a lot of white space, and the fact that for values of a
or b
larger than 4, I cannot really see any bar clearly. Luckily, since 1614 Logarithms are a common mathematical operation… So common that we even have a dedicated keyword argument in plt.hist(...)
that just transforms our linear y-axis to a logarithmic one:
真好! 现在我们可以明确地看到, a
和b
毕竟是来自同一分布的样本。 在这里,我仍然不喜欢某些东西:很多空白,而且对于大于4的a
或b
值,我看不到任何清晰的条形。 幸运的是,自1614年以来,对数是一种常见的数学运算...如此普遍,以至于我们甚至在plt.hist(...)
中都有一个专用的关键字参数, plt.hist(...)
参数仅将线性y轴转换为对数:
Notice that this is by no means a “perfect” plot: our axes are unlabeled, no legend, and it just looks kinda ugly! But hey, at least we can extract insights that we would have never been able to see with just a call to plt.hist([a,b])
.
请注意,这绝不是一个“完美”的图:我们的轴是未标记的,没有图例,而且看起来有点难看! 但是,至少我们可以通过调用plt.hist([a,b])
来提取我们从未见过的见解。
结论 (Conclusion)
What all the above-listed mistakes have in common is that they are easily avoidable with some thought and knowledge of the subject, so my advice for your next data science case study is: relax, focus, try to be one step ahead of whatever mind game they’re playing with you, and Google for stuff (a lot!). Interviewing can be stressful, but if both parties are fair (especially people interviewing and coming up with assignments) it’s almost never lost time.
上面列出的所有错误的共同点在于,只要对主题有一定的了解和了解,就可以轻松避免这些错误,因此,我对下一个数据科学案例研究的建议是:放松,集中注意力,力争领先一步他们与您一起玩的游戏,还有Google提供的东西(很多!)。 面试可能会带来压力,但如果双方都公平( 尤其是面试和提出任务的人),则几乎不会浪费时间。
Any feedback on this article would be much appreciated; did I miss anything that you think is particularly important?
对于本文的任何反馈将不胜感激; 我是否错过了您认为特别重要的事情?
To conclude, I wish you all the best in your career, whatever job you happen to be doing now! Maybe see you at an interview :-)
最后,祝您事业顺利,无论您现在正从事什么工作! 也许在面试中见到你:-)
翻译自: https://towardsdatascience.com/acing-a-data-science-job-interview-b37e8b68869b
大数据数据科学家常用面试题
本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/389418.shtml
如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!