大数据数据科学家常用面试题_进行数据科学工作面试

大数据数据科学家常用面试题

During my time as a Data Scientist, I had the chance to interview my fair share of candidates for data-related roles. While doing this, I started noticing a pattern: some kinds of (simple) mistakes were overwhelmingly frequent among candidates! In striking disagreement with a famous quote by Tolstoy, it seems to me, “most unhappy mistakes in case studies look alike”.

在担任数据科学家期间,我有机会采访了相当一部分与数据相关的职位的候选人。 在这样做的同时,我开始注意到一种模式:候选人中绝大多数(简单)的错误非常频繁! 在我看来,与托尔斯泰的一句名言大相径庭的是,“案例研究中最不幸的错误看起来是相似的”。

In my mind, I started picturing the kind of candidate that I would hire in a heartbeat. No, not a Rockstar/Guru/Evangelist with 12 years of professional experience managing Kubernetes clusters and working with Hadoop/Spark, while simultaneously contributing to TensorFlow’s development, obtaining 2 PhDs, and publishing at least 3 Deep Learning papers per year. Nope; I would just instantly be struck by a person who at least does not make the kind of mistakes I am about to describe… And I can imagine the same happening in other companies, with other interviewers.

在我的脑海中,我开始想象自己会心动的候选人。 不,不是拥有12年管理Kubernetes集群和Hadoop / Spark的专业经验的Rockstar / Guru / Evangelist,同时又为TensorFlow的发展做出了贡献,获得了2个博士学位,并每年发表至少3篇Deep Learning论文。 不; 我将立即被至少没有犯我将要描述的那种错误的人打动……我可以想象在其他公司和其他面试官中也发生了同样的情况。

Although this is a personal and quite opinionated list, I hope these few tips and tricks can be of some help to people at the start of their data science career! I am putting here only the more DS-related things that came to my mind, but of course writing Pythonic, readable, and expressive code is also something that will please immensely whomever is interviewing you!

尽管这是一份个人且颇为自以为是的清单,但我希望这些提示和技巧对人们在数据科学事业开始时能够有所帮助! 我只想起更多与DS相关的事情,但是当然编写Python式,可读性和表达性代码也将极大地取悦与您面谈的任何人!

马虎使用熊猫 (Sloppy use of Pandas)

Let’s face it: for most of your day-to-day tasks as a data scientist you will be manipulating tables, slicing them, grouping them by the values contained in a column, applying transformations to them, and so on. This almost automatically implies that Pandas is one of the most important foundational tools for a data scientist, and if you are able to showcase some mastery with it, well, people will take you quite seriously.

让我们面对现实:作为数据科学家,您在日常的大部分工作中都会处理表格,对其进行切片,将它们按列中包含的值进行分组,对其进行转换等等。 这几乎自动意味着,Pandas是数据科学家最重要的基础工具之一,如果您能够展示它的精通知识,那么人们会非常重视您的。

On the contrary, if you systematically do very low-level manipulations on your DataFrames where a built-in Pandas command exist, you will potentially raise all kinds of red flags.

相反,如果您在存在内置Pandas命令的DataFrame上系统地进行非常低级的操作,则可能会引发各种危险信号。

Here are a few tricks to improve with Pandas:

以下是熊猫改进的一些技巧:

  1. USE IT!

    用它!

  2. Whenever you have to do any manipulation of a DataFrame or Series, stop for a couple of minutes and read the docs to check whether there are already built-in methods that can save you 90% of the work. Even if you don’t find them, in the process of reading through the documentation you will learn tons of stuff that will very likely come in handy in the future.

    每当您需要对DataFrame或Series进行任何处理时,都请停几分钟并阅读文档,以检查是否已经有内置方法可以节省90%的工作。 即使您找不到它们,在阅读文档的过程中,您还将学到很多东西,这些东西将来很有可能会派上用场。
  3. Read tutorials written by trustworthy people, see how they do some operations. Especially, Part II of Tom Augspurger’s Modern Pandas tutorial is quite a good place to start with. Even better, read not just part II, but the whole series. Also, this talk by Vincent D. Warmerdam is worth looking at.

    阅读可信赖人员撰写的教程,了解他们如何进行某些操作。 特别是, Tom Augspurger的Modern Pandas教程的第二部分是一个很好的起点。 更好的是,不仅阅读第二部分,还阅读整个系列。 此外, 文森特·D·沃默丹(Vincent D. Warmerdam)的演讲值得一看。

  4. If you have to perform some complicated, maybe not built-in, transformation of your data, consider wrapping it in a function! After you do that, .pipe(...) and .apply(...) are your friends.

    如果您必须执行一些复杂的(也许不是内置的)数据转换,请考虑将其包装在函数中! 完成之后, .pipe(...).apply(...)是您的朋友。

Final tip: do not use inplace=True anywhere. Contrary to popular belief, it doesn’t bring any performance bonus and it naturally makes you write unclear code, as it hinders your ability to chain methods. Hopefully this feature will be discontinued sometime in the future.

最后提示:请勿在任何地方使用inplace=True 。 与流行的看法相反,它不会带来任何性能上的好处,并且自然会使您编写不清楚的代码,因为这会妨碍您链接方法的能力。 希望此功能将来会停止 。

信息从测试仪泄漏 (Information leaking from the test set)

The test set is sacred; while building models or selecting the best one you got so far, it should not even be looked at. Think about it: the reason why we have a test set in the first place is that we want to have an unbiased estimate of the generalization error of a model. If we are allowed to get a sneak peek into “the future” (i.e., data that during training and model building fundamentally we should not have access to) it’s almost guaranteed that we will get influenced by that, and bias our error estimates.

测试集是神圣的; 在构建模型或选择迄今为止获得的最佳模型时,甚至不应该考虑它。 想想看:我们之所以首先拥有一个测试集,是因为我们想要对模型的泛化误差进行无偏估计。 如果允许我们窥视“未来”(即从根本上讲我们在培训和模型构建过程中不应该使用的数据),几乎可以保证我们会受到此影响,并偏离我们的错误估计。

Although I’ve never seen anybody directly fit a model on the test set, quite commonly instead candidates performed hyperparameter tuning and model selection by looking at some metric on the test set. Please do not do that, but rather save part of the data as a validation set instead, or even better, perform cross-validation.

尽管我从未见过有人直接将模型拟合到测试集上,但相当普遍的是,考生通过查看测试集上的某些指标来执行超参数调整和模型选择。 请不要这样做,而是将部分数据保存为验证集,或者甚至更好地执行交叉验证。

Another quite common thing which causes leakage of information from the test set is fitting scalers (like sklearn.preprocessing.StandardScaler) or oversampling routines (e.g., imblearn.over_sampling.SMOTE) on the whole dataset. Again, feature engineering, resampling, and so on are part of how a model is built and trained: keep the test set out of it.

导致信息从测试集中泄漏的另一种非常普遍的情况是整个数据集上的拟合缩放器(例如sklearn.preprocessing.StandardScaler )或过采样例程(例如, imblearn.over_sampling.SMOTE )。 同样,特征工程,重采样等也是模型构建和训练的一部分:将测试集保留在模型之外。

平均缺陷 (Flaw of averages)

Although summary statistics, like averages, quantiles, and so on, are useful to get a first impression of the data, don’t make the mistake of reducing distributions to a single number when this doesn’t make sense. A classic cautionary example to showcase this is Anscombe’s quartet, but my favorite is the Datasaurus Dozen.

尽管摘要统计信息(例如平均值,分位数等)对于获得数据的第一印象很有用,但不要犯这样的错误,即在没有意义的情况下将分布简化为单个数。 一个典型的警示示例就是Anscombe的四重奏 ,但我最喜欢的是Datasaurus Dozen 。

Image for post
Source: Autodesk Research
资料来源:Autodesk Research

More often than not, the distribution of your data points matters more than their average value, and especially in some applications the shape of the tails of your distributions is what at the end of the day governs decisions.

通常,数据点的分布比其平均值更重要,尤其是在某些应用程序中,分布的尾部形状最终决定了决策。

If you show that you take this kind of issues in consideration, and don’t even wink when somebody mentions Jensen’s inequality, only good things can happen.

如果您证明自己考虑了此类问题,甚至在有人提到詹森的不平等时甚至都不眨眼,那么只会发生好事。

盲目使用图书馆 (Blind use of libraries)

When you are given a case study, you often have an advantage you can capitalize on: you choose the model(s) to use. That means that you can anticipate some of the questions interviewers might ask you!

在进行案例研究时,通常会具有一个可以利用的优势:选择要使用的模型。 这意味着您可以预见面试官可能会问您的一些问题!

For example, if you end up using an XGBClassifier for your task, try to understand how it works, as deeply as you can. Everyone knows it’s based on decision trees, but which other “ingredients” do you need for it? Do you know how XGBoost handles missing values? Could you explain Bagging and Boosting in layman’s terms?

例如,如果最终为任务使用XGBClassifier ,请尝试尽可能深入地了解其工作方式。 每个人都知道它基于决策树,但是您还需要其他“成分”吗? 您知道XGBoost如何处理缺失值吗? 您能用外行人的术语解释装袋和提振吗?

Even if you end up using linear regression, you should have a clear idea about what is happening under the hood, and the meaning behind the parameters you set. If you say “I set the learning rate to X”, and somebody follows with “What’s a learning rate?”, it’s quite bad if you cannot at least spend a few words on it.

即使最终使用线性回归,也应该对幕后情况以及所设置参数的含义有一个清晰的了解。 如果您说“我将学习率设置为X”,然后有人说“什么是学习率?”,那么您至少不能在上面花几个字就很不好了。

可视化选择差 (Poor visualization choices)

Choosing the correct options for your plots goes a long way too. Ultimately, I think the most common mistakes here are due to poor choice of normalization or not using the correct scales for the axes.

为您的绘图选择正确的选项还有很长的路要走。 最终,我认为这里最常见的错误是由于归一化选择不当或未使用正确的轴比例。

Let’s look at an example; the following snippet of code

让我们看一个例子; 以下代码片段

just creates two arrays with samples from an exponential distribution; then, it generates the following plot

只是创建两个具有指数分布样本的数组; 然后,生成以下图

Image for post

I saw some variation of this an enormous amount of times; basically, what we would really like to do is compare the distribution of something among two groups, but in this plot we are only showing raw counts of observed values. If one of the groups has more samples than the other, a plot like this is meaningless to get an idea of the underlying distributions. A better choice would be to normalize what we are displaying in a sensible way: in this case, just setting the parameter density=True transforms the raw counts into relative frequencies, and gives us the following:

我看到了很多次这种变化。 基本上,我们真正想做的是比较两组之间某物的分布,但是在此图中,我们仅显示了观测值的原始计数。 如果一组中的一个样本比另一组中的样本更多,则这样的图对于了解基本分布毫无意义。 更好的选择是以一种明智的方式对显示的内容进行规范化:在这种情况下,只需将参数density=True设置即可将原始计数转换为相对频率,并提供以下信息:

Image for post

Nice! Now we can explicitly see that, after all, a and b are samples from the same distribution. There is still something that I dislike here: a lot of white space, and the fact that for values of a or b larger than 4, I cannot really see any bar clearly. Luckily, since 1614 Logarithms are a common mathematical operation… So common that we even have a dedicated keyword argument in plt.hist(...) that just transforms our linear y-axis to a logarithmic one:

真好! 现在我们可以明确地看到, ab毕竟是来自同一分布的样本。 在这里,我仍然不喜欢某些东西:很多空白,而且对于大于4的ab值,我看不到任何清晰的条形。 幸运的是,自1614年以来,对数是一种常见的数学运算...如此普遍,以至于我们甚至在plt.hist(...)中都有一个专用的关键字参数, plt.hist(...)参数仅将线性y轴转换为对数:

Image for post

Notice that this is by no means a “perfect” plot: our axes are unlabeled, no legend, and it just looks kinda ugly! But hey, at least we can extract insights that we would have never been able to see with just a call to plt.hist([a,b]).

请注意,这绝不是一个“完美”的图:我们的轴是未标记的,没有图例,而且看起来有点难看! 但是,至少我们可以通过调用plt.hist([a,b])来提取我们从未见过的见解。

结论 (Conclusion)

What all the above-listed mistakes have in common is that they are easily avoidable with some thought and knowledge of the subject, so my advice for your next data science case study is: relax, focus, try to be one step ahead of whatever mind game they’re playing with you, and Google for stuff (a lot!). Interviewing can be stressful, but if both parties are fair (especially people interviewing and coming up with assignments) it’s almost never lost time.

上面列出的所有错误的共同点在于,只要对主题有一定的了解和了解,就可以轻松避免这些错误,因此,我对下一个数据科学案例研究的建议是:放松,集中注意力,力争领先一步他们与您一起玩的游戏,还有Google提供的东西(很多!)。 面试可能会带来压力,但如果双方都公平( 尤其是面试和提出任务的人),则几乎不会浪费时间。

Any feedback on this article would be much appreciated; did I miss anything that you think is particularly important?

对于本文的任何反馈将不胜感激; 我是否错过了您认为特别重要的事情?

To conclude, I wish you all the best in your career, whatever job you happen to be doing now! Maybe see you at an interview :-)

最后,祝您事业顺利,无论您现在正从事什么工作! 也许在面试中见到你:-)

翻译自: https://towardsdatascience.com/acing-a-data-science-job-interview-b37e8b68869b

大数据数据科学家常用面试题

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/389418.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

scrapy模拟模拟点击_模拟大流行

scrapy模拟模拟点击复杂系统 (Complex Systems) In our daily life, we encounter many complex systems where individuals are interacting with each other such as the stock market or rush hour traffic. Finding appropriate models for these complex systems may give…

vue.js python_使用Python和Vue.js自动化报告过程

vue.js pythonIf your organization does not have a data visualization solution like Tableau or PowerBI nor means to host a server to deploy open source solutions like Dash then you are probably stuck doing reports with Excel or exporting your notebooks.如果…

plsql中导入csvs_在命令行中使用sql分析csvs

plsql中导入csvsIf you are familiar with coding in SQL, there is a strong chance you do it in PgAdmin, MySQL, BigQuery, SQL Server, etc. But there are times you just want to use your SQL skills for quick analysis on a small/medium sized dataset.如果您熟悉SQ…

计算机科学必读书籍_5篇关于数据科学家的产品分类必读文章

计算机科学必读书籍Product categorization/product classification is the organization of products into their respective departments or categories. As well, a large part of the process is the design of the product taxonomy as a whole.产品分类/产品分类是将产品…

交替最小二乘矩阵分解_使用交替最小二乘矩阵分解与pyspark建立推荐系统

交替最小二乘矩阵分解pyspark上的动手推荐系统 (Hands-on recommender system on pyspark) Recommender System is an information filtering tool that seeks to predict which product a user will like, and based on that, recommends a few products to the users. For ex…

python 网页编程_通过Python编程检索网页

python 网页编程The internet and the World Wide Web (WWW), is probably the most prominent source of information today. Most of that information is retrievable through HTTP. HTTP was invented originally to share pages of hypertext (hence the name Hypertext T…

火种 ctf_分析我的火种数据

火种 ctfOriginally published at https://www.linkedin.com on March 27, 2020 (data up to date as of March 20, 2020).最初于 2020年3月27日 在 https://www.linkedin.com 上 发布 (数据截至2020年3月20日)。 Day 3 of social distancing.社会疏离的第三天。 As I sit on…

data studio_面向营销人员的Data Studio —报表指南

data studioIn this guide, we describe both the theoretical and practical sides of reporting with Google Data Studio. You can use this guide as a comprehensive cheat sheet in your everyday marketing.在本指南中,我们描述了使用Google Data Studio进行…

人流量统计系统介绍_统计介绍

人流量统计系统介绍Its very important to know about statistics . May you be a from a finance background, may you be data scientist or a data analyst, life is all about mathematics. As per the wiki definition “Statistics is the discipline that concerns the …

乐高ev3 读取外部数据_数据就是新乐高

乐高ev3 读取外部数据When I was a kid, I used to love playing with Lego. My brother and I built almost all kinds of stuff with Lego — animals, cars, houses, and even spaceships. As time went on, our creations became more ambitious and realistic. There were…

图像灰度化与二值化

图像灰度化 什么是图像灰度化? 图像灰度化并不是将单纯的图像变成灰色,而是将图片的BGR各通道以某种规律综合起来,使图片显示位灰色。 规律如下: 手动实现灰度化 首先我们采用手动灰度化的方式: 其思想就是&#…

分析citibike数据eda

数据科学 (Data Science) CitiBike is New York City’s famous bike rental company and the largest in the USA. CitiBike launched in May 2013 and has become an essential part of the transportation network. They make commute fun, efficient, and affordable — no…

上采样(放大图像)和下采样(缩小图像)(最邻近插值和双线性插值的理解和实现)

上采样和下采样 什么是上采样和下采样? • 缩小图像(或称为下采样(subsampled)或降采样(downsampled))的主要目的有 两个:1、使得图像符合显示区域的大小;2、生成对应图…

r语言绘制雷达图_用r绘制雷达蜘蛛图

r语言绘制雷达图I’ve tried several different types of NBA analytical articles within my readership who are a group of true fans of basketball. I found that the most popular articles are not those with state-of-the-art machine learning technologies, but tho…

java 分裂数字_分裂的补充:超越数字,打印物理可视化

java 分裂数字As noted in my earlier Nightingale writings, color harmony is the process of choosing colors on a Color Wheel that work well together in the composition of an image. Today, I will step further into color theory by discussing the Split Compleme…

结构化数据建模——titanic数据集的模型建立和训练(Pytorch版)

本文参考《20天吃透Pytorch》来实现titanic数据集的模型建立和训练 在书中理论的同时加入自己的理解。 一,准备数据 数据加载 titanic数据集的目标是根据乘客信息预测他们在Titanic号撞击冰山沉没后能否生存。 结构化数据一般会使用Pandas中的DataFrame进行预处理…

比赛,幸福度_幸福与生活满意度

比赛,幸福度What is the purpose of life? Is that to be happy? Why people go through all the pain and hardship? Is it to achieve happiness in some way?人生的目的是什么? 那是幸福吗? 人们为什么要经历所有的痛苦和磨难? 是通过…

带有postgres和jupyter笔记本的Titanic数据集

PostgreSQL is a powerful, open source object-relational database system with over 30 years of active development that has earned it a strong reputation for reliability, feature robustness, and performance.PostgreSQL是一个功能强大的开源对象关系数据库系统&am…

Django学习--数据库同步操作技巧

同步数据库:使用上述两条命令同步数据库1.认识migrations目录:migrations目录作用:用来存放通过makemigrations命令生成的数据库脚本,里面的生成的脚本不要轻易修改。要正常的使用数据库同步的功能,app目录下必须要有m…

React 新 Context API 在前端状态管理的实践

2019独角兽企业重金招聘Python工程师标准>>> 本文转载至:今日头条技术博客 众所周知,React的单向数据流模式导致状态只能一级一级的由父组件传递到子组件,在大中型应用中较为繁琐不好管理,通常我们需要使用Redux来帮助…