数据库:存储过程
Once you begin studying data science, you will hear something called ‘data science process’. This expression refers to a five stage process that usually data scientists perform when working on a project. In this post I will walk through each of them, describe what is involved and what technologies are normally used.
一旦开始学习数据科学,您将听到一种称为“数据科学过程”的信息。 此表述是指数据科学家通常在执行项目时执行的五个阶段的过程。 在这篇文章中,我将逐步介绍它们中的每一个,描述涉及的内容和通常使用的技术。
1.数据采集 (1. Data Acquisition)
When you are just studying data science, your data may be already given to you by your instructors. Also, you can find a lot of beautiful datasets on Kaggle.com or Google Dataset Search. In this case data acquisition is pretty simple, just download the dataset and you’re all set to go.
当您仅学习数据科学时,您的数据可能已经由您的讲师提供给您。 另外,您可以在Kaggle.com或Google数据集搜索上找到许多精美的数据集 。 在这种情况下,数据采集非常简单,只需下载数据集即可。
In real life it is a little trickier. To obtain data in a format you need you will probably be using API’s or web scraping and your basic knowledge of HTML in order to obtain everything you need. In one of my earlier posts I described how I obtained the data about beauty products from Sephora.com using Selenium and BeautifulSoup.
在现实生活中,这有点棘手。 要获取您需要的格式的数据,您可能会使用API或Web抓取以及HTML的基本知识来获取所需的一切。 在我以前的一篇文章中,我描述了如何使用Selenium和BeautifulSoup从Sephora.com获得有关美容产品的数据。
Technologies used: HTML, SQL, Selenium, BeautifulSoup.
使用的技术:HTML,SQL,Selenium,BeautifulSoup。
2.数据清理 (2. Data Cleaning)
Again, if the dataset was already given to you by your instructors, or you got it on one of the websites mentioned above, there’s a good chance that your data is already clean. However, in most cases there will be some cleaning required. You need to handle the missing values (and be smart about it), make sure that all the columns are in correct datatypes (date-time, integers, floats, strings, etc.), all column names don’t contain spaces (especially important if you’re using NLP to perform analysis and modeling). Check out my post Beginner’s guide to data cleaning for more information.
同样,如果数据集已经由您的讲师提供给您,或者您已在上述网站之一上获得,则很有可能您的数据已经清理干净。 但是,在大多数情况下,需要进行一些清洁。 您需要处理缺失的值(并对此有所了解),确保所有列的数据类型都正确(日期时间,整数,浮点数,字符串等),所有列名均不包含空格(尤其是空格)如果您要使用NLP进行分析和建模,则非常重要)。 查看我的文章数据清理初学者指南以获取更多信息。
Technologies used: Pandas, NumPy
使用的技术:Pandas,NumPy
3. EDA (3. EDA)
EDA stands for Exploratory Data Analysis. At this stage of the process you need to get to know your data. What is the shape of the table? How many rows and columns there are? What are the data types (to make sure you cleaned properly)? How the numeric values are distributed? Is there some sort of correlation/multicollinearity? Is there class imbalance if you want to perform classification? You need to answer all these questions and more before you get to the next stage. I would just write down all the questions I have and try to answer them one by one. This stage is also very important if you are about to present the results to a non-technical audience. While exploring your data in a meaningful way, you will create beautiful visualizations. And someone with no background in math and coding will better respond to an interactive 3D map rather than to you saying “My adjusted R² is 0.92!”.
EDA代表探索性数据分析。 在流程的此阶段,您需要了解您的数据。 桌子的形状是什么? 有多少行和几列? 有哪些数据类型(以确保正确清理)? 数值如何分布? 有某种相关性/多重共线性吗? 如果要进行分类,是否存在班级失衡 ? 在进入下一阶段之前,您需要回答所有这些问题以及更多其他问题。 我只想写下所有问题,然后尝试一个接一个地回答。 如果您要向非技术人员介绍结果,那么此阶段也非常重要。 在以有意义的方式浏览数据时,您将创建漂亮的可视化效果。 没有数学和编码背景的人会更好地响应交互式3D地图,而不是您说“我的调整后R²为0.92!”。
Technologies used: Pandas, Numpy, Matplotlib, Seaborn, Plotly (GO and Express)
使用的技术:熊猫,Numpy,Matplotlib,Seaborn,Plotly(GO和Express)
4.建模 (4. Modeling)
This is the most fun part (IMO). After all the preparation you get to create a machine learning/deep learning model that will make some sort of predictions. This can be a simple linear regression, multiple regression, classification, time series, NLP analysis, or a huge computer vision project with image recognition. Describing how each and every one of these works is beyond the scope of this post, but check out my earlier post about how to talk about regression with babies and I’m-really-bad-at-math people.
这是最有趣的部分(IMO)。 完成所有准备工作后,您将创建一个可以进行某种预测的机器学习/深度学习模型。 这可以是简单的线性回归,多元回归,分类,时间序列,NLP分析或具有图像识别功能的大型计算机视觉项目。 描述每种方法的工作方式超出了本文的范围,但是请查阅我之前的文章 ,该文章介绍了如何与婴儿和我真的很糟糕的人谈论回归。
Technologies used: Scikit-Learn, SciPy, NumPy, Keras, Tensorflow, PyTorch, XGBoost, and many, many more (really depends on what you’re trying to model).
使用的技术:Scikit-Learn,SciPy,NumPy,Keras,Tensorflow,PyTorch,XGBoost等(取决于您要建模的内容)。
5.模型解释与应用 (5. Model Interpretation and Applications)
The results of your model are probably going to look something like this:
您的模型结果可能看起来像这样:
What the heck does this all mean? You can’t just go to the investors and marketing department and say something like ‘my validation accuracy achieved 93% after I handled the class imbalance’ or ‘the proportion of the variance for a dependent variable y is explained by independent variables X by R-squared of 0.75’, you will immediately hear back “English, please!”.
这到底意味着什么? 您不能只去投资者和市场部门说“我在处理类不平衡问题后,我的验证精度达到93%”或“因变量y的方差比例由自变量X乘以R来解释”之类的说法平方为0.75',您将立即听到“请英语!”的声音。
The goal of the final stage of the data science process is to learn how to translate back from Math to English. It doesn’t matter how high or low your adjusted R² or validation accuracy is if you can’t explain what it means in real life.
数据科学过程最后阶段的目标是学习如何从数学翻译回英语。 如果您无法解释现实生活中的含义,那么调整后的R²或验证精度的高低无关紧要。
The results of this whole data science process can be wrapped up in a presentation or they can be used to build a useful web application or some other sort of software. You will need basic knowledge of web development to make it happen, but if I built an app in four days, you certainly can too! Here’s a post about how I did it.
整个数据科学过程的结果可以包装在演示文稿中,也可以用于构建有用的Web应用程序或某种其他类型的软件。 您需要具备Web开发的基础知识才能实现它,但是如果我在四天内构建了一个应用程序,您当然也可以! 这是关于我如何做的帖子 。
Technologies used: Your knowledge of math for data interpretation, Flask and Dash for creating a front-end.
使用的技术:您的数学知识可用于数据解释,Flask和Dash可用于创建前端。
This is a quick summary of what a data science process looks like in a nutshell. Of course, there’s more to it in real life, but if you’re just learning, it’s a nice structure to stick to. Enjoy your data!
简要概述了数据科学过程的外观。 当然,现实生活中还有很多其他方面,但是如果您只是学习,那么这是一个值得坚持的好结构。 享受您的数据!
翻译自: https://medium.com/the-innovation/data-science-process-summary-865abd16183d
数据库:存储过程
本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/391214.shtml
如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!