数据库:存储过程_数据科学过程:摘要

数据库:存储过程

Once you begin studying data science, you will hear something called ‘data science process’. This expression refers to a five stage process that usually data scientists perform when working on a project. In this post I will walk through each of them, describe what is involved and what technologies are normally used.

一旦开始学习数据科学,您将听到一种称为“数据科学过程”的信息。 此表述是指数据科学家通常在执行项目时执行的五个阶段的过程。 在这篇文章中,我将逐步介绍它们中的每一个,描述涉及的内容和通常使用的技术。

1.数据采集 (1. Data Acquisition)

When you are just studying data science, your data may be already given to you by your instructors. Also, you can find a lot of beautiful datasets on Kaggle.com or Google Dataset Search. In this case data acquisition is pretty simple, just download the dataset and you’re all set to go.

当您仅学习数据科学时,您的数据可能已经由您的讲师提供给您。 另外,您可以在Kaggle.com或Google数据集搜索上找到许多精美的数据集 。 在这种情况下,数据采集非常简单,只需下载数据集即可。

In real life it is a little trickier. To obtain data in a format you need you will probably be using API’s or web scraping and your basic knowledge of HTML in order to obtain everything you need. In one of my earlier posts I described how I obtained the data about beauty products from Sephora.com using Selenium and BeautifulSoup.

在现实生活中,这有点棘手。 要获取您需要的格式的数据,您可能会使用API​​或Web抓取以及HTML的基本知识来获取所需的一切。 在我以前的一篇文章中,我描述了如何使用Selenium和BeautifulSoup从Sephora.com获得有关美容产品的数据。

Technologies used: HTML, SQL, Selenium, BeautifulSoup.

使用的技术:HTML,SQL,Selenium,BeautifulSoup。

2.数据清理 (2. Data Cleaning)

Again, if the dataset was already given to you by your instructors, or you got it on one of the websites mentioned above, there’s a good chance that your data is already clean. However, in most cases there will be some cleaning required. You need to handle the missing values (and be smart about it), make sure that all the columns are in correct datatypes (date-time, integers, floats, strings, etc.), all column names don’t contain spaces (especially important if you’re using NLP to perform analysis and modeling). Check out my post Beginner’s guide to data cleaning for more information.

同样,如果数据集已经由您的讲师提供给您,或者您已在上述网站之一上获得,则很有可能您的数据已经清理干净。 但是,在大多数情况下,需要进行一些清洁。 您需要处理缺失的值(并对此有所了解),确保所有列的数据类型都正确(日期时间,整数,浮点数,字符串等),所有列名均不包含空格(尤其是空格)如果您要使用NLP进行分析和建模,则非常重要)。 查看我的文章数据清理初学者指南以获取更多信息。

Technologies used: Pandas, NumPy

使用的技术:Pandas,NumPy

3. EDA (3. EDA)

EDA stands for Exploratory Data Analysis. At this stage of the process you need to get to know your data. What is the shape of the table? How many rows and columns there are? What are the data types (to make sure you cleaned properly)? How the numeric values are distributed? Is there some sort of correlation/multicollinearity? Is there class imbalance if you want to perform classification? You need to answer all these questions and more before you get to the next stage. I would just write down all the questions I have and try to answer them one by one. This stage is also very important if you are about to present the results to a non-technical audience. While exploring your data in a meaningful way, you will create beautiful visualizations. And someone with no background in math and coding will better respond to an interactive 3D map rather than to you saying “My adjusted R² is 0.92!”.

EDA代表探索性数据分析。 在流程的此阶段,您需要了解您的数据。 桌子的形状是什么? 有多少行和几列? 有哪些数据类型(以确保正确清理)? 数值如何分布? 有某种相关性/多重共线性吗? 如果要进行分类,是否存在班级失衡 ? 在进入下一阶段之前,您需要回答所有这些问题以及更多其他问题。 我只想写下所有问题,然后尝试一个接一个地回答。 如果您要向非技术人员介绍结果,那么此阶段也非常重要。 在以有意义的方式浏览数据时,您将创建漂亮的可视化效果。 没有数学和编码背景的人会更好地响应交互式3D地图,而不是您说“我的调整后R²为0.92!”。

Image for post
Screenshot from one of my project presentations
我的项目演示之一的屏幕截图

Technologies used: Pandas, Numpy, Matplotlib, Seaborn, Plotly (GO and Express)

使用的技术:熊猫,Numpy,Matplotlib,Seaborn,Plotly(GO和Express)

4.建模 (4. Modeling)

This is the most fun part (IMO). After all the preparation you get to create a machine learning/deep learning model that will make some sort of predictions. This can be a simple linear regression, multiple regression, classification, time series, NLP analysis, or a huge computer vision project with image recognition. Describing how each and every one of these works is beyond the scope of this post, but check out my earlier post about how to talk about regression with babies and I’m-really-bad-at-math people.

这是最有趣的部分(IMO)。 完成所有准备工作后,您将创建一个可以进行某种预测的机器学习/深度学习模型。 这可以是简单的线性回归,多元回归,分类,时间序列,NLP分析或具有图像识别功能的大型计算机视觉项目。 描述每种方法的工作方式超出了本文的范围,但是请查阅我之前的文章 ,该文章介绍了如何与婴儿和我真的很糟糕的人谈论回归。

Technologies used: Scikit-Learn, SciPy, NumPy, Keras, Tensorflow, PyTorch, XGBoost, and many, many more (really depends on what you’re trying to model).

使用的技术:Scikit-Learn,SciPy,NumPy,Keras,Tensorflow,PyTorch,XGBoost等(取决于您要建模的内容)。

5.模型解释与应用 (5. Model Interpretation and Applications)

The results of your model are probably going to look something like this:

您的模型结果可能看起来像这样:

Image for post
Screenshot of my project: binary classification with XGBoost
我的项目的屏幕截图:使用XGBoost进行二进制分类

What the heck does this all mean? You can’t just go to the investors and marketing department and say something like ‘my validation accuracy achieved 93% after I handled the class imbalance’ or ‘the proportion of the variance for a dependent variable y is explained by independent variables X by R-squared of 0.75’, you will immediately hear back “English, please!”.

这到底意味着什么? 您不能只去投资者和市场部门说“我在处理类不平衡问题后,我的验证精度达到93%”或“因变量y的方差比例由自变量X乘以R来解释”之类的说法平方为0.75',您将立即听到“请英语!”的声音。

The goal of the final stage of the data science process is to learn how to translate back from Math to English. It doesn’t matter how high or low your adjusted R² or validation accuracy is if you can’t explain what it means in real life.

数据科学过程最后阶段的目标是学习如何从数学翻译回英语。 如果您无法解释现实生活中的含义,那么调整后的R²或验证精度的高低无关紧要。

The results of this whole data science process can be wrapped up in a presentation or they can be used to build a useful web application or some other sort of software. You will need basic knowledge of web development to make it happen, but if I built an app in four days, you certainly can too! Here’s a post about how I did it.

整个数据科学过程的结果可以包装在演示文稿中,也可以用于构建有用的Web应用程序或某种其他类型的软件。 您需要具备Web开发的基础知识才能实现它,但是如果我在四天内构建了一个应用程序,您当然也可以! 这是关于我如何做的帖子 。

Technologies used: Your knowledge of math for data interpretation, Flask and Dash for creating a front-end.

使用的技术:您的数学知识可用于数据解释,Flask和Dash可用于创建前端。

This is a quick summary of what a data science process looks like in a nutshell. Of course, there’s more to it in real life, but if you’re just learning, it’s a nice structure to stick to. Enjoy your data!

简要概述了数据科学过程的外观。 当然,现实生活中还有很多其他方面,但是如果您只是学习,那么这是一个值得坚持的好结构。 享受您的数据!

翻译自: https://medium.com/the-innovation/data-science-process-summary-865abd16183d

数据库:存储过程

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/391214.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

svm和k-最近邻_使用K最近邻的电影推荐和评级预测

svm和k-最近邻Recommendation systems are becoming increasingly important in today’s hectic world. People are always in the lookout for products/services that are best suited for them. Therefore, the recommendation systems are important as they help them ma…

Oracle:时间字段模糊查询

需要查询某一天的数据,但是库里面存的是下图date类型 将Oracle中时间字段转化成字符串,然后进行字符串模糊查询 select * from CAINIAO_MONITOR_MSG t WHERE to_char(t.CREATE_TIME,yyyy-MM-dd) like 2019-09-12 转载于:https://www.cnblogs.com/gcgc/p/…

cnn对网络数据预处理_CNN中的数据预处理和网络构建

cnn对网络数据预处理In this article, we will go through the end-to-end pipeline of training convolution neural networks, i.e. organizing the data into directories, preprocessing, data augmentation, model building, etc.在本文中,我们将遍历训练卷积神…

leetcode 554. 砖墙

你的面前有一堵矩形的、由 n 行砖块组成的砖墙。这些砖块高度相同(也就是一个单位高)但是宽度不同。每一行砖块的宽度之和应该相等。 你现在要画一条 自顶向下 的、穿过 最少 砖块的垂线。如果你画的线只是从砖块的边缘经过,就不算穿过这块砖…

递归 和 迭代 斐波那契数列

#include "stdio.h"int Fbi(int i) /* 斐波那契的递归函数 */ { if( i < 2 ) return i 0 ? 0 : 1; return Fbi(i - 1) Fbi(i - 2); /* 这里Fbi就是函数自己&#xff0c;等于在调用自己 */ }int main() { int i; int a[40]; printf("迭代显示斐波那契数列…

飞行模式的开启和关闭

2019独角兽企业重金招聘Python工程师标准>>> if(Settings.System.getString(getActivity().getContentResolver(),Settings.Global.AIRPLANE_MODE_ON).equals("0")) { Settings.System.putInt(getActivity().getContentResolver(),Settings.Global.AIRPLA…

消解原理推理_什么是推理统计中的Z检验及其工作原理?

消解原理推理I Feel:我觉得&#xff1a; The more you analyze the data the more enlightened, data engineer you will become.您对数据的分析越多&#xff0c;您将变得越发开明。 In data engineering, you will always find an instance where you need to establish whet…

pytest+allure测试框架搭建

https://blog.csdn.net/wust_lh/article/details/86685912 https://www.jianshu.com/p/9673b2aeb0d3 定制化展示数据 https://blog.csdn.net/qw943571775/article/details/99634577 环境说明&#xff1a; jdk 1.8 python 3.5.3 allure-commandline 2.13.0 文档及下载地址&…

大学生信息安全_给大学生的信息

大学生信息安全You’re an undergraduate. Either you’re graduating soon (like me) or you’re in the process of getting your first college degree. The process is not easy and I can only assume how difficult the pressures on Masters and Ph.D. students are. Ho…

特斯拉最安全的车_特斯拉现在是最受欢迎的租车选择

特斯拉最安全的车Have you been curious to know which cars are most popular in US and what are their typical rental fares in various cities? As the head of Product and Data Science at an emerging technology start-up, Ving Rides, these were some of the quest…

WebSocket入门

WebSocket前言  WebSocket是HTML5的重要特性&#xff0c;它实现了基于浏览器的远程socket&#xff0c;它使浏览器和服务器可以进行全双工通信&#xff0c;许多浏览器&#xff08;Firefox、Google Chrome和Safari&#xff09;都已对此做了支持。 在WebSocket出现之前&#xff…

ml dl el学习_DeepChem —在生命科学和化学信息学中使用ML和DL的框架

ml dl el学习Application of Machine Learning and Deep Learning for Drug Discovery, Genomics, Microsocopy and Quantum Chemistry can create radical impact and holds the potential to significantly accelerate the process of medical research and vaccine developm…

2017-2018-1 20179215《Linux内核原理与分析》第二周作业

20179215《Linux内核原理与分析》第二周作业 这一周主要了解了计算机是如何工作的&#xff0c;包括现在存储程序计算机的工作模型、X86汇编指令包括几种内存地址的寻址方式和push、pop、call、re等几个重要的汇编指令。主要分为两部分进行这周的学习总结。第一部分对学习内容进…

Gradle复制文件/目录方法

2019独角兽企业重金招聘Python工程师标准>>> gradle复制文件/文件夹方法 复制文件 //复制IDE生成的classes.jar文件到build/libs中&#xff0c;并改名为FileUtils.jar. task copyFile(type:Copy) {delete build/libs/FileUtils.jarfrom(build/intermediates/bundles…

用户参与度与活跃度的区别_用户参与度突然下降

用户参与度与活跃度的区别disclaimer: I don’t work for Yammer, this is a public data case study, I’ve written it in a narrative format to make this case study more engaging to read.免责声明&#xff1a;我不为Yammer工作&#xff0c;这是一个公共数据案例研究&am…

重学TCP协议(6) 四次挥手

1. 四次挥手 客户端进程发出连接释放报文&#xff0c;并且停止发送数据。释放数据报文首部&#xff0c;FIN1&#xff0c;其序列号为sequ&#xff08;等于前面已经传送过来的数据的最后一个字节的序号加1&#xff09;&#xff0c;此时&#xff0c;客户端进入FIN-WAIT-1&#xff…

UML建模图实战笔记

一、前言 UML&#xff1a;Unified Modeling Language&#xff08;统一建模语言&#xff09;&#xff0c;使用UML进行建模的作用有哪些&#xff1a; 可以更好的理解问题可以及早的发现错误或者被遗漏的点可以更加方便的进行组员之间的沟通支持面向对象软件开发建模&#xff0c;可…

数据草拟:使您的团队热爱数据的研讨会

Learn the rules to Data Draw Up; a fun way to get your teams invested in data.了解数据收集的规则&#xff1b; 一种让您的团队投入数据的有趣方式。 Let’s keep things short. Metrics are one of the most important things in Product Management. They help us to u…

深入理解InnoDB(5)-文件系统

1. 数据库和文件系统的关系 像 InnoDB 、 MyISAM 这样的存储引擎都是把表存储在文件系统上的。当我们想读取数据的时候&#xff0c;这些存储引擎会从文件系统中把数据读出来返回给我们&#xff0c;当我们想写入数据的时候&#xff0c;这些存储引擎会把这些数据又写回文件系统。…

Digital River拉来Netconcepts站台 亚太营销服务升级

它是大洋彼岸的一家网络软件下载、分销商&#xff0c;很多重量级的软件行业领军企业都是其客户&#xff0c;它一直低调摸索亚太营销的路子&#xff0c;在今年九月份&#xff0c;它一改常态&#xff0c;高调宣布入华&#xff0c;三个月后&#xff0c;它带来了最新消息&#xff1…