数据科学还是计算机科学_何时不使用数据科学

数据科学还是计算机科学

意见 (Opinion)

目录 (Table of Contents)

  1. Introduction

    介绍
  2. Examples

    例子
  3. When You Should Use Data Science

    什么时候应该使用数据科学
  4. Summary

    摘要

介绍 (Introduction)

Both Data Science and Machine Learning are useful fields that apply several tools to predict, suggest, classify, and ultimately solve common business problems. You can create highly accurate models that automate previously manual tasks. Data Science can be powerful, saving companies money and time. However, you will find that you do not necessarily need Data Science to solve every problem you encounter. There are certain situations where human intervention is more important, or the situation does not allow for a generalized model.

数据科学和机器学习都是有用的领域,它们应用多种工具来预测,建议,分类并最终解决常见的业务问题。 您可以创建高度精确的模型来自动化以前的手动任务。 数据科学功能强大,可以节省公司的金钱和时间。 但是,您会发现不一定需要数据科学来解决遇到的每个问题。 在某些情况下,人工干预更为重要,或者这种情况不允许使用通用模型。

I will be describing five examples of when not to use Data Science. As a Data Scientist, I have found that I have slowly, over time, learned, or experienced when Data Science and Machine Learning were not necessary. I hope I can shed some light and intuition for you and your future situations.

我将描述五个何时不使用数据科学的示例。 作为数据科学家,我发现我在不需要数据科学和机器学习的过程中逐渐地,逐渐地学习或体验。 希望我能为您和您的未来情况提供一些启示和直觉。

例子 (Examples)

There are several examples of when to use Data Science and when not to use Data Science. Here are some situations that come to mind where a Data Science model is not necessary, and possibly could make the situation worse:

有几个何时使用数据科学以及何时不使用数据科学的示例。 在某些情况下,我想到了不需要数据科学模型的情况,这可能会使情况变得更糟:

  • Classifying some health implications

    分类一些对健康的影响

Depending on the severity of incorrect predictions, utilizing Data Science in some facets of the healthcare field can be extremely costly in a few ways. An example of a Data Science model that results in an incorrect prediction with no harm would be classifying a t-shirt as a sweater, and vice-versa. This incorrect suggestion that would be seen by consumers on an e-commerce site would be unfortunate, but it would not result in harm. Now imagine you created a model to classify cancer. If you classify someone as not having cancer, and they actually did, this result can be extremely harmful. Perhaps human intervention is the best route here or human-in-the-loop (a combination of Data Science and human efforts), rather than Data Science only. A good rule of thumb to know is:

根据错误预测的严重程度,在医疗保健领域的某些方面使用数据科学可能会在某些方面造成极大的损失。 数据科学模型的一个实例,该实例会导致错误的预测而不会造成损害,将T恤衫归类为毛衣,反之亦然。 消费者在电子商务网站上看到的这个错误建议是不幸的,但不会造成危害。 现在假设您创建了一个模型来对癌症进行分类。 如果您将某人归类为没有癌症,而实际上他们确实患有癌症,则此结果可能非常有害。 也许人为干预是此处的最佳选择,还是人与人之间的循环 ( 数据科学和人类努力的结合 ),而不是仅数据科学。 一个好的经验法则是:

If this model prediction is incorrect, what will be the consequences?

如果此模型预测不正确,将产生什么后果?

However, Data Science, Machine Learning, and AI are constantly evolving and you can expect to see emerging technologies and improvements on model accuracy quickly.

但是,数据科学,机器学习和AI不断发展,您可以期望看到新兴技术和模型准确性的快速提高。

  • When you don't have enough data

    当您没有足够的数据时

This example is more common. When you are producing a model, you want to make sure you have sufficient data. Bad data in and a bad model out could occur, and the same could be said about not having enough data that would then produce a bad model. The model could even seem to perform well but it would not generalize well to new situations. You could be overfitting, or simply not exposing the environment to enough possible instances of training data. Before you build a model as well as spend time on development and resources, check to see if you have enough data first.

这个例子比较常见。 制作模型时,您要确保有足够的数据。 可能会出现坏数据输入和坏模型输出的情况,对于没有足够的数据会产生坏模型的情况也可以这样说。 该模型似乎甚至表现良好,但不能很好地推广到新情况。 您可能过度拟合,或者只是没有将环境暴露于足够的训练数据实例中。 在构建模型以及花时间在开发和资源上之前,请先检查是否有足够的数据。

  • When it’s a one-off task

    当是一次性任务时

This example is a little more dependant on the specific situation. You may be asked to perform a Data Science model from a non-technical stakeholder or leader in your company, and perhaps should ask yourself if Data Science is necessary.

这个例子更多地取决于具体情况。 可能会要求公司的非技术利益相关者或负责人执行数据科学模型,并且也许应该问自己是否需要数据科学。

— if you are not outputting results daily, weekly, or even monthly, you may not want to spend the time or creating a complex model that incorporates the scheduling of ingesting new data.

—如果您不是每天,每周甚至每月都不输出结果,则可能不希望花费时间或创建包含吸收新数据调度的复杂模型。

You could apply similar skills to answer this business problem and suggest to the stakeholder that since you only need to have one outputted CSV file, for example, you can answer the question with a simple Python function (you may not need to go in-to-depth with your stakeholders as to why you are not going to use a Data Science model, as some stakeholders just want an outputted result and do not care how you got it). You may just need a small function that manually mimics the themes of a Data Science model. If you know the situation well, you could create bins or weights yourself and apply those to features or columns and come up with your own score. Here is an example of what I am describing:

您可以应用类似的技能来回答此业务问题,并向涉众建议,例如,由于您仅需要输出一个CSV文件,因此可以使用简单的Python函数来回答问题( 您可能不需要进入与您的利益相关者深入了解为什么不使用数据科学模型,因为一些利益相关者只是想要输出结果,而不关心您如何获得它 。 您可能只需要一个小的功能即可手动模仿数据科学模型的主题。 如果您很了解情况,则可以自己创建箱或权重,然后将其应用于要素或列,并得出自己的分数。 这是我正在描述的示例:

Example:.50*(feature_1) + .20*(feature_2) + .30(feature_3) = score (scaled)

While this might not be the most ‘accurate’, if you need a quick way to organize data, a function like this or something similar could be sufficient.

尽管这可能不是最“ 准确 ”的方法,但是如果您需要一种快速的方法来组织数据,那么像这样的功能或类似的功能就足够了。

  • When you don’t have labeled data

    当您没有标签数据时

Sometimes you may encounter a situation where you want to classify thousands of observations, but you have too much unlabeled data in your dataset. There are ways around this problem like labeling software or unsupervised techniques to create new labels. However, if you find that either using human effort or other software services to label takes up too much time and money, then you may want to reassess the situation. Perhaps you need to perform more data engineering techniques like accessing an API before you implement a Data Science model.

有时您可能会遇到想要对数千个观测值进行分类的情况,但是数据集中的未标记数据过多。 解决此问题的方法有很多,例如标签软件或创建新标签的无监督技术。 但是,如果您发现使用人工或其他软件服务进行标记会占用太多时间和金钱,那么您可能需要重新评估情况。 在实现数据科学模型之前,可能需要执行更多的数据工程技术,例如访问API。

  • When your budget is tight

    当您的预算紧张时

Depending on how much data you are ingesting and predicting, training a model can be expensive. Your company may not have enough resources yet, and an expensive Data Science model not may be feasible.

根据要摄取和预测的数据量,训练模型可能会很昂贵。 您的公司可能没有足够的资源,昂贵的数据科学模型可能不可行。

This point goes along with ‘when you do not have enough time’ as well. You may have a certain deadline that is soon approaching and there are methods other than Data Science that can be beneficial like Python functions and rules.

这一点与“ 当您没有足够的时间时 ”也是如此。 您可能有一个即将到来的截止日期,并且除了Data Science之外,还有其他一些方法可能会有益,例如Python函数和规则。

什么时候应该使用数据科学 (When You Should Use Data Science)

There are countless situations when you should use Data Science and Machine Learning. Essentially, you could flip the above examples, or look at if you have an unsupervised, supervised, time-series, etc situation for when you should use Data Science.

在无数情况下,您应该使用数据科学和机器学习。 从本质上讲,您可以翻转上面的示例,或者查看何时使用数据科学时是否处于不受监督,受监督,时间序列等的情况。

You can also apply the above examples but incorporate both Data Science techniques and manual processes as well. Human-in-the-loop is becoming more common as a good bridge between these two practices.

您也可以应用上述示例,但同时要结合数据科学技术和手动过程。 作为这两种实践之间的良好桥梁, 环环相扣的人正变得越来越普遍。

Some specific examples of when to use Data Science include, but are not limited to:

何时使用数据科学的一些具体示例包括但不限于:

  • Recommending movies to users

    向用户推荐电影

  • Forecasting sales for a company

    预测公司的销售

  • Analyzing sentiment of reviews

    分析评论情绪

  • Predicting temperature for a given month

    预测给定月份的温度

  • Etc.

    等等。

The examples of ‘when not to use Data Science’ are not to discourage you from utilizing Data Science, but to stress the importance of ‘just because you can, does not mean you should’. Ultimately, it depends on your specific situation and what the output will be affecting. Therefore, each example can be rebutted to be a use case for Data Science given the specific environment.

何时不使用数据科学 ”的例子并不是要阻止您使用数据科学,而是要强调“ 仅仅因为您可以,并不意味着您应该 ”的重要性。 最终,这取决于您的具体情况以及输出将影响什么。 因此,在特定的环境下,每个示例都可以反驳为Data Science的用例。

摘要 (Summary)

Image for post
Photo by Andreas Klassen on Unsplash [2].
摄影: 安德烈亚斯· 克拉森 ( Andreas Klassen) , 摄于Unsplash [2]。

There are caveats to all of these examples, and you may end up using Data Science in these situations. Data Science is evolving and new facets are emerging. Keep in mind, this article is opinion oriented and these points or examples can change quickly. Feel free to comment down below when you think you should or should not use Data Science for a given situation. To summarize, here are all of the five examples of when you should not use Data Science.

所有这些示例都有一些警告,您可能最终在这些情况下使用数据科学。 数据科学正在发展,新的方面正在涌现。 请记住,本文以观点为导向,这些要点或示例可能会Swift改变。 如果您认为在特定情况下应该或不应该使用Data Science,请在下面随意评论。 总而言之,以下是您不应该使用数据科学的五个示例。

Classifying some health implicationsWhen you don’t have enough dataWhen it’s a one-off taskWhen you don’t have labeled dataWhen your budget is tight

I hope you enjoyed my article. Thank you for reading!

希望您喜欢我的文章。 感谢您的阅读!

翻译自: https://towardsdatascience.com/when-not-to-use-data-science-f2e42a3a77d3

数据科学还是计算机科学

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/390735.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

leetcode 523. 连续的子数组和

给你一个整数数组 nums 和一个整数 k ,编写一个函数来判断该数组是否含有同时满足下述条件的连续子数组: 子数组大小 至少为 2 ,且 子数组元素总和为 k 的倍数。 如果存在,返回 true ;否则,返回 false 。 …

Docker学习笔记 - Docker Compose

一、概念 Docker Compose 用于定义运行使用多个容器的应用,可以一条命令启动应用(多个容器)。 使用Docker Compose 的步骤: 定义容器 Dockerfile定义应用的各个服务 docker-compose.yml启动应用 docker-compose up二、安装 Note t…

线性回归算法数学原理_线性回归算法-非数学家的高级数学

线性回归算法数学原理内部AI (Inside AI) Linear regression is one of the most popular algorithms used in different fields well before the advent of computers. Today with the powerful computers, we can solve multi-dimensional linear regression which was not p…

Linux 概述

UNIX发展历程 第一个版本是1969年由Ken Thompson(UNIX之父)在AT& T贝尔实验室实现Ken Thompson和Dennis Ritchie(C语言之父)使用C语言对整个系统进行了再加工和编写UNIX的源代码属于SCO公司(AT&T ->Novell …

泰坦尼克:机器从灾难中学习_用于灾难响应的机器学习研究:什么才是好的论文?...

泰坦尼克:机器从灾难中学习For the first time in 2021, a major Machine Learning conference will have a track devoted to disaster response. The 16th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2021) has a track on…

github持续集成的设置_如何使用GitHub Actions和Puppeteer建立持续集成管道

github持续集成的设置Lately Ive added continuous integration to my blog using Puppeteer for end to end testing. My main goal was to allow automatic dependency updates using Dependabot. In this guide Ill show you how to create such a pipeline yourself. 最近&…

shell与常用命令

虚拟控制台 一台计算机的输入输出设备就是一个物理的控制台 ; 如果在一台计算机上用软件的方法实现了多个互不干扰独立工作的控制台界面,就是实现了多个虚拟控制台; Linux终端的工作方式是字符命令行方式,用户通过键盘输入命令进…

Linux文本编辑器

Linux文本编辑器 Linux系统下有很多文本编辑器。 按编辑区域: 行编辑器 ed 全屏编辑器 vi 按运行环境: 命令行控制台编辑器 vi X Window图形界面编辑器 gedit ed 它是一个很古老的行编辑器,vi这些编辑器都是ed演化而来。 每次只能对一…

Alpha第十天

Alpha第十天 听说 031502543 周龙荣(队长) 031502615 李家鹏 031502632 伍晨薇 031502637 张柽 031502639 郑秦 1.前言 任务分配是VV、ZQ、ZC负责前端开发,由JP和LL负责建库和服务器。界面开发的教辅材料是《第一行代码》,利用And…

Streamlit —使用数据应用程序更好地测试模型

介绍 (Introduction) We use all kinds of techniques from creating a very reliable validation set to using k-fold cross-validation or coming up with all sorts of fancy metrics to determine how good our model performs. However, nothing beats looking at the ra…

X Window系统

X Window系统 一种以位图方式显示的软件窗口系统。诞生于1984,比Microsoft Windows要早。是一套独立于内核的软件 Linux上的X Window系统 X Window系统由三个基本元素组成:X Server、X Client和二者通信的通道。 X Server:是控制输出及输入…

lasso回归和岭回归_如何计划新产品和服务机会的回归

lasso回归和岭回归Marketers sometimes have to be creative to offer customers something new without the luxury of that new item being a brand-new product or built-from-scratch service. In fact, incrementally introducing features is familiar to marketers of c…

Linux 设备管理和进程管理

设备管理 Linux系统中设备是用文件来表示的,每种设备都被抽象为设备文件的形式,这样,就给应用程序一个一致的文件界面,方便应用程序和操作系统之间的通信。 设备文件集中放置在/dev目录下,一般有几千个,不…

贝叶斯 定理_贝叶斯定理实际上是一个直观的分数

贝叶斯 定理Bayes’ Theorem is one of the most known to the field of probability, and it is used often as a baseline model in machine learning. It is, however, too often memorized and chanted by people who don’t really know what P(B|E) P(E|B) * P(B) / P(E…

文本数据可视化_如何使用TextHero快速预处理和可视化文本数据

文本数据可视化自然语言处理 (Natural Language Processing) When we are working on any NLP project or competition, we spend most of our time on preprocessing the text such as removing digits, punctuations, stopwords, whitespaces, etc and sometimes visualizati…

linux shell 编程

shell的作用 shell是用户和系统内核之间的接口程序shell是命令解释器 shell程序 Shell程序的特点及用途: shell程序可以认为是将shell命令按照控制结构组织到一个文本文件中,批量的交给shell去执行 不同的shell解释器使用不同的shell命令语法 shell…

真实感人故事_您的数据可以告诉您真实故事吗?

真实感人故事Many are passionate about Data Analytics. Many love matplotlib and Seaborn. Many enjoy designing and working on Classifiers. We are quick to grab a data set and launch Jupyter Notebook, import pandas and NumPy and get to work. But wait a minute…

转:防止跨站攻击,安全过滤

转:http://blog.csdn.net/zpf0918/article/details/43952511 Spring MVC防御CSRF、XSS和SQL注入攻击 本文说一下SpringMVC如何防御CSRF(Cross-site request forgery跨站请求伪造)和XSS(Cross site script跨站脚本攻击)。 说说CSRF 对CSRF来说,其实Spring…

Linux c编程

c语言标准 ANSI CPOSIX(提高UNIX程序可移植性)SVID(POSIX的扩展超集)XPG(X/Open可移植性指南)GNU C(唯一能编译Linux内核的编译器) gcc 简介 名称: GNU project C an…

k均值算法 二分k均值算法_使用K均值对加勒比珊瑚礁进行分类

k均值算法 二分k均值算法Have you ever seen a Caribbean reef? Well if you haven’t, prepare yourself.您见过加勒比礁吗? 好吧,如果没有,请做好准备。 Today, we will be answering a question that, at face value, appears quite sim…