数据科学还是计算机科学_数据科学101

数据科学还是计算机科学

什么是数据科学? (What is data science?)

Well, if you have just woken up from a 10-year coma and have no idea what is data science, don’t worry, there’s still time. Many years ago, statisticians had some pretty good ideas for analysing data and getting insights from it, but they lacked the computational power to do it, so their hands were tied. Until one day, when computers managed to catch up with those guys, and made all their dreams come true. All of a sudden, we not only had more data available than ever in history, but we also had powerful machines to perform heavy calculations on this data, allowing statisticians to try out all these new algorithms. Data science is the hip daughter born from this marriage between statistics and computer science. In other words, it is the science of extracting useful patterns from data sets by use of computer power.

好吧,如果您刚从十年昏迷中醒来,不知道什么是数据科学,请不要担心,还有时间。 许多年前,统计学家在分析数据和从中获取见解方面有一些相当不错的主意,但他们缺乏计算能力,因此束手无策。 直到一天,计算机都赶上了这些家伙,并使所有梦想成真。 突然之间,我们不仅拥有比以往任何时候都多的可用数据,而且还拥有功能强大的机器来对这些数据进行大量计算,从而使统计学家可以尝试所有这些新算法。 数据科学是统计学和计算机科学之间的结合而生的时髦女儿。 换句话说,这是通过使用计算机功能从数据集中提取有用模式的科学。

它是干什么用的? (What is it used for?)

One of the reasons data science is so popular nowadays is the number of possible applications that are emerging.

当今数据科学如此流行的原因之一是正在出现的可能的应用程序数量。

市场营销和销售 (Marketing and sales)

A typical use case for data science in marketing is product recommendation. When you check out a product on Amazon and they tell you there’s another product you might like, there is an algorithm behind that recommendation that thinks you will like those products based on what other customers who also saw that product actually bought.

市场营销中数据科学的典型用例是产品推荐。 当您在Amazon上查看某商品时,他们告诉您可能还会有另一种商品时,该建议背后有一个算法,该算法会根据其他顾客实际购买的商品来认为您会喜欢这些商品。

金融 (Finance)

The most common way that banks use data science methods is for credit risk analysis: back in the day, when someone asked for a loan, usually the banker took a good look at their financial record to decide whether to do it or not. Nowadays, there are sophisticated statistical models that are constantly updated and give a good estimated probability of default, making the whole process a lot faster and more reliable.

银行使用数据科学方法的最常见方法是进行信用风险分析:过去,当有人要求贷款时,银行家通常会仔细查看其财务记录,以决定是否这样做。 如今,有复杂的统计模型可以不断更新,并且可以很好地估计违约概率,从而使整个过程变得更快,更可靠。

卫生保健 (Healthcare)

Healthcare is one of the most promising industries when it comes to data science. There is a lot of data being generated by connected wearables such as smartwatches, including calories spent, miles walked and heartbeats. One of the possible applications is tracking variables that can help explain some diseases, and even remind you to go see a doctor if you present a behavior that might indicate a health issue.

就数据科学而言,医疗保健是最有前途的行业之一。 连接的可穿戴设备(例如智能手表)会生成大量数据,包括所消耗的卡路里,行走的距离和心跳。 一种可能的应用是跟踪变量,这些变量可以帮助解释某些疾病,甚至提醒您如果出现可能表明健康问题的行为,请去看医生。

它回答什么问题? (What questions does it answer?)

We can split data science tasks into two main groups: supervised vs. unsupervised learning

我们可以将数据科学任务分为两大类:有监督与无监督学习

Image for post
Image by author)作者提供的图片)

监督学习 (Supervised learning)

Supervised learning comprises all tasks for which we have a target variable, that is, some feature in our data that we already know we want to predict. For example, if we want to explain house prices based on their characteristics (such as number of rooms and floors), or if we want to predict the likelihood that a customer will stop using our services.

监督学习包括我们具有目标变量的所有任务,即我们已经知道要预测的数据中的某些功能。 例如,如果我们要根据房价的特征 (例如房间和楼层数)来解释房价 ,或者我们要预测客户停止使用我们的服务的可能性。

无监督学习 (Unsupervised learning)

These are the tasks for when we are not sure of the question we are asking. A typical case is clustering tasks, when we just want to find patterns in the data, not necessarily related to one specific variable (customer segmentation, for instance).

当我们不确定所要提出的问题时,这些就是这些任务。 一种典型的情况是群集任务,当我们只想在数据中查找模式时,不一定与一个特定变量(例如客户细分)相关。

是谁啊 (Who does it?)

Besides the knowledge required in statistics and computer science, data science also calls for business awareness: no matter how good your algorithms are, they will be useless if they are not applicable in that domain. People who work with data usually fall into three categories, depending on which one of those three areas of expertise they are more focused on:

除了统计和计算机科学所需的知识外,数据科学还要求提高商业意识:无论您的算法有多出色,如果它们不适用于该领域,它们将毫无用处。 处理数据的人员通常分为三类,具体取决于他们更专注于这三个专业领域中的哪一个:

数据分析师 (Data analyst)

Sometimes also called business analyst, this guy knows how to talk to people who don’t work directly with data. He’s usually in charge of translating business needs into data requirements (and data insights into business recommendations). He has an overall understanding of the main data science algorithms, and usually has really good skills in data visualization.

有时也称为业务分析师,这个人知道如何与不直接使用数据的人交谈。 他通常负责将业务需求转换为数据需求(以及将数据洞察转换为业务建议)。 他对主要的数据科学算法有全面的了解,并且通常在数据可视化方面具有非常好的技能。

数据工程师 (Data engineer)

This is the person who makes sure the data is collected from all its sources, integrated almost seamlessly into the company’s tech environment and that all the algorithms developed turn well and fast. They almost always come from a tech background, and sometimes have to create dedicated tools to display the data processes, especially if they are to be shared with other stakeholders in the company.

该人员负责确保从所有来源收集数据,几乎无缝地将其集成到公司的技术环境中,并且确保所开发的所有算法都能快速好转。 它们几乎总是来自技术背景,有时必须创建专用工具来显示数据过程,尤其是要与公司中的其他利益相关者共享它们时。

数据科学家 (Data scientist)

As you can guess from the name, this guy has a deeper understanding of the way most algorithms operate, and which are the best ones for each situation. They probably know more about statistics than the data analyst and the data engineer, but less about the ins and outs of the business or of the process industrialisation. Some companies prefer to hire PhD’s for this position, but it is not always the case.

您可能会从名字中猜到,这个家伙对大多数算法的运行方式有更深入的了解,并且每种情况下最好的算法。 他们可能比数据分析师和数据工程师对统计信息了解更多,但对业务或流程工业化的来龙去脉了解较少。 一些公司更愿意聘请博士学位来担任这一职位,但并非总是如此。

去哪儿了 (Where is it going?)

In the next few years, we will see much progress in many different domains. By using data, cities will be able to better manage their traffic, their energy consumption and even their police units allocation. By the use of wearables, we’ll be able to exercise, eat and sleep better. And there might be many other possibilities of which we haven’t even thought of.

在接下来的几年中,我们将在许多不同的领域看到巨大的进步。 通过使用数据,城市将能够更好地管理其交通,能源消耗甚至警力分配。 通过使用可穿戴设备,我们将能够更好地运动,饮食和睡眠。 而且可能还有许多其他我们甚至没有想到的可能性。

However, we will also find out that not everything can be improved with data, and we will soon find out where this limit lies. There will always be an important random component in every human activity or natural phenomenon that will never be tracked by any machine learning algorithm, no matter how sophisticated it is.

但是,我们还将发现并非所有数据都可以改善,而且我们很快就会发现此限制在哪里。 在任何人类活动或自然现象中,总会有一个重要的随机成分,无论它多么复杂,都不会被任何机器学习算法跟踪。

This data-driven culture might also cause some important behavioural changes. People are starting to realize how much of their personal lives is being tracked by big companies and the government, and most do not seem to enjoy it. This might lead people to voluntarily downgrade their tech devices, use tools to prevent data collection, and even reduce their overall technology usage. Governments are already aware of these concerns, and regulation is getting stricter all over the world when it comes to people’s privacy. Let’s see in the years to come how this will shape society (the Black Mirror series offer interesting insights into these possibilities).

这种由数据驱动的文化也可能导致一些重要的行为变化。 人们开始意识到大公司和政府正在追踪他们多少个人生活,而且大多数人似乎并不喜欢它。 这可能会导致人们自愿降级其技术设备,使用工具来防止数据收集,甚至降低其整体技术使用率。 各国政府已经意识到了这些担忧,并且在涉及人们隐私的世界范围内,监管越来越严格。 让我们来看看未来几年这将如何塑造社会(《黑镜》系列为这些可能性提供了有趣的见解)。

怎么做? (How to do it?)

If you want to learn more about it, I recommend the MIT Press Essential Knowledge series book “Data Science”, by John D. Kelleher and Brendan Tierney. It is a very good introduction to the subject, without getting too technical, to help you see if data science is really for you.

如果您想了解更多信息,我建议由John D. Kelleher和Brendan Tierney撰写的麻省理工学院出版社基础知识丛书“数据科学”。 这是对该主题的很好的介绍,并且没有太多的技术知识,可以帮助您了解数据科学是否真的适合您。

Next in line is “Data Science for Business” by Foster Provost and Tom Fawcett. This one is more focused on business applications and it goes deeper into the details of the algorithms. It will give you a really good grasp of all the possibilities enabled by data-driven decision making.

接下来的是Foster Provost和Tom Fawcett撰写的“商业数据科学”。 这是更专注于业务应用程序,它更深入地介绍了算法的细节。 它将使您真正掌握数据驱动的决策制定所带来的所有可能性。

Then, once you got the basics covered, it’s time to study for real: you will almost certainly need to learn to code (if you don’t know it already). The main languages you should focus on are SQL and R or Python. The first one is used to querying databases to extract the data you need, in the right shape. The other two are used for applying the algorithms and creating plots. R was created with a focus on statistics, whereas Python is a more general programming language. To start with, just choose one of the two to concentrate your efforts and, if needed, learn the other one later on.

然后,一旦您掌握了基础知识,就可以学习真实的东西了:您几乎肯定需要学习编码(如果您还不知道的话)。 您应该关注的主要语言是SQL和R或Python。 第一个用于查询数据库,以正确的形式提取所需的数据。 其他两个用于应用算法和创建图。 R的创建侧重于统计数据,而Python是一种更通用的编程语言。 首先,只需选择两者之一以集中精力,如果需要,稍后再学习另一种。

A good way to start practicing your skills is Kaggle.com, where you can play with toy datasets and take part into real competitions. It will help you put your knowledge to test and also build a portfolio of your own. However, keep in mind that eventually, you will need to work with real-life cases, it’s a different beast.

Kaggle.com是开始练习技能的一个好方法,您可以在其中玩玩具数据集并参加真实的比赛。 这将帮助您测试知识,并建立自己的投资组合。 但是,请记住,最终,您将需要处理实际案例,这是另一种野兽。

结论 (Conclusion)

Now that you know some of the data science lingo, you are able to go out there and do your own research. The amount of available resources is pretty much endless, and there’s new information coming out every day, so make sure you are always up to date on the new methods and possibilities.

既然您已经了解了一些数据科学术语,那么您就可以在那里进行自己的研究。 可用资源的数量几乎是无穷无尽的,每天都有新的信息出现,因此请确保您始终了解新的方法和可能性。

翻译自: https://towardsdatascience.com/data-science-101-99e34bea86c

数据科学还是计算机科学

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/388609.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

开机流程与主引导分区(MBR)

由于操作系统会提供所有的硬件并且提供内核功能,因此我们的计算机就能够认识硬盘内的文件系统,并且进一步读取硬盘内的软件文件与执行该软件来完成各项软件的执行目的 问题是你有没有发现,既然操作系统也是软件,那么我的计算机优势…

肤色检测算法 - 基于二次多项式混合模型的肤色检测。

由于CSDN博客和博客园的编辑方面有不一致的地方,导致文中部分图片错位,为不影响浏览效果,建议点击打开链接。 由于能力有限,算法层面的东西自己去创新的很少,很多都是从现有的论文中学习,然后实践的。 本文…

oracle解析儒略日,利用to_char获取当前日期准确的周数!

总的来说周数的算法有两种:算法一:iw算法,每周为星期一到星期日算一周,且每年的第一个星期一为第一周,就拿2014年来说,2014-01-01是星期三,但还是算为今年的第一周,可以简单的用sql函…

js有默认参数的函数加参数_函数参数:默认,关键字和任意

js有默认参数的函数加参数PYTHON开发人员的提示 (TIPS FOR PYTHON DEVELOPERS) Think that you are writing a function that accepts multiple parameters, and there is often a common value for some of these parameters. For instance, you would like to be able to cal…

2018大数据学习路线从入门到精通

最近很多人问小编现在学习大数据这么多,他们都是如何学习的呢。很多初学者在萌生向大数据方向发展的想法之后,不免产生一些疑问,应该怎样入门?应该学习哪些技术?学习路线又是什么?今天小编特意为大家整理了…

相似邻里算法_纽约市-邻里之战

相似邻里算法IBM Data Science Capstone ProjectIBM Data Science Capstone项目 分析和可视化与服装店投资者的要求有关的纽约市结构 (Analyzing and visualizing the structure of New York City in relation to the requirements of a Clothing Store Investor) 介绍 (Introd…

linux质控命令,Linux下microRNA质控-cutadapt安装

如果Linux系统已安装pip或conda,cutadapt的安装相对简便一些,示例如下:1.pip安装pip install --user --upgrade cutadapt添加环境变量echo export PATH$PATH:/your path/cutadapt-1.10/bin >> ~/.bashrc2.conda安装conda install -c b…

linux分辨率和用户有关吗,Linux系统在高分屏非正常分辨率显示

问题描述:win10重装为Ubuntu16.04,在1920x1080的显示屏上,linux系统分辨率只有800x600xrandr # 查看当前显示分辨率#输出:[Screen 0: minimum 800 x 600, current 800 x 600, maximum 800 x 600]可以看出显示屏最小为800x600&…

数据透视表和数据交叉表_数据透视表的数据提取

数据透视表和数据交叉表Consider the data of healthcare drugs as provided in the excel sheet. The concept of pivot tables in python allows you to extract the significance from a large detailed dataset. A pivot table helps in tracking only the required inform…

金融信息交换协议(FIX)v5.0

1. 什么是FIXFinancial Information eXchange(FIX)金融信息交换协议的制定是由多个致力于提升其相互间交易流程效率的金融机构和经纪商于1992年共同发起。这些企业把他们及他们的行业视为一个整体,认为能够从对交易指示,交易指令及交易执行的高效电子数…

linux行命令测网速,Linux命令行测试网速的方法

最近给服务器调整了互联网带宽的限速策略,调到100M让自己网站也爽一下。一般在windows上我喜欢用speedtest.net来测试,测速结果也被大家认可。在linux上speedtest.net提供了一个命令行工具speedtest-cli,用起来很方便,这里分享一下…

图像处理傅里叶变换图像变化_傅里叶变换和图像床单视图。

图像处理傅里叶变换图像变化What do Fourier Transforms do? What do the Fourier modes represent? Why are Fourier Transforms notoriously popular for data compression? These are the questions this article aims to address using an interesting analogy to repre…

C#DNS域名解析工具(DnsLookup)

C#DNS域名解析工具(DnsLookup) DNS域名解析工具:DnsLookup 输入域名后点击Resolve按钮即可。 主要实现代码如下: private void btnResolve_Click ( object sender, EventArgs e ) {lstIPs.Items.Clear ( ); //首先把结果里的ListBox清空 try {IPHostE…

滞后分析rstudio_使用RStudio进行A / B测试分析

滞后分析rstudioThe purpose of this article is to provide some guide on how to conduct analysis of a sample scenario A/B test results using R, evaluate the results and draw conclusions based on the analysis.本文的目的是提供一些指南,说明如何使用R对…

Linux程序实现弹框,jQuery实现弹出框 效果绝对美观

使用到JQeury写的几个比较好的Popup DialogBox,觉得不错。和大家分享下。使用它们结合.net可以实现很好的效果。1.jqpopup:是个可以拖拽,缩放并可以在它上面显示html页面上任何一个控件组合的控件。可以和后面的主页面通信。使用方法:先调用这几个js文件,可以自提供的下载地址下…

MySQL的事务-原子性

MySQL的事务处理具有ACID的特性,即原子性(Atomicity)、一致性(Consistency)、隔离性(Isolation)和持久性(Durability)。 1. 原子性指的是事务中所有操作都是原子性的,要…

大型网站架构演变

今天我们来谈谈一个网站一般是如何一步步来构建起系统架构的,虽然我们希望网站一开始就能有一个很好的架构,但马克思告诉我们事物是在发展中不断前进的,网站架构也是随着业务的扩大、用户的需求不断完善的,下面是一个网站架构逐步…

linux的磁盘磁头瓷片作用,Linux 磁盘管理

硬盘物理结构以下三张图片都是磁盘的实物图,一个磁盘是由多块堆放的瓷片组成的,所以磁头的结构也是堆叠的,他要对每一块瓷片进行读取,磁头是可以在不同磁道(在瓷片的表现为不同直径的同心圆,磁道间是有间隔的)之间移动…

多层插件开发框架

先来几张效果图: 1.基于DATASNAP构建的中间件,中间件已经经过实际项目的检验,单台中间件可支持几千客户端,中间件可集群 2.中间件支持同时连接ACCESS\SQL SERVER\MYSQL\ORACLE。。。多种数据库系统 3.中间件同时支持TCP/IP,HTTP&a…

unity3d 可视化编程_R编程系列:R中的3D可视化

unity3d 可视化编程In the last blog, we have learned how to create “Dynamic Maps Using ggplot2“. In this article, we will explore more into the 3D visualization in R programming language by using the plot3d package.在上一个博客中,我们学习了如何…