如何成为数据科学家_成为数据科学家需要了解什么

如何成为数据科学家

Data science is one of the new, emerging fields that has the power to extract useful trends and insights from both structured and unstructured data. It is an interdisciplinary field that uses scientific research, algorithms, and graphs to uncover the patterns within the chaos and use these patterns to create amazing things.

数据科学是新兴的领域之一,可以从结构化和非结构化数据中提取有用的趋势和见解。 它是一个跨学科领域,它使用科学研究,算法和图形来揭示混乱中的模式,并使用这些模式来创造令人惊奇的事物。

As a data scientist, you need to know some basic mathematics, programming, and have a keen eye for patterns and trends finding. Due to the inter-disciplinary nature of the field, data scientists will find themselves working on different and broad aspects of technology.

作为数据科学家,您需要了解一些基本的数学知识,编程知识,并对模型和趋势发现有敏锐的洞察力。 由于该领域具有跨学科性质,因此数据科学家将发现自己正在研究技术的广泛领域。

Before we get into what you need to become a data scientist, let’s first talk about what a job in data science entails.

在深入探讨成为数据科学家所需的知识之前,让我们先谈谈数据科学工作需要做什么。

数据科学家做什么? (What do data scientists do?)

Working in data science is similar to riding a roller-coaster. Some aspects of the job are slow and steady, while others are fast and crazy. Other parts of it are just like going in a loop, and you repeat things over and over again.

从事数据科学工作类似于坐过山车。 工作的某些方面缓慢而稳定,而另一些方面则快速而疯狂。 它的其他部分就像循环一样,您一遍又一遍地重复。

Whenever a data scientist starts a new project, they will go through a known set of steps to get to their final conclusion.

每当数据科学家开始一个新项目时,他们都会经过一系列已知的步骤来得出最终结论。

Any data science project starts with data and ends with data, and in between, the magic happens.

任何数据科学项目都以数据开始,以数据结束,在这两者之间,魔术就发生了。

If you look through the internet, you will find many articles that address a different number of steps in a data science project. However, regardless of the number of steps, the core aspects are the same. For me, any data science project goes through 6 main steps.

如果您浏览互联网,则会发现许多文章涉及数据科学项目中不同步骤的内容。 但是,无论步骤数如何,核心方面都是相同的。 对我来说,任何数据科学项目都要经历6个主要步骤。

Image for post
Canva)Canva制作 )

步骤№1:了解数据背景。 (Step №1: Understand data background.)

Whenever we start a data science project, we are usually aiming to solve a problem, enhance performance, or predict future trends. To do any of that, we first need to grasp the history of the source of the data and how it’s produced.

每当我们启动数据科学项目时,我们通常旨在解决问题,提高性能或预测未来趋势。 为此,我们首先需要掌握数据源的历史及其产生方式。

步骤№2:收集数据。 (Step №2: Collect data.)

Once we understood the background of that data, we need to collect the data to start working on it. Based on the nature of the project, there are different approaches to gather data. We can get it from a database, from an API, or — if you’re a beginner or just working on your skills — from an open data source. Another option to collect data is to scarp the wen for publically available information.

一旦了解了这些数据的背景,就需要收集数据以开始处理它。 根据项目的性质,有多种收集数据的方法。 我们可以从数据库,API或(如果您是初学者或只是在从事技能的人)从开放数据源中获取它。 收集数据的另一种方法是在网上获取公开信息。

步骤№3:清理并转换数据。 (Step №3: Clean and transform the data.)

Most — if not all — of the time, the data we collect from the source are pure and raw. That kind of data is not suitable to be used in algorithms and future steps. So, the first thing we do when we get new data is clean it up, categorize and tag it, and make sense of it.

在大多数时间(如果不是全部时间)中,我们从源头收集的数据是纯原始数据。 这类数据不适合用于算法和将来的步骤。 因此,当我们获取新数据时,我们要做的第一件事就是清理,分类和标记数据,并弄清数据。

步骤№4:分析和探索数据。 (Step №4: Analyze and explore the data.)

Once our data is clean and structured, we can start analyzing it and attempt to find patterns in it. This can be done by visualizing the data and looking for repetitions or spikes.

一旦我们的数据是干净的和结构化的,我们就可以开始分析它并尝试在其中找到模式。 这可以通过可视化数据并查找重复或峰值来完成。

步骤№5:对数据建模。 (Step №5: Model the data.)

We finally reach the magical step! After we explore and analyze our data, it’s time to feed into a machine learning algorithm and use it to predict future outcomes. This is truly the power of data science.

我们终于达到了神奇的一步 ! 在探索和分析我们的数据之后,是时候引入机器学习算法并用它来预测未来的结果了。 这确实是数据科学的力量。

步骤№6:可视化和交流结果。 (Step №6: Visualize and communicate results.)

Finally, and the most crucial step of the process is to visualize and present the results of the project effectively.

最后,该过程中最关键的一步是有效地可视化并呈现项目结果。

Once those steps are done, a new project comes in, and it’s time to start all over again.

完成这些步骤后,就会出现一个新项目,该是重新开始的时候了。

数据科学需要哪些技能? (What skills are needed for data science?)

Every step of the data project lifecycle requires a specific set of knowledge and skills. To better connect the skills needed, I will pair each phase of the project with the necessary skill to complete that step.

数据项目生命周期的每个步骤都需要一组特定的知识和技能。 为了更好地连接所需的技能,我将项目的每个阶段与必要的技能配对以完成该步骤。

  • To perform data investigation, you only need a curious mind, a pen, and a paper. You sit down and either ask the data source some questions to understand the data better or if it is an open-source data, read the documentation that accompanies the data.

    要执行数据调查,您只需要好奇的头脑,一支笔和一张纸。 您坐下来,或者问数据源一些问题以更好地理解数据,或者如果它是开源数据,请阅读数据随附的文档。

  • To perform data collection, you will need to know how to communicate with databases and APIs. Understanding the basic structure and mechanics of such techniques will make your data collection a breeze. If you’re using an open-source dataset, then learning how to look for datasets and some good sources can make a huge difference.

    要执行数据收集,您将需要知道如何与数据库和API通信。 了解此类技术的基本结构和机制将使您的数据收集轻而易举。 如果您使用的是开源数据集,那么学习如何查找数据集和一些好的资源可能会产生很大的不同。

  • To perform data cleaning, you need some good knowledge of basic data mining and cleaning techniques. You will need to tag your data and categorize it properly. Moreover, you can use regular expressions to look for misspellings or use special tools created to make this process easier for you.

    要执行数据清理 您需要一些基本的数据挖掘和清理技术方面的知识。 您将需要标记数据并进行正确分类。 此外,您可以使用正则表达式查找拼写错误,也可以使用创建的特殊工具使此过程更轻松。

  • To perform data exploration, you will need some basics statistics and probability theory. Some knowledge of data visualization and experimental design can help you a lot at this stage.

    要进行数据探索,您将需要一些基础统计和概率论。 在此阶段,一些数据可视化和实验设计方面的知识可以为您提供很多帮助。

  • To perform data modeling, you will need to know a few machine learning algorithms and how they work. You don’t need to understand everything 100%; if you can use them correctly and apply them to the correct form of data, you will be fine.

    要执行数据建模,您将需要了解一些机器学习算法及其工作方式。 您无需100%理解所有内容; 如果您可以正确使用它们并将它们应用于正确的数据形式,则可以。

  • Finally, to perform data communication, you might use some essential science communication 101. Which are knowing your audience, their background knowledge, and choosing wimple words to explain complex concepts? Additionally, effective data visualization can make or break your project at this stage.

    最后, 要进行数据交流,您可能会使用一些基本的科学交流101。哪些能了解您的听众,他们的背景知识以及选择愚蠢的单词来解释复杂的概念? 此外,有效的数据可视化可以在此阶段创建或破坏您的项目。

Image for post
Canva)Canva制作 )

技术工具 (Technical tools)

Some of the skills we just talked about require a programming language, an algorithm, or special packages.

我们刚才谈到的一些技能需要编程语言,算法或特殊程序包。

  • Programming languages: Python, R.

    编程语言:Python,R。
  • For handling and creating databases: MySQL, PostgreSQL, MongoDB, or SQLite in Python. If you’re using R, then you can use RMySQL.

    用于处理和创建数据库:Python中的MySQL , PostgreSQL , MongoDB或SQLite 。 如果使用R,则可以使用RMySQL。

  • Packages for data exploration and transformation: in Python Pandas, Numpy, or Scipy. Or in R GGplot2 and Dplyr.

    数据探索和转换的软件包:Python Pandas , Numpy或Scipy 。 或在R GGplot2和Dplyr中。

  • Python libraries for visualizations: Matplotlib, Plotly, Pygal.

    用于可视化的Python库: Matplotlib , Plotly , Pygal 。

  • Basic machine learning package for Python Scikit-learn and CARET in R.

    R中用于Python Scikit-learn和CARET的基本机器学习包。

结论 (Conclusion)

You don’t need to know everything about statistics, math, machine learning, or be a professional programmer to start with data science. You only need the basics of this knowledge. As you work on different projects and build your profile, your knowledge base will expand, and your “data science sense” will improve automatically.

您不需要了解统计,数学,机器学习的全部知识,也不需要成为专业的程序员就可以开始数据科学。 您只需要这些知识的基础。 当您从事不同的项目并建立个人档案时,您的知识库将会扩大,并且您的“数据科学意识”也会自动提高。

So, don’t be intimidated by the field, or by how many things you need to “master” to be a good data scientist. Just start with the basics and work your way through to the advanced topics. Be patient and give it your all, and you will get there.

因此,不要被该领域或要成为一名出色的数据科学家所需要掌握的几件事所吓倒。 只是从基础开始,然后逐步学习高级主题。 耐心点,全力以赴,您将到达那儿。

翻译自: https://towardsdatascience.com/what-do-you-need-to-know-to-become-a-data-scientist-1ed52e0e1ad

如何成为数据科学家

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/389876.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

阿里云对数据可靠性保障的一些思考

背景互联网时代的数据重要性不言而喻,任何数据的丢失都会给企事业单位、政府机关等造成无法计算和无法弥补的损失,尤其随着云计算和大数据时代的到来,数据中心的规模日益增大,环境更加复杂,云上客户群体越来越庞大&…

linux实验二

南京信息工程大学实验报告 实验名称 linux 常用命令练习 实验日期 2018-4-4 得分指导教师 系 计软院 专业 软嵌 年级 2015 级 班次 (1) 姓名王江远 学号20151398006 一、实验目的 1. 掌握 linux 系统中 shell 的基础知识 2. 掌握 linux 系统中文件系统的…

个人项目api接口_5个免费有趣的API,可用于学习个人项目等

个人项目api接口Public APIs are awesome!公共API很棒! There are over 50 pieces covering APIs on just the Towards Data Science publication, so I won’t go into too lengthy of an introduction. APIs basically let you interact with some tool or servi…

咕泡-模板方法 template method 设计模式笔记

2019独角兽企业重金招聘Python工程师标准>>> 模板方法模式(Template Method) 定义一个操作中的算法的骨架,而将一些步骤延迟到子类中Template Method 使得子类可以不改变一个算法的结构即可重定义该算法的某些特定步骤Template Me…

如何评价强gis与弱gis_什么是gis的简化解释

如何评价强gis与弱gisTL;DR — A Geographic Information System is an information system that specializes in the storage, retrieval and display of location data.TL; DR — 地理信息系统 是专门从事位置数据的存储,检索和显示的信息系统。 The standard de…

Scrum冲刺-Ⅳ

第四次冲刺任务 团队分工 成员:刘鹏芝,罗樟,王小莉,沈兴艳,徐棒,彭康明,胡广键 产品用户:王小莉 需求规约:彭康明,罗樟 UML:刘鹏芝,沈…

机器人影视对接_机器学习对接会

机器人影视对接A simple question like ‘How do you find a compatible partner?’ is what pushed me to try to do this project in order to find a compatible partner for any person in a population, and the motive behind this blog post is to explain my approach…

mysql 数据库优化之执行计划(explain)简析

数据库优化是一个比较宽泛的概念,涵盖范围较广。大的层面涉及分布式主从、分库、分表等;小的层面包括连接池使用、复杂查询与简单查询的选择及是否在应用中做数据整合等;具体到sql语句执行效率则需调整相应查询字段,条件字段&…

自我接纳_接纳预测因子

自我接纳现实世界中的数据科学 (Data Science in the Real World) Students are often worried and unaware about their chances of admission to graduate school. This blog aims to help students in shortlisting universities with their profiles using ML model. The p…

python中knn_如何在python中从头开始构建knn

python中knnk最近邻居 (k-Nearest Neighbors) k-Nearest Neighbors (KNN) is a supervised machine learning algorithm that can be used for either regression or classification tasks. KNN is non-parametric, which means that the algorithm does not make assumptions …

unity第三人称射击游戏_在游戏上第3部分完美的信息游戏

unity第三人称射击游戏Previous article上一篇文章 The economics literature distinguishes the quality of a game’s information (perfect vs. imperfect) from the completeness of a game’s information (complete vs. incomplete). Perfect information means that ev…

JVM(2)--一文读懂垃圾回收

与其他语言相比,例如c/c,我们都知道,java虚拟机对于程序中产生的垃圾,虚拟机是会自动帮我们进行清除管理的,而像c/c这些语言平台则需要程序员自己手动对内存进行释放。 虽然这种自动帮我们回收垃圾的策略少了一定的灵活…

2058. 找出临界点之间的最小和最大距离

2058. 找出临界点之间的最小和最大距离 链表中的 临界点 定义为一个 局部极大值点 或 局部极小值点 。 如果当前节点的值 严格大于 前一个节点和后一个节点,那么这个节点就是一个 局部极大值点 。 如果当前节点的值 严格小于 前一个节点和后一个节点,…

tb计算机存储单位_如何节省数TB的云存储

tb计算机存储单位Whatever cloud provider a company may use, costs are always a factor that influences decision-making, and the way software is written. As a consequence, almost any approach that helps save costs is likely worth investigating.无论公司使用哪种…

Django Rest Framework(一)

一、什么是RESTful REST与技术无关,代表一种软件架构风格,REST是Representational State Transfer的简称,中文翻译为“表征状态转移”。 REST从资源的角度审视整个网络,它将分布在网络中某个节点的资源通过URL进行标识&#xff0c…

数据可视化机器学习工具在线_为什么您不能跳过学习数据可视化

数据可视化机器学习工具在线重点 (Top highlight)There’s no scarcity of posts online about ‘fancy’ data topics like data modelling and data engineering. But I’ve noticed their cousin, data visualization, barely gets the same amount of attention. Among dat…

python中nlp的库_用于nlp的python中的网站数据清理

python中nlp的库The most important step of any data-driven project is obtaining quality data. Without these preprocessing steps, the results of a project can easily be biased or completely misunderstood. Here, we will focus on cleaning data that is composed…

一张图看懂云栖大会·上海峰会重磅产品发布

2018云栖大会上海峰会上,阿里云重磅发布一批产品并宣布了新一轮的价格调整,再次用科技普惠广大开发者和用户,详情见长图。 了解更多产品请戳:https://yunqi.aliyun.com/2018/shanghai/product?spm5176.8142029.759399.2.a7236d3e…

怎么看另一个电脑端口是否通_谁一个人睡觉另一个看看夫妻的睡眠习惯

怎么看另一个电脑端口是否通In 2014, FiveThirtyEight took a survey of about 1057 respondents to get a look at the (literal) sleeping habits of the American public beyond media portrayal. Some interesting notices: first, that about 45% of all couples sleep to…

Java基础之Collection和Map

List:实现了collection接口,list可以重复,有顺序 实现方式:3种,分别为:ArrayList,LinkedList,Vector。 三者的比较: ArrayList底层是一个动态数组,数组是使用…