数据科学家数据分析师
According to a recent survey conducted by Dimensional Research, only 50 percent of data analysts’ time is actually spent analyzing data. What’s the other half spent on? Data cleanup — that tedious and repetitive work that must be done before you can dig into the fancy data science stuff. I’m talking about deduplication, fuzzy matching, replacing invalid characters — basically, all the data wrangling and munging you need to do to make the data easier to understand and work with.
根据Dimensional Research最近进行的一项调查,实际上只有50%的数据分析师时间用于分析数据。 另一半花在什么上面? 数据清理-必须先完成乏味且重复的工作,然后才能深入研究花哨的数据科学资料。 我说的是重复数据删除,模糊匹配,替换无效字符-基本上,您需要对所有数据进行整理和整理以使数据更易于理解和使用。
Typically, data manipulation is accomplished one of two ways, each of which has pros and cons. The first method relies primarily on SQL, which is great for doing the joins, unions, and deduplications that are the bread and butter of data cleanup. For those specific actions that SQL is unable to perform, for example extracting word counts from unstructured text, you simply embed user-defined functions (UDFs) written in a general-purpose programming language, usually Python.
通常,数据操作是通过以下两种方式之一完成的,每种方式都有其优缺点。 第一种方法主要依赖于SQL,这非常适合执行联接,联合和重复数据删除,而重复数据删除是数据清理的基础。 对于SQL无法执行的那些特定操作,例如从非结构化文本中提取单词计数,您只需嵌入用通用编程语言(通常是Python)编写的用户定义函数(UDF)。
The second approach uses a general-purpose programming language, such as Python or Scala, as the “point of entry” for working with data. Operations that you would do in SQL, like joins, are provided by a data frame library like Pandas. Many data scientists naturally gravitate to this approach because they have more experience with Python or Scala, and they view SQL as a lesser tool primarily for business analysts. However, they are missing out on some big benefits of the SQL-first approach:
第二种方法使用通用编程语言(例如Python或Scala)作为处理数据的“入口点”。 您将在SQL中执行的操作(例如联接)由数据框架库(例如Pandas)提供。 许多数据科学家自然倾向于使用这种方法,因为他们在Python或Scala方面拥有更多经验,并且他们将SQL视为主要用于业务分析人员的较少工具。 但是,它们没有充分利用SQL优先方法的一些优点:
- The most common data-cleanup operations produce simpler code in SQL. Simpler code makes it easier for others to understand and harder for you to make mistakes; 最常见的数据清理操作会在SQL中产生更简单的代码。 更简单的代码使其他人更容易理解,并且更容易出错。
- SQL is ubiquitous among data analysts, so it’s easier to share code with analysts; SQL在数据分析人员中无处不在,因此与分析人员共享代码更加容易。
- It’s easier to hire for SQL expertise than Python or Scala. 雇用SQL专家比使用Python或Scala容易。
These benefits I just described are “human-focused,” but there is also a very important infrastructure benefit as well. Massively Parallel Processing (MPP) systems, like Snowflake and BigQuery, will automatically distribute your code across an arbitrarily large compute cluster if you write it in SQL.
我刚刚描述的这些好处是“以人为本”的,但是,还有一个非常重要的基础架构好处。 大规模并行处理(MPP)系统(例如Snowflake和BigQuery),如果您使用SQL编写代码,则会自动将代码分布在任意大型的计算集群中。
On the other hand, if you use Python or Scala dataframes as your primary programming model, you will often need to specify data distributions and other details of how the system spreads your computation across nodes. The resulting execution plan is usually less efficient than what a SQL-based system would have produced, thanks to write barriers as well as extra serialization and deserialization steps. This last point is increasingly important when you’re working with larger data sets. That’s not to say it’s impossible to distribute your workload effectively when using a dataframe-based system, but you’ll be doing infrastructure work that doesn’t add value instead of spending your time getting insights from data.
另一方面,如果您将Python或Scala数据框用作主要的编程模型,则通常需要指定数据分布以及系统如何在节点之间分布计算的其他详细信息。 由于写障碍以及额外的序列化和反序列化步骤,最终的执行计划通常效率不如基于SQL的系统。 当您使用较大的数据集时,这最后一点变得越来越重要。 这并不是说在使用基于数据帧的系统时不可能有效地分配工作负载,但是您将进行的基础架构工作不会增加价值,而不是花费时间从数据中获取洞察力。
Lastly and most importantly, by making SQL your foundation, you can avoid creating two competing camps within your organization, data scientists versus analysts. With everyone in alignment about how data manipulation is accomplished, your team can focus on the deep data analysis that’s increasingly important in business today.
最后也是最重要的一点是,通过使SQL成为基础,您可以避免在组织内创建两个竞争阵营,即数据科学家与分析师。 使每个人都对如何完成数据操作保持一致,您的团队可以专注于深度数据分析,该分析在当今业务中变得越来越重要。
翻译自: https://towardsdatascience.com/aligning-your-analysts-and-data-scientists-around-data-manipulation-fefe80d46c51
数据科学家数据分析师
本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/387873.shtml
如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!