Why Apache Spark is a Crossover Hit for Data Scientists [FWD]

Spark is a compelling multi-purpose platform for use cases that span investigative, as well as operational, analytics.

Data science is a broad church. I am a data scientist — or so I’ve been told — but what I do is actually quite different from what other “data scientists” do. For example, there are those practicing “investigative analytics” and those implementing “operational analytics.” (I’m in the second camp.)

 

Data scientists performing investigative analytics use interactive statistical environments like R to perform ad-hoc, exploratory analytics in order to answer questions and gain insights. By contrast, data scientists building operational analytics systems have more in common with engineers. They build software that creates and queries machine-learning models that operate at scale in real-time serving environments, using systems languages like C++ and Java, and often use several elements of an enterprise data hub, including the Apache Hadoop ecosystem.

And there are subgroups within these groups of data scientists. For example, some analysts who are proficient with R have never heard of Python or scikit-learn, or vice versa, even though both provide libraries of statistical functions that are accessible from a REPL (Read-Evaluate-Print Loop) environment.

A World of Tradeoffs

It would be wonderful to have one tool for everyone, and one architecture and language for investigative as well as operational analytics. If I primarily work in Java, should I really need to know a language like Python or R in order to be effective at exploring data? Coming from a conventional data analyst background, must I understand MapReduce in order to scale up computations? The array of tools available to data scientists tells a story of unfortunate tradeoffs:

  • R offers a rich environment for statistical analysis and machine learning, but it has some rough edges when performing many of the data processing and cleanup tasks that are required before the real analysis work can begin. As a language, it’s not similar to the mainstream languages developers know.
  • Python is a general purpose programming language with excellent libraries for data analysis like Pandas and scikit-learn. But like R, it’s still limited to working with an amount of data that can fit on one machine.
  • It’s possible to develop distributed machine learning algorithms on the classic MapReduce computation framework in Hadoop (see Apache Mahout). But MapReduce is notoriously low-level and difficult to express complex computations in.
  • Apache Crunch offers a simpler, idiomatic Java API for expressing MapReduce computations. But still, the nature of MapReduce makes it inefficient for iterative computations, and most machine learning algorithms have an iterative component.

And so on. There are both gaps and overlaps between these and other data science tools. Coming from a background in Java and Hadoop, I do wonder with envy sometimes: why can’t we have a nice REPL-like investigative analytics environment like the Python and R users have? That’s still scalable and distributed? And has the nice distributed-collection design of Crunch? And can equally be used in operational contexts?

Common Ground in Spark

These are the desires that make me excited about Apache Spark. While discussion about Spark for data science has mostly noted its ability to keep data resident in memory, which can speed up iterative machine learning workloads compared to MapReduce, this is perhaps not even the big news, not to me. It does not solve every problem for everyone. However, Spark has a number of features that make it a compelling crossover platform for investigative as well as operational analytics:

  • Spark comes with a machine-learning library, MLlib, albeit bare bones so far.
  • Being Scala-based, Spark embeds in any JVM-based operational system, but can also be used interactively in a REPL in a way that will feel familiar to R and Python users.
  • For Java programmers, Scala still presents a learning curve. But at least, any Java library can be used from within Scala.
  • Spark’s RDD (Resilient Distributed Dataset) abstraction resembles Crunch’s PCollection, which has proved a useful abstraction in Hadoop that will already be familiar to Crunch developers. (Crunch can even be used on top of Spark.)
  • Spark imitates Scala’s collections API and functional style, which is a boon to Java and Scala developers, but also somewhat familiar to developers coming from Python. Scala is also a compelling choice for statistical computing.
  • Spark itself, and Scala underneath it, are not specific to machine learning. They provide APIs supporting related tasks, like data access, ETL, and integration. As with Python, the entire data science pipeline can be implemented within this paradigm, not just the model fitting and analysis.
  • Code that is implemented in the REPL environment can be used mostly as-is in an operational context.
  • Data operations are transparently distributed across the cluster, even as you type.

Spark, and MLlib in particular, still has a lot of growing to do. For example, the project needs optimizations, fixes, and deeper integration with YARN. It doesn’t yet provide nearly the depth of library functions that conventional data analysis tools do. But as a best-of-most-worlds platform, it is already sufficiently interesting for a data scientist of any denomination to look at seriously.

In Action: Tagging Stack Overflow Questions

A complete example will give a sense of using Spark as an environment for transforming data and building models on Hadoop. The following example uses a dump of data from the popular Stack Overflow Q&A site. On Stack Overflow, developers can ask and answer questions about software. Questions can be tagged with short strings like “java” or “sql“. This example will build a model that can suggest new tags to questions based on existing tags, using thealternating least squares (ALS) recommender algorithm; questions are “users” and tags are “items”.

Getting the Data

Stack Exchange provides complete dumps of all data, most recently from January 20, 2014. The data is provided as a torrent containing different types of data from Stack Overflow and many sister sites. Only the filestackoverflow.com-Posts.7z needs to be downloaded from the torrent.

This file is just a bzip-compressed file. Spark, like Hadoop, can directly read and split some compressed files, but in this case it is necessary to uncompress a copy on to HDFS. In one step, that’s:

 

 

 

Uncompressed, it consumes about 24.4GB, and contains about 18 million posts, of which 2.1 million are questions. These questions have about 9.3 million tags from approximately 34,000 unique tags.

Set Up Spark

Given that Spark’s integration with Hadoop is relatively new, it can be time-consuming to get it working manually. Fortunately, CDH hides that complexity by integrating Spark and managing setup of its processes. Spark can beinstalled separately with CDH 4.6.0, and is included in CDH 5 Beta 2. This example uses an installation of CDH 5 Beta 2.

This example uses MLlib, which uses the jblas library for linear algebra, which in turn calls native code using LAPACKand Fortran. At the moment, it is necessary to manually install the Fortran library dependency to enable this. The package is called libgfortran or libgfortran3, and should be available from the standard package manager of major Linux distributions. For example, for RHEL 6, install it with:

 

 

 

This must be installed on all machines that have been designated as Spark workers.

Log in to the machine designated as the Spark master with ssh. It will be necessary, at the moment, to ask Spark to let its workers use a large amount of memory. The code in MLlib that is used in this example, in version 0.9.0, has amemory issue, one that is already fixed for the next release. To configure for more memory and launch the shell:

 

 

 

Interactive Processing in the Shell

The shell is the Scala REPL. It’s possible to execute lines of code, define methods, and in general access any Scala or Spark functionality in this environment, one line at a time. You can paste the following steps into the REPL, one by one.

First, get a handle on the Posts.xml file:

 

 

 

In response the REPL will print:

 

 

 

The text file is an RDD (Resilient Distributed Dataset) of Strings, which are the lines of the file. You can query it by calling methods of the RDD class. For example, to count the lines:

 

 

 

This command yields a great deal of output from Spark as it counts lines in a distributed way, and finally prints18066983.

The next snippet transforms the lines of the XML file into a collection of (questionID,tag) tuples. This demonstrates Scala’s functional programming style, and other quirks. (Explaining them is out of scope here.) RDDs behave like Scala collections, and expose many of the same methods, like map:

(You can copy the source for the above from here.)

You will notice that this returns immediately, unlike previously. So far, nothing requires Spark to actually perform this transformation. It is possible to force Spark to perform the computation by, for example, calling a method like count. Or Spark can be told to compute and persist the result through checkpointing, for example.

The MLlib implementation of ALS operates on numeric IDs, not strings. The tags (“items”) in this data set are strings. It will be sufficient here to hash tags to a nonnegative integer value, use the integer values for the computation, and then use a reverse mapping to translate back to tag strings later. Here, a hash function is defined since it will be reused shortly.

 

 

 

Now, you can convert the tuples from before into the format that the ALS implementation expects, and the model can be computed:

 

 

 

This will take minutes or more, depending on the size of your cluster, and will spew a large amount of output from the workers. Take a moment to find the Spark master web UI, which can be found from Cloudera Manager, and will run by default at http://[master]:18080. There will be one running application. Click through, then click “Application Detail UI”. In this view it’s possible to monitor Spark’s distributed execution of lines of code in ALS.scala:

When it is complete, a factored matrix model is available in Spark. It can be used to predict question-tag associations by “recommending” tags to questions. At this early stage of MLlib’s life, there is not even a proper recommend method yet, that would give suggested tags for a question. However it is easy to define one:     

 

 

 

And to call it, pick any question with at least four tags, like “How to make substring-matching query work fast on a large table?” and get its ID from the URL. Here, that’s 7122697:

 

 

 

This method will take a minute or more to complete, which is slow. The lookups in the last line are quite expensive since each requires a distributed search. It would be somewhat faster if this mapping were available in memory. It’s possible to tell Spark to do this:

 

 

 

Because of the magic of Scala closures, this does in fact affect the object used inside the recommend method just defined. Run the method call again and it will return faster. The result in both cases will be something similar to the following:

 

 

 

(Your result will not be identical, since ALS starts from a random solution and iterates.) The original question was tagged “postgresql”, “query-optimization”, “substring”, and “text-search”. It’s reasonable that the question might also be tagged “sql” and “database”. “oracle” makes sense in the context of questions about optimization and text search, and “ruby-on-rails” often comes up with PostgreSQL, even though these tags are not in fact related to this particular question.

Something for Everyone

Of course, this example could be more efficient and more general. But for the practicing data scientists out there — whether you came in as an R analyst, Python hacker, or Hadoop developer — hopefully you saw something familiar in different elements of the example, and have discovered a way to use Spark to access some benefits that the other tribes take for granted.

Learn more about Spark’s role in an EDH, and join the discussion in our brand-new Spark forum.

Sean is Director of Data Science for EMEA at Cloudera, helping customers build large-scale machine learning solutions on Hadoop. Previously, Sean founded Myrrix Ltd, producing a real-time recommender and clustering product evolved from Apache Mahout. Sean was primary author of recommender components in Mahout, and has been an active committer and PMC member for the project. He is co-author of Mahout in Action.

-----

FWD from post: http://blog.cloudera.com/blog/2014/03/why-apache-spark-is-a-crossover-hit-for-data-scientists/

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/287551.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

Blazor University (21)使用 RenderFragments 模板化组件 —— 传递占位符

原文链接:https://blazor-university.com/templating-components-with-renderfragements/passing-placeholders-to-renderfragments/将占位符传递给 RenderFragments源代码[1]说明:此页面的灵感来自用户 ℳisterℳagoo 的 Twitter 帖子。首先&#xff0c…

物联网(车联网)平台架构方案

技术支持QQ:787728951、车载终端网关采用mina/nettyspring架构,独立于其他应用,主要负责维护接入终端的tcp链接、上行以及下行消息的解码、编码、流量控制,黑白名单等安全控制,网关同时支持交通部JT/T808-2011、JT/T80…

[python opencv 计算机视觉零基础到实战] 八、ROI泛洪填充

一、学习目标 了解什么是ROI了解floodFill的使用方法 如有错误欢迎指出~ 目录 [python opencv 计算机视觉零基础到实战] 一、opencv的helloworld [【python opencv 计算机视觉零基础到实战】二、 opencv文件格式与摄像头读取] 一、opencv的helloworld [[python opencv 计…

解决冲突

人生不如意之事十之八九,合并分支往往也不是一帆风顺的。 准备新的feature1分支,继续我们的新分支开发: $ git checkout -b feature1 Switched to a new branch feature1修改readme.txt最后一行,改为: Creating a new …

HQL入门学习

2019独角兽企业重金招聘Python工程师标准>>> package myHibernate; /** 测试简单的HQL语句* 2010年4月9日 23:36:54* */ import java.text.ParseException; import java.text.SimpleDateFormat; import java.util.Calendar; import java.util.Date; import java.uti…

Oracle精简客户端配置

2019独角兽企业重金招聘Python工程师标准>>> 由于Oracle client体积很大。而且安装后,基本上就用2个功能:TNS配置服务名和SQL*Plus。下面是一种小巧、快捷的Oracle客户端配置方法: 1.下载Instant Client 下载地址: htt…

WinUI迁移到.NET MAUI个人体验

迁移的初衷本人平时是做.net相关的工作,对于.net技术栈也有一些了解,自从新的.net能够跨平台之后,之前也有跨平台的ui框架Xamarin,现在微软推出了.NET MAUI这个说是 统一了开发体验,而且都RC版本了,所以本人…

祝CSDN2021牛气冲天祝我也拨云散雾

前言 2020年4月,我写了一篇用turtle绘制《小清新风格的树》,反响挺好。现在打算使用turtle修改一下绘制方式,因为线条的绘制太过考虑因素过多,如果使用方块进行堆叠,绘制出来的形状可以如马赛克一样,既符合…

FPGA图案--数字表示(代码+波形)

在数字逻辑系统,仅仅存在高低。所以用它只代表一个整数数字。并且有3代表性的种类。这是:原码表示(符号加绝对值值)、反码表示(加-minus标志)而补码(符号加补)。这三个在FPGA中都有着广泛的应用。以下分别讨论。1、原码表示法 原码表示法是机器数的一种简…

WPF效果第一百八十四篇之网页视频保存

一年一度的小学入学采集开始了;我一朋友很是头大,他说头都大了好几圈了;既要准备各种入学材料又要听线上专人视频直播讲解;然而在直播结束后,他发现自己仍是一脸疑惑;虽说直播有回访吧,但是他那蜗牛网速简直了;这时他场外找我,让我看能不能给他自己下载一份;1、毕竟第一次,直接…

【遥感数字图像处理】基础知识:第一章 绪论

第一章 绪 论 ◆ 课程学习要求 主要教学内容:遥感数字图像处理的概念和基础知识,遥感数字图像的几何处理,遥感图像的辐射校正,遥感数字图像的增强处理,遥感图像的计算机分类,遥感数字图像的分析方法&…

自定义Git

在安装Git一节中,我们已经配置了user.name和user.email,实际上,Git还有很多可配置项。 比如,让Git显示颜色,会让命令输出看起来更醒目: $ git config --global color.ui true这样,Git会适当地显…

[python opencv 计算机视觉零基础到实战] 九、模糊

一、学习目标 了解什么是卷积了解模糊的使用方法与应用 如有错误欢迎指出~ 二、了解模糊的应用 上一篇:[python opencv 计算机视觉零基础到实战] 八、ROI泛洪填充 2.1 了解卷积是什么 在本节中,卷积我们不过多的进行深入讲解,我本人对卷积也只是稍…

windbg的时间旅行实现对 C# 程序的终极调试!

一:什么是时间旅行 简而言之就是把程序的执行流拍成vlog,这样就可以对 vlog 快进或者倒退,还可以分享给别人做进一步的分析,是不是想都不敢想。很开心的是 windbg preview 版本中已经实现了,叫做 时间旅行调试 TTD&…

【神经网络】神经网络结构在命名实体识别(NER)中的应用

命名实体识别(Named Entity Recognition,NER)就是从一段自然语言文本中找出相关实体,并标注出其位置以及类型,如下图。它是NLP领域中一些复杂任务(例如关系抽取,信息检索等)的基础。…

[python opencv 计算机视觉零基础到实战] 十、图片效果毛玻璃

一、学习目标 了解高斯模糊的使用方法了解毛玻璃的图片效果添加了解如何自己做一个噪声图片 上一篇:[python opencv 计算机视觉零基础到实战] 九、模糊 如有错误欢迎指出~ 二、了解模糊与美颜 2.1 使用高斯模糊降噪 由于很多小伙伴反应抛开原理或理论讲解使用用法对于初学…

Android之自定义View实现带4圆角或者2圆角的效果

1 问题 实现任意view经过自定义带4圆角或者2圆角的效果 2 原理 1) 实现view 4圆角 我们只需要把左边的图嵌入到右边里面去,最终显示左边的图就行。 2) 实现view上2圆角 我们只需要把左边的图嵌入到右边里面去,最终显示左边的图就行。 安卓源码里面有这样的类 package and…

java trim()函数_Java - split()函数和trim()函数的使用方法

split()函数和trim()函数的使用方法本文地址: http://blog.csdn.net/caroline_wendy/article/details/24465141详细參考Java API: http://docs.oracle.com/javase/6/docs/api/java/lang/String.htmlsplit()函数是依据參数如",", "-", " "等, 切割…