databricks

by Shubhi Asthana

通过Shubhi Asthana

如何开始使用Databricks (How to get started with Databricks)

When I started learning Spark with Pyspark, I came across the Databricks platform and explored it. This platform made it easy to setup an environment to run Spark dataframes and practice coding. This post contains some steps that can help you get started with Databricks.

当我开始使用Pyspark学习Spark时，我遇到了Databricks平台并进行了探索。通过该平台，可以轻松设置运行Spark数据帧和练习编码的环境。这篇文章包含一些步骤，可以帮助您开始使用Databricks。

Databricks is a platform that runs on top of Apache Spark. It conveniently has a Notebook systems setup. One can easily provision clusters in the cloud, and it also incorporates an integrated workspace for exploration and visualization.

Databricks是一个在Apache Spark之上运行的平台。它方便地具有笔记本电脑系统设置。可以轻松地在云中配置群集，并且它还集成了用于探索和可视化的集成工作区。

You can also schedule any existing notebook or locally developed Spark code to go from prototype to production without re-engineering.

您还可以计划任何现有的笔记本电脑或本地开发的Spark代码，从原型制作到生产，而无需重新设计。

1. 设置一个Databricks帐户 (1. Setup a Databricks account)

To get started with the tutorial, navigate to this link and select the free Community Edition to open your account. This option has single cluster with up to 6 GB free storage. It allows you to create a basic Notebook. You’ll need a valid email address to verify your account.

要开始使用本教程，请导航至此链接，然后选择免费的Community Edition打开您的帐户。此选项具有最多6 GB可用存储的单个群集。它允许您创建一个基本的Notebook。您需要一个有效的电子邮件地址来验证您的帐户。

You will observe this screen once you successfully log in to your account.

成功登录帐户后，您将看到此屏幕。

2. 创建一个新集群 (2. Creating a new Cluster)

We start with creating a new cluster to run our programs on. Click on “Cluster” on the main page and type in a new name for the cluster.

我们首先创建一个新的集群来运行我们的程序。单击主页上的“群集”，然后为群集键入一个新名称。

Next, you need to select the “Databricks Runtime” version. Databricks Runtime is a set of core components that run on clusters managed by Databricks. It includes Apache Spark, but also adds a number of components and updates to improve the usability and performance of the tool.

接下来，您需要选择“ Databricks Runtime”版本。 Databricks Runtime是一组在Databricks管理的群集上运行的核心组件。它包括Apache Spark，但还添加了许多组件和更新以改善该工具的可用性和性能。

You can select any Databricks Runtime version — I have selected 3.5 LTS (includes Apache Spark 2.2.1, Scala 2.11). You also have a choice between Python 2 and 3.

您可以选择任何Databricks Runtime版本-我选择了3.5 LTS(包括Apache Spark 2.2.1，Scala 2.11)。您还可以在Python 2和3之间进行选择。

It’ll take a few minutes to create the cluster. After some time, you should be able to see an active cluster on the dashboard.

创建集群需要几分钟。一段时间后，您应该能够在仪表板上看到活动的集群。

3. 创建一个新的笔记本 (3. Creating a new Notebook)

Let’s go ahead and create a new Notebook on which you can run your program.

让我们继续创建一个新的Notebook，您可以在其上运行程序。

From the main page, hit “New Notebook” and type in a name for the Notebook. Select the language of your choice — I chose Python here. You can see that Databricks supports multiple languages including Scala, R and SQL.

在主页上，单击“新笔记本”，然后输入笔记本的名称。选择您选择的语言-我在这里选择了Python。您可以看到Databricks支持多种语言，包括Scala，R和SQL。

Once the details are entered, you will observe that the layout of the notebook is very similar to the Jupyter notebook. To test the notebook, let’s import pyspark.

输入详细信息后，您会发现笔记本的布局与Jupyter笔记本非常相似。要测试笔记本，让我们导入pyspark。

The command ran in 0.15 seconds and also gives the cluster name on which it is running. If there are any errors in the code, it would show below the cmd box.

该命令运行了0.15秒，并且还给出了运行命令的集群名称。如果代码中有任何错误，它将显示在cmd框下方。

You can hit the keyboard icon on the top right corner of the page to see operating system-specific shortcuts.

您可以点击页面右上角的键盘图标来查看特定于操作系统的快捷方式。

The most important shortcuts here are:

这里最重要的快捷方式是：

Shift+Enter to run a cell
Shift + Enter键运行单元格
Ctrl+Enter keeps running the same cell without moving to the next cell
Ctrl + Enter保持运行相同的单元格，而无需移动到下一个单元格

Note these shortcuts are for Windows. You can check the OS-specific shortcuts for your OS on the keyboard icon.

请注意，这些快捷方式适用于Windows。您可以在键盘图标上检查操作系统特定于操作系统的快捷方式。

4. 将数据上传到Databricks (4. Uploading data to Databricks)

Head over to the “Tables” section on the left bar, and hit “Create Table.” You can upload a file, or connect to a Spark data source or some other database.

转到左侧栏上的“表格”部分，然后点击“创建表格”。您可以上传文件，或连接到Spark数据源或其他数据库。

Let’s upload the commonly used iris dataset file here (if you don’t have the dataset, use this link )

让我们在这里上传常用的虹膜数据集文件(如果您没有数据集，请使用此链接 )

Once you upload the data, create the table with a UI so you can visualize the table, and preview it on your cluster. As you can see, you can observe the attributes of the table. Spark will try to detect the datatype of each of the columns, and lets you edit it too.

上载数据后，使用UI创建表，以便可以可视化表并在集群上预览。如您所见，您可以观察表的属性。 Spark将尝试检测每列的数据类型，并让您对其进行编辑。

Now I need to put headers for the columns, so I can identify each column by their header instead of _c0, _c1 and so on.

现在，我需要为各列添加标题，以便可以通过其标题而不是_c0 ， _c1等等来标识每一列。

I put their headers as Sepal Length, Sepal Width, Petal Length, Petal Width and Class. Here, Spark detected the datatype of the first four columns incorrectly as a String, so I changed it to the desired datatype — Float.

我把它们的标题设置为“分隔长度”，“分隔宽度”，“花瓣长度”，“花瓣宽度”和“类”。在这里，Spark错误地将前四列的数据类型检测为String，因此我将其更改为所需的数据类型-Float。

5. 如何从笔记本电脑访问数据 (5. How to access data from Notebook)

Spark is a framework that can be used to analyze big data using SQL, machine learning, graph processing or real-time streaming analysis. We will be working with SparkSQL and Dataframes in this tutorial.

Spark是一个框架，可用于使用SQL，机器学习，图形处理或实时流分析来分析大数据。在本教程中，我们将使用SparkSQL和Dataframes。

Let’s get started with working with the data on the Notebook. The data that we have uploaded is now put in tabular format.We require a SQL query to read the data and put it in a dataframe.

让我们开始使用笔记本上的数据。现在，我们已上传的数据以表格格式放置。我们需要一个SQL查询来读取数据并将其放置在数据框中。

Type df = sqlContext.sql(“SELECT * FROM iris_data”) to read iris data into a dataframe.

类型df = sqlContext.sql(“SELECT * FROM iris_data”) 将虹膜数据读入数据帧。

To view the first five rows in the dataframe, I can simply run the command:

要查看数据框中的前五行，我可以简单地运行以下命令：

display(df.limit(5))

Notice a Bar chart icon at the bottom. Once you click, you can view the data that you have imported into Databricks. To view the bar chart of complete data, rundisplay(df) instead of display(df.limit(5)).

注意底部的条形图图标。单击后，您可以查看已导入到Databricks中的数据。要查看完整数据的条形图，请运行display(df)而不是display(df.limit(5)) 。

The dropdown button allows you to visualize the data in different charts like bar, pie, scatter, and so on. It also gives you plot options to customize the plot and visualize specific columns only.

下拉按钮使您可以可视化不同图表中的数据，如条形图，饼图，散点图等。它还提供了绘图选项，以自定义绘图并仅显示特定的列。

You can also display matplotlib and ggplot figures in Databricks. For a demonstration, see Matplotlib and ggplot in Python Notebooks.

您还可以在Databricks中显示matplotlib和ggplot数字。有关演示，请参阅Python Notebooks中的Matplotlib和ggplot 。

To view all the columns of the data, simply type df.columns

要查看数据的所有列，只需键入df.columns

To count how many rows total there are in the Dataframe (and see how long it takes to a full scan from remote disk/S3), run df.count().

要计算数据帧中总共有多少行(并查看从远程磁盘/ S3进行全面扫描所花费的时间)，请运行df.count() 。

6.将Spark数据框转换为Pandas数据框。 (6. Converting a Spark dataframe to a Pandas dataframe.)

Now if you are comfortable using pandas dataframes, and want to convert your Spark dataframe to pandas, you can do this by putting the command

现在，如果您习惯使用pandas数据框，并且想要将Spark数据框转换为pandas，则可以通过以下命令来完成此操作

import pandas as pdpandas_df=df.to_pandas()

Now you can use pandas operations on the pandas_df dataframe.

现在，您可以在pandas_df数据帧上使用pandas操作。

7.查看Spark UI (7. Viewing the Spark UI)

The Spark UI contains a wealth of information needed for debugging Spark jobs. There are a bunch of great visualizations, so let’s view them in a gist.

Spark UI包含调试Spark作业所需的大量信息。有很多很棒的可视化效果，所以让我们大致了解一下它们。

To go to Spark UI, you need to go to the top of the page where there are some menu options like “File,” “View,” “Code,” “Permissions,” and others. You will find the name of the cluster at the top next to “Attached” and a dropdown button next to it. Hit the dropdown button and select “View Spark UI.” A new tab will open up with the lots of information on your Notebook.

要转到Spark UI，您需要转到页面顶部，这里有一些菜单选项，如“文件”，“视图”，“代码”，“权限”等。您将在“已附加”旁边的顶部找到集群的名称，并在其旁边找到一个下拉按钮。点击下拉按钮，然后选择“查看Spark UI”。一个新的选项卡将打开，其中包含笔记本电脑上的大量信息。

The UI view gives plenty of information on each job executed on the cluster, stages, environment, and SQL queries executed. This UI can be helpful for users to debug their applications. Also, this UI gives a good visualization on Spark streaming statistics. To learn in more detail about each aspect of the Spark UI, refer to this link.

UI视图提供了有关在集群上执行的每个作业，阶段，环境和执行SQL查询的大量信息。该UI有助于用户调试其应用程序。此外，此UI还提供了关于Spark流统计的良好可视化效果。要详细了解Spark UI的各个方面，请参阅此链接。

Once you are done with the Notebook, you can go ahead and publish it or export the file in different file formats, such that somebody else can use it using a unique link. I have attached my Notebook in HTML format.

使用笔记本电脑完成操作后，您可以继续发布并以不同的文件格式导出文件，以便其他人可以通过唯一链接使用它。我已经以HTML格式附加了我的笔记本。

结语 (Wrapping up)

This is a short overview on how you can get started with Databricks quickly and run your programs. The advantage of using Databricks is that it offers an end-to-end service for building analytics, data warehousing, and machine learning applications. The entire Spark cluster can be managed, monitored, and secured using a self-service model of Databricks.

这是有关如何快速开始使用Databricks并运行程序的简短概述。使用Databricks的优势在于，它为构建分析，数据仓库和机器学习应用程序提供了端到端服务。可以使用Databricks的自助模型来管理，监视和保护整个Spark集群。

Here are some interesting links for Data Scientists and for Data Engineers. Also, here is a tutorial which I found very useful and is great for beginners.

这是数据科学家和数据工程师的一些有趣链接。另外，这是我发现非常有用的教程，对初学者来说非常有用。