tableau跨库创建并集_刮擦柏林青年旅舍,并以此建立一个Tableau全景。

tableau跨库创建并集

One of the coolest things about making our personal project is the fact that we can explore topics of our own interest. On my case, I’ve had the chance to backpack around the world for more than a year between 2016–2017, and it was one of the best experiences of my life.

进行个人项目的最酷的事情之一是,我们可以探索自己感兴趣的主题。 就我而言,2016年至2017年之间,我有机会在世界各地背包旅行了一年多,这是我一生中最好的经历之一。

During my travel, I used A LOT OF HOSTELS. From Hanoi to the Iguassu Falls, passing through Tokyo, Delhi and many other places, one always need a place to rest after a long day exploring the city. Funny enough, it was on some of those hostels that I got interested in learning how to code, which started my way to become a data analyst today.

在旅行中,我使用了很多杂物。 从河内到伊瓜苏瀑布,途经东京,德里和许多其他地方,经过漫长的一天探索这座城市,人们总是需要一个休息的地方。 有趣的是,正是在一些旅馆中,我对学习如何编码感兴趣,这开始使我成为今天的数据分析师。

Image for post
Source.来源 。

So, I’m very interested in understanding what makes a hostel better than another, how to compare them, etc, and after thinking about that I came up with this tutorial idea. Today we are going to do 2 things:

因此,我对了解什么使旅馆比其他旅馆更好,如何进行比较等感兴趣,并且在考虑了这一点之后,我想到了本教程。 今天我们要做两件事:

  • Scrap data from Hostel World, using Berlin as our study case, and save it into a data frame.

    使用柏林作为我们的研究案例,从Hostel World收集数据,并将其保存到数据框中。
  • Use that data to build a Tableau Dashboard that will allow us to select the hostel based in different criteria.

    使用该数据来构建Tableau仪表板,该仪表板将使我们能够根据不同的条件选择旅馆。

Why Berlin Hostels? Because Berlin is an amazing city, and there’s a lot of options of hostels there for us to explore. There are many different websites to look for hostels, and we will use my favorite, Hostel World, which I particularly utilized many times, and it’s the one I trust for the accuracy of the information they provide.

为什么选择柏林青年旅舍? 因为柏林是一个了不起的城市,所以这里有很多旅馆供我们探索。 有很多不同的网站可以寻找旅馆,我们将使用我最喜欢的Hostel World ,我多次使用它,并且我相信它可以提供所提供信息的准确性。

Image for post
Ricardo Gomez Angel on Ricardo Gomez Angel在UnsplashUnsplash拍摄

My goal is to show you that we can do the whole process of collect/transform/visualize data in a simple yet effective way so you can start doing your own projects. To fully enjoy this tutorial, it’s important that you are familiar with python, pandas, and also comfortable with HTML and Tableau basics concepts.

我的目标是向您展示,我们可以以一种简单而有效的方式完成收集/转换/可视化数据的整个过程,以便您可以开始自己的项目。 要完全享受本教程,重要的是,您必须熟悉python,pandas,并熟悉HTML和Tableau基本概念。

You can follow along with the notebook containing the code here, and access the Tableau Dashboard here.

您可以使用包含代码的笔记本跟着在这里 ,和访问的Tableau仪表板在这里 。

始终先浏览网站! (Always Explore the Website First!)

I highly recommend that you take some time to explore the structure of the website prior to start coding. If you’re using Chrome, just click on the right button of the mouse and select “Inspect”. That’s what you got:

我强烈建议您在开始编码之前花一些时间来探索网站的结构。 如果您使用的是Chrome,只需单击鼠标右键,然后选择“检查”。 那就是你得到的:

Image for post
The HTML structure of our target page. Source: author.
目标页面HTML结构。 资料来源:作者。

Think the HTML structure as a tree, with all its branches holding the information of the page. Try to find which class has information about hostel name, ratings, etc. More important, check out how information of each hostel has its own “branch”, or container. That means that once we figure out how to access it, we can expand the same logic for all other hostels/containers.

将HTML结构想像成一棵树,其所有分支都保存页面的信息。 尝试查找哪个班级提供有关旅馆名称,等级等的信息。更重要的是,检查每个旅馆的信息如何有其自己的“分支”或容器。 这意味着一旦弄清楚如何访问它,我们便可以为所有其他旅馆/容器扩展相同的逻辑。

On the code below I’m showing you how to get the raw information, then how to figure out how many pages of hostels we have, as we will need that to iterate later, and then how to separate the information about the first hostel in order to explore it. Take your time to read the code and comments, I wrote it specially for you:

在下面的代码中,我向您展示如何获取原始信息,然后如何确定我们拥有多少个旅舍页面,因为我们以后需要进行迭代,然后如何在其中分离有关第一个旅舍的信息。为了探索它。 花些时间阅读代码和注释,我是专门为您编写的:

# importing the libraries to use on the scraping
from requests import get
from bs4 import BeautifulSoupimport pandas as pd
import numpy as npimport timeimport re# getting the html info to be used
url = 'https://www.hostelworld.com/hostels/Berlin'
response = get(url)# create soup
soup = BeautifulSoup(response.text, 'html.parser')# creating individual containers, on each one there's information about one hostel.
holstel_containers= soup.findAll(class_= 'fabresult rounded clearfix hwta-property')# Figuring out how many pages with hostels do we have available. This information is important when iterating over pages.
total_pages= soup.findAll(class_= "pagination-page-number")
final_page= pd.to_numeric(total_pages[-1].text)
print(final_page)# checking how many hostels we have on the first page
print(len(holstel_containers))first_hostel = holstel_containers[0]
print(first_hostel.prettify())

The output of this code will be first a “3”, the number of pages with hostel info, then a “30”, the number of hostels per page, and finally a long bunch of HTML, which is the information about the first hostel on the list. The information we will extract today is the following:

此代码的输出将首先是“ 3”,即包含旅馆信息的页面数,然后是“ 30”,即每页的旅馆数,最后是一堆HTML,这是有关第一家旅馆的信息在清单上。 我们今天将提取的信息如下:

  • Name

    名称
  • Link

    链接
  • Distance from centre (km)

    距中心的距离(公里)
  • Average Rating

    平均评分
  • Number of reviews

    评论数
  • Average price in USD

    平ASP格(美元)

Using our super HTML skills, we figured out that the code to extract that is the one below. If you have already used Beautiful Soup, could you get the same information in a different way? If yes, I would love to see that on the comments.

使用我们的超级HTML技能,我们找出了下面要提取的代码。 如果您已经使用过Beautiful Soup,可以通过其他方式获得相同的信息吗? 如果是,我希望在评论中看到这一点。

# Hostel name
first_hostel.h2.a.text# hostel link
first_hostel.h2.a.get('href')# distance from city centre in km
first_hostel.find(class_= "addressline").text[12:18].replace('k','').replace('m','').strip()# average rating
first_hostel.find(class_='hwta-rating-score').text.replace('\n', '').strip()# number of reviews
first_hostel.find(class_="hwta-rating-counter").text.replace('\n', '').strip()# average price per night in USD
first_hostel.find(class_= "price").text.replace('\n', '').strip()[3:]

Note that we will need to use some pandas essentials, like replace and strip, along with some operators from the Beautiful Soup package, mostly the find, find_all and get. Knowing how to combining them is something that requires some practice, but I can guarantee that,once you understand the idea, it is pretty simple.

注意,我们将需要使用一些熊猫必需品,例如replace和strip ,以及Beautiful Soup包中的一些运算符,主要是findfind_allget。 知道如何将它们组合起来是需要一些实践的事情,但是我可以保证,一旦您理解了这个想法,它就非常简单。

Now that we know how to access the information we need in the first container, we will expand the same logic across all the hostels on the first page, and also across all the pages with hostel information. How do we do that? First by using our very well known for loop, then saving the information into empty lists, and finally using those lists to create a data frame:

现在,我们知道了如何访问第一个容器中所需的信息,我们将在第一页上的所有旅馆以及包含旅馆信息的所有页面上扩展相同的逻辑。 我们该怎么做? 首先使用我们众所周知的for循环 然后将信息保存到空列表中,最后使用这些列表创建数据框:

# first, create the empty lists
hostel_names= []
hostel_links= []
hostel_distance= []
hostel_ratings= []
hostel_reviews= []
hostel_prices= []for page in np.arange(1,final_page+1): # to iterate over the pages and create the conteiners, using the final_page data we've got at the beginingurl = 'https://www.hostelworld.com/hostels/Berlin?page=' + str(page)response = get(url)soup = BeautifulSoup(response.text, 'html.parser')holstel_containers= soup.findAll(class_= 'fabresult rounded clearfix hwta-property')for item in range(len(holstel_containers)): # to iterate over the results on each pagehostel_names.append(holstel_containers[item].h2.a.text)hostel_links.append(holstel_containers[item].h2.a.get('href'))hostel_distance.append(holstel_containers[item].find(class_= "addressline").text[12:18].replace('k','').replace('m','').strip())hostel_ratings.append(holstel_containers[item].find(class_='hwta-rating-score').text.replace('\n', '').strip())hostel_reviews.append(holstel_containers[item].find(class_="hwta-rating-counter").text.replace('\n', '').strip())hostel_prices.append(holstel_containers[item].find(class_= "price").text.replace('\n', '').strip()[3:])                          time.sleep(2) # this is used to not push too hard on the website# using the lists to create a brand new dataframe
hw_berlin = pd.DataFrame({'hostel_name': hostel_names,'distance_centre_km': hostel_distance,'average_rating': hostel_ratings,'number_reviews': hostel_reviews,'average_price_usd': hostel_prices,'hw_link': hostel_links
})hw_berlin.head()

And now we can appreciate the beauty of what we have just created:

现在我们可以欣赏到我们刚刚创造的美丽:

Image for post
First lines of the Berlin Hostels data frame. Source: author.
柏林旅馆数据框的第一行。 资料来源:作者。

After that we just need to clean up the data a little bit, removing non-numerical characters and converting strings, saved initially as object, to numbers. Finally, we will save our results into a .csv file.

之后,我们只需要稍微整理一下数据,删除非数字字符并将最初保存为object的字符串转换为数字。 最后,我们将结果保存到.csv文件中。

# removing non numerical character on the column distance_centre_km
hw_berlin.distance_centre_km = [re.sub('[^0-9.]','', x) for x in hw_berlin.distance_centre_km]# converting numerical columns to proper formatlist_to_convert = ['distance_centre_km', 'average_rating', 'number_reviews', 'average_price_usd']for column in list_to_convert:hw_berlin[column] = pd.to_numeric(hw_berlin[column], errors= 'coerce')# saving the final version into a .csv file  
hw_berlin.to_csv('hw_berlin_basic_info.csv')

Tableau欢乐时光! (Tableau Fun Time!)

Tableau is one of the most powerful BI tools available today, and it offers a free version, Tableau Public, that allows you to do A LOT of cool stuff. However, it can become pretty complex very fast, even to do some basic graphs. I cannot cover all the steps I did here, as it was a lot of click and drag actions. It’s different than code where you can just type and reproduce it all.

Tableau是当今可用的功能最强大的BI工具之一,它提供了免费版本Tableau Public ,使您可以做很多很棒的事情。 但是,即使做一些基本图形,它也会变得非常复杂。 我无法涵盖我在此处所做的所有步骤,因为这涉及很多单击和拖动操作。 它与代码不同,在代码中,您只需键入并复制所有内容即可。

So, if you are new to Tableau and if you want to understand how I build my visualization, the way to do that is by downloading the .twb file, which is available here, then open it in your computer, and do what we call “reverse engineering”, which is basically to check and play with the files that I’ve created yourself. Trust me, this is the most effective way to learn Tableau, and even when you can see the engineering behind, it can be hard to reproduce the same visualization. Let’s try to do it?

因此,如果您是Tableau的新手,并且想了解如何构建可视化文件,则可以通过下载.twb文件(在此处可用),然后在计算机中打开它并执行我们所谓的操作来实现。 “逆向工程” ,基本上是检查并播放我自己创建的文件。 相信我,这是学习Tableau的最有效方法,即使您看到了背后的工程知识,也很难再现相同的可视化效果。 让我们尝试做吗?

Image for post
Tableau offers different filters that help you to slice and visualize our recently scraped data. Source: author.
Tableau提供了不同的筛选器,可帮助您切片和可视化我们最近抓取的数据。 资料来源:作者。

As data or business analyst, we need basically to make data readable and easy to manipulate. The visualization I’ve build for this tutorial offers you that: you can slice and play with the hostels based in some different criteria we have available, filtering the options and finding the ones you are interested, just like a stakeholder would do. Besides the filters, I’ve included also a scatter plot where we can check the relationship between price and reviews.

作为数据或业务分析师,我们基本上需要使数据可读并易于操纵。 我为本教程构建的可视化为您提供:您可以根据我们可用的一些不同标准对旅馆进行切片和玩耍,过滤选项并找到您感兴趣的选项,就像利益相关者会做的那样。 除了过滤器之外,我还包括了一个散点图,我们可以在其中检查价格和评论之间的关系。

The dashboard is pretty simple, and I’ve done that way by purpose, I would like to see you doing it by yourself and sharing the link of your results on the comments. What kind of different information can you get from the date we’ve scraped? Could you do the same analysis with hostels in Paris, New York or Rio de Janeiro? I’ll leave those questions for you to answer with your own code and dashboard.

仪表板非常简单,我是有意这样做的,我希望您自己做,并分享您的结果在评论中的链接。 从我们抓取之日起,您可以获得什么不同的信息? 您是否可以对巴黎,纽约或里约热内卢的旅馆进行同样的分析? 我将用您自己的代码和仪表板来回答这些问题。

That’s all for today! I hope this tutorial will help you to get more knowledge about data scraping and Tableau. Feel free to connect with me on LinkedIn and to check my other texts and code on my Medium and GitHub profiles.

今天就这些! 我希望本教程将帮助您获得有关数据抓取和Tableau的更多知识。 随时在LinkedIn上与我联系,并在我的Medium和GitHub个人资料中查看我的其他文本和代码。

Image for post

翻译自: https://towardsdatascience.com/scraping-berlin-hostels-and-building-a-tableau-viz-with-it-a73ce5b88e22

tableau跨库创建并集

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/391997.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

策略模式下表单验证

策略模式下表单验证 class Validator {constructor(strategies) {this.cache []}add(value, rules) {if (!rules instanceof Array) throw rules should be Arrayvar self thisfor(var i 0, rule; rule rules[i];) {(function(rule) {var strategyArr rule.strategy.split…

在五分钟内学习使用Python进行类型转换

by PALAKOLLU SRI MANIKANTA通过PALAKOLLU SRI MANIKANTA 在五分钟内学习使用Python进行类型转换 (Learn typecasting in Python in five minutes) 以非常详尽的方式介绍了Python中的类型转换和类型转换的速成课程 (A crash course on Typecasting and Type conversion in Pyt…

Ajax post HTML 405,Web API Ajax POST向返回 405方法不允许_jquery_开发99编程知识库

因此,我有一個像這樣的jquery ajax請求:function createLokiAccount(someurl) {var d {"Jurisdiction":17}$.ajax({type:"POST",url:"http://myserver:111/Api/V1/Customers/CreateCustomer/",data: JSON.stringify(d),c…

leetcode 480. 滑动窗口中位数(堆+滑动窗口)

中位数是有序序列最中间的那个数。如果序列的大小是偶数,则没有最中间的数;此时中位数是最中间的两个数的平均数。 例如: [2,3,4],中位数是 3 [2,3],中位数是 (2 3) / 2 2.5 给你一个数组 nums,有一个大…

1.0 Hadoop的介绍、搭建、环境

HADOOP背景介绍 1.1 Hadoop产生背景 HADOOP最早起源于Nutch。Nutch的设计目标是构建一个大型的全网搜索引擎,包括网页抓取、索引、查询等功能,但随着抓取网页数量的增加,遇到了严重的可扩展性问题——如何解决数十亿网页的存储和索引问题。20…

如何实现多维智能监控?--AI运维的实践探索【一】

作者丨吴树生:腾讯高级工程师,负责SNG大数据监控平台建设。近十年监控系统开发经验,具有构建基于大数据平台的海量高可用分布式监控系统研发经验。 导语:监控数据多维化后,带来新的应用场景。SNG的哈勃多维监控平台在完…

.Net Web开发技术栈

有很多朋友有的因为兴趣,有的因为生计而走向了.Net中,有很多朋友想学,但是又不知道怎么学,学什么,怎么系统的学,为此我以我微薄之力总结归纳写了一篇.Net web开发技术栈,以此帮助那些想学&#…

使用Python和MetaTrader在5分钟内开始构建您的交易策略

In one of my last posts, I showed how to create graphics using the Plotly library. To do this, we import data from MetaTrader in a ‘raw’ way without automation. Today, we will learn how to automate this process and plot a heatmap graph of the correlation…

卷积神经网络 手势识别_如何构建识别手语手势的卷积神经网络

卷积神经网络 手势识别by Vagdevi Kommineni通过瓦格德维科米尼(Vagdevi Kommineni) 如何构建识别手语手势的卷积神经网络 (How to build a convolutional neural network that recognizes sign language gestures) Sign language has been a major boon for people who are h…

spring—第一个spring程序

1.导入依赖 <dependency><groupId>org.springframework</groupId><artifactId>spring-context</artifactId><version>5.0.9.RELEASE</version></dependency>2.写一个接口和实现 public interface dao {public void save(); }…

请对比html与css的异同,css2与css3的区别是什么?

css主要有三个版本&#xff0c;分别是css1、css2、css3。css2使用的比较多&#xff0c;因为css1的属性比较少&#xff0c;而css3有一些老式浏览器并不支持&#xff0c;所以大家在开发的时候主要还是使用css2。CSS1提供有关字体、颜色、位置和文本属性的基本信息&#xff0c;该版…

基础 之 数组

shell中的数组 array (1 2 3) array ([1]ins1 [2]ins2 [3]ins3)array ($(命令)) # 三种定义数组&#xff0c;直接定义&#xff0c;键值对&#xff0c;直接用命令做数组的值。${array[*]}${array[]}${array[0]} # 输出数组中的0位置的值&#xff0c;*和…

Linux_异常_08_本机无法访问虚拟机web等工程

这是因为防火墙的原因&#xff0c;把响应端口开启就行了。 # Firewall configuration written by system-config-firewall # Manual customization of this file is not recommended. *filter :INPUT ACCEPT [0:0] :FORWARD ACCEPT [0:0] :OUTPUT ACCEPT [0:0] -A INPUT -m st…

Building a WAMP Dev Environment [3/4] - Installing and Configuring PHP

Moved to http://blog.tangcs.com/2008/10/27/wamp-installing-configuring-php/转载于:https://www.cnblogs.com/WarrenTang/archive/2008/10/27/1320069.html

ipywidgets_未来价值和Ipywidgets

ipywidgetsHow to use Ipywidgets to visualize future value with different interest rates.如何使用Ipywidgets可视化不同利率下的未来价值。 There are some calculations that even being easy becoming better with a visualization of his terms. Moreover, the sooner…

2019 css 框架_宣布CSS 2019调查状态

2019 css 框架by Sacha Greif由Sacha Greif 宣布#StateOfCSS 2019调查 (Announcing the #StateOfCSS 2019 Survey) 了解JavaScript状况之后&#xff0c;帮助我们确定最新CSS趋势 (After the State of JavaScript, help us identify the latest CSS trends) I’ve been using C…

计算机主机后面辐射大,电脑的背面辐射大吗

众所周知&#xff0c;电子产品的辐射都比较大&#xff0c;而电脑是非常常见的电子产品&#xff0c;它也存在着一定的辐射&#xff0c;那么电脑的背面辐射大吗?下面就一起随佰佰安全网小编来了解一下吧。有资料显示&#xff0c;电脑后面的辐射比前面大&#xff0c;长期近距离在…

spring— Bean标签scope配置和生命周期配置

scope配置 singleton 默认值&#xff0c;单例的prototype 多例的request WEB 项目中&#xff0c;Spring 创建一个 Bean的对象&#xff0c;将对象存入到 request 域中session WEB 项目中&#xff0c;Spring 创建一个 Bean 的对象&#xff0c;将对象存入session 域中global sess…

装饰器3--装饰器作用原理

多思考&#xff0c;多记忆&#xff01;&#xff01;&#xff01; 转载于:https://www.cnblogs.com/momo8238/p/7217345.html

用folium模块画地理图_使用Folium表示您的地理空间数据

用folium模块画地理图As a part of the Data Science community, Geospatial data is one of the most crucial kinds of data to work with. The applications are as simple as ‘Where’s my food delivery order right now?’ and as complex as ‘What is the most optim…