tableau跨库创建并集
One of the coolest things about making our personal project is the fact that we can explore topics of our own interest. On my case, I’ve had the chance to backpack around the world for more than a year between 2016–2017, and it was one of the best experiences of my life.
进行个人项目的最酷的事情之一是,我们可以探索自己感兴趣的主题。 就我而言,2016年至2017年之间,我有机会在世界各地背包旅行了一年多,这是我一生中最好的经历之一。
During my travel, I used A LOT OF HOSTELS. From Hanoi to the Iguassu Falls, passing through Tokyo, Delhi and many other places, one always need a place to rest after a long day exploring the city. Funny enough, it was on some of those hostels that I got interested in learning how to code, which started my way to become a data analyst today.
在旅行中,我使用了很多杂物。 从河内到伊瓜苏瀑布,途经东京,德里和许多其他地方,经过漫长的一天探索这座城市,人们总是需要一个休息的地方。 有趣的是,正是在一些旅馆中,我对学习如何编码感兴趣,这开始使我成为今天的数据分析师。
So, I’m very interested in understanding what makes a hostel better than another, how to compare them, etc, and after thinking about that I came up with this tutorial idea. Today we are going to do 2 things:
因此,我对了解什么使旅馆比其他旅馆更好,如何进行比较等感兴趣,并且在考虑了这一点之后,我想到了本教程。 今天我们要做两件事:
- Scrap data from Hostel World, using Berlin as our study case, and save it into a data frame. 使用柏林作为我们的研究案例,从Hostel World收集数据,并将其保存到数据框中。
- Use that data to build a Tableau Dashboard that will allow us to select the hostel based in different criteria. 使用该数据来构建Tableau仪表板,该仪表板将使我们能够根据不同的条件选择旅馆。
Why Berlin Hostels? Because Berlin is an amazing city, and there’s a lot of options of hostels there for us to explore. There are many different websites to look for hostels, and we will use my favorite, Hostel World, which I particularly utilized many times, and it’s the one I trust for the accuracy of the information they provide.
为什么选择柏林青年旅舍? 因为柏林是一个了不起的城市,所以这里有很多旅馆供我们探索。 有很多不同的网站可以寻找旅馆,我们将使用我最喜欢的Hostel World ,我多次使用它,并且我相信它可以提供所提供信息的准确性。
My goal is to show you that we can do the whole process of collect/transform/visualize data in a simple yet effective way so you can start doing your own projects. To fully enjoy this tutorial, it’s important that you are familiar with python, pandas, and also comfortable with HTML and Tableau basics concepts.
我的目标是向您展示,我们可以以一种简单而有效的方式完成收集/转换/可视化数据的整个过程,以便您可以开始自己的项目。 要完全享受本教程,重要的是,您必须熟悉python,pandas,并熟悉HTML和Tableau基本概念。
You can follow along with the notebook containing the code here, and access the Tableau Dashboard here.
您可以使用包含代码的笔记本跟着在这里 ,和访问的Tableau仪表板在这里 。
始终先浏览网站! (Always Explore the Website First!)
I highly recommend that you take some time to explore the structure of the website prior to start coding. If you’re using Chrome, just click on the right button of the mouse and select “Inspect”. That’s what you got:
我强烈建议您在开始编码之前花一些时间来探索网站的结构。 如果您使用的是Chrome,只需单击鼠标右键,然后选择“检查”。 那就是你得到的:
Think the HTML structure as a tree, with all its branches holding the information of the page. Try to find which class has information about hostel name, ratings, etc. More important, check out how information of each hostel has its own “branch”, or container. That means that once we figure out how to access it, we can expand the same logic for all other hostels/containers.
将HTML结构想像成一棵树,其所有分支都保存页面的信息。 尝试查找哪个班级提供有关旅馆名称,等级等的信息。更重要的是,检查每个旅馆的信息如何有其自己的“分支”或容器。 这意味着一旦弄清楚如何访问它,我们便可以为所有其他旅馆/容器扩展相同的逻辑。
On the code below I’m showing you how to get the raw information, then how to figure out how many pages of hostels we have, as we will need that to iterate later, and then how to separate the information about the first hostel in order to explore it. Take your time to read the code and comments, I wrote it specially for you:
在下面的代码中,我向您展示如何获取原始信息,然后如何确定我们拥有多少个旅舍页面,因为我们以后需要进行迭代,然后如何在其中分离有关第一个旅舍的信息。为了探索它。 花些时间阅读代码和注释,我是专门为您编写的:
# importing the libraries to use on the scraping
from requests import get
from bs4 import BeautifulSoupimport pandas as pd
import numpy as npimport timeimport re# getting the html info to be used
url = 'https://www.hostelworld.com/hostels/Berlin'
response = get(url)# create soup
soup = BeautifulSoup(response.text, 'html.parser')# creating individual containers, on each one there's information about one hostel.
holstel_containers= soup.findAll(class_= 'fabresult rounded clearfix hwta-property')# Figuring out how many pages with hostels do we have available. This information is important when iterating over pages.
total_pages= soup.findAll(class_= "pagination-page-number")
final_page= pd.to_numeric(total_pages[-1].text)
print(final_page)# checking how many hostels we have on the first page
print(len(holstel_containers))first_hostel = holstel_containers[0]
print(first_hostel.prettify())
The output of this code will be first a “3”, the number of pages with hostel info, then a “30”, the number of hostels per page, and finally a long bunch of HTML, which is the information about the first hostel on the list. The information we will extract today is the following:
此代码的输出将首先是“ 3”,即包含旅馆信息的页面数,然后是“ 30”,即每页的旅馆数,最后是一堆HTML,这是有关第一家旅馆的信息在清单上。 我们今天将提取的信息如下:
- Name 名称
- Link 链接
- Distance from centre (km) 距中心的距离(公里)
- Average Rating 平均评分
- Number of reviews 评论数
- Average price in USD 平ASP格(美元)
Using our super HTML skills, we figured out that the code to extract that is the one below. If you have already used Beautiful Soup, could you get the same information in a different way? If yes, I would love to see that on the comments.
使用我们的超级HTML技能,我们找出了下面要提取的代码。 如果您已经使用过Beautiful Soup,可以通过其他方式获得相同的信息吗? 如果是,我希望在评论中看到这一点。
# Hostel name
first_hostel.h2.a.text# hostel link
first_hostel.h2.a.get('href')# distance from city centre in km
first_hostel.find(class_= "addressline").text[12:18].replace('k','').replace('m','').strip()# average rating
first_hostel.find(class_='hwta-rating-score').text.replace('\n', '').strip()# number of reviews
first_hostel.find(class_="hwta-rating-counter").text.replace('\n', '').strip()# average price per night in USD
first_hostel.find(class_= "price").text.replace('\n', '').strip()[3:]
Note that we will need to use some pandas essentials, like replace and strip, along with some operators from the Beautiful Soup package, mostly the find, find_all and get. Knowing how to combining them is something that requires some practice, but I can guarantee that,once you understand the idea, it is pretty simple.
注意,我们将需要使用一些熊猫必需品,例如replace和strip ,以及Beautiful Soup包中的一些运算符,主要是find , find_all和get。 知道如何将它们组合起来是需要一些实践的事情,但是我可以保证,一旦您理解了这个想法,它就非常简单。
Now that we know how to access the information we need in the first container, we will expand the same logic across all the hostels on the first page, and also across all the pages with hostel information. How do we do that? First by using our very well known for loop, then saving the information into empty lists, and finally using those lists to create a data frame:
现在,我们知道了如何访问第一个容器中所需的信息,我们将在第一页上的所有旅馆以及包含旅馆信息的所有页面上扩展相同的逻辑。 我们该怎么做? 首先使用我们众所周知的for循环 ,然后将信息保存到空列表中,最后使用这些列表创建数据框:
# first, create the empty lists
hostel_names= []
hostel_links= []
hostel_distance= []
hostel_ratings= []
hostel_reviews= []
hostel_prices= []for page in np.arange(1,final_page+1): # to iterate over the pages and create the conteiners, using the final_page data we've got at the beginingurl = 'https://www.hostelworld.com/hostels/Berlin?page=' + str(page)response = get(url)soup = BeautifulSoup(response.text, 'html.parser')holstel_containers= soup.findAll(class_= 'fabresult rounded clearfix hwta-property')for item in range(len(holstel_containers)): # to iterate over the results on each pagehostel_names.append(holstel_containers[item].h2.a.text)hostel_links.append(holstel_containers[item].h2.a.get('href'))hostel_distance.append(holstel_containers[item].find(class_= "addressline").text[12:18].replace('k','').replace('m','').strip())hostel_ratings.append(holstel_containers[item].find(class_='hwta-rating-score').text.replace('\n', '').strip())hostel_reviews.append(holstel_containers[item].find(class_="hwta-rating-counter").text.replace('\n', '').strip())hostel_prices.append(holstel_containers[item].find(class_= "price").text.replace('\n', '').strip()[3:]) time.sleep(2) # this is used to not push too hard on the website# using the lists to create a brand new dataframe
hw_berlin = pd.DataFrame({'hostel_name': hostel_names,'distance_centre_km': hostel_distance,'average_rating': hostel_ratings,'number_reviews': hostel_reviews,'average_price_usd': hostel_prices,'hw_link': hostel_links
})hw_berlin.head()
And now we can appreciate the beauty of what we have just created:
现在我们可以欣赏到我们刚刚创造的美丽:
After that we just need to clean up the data a little bit, removing non-numerical characters and converting strings, saved initially as object, to numbers. Finally, we will save our results into a .csv file.
之后,我们只需要稍微整理一下数据,删除非数字字符并将最初保存为object的字符串转换为数字。 最后,我们将结果保存到.csv文件中。
# removing non numerical character on the column distance_centre_km
hw_berlin.distance_centre_km = [re.sub('[^0-9.]','', x) for x in hw_berlin.distance_centre_km]# converting numerical columns to proper formatlist_to_convert = ['distance_centre_km', 'average_rating', 'number_reviews', 'average_price_usd']for column in list_to_convert:hw_berlin[column] = pd.to_numeric(hw_berlin[column], errors= 'coerce')# saving the final version into a .csv file
hw_berlin.to_csv('hw_berlin_basic_info.csv')
Tableau欢乐时光! (Tableau Fun Time!)
Tableau is one of the most powerful BI tools available today, and it offers a free version, Tableau Public, that allows you to do A LOT of cool stuff. However, it can become pretty complex very fast, even to do some basic graphs. I cannot cover all the steps I did here, as it was a lot of click and drag actions. It’s different than code where you can just type and reproduce it all.
Tableau是当今可用的功能最强大的BI工具之一,它提供了免费版本Tableau Public ,使您可以做很多很棒的事情。 但是,即使做一些基本图形,它也会变得非常复杂。 我无法涵盖我在此处所做的所有步骤,因为这涉及很多单击和拖动操作。 它与代码不同,在代码中,您只需键入并复制所有内容即可。
So, if you are new to Tableau and if you want to understand how I build my visualization, the way to do that is by downloading the .twb file, which is available here, then open it in your computer, and do what we call “reverse engineering”, which is basically to check and play with the files that I’ve created yourself. Trust me, this is the most effective way to learn Tableau, and even when you can see the engineering behind, it can be hard to reproduce the same visualization. Let’s try to do it?
因此,如果您是Tableau的新手,并且想了解如何构建可视化文件,则可以通过下载.twb文件(在此处可用),然后在计算机中打开它并执行我们所谓的操作来实现。 “逆向工程” ,基本上是检查并播放我自己创建的文件。 相信我,这是学习Tableau的最有效方法,即使您看到了背后的工程知识,也很难再现相同的可视化效果。 让我们尝试做吗?
As data or business analyst, we need basically to make data readable and easy to manipulate. The visualization I’ve build for this tutorial offers you that: you can slice and play with the hostels based in some different criteria we have available, filtering the options and finding the ones you are interested, just like a stakeholder would do. Besides the filters, I’ve included also a scatter plot where we can check the relationship between price and reviews.
作为数据或业务分析师,我们基本上需要使数据可读并易于操纵。 我为本教程构建的可视化为您提供:您可以根据我们可用的一些不同标准对旅馆进行切片和玩耍,过滤选项并找到您感兴趣的选项,就像利益相关者会做的那样。 除了过滤器之外,我还包括了一个散点图,我们可以在其中检查价格和评论之间的关系。
The dashboard is pretty simple, and I’ve done that way by purpose, I would like to see you doing it by yourself and sharing the link of your results on the comments. What kind of different information can you get from the date we’ve scraped? Could you do the same analysis with hostels in Paris, New York or Rio de Janeiro? I’ll leave those questions for you to answer with your own code and dashboard.
仪表板非常简单,我是有意这样做的,我希望您自己做,并分享您的结果在评论中的链接。 从我们抓取之日起,您可以获得什么不同的信息? 您是否可以对巴黎,纽约或里约热内卢的旅馆进行同样的分析? 我将用您自己的代码和仪表板来回答这些问题。
That’s all for today! I hope this tutorial will help you to get more knowledge about data scraping and Tableau. Feel free to connect with me on LinkedIn and to check my other texts and code on my Medium and GitHub profiles.
今天就这些! 我希望本教程将帮助您获得有关数据抓取和Tableau的更多知识。 随时在LinkedIn上与我联系,并在我的Medium和GitHub个人资料中查看我的其他文本和代码。
翻译自: https://towardsdatascience.com/scraping-berlin-hostels-and-building-a-tableau-viz-with-it-a73ce5b88e22
tableau跨库创建并集
本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/391997.shtml
如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!