scrape创建
网页搜罗,数据科学 (Web Scraping, Data Science)
In this tutorial, I will show you how to perform web scraping using Anaconda Jupyter notebook and the BeautifulSoup library.
在本教程中,我将向您展示如何使用Anaconda Jupyter笔记本和BeautifulSoup库执行Web抓取。
We’ll be scraping Company reviews and ratings from Indeed platform, and then we will export them to Pandas library dataframe and then to a .CSV file.
我们将从Indeed平台上抓取公司的评论和评分,然后将它们导出到Pandas库数据框,然后导出到.CSV文件。
Let us get straight down to business, however, if you’re looking on a guide to understanding Web Scraping in general, I advise you of reading this article from Dataquest.
但是,让我们直接从事业务,但是,如果您正在寻找一般理解Web爬网的指南,建议您阅读Dataquest的这篇文章。
Let us start by importing our 3 libraries
让我们从导入3个库开始
from bs4 import BeautifulSoup
import pandas as pd
import requests
Then, let’s go to indeed website and examine which information we want, we will be targeting Ernst & Young firm page, you can check it from the following link
然后,让我们转到确实的网站并检查我们想要的信息,我们将以安永会计师事务所为目标页面,您可以从以下链接中进行检查
https://www.indeed.com/cmp/Ey/reviews?fcountry=IT
Based on my location, the country is indicated as Italy but you can choose and control that if you want.
根据我的位置,该国家/地区显示为意大利,但您可以根据需要选择和控制该国家/地区。
In the next picture, we can see the multiple information that we can tackle and scrape:
在下一张图片中,我们可以看到我们可以解决和抓取的多种信息:
1- Review Title
1-评论标题
2- Review Body
2-审查机构
3- Rating
3-评分
4- The role of the reviewer
4-审稿人的角色
5- The location of the reviewer
5-评论者的位置
6- The review date
6-审查日期
However, you can notice that Points 4,5&6 are all in one line and will be scraped together, this can cause a bit of confusion for some people, but my advice is to scrape first then solve problems later. So, let’s try to do this.
但是,您会注意到,点4,5&6都在同一行中,并且将被刮擦在一起,这可能会使某些人感到困惑,但是我的建议是先刮擦然后再解决问题。 因此,让我们尝试执行此操作。
After knowing what we want to scrape, we need to find out how much do we need to scrape, do we want only 1 review? 1 page of reviews or all pages of reviews? I guess the answer should be all pages!!
知道要抓取的内容后,我们需要找出需要抓取的数量,我们只需要进行1次审核吗? 1页评论或所有页面评论? 我想答案应该是所有页面!
If you scrolled down the page and went over to page 2 you will find that the link for that page became as following:
如果您向下滚动页面并转到页面2,则会发现该页面的链接如下:
https://www.indeed.com/cmp/Ey/reviews?fcountry=IT&start=20
Then try to go to page 3, you will find the link became as following:
然后尝试转到第3页,您会发现链接如下所示:
https://www.indeed.com/cmp/Ey/reviews?fcountry=IT&start=4
Looks like we have a pattern here, page 2=20 , page 3 = 40, then page 4 = 60, right? All untill page 8 = 140
看起来我们这里有一个模式,第2页= 20,第3页= 40,然后第4页= 60,对吗? 全部直到第8页= 140
Let’s get back to coding, start by defining your dataframe that you want.
让我们回到编码,首先定义所需的数据框。
df = pd.DataFrame({‘review_title’: [],’review’:[],’author’:[],’rating’:[]})
In the next code I will make a for loop that starts from 0, jumps 20 and stops at 140.
在下一个代码中,我将创建一个for循环,该循环从0开始,跳20,然后在140处停止。
1- Inside that for loop we will make a GET
request to the web server, which will download the HTML contents of a given web page for us.
1-在该for循环内,我们将向Web服务器发出GET
请求,该服务器将为我们下载给定网页HTML内容。
2- Then, We will use the BeautifulSoup library to parse this page, and extract the text from it. We first have to create an instance of the BeautifulSoup
class to parse our document
2-然后,我们将使用BeautifulSoup库解析此页面,并从中提取文本。 我们首先必须创建BeautifulSoup
类的实例来解析我们的文档
3- Then by inspecting the html, we choose the classes from the web page, classes are used when scraping to specify specific elements we want to scrape.
3-然后通过检查html,我们从网页上选择类,在抓取时使用这些类来指定要抓取的特定元素。
4- And then we can conclude by adding the results to our DataFrame created before.
4-然后我们可以通过将结果添加到之前创建的DataFrame中来得出结论。
“I added a picture down for how the code should be in case you copied and some spaces were added wrong”
“我在图片上添加了图片,以防万一您复制了代码并添加了错误的空格,应该如何处理”
for i in range(10,140,20):
url = (f’https://www.indeed.com/cmp/Ey/reviews?fcountry=IT&start={i}')
header = {“User-Agent”:”Mozilla/5.0 Gecko/20100101 Firefox/33.0 GoogleChrome/10.0"}
page = requests.get(url,headers = header)
soup = BeautifulSoup(page.content, ‘lxml’)
results = soup.find(“div”, { “id” : ‘cmp-container’})
elems = results.find_all(class_=’cmp-Review-container’)
for elem in elems:
title = elem.find(attrs = {‘class’:’cmp-Review-title’})
review = elem.find(‘div’, {‘class’: ‘cmp-Review-text’})
author = elem.find(attrs = {‘class’:’cmp-Review-author’})
rating = elem.find(attrs = {‘class’:’cmp-ReviewRating-text’})
df = df.append({‘review_title’: title.text,
‘review’: review.text,
‘author’: author.text,
‘rating’: rating.text
}, ignore_index=True)
DONE. Let’s check our dataframe
完成。 让我们检查一下数据框
df.head()
Now, once scraped, let’s try solve the problem we have.
现在,一旦刮掉,让我们尝试解决我们遇到的问题。
Notice the author coulmn had 3 differnt information seperated by (-)
请注意,作者可能有3个不同的信息,并以(-)分隔
So, let’s split them
所以,让我们分开
author = df[‘author’].str.split(‘-’, expand=True)
Now, let’s rename the columns and delete the last one.
现在,让我们重命名列并删除最后一列。
author = author.rename(columns={0: “job”, 1: “location”,2:’time’})del author[3]
Then let’s join those new columns to our original dataframe and delete the old author column
然后,将这些新列添加到原始数据框中,并删除旧的author列
df1 = pd.concat([df,author],axis=1)
del df1[‘author’]
let’s examine our new dataframe
让我们检查一下新的数据框
df1.head()
Let’s re-organize the columns and remove any duplicates
让我们重新整理各列并删除所有重复项
df1 = df1[[‘job’, ‘review_title’, ‘review’, ‘rating’,’location’,’time’]]
df1 = df1.drop_duplicates()
Then finally let’s save the dataframe to a CSV file
最后,让我们将数据框保存到CSV文件中
df1.to_csv(‘EY_indeed.csv’)
You should now have a good understanding of how to scrape and extract data from Indeed. A good next step for you if you are familiar a bit with web scraping it to pick a site and try some web scraping on your own.
您现在应该对如何从Indeed抓取和提取数据有很好的了解。 如果您对网络抓取有点熟悉,可以选择一个不错的下一步来选择一个站点,然后自己尝试一些网络抓取。
Happy Coding:)
快乐编码:)
翻译自: https://towardsdatascience.com/scrape-company-reviews-ratings-from-indeed-in-2-minutes-59205222d3ae
scrape创建
本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/392385.shtml
如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!