selenium抓取
In this article we will go through a web scraping process of an E-Commerce website. I have designed this particular post to be beginner friendly. So, if you have no prior knowledge about web scraping or Selenium you can still follow along.
在本文中,我们将介绍电子商务网站的Web抓取过程。 我设计的这个特殊职位是初学者友好的。 因此,如果您不具备有关Web抓取或Selenium的先验知识,则仍然可以继续。
To understand web scraping, we need to understand HTML code basics. We will cover that as well.
要了解网络抓取,我们需要了解HTML代码基础。 我们也会对此进行介绍。
HTML基础 (Basics of HTML)
There are a lot of things to talk about concerning HTML basics but we will focus on the things that will be helpful (at least most of the times) in web scraping.
关于HTML基础知识,有很多要讨论的话题,但我们将专注于(至少在大多数情况下)对Web抓取有用的事情。
HTML element (Fig 1 RIGHT) — an HTML element is the collection of start tag, its attributes, an end tag and everything in between.
HTML元素 (图1右)— HTML元素是开始标记,其属性,结束标记以及介于两者之间的所有内容的集合。
Attributes — are special words used inside a start tag to control the element’s behavior. Attribute and its value are together used in referencing a tag and its content for styling. The most important attributes we will use in web scraping includes
class
,id
andname
.属性 -是在开始标签内用于控制元素行为的特殊单词。 属性及其值一起用于引用标签及其内容的样式。 我们将在网络抓取中使用的最重要的属性包括
class
,id
和name
。class
andid
attributes — HTML elements can have one or more classes, separated by spaces (see Fig 1 LEFT above). On the other hand, HTML elements must have uniqueid
attributes, that is, anid
cannot be used to reference more than one HTML element.class
和id
属性 -HTML元素可以具有一个或多个用空格分隔的类(请参见上面的图1左图)。 另一方面,HTML元素必须具有唯一的id
属性,即,一个id
不能用于引用多个HTML元素。
简单的网页爬取 (Simple Web Scraping)
Before we go into the actual scraping of an E-Commerce site let us scrape the site shown in the Figure below (from the HTML code in Fig 1 LEFT)
在进入实际的电子商务网站抓取之前,让我们抓取下图所示的网站(来自图1左图HTML代码)
From the Figure (Fig 2) above note the following:
从上图(图2)中注意以下几点:
This is the Uniform Resource Locator (URL). For this particular case, the locator leads to HTML code stored locally.
这是统一资源定位符( URL )。 对于这种特殊情况,定位器会导致本地存储HTML代码。
- The button labelled 2 is very important when you are hovering through the page to identify the elements of your interest. Once your object of interest is highlighted the tag element will also be highlighted. 当您将鼠标悬停在页面上以标识感兴趣的元素时,标记为2的按钮非常重要。 感兴趣的对象突出显示后,标记元素也将突出显示。
This is the page source code. It is just the HTML code like in Fig 1 LEFT. You can view this page source by clicking Ctrl+Shift+I to inspect the page or right click on site and choose Inspect Element or Inspect whichever is available on the options.
这是页面源代码。 就像图1左图一样,它只是HTML代码。 您可以通过单击Ctrl + Shift + I来检查页面或在站点上单击鼠标右键,然后选择“ 检查元素”或“ 检查 ”选项中可用的内容,以查看此页面源。
先决条件 (Prerequisites)
To conduct web scraping, we need selenium
Python package (If you don’t have the package install it using pip
) and browser webdriver
. For selenium
to work, it must have access to the driver. Download web drivers matching your browser from here: Chrome, Firefox, Edge and Safari. Once the web driver is downloaded, save it and note the path. By default, selenium
will look for the driver on the current working directory and as such you may want to save the drive on the same directory as the Python script. You are however not obliged to do this. You can save it anyway and provide a full path to the executable on Line 5 below
要进行网络抓取,我们需要使用selenium
Python软件包(如果您没有使用pip
安装该软件包)和浏览器webdriver
。 为了使selenium
起作用,它必须有权访问驱动程序。 从此处下载与您的浏览器匹配的Web驱动程序: Chrome , Firefox , Edge和Safari 。 下载网络驱动程序后,请保存并记下路径。 默认情况下, selenium
将在当前工作目录中查找驱动程序,因此,您可能希望将驱动器与Python脚本保存在同一目录中。 但是,您没有义务这样做。 您仍然可以保存它,并在下面的第5行提供完整的可执行文件路径
from selenium import webdriver
import timePATH = "./chromedriver"
driver = webdriver.Chrome(PATH)
driver.get(url="file:///home/kiprono/Desktop/untitled.html")
time.sleep(5)
driver.close()
Line 1 and 2 import necessary libraries.
第1行和第2行导入必要的库。
Line 4 an 5— Define the path to the web driver you downloaded and instantiate a Chrome driver. I am using Chrome web driver but you can as well use Firefox, Microsoft Edge or Safari.
第4行和第5行 -定义您下载的Web驱动程序的路径并实例化Chrome驱动程序。 我正在使用Chrome Web驱动程序,但您也可以使用Firefox,Microsoft Edge或Safari。
Line 6 — The driver launches a Chrome session in 5 and get the url source in 6.
第6行 -驱动程序在5中启动Chrome会话,并在6中获得url源。
Line 7 and 8— This line pauses Python execution for 5 seconds before closing the browser in 8. Pausing is important so that you have a glance of what is happening on the browser and closing ensures that the browsing session is ended otherwise we will end up with so many windows of Chrome sessions. Sleeping time may also be very important when waiting for the page load. However, there is another proper way of initiating a wait.
第7和8行-该行将Python执行暂停5秒钟,然后在8中关闭浏览器。暂停很重要,这样您就可以浏览浏览器上发生的一切,并确保关闭浏览会话,否则我们将结束Chrome会话的窗口如此之多。 等待页面加载时,Hibernate时间也可能非常重要。 但是,还有另一种适当的方式来启动等待。
定位元素 (Locating the Elements)
This is the most important part of web scraping. In this section we need to learn how to get HTML elements by using different attributes.
这是网页抓取的最重要部分。 在本节中,我们需要学习如何通过使用不同的属性来获取HTML元素。
Recall: Elements of a web page can be identified by using a class
, id
, tag
, name
or/and xpath
. Ids are unique but classes are not. This means that a given class
can identify more than one web element whereas one id
identifies one and only one element.
回想一下 : 网页的元素可以通过使用 class
, id
, tag
, name
或/和 xpath
来标识 。 ID是唯一的,但类不是唯一的。 这意味着一个给定的 class
可以标识一个以上的Web元素,而一个 id
标识一个且只有一个元素。
One HTML element can be identified using any of the following methods
可以使用以下任何一种方法来标识一个HTML元素
- driver.find_element_by_id driver.find_element_by_id
- driver.find_element_by_name driver.find_element_by_name
- driver.find_element_by_xpath driver.find_element_by_xpath
- driver.find_element_by_tag_name driver.find_element_by_tag_name
- driver.find_element_by_class_name driver.find_element_by_class_name
Multiple HTML elements can be identified using any of the following (the result is a list of elements found)
可以使用以下任意一种来标识多个HTML元素(结果是找到的元素列表)
- driver.find_elements_by_name driver.find_elements_by_name
- driver.find_elements_by_xpath driver.find_elements_by_xpath
- driver.find_elements_by_tag_name driver.find_elements_by_tag_name
- driver.find_elements_by_class_name driver.find_elements_by_class_name
Note: id
cannot be used to identify multiple elements because id
can only identify one element.
注意: id
不能用于标识多个元素,因为 id
只能标识一个元素。
from selenium import webdriver
import timePATH = "./chromedriver"
driver = webdriver.Chrome(PATH)
driver.get(url="file:///home/kiprono/Desktop/untitled.html")
print("Element identified by id:",driver.find_element_by_id("JDCf").text)
print("Element identified by class:",driver.find_element_by_class_name("container1").text)
print("Element identified by class:",driver.find_element_by_class_name("col4").text)
print("Element identified by tag name:",driver.find_element_by_tag_name("h1").text)
print("Element identified by xpath:",driver.find_element_by_xpath("/html/body/p").text)time.sleep(5)
driver.close()
Output:
输出:
Element identified by id: Content 2 here
Element identified by class: Content here
Element identified by class: Content 3 here
Element identified by tag name: My First Heading
Element identified by xpath: My first paragraph.
Line 8 — Note that
container1
is a class attribute value identifying two elements anddrive.find_element_by_class_name
returns the first element found.第8行 -注意
container1
是一个类属性值,标识两个元素,drive.find_element_by_class_name
返回找到的第一个元素。To extract the text from HTML element we use
.text
function as shown in the code snippet above.要从HTML元素中提取文本,我们使用
.text
函数,如上面的代码片段所示。Line 11 — To locate an element by
xpath
inspect the site elements, right click on the source code matching the element of interest and copy the XPath as show in the Figure (Fig 3) below第11行 -要通过
xpath
查找元素,请检查站点元素,右键单击与感兴趣的元素匹配的源代码,然后复制XPath,如下图(图3)所示
We can identify and loop through both elements identified by container1
class as shown below
我们可以识别并遍历container1
类识别的两个元素,如下所示
multiple_elements = driver.find_elements_by_class_name("container1")
for element in multiple_elements:print(element.text)
Output:
输出:
Content here
Content 3 here
刮实际站点 (Scraping Actual Site)
Now that you got the feel of how selenium work let us go ahead and scrape the actual site we are supposed to scrape. We will be scraping online book store [link].
现在您已经感觉到Selenium的工作方式,让我们继续进行操作,并刮除应该刮除的实际站点。 我们将抓取在线书店[ 链接 ]。
We will proceed as follows.
我们将进行如下操作。
- Scrape details for each book on the page. Each page as 20 books. The details of each book can be found by using the URL on each card. So, to get the book details we need this links. 在页面上刮取每本书的详细信息。 每页为20本书。 可以通过使用每张卡上的URL找到每本书的详细信息。 因此,要获取书籍详细信息,我们需要此链接。
- Scrape books in each and every page. This means that we will have a loop to scrape each book in the page and another one to iterate through pages. 在每一页中刮擦书籍。 这意味着我们将有一个循环来抓取页面中的每一本书,而另一循环则要遍历页面。
- Moving from one page to another involves a modification of the URL in a way that it is trivial to predict a link to any page. 从一页移动到另一页涉及对URL的修改,这很容易预测到任何页面的链接。
Here are the pages:
以下是页面:
Page 1 URL : http://books.toscrape.com/ . The following link also works for page 1 : http://books.toscrape.com/catalogue/page-1.html
第1页网址: http : //books.toscrape.com/ 。 以下链接也适用于第1页: http : //books.toscrape.com/catalogue/page-1.html
Page 2 URL : http://books.toscrape.com/catalogue/page-2.html
第2页网址: http : //books.toscrape.com/catalogue/page-2.html
Page 3 URL : http://books.toscrape.com/catalogue/page-3.html
第3页网址: http : //books.toscrape.com/catalogue/page-3.html
Page 4 URL : http://books.toscrape.com/catalogue/page-4.html
第4页网址: http : //books.toscrape.com/catalogue/page-4.html
- and so on 等等
Clearly, we can notice a pattern implying that looping through the pages will be simple because we can generate these URL as we move along the loop.
显然,我们可以注意到一种模式,这意味着在页面之间循环很简单,因为在循环中我们可以生成这些URL。
刮一本书 (Scraping one book)
On inspecting the site here is the HTML code for the highlighted region (representing one book)
在检查站点时,这里是突出显示区域HTML代码(代表一本书)
<article class="product_pod">
<div class="image_container">
<a href="the-nameless-city-the-nameless-city-1_940/index.html">
<img src="../media/cache/f4/79/f479de5f305c2ac0512702cf7155bb74.jpg" alt="The Nameless City (The Nameless City #1)" class="thumbnail">
</a>
</div>
<p class="star-rating Four">
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
</p>
<h3>
<a href="the-nameless-city-the-nameless-city-1_940/index.html" title="The Nameless City (The Nameless City #1)">The Nameless City (The ...</a>
</h3>
<div class="product_price">
<p class="price_color">£38.16</p>
<p class="instock availability">
<i class="icon-ok"></i>
In stock</p>
<form>
<button type="submit" class="btn btn-primary btn-block" data-loading-text="Adding...">Add to basket</button>
</form>
</div>
</article>
Before we go into coding lets make some observations
在进行编码之前,先观察一下
The book in question is inside
article
tag. The tag as a class attribute with the valueproduct_prod
有问题的书在
article
标签里面。 标记作为类属性,其值为product_prod
What we need in this card is to get the URL, that is,
href
ina
tag. to get intohref
we need to move down the hierarchy as follows:class= “product_prod”
>h3
tag >a
tag and the get value ofhref
attribute.我们需要在这个卡有什么是让URL,即,
href
的a
标签。 进入href
我们需要下移如下层次:class= “product_prod”
>h3
标签>a
标签和的get值href
属性。In fact all books in all pages belong to the same class
product_prod
and witharticle
tag.实际上,所有页面中的所有书籍都属于同一类
product_prod
并带有article
标签。
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
import pandas as pd
import numpy as np
import re#Set up the path to the chrome driver
PATH = "/home/kiprono/chromedriver"
driver = webdriver.Chrome(PATH)
#parse the page source using get() function
driver.get("http://books.toscrape.com/catalogue/category/books_1/page-1.html")#We find all the books in the page and just use 1
incategory = driver.find_elements_by_class_name("product_pod")[0]
#local the URL to open the contents of the book
a = incategory.find_element_by_tag_name("h3").find_element_by_tag_name("a").get_property("href")
driver.get(a)
#locate our elements of interest on the page containing book details.
title = driver.find_element_by_xpath("//*[@id='content_inner']/article/div[1]/div[2]/h1")
price = driver.find_element_by_xpath("//*[@id='content_inner']/article/div[1]/div[2]/p[1]")
stock = driver.find_element_by_xpath("//*[@id='content_inner']/article/div[1]/div[2]/p[2]")
stars = driver.find_element_by_xpath("//*[@id='content_inner']/article/div[1]/div[2]/p[3]").get_attribute("class")
stock = int(re.findall("\d+",stock.text)[0])# This is a fuction to convert stars from string expressions to int
def StarConversion(value):if value == "One":return 1elif value == "Two":return 2elif value == "Three":return 3elif value == "Four":return 4elif value == "Five":return 5 stars = StarConversion(stars.split()[1])description = driver.find_element_by_xpath("//*[@id='content_inner']/article/p")upc = driver.find_element_by_xpath("//*[@id='content_inner']/article/table/tbody/tr[1]/td")tax = driver.find_element_by_xpath("//*[@id='content_inner']/article/table/tbody/tr[5]/td")category_a = driver.find_element_by_xpath("//*[@id='default']/div/div/ul/li[3]/a")# all of our interest into a dictionary r
r = {"Title":title.text,"Stock": stock,"Stars": stars,"Price":price.text,"Tax":tax.text,"UPC":upc.text,"Description": description.text
}# print all contents of the dictionary
print(r)time.sleep(3)
driver.quit()
Lets go through some lines so that you understand what the code is actually doing
让我们看几行,以便您了解代码的实际作用
Line 16 through 18 — We are moving down the the HTML code to get the URL for the book. Once we get the link we open in 19. Note that 16 locates all books in the page because all the books belongs to the same class
product_prod
that is why we index (index 0) it to get only one book.第16到18行-我们向下移动HTML代码以获取该书的URL。 一旦获得链接,我们将在19中打开。请注意, 16会定位页面中的所有书籍,因为所有书籍都属于同一类
product_prod
,这就是为什么我们对其进行索引(索引0)以获得一本书的原因。It is also important to note that the numbers of stars a book has comes as a property of
p
tag. Here is where the star is located:同样重要的是要注意,一本书拥有的星星数是
p
标签的属性。 这是星星所在的位置:
<p class="star-rating Four">
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
</p>
This book is rated 4-star but this fact is hidden as a value to the class attribute. You use get_attribute(class)
to access such an information. On extracting content on line 24 by using text function, you will get such a string
这本书被评为四星级,但是这个事实被隐藏为class属性的值。 您可以使用get_attribute(class)
访问此类信息。 通过使用text函数在第24行提取内容时,您将获得这样的字符串
star-rating Four
Therefore we had to splitting the string before using the function in line 28–38 to get the actual star as a number.
因此,在使用第28–38行中的函数之前,我们必须先对字符串进行分割,以获取实际的星形作为数字。
If you extract the text content in 23 you will end up with such as string
如果您提取23中的文本内容,您将最终得到诸如字符串
In stock (22 available)
What we need is just the number 22 to mean that we have 22 copies available. We achieve that with line 25 where we are using regular expressions to extract the number.
我们需要的只是数字22,意味着我们有22份副本。 我们在第25行使用正则表达式提取数字来实现这一点。
re.findall("\d+","In stock (22 available)")Output:
['22']
In regular expression
\d
means the values [0–9] and+
means one or more occurrence of a character in that class, that is, in our case, we want to capture any number (irrespective of the number of the digits) in the string.在正则表达式中,
\d
表示值[0–9],而+
表示该类中一个或多个字符的出现,也就是说,在我们的例子中,我们要捕获任何数字(与数字个数无关)字符串。
在一页中刮所有书 (Scrape all books in one page)
Recall that line 16 above locates all the books in the page. Therefore, in order to scrape all the books in one page we need to loop through the list generated in that line as shown below
回想一下上面的第16行找到了页面中的所有书籍。 因此,为了将所有书籍抓取到一页中,我们需要遍历该行中生成的列表,如下所示
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
import pandas as pd
import numpy as np
import redef StarConversion(value):if value == "One":return 1elif value == "Two":return 2elif value == "Three":return 3elif value == "Four":return 4elif value == "Five":return 5 # Scrape one category # Travel#Set up the path to the chrome driver
PATH = "/home/kiprono/chromedriver"
driver = webdriver.Chrome(PATH)
driver.get("http://books.toscrape.com/catalogue/category/books_1/page-1.html")# Lets find all books in the page
incategory = driver.find_elements_by_class_name("product_pod")
#Generate a list of links for each and every book
links = []
for i in range(len(incategory)):item = incategory[i]#get the href propertya = item.find_element_by_tag_name("h3").find_element_by_tag_name("a").get_property("href")#Append the link to list linkslinks.append(a)all_details = []
# Lets loop through each link to acces the page of each book
for link in links:# get one book urldriver.get(url=link)# title of the booktitle = driver.find_element_by_xpath("//*[@id='content_inner']/article/div[1]/div[2]/h1")# price of the bookprice = driver.find_element_by_xpath("//*[@id='content_inner']/article/div[1]/div[2]/p[1]")# stock - number of copies available for the bookstock = driver.find_element_by_xpath("//*[@id='content_inner']/article/div[1]/div[2]/p[2]")# Stock comes as stringstock = int(re.findall("\d+",stock.text)[0])# Stars - Actual stars are in the tag attributestars = driver.find_element_by_xpath("//*[@id='content_inner']/article/div[1]/div[2]/p[3]").get_attribute("class")# convert string to number. Stars are like One, Two, Three ... We need 1,2,3,...stars = StarConversion(stars.split()[1])# Descriptiontry:description = driver.find_element_by_xpath("//*[@id='content_inner']/article/p")description = description.textexcept:description = None# UPC IDupc = driver.find_element_by_xpath("//*[@id='content_inner']/article/table/tbody/tr[1]/td")# Tax imposed in the booktax = driver.find_element_by_xpath("//*[@id='content_inner']/article/table/tbody/tr[5]/td")# Category of the bookcategory_a = driver.find_element_by_xpath("//*[@id='default']/div/div/ul/li[3]/a")# Define a dictionary with details we needr = {"1Title":title.text,"2Category":category_a.text,"3Stock": stock,"4Stars": stars,"5Price":price.text,"6Tax":tax.text,"7UPC":upc.text,"8Description": description}# append r to all detailsall_details.append(r)time.sleep(4)driver.close()
Once you have understood the previous example of scraping one book then this one should be easy to follow because the only difference is that we want to loop through all the books extracting the links (lines 33 through 38) and then loop through those links to extract the information we needed (line 42–84).
一旦您了解了前面刮取一本书的示例,那么该书应该很容易理解,因为唯一的区别是我们要循环浏览所有书籍以提取链接( 第33至38行 ),然后循环浏览这些链接以提取链接我们需要的信息( 第42–84行 )。
In this snippet also we have introduced try-except blocks to catch cases where the information we want is not available. Specifically, some books misses description
section.
在此代码段中,我们还引入了try-except块来捕获所需信息不可用的情况。 具体来说,有些书缺少description
部分。
I am sure you will also enjoy to see selenium open the pages as you watch. Enjoy!
我相信您也会喜欢Selenium在观看时打开页面。 请享用!
刮取所有页面中的所有书籍 (Scraping all books in all pages)
The key concept to understand here is that we need to loop through each book in each page, that is, two loops are involved here. As stated earlier we know that pages URL follows some pattern, for example, in our case we have
这里要理解的关键概念是,我们需要遍历每一页中的每一本书,也就是说,这里涉及两个循环。 如前所述,我们知道页面URL遵循某种模式,例如,在我们的情况下,
Page 1 : http://books.toscrape.com/index.html or http://books.toscrape.com/catalogue/page-1.html
第1页: http : //books.toscrape.com/index.html或http://books.toscrape.com/catalogue/page-1.html
Page 2: http://books.toscrape.com/catalogue/page-2.html
第2页: http : //books.toscrape.com/catalogue/page-2.html
Page 3: http://books.toscrape.com/catalogue/page-3.html
第3页: http : //books.toscrape.com/catalogue/page-3.html
- and so on until page 50 (we have 50 pages on site). 依此类推,直到第50页(我们的网站上有50页)。
We can easily create a Python for-loop to generate such URLs. Lets see how we can generate the first 10
我们可以轻松地创建Python for循环来生成此类URL。 让我们看看如何生成前10个
for c in range(1,11):print("http://books.toscrape.com/catalogue/category/books_1/page-{}.html".format(c))
Output:
输出:
http://books.toscrape.com/catalogue/category/books_1/page-1.htmlhttp://books.toscrape.com/catalogue/category/books_1/page-2.htmlhttp://books.toscrape.com/catalogue/category/books_1/page-3.htmlhttp://books.toscrape.com/catalogue/category/books_1/page-4.htmlhttp://books.toscrape.com/catalogue/category/books_1/page-5.htmlhttp://books.toscrape.com/catalogue/category/books_1/page-6.htmlhttp://books.toscrape.com/catalogue/category/books_1/page-7.htmlhttp://books.toscrape.com/catalogue/category/books_1/page-8.htmlhttp://books.toscrape.com/catalogue/category/books_1/page-9.htmlhttp://books.toscrape.com/catalogue/category/books_1/page-10.html
Therefore, we can scrape all books in all pages simply as below
因此,我们可以如下简单地刮取所有页面中的所有书籍
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
import pandas as pd
import numpy as np
import re# Scrape one category # Travel#Set up the path to the chrome driver
PATH = "/home/kiprono/chromedriver"
driver = webdriver.Chrome(PATH)
#parse the page source using get() function
driver.get("http://books.toscrape.com/catalogue/category/books_1/index.html")def StarConversion(value):if value == "One":return 1elif value == "Two":return 2elif value == "Three":return 3elif value == "Four":return 4elif value == "Five":return 5 #next_button = driver.find_element_by_class_name("next").find_element_by_tag_name("a").click()
all_details = []
for c in range(1,51):try:#get the pagedriver.get("http://books.toscrape.com/catalogue/category/books_1/page-{}.html".format(c))print("http://books.toscrape.com/catalogue/category/books_1/page-{}.html".format(c))# Lets find all books in the pageincategory = driver.find_elements_by_class_name("product_pod")#Generate a list of links for each and every booklinks = []for i in range(len(incategory)):item = incategory[i]#get the href propertya = item.find_element_by_tag_name("h3").find_element_by_tag_name("a").get_property("href")#Append the link to list linkslinks.append(a)# Lets loop through each link to acces the page of each bookfor link in links:# get one book urldriver.get(url=link)# title of the booktitle = driver.find_element_by_xpath("//*[@id='content_inner']/article/div[1]/div[2]/h1")# price of the bookprice = driver.find_element_by_xpath("//*[@id='content_inner']/article/div[1]/div[2]/p[1]")# stock - number of copies available for the bookstock = driver.find_element_by_xpath("//*[@id='content_inner']/article/div[1]/div[2]/p[2]")# Stock comes as string so we need to use this regex to exract digitsstock = int(re.findall("\d+",stock.text)[0])# Stars - Actual stars are values of class attributestars = driver.find_element_by_xpath("//*[@id='content_inner']/article/div[1]/div[2]/p[3]").get_attribute("class")# convert string to number. Stars are like One, Two, Three ... We need 1,2,3,...stars = StarConversion(stars.split()[1])# Descriptiontry:description = driver.find_element_by_xpath("//*[@id='content_inner']/article/p")description = description.textexcept:description = None# UPC IDupc = driver.find_element_by_xpath("//*[@id='content_inner']/article/table/tbody/tr[1]/td")# Tax imposed in the booktax = driver.find_element_by_xpath("//*[@id='content_inner']/article/table/tbody/tr[5]/td")# Category of the bookcategory_a = driver.find_element_by_xpath("//*[@id='default']/div/div/ul/li[3]/a")# Define a dictionary with details we needr = {"1Title":title.text,"2Category":category_a.text,"3Stock": stock,"4Stars": stars,"5Price":price.text,"6Tax":tax.text,"7UPC":upc.text,"8Description": description}# append r to all detailsall_details.append(r)except:# Lets just close the browser if we run to an errordriver.close()# save the information into a CSV file
df = pd.DataFrame(all_details)
df.to_csv("all_pages.csv")time.sleep(3)
driver.close()
The only difference between this snippet and the previous one is the fact that we are looping through pages with the loop starting in line 34 and the fact we also write all the information scraped into a CSV file named all_pages.csv
.
此代码段与上一个代码段之间的唯一区别是,我们正在循环浏览页面,循环从第34行开始,并且我们还将所有抓取的信息都写入了名为all_pages.csv
的CSV文件all_pages.csv
。
We are also using try-expect to handle exceptions that may arise in the process of scraping. In case an exception is raise we just exit and close the browser (line 94)
我们还使用try-expect处理可能在抓取过程中出现的异常。 万一引发异常,我们只需退出并关闭浏览器( 第94行 )
结论 (Conclusion)
Web scraping is an important process of collecting data from the internet. Different websites have different designs and therefore there’s no particular one scraper that can be used in any particular site. The most essential skill is to understand web scraping on a high level: knowing how to locate web elements and being able to identify and handle errors when they arise. This kind of understanding comes up when we practice web scraping on different sites.
Web抓取是从Internet收集数据的重要过程。 不同的网站具有不同的设计,因此,没有一个特定的刮板可用于任何特定的站点。 最基本的技能是从较高的角度了解Web抓取:知道如何定位Web元素并能够在出现错误时进行识别和处理。 当我们在不同站点上练习Web抓取时,就会出现这种理解。
Here are more web scraping examples:
以下是更多网络抓取示例:
Lastly, some websites do not permit their data to be scraped especially scraping and publishing the result. It is, therefore, important to check the site policies before scraping.
最后,某些网站不允许刮擦其数据,尤其是刮擦并发布结果。 因此,在抓取之前检查站点策略很重要。
As always, thank you for reading :-)
和往常一样,谢谢您的阅读:-)
翻译自: https://towardsdatascience.com/web-scraping-e-commerce-website-using-selenium-1088131c8541
selenium抓取
本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/390572.shtml
如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!