selenium抓取_使用Selenium的网络抓取电子商务网站

selenium抓取

In this article we will go through a web scraping process of an E-Commerce website. I have designed this particular post to be beginner friendly. So, if you have no prior knowledge about web scraping or Selenium you can still follow along.

在本文中,我们将介绍电子商务网站的Web抓取过程。 我设计的这个特殊职位是初学者友好的。 因此,如果您不具备有关Web抓取或Selenium的先验知识,则仍然可以继续。

To understand web scraping, we need to understand HTML code basics. We will cover that as well.

要了解网络抓取,我们需要了解HTML代码基础。 我们也会对此进行介绍。

HTML基础 (Basics of HTML)

There are a lot of things to talk about concerning HTML basics but we will focus on the things that will be helpful (at least most of the times) in web scraping.

关于HTML基础知识,有很多要讨论的话题,但我们将专注于(至少在大多数情况下)对Web抓取有用的事情。

Image for post
Image for post
LEFT: A simple HTML Code, LEFT :一个简单HTML代码, RIGHT: HTML element, Source: AuthorRIGHT :HTML元素,来源:作者
  • HTML element (Fig 1 RIGHT) — an HTML element is the collection of start tag, its attributes, an end tag and everything in between.

    HTML元素 (图1右)— HTML元素是开始标记,其属性,结束标记以及介于两者之间的所有内容的集合。

  • Attributes — are special words used inside a start tag to control the element’s behavior. Attribute and its value are together used in referencing a tag and its content for styling. The most important attributes we will use in web scraping includes class, id and name.

    属性 -是在开始标签内用于控制元素行为的特殊单词。 属性及其值一起用于引用标签及其内容的样式。 我们将在网络抓取中使用的最重要的属性包括classidname

  • class and id attributes — HTML elements can have one or more classes, separated by spaces (see Fig 1 LEFT above). On the other hand, HTML elements must have unique id attributes, that is, an id cannot be used to reference more than one HTML element.

    class id 属性 -HTML元素可以具有一个或多个用空格分隔的类(请参见上面的图1左图)。 另一方面,HTML元素必须具有唯一的id属性,即,一个id不能用于引用多个HTML元素。

简单的网页爬取 (Simple Web Scraping)

Before we go into the actual scraping of an E-Commerce site let us scrape the site shown in the Figure below (from the HTML code in Fig 1 LEFT)

在进入实际的电子商务网站抓取之前,让我们抓取下图所示的网站(来自图1左图HTML代码)

Image for post
Fig 2 : On site
图2:现场

From the Figure (Fig 2) above note the following:

从上图(图2)中注意以下几点:

  1. This is the Uniform Resource Locator (URL). For this particular case, the locator leads to HTML code stored locally.

    这是统一资源定位符( URL )。 对于这种特殊情况,定位器会导致本地存储HTML代码。

  2. The button labelled 2 is very important when you are hovering through the page to identify the elements of your interest. Once your object of interest is highlighted the tag element will also be highlighted.

    当您将鼠标悬停在页面上以标识感兴趣的元素时,标记为2的按钮非常重要。 感兴趣的对象突出显示后,标记元素也将突出显示。
  3. This is the page source code. It is just the HTML code like in Fig 1 LEFT. You can view this page source by clicking Ctrl+Shift+I to inspect the page or right click on site and choose Inspect Element or Inspect whichever is available on the options.

    这是页面源代码。 就像图1左图一样,它只是HTML代码。 您可以通过单击Ctrl + Shift + I来检查页面或在站点上单击鼠标右键,然后选择“ 检查元素”或“ 检查 ”选项中可用的内容,以查看此页面源。

先决条件 (Prerequisites)

To conduct web scraping, we need selenium Python package (If you don’t have the package install it using pip) and browser webdriver. For selenium to work, it must have access to the driver. Download web drivers matching your browser from here: Chrome, Firefox, Edge and Safari. Once the web driver is downloaded, save it and note the path. By default, selenium will look for the driver on the current working directory and as such you may want to save the drive on the same directory as the Python script. You are however not obliged to do this. You can save it anyway and provide a full path to the executable on Line 5 below

要进行网络抓取,我们需要使用selenium Python软件包(如果您没有使用pip安装该软件包)和浏览器webdriver 。 为了使selenium起作用,它必须有权访问驱动程序。 从此处下载与您的浏览器匹配的Web驱动程序: Chrome , Firefox , Edge和Safari 。 下载网络驱动程序后,请保存并记下路径。 默认情况下, selenium将在当前工作目录中查找驱动程序,因此,您可能希望将驱动器与Python脚本保存在同一目录中。 但是,您没有义务这样做。 您仍然可以保存它,并在下面的第5行提供完整的可执行文件路径

from selenium import webdriver
import timePATH = "./chromedriver"
driver = webdriver.Chrome(PATH)
driver.get(url="file:///home/kiprono/Desktop/untitled.html")
time.sleep(5)
driver.close()
  • Line 1 and 2 import necessary libraries.

    第1行和第2行导入必要的库。

  • Line 4 an 5— Define the path to the web driver you downloaded and instantiate a Chrome driver. I am using Chrome web driver but you can as well use Firefox, Microsoft Edge or Safari.

    第4行和第5行 -定义您下载的Web驱动程序的路径并实例化Chrome驱动程序。 我正在使用Chrome Web驱动程序,但您也可以使用Firefox,Microsoft Edge或Safari。

  • Line 6 — The driver launches a Chrome session in 5 and get the url source in 6.

    第6行 -驱动程序在5中启动Chrome会话,并在6中获得url源。

  • Line 7 and 8— This line pauses Python execution for 5 seconds before closing the browser in 8. Pausing is important so that you have a glance of what is happening on the browser and closing ensures that the browsing session is ended otherwise we will end up with so many windows of Chrome sessions. Sleeping time may also be very important when waiting for the page load. However, there is another proper way of initiating a wait.

    第7和8行-该行将Python执行暂停5秒钟,然后在8中关闭浏览器。暂停很重要,这样您就可以浏览浏览器上发生的一切,并确保关闭浏览会话,否则我们将结束Chrome会话的窗口如此之多。 等待页面加载时,Hibernate时间也可能非常重要。 但是,还有另一种适当的方式来启动等待。

定位元素 (Locating the Elements)

This is the most important part of web scraping. In this section we need to learn how to get HTML elements by using different attributes.

这是网页抓取的最重要部分。 在本节中,我们需要学习如何通过使用不同的属性来获取HTML元素。

Recall: Elements of a web page can be identified by using a class, id, tag, name or/and xpath. Ids are unique but classes are not. This means that a given class can identify more than one web element whereas one id identifies one and only one element.

回想一下网页的元素可以通过使用 class id tag name 或/和 xpath 来标识 ID是唯一的,但类不是唯一的。 这意味着一个给定的 class 可以标识一个以上的Web元素,而一个 id 标识一个且只有一个元素。

One HTML element can be identified using any of the following methods

可以使用以下任何一种方法来标识一个HTML元素

  • driver.find_element_by_id

    driver.find_element_by_id
  • driver.find_element_by_name

    driver.find_element_by_name
  • driver.find_element_by_xpath

    driver.find_element_by_xpath
  • driver.find_element_by_tag_name

    driver.find_element_by_tag_name
  • driver.find_element_by_class_name

    driver.find_element_by_class_name

Multiple HTML elements can be identified using any of the following (the result is a list of elements found)

可以使用以下任意一种来标识多个HTML元素(结果是找到的元素列表)

  • driver.find_elements_by_name

    driver.find_elements_by_name
  • driver.find_elements_by_xpath

    driver.find_elements_by_xpath
  • driver.find_elements_by_tag_name

    driver.find_elements_by_tag_name
  • driver.find_elements_by_class_name

    driver.find_elements_by_class_name

Note: id cannot be used to identify multiple elements because id can only identify one element.

注意: id 不能用于标识多个元素,因为 id 只能标识一个元素。

from selenium import webdriver
import timePATH = "./chromedriver"
driver = webdriver.Chrome(PATH)
driver.get(url="file:///home/kiprono/Desktop/untitled.html")
print("Element identified by id:",driver.find_element_by_id("JDCf").text)
print("Element identified by class:",driver.find_element_by_class_name("container1").text)
print("Element identified by class:",driver.find_element_by_class_name("col4").text)
print("Element identified by tag name:",driver.find_element_by_tag_name("h1").text)
print("Element identified by xpath:",driver.find_element_by_xpath("/html/body/p").text)time.sleep(5)
driver.close()

Output:

输出:

Element identified by id: Content 2 here
Element identified by class: Content here
Element identified by class: Content 3 here
Element identified by tag name: My First Heading
Element identified by xpath: My first paragraph.
  • Line 8 — Note that container1 is a class attribute value identifying two elements and drive.find_element_by_class_name returns the first element found.

    第8行 -注意container1是一个类属性值,标识两个元素, drive.find_element_by_class_name返回找到的第一个元素。

  • To extract the text from HTML element we use .text function as shown in the code snippet above.

    要从HTML元素中提取文本,我们使用.text函数,如上面的代码片段所示。

  • Line 11 — To locate an element by xpath inspect the site elements, right click on the source code matching the element of interest and copy the XPath as show in the Figure (Fig 3) below

    第11行 -要通过xpath查找元素,请检查站点元素,右键单击与感兴趣的元素匹配的源代码,然后复制XPath,如下图(图3)所示

Image for post
Fig 3: Getting XPath to a particular element
图3:将XPath获取到特定元素

We can identify and loop through both elements identified by container1 class as shown below

我们可以识别并遍历container1类识别的两个元素,如下所示

multiple_elements = driver.find_elements_by_class_name("container1")
for element in multiple_elements:print(element.text)

Output:

输出:

Content here
Content 3 here

刮实际站点 (Scraping Actual Site)

Now that you got the feel of how selenium work let us go ahead and scrape the actual site we are supposed to scrape. We will be scraping online book store [link].

现在您已经感觉到Selenium的工作方式,让我们继续进行操作,并刮除应该刮除的实际站点。 我们将抓取在线书店[ 链接 ]。

Image for post
Fig 4: The site we want to scrape (Source: Author)
图4:我们要抓取的网站(来源:作者)

We will proceed as follows.

我们将进行如下操作。

  • Scrape details for each book on the page. Each page as 20 books. The details of each book can be found by using the URL on each card. So, to get the book details we need this links.

    在页面上刮取每本书的详细信息。 每页为20本书。 可以通过使用每张卡上的URL找到每本书的详细信息。 因此,要获取书籍详细信息,我们需要此链接。
  • Scrape books in each and every page. This means that we will have a loop to scrape each book in the page and another one to iterate through pages.

    在每一页中刮擦书籍。 这意味着我们将有一个循环来抓取页面中的每一本书,而另一循环则要遍历页面。
  • Moving from one page to another involves a modification of the URL in a way that it is trivial to predict a link to any page.

    从一页移动到另一页涉及对URL的修改,这很容易预测到任何页面的链接。

Here are the pages:

以下是页面:

  • Page 1 URL : http://books.toscrape.com/ . The following link also works for page 1 : http://books.toscrape.com/catalogue/page-1.html

    第1页网址: http : //books.toscrape.com/ 。 以下链接也适用于第1页: http : //books.toscrape.com/catalogue/page-1.html

  • Page 2 URL : http://books.toscrape.com/catalogue/page-2.html

    第2页网址: http : //books.toscrape.com/catalogue/page-2.html

  • Page 3 URL : http://books.toscrape.com/catalogue/page-3.html

    第3页网址: http : //books.toscrape.com/catalogue/page-3.html

  • Page 4 URL : http://books.toscrape.com/catalogue/page-4.html

    第4页网址: http : //books.toscrape.com/catalogue/page-4.html

  • and so on

    等等

Clearly, we can notice a pattern implying that looping through the pages will be simple because we can generate these URL as we move along the loop.

显然,我们可以注意到一种模式,这意味着在页面之间循环很简单,因为在循环中我们可以生成这些URL。

刮一本书 (Scraping one book)

Image for post
Fig 5 : Inspection of elements for one book.
图5:检查一本书的元素。

On inspecting the site here is the HTML code for the highlighted region (representing one book)

在检查站点时,这里是突出显示区域HTML代码(代表一本书)

<article class="product_pod">
<div class="image_container">
<a href="the-nameless-city-the-nameless-city-1_940/index.html">
<img src="../media/cache/f4/79/f479de5f305c2ac0512702cf7155bb74.jpg" alt="The Nameless City (The Nameless City #1)" class="thumbnail">
</a>
</div>
<p class="star-rating Four">
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
</p>
<h3>
<a href="the-nameless-city-the-nameless-city-1_940/index.html" title="The Nameless City (The Nameless City #1)">The Nameless City (The ...</a>
</h3>
<div class="product_price">
<p class="price_color">£38.16</p>
<p class="instock availability">
<i class="icon-ok"></i>
In stock</p>
<form>
<button type="submit" class="btn btn-primary btn-block" data-loading-text="Adding...">Add to basket</button>
</form>
</div>
</article>

Before we go into coding lets make some observations

在进行编码之前,先观察一下

  • The book in question is inside article tag. The tag as a class attribute with the value product_prod

    有问题的书在article标签里面。 标记作为类属性,其值为product_prod

  • What we need in this card is to get the URL, that is, href in a tag. to get into href we need to move down the hierarchy as follows: class= “product_prod” > h3 tag > a tag and the get value of href attribute.

    我们需要在这个卡有什么是让URL,即, hrefa标签。 进入href我们需要下移如下层次: class= “product_prod” > h3标签> a标签和的get值href属性。

  • In fact all books in all pages belong to the same class product_prod and with article tag.

    实际上,所有页面中的所有书籍都属于同一类product_prod并带有article标签。

Image for post
Fig 6: A page with details of one book (Annotated to show our details of interest)
图6:包含一本书详细信息的页面(带注释以显示我们感兴趣的详细信息)
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
import pandas as pd
import numpy as np
import re#Set up the path to the chrome driver
PATH = "/home/kiprono/chromedriver"
driver = webdriver.Chrome(PATH)
#parse the page source using get() function
driver.get("http://books.toscrape.com/catalogue/category/books_1/page-1.html")#We find all the books in the page and just use 1
incategory = driver.find_elements_by_class_name("product_pod")[0]
#local the URL to open the contents of the book
a = incategory.find_element_by_tag_name("h3").find_element_by_tag_name("a").get_property("href")
driver.get(a)
#locate our elements of interest on the page containing book details.
title = driver.find_element_by_xpath("//*[@id='content_inner']/article/div[1]/div[2]/h1")
price = driver.find_element_by_xpath("//*[@id='content_inner']/article/div[1]/div[2]/p[1]")
stock = driver.find_element_by_xpath("//*[@id='content_inner']/article/div[1]/div[2]/p[2]")
stars = driver.find_element_by_xpath("//*[@id='content_inner']/article/div[1]/div[2]/p[3]").get_attribute("class")
stock = int(re.findall("\d+",stock.text)[0])# This is a fuction to convert stars from string expressions to int
def StarConversion(value):if value == "One":return 1elif value == "Two":return 2elif value == "Three":return 3elif value == "Four":return 4elif value == "Five":return 5 stars = StarConversion(stars.split()[1])description = driver.find_element_by_xpath("//*[@id='content_inner']/article/p")upc = driver.find_element_by_xpath("//*[@id='content_inner']/article/table/tbody/tr[1]/td")tax = driver.find_element_by_xpath("//*[@id='content_inner']/article/table/tbody/tr[5]/td")category_a =  driver.find_element_by_xpath("//*[@id='default']/div/div/ul/li[3]/a")# all  of our interest into a dictionary r
r = {"Title":title.text,"Stock": stock,"Stars": stars,"Price":price.text,"Tax":tax.text,"UPC":upc.text,"Description": description.text
}# print all contents of the dictionary
print(r)time.sleep(3)
driver.quit()

Lets go through some lines so that you understand what the code is actually doing

让我们看几行,以便您了解代码的实际作用

  • Line 16 through 18 — We are moving down the the HTML code to get the URL for the book. Once we get the link we open in 19. Note that 16 locates all books in the page because all the books belongs to the same class product_prod that is why we index (index 0) it to get only one book.

    第16到18行-我们向下移动HTML代码以获取该书的URL。 一旦获得链接,我们将在19中打开请注意, 16会定位页面中的所有书籍,因为所有书籍都属于同一类product_prod ,这就是为什么我们对其进行索引(索引0)以获得一本书的原因

  • It is also important to note that the numbers of stars a book has comes as a property of p tag. Here is where the star is located:

    同样重要的是要注意,一本书拥有的星星数是p标签的属性。 这是星星所在的位置:

<p class="star-rating Four">
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
</p>

This book is rated 4-star but this fact is hidden as a value to the class attribute. You use get_attribute(class) to access such an information. On extracting content on line 24 by using text function, you will get such a string

这本书被评为四星级,但是这个事实被隐藏为class属性的值。 您可以使用get_attribute(class)访问此类信息。 通过使用text函数在第24行提取内容时您将获得这样的字符串

star-rating Four

Therefore we had to splitting the string before using the function in line 28–38 to get the actual star as a number.

因此,在使用第28–38行中的函数之前,我们必须先对字符串进行分割,以获取实际的星形作为数字。

  • If you extract the text content in 23 you will end up with such as string

    如果您提取23中的文本内容,您将最终得到诸如字符串

In stock (22 available)

What we need is just the number 22 to mean that we have 22 copies available. We achieve that with line 25 where we are using regular expressions to extract the number.

我们需要的只是数字22,意味着我们有22份副本。 我们在第25行使用正则表达式提取数字来实现这一点。

re.findall("\d+","In stock (22 available)")Output:
['22']
  • In regular expression \d means the values [0–9] and + means one or more occurrence of a character in that class, that is, in our case, we want to capture any number (irrespective of the number of the digits) in the string.

    在正则表达式中, \d表示值[0–9],而+表示该类中一个或多个字符的出现,也就是说,在我们的例子中,我们要捕获任何数字(与数字个数无关)字符串。

在一页中刮所有书 (Scrape all books in one page)

Recall that line 16 above locates all the books in the page. Therefore, in order to scrape all the books in one page we need to loop through the list generated in that line as shown below

回想一下上面的第16行找到了页面中的所有书籍。 因此,为了将所有书籍抓取到一页中,我们需要遍历该行中生成的列表,如下所示

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
import pandas as pd
import numpy as np
import redef StarConversion(value):if value == "One":return 1elif value == "Two":return 2elif value == "Three":return 3elif value == "Four":return 4elif value == "Five":return 5 # Scrape one category # Travel#Set up the path to the chrome driver
PATH = "/home/kiprono/chromedriver"
driver = webdriver.Chrome(PATH)
driver.get("http://books.toscrape.com/catalogue/category/books_1/page-1.html")# Lets find all books in the page
incategory = driver.find_elements_by_class_name("product_pod")
#Generate a list of links for each and every book
links = []
for i in range(len(incategory)):item = incategory[i]#get the href propertya = item.find_element_by_tag_name("h3").find_element_by_tag_name("a").get_property("href")#Append the link to list linkslinks.append(a)all_details = []
# Lets loop through each link to acces the page of each book
for link in links:# get one book urldriver.get(url=link)# title of the booktitle = driver.find_element_by_xpath("//*[@id='content_inner']/article/div[1]/div[2]/h1")# price of the bookprice = driver.find_element_by_xpath("//*[@id='content_inner']/article/div[1]/div[2]/p[1]")# stock - number of copies available for the bookstock = driver.find_element_by_xpath("//*[@id='content_inner']/article/div[1]/div[2]/p[2]")# Stock comes as stringstock = int(re.findall("\d+",stock.text)[0])# Stars - Actual stars are in the tag attributestars = driver.find_element_by_xpath("//*[@id='content_inner']/article/div[1]/div[2]/p[3]").get_attribute("class")# convert string to number. Stars are like One, Two, Three ... We need 1,2,3,...stars = StarConversion(stars.split()[1])# Descriptiontry:description = driver.find_element_by_xpath("//*[@id='content_inner']/article/p")description = description.textexcept:description = None# UPC IDupc = driver.find_element_by_xpath("//*[@id='content_inner']/article/table/tbody/tr[1]/td")# Tax imposed in the booktax = driver.find_element_by_xpath("//*[@id='content_inner']/article/table/tbody/tr[5]/td")# Category of the bookcategory_a =  driver.find_element_by_xpath("//*[@id='default']/div/div/ul/li[3]/a")# Define a dictionary with details we needr = {"1Title":title.text,"2Category":category_a.text,"3Stock": stock,"4Stars": stars,"5Price":price.text,"6Tax":tax.text,"7UPC":upc.text,"8Description": description}# append r to all detailsall_details.append(r)time.sleep(4)driver.close()

Once you have understood the previous example of scraping one book then this one should be easy to follow because the only difference is that we want to loop through all the books extracting the links (lines 33 through 38) and then loop through those links to extract the information we needed (line 42–84).

一旦您了解了前面刮取一本书的示例,那么该书应该很容易理解,因为唯一的区别是我们要循环浏览所有书籍以提取链接( 第33至38行 ),然后循环浏览这些链接以提取链接我们需要的信息( 第42–84行 )。

In this snippet also we have introduced try-except blocks to catch cases where the information we want is not available. Specifically, some books misses description section.

在此代码段中,我们还引入了try-except块来捕获所需信息不可用的情况。 具体来说,有些书缺少description部分。

I am sure you will also enjoy to see selenium open the pages as you watch. Enjoy!

我相信您也会喜​​欢Selenium在观看时打开页面。 请享用!

刮取所有页面中的所有书籍 (Scraping all books in all pages)

The key concept to understand here is that we need to loop through each book in each page, that is, two loops are involved here. As stated earlier we know that pages URL follows some pattern, for example, in our case we have

这里要理解的关键概念是,我们需要遍历每一页中的每一本书,也就是说,这里涉及两个循环。 如前所述,我们知道页面URL遵循某种模式,例如,在我们的情况下,

  • Page 1 : http://books.toscrape.com/index.html or http://books.toscrape.com/catalogue/page-1.html

    第1页: http : //books.toscrape.com/index.html或http://books.toscrape.com/catalogue/page-1.html

  • Page 2: http://books.toscrape.com/catalogue/page-2.html

    第2页: http : //books.toscrape.com/catalogue/page-2.html

  • Page 3: http://books.toscrape.com/catalogue/page-3.html

    第3页: http : //books.toscrape.com/catalogue/page-3.html

  • and so on until page 50 (we have 50 pages on site).

    依此类推,直到第50页(我们的网站上有50页)。

We can easily create a Python for-loop to generate such URLs. Lets see how we can generate the first 10

我们可以轻松地创建Python for循环来生成此类URL。 让我们看看如何生成前10个

for c in range(1,11):print("http://books.toscrape.com/catalogue/category/books_1/page-{}.html".format(c))

Output:

输出:

http://books.toscrape.com/catalogue/category/books_1/page-1.htmlhttp://books.toscrape.com/catalogue/category/books_1/page-2.htmlhttp://books.toscrape.com/catalogue/category/books_1/page-3.htmlhttp://books.toscrape.com/catalogue/category/books_1/page-4.htmlhttp://books.toscrape.com/catalogue/category/books_1/page-5.htmlhttp://books.toscrape.com/catalogue/category/books_1/page-6.htmlhttp://books.toscrape.com/catalogue/category/books_1/page-7.htmlhttp://books.toscrape.com/catalogue/category/books_1/page-8.htmlhttp://books.toscrape.com/catalogue/category/books_1/page-9.htmlhttp://books.toscrape.com/catalogue/category/books_1/page-10.html

Therefore, we can scrape all books in all pages simply as below

因此,我们可以如下简单地刮取所有页面中的所有书籍

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
import pandas as pd
import numpy as np
import re# Scrape one category # Travel#Set up the path to the chrome driver
PATH = "/home/kiprono/chromedriver"
driver = webdriver.Chrome(PATH)
#parse the page source using get() function
driver.get("http://books.toscrape.com/catalogue/category/books_1/index.html")def StarConversion(value):if value == "One":return 1elif value == "Two":return 2elif value == "Three":return 3elif value == "Four":return 4elif value == "Five":return 5 #next_button = driver.find_element_by_class_name("next").find_element_by_tag_name("a").click()
all_details = []
for c in range(1,51):try:#get the pagedriver.get("http://books.toscrape.com/catalogue/category/books_1/page-{}.html".format(c))print("http://books.toscrape.com/catalogue/category/books_1/page-{}.html".format(c))# Lets find all books in the pageincategory = driver.find_elements_by_class_name("product_pod")#Generate a list of links for each and every booklinks = []for i in range(len(incategory)):item = incategory[i]#get the href propertya = item.find_element_by_tag_name("h3").find_element_by_tag_name("a").get_property("href")#Append the link to list linkslinks.append(a)# Lets loop through each link to acces the page of each bookfor link in links:# get one book urldriver.get(url=link)# title of the booktitle = driver.find_element_by_xpath("//*[@id='content_inner']/article/div[1]/div[2]/h1")# price of the bookprice = driver.find_element_by_xpath("//*[@id='content_inner']/article/div[1]/div[2]/p[1]")# stock - number of copies available for the bookstock = driver.find_element_by_xpath("//*[@id='content_inner']/article/div[1]/div[2]/p[2]")# Stock comes as string so we need to use this regex to exract digitsstock = int(re.findall("\d+",stock.text)[0])# Stars - Actual stars are values of class attributestars = driver.find_element_by_xpath("//*[@id='content_inner']/article/div[1]/div[2]/p[3]").get_attribute("class")# convert string to number. Stars are like One, Two, Three ... We need 1,2,3,...stars = StarConversion(stars.split()[1])# Descriptiontry:description = driver.find_element_by_xpath("//*[@id='content_inner']/article/p")description = description.textexcept:description = None# UPC IDupc = driver.find_element_by_xpath("//*[@id='content_inner']/article/table/tbody/tr[1]/td")# Tax imposed in the booktax = driver.find_element_by_xpath("//*[@id='content_inner']/article/table/tbody/tr[5]/td")# Category of the bookcategory_a =  driver.find_element_by_xpath("//*[@id='default']/div/div/ul/li[3]/a")# Define a dictionary with details we needr = {"1Title":title.text,"2Category":category_a.text,"3Stock": stock,"4Stars": stars,"5Price":price.text,"6Tax":tax.text,"7UPC":upc.text,"8Description": description}# append r to all detailsall_details.append(r)except:# Lets just close the browser if we run to an errordriver.close()# save the information into a CSV file
df = pd.DataFrame(all_details)
df.to_csv("all_pages.csv")time.sleep(3)
driver.close()

The only difference between this snippet and the previous one is the fact that we are looping through pages with the loop starting in line 34 and the fact we also write all the information scraped into a CSV file named all_pages.csv .

此代码段与上一个代码段之间的唯一区别是,我们正在循环浏览页面,循环从第34行开始,并且我们还将所有抓取的信息都写入了名为all_pages.csv的CSV文件all_pages.csv

We are also using try-expect to handle exceptions that may arise in the process of scraping. In case an exception is raise we just exit and close the browser (line 94)

我们还使用try-expect处理可能在抓取过程中出现的异常。 万一引发异常,我们只需退出并关闭浏览器( 第94行 )

Image for post
Fig7: Head view of resulting CSV file (all_pages.csv)
图7:生成的CSV文件的头部视图(all_pages.csv)

结论 (Conclusion)

Web scraping is an important process of collecting data from the internet. Different websites have different designs and therefore there’s no particular one scraper that can be used in any particular site. The most essential skill is to understand web scraping on a high level: knowing how to locate web elements and being able to identify and handle errors when they arise. This kind of understanding comes up when we practice web scraping on different sites.

Web抓取是从Internet收集数据的重要过程。 不同的网站具有不同的设计,因此,没有一个特定的刮板可用于任何特定的站点。 最基本的技能是从较高的角度了解Web抓取:知道如何定位Web元素并能够在出现错误时进行识别和处理。 当我们在不同站点上练习Web抓取时,就会出现这种理解。

Here are more web scraping examples:

以下是更多网络抓取示例:

Lastly, some websites do not permit their data to be scraped especially scraping and publishing the result. It is, therefore, important to check the site policies before scraping.

最后,某些网站不允许刮擦其数据,尤其是刮擦并发布结果。 因此,在抓取之前检查站点策略很重要。

As always, thank you for reading :-)

和往常一样,谢谢您的阅读:-)

翻译自: https://towardsdatascience.com/web-scraping-e-commerce-website-using-selenium-1088131c8541

selenium抓取

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/390572.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

剑指 Offer 37. 序列化二叉树

题目 序列化是将一个数据结构或者对象转换为连续的比特位的操作&#xff0c;进而可以将转换后的数据存储在一个文件或者内存中&#xff0c;同时也可以通过网络传输到另一个计算机环境&#xff0c;采取相反方式重构得到原数据。 请设计一个算法来实现二叉树的序列化与反序列化…

一个简单的 js 时间对象创建

JS中获取时间很常见&#xff0c;凑凑热闹&#xff0c;也获取一个时间对象试试 首先&#xff0c;先了解js的获取时间函数如下&#xff1a; var myDate new Date(); //创建一个时间对象 myDate.getYear(); // 获取当前年份&#xff08;2位&#x…

裁判打分_内在的裁判偏见

裁判打分News flash: being an umpire is hard. Their job is to judge whether a ball that’s capable of moving upwards of 100 MPH or breaking 25 inches crossed through an imaginary zone before being caught. I don’t think many would argue that they have it ea…

LCP 07. 传递信息

小朋友 A 在和 ta 的小伙伴们玩传信息游戏&#xff0c;游戏规则如下&#xff1a; 有 n 名玩家&#xff0c;所有玩家编号分别为 0 &#xff5e; n-1&#xff0c;其中小朋友 A 的编号为 0 每个玩家都有固定的若干个可传信息的其他玩家&#xff08;也可能没有&#xff09;。传信息…

微信公众号自动回复加超链接最新可用实现方案

你在管理微信号时是否会有自动回复或者在关键字触发自动回复加一个超链接的需求呢&#xff1f;例如下图像王者荣耀这样&#xff1a; 很多有开发经验的朋友都知道微信管理平台会类似富文本编辑器&#xff0c;第一想到的解决方案会是在编辑框中加<a href网址 >显示文字<…

从Jupyter Notebook切换到脚本的5个理由

意见 (Opinion) 动机 (Motivation) Like most people, the first tool I used when started learning data science is Jupyter Notebook. Most of the online data science courses use Jupyter Notebook as a medium to teach. This makes sense because it is easier for be…

win10子系统linux编译ffmpeg

android-ndk-r14b(linux版) ffmpeg-4.0 开启win10子系统&#xff08;控制面板-》程序和功能-》启用或关闭Windows功能 然后在 适用与 Linux 的 Windows 子系统前面打勾&#xff09; 然后点击确定&#xff0c;等待安装&#xff0c;电脑会重启 然后在win10应用商店 搜索ubuntu安装…

leetcode 451. 根据字符出现频率排序

给定一个字符串&#xff0c;请将字符串里的字符按照出现的频率降序排列。 示例 1:输入: "tree"输出: "eert"解释: e出现两次&#xff0c;r和t都只出现一次。 因此e必须出现在r和t之前。此外&#xff0c;"eetr"也是一个有效的答案。 示例 2:输入…

Spring-Security 自定义Filter完成验证码校验

Spring-Security的功能主要是由一堆Filter构成过滤器链来实现&#xff0c;每个Filter都会完成自己的一部分工作。我今天要做的是对UsernamePasswordAuthenticationFilter进行扩展&#xff0c;新增一个Filter&#xff0c;完成对登录页面的校验码的验证。下面先给一张过滤器的说明…

如何使用Ionic和Firebase在短短三天内创建冠状病毒跟踪器应用程序

I am really fond of Hybrid App technologies – they help us achieve so much in a single codebase. Using the Ionic Framework, I developed a cross-platform mobile solution for tracking Coronavirus cases in just 3 days. 我真的很喜欢Hybrid App技术-它们可以帮助…

二、Java面向对象(7)_封装思想——this关键字

2018-04-30 this关键字 什么是this: 表示当前对象本身&#xff0c;或当前类的一个实例&#xff0c;通过 this 可以调用本对象的所有方法和属性。 this主要存在于两个地方&#xff1a; 1&#xff09;构造函数&#xff1a;此时this表示调用当前创建的对象 2&#xff09;成员方法中…

机器学习模型 非线性模型_调试机器学习模型的终极指南

机器学习模型 非线性模型You’ve divided your data into a training, development and test set, with the correct percentage of samples in each block, and you’ve also made sure that all of these blocks (specially development and test set) come from the same di…

web相关基础知识1

2017-12-13 09:47:11 关于HTML 1.绝对路径和相对路径 相对路径&#xff1a;相对于文件自身为参考。 &#xff08;工作中一般是使用相对路径&#xff09; 这里我们用html文件为参考。如果说html和图片平级&#xff0c;那直接使用src 如果说图片在和html平级的文件夹里面&#xf…

您的第一个简单的机器学习项目

This article is for those dummies like me, who’ve never tried to know what machine learning was or have left it halfway for the sole reason of being overwhelmed. Follow through every line and stay along. I promise you’d be quite acquainted with giving yo…

eclipse报Access restriction: The type 'BASE64Decoder' is not API处理方法

今天从svn更新代码之后&#xff0c;由于代码中使用了BASE64Encoder 更新之后报如下错误&#xff1a; Access restriction: The type ‘BASE64Decoder’ is not API (restriction on required library ‘D:\java\jdk1.7.0_45\jre\lib\rt.jar’) 解决其实很简单&#xff0c;把JR…

简单团队-爬取豆瓣电影T250-项目进度

本次主要讲解一下我们的页面设计及展示最终效果&#xff1a; 页面设计主要用到的软件是&#xff1a;html&#xff0c;css&#xff0c;js&#xff0c; 主要用的编译器是&#xff1a;sublime&#xff0c;dreamweaver&#xff0c;eclipse&#xff0c;由于每个人使用习惯不一样&…

鸽子为什么喜欢盘旋_如何为鸽子回避系统设置数据收集

鸽子为什么喜欢盘旋鸽子回避系统 (Pigeon Avoidance System) Disclaimer: You are reading Part 2 that describes the technical setup. Part 1 gave an overview of the Pigeon Avoidance System and Part 3 provides details about the Pigeon Recognition Model.免责声明&a…

前端开发-DOM

文档对象模型&#xff08;Document Object Model&#xff0c;DOM&#xff09;是一种用于HTML和XML文档的编程接口。它给文档提供了一种结构化的表示方法&#xff0c;可以改变文档的内容和呈现方式。我们最为关心的是&#xff0c;DOM把网页和脚本以及其他的编程语言联系了起来。…

JAVA-初步认识-第十三章-多线程(验证同步函数的锁)

一. 至于同步函数用的是哪个锁&#xff0c;我们可以验证一下&#xff0c;借助原先卖票的例子 对于程序中的num&#xff0c;从100改为400&#xff0c;DOS的结果显示的始终都是0线程&#xff0c;票号最小都是1。 票号是没有问题的&#xff0c;因为同步了。 有人针对只出现0线程&a…

追求卓越追求完美规范学习_追求新的黄金比例

追求卓越追求完美规范学习The golden ratio is originally a mathematical term. But art, architecture, and design are inconceivable without this math. Everyone aspires to golden proportions as beautiful and unattainable perfection. By visualizing data, we chal…