网页视频15分钟自动暂停
什么是网页抓取? (What is Web Scraping?)
Web scraping, also known as web data extraction, is the process of retrieving or “scraping” data from a website. This information is collected and then exported into a format that is more useful for the user and it can be a spreadsheet or an API. Although web scraping can be done manually, in most cases, automated tools are preferred when scraping web data as they can be less costly and work at a faster rate.
Web抓取,也称为Web数据提取,是从网站检索或“抓取”数据的过程。 收集此信息,然后将其导出为对用户更有用的格式,可以是电子表格或API。 尽管可以手动进行Web抓取 ,但是在大多数情况下,抓取Web数据时首选自动化工具,因为它们的成本较低且工作速度更快。
网站搜刮合法吗? (Is Web Scraping Legal?)
The simplest way is to check the robots.txt file of the website. You can find this file by appending “/robots.txt” to the URL that you want to scrape. It is usually at the website domain /robots.txt. If all the bots indicated by ‘user-agent: *’ are blocked/disallowed in the robots.txt file, then you’re not allowed to scrape. For this article, I am scraping the Flipkart website. So, to see the “robots.txt” file, the URL is www.flipkart.com/robots.txt.
最简单的方法是检查网站的robots.txt文件。 您可以通过将“ /robots.txt”附加到要抓取的URL来找到此文件。 它通常位于网站域/robots.txt中。 如果robots.txt文件中阻止/禁止了“用户代理:*”指示的所有漫游器,则不允许您抓取。 对于本文,我将抓取Flipkart网站。 因此,要查看“ robots.txt”文件,URL为www.flipkart.com/robots.txt。
用于Web爬网的库 (Libraries used for Web Scraping)
BeautifulSoup: BeautifulSoup is a Python library for pulling data out of HTML and XML files. It works with your favourite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree.
BeautifulSoup:BeautifulSoup是一个Python库,用于从HTML和XML文件中提取数据。 它与您最喜欢的解析器一起使用,提供了导航,搜索和修改解析树的惯用方式。
Pandas: Pandas is a fast, powerful, flexible, and easy to use open-source data analysis and manipulation tool, built on top of the Python programming language.
Pandas:Pandas是一种快速,强大,灵活且易于使用的开源数据分析和处理工具,建立在Python编程语言之上。
为什么选择BeautifulSoup? (Why BeautifulSoup?)
It is an incredible tool for pulling out information from a webpage. You can use it to extract tables, lists, paragraphs and you can also put filters to extract information from web pages. For more info, you can refer to the BeautifulSoup documentation
它是从网页中提取信息的不可思议的工具。 您可以使用它来提取表,列表,段落,还可以放置过滤器以从网页中提取信息。 有关更多信息,您可以参考BeautifulSoup 文档。
刮Flipkart网站 (Scraping Flipkart Website)
from bs4 import BeautifulSoup
import requests
import csv
import pandas as pd
First, we import the BeautifulSoup and the requests library and these are very important libraries for web scraping.
首先,我们导入BeautifulSoup和请求库,这些对于Web抓取是非常重要的库。
requests: requests, is one of the packages in Python that made the language interesting. requests is based on Python’s urllib2 module.
请求:请求是Python中使该语言有趣的软件包之一。 请求基于Python的urllib2模块。
req = requests.get("https://www.flipkart.com/search?q=laptops&otracker=search&otracker1=search&marketplace=FLIPKART&as-show=on&as=off&page=1") # URL of the website which you want to scrape
content = req.content # Get the content
To get the contents of the specified URL, submit a request using the requests library. This is the URL of the Flipkart website containing laptops.
要获取指定URL的内容,请使用请求库提交请求。 这是包含笔记本电脑的Flipkart网站的URL。
This is the Flipkart website comprising of different laptops. This page contains the details of 24 laptops. So now looking at this, we try to extract the different features of the laptops such as the description of the laptop (model name along with the specification of the laptop), Processor (Intel/AMD, i3/i5/i7/Ryzen3Ryzen5/Ryzen7), RAM (4/8/16 GB), Operating System (Windows/Mac), Disk Drive Storage (SSD/HDD,256/512/1TB storage), Display (13.3/14/15.6 inches), Warranty(Onsite/Limited Hardware/International), Rating(4.1–5), Price (Rupees).
这是由不同笔记本电脑组成的Flipkart网站。 此页面包含24台笔记本电脑的详细信息。 因此,现在着眼于此,我们尝试提取笔记本电脑的不同功能,例如笔记本电脑的描述(型号名称以及笔记本电脑的规格),处理器(Intel / AMD,i3 / i5 / i7 / Ryzen3Ryzen5 / Ryzen7) ),RAM(4/8/16 GB),操作系统(Windows / Mac),磁盘驱动器存储(SSD / HDD,256/512 / 1TB存储),显示器(13.3 / 14 / 15.6英寸),保修(现场/硬件/国际限量版),评分(4.1–5),价格 (卢比)。
soup = BeautifulSoup(content,'html.parser')
print(soup.prettify())<!DOCTYPE html>
<html lang="en">
<head>
<link href="https://rukminim1.flixcart.com" rel="dns-prefetch"/>
<link href="https://img1a.flixcart.com" rel="dns-prefetch"/>
<link href="//img1a.flixcart.com/www/linchpin/fk-cp-zion/css/app.chunk.21be2e.css" rel="stylesheet"/>
<link as="image" href="//img1a.flixcart.com/www/linchpin/fk-cp-zion/img/fk-logo_9fddff.png" rel="preload"/>
<meta content="text/html; charset=utf-8" http-equiv="Content-type"/>
<meta content="IE=Edge" http-equiv="X-UA-Compatible"/>
<meta content="102988293558" property="fb:page_id"/>
<meta content="658873552,624500995,100000233612389" property="fb:admins"/>
<meta content="noodp" name="robots"/>
<link href="https://img1a.flixcart.com/www/promos/new/20150528-140547-favicon-retina.ico" rel="shortcut icon">
....
....
</script>
<script async="" defer="" id="omni_script" nonce="7596241618870897262" src="//img1a.flixcart.com/www/linchpin/batman-returns/omni/omni16.js">
</script>
</body>
</html>
Here we need to specify the content variable and the parser, which is the HTML parser. So now soup is a variable of the BeautifulSoup object of our parsed HTML. soup.prettify() displays the entire code of the webpage.
在这里,我们需要指定内容变量和解析器,即HTML解析器。 因此,汤是我们已解析HTML的BeautifulSoup对象的变量。 soup.prettify()显示网页的整个代码。
Extracting the Descriptions
提取描述
When you click on the “Inspect” tab, you will see a “Browser Inspector Box” open. We observe that the class name of the descriptions is ‘_3wU53n’ so we use the find method to extract the descriptions of the laptops.
当您单击“检查”选项卡时,将看到“浏览器检查器框”打开。 我们观察到描述的类名是'_3wU53n',因此我们使用find方法提取笔记本电脑的描述。
desc = soup.find_all('div' , class_='_3wU53n')[<div class="_3wU53n">HP 14s Core i5 10th Gen - (8 GB/512 GB SSD/Windows 10 Home) 14s-cs3010TU Laptop</div>,
<div class="_3wU53n">HP 14q Core i3 8th Gen - (8 GB/256 GB SSD/Windows 10 Home) 14q-cs0029TU Thin and Light Laptop</div>,
<div class="_3wU53n">Asus VivoBook 15 Ryzen 3 Dual Core - (4 GB/1 TB HDD/Windows 10 Home) M509DA-EJ741T Laptop</div>,
<div class="_3wU53n">Acer Aspire 7 Core i5 9th Gen - (8 GB/512 GB SSD/Windows 10 Home/4 GB Graphics/NVIDIA Geforce GTX 1650...</div>,
....
....
<div class="_3wU53n">MSI GP65 Leopard Core i7 10th Gen - (32 GB/1 TB HDD/512 GB SSD/Windows 10 Home/8 GB Graphics/NVIDIA Ge...</div>,
<div class="_3wU53n">Asus Core i5 10th Gen - (8 GB/512 GB SSD/Windows 10 Home/2 GB Graphics) X509JB-EJ591T Laptop</div>]
Extracting the descriptions from the website using the find method-grabbing the div tag which has the class name ‘ _3wU53n’. This returns all the div tags with the class name of ‘ _3wU53n’. As class is a special keyword in python, we have to use the class_ keyword and pass the arguments here.
使用find方法从网站中提取描述-抓取div标签,该标签的类名为“ _3wU53n”。 这将返回所有类名称为“ _3wU53n”的div标签。 由于class是python中的特殊关键字,因此我们必须使用class_关键字并在此处传递参数。
descriptions = [] # Create a list to store the descriptions
for i in range(len(desc)):
descriptions.append(desc[i].text)
len(descriptions)24 # Number of laptops
['HP 14s Core i5 10th Gen - (8 GB/512 GB SSD/Windows 10 Home) 14s-cs3010TU Laptop',
'HP 14q Core i3 8th Gen - (8 GB/256 GB SSD/Windows 10 Home) 14q-cs0029TU Thin and Light Laptop',
'Asus VivoBook 15 Ryzen 3 Dual Core - (4 GB/1 TB HDD/Windows 10 Home) M509DA-EJ741T Laptop',
'Acer Aspire 7 Core i5 9th Gen - (8 GB/512 GB SSD/Windows 10 Home/4 GB Graphics/NVIDIA Geforce GTX 1650...',
....
....
'MSI GP65 Leopard Core i7 10th Gen - (32 GB/1 TB HDD/512 GB SSD/Windows 10 Home/8 GB Graphics/NVIDIA Ge...',
'Asus Core i5 10th Gen - (8 GB/512 GB SSD/Windows 10 Home/2 GB Graphics) X509JB-EJ591T Laptop']
Create an empty list to store the descriptions of all the laptops. We can even access the child tags with dot access. So now iterate through all the tags and then use the .text method to extract only the text content from the tags. In every iteration append the text to the descriptions list. So after iterating through all the tags, the descriptions list will have the text content of all the laptops (which is the description of the laptop-model name along with specifications).
创建一个空列表来存储所有笔记本电脑的描述。 我们甚至可以通过点访问来访问子标签。 因此,现在遍历所有标签,然后使用.text方法仅从标签中提取文本内容。 在每次迭代中,将文本添加到描述列表中。 因此,在遍历所有标签之后,描述列表将包含所有便携式计算机的文本内容(这是便携式计算机型号名称的描述以及规格)。
Similarly, we apply the same approach to extract all the other features.
同样,我们采用相同的方法提取所有其他功能。
Extracting the specifications
提取规格
We observe that the various specifications are under the same div and the class names are the same for all those 5 features (Processor, RAM, Disk Drive, Display, Warranty).
我们注意到,所有这5个功能(处理器,RAM,磁盘驱动器,显示,保修)的各种规格都在同一div下,并且类名相同。
All the features are inside the ‘li’ tag and the class name is the same for all which is ‘tVe95H’ so we need to apply some technique to extract the distinct features.
所有功能都在'li'标记内,并且所有类的名称都相同'tVe95H',因此我们需要应用某种技术来提取不同的功能。
# Create empty lists for the features
processors=[]
ram=[]
os=[]
storage=[]
inches=[]
warranty=[]for i in range(0,len(commonclass)):
p=commonclass[i].text # Extracting the text from the tags
if("Core" in p):
processors.append(p)
elif("RAM" in p):
ram.append(p)
# If RAM is present in the text then append it to the ram list. Similarly do this for the other features as well elif("HDD" in p or "SSD" in p):
storage.append(p)
elif("Operating" in p):
os.append(p)
elif("Display" in p):
inches.append(p)
elif("Warranty" in p):
warranty.append(p)
The .text method is used to extract the text information from the tags so this gives us the values of Processor, RAM, Disk Drive, Display, Warranty. So in the same way, we apply this approach to the remaining features as well.
.text方法用于从标记中提取文本信息,因此可以为我们提供处理器,RAM,磁盘驱动器,显示,保修的值。 因此,以同样的方式,我们也将这种方法应用于其余功能。
print(len(processors))
print(len(warranty))
print(len(os))
print(len(ram))
print(len(inches))24
24
24
24
24
Extracting the price
提取价格
price = soup.find_all(‘div’,class_=’_1vC4OE _2rQ-NK’)
# Extracting price of each laptop from the website
prices = []
for i in range(len(price)):
prices.append(price[i].text)
len(prices)
prices24
['₹52,990',
'₹34,990',
'₹29,990',
'₹56,990',
'₹54,990',
....
....
'₹78,990',
'₹1,59,990',
'₹52,990']
In the same manner, we extract the price of each laptop and add all the prices to the prices list.
以相同的方式,我们提取每台笔记本电脑的价格,并将所有价格添加到价格列表中。
rating = soup.find_all('div',class_='hGSR34')
Extracting the ratings of each laptop from the website
ratings = []
for i in range(len(rating)):
ratings.append(rating[i].text)
len(ratings)
ratings37
['4.4',
'4.5',
'4.4',
'4.4',
'4.2',
'4.5',
'4.4',
'4.5',
'4.4',
'4.2',
....
....
'₹1,59,990',
'₹52,990']
Here we are getting the length of the ratings to be 37. But what’s the reason behind it?
在这里,我们得到的评级长度为37。但是背后的原因是什么呢?
We observe that the class name for the recommended laptops is also the same as the featured laptops, so that’s why it’s extracting the ratings of recommended laptops as well.This is leading to an increase in the number of ratings. It has to be 24 but now it's 37!
我们发现推荐笔记本电脑的类别名称也与特色笔记本电脑相同,这就是为什么它也提取推荐笔记本电脑的等级的原因,这导致等级数量增加。 它必须是24,但现在是37!
Last but not the least, merge all the features into a single data frame and store the data in the required format!
最后但并非最不重要的一点是,将所有功能合并到一个数据框中,并以所需的格式存储数据!
df = {'Description':descriptions,'Processor':processors,'RAM':ram,'Operating System':os,'Storage':storage,'Display':inches,'Warranty':warranty,'Price':prices}
dataset = pd.DataFrame(data = d)
The final dataset
最终数据集
Saving the dataset to a CSV file
将数据集保存到CSV文件
dataset.to_csv('laptops.csv')
Now we get the whole dataset into a CSV file.
现在,我们将整个数据集放入一个CSV文件中。
To verify it again, we read the downloaded CSV file in Jupyter Notebook.
为了再次验证,我们在Jupyter Notebook中读取了下载的CSV文件。
df = pd.read_csv('laptops.csv')
df.shape(24, 9)
As this is a dynamic website, the content keeps on changing!
由于这是一个动态的网站,因此内容不断变化!
You can always refer to my GitHub Repository for the entire code.
您可以始终参考我的GitHub存储库以获取完整代码。
Connect with me on LinkedIn here
在此处通过LinkedIn与我联系
“For every $20 you spend on web analytics tools, you should spend $80 on the brains to make sense of the data.” — Jeff Sauer
“您每花20美元在网络分析工具上,就应该花80美元在大脑上以理解数据。” —杰夫·索尔
I hope you found the article insightful. I would love to hear feedback to improvise it and come back with better content.
我希望您发现这篇文章很有见地。 我很想听听反馈以即兴创作,并以更好的内容回来。
Thank you so much for reading!
非常感谢您的阅读!
翻译自: https://towardsdatascience.com/learn-web-scraping-in-15-minutes-27e5ebb1c28e
网页视频15分钟自动暂停
本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/388234.shtml
如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!