python 抓取网页链接
Prerequisite:
先决条件:
Urllib3: It is a powerful, sanity-friendly HTTP client for Python with having many features like thread safety, client-side SSL/TSL verification, connection pooling, file uploading with multipart encoding, etc.
Urllib3 :这是一个功能强大,对环境友好的Python HTTP客户端,具有许多功能,例如线程安全,客户端SSL / TSL验证,连接池,使用多部分编码的文件上传等。
Installing urllib3:
安装urllib3:
$ pip install urllib3
BeautifulSoup: It is a Python library that is used to scrape/get information from the webpages, XML files i.e. for pulling data out of HTML and XML files.
BeautifulSoup :这是一个Python库,用于从网页,XML文件中抓取/获取信息,即从HTML和XML文件中提取数据。
Installing BeautifulSoup:
安装BeautifulSoup:
$ pip install beautifulsoup4
Commands Used:
使用的命令:
html= urllib.request.urlopen(url).read(): Opens the URL and reads the whole blob with newlines at the end and it all comes into one big string.
html = urllib.request.urlopen(url).read() :打开URL并以换行符结尾读取整个blob,所有这些都变成一个大字符串。
soup= BeautifulSoup(html,'html.parser'): Using BeautifulSoup to parse the string BeautifulSoup converts the string and it just takes the whole file and uses the HTML parser, and we get back an object.
soup = BeautifulSoup(html,'html.parser') :使用BeautifulSoup解析字符串BeautifulSoup转换该字符串,它只获取整个文件并使用HTML解析器,然后返回一个对象。
tags= soup('a'): To get the list of all the anchor tags.
tags =汤('a') :获取所有锚标签的列表。
tag.get('href',None): Extract and get the data from the href.
tag.get('href',None) :从href中提取并获取数据。
网页链接的Python程序 (Python program to Links from a Webpage)
# import statements
import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
# Get links
# URL of a WebPage
url = input("Enter URL: ")
# Open the URL and read the whole page
html = urllib.request.urlopen(url).read()
# Parse the string
soup = BeautifulSoup(html, 'html.parser')
# Retrieve all of the anchor tags
# Returns a list of all the links
tags = soup('a')
#Prints all the links in the list tags
for tag in tags:
# Get the data from href key
print(tag.get('href', None), end = "\n")
Output:
输出:
Enter URL: https://www.google.com/
https://www.google.com/imghp?hl=en&tab=wi
https://maps.google.com/maps?hl=en&tab=wl
https://play.google.com/?hl=en&tab=w8
https://www.youtube.com/?gl=US&tab=w1
https://news.google.com/nwshp?hl=en&tab=wn
https://mail.google.com/mail/?tab=wmhttps://drive.google.com/?tab=wo
https://www.google.com/intl/en/about/products?tab=wh
http://www.google.com/history/optout?hl=en
/preferences?hl=en
https://accounts.google.com/ServiceLogin?hl=en&passive=true
&continue=https://www.google.com/
/advanced_search?hl=en&authuser=0
/intl/en/ads/
/services/
/intl/en/about.html
/intl/en/policies/privacy/
/intl/en/policies/terms/
翻译自: https://www.includehelp.com/python/scraping-links-from-a-webpage.aspx
python 抓取网页链接