python爬虫知乎图片_python爬虫（爬取知乎答案图片）

python爬虫（爬取知乎答案图片）

1.⾸先，你要在电脑⾥安装 python 的环境

我会提供2.7和3.6两个版本的代码,但是本⽂只以python3.6版本为例。

安装完成后，打开你电脑的终端（Terminal）执⾏以下命令：

搭建 python 3.6环境，

爬虫专用。如果你已经装好了 python3.6的环境，那么可以跳过搭建环境这一步，直接安装所需要的 python库。

检查 python 版本

python –version

2、selenium

因为知乎⽹站前端是⽤ react 搭建的，页⾯内容随着⽤户⿏标滚轴滑动、点击依次展现，为了获取海量的图⽚内容，我们需要⽤selenium这个 lib 模拟⽤户对浏览器进⾏滑动点击等操作。

利⽤ pip 安装 selenium

pip install -U selenium

下载安装完成后，我建议⼤家打开上⾯的链接，阅读⼀下 selenium 的使⽤⽅法。意思⼤致为，为了运⾏ selenium，我们需要安装⼀个 chrome 的driver，下载完成后，对于 Mac ⽤户，直接把它复制到/usr/bin或者/usr/local/bin，当然你也可以⾃定义并添加路径。对于 Win ⽤户，也是同理。

chromedriver与chrome各版本及下载地址

Chrome: 下载地址

Firefox:下载地址

Safari:下载地址

2.0在爬⾍的时候我们经常会发现⽹页都是经过压缩去掉缩进和空格的，页⾯结构会很不清晰，这时候我们就需要⽤ BeautifulSoup 这个 lib 来进⾏html ⽂件结构化。

pip install beautifulsoup4

Section 2 – 代码解释

from selenium import webdriver

import time

import urllib.request

from bs4 import BeautifulSoup

import html.parser

2.1 确定目标URL

在 main 函数里边，打开chrome driver，然后输入 url

def main():

# ********* Open chrome driver and type the website that you want to view ***********************

driver = webdriver.Chrome() # 打开浏览器

列出来你想要下载图片的网站

https://www.zhihu.com/question/35242408 #拥有丰富的表情包是怎样的体验

2.2 模拟滚动点击操作

在 main 函数⾥我们定义⼀个重复执⾏的函数，来进⾏滚动和点击的操作。⾸先我们可以⽤driver.execute_scrip来进⾏滚动操作。通过观察，我们发现知乎问题底部有⼀个“查看更多回答的”的按钮，如下图。因此我们可以⽤driver.find_element_by_css_selector来选中这个按钮，并点击。我们这⾥只爬取五个页⾯的图⽚。其实，五个页⾯，100个回答，往往都能有1000张图⽚了。。。

# ****************** Scroll to the bottom, and click the "view more" button *********

def execute_times(times):

for i in range(times):

driver.execute_script("window.scrollTo(0, document.body.scrollHeight);") # 滑动到浏览器底部

time.sleep(2) # 等待页面加载

try:

driver.find_element_by_css_selector('button.QuestionMainAction').click() # 选中并点击页面底部的加载更多

print("page" + str(i)) # 输出页面页数

time.sleep(1) # 等待页面加载

except:

break

execute_times(5)

2.3 结构化HTML页面并保存

我们每次爬取页⾯信息，要做的第⼀件事就是把页⾯ HTML 存储下来。为了⽅便我们⾁眼浏览，这时候就需要⽤beautifulSoup把压缩后的 HTML ⽂件结构化并保存。

# **************** Prettify the html file and store raw data file *****************************************

result_raw = driver.page_source # 这是原网页 HTML 信息

result_soup = BeautifulSoup(result_raw, 'html.parser')

result_bf = result_soup.prettify() # 结构化原 HTML 文件

with open("./output/rawfile/raw_result.txt", 'w') as girls: # 存储路径里的文件夹需要事先创建。

girls.write(result_bf)

girls.close()

print("Store raw data successfully!!!")

2.4 爬取知乎问题回答里的nodes

要知道，在我们每次想要爬取页⾯信息之前，要做的第⼀件事就是观察，观察这个页⾯的结构，量⾝⽽裁。⼀般每个页⾯⾥都有很多个图⽚，⽐如在这个知乎页⾯⾥，有很多⽤户头像以及插⼊的图⽚。但是我们这⾥不想要⽤户头像，我们只想要要回答问题⾥的照⽚，所以不能够直接爬取所有 <\img> 的照⽚。仔细观察，都是被 escape（HTML entity 转码）了的，所以要⽤html.parser.unescape进⾏解码。

# **************** Find all nodes and store them *****************************************

with open("./output/rawfile/noscript_meta.txt", 'w') as noscript_meta: # 存储路径里的文件夹需要事先创建。

noscript_nodes = result_soup.find_all('noscript') # 找到所有

node

noscript_inner_all = ""

for noscript in noscript_nodes:

noscript_inner = noscript.get_text() # 获取

node内部内容

noscript_inner_all += noscript_inner + "\n"

noscript_all = html.parser.unescape(noscript_inner_all) # 将内部内容转码并存储

noscript_meta.write(noscript_all)

noscript_meta.close()

print("Store noscript meta data successfully!!!")

2.5 下载图片

有了 img 的所有 node，下载图⽚就轻松多了。⽤⼀个 urllib.request.urlretrieve就全部搞定。这⾥我又做了⼀点清理，把所有的 url 单独存了⼀下，并⽤序号标记，你也可以不要这⼀步直接下载。

# **************** Store meta data of imgs *****************************************

img_soup = BeautifulSoup(noscript_all, 'html.parser')

img_nodes = img_soup.find_all('img')

with open("./output/rawfile/img_meta.txt", 'w') as img_meta:

count = 0

for img in img_nodes:

if img.get('src') is not None:

img_url = img.get('src')

line = str(count) + "\t" + img_url + "\n"

img_meta.write(line)

urllib.request.urlretrieve(img_url, "./output/image/" + str(count) + ".jpg") # 一个一个下载图片

count += 1

img_meta.close()

print("Store meta data and images successfully!!!")

记得还有最后的！！！

if __name__ == '__main__':

main()

下面是全部源码

from selenium import webdriver

import time

import urllib.request

from bs4 import BeautifulSoup

import html.parser

def main():

# ********* Open chrome driver and type the website that you want to view ***********************

driver = webdriver.Chrome() # 打开浏览器

# 列出来你想要下载图片的网站

driver.get("https://www.zhihu.com/question/35242408") #拥有丰富的表情包是怎样的体验

# ****************** Scroll to the bottom, and click the "view more" button *********

def execute_times(times):

for i in range(times):

driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

time.sleep(2)

try:

driver.find_element_by_css_selector('button.QuestionMainAction').click()

print("page" + str(i))

time.sleep(1)

except:

break

execute_times(7)

# **************** Prettify the html file and store raw data file *****************************************

result_raw = driver.page_source # 这是原网页 HTML 信息

result_soup = BeautifulSoup(result_raw, 'html.parser')

result_bf = result_soup.prettify() # 结构化原 HTML 文件

with open("D:/newworld/output/rawfile/raw_result.txt", 'w',encoding='utf-8') as girls: # 存储路径里的文件夹需要事先创建。

girls.write(result_bf)

girls.close()

print("Store raw data successfully!!!")

# **************** Find all nodes and store them *****************************************

with open("D:/newworld/output/rawfile/noscript_meta.txt", 'w',encoding='utf-8') as noscript_meta:

noscript_nodes = result_soup.find_all('noscript') # 找到所有

node

noscript_inner_all = ""

for noscript in noscript_nodes:

noscript_inner = noscript.get_text() # 获取

node内部内容

noscript_inner_all += noscript_inner + "\n"

noscript_all = html.parser.unescape(noscript_inner_all) # 将内部内容转码并存储

noscript_meta.write(noscript_all)

noscript_meta.close()

print("Store noscript meta data successfully!!!")

# **************** Store meta data of imgs *****************************************

img_soup = BeautifulSoup(noscript_all, 'html.parser')

img_nodes = img_soup.find_all('img')

with open("D:/newworld/output/rawfile/img_meta.txt", 'w',encoding='utf-8') as img_meta:

count = 0

for img in img_nodes:

if img.get('src') is not None:

img_url = img.get('src')

line = str(count) + "\t" + img_url + "\n"

img_meta.write(line)

urllib.request.urlretrieve(img_url, "D:/newworld/output/zhihumeinv/" + str(count) + ".jpg") # 一个一个下载图片

count += 1

img_meta.close()

print("Store meta data and images successfully!!!")

if __name__ == '__main__':

main()

python爬虫知乎图片_python爬虫（爬取知乎答案图片）

相关文章

HTML5简略介绍

我的世界java和基岩版哪个好玩_我的世界：Java版本好玩还是基岩版好玩？老玩家看完后沉默了...

SQL Server简介

Windows下断言的类型及实现

mysql最大执行时间_导入大型MySQL数据库时,最大执行时间超过300秒

linux右上角不显示网络连接_来体验下Linux吧

debian下ror新建项目报错解决

mysql中的默认约束_数据库中默认约束的作用

python md5解密_python 生成文件MD5码

Java数据结构、list集合、ArrayList集合、LinkedList集合、Vector集合

第11章 GUI Page436 使用缓冲DC， wxBufferedPaintDC

Tomcat+JSP经典配置实例

python海龟编辑器画小汽车_【海龟编辑器下载】海龟编辑器(Python编辑器) v1.3.4 官方免费版-趣致软件园...

.net 页面之间传值的几种方法！（转）

微信小程序中组件的使用

mysql允许所有用户连接_Mysql权限控制 - 允许用户远程连接

mysql id 不在集合里面_MySQL，PHP：从表中选择*，其中id不在数组中

#并行优化# 容错算法 (Fault Tolerant)

python人工智能pdf_800G人工智能学习资料:AI电子书+Python语言入门+教程+机器学习等限时免费领取！...

mysql rpm 安装多实例_MySQL多实例安装