- 实现时间:2021-05-30
- 实现难度:★★★☆☆☆
- 实现目标:采集 Facebook 评论插件、留言外挂程序的所有评论。
- 完整代码:https://github.com/TRHX/Python3-Spider-Practice/tree/master/CommentPlugin/facebook-comments
- 其他爬虫实战代码合集(持续更新):https://github.com/TRHX/Python3-Spider-Practice
- 爬虫实战专栏(持续更新):https://itrhx.blog.csdn.net/article/category/9351278
文章目录
- 【1x00】写在前面
- 【2x00】逻辑分析
- 【2x01】第一页
- 【2x02】下一页
- 【2x03】回复别人的评论
- 【3x00】完整代码
- 【4x00】数据截图
【1x00】写在前面
本文的采集代码适用于 Facebook 评论插件的评论采集。仅用于 Python 编程技术交流!
Facebook 评论插件官网:https://developers.facebook.com/products/social-plugins/comments
本文以 https://www.chinatimes.com/realtimenews/20210529003827-260407 为例。
【2x00】逻辑分析
【2x01】第一页
在页面的 Facebook 评论插件位置右键查看框架源代码,我们就可以看到第一页评论页面的源码,直接访问这个 URL 就可以看到评论信息。
这个页面的 URL 为:https://www.facebook.com/plugins/feedback.php?app_id=1379575469016080&channel=https%3A%2F%2Fstaticxx.facebook.com%2Fx%2Fconnect%2Fxd_arbiter%2F%3Fversion%3D46%23cb%3Df22d8c81d4ce144%26domain%3Dwww.chinatimes.com%26origin%3Dhttps%253A%252F%252Fwww.chinatimes.com%252Ff5f738a4fa595%26relation%3Dparent.parent&container_width=924&height=100&href=https%3A%2F%2Fwww.chinatimes.com%2Frealtimenews%2F20210529003827-260407&locale=zh_TW&numposts=5&order_by=reverse_time&sdk=joey&version=v3.2&width
我们将其格式化后,得到有以下参数:
https://www.facebook.com/plugins/feedback.php?
app_id: 1379575469016080
channel: https://staticxx.facebook.com/x/connect/xd_arbiter/?version=46#cb=f22d8c81d4ce144
domain: www.chinatimes.com
origin: https%3A%2F%2Fwww.chinatimes.com%2Ff5f738a4fa595
relation: parent.parent
container_width: 924
height: 100
href: https://www.chinatimes.com/realtimenews/20210529003827-260407
locale: zh_TW
numposts: 5
order_by: reverse_time
sdk: joey
version: v3.2
width
以上参数中,app_id 需要我们去获取,domain 为该网站的域名,href 为该页面的 URL,剩下的其他参数经测试,对结果无影响,可直接复制过去。
直接在原页面搜索 app_id 的值,可以发现有个 meta 标签里面有这个值,直接使用 Xpath 匹配即可,注意,经过测试,部分使用了这个插件的页面是没有 app_id 的,不需要这个值也能获取,所以要注意报错处理。
try:app_id = content.xpath('//meta[@property="fb:app_id"]/@content')[0]
except IndexError:pass
对于第一页的所有评论,我们搜索评论文字的 Unicode 编码,可以在 response 中找到对应内容,直接将包含评论信息的这一段提取出来即可。
【2x02】下一页
点击载入其他留言,可以看到新的请求,类似于:https://www.facebook.com/plugins/comments/async/4045370158886862/pager/reverse_time/,请求方式为 post。URL 中 async 后面的一串数字为 targetID,可以在请求返回的数据中获取。
Form data 如下:
app_id: 1379575469016080
after_cursor: AQHReYdcksX9wFZEKA3MgNmN8PCRr7N3tFfZZuIKpCKnIuv-SxCycw4uZ1LqhtMr7RVkGyqACNdpkd9uJJ1jk6ne9g
limit: 10
__user: 0
__a: 1
__dyn: 7xe6EgU4e1QyUbFp62-m1FwAxu13wKxK7Emy8W3q322aewTwl8eU4K3a3W1DwUx60Vo1upE4W0LEK1pwo8swaq1xwEwhU1382gKi8wnU1e42C0BE1co3rw9O0RE5a1qw8W0b1w
__csr:
__req: 1
__hs: 18777.PHASED:plugin_feedback_pkg.2.0.0.0
dpr: 1
__ccg: EXCELLENT
__rev: 1003879025
__s: ::lw3b8e
__hsi: 6968076253228168178
__comet_req: 0
locale: zh_TW
lsd: AVp5kXcGShk
jazoest: 2975
__sp: 1
app_id 和前面一样,after_cursor 的值通过搜索可以在上一页评论数据里面找到,换句话说,这一页的数据里面包含一个 after_cursor 的值,这个值是下一页请求 Form data 里面的参数。经测试其他参数的值不影响最终结果。
【2x03】回复别人的评论
回复别人的评论分为两种,第一种是直接可以看到的,第二种是需要点击“更多回复”才能看到的。第一种可以直接获取,第二种需要再次发送新的请求才能获取,新的请求的 URL 类似于:https://www.facebook.com/plugins/comments/async/comment/4045370158886862_4046939882063223/pager/ ,请求方式和下一页的请求方式一样,其中 URL comment 后面的一串数字仍然是 targetID, Form data 里的 after_cursor 参数可以在楼主的评论数据里面获取。
【3x00】完整代码
完整代码 Github 地址(点亮 star 有 buff 加成):
https://github.com/TRHX/Python3-Spider-Practice/tree/master/CommentPlugin/facebook-comments
# ====================================
# --*-- coding: utf-8 --*--
# @Time : 2021-05-30
# @Author : TRHX • 鲍勃
# @Blog : www.itrhx.com
# @CSDN : itrhx.blog.csdn.net
# @FileName: facebook.py
# @Software: PyCharm
# ====================================import requests
import json
import time
from lxml import etree# ============================== 测试链接 ============================== #
# https://www.chinatimes.com/realtimenews/20210529003827-260407
# https://tw.appledaily.com/life/20210530/IETG7L3VMBA57OD45OC5KFTCPQ/
# https://www.nownews.com/news/5281470
# https://www.thejakartapost.com/life/2019/06/03/how-to-lose-belly-fat-in-seven-days.html
# https://mcnews.cc/p/25224
# https://news.ltn.com.tw/news/world/breakingnews/3550262
# https://www.npf.org.tw/1/15857
# https://news.pts.org.tw/article/528425
# https://news.tvbs.com.tw/life/1518745
# ============================== 测试链接 ============================== #PAGE_URL = 'https://www.chinatimes.com/realtimenews/20210529003827-260407'
PROXIES = {'http': 'http://127.0.0.1:10809', 'https': 'http://127.0.0.1:10809'}
# PROXIES = None # 如果不需要代理则设置为 Noneclass FacebookComment:def __init__(self):self.json_name = 'facebook_comments.json'self.domain = PAGE_URL.split('/')[2]self.iframe_referer = 'https://{}/'.format(self.domain)self.user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Safari/537.36'self.channel_base_url = 'https%3A%2F%2Fstaticxx.facebook.com%2Fx%2Fconnect%2Fxd_arbiter%2F%3Fversion%3D46%23cb%3Df17861604189654%26domain%3D{}%26origin%3Dhttps%253A%252F%252F{}%252Ff9bd3e89788d7%26relation%3Dparent.parent'self.referer_base_url = 'https://www.facebook.com/plugins/feedback.php?app_id={}&channel={}&container_width=924&height=100&href={}&locale=zh_TW&numposts=5&order_by=reverse_time&sdk=joey&version=v3.2&width'self.comment_base_url = 'https://www.facebook.com/plugins/comments/async/{}/pager/reverse_time/'self.reply_base_url = 'https://www.facebook.com/plugins/comments/async/comment/{}/pager/'self.target_id = ''self.referer = ''self.app_id = ''@staticmethoddef find_value(html: str, key: str, num_chars: int, separator: str) -> str:pos_begin = html.find(key) + len(key) + num_charspos_end = html.find(separator, pos_begin)return html[pos_begin: pos_end]@staticmethoddef save_comment(filename: str, information: json) -> None:with open(filename, 'a+', encoding='utf-8') as f:f.write(information + '\n')def get_app_id(self) -> None:headers = {'user-agent': self.user_agent}response = requests.get(url=PAGE_URL, headers=headers, proxies=PROXIES)html = response.textcontent = etree.HTML(html)try:app_id = content.xpath('//meta[@property="fb:app_id"]/@content')[0]self.app_id = app_idexcept IndexError:passdef get_first_parameter(self) -> str:channel_url = self.channel_base_url.format(self.domain, self.domain)referer_url = self.referer_base_url.format(self.app_id, channel_url, PAGE_URL)headers = {'authority': 'www.facebook.com','upgrade-insecure-requests': '1','user-agent': self.user_agent,'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9','sec-fetch-site': 'cross-site','sec-fetch-mode': 'navigate','sec-fetch-dest': 'iframe','referer': self.iframe_referer,'accept-language': 'zh-CN,zh;q=0.9'}response = requests.get(url=referer_url, headers=headers, proxies=PROXIES)data = response.textafter_cursor = self.find_value(data, "afterCursor", 3, separator='"')target_id = self.find_value(data, "targetID", 3, separator='"')# rev = find_value(data, "consistency", 9, separator='}')# 提取并保存最开始的评论tree = etree.HTML(data)script = tree.xpath('//body/script[last()]/text()')[0]html_begin = script.find('"comments":') + len('"comments":')html_end = script.find('"meta"')result = script[html_begin:html_end].strip()result_dict = json.loads(result[:-1])comment_type = 'first'self.processing_comment(result_dict, comment_type)self.target_id = target_idself.referer = referer_urlreturn after_cursordef get_comment(self, after_cursor: str, comment_url: str) -> None:""":param after_cursor: 字符串,下一页的 cursor:param comment_url: 字符串,评论页面的 URL:return: None"""num = 1while after_cursor:post_data = {'app_id': self.app_id,'after_cursor': after_cursor,'limit': 10,'iframe_referer': self.iframe_referer,'__user': 0,'__a': 1,'__dyn': '7xe6EgU4e3W3mbG2KmhwRwqo98nwgUbErxW5EyewSwMwyzEdU5i3K1bwOw-wpUe8hwem0nCq1ewbWbwmo62782CwOwKwEwhU1382gKi8wl8G0jx0Fw9q0B82swdK0D83mwkE5G0zE16o','__csr': '','__req': num,'__beoa': 0,'__pc': 'PHASED:plugin_feedback_pkg','dpr': 1,'__ccg': 'GOOD',# '__rev': rev,# '__s': ':mfgzaz:f4if6y',# '__hsi': '6899699958141806572','__comet_req': 0,'locale': 'zh_TW',# 'jazoest': '22012','__sp': 1}headers = {'user-agent': self.user_agent,'content-type': 'application/x-www-form-urlencoded','accept': '*/*','origin': 'https://www.facebook.com','sec-fetch-site': 'same-origin','sec-fetch-mode': 'cors','sec-fetch-dest': 'empty','referer': self.referer,'accept-language': 'zh-CN,zh;q=0.9'}response = requests.post(url=comment_url, headers=headers, proxies=PROXIES, data=post_data)data = response.textif 'xml version' in data:html_data = data.split('\n', 1)[1]else:html_data = dataif 'for (;;);' in html_data:json_text = html_data.split("for (;;);")[1]json_dict = json.loads(json_text)# print(json_dict)comment_type = 'second'self.processing_comment(json_dict, comment_type)try:after_cursor = json_dict['payload']['afterCursor']except KeyError:after_cursor = False# try:# rev = json_dict['hsrp']['hblp']['consistency']['rev']# except KeyError:# rev = ''else:after_cursor = Falsenum += 1def processing_comment(self, comment_dict: dict, comment_type: str) -> None:""":param comment_dict: 字典,所有评论信息,不同页面传来的数据可能结构不一样:param comment_type: 字符串,用来标记第一页和非第一页的评论:return: None"""try:comment_dict = comment_dict['payload']except KeyError:comment_dict = comment_dict# 如果为 first,表示是第一页评论,则全部储存,否则要去掉重复的第一个if comment_type == 'first':comment_ids = comment_dict['commentIDs']else:comment_ids = comment_dict['commentIDs'][1:]# 第一次储存,储存所有一级评论self.extract_comment(comment_dict, comment_ids)def extract_comment(self, comment_dict: dict, comment_ids: list) -> None:""":param comment_dict: 字典,所有的评论信息:param comment_ids: 列表,所有评论的 ID:return: None"""for i in range(len(comment_ids)):# ================== info ================== #crawl_timestamp = int(time.time())crawl_time = time.strftime('%Y-%m-%d %H:%M:%S', time.localtime())# ================= comment ================ #comment = comment_dict['idMap'][comment_ids[i]]comment_id = comment_ids[i]target_id = comment['targetID']created_timestamp = comment['timestamp']['time']created_time_text = comment['timestamp']['text']created_time = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime(float(created_timestamp)))comment_type = comment['type']ranges = comment['ranges']like_count = comment['likeCount']has_liked = comment['hasLiked']can_like = comment['canLike']can_edit = comment['canEdit']hidden = comment['hidden']high_lighted_words = comment['highlightedWords']spam_count = comment['spamCount']can_embed = comment['canEmbed']try:reply_count = comment['public_replies']['totalCount']except KeyError:reply_count = 0report_uri = 'https://www.facebook.com' + comment['reportURI']content = comment['body']['text']# ================= author ================= #author_id = comment['authorID']author = comment_dict['idMap'][author_id]author_name = author['name']thumb_src = author['thumbSrc']uri = author['uri']is_verified = author['isVerified']author_type = author['type']comment_result_dict = {'info': {'pageURL': PAGE_URL, # 原始页面链接'crawlTimestamp': crawl_timestamp, # 爬取时间戳'crawlTime': crawl_time # 爬取时间},'comment': {'type': comment_type, # 类型'commentID': comment_id, # 评论 ID'targetID': target_id, # 目标 ID,若为回复 A 的评论,则其值为 A 的评论 ID'createdTimestamp': created_timestamp, # 评论时间戳'createdTime': created_time, # 评论时间'createdTimeText': created_time_text, # 评论时间(年月日)'likeCount': like_count, # 该条评论获得的点赞数'replyCount': reply_count, # 该条评论下的回复数'spamCount': spam_count, # 该条评论被标记为垃圾信息的次数'hasLiked': has_liked, # 该条评论是否被你点赞过'canLike': can_like, # 该条评论是否可以被点赞'canEdit': can_edit, # 该条评论是否可以被编辑'hidden': hidden, # 该条评论是否被隐藏'canEmbed': can_embed, # 该条评论是否可以被嵌入到其他网页'ranges': ranges, # 不知道啥含义'highLightedWords': high_lighted_words, # 该条评论被高亮的单词'reportURI': report_uri, # 举报该条评论的链接'content': content, # 该条评论的内容},'author': {'type': author_type, # 类型'authorID': author_id, # 该条评论作者的 ID'authorName': author_name, # 该条评论作者的用户名'isVerified': is_verified, # 该条评论作者是否已认证过'uri': uri, # 该条评论作者的 facebook 主页'thumbSrc': thumb_src # 该条评论作者的头像链接}}print(comment_result_dict)self.save_comment(self.json_name, json.dumps(comment_result_dict, ensure_ascii=False))# 第二次储存,储存所有二级评论(回复别人的评论,且不用点击“更多回复”就能看见的评论)# 判断依据,是否存在 commentIDstry:reply_ids = comment['public_replies']['commentIDs']self.extract_comment(comment_dict, reply_ids)except KeyError:pass# 第三次储存,储存所有三级评论(回复别人的评论,但是需要点击“更多回复”才能看见的评论)# 判断依据,是否存在 afterCursortry:reply_after_cursor = comment['public_replies']['afterCursor']reply_id = comment_ids[i]reply_url = self.reply_base_url.format(reply_id)self.get_comment(reply_after_cursor, reply_url)except KeyError:passdef run(self) -> None:self.get_app_id()after_cursor = self.get_first_parameter()if len(after_cursor) < 20:print('\n{} 评论采集完毕!'.format(PAGE_URL))else:comment_url = self.comment_base_url.format(self.target_id)self.get_comment(after_cursor, comment_url)print('\n{} 评论采集完毕!'.format(PAGE_URL))if __name__ == '__main__':FC = FacebookComment()FC.run()
【4x00】数据截图