Python 采集 Facebook 评论插件、留言外挂程序

实现时间：2021-05-30
实现难度：★★★☆☆☆
实现目标：采集 Facebook 评论插件、留言外挂程序的所有评论。
完整代码：https://github.com/TRHX/Python3-Spider-Practice/tree/master/CommentPlugin/facebook-comments
其他爬虫实战代码合集（持续更新）：https://github.com/TRHX/Python3-Spider-Practice
爬虫实战专栏（持续更新）：https://itrhx.blog.csdn.net/article/category/9351278

文章目录

- 【1x00】写在前面
- 【2x00】逻辑分析
- - 【2x01】第一页
  - 【2x02】下一页
  - 【2x03】回复别人的评论
- 【3x00】完整代码
- 【4x00】数据截图

【1x00】写在前面

本文的采集代码适用于 Facebook 评论插件的评论采集。仅用于 Python 编程技术交流！

Facebook 评论插件官网：https://developers.facebook.com/products/social-plugins/comments

本文以 https://www.chinatimes.com/realtimenews/20210529003827-260407 为例。

【2x00】逻辑分析

【2x01】第一页

在页面的 Facebook 评论插件位置右键查看框架源代码，我们就可以看到第一页评论页面的源码，直接访问这个 URL 就可以看到评论信息。

在这里插入图片描述
这个页面的 URL 为：https://www.facebook.com/plugins/feedback.php?app_id=1379575469016080&channel=https%3A%2F%2Fstaticxx.facebook.com%2Fx%2Fconnect%2Fxd_arbiter%2F%3Fversion%3D46%23cb%3Df22d8c81d4ce144%26domain%3Dwww.chinatimes.com%26origin%3Dhttps%253A%252F%252Fwww.chinatimes.com%252Ff5f738a4fa595%26relation%3Dparent.parent&container_width=924&height=100&href=https%3A%2F%2Fwww.chinatimes.com%2Frealtimenews%2F20210529003827-260407&locale=zh_TW&numposts=5&order_by=reverse_time&sdk=joey&version=v3.2&width

我们将其格式化后，得到有以下参数：

https://www.facebook.com/plugins/feedback.php?
app_id: 1379575469016080
channel: https://staticxx.facebook.com/x/connect/xd_arbiter/?version=46#cb=f22d8c81d4ce144
domain: www.chinatimes.com
origin: https%3A%2F%2Fwww.chinatimes.com%2Ff5f738a4fa595
relation: parent.parent
container_width: 924
height: 100
href: https://www.chinatimes.com/realtimenews/20210529003827-260407
locale: zh_TW
numposts: 5
order_by: reverse_time
sdk: joey
version: v3.2
width

以上参数中，app_id 需要我们去获取，domain 为该网站的域名，href 为该页面的 URL，剩下的其他参数经测试，对结果无影响，可直接复制过去。

直接在原页面搜索 app_id 的值，可以发现有个 meta 标签里面有这个值，直接使用 Xpath 匹配即可，注意，经过测试，部分使用了这个插件的页面是没有 app_id 的，不需要这个值也能获取，所以要注意报错处理。

try:app_id = content.xpath('//meta[@property="fb:app_id"]/@content')[0]
except IndexError:pass

在这里插入图片描述

对于第一页的所有评论，我们搜索评论文字的 Unicode 编码，可以在 response 中找到对应内容，直接将包含评论信息的这一段提取出来即可。

在这里插入图片描述

【2x02】下一页

点击载入其他留言，可以看到新的请求，类似于：https://www.facebook.com/plugins/comments/async/4045370158886862/pager/reverse_time/，请求方式为 post。URL 中 async 后面的一串数字为 targetID，可以在请求返回的数据中获取。

在这里插入图片描述

Form data 如下：

app_id: 1379575469016080
after_cursor: AQHReYdcksX9wFZEKA3MgNmN8PCRr7N3tFfZZuIKpCKnIuv-SxCycw4uZ1LqhtMr7RVkGyqACNdpkd9uJJ1jk6ne9g
limit: 10
__user: 0
__a: 1
__dyn: 7xe6EgU4e1QyUbFp62-m1FwAxu13wKxK7Emy8W3q322aewTwl8eU4K3a3W1DwUx60Vo1upE4W0LEK1pwo8swaq1xwEwhU1382gKi8wnU1e42C0BE1co3rw9O0RE5a1qw8W0b1w
__csr: 
__req: 1
__hs: 18777.PHASED:plugin_feedback_pkg.2.0.0.0
dpr: 1
__ccg: EXCELLENT
__rev: 1003879025
__s: ::lw3b8e
__hsi: 6968076253228168178
__comet_req: 0
locale: zh_TW
lsd: AVp5kXcGShk
jazoest: 2975
__sp: 1

app_id 和前面一样，after_cursor 的值通过搜索可以在上一页评论数据里面找到，换句话说，这一页的数据里面包含一个 after_cursor 的值，这个值是下一页请求 Form data 里面的参数。经测试其他参数的值不影响最终结果。

在这里插入图片描述

【2x03】回复别人的评论

回复别人的评论分为两种，第一种是直接可以看到的，第二种是需要点击“更多回复”才能看到的。第一种可以直接获取，第二种需要再次发送新的请求才能获取，新的请求的 URL 类似于：https://www.facebook.com/plugins/comments/async/comment/4045370158886862_4046939882063223/pager/ ，请求方式和下一页的请求方式一样，其中 URL comment 后面的一串数字仍然是 targetID， Form data 里的 after_cursor 参数可以在楼主的评论数据里面获取。

【3x00】完整代码

完整代码 Github 地址（点亮 star 有 buff 加成）：
https://github.com/TRHX/Python3-Spider-Practice/tree/master/CommentPlugin/facebook-comments

# ====================================
# --*-- coding: utf-8 --*--
# @Time    : 2021-05-30
# @Author  : TRHX • 鲍勃
# @Blog    : www.itrhx.com
# @CSDN    : itrhx.blog.csdn.net
# @FileName: facebook.py
# @Software: PyCharm
# ====================================import requests
import json
import time
from lxml import etree# ==============================  测试链接  ============================== #
# https://www.chinatimes.com/realtimenews/20210529003827-260407
# https://tw.appledaily.com/life/20210530/IETG7L3VMBA57OD45OC5KFTCPQ/
# https://www.nownews.com/news/5281470
# https://www.thejakartapost.com/life/2019/06/03/how-to-lose-belly-fat-in-seven-days.html
# https://mcnews.cc/p/25224
# https://news.ltn.com.tw/news/world/breakingnews/3550262
# https://www.npf.org.tw/1/15857
# https://news.pts.org.tw/article/528425
# https://news.tvbs.com.tw/life/1518745
# ==============================  测试链接  ============================== #PAGE_URL = 'https://www.chinatimes.com/realtimenews/20210529003827-260407'
PROXIES = {'http': 'http://127.0.0.1:10809', 'https': 'http://127.0.0.1:10809'}
# PROXIES = None # 如果不需要代理则设置为 Noneclass FacebookComment:def __init__(self):self.json_name = 'facebook_comments.json'self.domain = PAGE_URL.split('/')[2]self.iframe_referer = 'https://{}/'.format(self.domain)self.user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Safari/537.36'self.channel_base_url = 'https%3A%2F%2Fstaticxx.facebook.com%2Fx%2Fconnect%2Fxd_arbiter%2F%3Fversion%3D46%23cb%3Df17861604189654%26domain%3D{}%26origin%3Dhttps%253A%252F%252F{}%252Ff9bd3e89788d7%26relation%3Dparent.parent'self.referer_base_url = 'https://www.facebook.com/plugins/feedback.php?app_id={}&channel={}&container_width=924&height=100&href={}&locale=zh_TW&numposts=5&order_by=reverse_time&sdk=joey&version=v3.2&width'self.comment_base_url = 'https://www.facebook.com/plugins/comments/async/{}/pager/reverse_time/'self.reply_base_url = 'https://www.facebook.com/plugins/comments/async/comment/{}/pager/'self.target_id = ''self.referer = ''self.app_id = ''@staticmethoddef find_value(html: str, key: str, num_chars: int, separator: str) -> str:pos_begin = html.find(key) + len(key) + num_charspos_end = html.find(separator, pos_begin)return html[pos_begin: pos_end]@staticmethoddef save_comment(filename: str, information: json) -> None:with open(filename, 'a+', encoding='utf-8') as f:f.write(information + '\n')def get_app_id(self) -> None:headers = {'user-agent': self.user_agent}response = requests.get(url=PAGE_URL, headers=headers, proxies=PROXIES)html = response.textcontent = etree.HTML(html)try:app_id = content.xpath('//meta[@property="fb:app_id"]/@content')[0]self.app_id = app_idexcept IndexError:passdef get_first_parameter(self) -> str:channel_url = self.channel_base_url.format(self.domain, self.domain)referer_url = self.referer_base_url.format(self.app_id, channel_url, PAGE_URL)headers = {'authority': 'www.facebook.com','upgrade-insecure-requests': '1','user-agent': self.user_agent,'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9','sec-fetch-site': 'cross-site','sec-fetch-mode': 'navigate','sec-fetch-dest': 'iframe','referer': self.iframe_referer,'accept-language': 'zh-CN,zh;q=0.9'}response = requests.get(url=referer_url, headers=headers, proxies=PROXIES)data = response.textafter_cursor = self.find_value(data, "afterCursor", 3, separator='"')target_id = self.find_value(data, "targetID", 3, separator='"')# rev = find_value(data, "consistency", 9, separator='}')# 提取并保存最开始的评论tree = etree.HTML(data)script = tree.xpath('//body/script[last()]/text()')[0]html_begin = script.find('"comments":') + len('"comments":')html_end = script.find('"meta"')result = script[html_begin:html_end].strip()result_dict = json.loads(result[:-1])comment_type = 'first'self.processing_comment(result_dict, comment_type)self.target_id = target_idself.referer = referer_urlreturn after_cursordef get_comment(self, after_cursor: str, comment_url: str) -> None:""":param after_cursor: 字符串，下一页的 cursor:param comment_url: 字符串，评论页面的 URL:return: None"""num = 1while after_cursor:post_data = {'app_id': self.app_id,'after_cursor': after_cursor,'limit': 10,'iframe_referer': self.iframe_referer,'__user': 0,'__a': 1,'__dyn': '7xe6EgU4e3W3mbG2KmhwRwqo98nwgUbErxW5EyewSwMwyzEdU5i3K1bwOw-wpUe8hwem0nCq1ewbWbwmo62782CwOwKwEwhU1382gKi8wl8G0jx0Fw9q0B82swdK0D83mwkE5G0zE16o','__csr': '','__req': num,'__beoa': 0,'__pc': 'PHASED:plugin_feedback_pkg','dpr': 1,'__ccg': 'GOOD',# '__rev': rev,# '__s': ':mfgzaz:f4if6y',# '__hsi': '6899699958141806572','__comet_req': 0,'locale': 'zh_TW',# 'jazoest': '22012','__sp': 1}headers = {'user-agent': self.user_agent,'content-type': 'application/x-www-form-urlencoded','accept': '*/*','origin': 'https://www.facebook.com','sec-fetch-site': 'same-origin','sec-fetch-mode': 'cors','sec-fetch-dest': 'empty','referer': self.referer,'accept-language': 'zh-CN,zh;q=0.9'}response = requests.post(url=comment_url, headers=headers, proxies=PROXIES, data=post_data)data = response.textif 'xml version' in data:html_data = data.split('\n', 1)[1]else:html_data = dataif 'for (;;);' in html_data:json_text = html_data.split("for (;;);")[1]json_dict = json.loads(json_text)# print(json_dict)comment_type = 'second'self.processing_comment(json_dict, comment_type)try:after_cursor = json_dict['payload']['afterCursor']except KeyError:after_cursor = False# try:#     rev = json_dict['hsrp']['hblp']['consistency']['rev']# except KeyError:#     rev = ''else:after_cursor = Falsenum += 1def processing_comment(self, comment_dict: dict, comment_type: str) -> None:""":param comment_dict: 字典，所有评论信息，不同页面传来的数据可能结构不一样:param comment_type: 字符串，用来标记第一页和非第一页的评论:return: None"""try:comment_dict = comment_dict['payload']except KeyError:comment_dict = comment_dict# 如果为 first，表示是第一页评论，则全部储存，否则要去掉重复的第一个if comment_type == 'first':comment_ids = comment_dict['commentIDs']else:comment_ids = comment_dict['commentIDs'][1:]# 第一次储存，储存所有一级评论self.extract_comment(comment_dict, comment_ids)def extract_comment(self, comment_dict: dict, comment_ids: list) -> None:""":param comment_dict: 字典，所有的评论信息:param comment_ids: 列表，所有评论的 ID:return: None"""for i in range(len(comment_ids)):# ==================  info  ================== #crawl_timestamp = int(time.time())crawl_time = time.strftime('%Y-%m-%d %H:%M:%S', time.localtime())# =================  comment  ================ #comment = comment_dict['idMap'][comment_ids[i]]comment_id = comment_ids[i]target_id = comment['targetID']created_timestamp = comment['timestamp']['time']created_time_text = comment['timestamp']['text']created_time = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime(float(created_timestamp)))comment_type = comment['type']ranges = comment['ranges']like_count = comment['likeCount']has_liked = comment['hasLiked']can_like = comment['canLike']can_edit = comment['canEdit']hidden = comment['hidden']high_lighted_words = comment['highlightedWords']spam_count = comment['spamCount']can_embed = comment['canEmbed']try:reply_count = comment['public_replies']['totalCount']except KeyError:reply_count = 0report_uri = 'https://www.facebook.com' + comment['reportURI']content = comment['body']['text']# =================  author  ================= #author_id = comment['authorID']author = comment_dict['idMap'][author_id]author_name = author['name']thumb_src = author['thumbSrc']uri = author['uri']is_verified = author['isVerified']author_type = author['type']comment_result_dict = {'info': {'pageURL': PAGE_URL,                      # 原始页面链接'crawlTimestamp': crawl_timestamp,        # 爬取时间戳'crawlTime': crawl_time                   # 爬取时间},'comment': {'type': comment_type,                     # 类型'commentID': comment_id,                  # 评论 ID'targetID': target_id,                    # 目标 ID，若为回复 A 的评论，则其值为 A 的评论 ID'createdTimestamp': created_timestamp,    # 评论时间戳'createdTime': created_time,              # 评论时间'createdTimeText': created_time_text,     # 评论时间（年月日）'likeCount': like_count,                  # 该条评论获得的点赞数'replyCount': reply_count,                # 该条评论下的回复数'spamCount': spam_count,                  # 该条评论被标记为垃圾信息的次数'hasLiked': has_liked,                    # 该条评论是否被你点赞过'canLike': can_like,                      # 该条评论是否可以被点赞'canEdit': can_edit,                      # 该条评论是否可以被编辑'hidden': hidden,                         # 该条评论是否被隐藏'canEmbed': can_embed,                    # 该条评论是否可以被嵌入到其他网页'ranges': ranges,                         # 不知道啥含义'highLightedWords': high_lighted_words,   # 该条评论被高亮的单词'reportURI': report_uri,                  # 举报该条评论的链接'content': content,                       # 该条评论的内容},'author': {'type': author_type,                      # 类型'authorID': author_id,                    # 该条评论作者的 ID'authorName': author_name,                # 该条评论作者的用户名'isVerified': is_verified,                # 该条评论作者是否已认证过'uri': uri,                               # 该条评论作者的 facebook 主页'thumbSrc': thumb_src                     # 该条评论作者的头像链接}}print(comment_result_dict)self.save_comment(self.json_name, json.dumps(comment_result_dict, ensure_ascii=False))# 第二次储存，储存所有二级评论(回复别人的评论，且不用点击“更多回复”就能看见的评论)# 判断依据，是否存在 commentIDstry:reply_ids = comment['public_replies']['commentIDs']self.extract_comment(comment_dict, reply_ids)except KeyError:pass# 第三次储存，储存所有三级评论(回复别人的评论，但是需要点击“更多回复”才能看见的评论)# 判断依据，是否存在 afterCursortry:reply_after_cursor = comment['public_replies']['afterCursor']reply_id = comment_ids[i]reply_url = self.reply_base_url.format(reply_id)self.get_comment(reply_after_cursor, reply_url)except KeyError:passdef run(self) -> None:self.get_app_id()after_cursor = self.get_first_parameter()if len(after_cursor) < 20:print('\n{} 评论采集完毕！'.format(PAGE_URL))else:comment_url = self.comment_base_url.format(self.target_id)self.get_comment(after_cursor, comment_url)print('\n{} 评论采集完毕！'.format(PAGE_URL))if __name__ == '__main__':FC = FacebookComment()FC.run()