这一集我们讲一个比较简单的域名校验,可能你没有听过这个名字,因为这个名字是我编的,那么它究竟是什么呢?又为什么说它是掩耳盗铃呢?我们来看看下面的案例:
- 必应搜索页隐藏内容
- 虎嗅新闻跳转404
import requests
import chardet
from bs4 import BeautifulSoup,Commentdef remove_css(html):soup = BeautifulSoup(html, 'html.parser')# print(soup.text)# 删除<style>标签# for style_tag in soup('style'):# style_tag.decompose()# 删除<link>标签# for link_tag in soup('link'):# link_tag.decompose()# 删除<symbol>标签for symbol_tag in soup('symbol'):symbol_tag.decompose()# 删除<script>标签for script_tag in soup('script'):script_tag.decompose()# 删除<svg>标签for script_tag in soup('svg'):script_tag.decompose()# 删除注释comments = soup.find_all(string=lambda text: isinstance(text, Comment))for comment in comments:comment.extract()return str(soup)def download_page(url,file_name):headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36'}r = requests.get(url=url, headers=headers)encoding = chardet.detect(r.content)["encoding"]if encoding.lower() == "gb2312":encoding = 'gb18030'html = r.content.decode(encoding)with open(file_name,'w',encoding='utf-8') as f:f.write(html)# f.write(remove_css(html))url = 'https://cn.bing.com/search?q=%E5%AE%A4%E6%B8%A9%E8%B6%85%E5%AF%BC&form=QBLH&sp=-1&lq=0&pq=%E5%AE%A4%E6%B8%A9%E8%B6%85%E5%AF%BC&sc=10-4&qs=n&sk=&cvid=DA87FC09FB9F4425908E34195B622973&ghsh=0&ghacc=0&ghpl='
download_page(url=url,file_name='1.biying.html')
url = 'https://www.huxiu.com/article/1870796.html'
download_page(url=url,file_name='2.huxiu.html')
这两个页面获取到之后都无法正常显示,需要去掉请求到的页面里的js代码,就正常了
为什么说有掩耳盗铃的嫌疑呢?因为数据是真正获取到的,只是不给看到。
视频教程地址:https://www.bilibili.com/video/BV1RN411h78z/