CefSharp作为专门为爬虫工具开发的库比Selenium这种开发目的是页面测试工具然后用来做爬虫的工具要贴心得多。我们操作网页的时候发送或者做了某个动作提交表单之后需要知道我们的动作或者提交是否成功,因为有的页面会因为网络延迟问题提交失败,需要准确的获取到发送消息后服务器的返回值,如果直接通过页面的弹窗获取发送消息后的结果会非常麻烦,有时候一个消息发送后会产生多种不同的返回结果,可能提交成功,可能提交失败,可能消息超时等等,如果能够直接获取到发送消息的Request,无疑会大大方便我们判断。
例如这是点击百度搜索框时产生的GET消息的返回值:
CefSharp贴心的为开发者提供了网页运行的不同阶段的回调函数,类似于VUE前端框架的钩子函数。CefSharp允许开发者在POST或GET消息发送时修改提交的参数也就是postData,还可以拦截修改图片,JS文件,CSS样式等等,这篇文章只是记录如何获取GET或者POST消息提交后直接获取JSON、XML、HTML数据。
这些自定义功能都基于IResourceRequestHandler类,首先我们要创建一个新的类继承重写这个类中的方法。
public class ResourceRequestHandler : IResourceRequestHandler{/// <summary>/// Called on the CEF IO thread before a resource request is loaded. To optionally filter cookies for the request return a/// <see cref="ICookieAccessFilter"/> object./// </summary>/// <param name="chromiumWebBrowser">The ChromiumWebBrowser control.</param>/// <param name="browser">the browser object - may be null if originating from ServiceWorker or CefURLRequest.</param>/// <param name="frame">the frame object - may be null if originating from ServiceWorker or CefURLRequest.</param>/// <param name="request">the request object - can be modified in this callback.</param>/// <returns>To optionally filter cookies for the request return a ICookieAccessFilter instance otherwise return null.</returns>ICookieAccessFilter IResourceRequestHandler.GetCookieAccessFilter(IWebBrowser chromiumWebBrowser, IBrowser browser, IFrame frame, IRequest request){return GetCookieAccessFilter(chromiumWebBrowser, browser, frame, request);}/// <summary>/// Called on the CEF IO thread before a resource request is loaded. To optionally filter cookies for the request return a/// <see cref="ICookieAccessFilter"/> object./// </summary>/// <param name="chromiumWebBrowser">The ChromiumWebBrowser control.</param>/// <param name="browser">the browser object - may be null if originating from ServiceWorker or CefURLRequest.</param>/// <param name="frame">the frame object - may be null if originating from ServiceWorker or CefURLRequest.</param>/// <param name="request">the request object - can be modified in this callback.</param>/// <returns>To optionally filter cookies for the request return a ICookieAccessFilter instance otherwise return null.</returns>protected virtual ICookieAccessFilter GetCookieAccessFilter(IWebBrowser chromiumWebBrowser, IBrowser browser, IFrame frame, IRequest request){return null;}/// <summary>/// Called on the CEF IO thread before a resource is loaded. To specify a handler for the resource return a/// <see cref="IResourceHandler"/> object./// </summary>/// <param name="chromiumWebBrowser">The browser UI control.</param>/// <param name="browser">the browser object - may be null if originating from ServiceWorker or CefURLRequest.</param>/// <param name="frame">the frame object - may be null if originating from ServiceWorker or CefURLRequest.</param>/// <param name="request">the request object - cannot be modified in this callback.</param>/// <returns>/// To allow the resource to load using the default network loader return null otherwise return an instance of/// <see cref="IResourceHandler"/> with a valid stream./// </returns>IResourceHandler IResourceRequestHandler.GetResourceHandler(IWebBrowser chromiumWebBrowser, IBrowser browser, IFrame frame, IRequest request){return GetResourceHandler(chromiumWebBrowser, browser, frame, request);}/// <summary>/// Called on the CEF IO thread before a resource is loaded. To specify a handler for the resource return a/// <see cref="IResourceHandler"/> object./// </summary>/// <param name="chromiumWebBrowser">The browser UI control.</param>/// <param name="browser">the browser object - may be null if originating from ServiceWorker or CefURLRequest.</param>/// <param name="frame">the frame object - may be null if originating from ServiceWorker or CefURLRequest.</param>/// <param name="request">the request object - cannot be modified in this callback.</param>/// <returns>/// To allow the resource to load using the default network loader return null otherwise return an instance of/// <see cref="IResourceHandler"/> with a valid stream./// </returns>protected virtual IResourceHandler GetResourceHandler(IWebBrowser chromiumWebBrowser, IBrowser browser, IFrame frame, IRequest request){return null;}/// <summary>Called on the CEF IO thread to optionally filter resource response content.</summary>/// <param name="chromiumWebBrowser">The ChromiumWebBrowser control.</param>/// <param name="browser">the browser object - may be null if originating from ServiceWorker or CefURLRequest.</param>/// <param name="frame">the frame object - may be null if originating from ServiceWorker or CefURLRequest.</param>/// <param name="request">the request object - cannot be modified in this callback.</param>/// <param name="response">the response object - cannot be modified in this callback.</param>/// <returns>Return an IResponseFilter to intercept this response, otherwise return null.</returns>IResponseFilter IResourceRequestHandler.GetResourceResponseFilter(IWebBrowser chromiumWebBrowser, IBrowser browser, IFrame frame, IRequest request, IResponse response){return GetResourceResponseFilter(chromiumWebBrowser, browser, frame, request, response);}/// <summary>Called on the CEF IO thread to optionally filter resource response content.</summary>/// <param name="chromiumWebBrowser">The ChromiumWebBrowser control.</param>/// <param name="browser">the browser object - may be null if originating from ServiceWorker or CefURLRequest.</param>/// <param name="frame">the frame object - may be null if originating from ServiceWorker or CefURLRequest.</param>/// <param name="request">the request object - cannot be modified in this callback.</param>/// <param name="response">the response object - cannot be modified in this callback.</param>/// <returns>Return an IResponseFilter to intercept this response, otherwise return null.</returns>protected virtual IResponseFilter GetResourceResponseFilter(IWebBrowser chromiumWebBrowser, IBrowser browser, IFrame frame, IRequest request, IResponse response){return null;}/// <summary>/// Called on the CEF IO thread before a resource request is loaded. To redirect or change the resource load optionally modify/// <paramref name="request"/>. Modification of the request URL will be treated as a redirect./// </summary>/// <param name="chromiumWebBrowser">The ChromiumWebBrowser control.</param>/// <param name="browser">the browser object - may be null if originating from ServiceWorker or CefURLRequest.</param>/// <param name="frame">the frame object - may be null if originating from ServiceWorker or CefURLRequest.</param>/// <param name="request">the request object - can be modified in this callback.</param>/// <param name="callback">Callback interface used for asynchronous continuation of url requests.</param>/// <returns>/// Return <see cref="CefReturnValue.Continue"/> to continue the request immediately. Return/// <see cref="CefReturnValue.ContinueAsync"/> and call <see cref="IRequestCallback.Continue"/> or/// <see cref="IRequestCallback.Cancel"/> at a later time to continue or the cancel the request asynchronously. Return/// <see cref="CefReturnValue.Cancel"/> to cancel the request immediately./// </returns>CefReturnValue IResourceRequestHandler.OnBeforeResourceLoad(IWebBrowser chromiumWebBrowser, IBrowser browser, IFrame frame, IRequest request, IRequestCallback callback){return OnBeforeResourceLoad(chromiumWebBrowser, browser, frame, request, callback);}/// <summary>/// Called on the CEF IO thread before a resource request is loaded. To redirect or change the resource load optionally modify/// <paramref name="request"/>. Modification of the request URL will be treated as a redirect./// </summary>/// <param name="chromiumWebBrowser">The ChromiumWebBrowser control.</param>/// <param name="browser">the browser object - may be null if originating from ServiceWorker or CefURLRequest.</param>/// <param name="frame">the frame object - may be null if originating from ServiceWorker or CefURLRequest.</param>/// <param name="request">the request object - can be modified in this callback.</param>/// <param name="callback">Callback interface used for asynchronous continuation of url requests.</param>/// <returns>/// Return <see cref="CefReturnValue.Continue"/> to continue the request immediately. Return/// <see cref="CefReturnValue.ContinueAsync"/> and call <see cref="IRequestCallback.Continue"/> or/// <see cref="IRequestCallback.Cancel"/> at a later time to continue or the cancel the request asynchronously. Return/// <see cref="CefReturnValue.Cancel"/> to cancel the request immediately./// </returns>protected virtual CefReturnValue OnBeforeResourceLoad(IWebBrowser chromiumWebBrowser, IBrowser browser, IFrame frame, IRequest request, IRequestCallback callback){return CefReturnValue.Continue;}/// <summary>/// Called on the CEF UI thread to handle requests for URLs with an unknown protocol component. SECURITY WARNING: YOU SHOULD USE/// THIS METHOD TO ENFORCE RESTRICTIONS BASED ON SCHEME, HOST OR OTHER URL ANALYSIS BEFORE ALLOWING OS EXECUTION./// </summary>/// <param name="chromiumWebBrowser">The ChromiumWebBrowser control.</param>/// <param name="browser">the browser object - may be null if originating from ServiceWorker or CefURLRequest.</param>/// <param name="frame">the frame object - may be null if originating from ServiceWorker or CefURLRequest.</param>/// <param name="request">the request object - cannot be modified in this callback.</param>/// <returns>/// return to true to attempt execution via the registered OS protocol handler, if any. Otherwise return false./// </returns>bool IResourceRequestHandler.OnProtocolExecution(IWebBrowser chromiumWebBrowser, IBrowser browser, IFrame frame, IRequest request){return OnProtocolExecution(chromiumWebBrowser, browser, frame, request);}/// <summary>/// Called on the CEF UI thread to handle requests for URLs with an unknown protocol component. SECURITY WARNING: YOU SHOULD USE/// THIS METHOD TO ENFORCE RESTRICTIONS BASED ON SCHEME, HOST OR OTHER URL ANALYSIS BEFORE ALLOWING OS EXECUTION./// </summary>/// <param name="chromiumWebBrowser">The ChromiumWebBrowser control.</param>/// <param name="browser">the browser object - may be null if originating from ServiceWorker or CefURLRequest.</param>/// <param name="frame">the frame object - may be null if originating from ServiceWorker or CefURLRequest.</param>/// <param name="request">the request object - cannot be modified in this callback.</param>/// <returns>/// return to true to attempt execution via the registered OS protocol handler, if any. Otherwise return false./// </returns>protected virtual bool OnProtocolExecution(IWebBrowser chromiumWebBrowser, IBrowser browser, IFrame frame, IRequest request){return false;}/// <summary>/// Called on the CEF IO thread when a resource load has completed. This method will be called for all requests, including/// requests that are aborted due to CEF shutdown or destruction of the associated browser. In cases where the associated browser/// is destroyed this callback may arrive after the <see cref="ILifeSpanHandler.OnBeforeClose"/> callback for that browser. The/// <see cref="IFrame.IsValid"/> method can be used to test for this situation, and care/// should be taken not to call <paramref name="browser"/> or <paramref name="frame"/> methods that modify state (like LoadURL,/// SendProcessMessage, etc.) if the frame is invalid./// </summary>/// <param name="chromiumWebBrowser">The ChromiumWebBrowser control.</param>/// <param name="browser">the browser object - may be null if originating from ServiceWorker or CefURLRequest.</param>/// <param name="frame">the frame object - may be null if originating from ServiceWorker or CefURLRequest.</param>/// <param name="request">the request object - cannot be modified in this callback.</param>/// <param name="response">the response object - cannot be modified in this callback.</param>/// <param name="status">indicates the load completion status.</param>/// <param name="receivedContentLength">is the number of response bytes actually read.</param>void IResourceRequestHandler.OnResourceLoadComplete(IWebBrowser chromiumWebBrowser, IBrowser browser, IFrame frame, IRequest request, IResponse response, UrlRequestStatus status, long receivedContentLength){OnResourceLoadComplete(chromiumWebBrowser, browser, frame, request, response, status, receivedContentLength);}/// <summary>/// Called on the CEF IO thread when a resource load has completed. This method will be called for all requests, including/// requests that are aborted due to CEF shutdown or destruction of the associated browser. In cases where the associated browser/// is destroyed this callback may arrive after the <see cref="ILifeSpanHandler.OnBeforeClose"/> callback for that browser. The/// <see cref="IFrame.IsValid"/> method can be used to test for this situation, and care/// should be taken not to call <paramref name="browser"/> or <paramref name="frame"/> methods that modify state (like LoadURL,/// SendProcessMessage, etc.) if the frame is invalid./// </summary>/// <param name="chromiumWebBrowser">The ChromiumWebBrowser control.</param>/// <param name="browser">the browser object - may be null if originating from ServiceWorker or CefURLRequest.</param>/// <param name="frame">the frame object - may be null if originating from ServiceWorker or CefURLRequest.</param>/// <param name="request">the request object - cannot be modified in this callback.</param>/// <param name="response">the response object - cannot be modified in this callback.</param>/// <param name="status">indicates the load completion status.</param>/// <param name="receivedContentLength">is the number of response bytes actually read.</param>protected virtual void OnResourceLoadComplete(IWebBrowser chromiumWebBrowser, IBrowser browser, IFrame frame, IRequest request, IResponse response, UrlRequestStatus status, long receivedContentLength){}/// <summary>/// Called on the CEF IO thread when a resource load is redirected. The <paramref name="request"/> parameter will contain the old/// URL and other request-related information. The <paramref name="response"/> parameter will contain the response that resulted/// in the redirect. The <paramref name="newUrl"/> parameter will contain the new URL and can be changed if desired./// </summary>/// <param name="chromiumWebBrowser">The ChromiumWebBrowser control.</param>/// <param name="browser">the browser object - may be null if originating from ServiceWorker or CefURLRequest.</param>/// <param name="frame">the frame object - may be null if originating from ServiceWorker or CefURLRequest.</param>/// <param name="request">the request object - cannot be modified in this callback.</param>/// <param name="response">the response object - cannot be modified in this callback.</param>/// <param name="newUrl">[in,out] the new URL and can be changed if desired.</param>void IResourceRequestHandler.OnResourceRedirect(IWebBrowser chromiumWebBrowser, IBrowser browser, IFrame frame, IRequest request, IResponse response, ref string newUrl){OnResourceRedirect(chromiumWebBrowser, browser, frame, request, response, ref newUrl);}/// <summary>/// Called on the CEF IO thread when a resource load is redirected. The <paramref name="request"/> parameter will contain the old/// URL and other request-related information. The <paramref name="response"/> parameter will contain the response that resulted/// in the redirect. The <paramref name="newUrl"/> parameter will contain the new URL and can be changed if desired./// </summary>/// <param name="chromiumWebBrowser">The ChromiumWebBrowser control.</param>/// <param name="browser">the browser object - may be null if originating from ServiceWorker or CefURLRequest.</param>/// <param name="frame">the frame object - may be null if originating from ServiceWorker or CefURLRequest.</param>/// <param name="request">the request object - cannot be modified in this callback.</param>/// <param name="response">the response object - cannot be modified in this callback.</param>/// <param name="newUrl">[in,out] the new URL and can be changed if desired.</param>protected virtual void OnResourceRedirect(IWebBrowser chromiumWebBrowser, IBrowser browser, IFrame frame, IRequest request, IResponse response, ref string newUrl){}/// <summary>/// Called on the CEF IO thread when a resource response is received. To allow the resource load to proceed without modification/// return false. To redirect or retry the resource load optionally modify <paramref name="request"/> and return true./// Modification of the request URL will be treated as a redirect. Requests handled using the default network loader cannot be/// redirected in this callback./// /// WARNING: Redirecting using this method is deprecated. Use OnBeforeResourceLoad or GetResourceHandler to perform redirects./// </summary>/// <param name="chromiumWebBrowser">The ChromiumWebBrowser control.</param>/// <param name="browser">the browser object - may be null if originating from ServiceWorker or CefURLRequest.</param>/// <param name="frame">the frame object - may be null if originating from ServiceWorker or CefURLRequest.</param>/// <param name="request">the request object.</param>/// <param name="response">the response object - cannot be modified in this callback.</param>/// <returns>/// To allow the resource load to proceed without modification return false. To redirect or retry the resource load optionally/// modify <paramref name="request"/> and return true. Modification of the request URL will be treated as a redirect. Requests/// handled using the default network loader cannot be redirected in this callback./// </returns>bool IResourceRequestHandler.OnResourceResponse(IWebBrowser chromiumWebBrowser, IBrowser browser, IFrame frame, IRequest request, IResponse response){return OnResourceResponse(chromiumWebBrowser, browser, frame, request, response);}/// <summary>/// Called on the CEF IO thread when a resource response is received. To allow the resource load to proceed without modification/// return false. To redirect or retry the resource load optionally modify <paramref name="request"/> and return true./// Modification of the request URL will be treated as a redirect. Requests handled using the default network loader cannot be/// redirected in this callback./// /// WARNING: Redirecting using this method is deprecated. Use OnBeforeResourceLoad or GetResourceHandler to perform redirects./// </summary>/// <param name="chromiumWebBrowser">The ChromiumWebBrowser control.</param>/// <param name="browser">the browser object - may be null if originating from ServiceWorker or CefURLRequest.</param>/// <param name="frame">the frame object - may be null if originating from ServiceWorker or CefURLRequest.</param>/// <param name="request">the request object.</param>/// <param name="response">the response object - cannot be modified in this callback.</param>/// <returns>/// To allow the resource load to proceed without modification return false. To redirect or retry the resource load optionally/// modify <paramref name="request"/> and return true. Modification of the request URL will be treated as a redirect. Requests/// handled using the default network loader cannot be redirected in this callback./// </returns>protected virtual bool OnResourceResponse(IWebBrowser chromiumWebBrowser, IBrowser browser, IFrame frame, IRequest request, IResponse response){return false;}/// <summary>/// Called when the unamanged resource is freed./// Unmanaged resources are ref counted and freed when/// the last reference is released, this works differently/// to .Net garbage collection./// </summary>protected virtual void Dispose(){}void IDisposable.Dispose(){Dispose();}}
然后获取消息发送后的返回值则是在IResponseFilter类的方法中接收,也新建一个类继承IResponseFilter类。
public class TestJsonFilter : IResponseFilter{public List<byte> DataAll = new List<byte>();public FilterStatus Filter(System.IO.Stream dataIn, out long dataInRead, System.IO.Stream dataOut, out long dataOutWritten){try{if (dataIn == null || dataIn.Length == 0){dataInRead = 0;dataOutWritten = 0;return FilterStatus.Done;}dataInRead = dataIn.Length;dataOutWritten = Math.Min(dataInRead, dataOut.Length);dataIn.CopyTo(dataOut);dataIn.Seek(0, SeekOrigin.Begin);byte[] bs = new byte[dataIn.Length];dataIn.Read(bs, 0, bs.Length);DataAll.AddRange(bs);dataInRead = dataIn.Length;dataOutWritten = dataIn.Length;return FilterStatus.NeedMoreData;}catch (Exception ex){dataInRead = dataIn.Length;dataOutWritten = dataIn.Length;return FilterStatus.Done;}}public bool InitFilter(){return true;}public void Dispose(){}}
再创建一个类用于配合读取返回值。
public class FilterManager{private static Dictionary<string, IResponseFilter> dataList = new Dictionary<string, IResponseFilter>();public static IResponseFilter CreateFilter(string guid){lock (dataList){var filter = new TestJsonFilter();dataList.Add(guid, filter);return filter;}}public static IResponseFilter GetFileter(string guid){lock (dataList){return dataList[guid];}}}
然后重写IResponseFilter、OnResourceLoadComplete两个接口,在OnResourceLoadComplete接口中就能接收返回值了。返回值会返回到函数的request参数下,此参数是一个结构体,可以自行在 if (request.Url.ToLower().Contains("sugrec")) 这一句上下断点查看结构体的内容,然后自行加判断来过滤返回值,这里鄙人先判断发送类型为GET消息,然后再根据发送消息URL里的关键字来过滤返回值,最后显示到WinForm窗口程序绑定的控制台窗口里。
public class WinFormResourceRequestHandler : ResourceRequestHandler{protected override IResponseFilter GetResourceResponseFilter(IWebBrowser chromiumWebBrowser, IBrowser browser, IFrame frame, IRequest request, IResponse response){var filter = FilterManager.CreateFilter(request.Identifier.ToString());return filter;}protected override void OnResourceLoadComplete(IWebBrowser chromiumWebBrowser, IBrowser browser, IFrame frame, IRequest request, IResponse response, UrlRequestStatus status, long receivedContentLength){if (request.Method == "GET"){//先指定消息类型POST或者GETif (request.Url.ToLower().Contains("sugrec")){//以URL为过滤条件var filter = FilterManager.GetFileter(request.Identifier.ToString()) as TestJsonFilter;UTF8Encoding encoding = new UTF8Encoding();//这里截获返回的数据var data = encoding.GetString(filter.DataAll.ToArray());System.Console.WriteLine("742行:" + data);}}}}public class WinFormsRequestHandler : RequestHandler{protected override IResourceRequestHandler GetResourceRequestHandler(IWebBrowser chromiumWebBrowser, IBrowser browser, IFrame frame, IRequest request, bool isNavigation, bool isDownload, string requestInitiator, ref bool disableDefaultHandling){//NOTE: In most cases you examine the request.Url and only handle requests you are interested inif (request.Url.ToLower().Contains("login".ToLower())){using (var postData = request.PostData){if (postData != null){var elements = postData.Elements;var charSet = request.GetCharSet();foreach (var element in elements){if (element.Type == PostDataElementType.Bytes){var body = element.GetBody(charSet);}}}}}return new WinFormResourceRequestHandler();}}
运行程序后:
如何运用这个自定义的类呢?
using System.ComponentModel;
using System.Data;
using System.Drawing;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.Windows.Forms;
using CefSharp;
using CefSharp.WinForms;
using CefSharp.Handler;
using System.Runtime.InteropServices;
using System.Threading;
using System.Text.RegularExpressions;
using System.Security.Cryptography.X509Certificates;
using System.IO;public partial class Form1 : Form
{ChromiumWebBrowser different;[DllImport("kernel32.dll")]public static extern bool AllocConsole();[DllImport("kernel32.dll")]public static extern bool FreeConsole();public Form1(){InitializeComponent();AllocConsole(); //关联一个控制台窗口用于显示信息}private void Form1_Load(object sender, EventArgs e){CefSettings settings = new CefSettings();settings.CachePath = Environment.GetFolderPath(Environment.SpecialFolder.LocalApplicationData) + @"\Know";//设置cookie存储目录 C:\Users\×××(系统用户名)\AppData\Local\KnowCef.Initialize(settings);//初始化Cef组件different = new ChromiumWebBrowser("https://www.baidu.com");different.RequestHandler = new WinFormsRequestHandler();//应用拦截规则different.LifeSpanHandler = new CefLifeSpanHandler();//让新页面在当前页面打开different.BrowserSettings = new BrowserSettings(){WebGl = CefState.Enabled,ImageLoading = CefState.Enabled,RemoteFonts = CefState.Enabled,AcceptLanguageList = "zh-CN"};tableLayoutPanel1.Controls.Add(different, 0, 1);//把浏览器空间加入布局容器}private void Form1_FormClosing(object sender, FormClosingEventArgs e){//窗口关闭前 回调函数FreeConsole();//释放关联的控制台,不然会报错}}
参考资料:https://www.cnblogs.com/heifengwll/p/13277232.html
如何拦截替换页面资源,JS,CSS等:CefSharp请求资源拦截及自定义处理-腾讯云开发者社区-腾讯云