by Praveen Dubey
通过Praveen Dubey
如何使用浏览器控制台通过JavaScript抓取并将数据保存在文件中 (How to use the browser console to scrape and save data in a file with JavaScript)
A while back I had to crawl a site for links, and further use those page links to crawl data using selenium or puppeteer. Setup for the content on the site was bit uncanny so I couldn’t start directly with selenium and node. Also, unfortunately, data was huge on the site. I had to quickly come up with an approach to first crawl all the links and pass those for details crawling of each page.
前一阵子,我不得不对一个站点进行爬网以获取链接,并进一步使用这些页面链接来使用Selenium或puppeteer来对数据进行爬网。 该网站上的内容设置有点离奇,所以我不能直接从Selenium和Node开始。 同样,不幸的是,该站点上的数据非常庞大。 我必须快速想出一种方法,首先抓取所有链接,然后将其传递给每个页面的详细信息抓取。
That’s where I learned this cool stuff with the browser Console API. You can use this on any website without much setup, as it’s just JavaScript.
那是我从浏览器控制台API那里学到的好东西。 您可以在任何网站上使用它,而无需进行太多设置,因为它只是JavaScript。
Let’s jump into the technical details.
让我们跳入技术细节。
高级概述 (High Level Overview)
For crawling all the links on a page, I wrote a small piece of JS in the console. This JavaScript crawls all the links (takes 1–2 hours, as it does pagination also) and dumps a json
file with all the crawled data. The thing to keep in mind is that you need to make sure the website works similarly to a single page application. Otherwise, it does not reload the page if you want to crawl more than one page. If it does not, your console code will be gone.
为了抓取页面上的所有链接,我在控制台中编写了一小段JS。 此JavaScript会爬网所有链接(需要1到2个小时,因为它也会进行分页)并转储包含所有已爬网数据的json
文件。 要记住的事情是,您需要确保该网站的工作方式类似于单页应用程序。 否则,如果您要爬网多个页面,则不会重新加载页面。 如果没有,您的控制台代码将消失。
Medium does not refresh the page for some scenarios. For now, let’s crawl a story and save the scraped data in a file from the console automatically after scrapping.
中型在某些情况下不会刷新页面。 现在,让我们抓取一个故事,并将抓取的数据在抓取后自动从控制台保存到文件中。
But before we do that here’s a quick demo of the final execution.
但是在开始之前,这里是最终执行的快速演示。
1.从浏览器获取控制台对象实例 (1. Get the console object instance from the browser)
// Console API to clear console before logging new data
console.API;
if (typeof console._commandLineAPI !== 'undefined') { console.API = console._commandLineAPI; //chrome
} else if (typeof console._inspectorCommandLineAPI !== 'undefined'){ console.API = console._inspectorCommandLineAPI; //Safari
} else if (typeof console.clear !== 'undefined') { console.API = console;
}
The code is simply trying to get the console object instance based on the user’s current browser. You can ignore and directly assign the instance to your browser.
该代码只是试图根据用户当前的浏览器获取控制台对象实例。 您可以忽略实例并将其直接分配给浏览器。
Example, if you using Chrome, the below code should be sufficient.
例如,如果您使用Chrome ,则下面的代码应该足够了。
if (typeof console._commandLineAPI !== 'undefined') { console.API = console._commandLineAPI; //chrome
}
2.定义初级助手功能 (2. Defining the Junior helper function)
I’ll assume that you have opened a Medium story as of now in your browser. Lines 6 to 12 define the DOM element attributes which can be used to extract story title, clap count, user name, profile image URL, profile description and read time of the story, respectively.
我假设您已经在浏览器中打开了一个中型故事。 第6至12行定义DOM元素属性,可分别用于提取故事标题,拍手数,用户名,个人资料图像URL,个人资料描述和故事的读取时间 。
These are the basic things which I want to show for this story. You can add a few more elements like extracting links from the story, all images, or embed links.
这些是我要为这个故事展示的基本内容。 您可以添加更多元素,例如从故事中提取链接,所有图像或嵌入链接。
3.定义我们的高级助手功能-野兽 (3. Defining our Senior helper function — the beast)
As we are crawling the page for different elements, we will save them in a collection. This collection will be passed to one of the main functions.
当我们在页面上搜寻不同的元素时,我们会将它们保存在集合中。 该集合将传递给主要功能之一。
We have defined a function name, console.save
. The task for this function is to dump a csv / json file with the data passed.
我们定义了一个函数名称console.save
。 此功能的任务是转储带有所传递数据的csv / json文件。
It creates a Blob Object with our data. A Blob
object represents a file-like object of immutable, raw data. Blobs represent data that isn't necessarily in a JavaScript-native format.
它使用我们的数据创建一个Blob对象。 Blob
对象代表不可变的原始数据的类似文件的对象。 Blob表示的数据不一定是JavaScript原生格式。
Create blob is attached to a link tag <
;a> on which a click event is triggered.
创建blob附加到链接标签<
; a>上,在该链接标签上触发了点击事件。
Here is the quick demo of console.save
with a small array
passed as data.
这是console.save
的快速演示,其中有一个作为数据传递的小array
。
Putting together all the pieces of the code, this is what we have:
将所有代码段放在一起,这就是我们所拥有的:
- Console API Instance 控制台API实例
- Helper function to extract elements 辅助函数提取元素
- Console Save function to create a file 控制台保存功能可创建文件
Let’s execute our console.save() in the browser to save the data in a file. For this, you can go to a story on Medium and execute this code in the browser console.
让我们在浏览器中执行console.save()以将数据保存到文件中。 为此,您可以转到Medium上的故事并在浏览器控制台中执行此代码。
I have shown the demo with extracting data from a single page, but the same code can be tweaked to crawl multiple stories from a publisher’s home page. Take an example of freeCodeCamp: you can navigate from one story to another and come back (using the browser’s back button) to the publisher home page without the page being refreshed.
我已经演示了从单个页面提取数据的演示,但是可以对相同的代码进行调整,以从发布者的主页中抓取多个故事。 以freeCodeCamp为例 :您可以从一个故事导航到另一个故事,然后(使用浏览器的后退按钮)返回到发布者主页,而无需刷新页面。
Below is the bare minimum code you need to extract multiple stories from a publisher’s home page.
下面是从发布者的主页中提取多个故事所需的最低限度代码。
Let’s see the code in action for getting the profile description from multiple stories.
让我们看一下从多个故事中获取个人档案描述的代码。
For any such type of application, once you have scrapped the data, you can pass it to our console.save function and store it in a file.
对于任何这种类型的应用程序,一旦您将数据抓取后,就可以将其传递给我们的console.save函数并将其存储在文件中。
The console save function can be quickly attached to your console code and can help you to dump the data in the file. I am not saying you have to use the console for scraping data, but sometimes this will be a way quicker approach since we all are very familiar working with the DOM using CSS selectors.
控制台保存功能可以快速附加到控制台代码中,并可以帮助您转储文件中的数据。 我并不是说您必须使用控制台来抓取数据,但是有时这将是一种更快的方法,因为我们都非常熟悉使用CSS选择器来处理DOM。
You can download the code from Github
您可以从Github下载代码
Thank you for reading this article! Hope it gave you cool idea to scrape some data quickly without much setup. Hit the clap button if it enjoyed it! If you have any questions, send me an email (praveend806 [at] gmail [dot] com).
感谢您阅读本文! 希望它为您提供了一个不错的主意,使您无需进行太多设置即可快速抓取一些数据。 如果喜欢,请按拍手按钮! 如果您有任何疑问,请给我发送电子邮件(praveend806 [at] gmail [dot] com)。
了解更多有关控制台的资源: (Resources to learn more about the Console:)
Using the Console | Tools for Web Developers | Google DevelopersLearn how to navigate the Chrome DevTools JavaScript Console.developers.google.comBrowser ConsoleThe Browser Console is like the Web Console, but applied to the whole browser rather than a single content tab.developer.mozilla.orgBlobA Blob object represents a file-like object of immutable, raw data. Blobs represent data that isn't necessarily in a…developer.mozilla.org
使用控制台| Web开发人员工具| Google Developers 了解如何浏览Chrome DevTools JavaScript控制台。 developers.google.com 浏览器控制台 浏览器控制台类似于Web控制台,但应用于整个浏览器,而不是单个内容选项卡。 developer.mozilla.org Blob Blob对象表示不可变的原始数据的类似文件的对象。 Blob代表不一定要包含在…中的数据... developer.mozilla.org
翻译自: https://www.freecodecamp.org/news/how-to-use-the-browser-console-to-scrape-and-save-data-in-a-file-with-javascript-b40f4ded87ef/