思路:1.获取拉勾网搜索到职位的页数
2.调用接口获取职位id
3.根据职位id访问页面,匹配出关键字
url访问采用unirest,由于拉钩反爬虫,短时间内频繁访问会被限制访问,所以没有采用多线程,而且每个页面访问时间间隔设定为10s,通过nokogiri解析页面,正则匹配只获取技能要求中的英文单词,可能存在数据不准确的情况
数据持久化到excel中,采用ruby erb生成word_cloud报告
爬虫代码:
require 'unirest' require 'uri' require 'nokogiri' require 'json' require 'win32ole'@position = '测试开发工程师' @city = '杭州'# 页面访问 def query_url(method, url, headers:{}, parameters:nil)case methodwhen :getUnirest.get(url, headers:headers).bodywhen :postUnirest.post(url, headers:headers, parameters:parameters).bodyend end# 获取页数 def get_page_num(url)html = query_url(:get, url).force_encoding('utf-8')html.scan(/<span class="span totalNum">(\d+)<\/span>/).first.first end# 获取每页显示的所有职位的id def get_positionsId(url, headers:{}, parameters:nil)response = query_url(:post, url, headers:headers, parameters:parameters)positions_id = Array.newresponse['content']['positionResult']['result'].each{|i| positions_id << i['positionId']}positions_id end# 匹配职位英文关键字 def get_skills(url)puts "loading url: #{url}"html = query_url(:get, url).force_encoding('utf-8')doc = Nokogiri::HTML(html)data = doc.css('dd.job_bt')data.text.scan(/[a-zA-Z]+/) end# 计算词频 def word_count(arr)arr.map!(&:downcase)arr.select!{|i| i.length>1}counter = Hash.new(0)arr.each { |k| counter[k]+=1 }# 过滤num=1的数据counter.select!{|_,v| v > 1}counter2 = counter.sort_by{|_,v| -v}.to_hcounter2 end# 转换 def parse(hash)data = Array.newhash.each do |k,v|word = Hash.newword['name'] = kword['value'] = vdata << wordendJSON data end# 持久化数据 def save_excel(hash)excel = WIN32OLE.new('Excel.Application')excel.visible = falseworkbook = excel.Workbooks.Add()worksheet = workbook.Worksheets(1)# puts hash.size(1..hash.size+1).each do |i|if i == 1# puts "A#{i}:B#{i}"worksheet.Range("A#{i}:B#{i}").value = ['关键词', '频次']else# puts i# puts hash.keys[i-2], hash.values[i-2]worksheet.Range("A#{i}:B#{i}").value = [hash.keys[i-2], hash.values[i-2]]endendexcel.DisplayAlerts = falseworkbook.saveas(File.dirname(__FILE__)+'\lagouspider.xls')workbook.saved = trueexcel.ActiveWorkbook.Close(1)excel.Quit() end# 获取页数 url = URI.encode("https://www.lagou.com/jobs/list_#@position?city=#@city&cl=false&fromSearch=true&labelWords=&suginput=") num = get_page_num(url).to_i puts "存在 #{num} 个信息分页"skills = Array.new (1..num).each do |i|puts "定位在第#{i}页"# 获取positionsidurl2 = URI.encode("https://www.lagou.com/jobs/positionAjax.json?city=#@city&needAddtionalResult=false")headers = {Referer:url, 'User-Agent':i%2==1?'Mozilla/5.0':'Chrome/67.0.3396.87'}parameters = {first:(i==1), pn:i, kd:@position}positions_id = get_positionsId(url2, headers:headers, parameters:parameters)positions_id.each do |id|# 访问具体职位页面,提取英文技能关键字url3 = "https://www.lagou.com/jobs/#{id}.html"skills.concat get_skills(url3)sleep 10end endcount = word_count(skills) save_excel(count) @data = parse(count)
效果展示: