爬虫之拉勾网职位获取

重点在于演示urllib.request.Request()请求中各项参数的 书写格式 譬如: url data headers...

Demo演示(POST请求):

import urllib.request
import urllib.parse
import json, jsonpath, csv

url = "https://www.lagou.com/jobs/positionAjax.json?city=%E4%B8%8A%E6%B5%B7&needAddtionalResult=false"
headers = {
    "Accept": "application/json, text/javascript, */*; q=0.单线程",
    "Accept-Encoding": "gzip, deflate, br",
    "Accept-Language": "zh-CN,zh;q=0.9",
    "Connection": "keep-alive",
    "Content-Length": "38",
    "Content-Type": "application/x-www-form-urlencoded; charset=UTF-8",
    "Cookie": "_ga=GA1.2.1963509933.1531996888; user_trace_token=20180719184127-4a8c7914-8b40-11e8-9eb6-525400f775ce; LGUID=20180719184127-4a8c7df2-8b40-11e8-9eb6-525400f775ce; JSESSIONID=ABAAABAAAIAACBI0F0B14254DA54E3CCF3B1F22FE32B179; _gid=GA1.2.1918046323.1536408617; Hm_lvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1536408620; X_HTTP_TOKEN=339034308973d0bd323cc0b9b6b3203a; LG_LOGIN_USER_ID=24096d6ba723e146bd326de981ab924b23c1f21775136c3a8be953e855211e61; _putrc=95519B7FB60FCF58123F89F2B170EADC; login=true; unick=%E9%A9%AC%E7%BB%A7%E4%B8%9A; showExpriedIndex=1; showExpriedCompanyHome=1; showExpriedMyPublish=1; hasDeliver=55; gate_login_token=3b4fa15daef090780ae377bbcd66dc83af9af0cc6a7f1dd697770790f3b9f9ef; index_location_city=%E4%B8%8A%E6%B5%B7; TG-TRACK-CODE=search_code; _gat=1; LGSID=20180908221639-cd6d2a72-b371-11e8-b62b-5254005c3644; PRE_UTM=; PRE_HOST=; PRE_SITE=https%3A%2F%2Fwww.lagou.com%2Fjobs%2Flist_python%3FlabelWords%3D%26fromSearch%3Dtrue%26suginput%3D; PRE_LAND=https%3A%2F%2Fwww.lagou.com%2Fjobs%2Flist_%25E7%2588%25AC%25E8%2599%25AB%3Fcity%3D%25E4%25B8%258A%25E6%25B5%25B7%26cl%3Dfalse%26fromSearch%3Dtrue%26labelWords%3D%26suginput%3D; SEARCH_ID=e559a417b4464fd9bc0b439a67ef0a5a; Hm_lpvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1536416580; LGRID=20180908222259-afed9b74-b372-11e8-b62b-5254005c3644",
    "Host": "www.lagou.com",
    "Origin": "https://www.lagou.com",
    "Referer": "https://www.lagou.com/jobs/list_%E7%88%AC%E8%99%AB?city=%E4%B8%8A%E6%B5%B7&cl=false&fromSearch=true&labelWords=&suginput=",
    "User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36",
    "X-Anit-Forge-Code": "0",
    "X-Anit-Forge-Token": "None",
    "X-Requested-With": "XMLHttpRequest"}
# params = {"city": "上海", "needAddtionalResult": "false"}
list_position = []
for pn in range(1, 5):
    data = {
        "first": "false",
        "pn": pn,
        "kd": "爬虫"
    }
    # params = urllib.parse.urlencode(params)
    # url = url + params
    data = urllib.parse.urlencode(data).encode('utf-8')
    req = urllib.request.Request(url, data=data, headers=headers)
    print('正在请求第%d页' % pn)
    str_data = urllib.request.urlopen(req).read()
    with open('03.html', 'wb') as f:
        f.write(str_data)
    # 转换成python对象
    data_list = json.loads(str_data)
    job_list = jsonpath.jsonpath(data_list, "$..result")[0]

    for item in job_list:
        position_dict = {}
        position_dict['positionName'] = item.get('positionName')
        position_dict['createTime'] = item.get('createTime')
        position_dict['url'] = 'https://www.lagou.com/jobs/' + str(item.get('positionId')) + '.html'

        position_dict['salary'] = item.get('salary')
        position_dict['workYear'] = item.get('workYear')
        position_dict['companySize'] = item.get('companySize')
        list_position.append(position_dict)

# 保存到json文件
json.dump(list_position, open('03.json', 'w'))

# 保存到csv文件  'gbk' codec can't encode character '\u200b' in position 0: illegal multibyte seq
csv_writer = csv.writer(open('04.csv', 'w', encoding='utf-8'))
sheets = list_position[0].keys()  # 表头
row_content = []
for item in list_position:
    row_content.append(item.values())  # 内容
try:
    csv_writer.writerow(sheets)
    csv_writer.writerows(row_content)
except Exception as e:
    print(e)


 1 import urllib.request
 2 import urllib.parse
 3 import json, jsonpath, csv
 4 
 5 url = "https://www.lagou.com/jobs/positionAjax.json?city=%E4%B8%8A%E6%B5%B7&needAddtionalResult=false"
 6 headers = {
 7     "Accept": "application/json, text/javascript, */*; q=0.单线程",
 8     "Accept-Encoding": "gzip, deflate, br",
 9     "Accept-Language": "zh-CN,zh;q=0.9",
10     "Connection": "keep-alive",
11     "Content-Length": "38",
12     "Content-Type": "application/x-www-form-urlencoded; charset=UTF-8",
13     "Cookie": "_ga=GA1.2.1963509933.1531996888; user_trace_token=20180719184127-4a8c7914-8b40-11e8-9eb6-525400f775ce; LGUID=20180719184127-4a8c7df2-8b40-11e8-9eb6-525400f775ce; JSESSIONID=ABAAABAAAIAACBI0F0B14254DA54E3CCF3B1F22FE32B179; _gid=GA1.2.1918046323.1536408617; Hm_lvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1536408620; X_HTTP_TOKEN=339034308973d0bd323cc0b9b6b3203a; LG_LOGIN_USER_ID=24096d6ba723e146bd326de981ab924b23c1f21775136c3a8be953e855211e61; _putrc=95519B7FB60FCF58123F89F2B170EADC; login=true; unick=%E9%A9%AC%E7%BB%A7%E4%B8%9A; showExpriedIndex=1; showExpriedCompanyHome=1; showExpriedMyPublish=1; hasDeliver=55; gate_login_token=3b4fa15daef090780ae377bbcd66dc83af9af0cc6a7f1dd697770790f3b9f9ef; index_location_city=%E4%B8%8A%E6%B5%B7; TG-TRACK-CODE=search_code; _gat=1; LGSID=20180908221639-cd6d2a72-b371-11e8-b62b-5254005c3644; PRE_UTM=; PRE_HOST=; PRE_SITE=https%3A%2F%2Fwww.lagou.com%2Fjobs%2Flist_python%3FlabelWords%3D%26fromSearch%3Dtrue%26suginput%3D; PRE_LAND=https%3A%2F%2Fwww.lagou.com%2Fjobs%2Flist_%25E7%2588%25AC%25E8%2599%25AB%3Fcity%3D%25E4%25B8%258A%25E6%25B5%25B7%26cl%3Dfalse%26fromSearch%3Dtrue%26labelWords%3D%26suginput%3D; SEARCH_ID=e559a417b4464fd9bc0b439a67ef0a5a; Hm_lpvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1536416580; LGRID=20180908222259-afed9b74-b372-11e8-b62b-5254005c3644",
14     "Host": "www.lagou.com",
15     "Origin": "https://www.lagou.com",
16     "Referer": "https://www.lagou.com/jobs/list_%E7%88%AC%E8%99%AB?city=%E4%B8%8A%E6%B5%B7&cl=false&fromSearch=true&labelWords=&suginput=",
17     "User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36",
18     "X-Anit-Forge-Code": "0",
19     "X-Anit-Forge-Token": "None",
20     "X-Requested-With": "XMLHttpRequest"}
21 # params = {"city": "上海", "needAddtionalResult": "false"}
22 list_position = []
23 for pn in range(1, 5):
24     data = {
25         "first": "false",
26         "pn": pn,
27         "kd": "爬虫"
28     }
29     # params = urllib.parse.urlencode(params)
30     # url = url + params
31     data = urllib.parse.urlencode(data).encode('utf-8')
32     req = urllib.request.Request(url, data=data, headers=headers)
33     print('正在请求第%d页' % pn)
34     str_data = urllib.request.urlopen(req).read()
35     with open('03.html', 'wb') as f:
36         f.write(str_data)
37     # 转换成python对象
38     data_list = json.loads(str_data)
39     job_list = jsonpath.jsonpath(data_list, "$..result")[0]
40 
41     for item in job_list:
42         position_dict = {}
43         position_dict['positionName'] = item.get('positionName')
44         position_dict['createTime'] = item.get('createTime')
45         position_dict['url'] = 'https://www.lagou.com/jobs/' + str(item.get('positionId')) + '.html'
46 
47         position_dict['salary'] = item.get('salary')
48         position_dict['workYear'] = item.get('workYear')
49         position_dict['companySize'] = item.get('companySize')
50         list_position.append(position_dict)
51 
52 # 保存到json文件
53 json.dump(list_position, open('03.json', 'w'))
54 
55 # 保存到csv文件  'gbk' codec can't encode character '\u200b' in position 0: illegal multibyte seq
56 csv_writer = csv.writer(open('04.csv', 'w', encoding='utf-8'))
57 sheets = list_position[0].keys()  # 表头
58 row_content = []
59 for item in list_position:
60     row_content.append(item.values())  # 内容
61 try:
62     csv_writer.writerow(sheets)
63     csv_writer.writerows(row_content)
64 except Excepti

转载于:https://www.cnblogs.com/We612/p/9978288.html

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/250231.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

小程序 --- 点击放大功能、获取位置信息、文字样式省略、页面跳转(navigateTo)

1. 点击放大功能的实现 需求: 点击轮播图中的图片会实现放大预览的功能。首先有轮播图的样式如下 <!-- pages/goods_detail/index.wxml --> <!-- 轮播图 --> <view class"detail_swiper"><swiperautoplaycircularindicator-dots><swip…

Axure实现多用户注册验证

*****多用户登录验证***** 一、&#xff08;常规想法&#xff09;方法&#xff1a;工作量较大&#xff0c;做起来繁琐 1、当用户名和密码相同时怎么区分两者&#xff0c;使用冒号和括号来区分&#xff1a; eg. (admin:123456)(123456:demo)(zhang:san);由此得出前面是括号后面是…

Maximum Xor Secondary(单调栈好题)

Maximum Xor Secondary CodeForces - 280B Bike loves looking for the second maximum element in the sequence. The second maximum element in the sequence of distinct numbers x1, x2, ..., xk (k > 1) is such maximum element xj, that the following inequa…

杂项-公司:唯品会

ylbtech-杂项-公司&#xff1a;唯品会唯品会公司成立于2008年08月&#xff0c;2012年3月23日登陆美国纽约证券交易所上市&#xff08;股票代码&#xff1a;VIPS&#xff09;。成为华南第一家在美国纽交所上市的电子商务企业。主营B2C商城唯品会名牌折扣网站是一家致力于打造中高…

Linux基本的操作

一、为什么我们要学习Linux 相信大部分人的PC端都是用Windows系统的&#xff0c;那我们为什么要学习Linux这个操作系统呢&#xff1f;&#xff1f;&#xff1f;Windows图形化界面做得这么好&#xff0c;日常基本使用的话&#xff0c;学习成本几乎为零。 而Linux不一样&#xff…

汇编语言 实验4

实验4 实验内容1&#xff1a;综合使用 loop,[bx]&#xff0c;编写完整汇编程序&#xff0c;实现向内存 b800:07b8 开始的连续 16 个 字单元重复填充字数据 0403H&#xff1b;修改0403H为0441H&#xff0c;再次运行 步骤1&#xff1a;在记事本中编写好temp.asm文件 步骤2&#x…

LDAP第三天 MySQL+LDAP 安装

https://www.easysoft.com/applications/openldap/back-sql-odbc.html OpenLDAP 使用 SQLServer 和 Oracle 数据库。 https://www.cnblogs.com/bigbrotherer/p/7251372.html          CentOS7安装OpenLDAPMySQLPHPLDAPadmin 1.安装和设置数据库 在CentOS7下&…

Myeclipse连接Mysql数据库时报错:Error while performing database login with the pro driver:unable...

driver template: Mysql connector/j&#xff08;下拉框进行选择&#xff09; driver name: 任意填&#xff0c;最好是数据库名称&#xff0c;方便查找 connection URL: jdbc:mysql://localhost:3306/programmableweb User name: 用户名 password: 密码 Driver jars: 添加jar包…

matlab --- 图像处理基础

MATLAB图像处理 1. 数字图像处理 参考 数字图像处理(Digital Image Processing)又称为计算机图像处理,是一种将图像信号数字化利用计算进行处理的过程。随着计算机科学、电子学和光学的发展,数字图像处理已经广泛的应用到诸多领域之中。本小节主要介绍图像的概念、分类和数字…

[python、flask] - POST请求

1. 微信小程序POST传递数据给flask服务器 小程序端 // 提交POST数据 import { request } from "../../request/index.js"async handleDetectionPoints() {let params {url: "/detect_points",data: {"points": arr,"img_name": thi…

[vue]data数据属性及ref获取dom

data项的定义 this.$refs获取dom 获取不到数据 这样中转下才ok 小结: data里不能用this.$ref. 另外使用visjs时候 view-source:http://visjs.org/examples/network/basicUsage.html 加载不出东西,点了按钮触发才ok 小结: create里应该是从上到下执行的. 转载于:https://www.cnb…

[异步、tensorflow] - 子线程操作tensor,主线程处理tensor

参考整体流程如下图 代码 import tensorflow as tf"""模拟: 子线程不停的取数据放入队列中, 主线程从队列中取数据执行包含: 作用域的命名、把程序的图结构写入事件、多线程 """# 模拟异步存入样本. # 1、 定义一个队列,长度为1000 with tf.va…

Element

官网&#xff1a;http://element-cn.eleme.io/#/zh-CN 转载于:https://www.cnblogs.com/weibanggang/p/9995433.html

[tensorflow] - csv文件读取

参考 文件流程 csv读取流程 函数的流程 import tensorflow as tf import os"""tensorflow中csv文件的读取1、 先找到文件,构造一个列表2、 构造一个文件队列3、 读取(read)队列内容csv: 读取一行二进制文件: 指定一个样本的bytes读取图片文件: 按一张一张…

课程模块表结构

课程模块 我们要开始写课程模块了~~课程模块都有哪些功能呢~~ 我们的课程模块&#xff0c;包括了免费课程以及专题课程两个方向~~ 主要是课程的展示&#xff0c;点击课程进入课程详细页面~~ 课程详细页面展示&#xff0c;课程的概述&#xff0c;课程的价格策略&#xff0c;课程…

[tensorflow、神经网络] - 使用tf和mnist训练一个识别手写数字模型,并测试

参考 包含: 1.层级的计算、2.训练的整体流程、3.tensorboard画图、4.保存/使用模型、5.总体代码(含详细注释) 1. 层级的计算 如上图,mnist手写数字识别的训练集提供的图片是 28 * 28 * 1的手写图像,初始识别的时候,并不知道一次要训练多少个数据,因此输入的规模为 [None, 784].…

面向过程、面向函数、面向对象的区别浅谈

Python的面向过程、面向函数、面向对象的区别浅谈 转自--猎奇古今&#xff0c;加上其他 有人之前私信问我&#xff0c;python编程有面向过程、面向函数、面向对象三种&#xff0c;那么他们区别在哪呢&#xff1f; 面向过程就是将编程当成是做一件事&#xff0c;要按步骤完成&am…

[pytorch、学习] - 3.5 图像分类数据集

参考 3.5. 图像分类数据集 在介绍shftmax回归的实现前我们先引入一个多类图像分类数据集 本章开始使用pytorch实现啦~ 本节我们将使用torchvision包,它是服务于PyTorch深度学习框架的,主要用来构建计算机视觉模型。torchvision主要由以下几部分构成: torchvision.datasets: …

[pytorch、学习] - 3.6 softmax回归的从零开始实现

参考 3.6 softmax回归的从零开始实现 import torch import torchvision import numpy as np import sys sys.path.append("..") import d2lzh_pytorch as d2l3.6.1. 获取和读取数据 batch_size 256 train_iter, test_iter d2l.load_data_fashion_mnist(batch_si…

[pytorch、学习] - 3.7 softmax回归的简洁实现

参考 3.7. softmax回归的简洁实现 使用pytorch实现softmax import torch from torch import nn from torch.nn import init import numpy as np import sys sys.path.append("..") import d2lzh_pytorch as d2l3.7.1. 获取和读取数据 batch_size 256 train_iter…