批量标准查新——我的第一个爬虫程序

crazydingo 发表于 2020-8-14 10:52

本帖最后由 crazydingo 于 2021-12-23 13:41 编辑

时隔这么久了，才来了。因为有朋友总要咨询我，希望我打包处理下。不打包也是为了让大家用起来方便些，既然有人要。我就打包了一份。
供你们使用吧。

https://crazydingo.lanzoul.com/i5Mfdxw8coj
-------------------------------------------------------------因为工作关系需要对检测标准进行查新。标准数量很多，大概1200本，而且要求是按季度查新，如果人工查，太耗费时间和精力了。所以想着就用python来试试了。
找了很多网站，现在很多网站都做了发爬虫，或者做了ip访问限制。“著名”的工标网，限制一个ip每天只能访问200个页面。
找到一个网站，本来原来是准备用webbrower方式来解析。后来分析一下网站是用json显示数据的。然后就对其进行了解析。

打开网页后，点击F12，看下里面的XHR里面的内容。看到里面一堆东西。点击进去就可以看到我需要查的内容了。

最后查看里面的内容，看到抓取的URL和相关参数。这就可以抓取了。
分析里面的数据，找到需要的内容就行。
然后因为标准数量多，我用的是excel 导入，最后查询结果导入到excel里面。
#coding: utf-8
import requests
import json
import xlrd
from xlrd import xldate_as_tuple
import xlwt
import time

std_file='std_check.xlsx'
std_check_re = xlwt.Workbook()
sheet2= std_check_re.add_sheet('结果',cell_overwrite_ok=True)

def check_excel():
wb=xlrd.open_workbook(filename=std_file)
print(wb.sheet_names())
sheet1=wb.sheet_by_index(0)
print(sheet1)
print(sheet1.nrows)
sheet2.col(0).width=256*20 #设置列宽
sheet2.col(1).width = 256 * 40
sheet2.col(2).width = 256 * 40
sheet2.col(3).width = 256 * 10
sheet2.col(4).width = 256 * 20
sheet2.col(5).width = 256 * 20
sheet2.col(6).width = 256 * 40

sheet2.write(0,0,'标准代号')# 设置表头名字
sheet2.write(0,1,'标准中文名')
sheet2.write(0,2,'标准英文名')
sheet2.write(0, 3, '标准状态')
sheet2.write(0, 4, '实施时间')
sheet2.write(0, 5, '废止时间')
sheet2.write(0, 6, '发布公告')

for i in range(1,sheet1.nrows):
   time.sleep(3)
   check_std=sheet1.row_values(i)
   sheet2.write(i,0,check_std)
   sheet2.write(i,1,check_std)
   print(i,'Now is checking',check_std)
   url = 'http://www.njbz365.com/njbzb/shopCartManage/getStanDetailInfo1.do'
   data = {"id": 2427529,
         "stanNum": check_std,
         }
   wbdata = requests.post(url, data=data).json()
   sheet2.write(i, 0, check_std)
   sheet2.write(i, 1, wbdata['content']['map']['SN_CHN'])
   sheet2.write(i, 2, wbdata['content']['map']['SN_EN'])
   sheet2.write(i, 3, wbdata['content']['map']['SN_STATE'])
   sheet2.write(i, 4, wbdata['content']['map']['CARRY_OUT_DATE'])
   sheet2.write(i, 5, wbdata['content']['map']['ABOLISH_DATE'])
   sheet2.write(i, 6, wbdata['content']['map']['stanRel'])
   std_check_re.save('check_result.xls')
   #print(wbdata)
   #print(wbdata['content']['map']['SN_CHN']) #中文名
   #print(wbdata['content']['map']['SN_EN']) #英文名
   #print(wbdata['content']['map']['CARRY_OUT_DATE']) #实施时间
   #print(wbdata['content']['map']['SN_STATE']) #状态
   #print(wbdata['content']['map']['stanRel']) #发布公告
   #print(wbdata['content']['map']['ABOLISH_DATE']) #作废时间

check_excel()

附上源代码。
附件是源代码和相应的excel格式。
我用的是pycharm环境开发的。

在吾爱混了几年了，这是第一次发帖，还请大家多多指导，不吝赐教。

请大家觉得有帮助得话，麻烦点点免费评分。谢谢

darkadxz 发表于 2022-3-7 16:17

根据楼主的代码修改了下，可以根据规范名称查询到最新的规范：
import re
import time
import requests
import xlrd
import xlwt

std_file = 'std_check.xls'
std_check_re = xlwt.Workbook()
sheet2 = std_check_re.add_sheet('结果', cell_overwrite_ok=True)

def check_excel():
wb = xlrd.open_workbook(filename=std_file)
print(wb.sheet_names())
sheet1 = wb.sheet_by_index(0)
print(sheet1)
print('共有', sheet1.nrows - 1, '条数据')
sheet2.col(0).width = 256 * 5# 设置列宽
sheet2.col(1).width = 256 * 20
sheet2.col(2).width = 256 * 20
sheet2.col(3).width = 256 * 40
sheet2.col(4).width = 256 * 10
sheet2.col(5).width = 256 * 10
sheet2.col(6).width = 256 * 10

sheet2.write(0, 0, '序号')# 设置表头名字
sheet2.write(0, 1, '查询标准代号')
sheet2.write(0, 2, '标准编号')
sheet2.write(0, 3, '标准中文名')
sheet2.write(0, 4, '状态')
sheet2.write(0, 5, '标准类型')
sheet2.write(0, 6, '实施日期')

for i in range(1, sheet1.nrows):
   time.sleep(3)
   check_std = sheet1.row_values(i)
   a = check_std.__str__()
   b =''.join(a.split())
   c = b.split('-')
   d = re.sub("/T","",c)
   e = d.strip("['")
   print(i, 'Now is checking', check_std)
   url = "http://www.njbz365.com/njbzb/solrData/search.do?searchString=" + e + "&isTilu=true&isContent=true"
   wbdata = requests.post(url).json()
   timeArray = time.localtime(wbdata['result']['IMPL_DATE']/1000)
   otherStyleTime = time.strftime("%Y-%m-%d", timeArray)
   sheet2.write(i, 0, i)
   sheet2.write(i, 1, check_std)
   sheet2.write(i, 2, wbdata['result']['STAN_NUM'])# 标准编号
   sheet2.write(i, 3, wbdata['result']['STAN_CNNAME'])# 中文名
   sheet2.write(i, 4, wbdata['result']['STAN_STATUS'])# 状态
   sheet2.write(i, 5, wbdata['result']['STAN_CATEGORY'])# 标准类型
   sheet2.write(i, 6, otherStyleTime)# 实施日期

check_excel()
std_check_re.save('check_result.xls')

shimeng0624 发表于 2020-8-14 11:05

感谢你的热心分享

南岸发表于 2020-8-14 11:17

不错不错加油

littlebear945 发表于 2020-8-14 11:25

大佬，这个怎么用，用您发的Excel就可以直接查询么？

crazydingo 发表于 2020-8-14 11:29

littlebear945 发表于 2020-8-14 11:25
大佬，这个怎么用，用您发的Excel就可以直接查询么？

这是源代码，你用pycharm运行就行，我没有编译成可执行文件。

Aska 发表于 2020-8-14 11:29

膜拜大佬，加油{:1_893:}

crazydingo 发表于 2020-8-14 23:54

是没有什么特别大的用处吗？都没有人看啊。。。:'(weeqw

gzh820101 发表于 2020-8-22 16:50

正需要，非常感谢

北北121 发表于 2020-12-6 02:59

crazydingo 发表于 2020-8-14 23:54
是没有什么特别大的用处吗？都没有人看啊。。。

运行出错，请大神帮我看看，谢谢
C:\Users\Administrator\PycharmProjects\untitled1\venv\Scripts\python.exe D:/00/std-check/std-check.py
Traceback (most recent call last):
File "D:/00/std-check/std-check.py", line 2, in <module>
import requests
ModuleNotFoundError: No module named 'requests'

Process finished with exit code 1

szy521spy 发表于 2020-12-7 14:28

哇哦~大哥我目前就正在找，你这个简直工作和我一样，我也是要做检测标准，也是工标网。。。。哈哈非常有用。。谢大哥

页: [1] 2 3 4 5 6 7 8

吾爱破解 - 52pojie.cn's Archiver

批量标准查新——我的第一个爬虫程序