关于第一次爬取某题库
本帖最后由 笔墨纸砚 于 2020-11-13 14:44 编辑这篇文章主要介绍了基于Python爬取题库过程详解,文中通过示例代码介绍的非常详细,对大家的学习或者工作具有一定的参考学习价值,需要的朋友可以参考下基本环境配置
[*]python 3.6
[*]pycharm
[*]requests
[*]csv
[*]time
相关模块pip安装即可使用了 Fidder Postman pycharm目标网页:
通过点击发现只有这个页面出现了题目 选项 答案抓包 发现了http://202.207.240.251/ksxt/Center/Question/getNextQuestion.html?seaKey=&professional_level_id=&knowledge_point_id=&question_type_id=&question_no=76531&numrow=1
https://static.52pojie.cn/static/image/hrline/5.gif
numrow这个数值变动一次,题目就换一道,分析得出需要多次抓包http://202.207.240.251/ksxt/Center/Question/getNextQuestion.html?seaKey=&professional_level_id=&knowledge_point_id=&question_type_id=&question_no=76531&numrow=i i需要做一个循环
在根据
发现5741道题
https://static.52pojie.cn/static/image/hrline/1.gif
请求网页:
import requests
url = "http://202.207.240.251/ksxt/Center/Question/getNextQuestion.html?seaKey=&professional_level_id=&knowledge_point_id=&question_type_id=&question_no=76531&numrow=1"
payload = {}
headers = {
'Host': ' 202.207.240.251',
'Accept': 'application/json, text/javascript, */*; q=0.01',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.183 Safari/537.36 Edg/86.0.622.63',
'X-Requested-With': 'XMLHttpRequest',
'Referer': 'http://202.207.240.251/tyut_ksxt/Home/Questions/index.html',
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8,en-GB;q=0.7,en-US;q=0.6',
'Cookie': 'KSXTSESSID='
}
response = requests.request("GET", url, headers=headers, data = payload)
print(response.text.encode('utf8'))
https://static.52pojie.cn/static/image/hrline/5.gif
做循环:
i=0
while i < 5742:#5741
url = "http://202.207.240.251/ksxt/Center/Question/getNextQuestion.html?seaKey=&professional_level_id=&knowledge_point_id=&question_type_id=&question_no=76531&numrow="+str(i)
payload = {}
headers = {
'Host': '202.207.240.251',
'Accept': 'application/json, text/javascript, */*; q=0.01',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.183 Safari/537.36 Edg/86.0.622.63',
'X-Requested-With': 'XMLHttpRequest',
'Referer': 'http://202.207.240.251/tyut_ksxt/Home/Questions/index.html',
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8,en-GB;q=0.7,en-US;q=0.6',
'Cookie': 'KSXTSESSID='
}
i=int(i)+1
response = requests.request("GET", url, headers=headers, data = payload)
html_data = response.json()https://static.52pojie.cn/static/image/hrline/5.gif
解析数据:
dit = {}
dit['题目'] = data_list['question_stem']
dit['专业类别'] = data_list['professional_level']
dit['试题类别'] = data_list['knowledge_point']
dit['选题类型'] = data_list['question_type']
listLen = len(data_list['question_result'])
for x in range(listLen):
try:
dit['A'] = data_list['question_result']['choice']
dit['B'] = data_list['question_result']['choice']
dit['C'] = data_list['question_result']['choice']
dit['D'] = data_list['question_result']['choice']
dit['E'] = data_list['question_result']['choice']
dit['F'] = data_list['question_result']['choice']
except:
break
dit['答案'] = data_list['answer']https://static.52pojie.cn/static/image/hrline/5.gif
保存数据:f = open('实验室.csv', mode='a', encoding='utf-8-sig', newline='')
csv_writer = csv.DictWriter(f, fieldnames=['题目', '专业类别', '试题类别', '选题类型', 'A', 'B', 'C', 'D', 'E', 'F', '答案'])
csv_writer.writerow(dit)
f.close()
PS:在这里也说明一下,因为网上的教程不是很全,我自己在摸索过程中也遇到了很多问题。所以记录一下自己在学习python过程中的艰辛!本人也是一名初学者,请论坛大神不要喷小弟,再次感谢了{:301_1008:} QingYi. 发表于 2020-11-12 20:50
那你是真的吊,之前学习通的题库就是这么被爬的吧
emmm就是爬虫 但是 排版好难啊
笔墨纸砚 发表于 2020-11-12 20:51
emmm就是爬虫 但是 排版好难啊
用MarkdownPad 排好了,再复制发布 那你是真的吊,之前学习通的题库就是这么被爬的吧 爬虫应用真广 看着好厉害,代码看不懂 经常需要的,谢谢 学习了,虽然刚起步但是帮助很大,谢谢 最近正在学python,感觉好有意思。 确实厉害 真不错厉害了