本帖最后由 howyouxiu 于 2020-7-27 20:49 编辑
声明:文章转自@我是小人物,如有违规请删帖
成果展示(篇幅原因,展示部分,全国共两千多所高校):
这是原网页数据:
思路:查看网页源码发现为固定数据,非异步请求,所以呢就直接构造连接了
通过对比发现需要构造处就是红框部分,依次增加20使用xpath获取表格类数据比较方便源码:[Python] 纯文本查看 复制代码 import requests
from lxml import etree
import openpyxl
title = ['院校名称', '院校所在地', '教育主管部门', '院校类型', '学历层次', '满意度']
workbook = openpyxl.Workbook()
sheet = workbook.worksheets[0]
sheet.append(title)
def writefile(school, destination, party, schooltype, floattype, score):
for i in range(len(school)):
sheet.append([school[i], destination[i], party[i], schooltype[i], floattype[i], score[i]])
def replacet(who):
for i in range(len(who)):
who[i] = who[i].replace(' ', '').replace('\n', '')
return who
def get(url):
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/78.0.3904.97 Safari/537.36", }
response = requests.get(url, headers=headers).text
html = etree.HTML(response, etree.HTMLParser())
school = html.xpath('//div/table/tr/td[1]/a/text()')
destination = html.xpath('//div/table/tr/td[2]/text()')
party = html.xpath('//div/table/tr/td[3]/text()')
schooltype = html.xpath('//div/table/tr/td[4]/text()')
floattype = html.xpath('//div/table/tr/td[5]/text()')
score = html.xpath('//div/table/tr/td[9]/a/text()')
school = replacet(school)
destination = replacet(destination)
party = replacet(party)
schooltype = replacet(schooltype)
floattype = replacet(floattype)
score = replacet(score)
writefile(school, destination, party, schooltype, floattype, score)
if __name__ == '__main__':
for p in range(0, 2820, 20):
print('第{}个开始'.format(p))
try:
get('https://gaokao.chsi.com.cn/sch/search--ss-on,searchType-1,option-qg,start-{}.dhtml'.format(p))
print('第{}个保存完成'.format(p))
except:
print('第{}个保存失败'.format(p))
workbook.save('2020高考高校信息库.xlsx')
workbook.close()
完成!源码及excel下载地址:https://haogeshare.lanzouj.com/iixmYf0zruh
|