科学网基金项目爬虫有无大神指点一下,每页的url和cookies都变
本帖最后由 zhb1996 于 2020-12-20 13:04 编辑我现在会爬取当前页的,3000多条基金项目也不能每页都去复制url和cookies,愁死了,关键url中不光page变...我把我写的第二页的代码放在这里,求求哪位大神可以加工一下https://static.52pojie.cn/static/image/smiley/default/40.gif(刚学python,写的不大行,我用jupterlab写的,要爬“石墨烯”领域的基金项目https://static.52pojie.cn/static/image/smiley/default/40.gif)
import pandas as pd
import json
import requests
# url = 'http://fund.sciencenet.cn/search/researchField?expires=1608433738&keyWord%5B0%5D=%E7%9F%B3%E5%A2%A8%E7%83%AF&page=2&submit=list&signature=1d0c1eeb26addb779421ed659081351e508702375836415309737ef51c6624df'
url = 'http://fund.sciencenet.cn/search/researchField?expires=1608434986&keyWord%5B0%5D=%E7%9F%B3%E5%A2%A8%E7%83%AF&page=2&submit=list&signature=48e969313ac8d84bf7e3397e755e81e5e6ee0e9aa0c632dcca2a3d01be5281ac'
header = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:84.0) Gecko/20100101 Firefox/84.0'}
cookies = {
'Hm_Ipvut_25531f8e8fec2c71a8a263d5514be1dd':'1608434571'
,'Hm_lvt_25531f8e8fec2c71a8a263d5514be1dd':'1608304704,1608304715,1608339657,1608433111'
,'_session':'eyJpdiI6ImVFRE5wSlN5QlZrTFB1Y242Z3pKWWc9PSIsInZhbHVlIjoiRG12SEhcL1RxZ3JKQllSd1hBVkdGS1wvaStKZFVudHBJeUNFSlVJa2N0UGsrMkdnOG9CU09PbUE4a0VVMzRoMXdKIiwibWFjIjoiYWVmZGM0NjZhMWRhYzdkZmY3YWQ1MjRlZWE2NjYwNTRmMjU3MjMwZmNiMTk3NTRkN2VlNWUwODAzMzBkZmE5MyJ9'
,'acw_tc':'65c86a0a16084331106646731e0c269f12794977e998bbf8e4c8ec01a97040'
,'remember_web_59ba36addc2b2f9401580f014c7f58ea4e30989d':'eyJpdiI6InErdWhvc3dRN0h2NWNUbE1ZYm1tOHc9PSIsInZhbHVlIjoiajBuY0R3TndrVW1ORXBEME93WG04ZUUwdEpFb3cwQ29JdURudlJwUHoxTGh4b09rWTllS1ZUOXR0blBTMEtiZHU5QmY4WlFWOWJqcDVFdGNGTjU2QjdaMWJWeVNMb0hMd21cL2tRU3BIUE1JTjhkdGI2dHNoTWRMdm9kQTBXeTVpVHpUc3VNOXgwRVltNGlTWnkyeVArRForVGhkMjZzSUpWNXZmcXFpV3VOZz0iLCJtYWMiOiI0ZGJhMjM1MWVlMjM5MjljNjA1ZDM5Y2Q0OGVkMjlmM2VlNGU5YTNlODU0MTljNThmNWNiOTFiZDhkNzVlYWI0In0%3D'
}
r = requests.get(url,headers = header,cookies = cookies)
r.encoding = 'utf-8'
r.status_code
html = r.text
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
import re
t = soup.prettify
t
t = str(t)
#爬取函数
def contend():
#打开一个现有文件
df = pd.read_csv(r"C:\Users\Administrator\Desktop\新建文件夹\pa2.csv",encoding = 'gbk')
#爬取标题
reg0 = ('target="_blank">\n(.*?)\n')
title = pd.DataFrame(re.findall(reg0,t))
df['人数'] = pd.DataFrame(re.findall(reg0,t))
df =df.rename(columns={'人数': '标题'})
#爬取金额
reg = ('<span>金额:<b>(.*?)</b>')
# jine = pd.DataFrame(re.findall(reg,t))
df['金额'] = pd.DataFrame(re.findall(reg,t))
#爬取负责人
reg1 = ('<span class="author">负责人:<i>(.*?)</i>')
# re.findall(reg1,t)
df['负责人'] = pd.DataFrame(re.findall(reg1,t))
#爬取申请单位
reg2 = ('<span>申请单位:<i>(.*?)</i>')
# re.findall(reg2,t)
df['申请单位'] = pd.DataFrame(re.findall(reg2,t))
#爬取研究类型
reg3 = ('研究类型:<i>(.*?)</i>')
# re.findall(reg3,t)
df['研究类型'] = pd.DataFrame(re.findall(reg3,t))
#项目批准号
reg4 = ('项目批准号:<b>(.*?)</b>')
# re.findall(reg4,t)
df['项目批准号'] = pd.DataFrame(re.findall(reg4,t))
#批准年度
reg5 = ('<span>批准年度:<b>(.*?)</b>')
# re.findall(reg5,t)
df['批准年度'] = pd.DataFrame(re.findall(reg5,t))
reg6 = ('<span>关键.*?<i>\n (.*?)\n </')
# re.compile(reg6, re.DOTALL).findall(t)
df['关键词'] = pd.DataFrame(re.compile(reg6, re.DOTALL).findall(t))
#把爬取的内容存起来
df.to_csv(r"C:\Users\Administrator\Desktop\新建文件夹\第2页.csv")
return 。。。想要大佬上,那你代码贴的就弄好点。。富文本上面有代码的插入, page不变,你就for循环拼接url嘛page=+i+ 坏人。丶 发表于 2020-12-20 12:45
。。。想要大佬上,那你代码贴的就弄好点。。富文本上面有代码的插入,
好,谢谢,改了,我还不知道能插入代码嘞{:1_909:} 坏人。丶 发表于 2020-12-20 12:46
page不变,你就for循环拼接url嘛page=+i+
不光page变。。。{:1_896:} zhb1996 发表于 2020-12-20 13:06
不光page变。。。
https://github.com/Huster-SC/Sciencenet--/blob/master/main.py 你可以去github上面看看别的大佬怎么爬的
页:
[1]