python每次循环字符串怎样叠加？

double07 发表于 2021-4-3 12:50

本帖最后由 double07 于 2021-4-3 12:51 编辑

每次循环都会覆盖掉之前内容，怎样调整代{过}{滤}理把循环内容放在“list['工作职责']”中，与前面“工作地点”、“月薪”、"职位"共存同一个字典中

kaideng 发表于 2021-4-3 13:26

.append（）试试？

yingl7 发表于 2021-4-3 14:08

list['工作职责'] += i 试试看呢？

double07 发表于 2021-4-3 14:21

放上源码，看看怎样修改？
#导入模块
import requests
import pandas
import time
from lxml import etree

p=0
data_list=[]
dqs=""
pubtime="3"
salary="30"
industries="200"
position=''
curPage=1
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36'
}
#获取网页内容
def gethtml(url):
response=requests.get(url,headers=headers)
r_response=response.content.decode().replace("https://www.liepin.com", "")
return r_response

#获取网页数据
def parse_url(r):
html=etree.HTML(r)
b=html.xpath('//ul[@class="sojob-list"]/li')
for i in b:
   list={}
   list['职位'] = i.xpath('./div/div/h3/a/text()').strip()
   list['招聘企业'] = i.xpath('./div/div/p/a/text()').strip()
   list['工作地点'] = i.xpath('.//*[@class="area"]/text()').strip()
   list['月薪']= i.xpath('./div/div/p/span/text()').strip()
   list['发布时间'] = i.xpath('./div/div/p/time/text()').strip()
   href_list="https://www.liepin.com"+i.xpath("./div/div/h3/a/@href").strip()
   href_r=requests.get(href_list,headers=headers)
   href_text=href_r.content.decode()
   href_Parse=etree.HTML(href_text)
   job_list=href_Parse.xpath('//div/div/text()')
   for j in job_list:
         list['工作职责']=j.strip()
   data_list.append(list)
return data_list

#翻页
def next_page():
url_np='https://www.liepin.com/zhaopin/?compkind=&dqs{}=&pubTime={}&pageSize=40&salary={}%24&compTag=&sortFlag=15°radeFlag=0&compIds=&subIndustry=&jobKind=&industries={}&compscale=&key={}&curPage={}'
url_list=
return url_list

#主程序
def run_liep():
page = next_page()
time.sleep(1)
p = 0
for i in page:
   p+=1
   print('正在获取第{}页数据'.format(p))
   gh=gethtml(i)
   gp=parse_url(gh)
   gp = pandas.DataFrame(gp)
   gp.to_excel('./liepin.xlsx', index=False)
return gp

if __name__ == '__main__':
print(run_liep())

ligxi 发表于 2021-4-3 15:54

不要用关键字list作为变量名，可能会导致未知问题。
# 导入模块
import requests
import pandas
import time
from lxml import etree

p = 0
data_list = []
dqs = ""
pubtime = "3"
salary = "30"
industries = "200"
position = ''
curPage = 1
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36'
}

# 获取网页内容
def gethtml(url):
response = requests.get(url, headers=headers)
r_response = response.content.decode().replace("https://www.liepin.com", "")
return r_response

# 获取网页数据
def parse_url(r):
html = etree.HTML(r)
b = html.xpath('//ul[@class="sojob-list"]/li')
for i in b:
   lst = {}
   lst['职位'] = i.xpath('./div/div/h3/a/text()').strip()
   lst['招聘企业'] = i.xpath('./div/div/p/a/text()').strip()
   lst['工作地点'] = i.xpath('.//*[@class="area"]/text()').strip()
   lst['月薪'] = i.xpath('./div/div/p/span/text()').strip()
   lst['发布时间'] = i.xpath('./div/div/p/time/text()').strip()
   href_list = "https://www.liepin.com" + i.xpath("./div/div/h3/a/@href").strip()
   href_r = requests.get(href_list, headers=headers)
   href_text = href_r.content.decode()
   href_Parse = etree.HTML(href_text)
   job_list = filter(lambda x: x.strip() != '', href_Parse.xpath('//div/div/text()'))
   lst['工作职责'] =
   # for j in job_list:
   # lst['工作职责'].append(j.strip())
   # print(lst)
   data_list.append(lst)
print(data_list)
return data_list

# 翻页
def next_page():
url_np = 'https://www.liepin.com/zhaopin/?compkind=&dqs{}=&pubTime={}&pageSize=40&salary={}%24&compTag=&sortFlag=15°radeFlag=0&compIds=&subIndustry=&jobKind=&industries={}&compscale=&key={}&curPage={}'
url_list =
return url_list

# 主程序
def run_liep():
page = next_page()
time.sleep(1)
p = 0
for i in page:
   p += 1
   print('正在获取第{}页数据'.format(p))
   gh = gethtml(i)
   gp = parse_url(gh)
   gp = pandas.DataFrame(gp)
   gp.to_excel('./liepin.xlsx', index=False)
return gp

if __name__ == '__main__':
print(run_liep())

double07 发表于 2021-4-3 16:34

ligxi 发表于 2021-4-3 15:54
不要用关键字list作为变量名，可能会导致未知问题。
# 导入模块
import reques ...

大佬，根据代码，有3个疑问：
1.代码39-43行是新增详细页数据抓取，增加这几行代码，爬取1页的速度相对以前2秒，目前变为2分钟，不知道是否有相同体会？
2.按照这个代码跑出的数据是列表（见下图），能否去列表的括号，按行显示1，2，3，4……的职责内容？
3.为什么有的职责跑出来列表会为空？（见下图）

ligxi 发表于 2021-4-3 17:09

double07 发表于 2021-4-3 16:34
大佬，根据代码，有3个疑问：
1.代码39-43行是新增详细页数据抓取，增加这几行代码，爬取1页的速度相对 ...

1、没有。如果没有多余的time.sleep限制间隔时间是不会这样的，它只是等待网页内容返回。
2、这个列表你自己再加工一下，再逐个读出来。或者改成其他数据类型嵌套，自行处理吧。
3、空列表的原因要么是匹配没写对，好像不是每个详情页的结构都是一样的。还有可能就是爬太快，IP被封了，直接返回空内容。

补充：不要爬太快了，亲测，会被封IP的。代码里的time.sleep跟没写差不多。

double07 发表于 2021-4-3 17:35

本帖最后由 double07 于 2021-4-3 17:44 编辑

ligxi 发表于 2021-4-3 17:09
1、没有。如果没有多余的time.sleep限制间隔时间是不会这样的，它只是等待网页内容返回。
2、这个列表你 ...
大佬，在哪里加time sleep可解决爬取速度过快的问题？

double07 发表于 2021-4-3 17:59

上图中把列表[]去掉，转独立的字符串想不到好的方法，能提示一下？

页: [1]

吾爱破解 - 52pojie.cn's Archiver

python每次循环字符串怎样叠加？