Python自学记录-爬取网络段子 - 吾爱破解 - 52pojie.cn

BoBuo 发表于 2021-9-26 13:47

Python自学记录--爬取网络段子

小白自学Python，部分段子网页无法访问，使用try:处理异常需要很长时间，期待大佬指点#爬取糗事百科段子
import requests
from lxml import etree

#设置UA
headers={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36"}

#设置需要爬取页数
page=int(input("请输入您需要的页数："))

#获取各页链接
url2=[]
for x in range(1,page+1):
url2.append("https://www.qiushibaike.com/8hr/page/"+str(x))
#print(url2)

#读取各页信息
for url in url2:
response=requests.get(url,headers=headers).text
html=etree.HTML(response)
result1=html.xpath('//div//a[@class="recmd-content"]/@href')
#print(result1)

for site in result1:
   xurl="https://www.qiushibaike.com"+site
   #print(xurl)
   response2=requests.get(xurl).text
   html2=etree.HTML(response2)
   result2=html2.xpath("//div[@class='content']")
   try:
         print(result2.text)
   except Exception as e:
         print("错误：糗百君的飞船出了一点小毛病……")

我今天是大佬 发表于 2021-9-27 09:10

可以用多线程解决

844043335 发表于 2021-9-26 14:23

哲少发表于 2021-9-26 16:18

外加多线程

GiaoMan-wei 发表于 2021-9-26 17:05

这个结果~~爬的是些啥呀{:1_904:}

dingallen216 发表于 2021-9-26 16:33

楼上正解，不是处理异常费时，是等待时间太长

junjie0927 发表于 2021-9-26 14:31

response2=requests.get(xurl).text
改为
response2=requests.get(xurl,timeout = 5).text

这样超时5秒连接不了就直接抛出错误。这样可以大大缩短时间。

BoBuo 发表于 2021-9-26 17:12

junjie0927 发表于 2021-9-26 14:31
response2=requests.get(xurl).text
改为
response2=requests.get(xurl,timeou ...

哦~感谢！

BoBuo 发表于 2021-9-26 17:23

844043335 发表于 2021-9-26 14:23
嗨，菠萝儿

嗨，你好

BoBuo 发表于 2021-9-27 12:41

我今天是大佬发表于 2021-9-27 09:10
可以用多线程解决

谢谢师兄

页: [1]

吾爱破解 - 52pojie.cn's Archiver

Python自学记录--爬取网络段子