python 爬小说2级目录问题

fa00x 发表于 2020-8-30 19:02

本帖最后由 fa00x 于 2020-8-30 22:51 编辑

import requests
from pyquery import PyQuery as pq

for i in range(1128871,1128911):
url ="https://www.daocaorenshuwu.com/book/yinhezhixin2/" +str(i)+ ".html"
headers1 = {
   'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; '
}
h = requests.get(url=url, headers=headers1)
h.encoding = 'utf-8'
doc = pq(h.content)
lis1 = doc('#cont-text').text()
name = doc("#content h1").text()
print(name)
with open('./222.txt', 'a+', encoding='utf-8') as f:
   f.write(name)
   f.write('\n')
   f.write(str(lis1))
   f.write('\n')

问题这个只是爬取了第一页
有的章节目录下面 2级目录有 4个分页

https://www.daocaorenshuwu.com/book/yinhezhixin2/1128874_2.html

https://www.daocaorenshuwu.com/book/yinhezhixin2/1128874_3.html

https://www.daocaorenshuwu.com/book/yinhezhixin2/1128874_4.html

这中构造在我上面代码结构中如何拿到这种 2 级目录url？

fa00x 发表于 2020-8-30 19:07

本帖最后由 fa00x 于 2020-8-30 19:36 编辑

for i in range(1128871,1128911):
for j in range (1,5):
url ="https://www.daocaorenshuwu.com/book/yinhezhixin2/" +str(i)+ "_"+str(j)+".html"
print(url)

返回值没有第一页？

第一章黑星奇景（第2/4页）
第一章黑星奇景（第3/4页）
第一章黑星奇景（第4/4页）
第二章超级人类（第2/2页）
第三章银心魅影（第2/4页）
第三章银心魅影（第3/4页）
第三章银心魅影（第4/4页）
第四章真理之光（第2/2页）

Zeaf 发表于 2020-8-30 19:13

如何加判断，判断这个https://www.daocaorenshuwu.com/book/yinhezhixin2/1128874_5.html显示的结果就行了吧

fa00x 发表于 2020-8-30 19:20

本帖最后由 fa00x 于 2020-8-30 19:36 编辑

Zeaf 发表于 2020-8-30 19:13
如何加判断，判断这个https://www.daocaorenshuwu.com/book/yinhezhixin2/1128874_5.html显示的结果就行了 ...

返回值没有第一页内容都是2 页开始？

D.A. 发表于 2020-8-30 19:40

直接通过xpath方式获取链接？
比如links = xml.xpath("//ul[@class="pagination pagination-sm"]/li/a/@href")，获取前半部分的链接：1128874_2.html、1128874_3.html、1128874_4.html，然后和源地址拼接

不想当咸鱼 发表于 2020-8-30 19:41

可以先使用链接进去第一页，然后判断里面页面有没有#page，有两个（上下），随便拿到一个，是一个ul无序列表，接着直接从第二页开始遍历，大概思路就是这里

Ldfd 发表于 2020-8-30 19:47

楼主本身做的思路就很诡异，你的for in range就不咋地（不具有通用性，比如网页数字乱序，其他小说
你可以通过xpath获得href
如//*[@id="content"]/div/ul/li[*]/a/@href
可以得到1128872_2.html
1128872_3.html
1128872_4.html
然后就简单了

Ldfd 发表于 2020-8-30 19:54

for i in range (2,5):
html = requests.get(f'https://www.daocaorenshuwu.com/book/yinhezhixin2/1128872_{i}.html').status_code
if html != 404:
print('OK')
这样也行

fa00x 发表于 2020-8-30 20:47

D.A. 发表于 2020-8-30 19:40
直接通过xpath方式获取链接？
比如links = xml.xpath("//ul[@class="pagination pagination-sm"]/li/a/@hr ...

依然没有第一页

nightcat 发表于 2020-8-30 21:12

本帖最后由 nightcat 于 2020-8-30 21:21 编辑

import requests
import lxml.etree
from fake_useragent import UserAgent
from pyquery import PyQuery as pq

ua = UserAgent(verify_ssl=False).chrome
header = {'user-agent': ua}

def start_url(url):
response = requests.get(url,headers=header)
selector = lxml.etree.HTML(response.text)
title = pq(response.content)("#content h1").text()
body = pq(response.content)('#cont-text').text()
body = body.replace(r'DaoCaoRen.getCode("ui-content");','')

file = f'{title}.txt'
download_file(file,body,title)

for i in selector.xpath('//*[@id="content"]/div/ul/li/a/@href'):
   new_url = "https://www.daocaorenshuwu.com/book/yinhezhixin2/" + str(i)
   new_body = content(new_url)
   new_body = new_body.replace(r'DaoCaoRen.getCode("ui-content");', '')
   download_file(file,conttext=new_body)

def content(url):
response = requests.get(url, headers=header)
new_body = pq(response.content)('#cont-text').text()
return new_body

def download_file(file,conttext,title=None):
with open(file,'a+',encoding='utf-8') as f:
   if title:
         f.write(f'{title}\n')
   f.write(conttext)

if __name__ == '__main__':
for i in range(1128871, 1128911):
   url = "https://www.daocaorenshuwu.com/book/yinhezhixin2/" + str(i) + ".html"
   start_url(url)

页: [1] 2

吾爱破解 - 52pojie.cn's Archiver

python 爬小说2级目录问题