python 爬小说2级目录问题

fa00x · 发表于 2020-8-30 19:02

本帖最后由 fa00x 于 2020-8-30 22:51 编辑

[Python] 纯文本查看 复制代码

import requests
from pyquery import PyQuery as pq

for i in range(1128871,1128911):
    url ="https://www.daocaorenshuwu.com/book/yinhezhixin2/" +str(i)+ ".html"
    headers1 = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; '
    }
    h = requests.get(url=url, headers=headers1)
    h.encoding = 'utf-8'
    doc = pq(h.content)
    lis1 = doc('#cont-text').text()
    name = doc("#content h1").text()
    print(name)
    with open('./222.txt', 'a+', encoding='utf-8') as f:
        f.write(name)
        f.write('\n')
        f.write(str(lis1))
        f.write('\n')

问题这个只是爬取了第一页
有的章节目录下面 2级目录有 4个分页
火狐截图_2020-08-30T10-59-07.285Z.png

https://www.daocaorenshuwu.com/book/yinhezhixin2/1128874_2.html

https://www.daocaorenshuwu.com/book/yinhezhixin2/1128874_3.html

https://www.daocaorenshuwu.com/book/yinhezhixin2/1128874_4.html

这中构造在我上面代码结构中如何拿到这种 2 级目录url？

fa00x · 发表于 2020-8-30 19:07

本帖最后由 fa00x 于 2020-8-30 19:36 编辑

[Python] 纯文本查看 复制代码

for i in range(1128871,1128911):
    for j in range (1,5):
        url ="https://www.daocaorenshuwu.com/book/yinhezhixin2/" +str(i)+ "_"+str(j)+".html"
        print(url)

返回值没有第一页？

第一章黑星奇景（第2/4页）
第一章黑星奇景（第3/4页）
第一章黑星奇景（第4/4页）
第二章超级人类（第2/2页）
第三章银心魅影（第2/4页）
第三章银心魅影（第3/4页）
第三章银心魅影（第4/4页）
第四章真理之光（第2/2页）

Zeaf · 发表于 2020-8-30 19:13

如何加判断，判断这个https://www.daocaorenshuwu.com/book/yinhezhixin2/1128874_5.html显示的结果就行了吧

fa00x · 发表于 2020-8-30 19:20

本帖最后由 fa00x 于 2020-8-30 19:36 编辑

Zeaf 发表于 2020-8-30 19:13
如何加判断，判断这个https://www.daocaorenshuwu.com/book/yinhezhixin2/1128874_5.html显示的结果就行了 ...

返回值没有第一页内容都是2 页开始？

D.A. · 发表于 2020-8-30 19:40

直接通过xpath方式获取链接？
比如links = xml.xpath("//ul[@class="pagination pagination-sm"]/li/a/@href")，获取前半部分的链接：1128874_2.html、1128874_3.html、1128874_4.html，然后和源地址拼接

不想当咸鱼 · 发表于 2020-8-30 19:41

可以先使用链接进去第一页，然后判断里面页面有没有#page，有两个（上下），随便拿到一个，是一个ul无序列表，接着直接从第二页开始遍历，大概思路就是这里

Ldfd · 发表于 2020-8-30 19:47

楼主本身做的思路就很诡异，你的for in range就不咋地（不具有通用性，比如网页数字乱序，其他小说
你可以通过xpath获得href
如//*[@id="content"]/div[3]/ul/li[*]/a/@href
可以得到1128872_2.html
1128872_3.html
1128872_4.html
然后就简单了

Ldfd · 发表于 2020-8-30 19:54

for i in range (2,5):
html = requests.get(f'https://www.daocaorenshuwu.com/book/yinhezhixin2/1128872_{i}.html').status_code
if html != 404:
print('OK')
这样也行

fa00x · 发表于 2020-8-30 20:47

D.A. 发表于 2020-8-30 19:40
直接通过xpath方式获取链接？
比如links = xml.xpath("//ul[@class="pagination pagination-sm"]/li/a/@hr ...

依然没有第一页

nightcat · 发表于 2020-8-30 21:12

本帖最后由 nightcat 于 2020-8-30 21:21 编辑

[Python] 纯文本查看 复制代码

import requests
import lxml.etree
from fake_useragent import UserAgent
from pyquery import PyQuery as pq

ua = UserAgent(verify_ssl=False).chrome
header = {'user-agent': ua}

def start_url(url):
    response = requests.get(url,headers=header)
    selector = lxml.etree.HTML(response.text)
    title = pq(response.content)("#content h1").text()
    body = pq(response.content)('#cont-text').text()
    body = body.replace(r'DaoCaoRen.getCode("ui-content");','')

    file = f'{title}.txt'
    download_file(file,body,title)

    for i in selector.xpath('//*[@id="content"]/div[3]/ul/li/a/@href'):
        new_url = "https://www.daocaorenshuwu.com/book/yinhezhixin2/" + str(i)
        new_body = content(new_url)
        new_body = new_body.replace(r'DaoCaoRen.getCode("ui-content");', '')
        download_file(file,conttext=new_body)


def content(url):
    response = requests.get(url, headers=header)
    new_body = pq(response.content)('#cont-text').text()
    return new_body

def download_file(file,conttext,title=None):
    with open(file,'a+',encoding='utf-8') as f:
        if title:
            f.write(f'{title}\n')
        f.write(conttext)

if __name__ == '__main__':
    for i in range(1128871, 1128911):
        url = "https://www.daocaorenshuwu.com/book/yinhezhixin2/" + str(i) + ".html"
        start_url(url)

帐号		自动登录	找回密码
密码			注册[Register]

[已解决] python 爬小说2级目录问题

免费评分

免费评分