python 爬小说2级目录问题
本帖最后由 fa00x 于 2020-8-30 22:51 编辑import requests
from pyquery import PyQuery as pq
for i in range(1128871,1128911):
url ="https://www.daocaorenshuwu.com/book/yinhezhixin2/" +str(i)+ ".html"
headers1 = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; '
}
h = requests.get(url=url, headers=headers1)
h.encoding = 'utf-8'
doc = pq(h.content)
lis1 = doc('#cont-text').text()
name = doc("#content h1").text()
print(name)
with open('./222.txt', 'a+', encoding='utf-8') as f:
f.write(name)
f.write('\n')
f.write(str(lis1))
f.write('\n')
问题 这个只是爬取了第一页
有的章节 目录下面 2级目录 有 4个分页
https://www.daocaorenshuwu.com/book/yinhezhixin2/1128874_2.html
https://www.daocaorenshuwu.com/book/yinhezhixin2/1128874_3.html
https://www.daocaorenshuwu.com/book/yinhezhixin2/1128874_4.html
这中构造 在我上面代码结构中如何拿到这种 2 级目录url?
本帖最后由 fa00x 于 2020-8-30 19:36 编辑
for i in range(1128871,1128911):
for j in range (1,5):
url ="https://www.daocaorenshuwu.com/book/yinhezhixin2/" +str(i)+ "_"+str(j)+".html"
print(url)
返回值 没有 第一页?
第一章 黑星奇景(第2/4页)
第一章 黑星奇景(第3/4页)
第一章 黑星奇景(第4/4页)
第二章 超级人类(第2/2页)
第三章 银心魅影(第2/4页)
第三章 银心魅影(第3/4页)
第三章 银心魅影(第4/4页)
第四章 真理之光(第2/2页)
如何加判断,判断这个https://www.daocaorenshuwu.com/book/yinhezhixin2/1128874_5.html显示的结果就行了吧 本帖最后由 fa00x 于 2020-8-30 19:36 编辑
Zeaf 发表于 2020-8-30 19:13
如何加判断,判断这个https://www.daocaorenshuwu.com/book/yinhezhixin2/1128874_5.html显示的结果就行了 ...
返回值没有第一页 内容都是2 页开始?
直接通过xpath方式获取链接?
比如links = xml.xpath("//ul[@class="pagination pagination-sm"]/li/a/@href"),获取前半部分的链接:1128874_2.html、1128874_3.html、1128874_4.html,然后和源地址拼接 可以先使用链接进去第一页,然后判断里面页面有没有#page,有两个(上下),随便拿到一个,是一个ul无序列表,接着直接从第二页开始遍历,大概思路就是这里 楼主本身做的思路就很诡异,你的for in range就不咋地(不具有通用性,比如网页数字乱序,其他小说
你可以通过xpath获得href
如//*[@id="content"]/div/ul/li[*]/a/@href
可以得到1128872_2.html
1128872_3.html
1128872_4.html
然后就简单了
for i in range (2,5):
html = requests.get(f'https://www.daocaorenshuwu.com/book/yinhezhixin2/1128872_{i}.html').status_code
if html != 404:
print('OK')
这样也行 D.A. 发表于 2020-8-30 19:40
直接通过xpath方式获取链接?
比如links = xml.xpath("//ul[@class="pagination pagination-sm"]/li/a/@hr ...
依然没有第一页 本帖最后由 nightcat 于 2020-8-30 21:21 编辑
import requests
import lxml.etree
from fake_useragent import UserAgent
from pyquery import PyQuery as pq
ua = UserAgent(verify_ssl=False).chrome
header = {'user-agent': ua}
def start_url(url):
response = requests.get(url,headers=header)
selector = lxml.etree.HTML(response.text)
title = pq(response.content)("#content h1").text()
body = pq(response.content)('#cont-text').text()
body = body.replace(r'DaoCaoRen.getCode("ui-content");','')
file = f'{title}.txt'
download_file(file,body,title)
for i in selector.xpath('//*[@id="content"]/div/ul/li/a/@href'):
new_url = "https://www.daocaorenshuwu.com/book/yinhezhixin2/" + str(i)
new_body = content(new_url)
new_body = new_body.replace(r'DaoCaoRen.getCode("ui-content");', '')
download_file(file,conttext=new_body)
def content(url):
response = requests.get(url, headers=header)
new_body = pq(response.content)('#cont-text').text()
return new_body
def download_file(file,conttext,title=None):
with open(file,'a+',encoding='utf-8') as f:
if title:
f.write(f'{title}\n')
f.write(conttext)
if __name__ == '__main__':
for i in range(1128871, 1128911):
url = "https://www.daocaorenshuwu.com/book/yinhezhixin2/" + str(i) + ".html"
start_url(url)
页:
[1]
2