开一大坑---写一个全网小说下载器

wujie462 · 发表于 2019-12-25 07:22

虽然我不喜欢看小说,但是拿这个作为python爬虫练手挺不错的.
所以仅供学习使用
如有侵权,请联系我

麦田孤望者 · 发表于 2020-2-6 18:19

弄完了一个

[Python] 纯文本查看 复制代码

import json
import re
import time
import os
import sys

import requests

class Main():
    
    bookname = input('请输入书名:')

    url_dict = {
        '八一中文网1':['https://www.zwdu.com/','https://www.zwdu.com/search.php?keyword=','utf-8','GET','<div class="result-game-item-detail">(.*?)<div class="search-result-page">'],
        '八一中文网2':['http://www.81zw.in/','http://www.81zw.in/plus/search.php?kwtype=0&searchtype=&q=','gbk','GET','<ul class="ul_b_list">(.*?)</ul>'],
        '八一中文网3':['https://www.81xs.cc/','https://www.81xs.cc/s.php?ie=gbk&q=','gbk','GET'],
        '八一中文网4':['https://www.81zw.com.tw/','https://www.81zw.com.tw/modules/article/search.php?searchkey=','gbk','GET'],

        '笔趣读1':['http://www.biqudu.tv/','http://www.biqudu.tv/s.php?q=','utf-8','GET'],
        '笔趣读2':['https://www.biqugetv.com/','https://www.biqugetv.com/search.php?keyword=','utf-8','GET'],
        '笔趣读3':['http://www.biquduge.com/','http://www.biquduge.com/modules/article/search.php','gbk','POST',{'searchkey':str(bookname.encode('gbk'))[:-1][2:].replace('\\x','%'),'searchtype': 'articlename'}],#text = str('书名'.encode('gbk'))[:-1][2:].replace('\\x','%')
        '笔趣读4':['https://m.biqu55.com/','https://m.biqu55.com/home/search','utf-8','POST',{'action': 'search','q':bookname}],

        '中文阅读网':['https://www.zwydw.com/','https://www.zwydw.com/s.php?ie=gbk&q=','gbk','GET'],
    }

    def choose_from(self):

        from_list = []

        for x,i in enumerate(self.url_dict):
            print(x+1,i)
            from_list.append(i)
        choose = input('请输入您想要的书源序号(目前仅支持书源1):')
        #print(len(self.url_dict))#这句话输出书源数量

        if choose == 'q':
            sys.exit()

        elif int(choose)>len(self.url_dict) or int(choose)<1:
            os.system('cls')
            print('请重新输入！(目前仅支持书源1 退出请按q)\n')
            self.choose_from()

        
        return from_list[int(choose)-1]
    
    def get_base_url(self,storename):
        info_list = self.url_dict[storename]
        #print(info_list)
        method = info_list[3]
        #print(method)
        if method == 'POST':
            res = requests.post(info_list[1],data=info_list[5])
            #print(res.text)
            search_txt = info_list[5]
        if method == 'GET':
            res = requests.get(info_list[1]+str(self.bookname.encode(info_list[2]))[:-1][2:].replace('\\x','%'))
            #print(res.text)
            search_txt = info_list[4]
        
        return res.text,search_txt

    def get_book_list(self,html,params):

        def findall(tx1,tx2,tx3):
            return re.findall(re.compile('{}(.*?){}'.format(tx1,tx2),re.S),tx3)

        htm = re.findall(re.compile(params,re.S),html)
        html = []
        for i in htm:
            bookname = findall(' title="','" c',i)[0]
            url = findall('<a cpos="title" href="','" title=',i)[0]
            describe = findall('<p class="result-game-item-desc">','</p>',i)[0]
            auther = findall('<span>','</span>',findall('<p class="result-game-item-info-tag">','</p>',i)[0])[0].strip()
            #print(url,bookname,auther,describe)
            html.append((bookname,url,auther,describe))
            #print(i.strip())

        print(html)

a = Main()
#print(a.choose_from())
b = a.get_base_url(a.choose_from())
c = a.get_book_list(b[0],b[1])
#print(b)

wujie462 · 发表于 2019-12-26 06:52

本帖最后由 wujie462 于 2019-12-26 06:55 编辑

0x004建立代{过}{滤}理IP池
各种网站的各种反爬机制,应对这种情况我能想到到的就只有IP代{过}{滤}理池了.

引用自这篇文章:https://blog.csdn.net/qq_42776455/article/details/83047883
感兴趣的各位可以去看看,楼主懒,直接复制粘贴了

[Python] 纯文本查看 复制代码

import json
import telnetlib
import requests
import random

proxy_url = 'https://raw.githubusercontent.com/fate0/proxylist/master/proxy.list'
# proxyList = []

def verify(ip,port,type):
    proxies = {}
    try:
        telnet = telnetlib.Telnet(ip,port=port,timeout=3)
    except:
        print('unconnected')
    else:
        #print('connected successfully')
        # proxyList.append((ip + ':' + str(port),type))
        proxies['type'] = type
        proxies['host'] = ip
        proxies['port'] = port
        proxiesJson = json.dumps(proxies)
        with open('verified_proxies.json','a+') as f:
            f.write(proxiesJson + '\n')
        print("已写入：%s" % proxies)

def getProxy(proxy_url):
    response = requests.get(proxy_url)
    proxies_list = response.text.split('\n')
    for proxy_str in proxies_list:
        proxy_json = json.loads(proxy_str)
        host = proxy_json['host']
        port = proxy_json['port']
        type = proxy_json['type']
        verify(host,port,type)


if __name__ == '__main__':
    getProxy(proxy_url)

wujie462 · 发表于 2019-12-25 07:24

本帖最后由 wujie462 于 2019-12-26 14:25 编辑

强烈谴责:楼主标题党
好吧其实确实有点标题党了
我的目标:有UI,支持一些不知名的小说站的小说下载,支持搜索,
楼主纯新人,所以更新巨慢无比
预计用时:也就那么几年左右
如果你会python,哪怕一丢丢,如果你愿意帮助我,你可以在下面留言
OVER~~~~

wujie462 · 发表于 2019-12-25 07:42

本帖最后由 wujie462 于 2019-12-25 07:56 编辑

0x001获取搜索结果的网页源代码

[Python] 纯文本查看 复制代码

# -*- coding:UTF-8 -*-
from bs4 import BeautifulSoup
import requests

h = {
     'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36'
}

if __name__ == "__main__":
     target = 'https://www.feisuzw.com/book/search.aspx?SearchKey=%B8%B2%CA%D6&SearchClass=1&SeaButton='
     req = requests.get(url = target, headers=h)     print(req.text)

yjn866y · 发表于 2019-12-25 07:50

这个坑真不小。。。。

wujie462 · 发表于 2019-12-25 07:58

0x002找到搜索结果链接的位置

wujie462 · 发表于 2019-12-25 08:02

0x003得到结果

[Python] 纯文本查看 复制代码

# -*- coding:UTF-8 -*-
from bs4 import BeautifulSoup
import requests
import re

h = {
     'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36'
}

if __name__ == "__main__":
     target = 'https://www.feisuzw.com/book/search.aspx?SearchKey=%B8%B2%CA%D6&SearchClass=1&SeaButton='
     req = requests.get(url = target, headers=h)
     soup = BeautifulSoup(req.text, 'lxml')
     urls = soup.select('#CListTitle > a:nth-child(1)')
     for i in urls:
          print(i.get('href'))

     # print(html)
     # print(req.text)

35925 · 发表于 2019-12-25 08:16

期待楼主的大作问世，给你点赞

rayxsun · 发表于 2019-12-25 08:19

坑有点大，慢慢填。

qlcyl110 · 发表于 2019-12-25 08:23

持续关注，大佬加油！

没有星星的夜空 · 发表于 2019-12-25 08:27

吾爱有你更精彩~~~~~
期待楼主大作

帐号		自动登录	找回密码
密码			注册[Register]

[Python 转载] 开一大坑---写一个全网小说下载器

免费评分

本帖被以下淘专辑推荐: