吾爱破解 - 52pojie.cn

 找回密码
 注册[Register]

QQ登录

只需一步,快速开始

查看: 3098|回复: 13
收起左侧

[Python 转载] 【Python爬虫】初学爬取网络小说

[复制链接]
laotun 发表于 2020-3-2 09:33
刚刚开始学习爬虫,爬取小说,只能爬取一个网站,第一次写有什么不对的请大家指出来,代码就直接贴出来了
[Python] 纯文本查看 复制代码
from lxml import etree
from urllib import parse
import requests
import re


# 搜索
def search():
    txt = input("请输入书的全名:")
    txt = parse.quote(txt)
    txt = parse.quote(txt)
    url = 'https://www.bookbao8.com/Search/q_' + txt
    u = requests.request('GET', url,  headers={
       'user - agent': 'Mozilla / 5.0(Windows NT 10.0;Win64;x64) AppleWebKit / 537.36(KHTML, likeGecko) Chrome / 80.0.3987.122Safari / 537.36'
    })
    url = etree.HTML(u.text)
    url = url.xpath("//div[@class='txt']/span[@class='t']/a/@href")
    url = url[0]
    url = ''.join(url)
    url = 'https://www.bookbao8.com' + url
    u = requests.request('GET', url, headers={
        'user - agent': 'Mozilla / 5.0(Windows NT 10.0;Win64;x64) AppleWebKit / 537.36(KHTML, likeGecko) Chrome / 80.0.3987.122Safari / 537.36'
    })
    txt = etree.HTML(u.text)
    # 名称
    name = txt.xpath("//div[@id='info']/h1/text()")
    name = ''.join(name)
    book_name = name
    name = '名称:' + name + '\n'
    # 作者
    author = txt.xpath("//div[@id='info']/p/a/text()")
    author = author[0]
    author = '作者:' + author + '\n'
    # 类别
    sort = txt.xpath("//div[@id='info']/p/a/text()")
    sort = sort[1]
    sort = '类别:' + sort + '\n'
    # 信息总和
    search_content = name + author + sort
    print(search_content)

    num = int(input("输入1下载:"))
    down(u, num, book_name)


def down(u, num, name):
    file = open('%s.txt' % name, 'a', encoding='utf-8')
    if num == 1:
        href = etree.HTML(u.text)
        href = href.xpath("//div[@class='wp b2 info_chapterlist']/ul/li/a/@href")
        for h in href:
            h = ''.join(h)
            url = requests.request('GET', 'https://www.bookbao8.com/' + h,  headers={
       'user - agent': 'Mozilla / 5.0(Windows NT 10.0;Win64;x64) AppleWebKit / 537.36(KHTML, likeGecko) Chrome / 80.0.3987.122Safari / 537.36'
                                                                                        })
            # 文章内容
            text = re.search(r'<dd id="contents">(.*?)</dd>', url.text, re.S)
            text = text[0]
            text = str(text)
            text = re.sub(r'<dd id="contents">', '', text)
            text = re.sub(r' ', '', text)
            text = re.sub(r'<br />', '', text)
            text = re.sub(r'</dd>', '', text)
            # 章节
            title = etree.HTML(url.text)
            title = title.xpath("//div[@class='bdsub']/dl/dd/h1/text()")
            title = ''.join(title)

            file.write(title)
            file.write('\n')
            file.write(text)
            file.write('\n')
            print("%s 下载完成" % title)
        file.close()
        print("下载完成!!!!!")
    else:
        exit()
        
        
search()

免费评分

参与人数 1吾爱币 +1 热心值 +1 收起 理由
loner. + 1 + 1 我很赞同!

查看全部评分

发帖前要善用论坛搜索功能,那里可能会有你要找的答案或者已经有人发布过相同内容了,请勿重复发帖。

 楼主| laotun 发表于 2020-3-2 11:12
本帖最后由 laotun 于 2020-3-2 18:43 编辑

更新了下代码,之前的会被检查到爬虫,重新添加了请求头,爬取的时候容易502,新加入了502判断,  新加了个停顿,每下一章停顿5秒
[Python] 纯文本查看 复制代码
from lxml import etree
from urllib import parse
from fake_useragent import UserAgent
import requests, re, time

ua = UserAgent()


# 搜索
def search():
    txt = input("请输入书的全名:")
    txt = parse.quote(txt)
    txt = parse.quote(txt)
    url = 'https://www.bookbao8.com/Search/q_' + txt
    while True:
        u = requests.request('POST', url, headers={'user - agent': ua.random})
        url = etree.HTML(u.text)
        x = url.xpath("//body/center/h1/text()")
        x = ''.join(x)
        if x != '502 Bad Gateway':
            break
    url = url.xpath("//div[@class='txt']/span[@class='t']/a/@href")
    url = url[0]
    url = ''.join(url)
    url = 'https://www.bookbao8.com' + url
    while True:
        u = requests.request('POST', url, headers={'user - agent': ua.random})
        txt = etree.HTML(u.text)
        x = txt.xpath("//body/center/h1/text()")
        x = ''.join(x)
        if x != '502 Bad Gateway':
            break
    # 名称
    name = txt.xpath("//div[@id='info']/h1/text()")
    name = ''.join(name)
    book_name = name
    name = '名称:' + name + '\n'
    # 作者
    author = txt.xpath("//div[@id='info']/p/a/text()")
    author = author[0]
    author = '作者:' + author + '\n'
    # 类别
    sort = txt.xpath("//div[@id='info']/p/a/text()")
    sort = sort[1]
    sort = '类别:' + sort + '\n'
    # 信息总和
    search_content = name + author + sort
    print(search_content)

    num = int(input("输入1下载:"))
    down(u, num, book_name)


def down(u, num, name):
    file = open('%s.txt' % name, 'a', encoding='utf-8')
    if num == 1:
        href = etree.HTML(u.text)
        href = href.xpath("//div[@class='wp b2 info_chapterlist']/ul/li/a/@href")
        for h in href:
            h = ''.join(h)
            while True:
                url = requests.request('POST', 'https://www.bookbao8.com/' + h, headers={'user - agent': ua.random})
                x = etree.HTML(url.text)
                x = x.xpath("//body/center/h1/text()")
                x = ''.join(x)
                if x != '502 Bad Gateway':
                    break
            # 文章内容
            text = re.search(r'<dd id="contents">(.*?)</dd>', url.text, re.S)
            text = text[0]
            # text = str(text)
            text = ''.join(text)
            text = re.sub(r'<dd id="contents">', '', text)
            text = re.sub(r' ', '', text)
            text = re.sub(r'<br />', '', text)
            text = re.sub(r'</dd>', '', text)
            # 章节
            title = etree.HTML(url.text)
            title = title.xpath("//div[@class='bdsub']/dl/dd/h1/text()")
            title = ''.join(title)

            file.write(title)
            file.write('\n')
            file.write(text)
            file.write('\n')
            print("%s 下载完成" % title)
            time.sleep(5)
        file.close()
        print("下载完成!!!!!")
    else:
        exit()


search()
sndncel 发表于 2020-3-2 11:16
from lxml import etree
from urllib import parse
import requests
import re


# 搜索
def search():
    txt = input("请输入书的全名:")
    txt = parse.quote(txt)
    txt = parse.quote(txt)
    url = 'https://www.bookbao8.com/Search/q_' + txt
    这里为啥要两个一样的代码。。。。。
hitlerfs 发表于 2020-3-2 10:26
zhangbaoyu 发表于 2020-3-2 10:35
贴出来也不会用啊
KevinStark 发表于 2020-3-2 10:45
不明觉厉
xu474242 发表于 2020-3-2 11:07
厉害,我是没有耐心去学习现在了
 楼主| laotun 发表于 2020-3-2 11:13

这个代码会检查到是爬虫,我更新了下代码
 楼主| laotun 发表于 2020-3-2 11:24
sndncel 发表于 2020-3-2 11:16
from lxml import etree
from urllib import parse
import requests

这个网站搜索是经过2次url转码的
头像被屏蔽
loner. 发表于 2020-3-2 12:31
提示: 作者被禁止或删除 内容自动屏蔽
您需要登录后才可以回帖 登录 | 注册[Register]

本版积分规则

返回列表

RSS订阅|小黑屋|处罚记录|联系我们|吾爱破解 - LCG - LSG ( 京ICP备16042023号 | 京公网安备 11010502030087号 )

GMT+8, 2024-11-17 00:39

Powered by Discuz!

Copyright © 2001-2020, Tencent Cloud.

快速回复 返回顶部 返回列表