吾爱破解 - 52pojie.cn

 找回密码
 注册[Register]

QQ登录

只需一步,快速开始

查看: 2983|回复: 20
收起左侧

[Python 转载] 古龙书屋爬小说

[复制链接]
wx9265661 发表于 2020-3-22 10:18
女朋友要看小说,但是找不到现成的资源,就想着用python爬一下。
源代码如下,代{过}{滤}理是抄的论坛一个老哥的(@fengmodel
因为急着想把小说爬下来,没加异常处理什么的,跑下来发现一点问题,就是程序跑到一半就会像假死一样不动了,也不报错,也不停止,但是我手动停止下再继续跑就又可以了,虽然两次是把所有的章节爬了下来,但是想着把问题找出来,请论坛的大佬们帮忙看看~
[Python] 纯文本查看 复制代码
import random
import requests
import time
from bs4 import BeautifulSoup


def UserAgent_random():
    user_agent_list = [
        'Mozilla/5.0 (Windows NT 6.2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1464.0 Safari/537.36',
        'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.16 Safari/537.36',
        'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.3319.102 Safari/537.36',
        'Mozilla/5.0 (X11; CrOS i686 3912.101.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.116 '
        'Safari/537.36',
        'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.93 Safari/537.36',
        'Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1667.0 '
        'Safari/537.36',
        'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:17.0) Gecko/20100101 Firefox/17.0.6',
        'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1468.0 Safari/537.36',
        'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2224.3 Safari/537.36',
        'Mozilla/5.0 (X11; CrOS i686 3912.101.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.116 '
        'Safari/537.36']

    UserAgent = {'User-Agent': random.choice(user_agent_list)}
    return UserAgent


def next_page(soup):
    pager = soup.find(name='div', attrs={'class': 'pager'})
    for a in pager.findAll(name='a'):
        if a.string == '下一章':
            return str(a['href'])


def download_page(soup):
    head = '【' + str(soup.h1.string) + '】' + '\n'  # 章节名
    paragraph.append(head)
    content_text = soup.find(name='div', attrs={'class': 'content'})
    for i in content_text.findAll(name='p'):
        paragraph.append(str(i.string) + '\n')
    paragraph.append('\n\n\n\n')


if __name__ == '__main__':
    url = 'https://m.gulongsw.com'
    url_r = '/xs_968/938982.html'
    # final_url = '/xs_968/1008623.html'
    while url_r != '/xs_968/':
        paragraph = []
        UserAgent = UserAgent_random()
        real_html = requests.get(url + url_r, headers=UserAgent).text
        soup = BeautifulSoup(real_html, 'html.parser')
        download_page(soup)
        url_r = next_page(soup)
        print('loading' + paragraph[-5])
        time.sleep(5)
        with open('novel.txt', 'a', encoding='utf-8') as f:
            for p in paragraph:
                f.write(p)
            f.close()

免费评分

参与人数 2吾爱币 +4 热心值 +2 收起 理由
asd117118 + 1 + 1 用心讨论,共获提升!
苏紫方璇 + 3 + 1 欢迎分析讨论交流,吾爱破解论坛有你更精彩!

查看全部评分

发帖前要善用论坛搜索功能,那里可能会有你要找的答案或者已经有人发布过相同内容了,请勿重复发帖。

猫南北爱上狗东西 发表于 2020-3-22 20:05
加个超时重连就好了,不用设置time.sleep,这已经很慢了
[Python] 纯文本查看 复制代码
if __name__ == '__main__':
    url = 'https://m.gulongsw.com'
    url_r = '/xs_968/938982.html'
    # final_url = '/xs_968/1008623.html'

    from requests.adapters import HTTPAdapter

    
    while url_r != '/xs_968/':
        paragraph = []
        UserAgent = UserAgent_random()
        s = requests.Session()
        s.mount('http://', HTTPAdapter(max_retries=3))
        s.mount('https://', HTTPAdapter(max_retries=3))
        try:
            real_html = s.get(url + url_r, headers=UserAgent, timeout=5).text
        except requests.exceptions.RequestException as e:
            print(e)
        soup = BeautifulSoup(real_html, 'html.parser')
        download_page(soup)
        url_r = next_page(soup)
        with open('novel.txt', 'a', encoding='utf-8') as f:
            for p in paragraph:
                f.write(p)
            f.close()
xiaoshan1818 发表于 2020-3-22 10:21
 楼主| wx9265661 发表于 2020-3-22 10:25
zhangxu888 发表于 2020-3-22 10:30
其他都好说,就是定位太难了
董志刚 发表于 2020-3-22 10:32
写的真好    我就不会写     还是得学习呀
 楼主| wx9265661 发表于 2020-3-22 10:39
zhangxu888 发表于 2020-3-22 10:30
其他都好说,就是定位太难了

会不会是网站发现我是爬虫了
 楼主| wx9265661 发表于 2020-3-22 10:41
董志刚 发表于 2020-3-22 10:32
写的真好    我就不会写     还是得学习呀

一起学习!
带不走的回忆 发表于 2020-3-22 10:47
你就是爬虫哈哈
prospect2005 发表于 2020-3-22 10:55
楼主牛逼。。。。。。。。
limwu 发表于 2020-3-22 11:15
写的不错,我也想学习,到总没下决心
您需要登录后才可以回帖 登录 | 注册[Register]

本版积分规则

返回列表

RSS订阅|小黑屋|处罚记录|联系我们|吾爱破解 - LCG - LSG ( 京ICP备16042023号 | 京公网安备 11010502030087号 )

GMT+8, 2024-11-25 14:53

Powered by Discuz!

Copyright © 2001-2020, Tencent Cloud.

快速回复 返回顶部 返回列表