经典Python爬虫例子，Python多线程爬虫例子案例

huguo002 · 发表于 2020-5-15 19:35

本帖最后由 huguo002 于 2020-5-15 20:13 编辑

很久没写爬虫了，一个经典的Python爬虫例子，Python多线程爬虫例子案例，目标网站结构比较简单，适合练手使用，采用了经典的生产者和消费者模式，同时结合python类和装饰器的使用，应该能够让你获益不少。
目标网站.png

几个关键点：
1.python多线程生产者与消费者模式官方文档：17.1. threading — 基于线程的并行https://docs.python.org/zh-cn/3.6/library/threading.html
两个案例参考：用Python多线程实现生产者消费者模式https://segmentfault.com/a/1190000008909344
python-多线程3-生产者消费者https://www.cnblogs.com/R-bear/p/7031722.html

2.@property 装饰器既要保护类的封装特性，又要让开发者可以使用“对象.属性”的方式操作操作类属性，除了使用 property() 函数，Python 还提供了 @property 装饰器。
通过 @property 装饰器，可以直接通过方法名来访问方法，不需要在方法名后添加一对“（）”小括号。
参考：http://c.biancheng.net/view/4561.htmlhttps://www.liaoxuefeng.com/wiki/897692888725344/923030547069856

3.@staticmethod 静态方法@staticmethod 静态方法只是名义上归属类管理，但是不能使用类变量和实例变量，是类的工具包放在函数前（该函数不传入self或者cls），所以不能访问类属性和实例属性。
参考：Python进阶-----静态方法（@staticmethod）https://www.cnblogs.com/Meanwey/p/9788713.html
Python staticmethod() 函数https://www.runoob.com/python/python-func-staticmethod.html

4.Queue 队列queue 模块实现多生产者，多消费者队列。当信息必须安全的在多线程之间交换时，它在线程编程中是特别有用的。
此模块中的 Queue 类实现了所有锁定需求的语义。Queue.put(item, block=True, timeout=None) 将 item 放入队列。如果可选参数 block 是 true 并且 timeout 是 None (默认)，则在必要时阻塞至有空闲插槽可用。如果 timeout 是个正数，将最多阻塞 timeout 秒，如果在这段时间没有可用的空闲插槽，将引发 Full 异常。
反之 (block 是 false)，如果空闲插槽立即可用，则把 item 放入队列，否则引发 Full 异常 ( 在这种情况下，timeout 将被忽略)。
参考：17.7. queue — 一个同步的队列类https://docs.python.org/zh-cn/3.6/library/queue.html 还是推荐和尝试去阅读官方文档，慢慢理解和实践！

运行三个报错1.Queue 队列，只能接收一个值
报错.png

2.目录文件名未格式处理，存储路径错误
   路径报错.png

3.timeout报错可能是图片路径存在问题，待查证！
   timeout报错.png

运行效果

采集效果

附源码：

[Python] 纯文本查看 复制代码

001

002

003

004

005

006

007

008

009

010

011

012

013

014

015

016

017

018

019

020

021

022

023

024

025

026

027

028

029

030

031

032

033

034

035

036

037

038

039

040

041

042

043

044

045

046

047

048

049

050

051

052

053

054

055

056

057

058

059

060

061

062

063

064

065

066

067

068

069

070

071

072

073

074

075

076

077

078

079

080

081

082

083

084

085

086

087

088

089

090

091

092

093

094

095

096

097

098

099

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

#BillWang 博闻 采集
#微信：huguo00289
# -*- coding: UTF-8 -*-
import requests
import os,random,re
from lxml import etree
import threading
from queue import Queue
 
class Httprequest(object):
    ua_list = [
        'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/14.0.835.163 Safari/535.1',
        'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36Chrome 17.0',
        'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11',
        'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0) Gecko/20100101 Firefox/6.0Firefox 4.0.1',
        'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1',
        'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50',
        'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50',
        'Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11',
    ]
    @property  #把方法变成属性的装饰器
    def random_headers(self):
        return {
            'User-Agent': random.choice(self.ua_list)
        }
 
#生产者模式
class Procuder(threading.Thread,Httprequest):
    def __init__(self,page_queue,img_queue,*args,**kwargs):
        super(Procuder,self).__init__(*args,**kwargs)
        self.url = "http://www.billwang.net/html/blogs/"
        self.page_queue=page_queue
        self.img_queue=img_queue
 
 
    def run(self):
        while True:
            if self.page_queue.empty():
                break
            url = self.page_queue.get()
            self.get_list(url)
 
 
    def get_list(self, url):
        print(f">>> 正在爬取列表页 {url}")
        html = requests.get(url, headers=self.random_headers, timeout=5).content.decode('utf-8')
        req = etree.HTML(html)
        hrefs = req.xpath('//div[@class="txtbox"]/a/@href')
        for href in hrefs:
            href = f'{self.url.split("/html")[0]}{href}'
            self.get_content(href)
 
    def get_content(self, url):
        print(f">>> 正在爬取详情页 {url}")
        html = requests.get(url, headers=self.random_headers, timeout=5).content.decode('utf-8')
        req = etree.HTML(html)
        title= req.xpath('//div[@class="detail-con"]/h1[@class="title"]/text()')[0]
        h1 =self.validate_title(title)
        content = req.xpath('//div[@class="content"]//text()')
        content = self.deal_content(content)
        content_req = req.xpath('//div[@class="content"]')[0]
        imgs = content_req.xpath('*//img/@src')
        data=(h1, content, imgs)
        print(data)
        self.img_queue.put(data)
 
    @staticmethod
    def validate_title(title):
        pattern = r"[\/\\\:\*\?\"\<\>\|]"
        new_title = re.sub(pattern, "_", title)  # 替换为下划线
        return new_title
 
    @staticmethod  # @staticmethod 静态方法只是名义上归属类管理，但是不能使用类变量和实例变量，是类的工具包放在函数前（该函数不传入self或者cls），所以不能访问类属性和实例属性
    def deal_content(content):
        content.remove('\n        ')
        content.remove('        ')
        content = ' '.join(content)
        content = content.replace('免责声明：本站目的在于分享更多信息，不代表本站的观点和立场，版权归原作者所有。若有侵权或异议请联系我们删除。', '')
        return content
 
 
 
#消费者模式
class Consumer(threading.Thread,Httprequest):
    def __init__(self,page_queue,img_queue,*args,**kwargs):
        super(Consumer,self).__init__(*args,**kwargs)
        self.page_queue=page_queue
        self.img_queue=img_queue
        self.path = f'billw/'
 
    def save_content(self,h1,content,path):
        os.makedirs(self.path, exist_ok=True)  # 创建文件夹
        print(f">>> 开始保存 {h1}文本内容")
        text = '%s%s%s' % (h1, '\n', content)
        with open(f'{path}{h1}.txt', 'w', encoding='utf-8') as f:
            f.write(text)
        print(">>> 保存成功！")
 
    def save_imgs(self,imgs,path):
        i = 1
        for img in imgs:
            img_url = img
            img_name = f'{i}{os.path.splitext(img)[-1]}'
            img_path = f'{path}{img_name}'
            self.save_img(img_url, img_name, img_path)
            i = i + 1
 
    def save_img(self, img_url, img_name, img_path):
        print(f">>> 开始保存 {img_name} 图片")
        r = requests.get(img_url, headers=self.random_headers,timeout=5)
        with open(img_path, 'wb') as f:
            f.write(r.content)
        print(f">>> 保存 {img_name} 图片成功")
 
 
 
    def run(self):
        while True:
            if self.page_queue.empty() and self.img_queue.empty():
                break
            data=self.img_queue.get()
            h1 = data[0]
            content = data[1]
            imgs = data[2]
            path = f'billw/{h1}/'
            os.makedirs(path, exist_ok=True)  # 创建文件夹
            self.save_content(h1, content, path)
            self.save_imgs(imgs, path)
 
 
def main():
    page_queue=Queue(100)
    img_queue=Queue(10000)
    for i in range(1, 21):
        url = "http://www.billwang.net/html/blogs/%d/" % i
        print(f'>>> 正在爬取 第{i}页 列表页，链接：{url} ...')
        page_queue.put(url)
 
    for x in range(2):
        t=Procuder(page_queue,img_queue)
        t.start()
 
    for x in range(8):
        t=Consumer(page_queue,img_queue)
        t.start()
 
 
 
if __name__=="__main__":
    main()

处女-大龙猫 · 发表于 2020-5-16 09:08

huguo002 发表于 2020-5-16 09:06
淘宝是怎么搞定的？

我用的selenium写的，最初用reques库模拟硬刚，跟着大牛学习。不过后来淘宝更新了，二月写的代码不能爬了，现在没办法用了selenium。

斜杠朱先生 · 发表于 2020-5-16 11:30

处女-大龙猫发表于 2020-5-15 21:04
最近正在爬淘宝, 不过已经写好了, 这个帖子收藏了, 等忙完就看看, 关于多线程我花不太会写, 我爬淘宝就写了 ...

爬取淘宝哪方面的信息呀？能交流一下不。

fanvalen · 发表于 2020-5-15 19:53

aaaaaa 杀鬼图片加载不出啦只有代码能看

hmily65 · 发表于 2020-5-15 20:05

看不到图片

huguo002 · 发表于 2020-5-15 20:13

fanvalen 发表于 2020-5-15 19:53
aaaaaa 杀鬼图片加载不出啦只有代码能看

已重新上图！

huguo002 · 发表于 2020-5-15 20:14

hmily65 发表于 2020-5-15 20:05
看不到图片

已重新上图！

hmily65 · 发表于 2020-5-15 20:16

运行效果的动态图是怎么弄的？

hq8205 · 发表于 2020-5-15 20:18

不错，思路清晰，图文并茂，说明详细，受教了

fanvalen · 发表于 2020-5-15 20:52

hmily65 发表于 2020-5-15 20:16
运行效果的动态图是怎么弄的？

gif录制的

处女-大龙猫 · 发表于 2020-5-15 21:04

最近正在爬淘宝, 不过已经写好了, 这个帖子收藏了, 等忙完就看看, 关于多线程我花不太会写, 我爬淘宝就写了两个线程, 关于os的线程这一部分真的有点费劲

bdcpc · 发表于 2020-5-16 08:58

点赞一下〜最近想看看多线程怎么操作的

帐号		自动登录	找回密码
密码			注册[Register]

[Python 转载] 经典Python爬虫例子，Python多线程爬虫例子案例

免费评分

本帖被以下淘专辑推荐: