吾爱破解 - 52pojie.cn

 找回密码
 注册[Register]

QQ登录

只需一步,快速开始

查看: 1295|回复: 16
收起左侧

[Python 原创] 反扒的小办法

[复制链接]
smile7788 发表于 2023-7-27 09:44
# 上个帖子写了个小爬虫,有朋友反应,加了多线程,爬取速度太快会被反扒(实际上就是封IP)。

我这里介绍一个防止封IP 的小方法。设置多个 User-Agent 随机取。 下面是代码。

1. 第一步 获取User-Agent ,保存到本地文件,这里我用到一个网站,里面有非常多的Agent。http://www.useragentstring.com/

[Python] 纯文本查看 复制代码
'''
    获取浏览器信息,防止被封ip
'''

import sys
import requests
import parsel
import re
import traceback
from threading import Thread
import random
#C:\Users\Administrator\AppData\Local\Programs\Python\Python37\python.exe .\get_user_agent.py

# 定义请求头 
# http://www.useragentstring.com/   获取User-Agent的网站
list_headers = [
    {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36'},
    {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.79 Safari/537.36'},
    {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36'},
    {'User-Agent':'Mozilla/5.0 (Windows NT 10.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36'},
    {'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.53 Safari/537.36'},
    {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.84 Safari/537.36'},
    {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'},
    {'User-Agent':'Mozilla/5.0 (X11; Ubuntu; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2919.83 Safari/537.36'},
    {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2866.71 Safari/537.36'},
    {'User-Agent':'Mozilla/5.0 (X11; Ubuntu; Linux i686 on x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2820.59 Safari/537.36'}
] 

# 信息获取页面
g_url = 'https://www.useragentstring.com/pages/Chrome/'

def get_url_html(url : str) -> str:
    # 调用第三方库requests发送请求,模拟浏览器访问 (随机获取不同的浏览器信息,防止被封ip)
    response = requests.get(url, headers=list_headers[random.randint(0,9)],verify=False)
    # 网页响应后根据网页文字内容自动转码为utf-8格式
    response.encoding = response.apparent_encoding
    # 响应数据转换为可视的网页文本信息
    html = response.text
    return html


def main():
    ls_info = []

    first_html = get_url_html(g_url)
    selector = parsel.Selector(first_html)
    uls = selector.xpath('//*[@id="liste"]/ul')
    for ul in uls:
        l_a = ul.css('li > a::text').getall()
        ls_info = ls_info + l_a
             
    with open("common\\agents.txt",'w')  as f:
        for ls in ls_info:
            f.write(ls + '\n')


if __name__ == '__main__':
    main()
        


    


2. 第二步,把自己的请求改成随机取agent,(额  其实上面的代码已经写了,有朋友估计都已经发现了)

[Python] 纯文本查看 复制代码
import sys
import requests
import parsel
import re
import traceback
import random
import json

# 定义请求头
# http://www.useragentstring.com/   获取User-Agent的网站 *多搞几个反扒
list_headers = [
    #{'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36'},
    {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.79 Safari/537.36'},
    {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36'},
    {'User-Agent':'Mozilla/5.0 (Windows NT 10.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36'},
    {'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.53 Safari/537.36'},
    {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.84 Safari/537.36'},
    {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'},
    {'User-Agent':'Mozilla/5.0 (X11; Ubuntu; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2919.83 Safari/537.36'},
    {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2866.71 Safari/537.36'},
    {'User-Agent':'Mozilla/5.0 (X11; Ubuntu; Linux i686 on x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2820.59 Safari/537.36'}
] 
headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36'}


def get_agents(filename : str) -> list:
    '''
        获取浏览器信息
    '''
    file_info = ''
    list_info = []
    
    with open(filename,'r',encoding='utf-8') as f:
        s = f.readlines()
        for ss in s :
            dic = {}
            dic['User-Agent'] = ss.replace('\n','')
            list_info.append(dic)
            print(dic)
            
    return list_info

# 获取浏览器信息(大量)
l_headers = get_agents('common\\agents.txt')
l_size = len(l_headers)

def get_url_html(url : str) -> str:
    try:
        # 取一个浏览器信息
        hs = l_headers[random.randint(0,l_size-1)]
        
        # # 调用第三方库requests发送请求,模拟浏览器访问 (随机获取不同的浏览器信息,防止被封ip)
        response = requests.get(url, headers=hs)
        
        # #常规连接 验证证书是否可行
        # response = requests.get(url, headers=headers) 
        
        # 不验证证书是否可行
        #response = requests.get(url, headers=headers,verify=False)
        
        # 指定可信证书目录
        #response = requests.get(url,headers=headers,verify='C:/Users/Administrator/Desktop/gitee_person/s2.pem')
        
        # 网页响应后根据网页文字内容自动转码为utf-8格式
        response.encoding = response.apparent_encoding
        # 响应数据转换为可视的网页文本信息
        html = response.text
    except Exception as e:
        print('get_url_html error:',url,hs)

    return html

# 单元测试
if __name__ == '__main__':
    #get_agents('C:\\Users\\Administrator\\Desktop\\gitee_person\\python3_code\\2-spider\\common\\agents.txt')
    get_agents('common\\agents.txt')
    
    

有点乱 ,但是跑肯定没问题,核心就是这个:
[Python] 纯文本查看 复制代码
# 取一个浏览器信息
        hs = l_headers[random.randint(0,l_size-1)]
        
        # # 调用第三方库requests发送请求,模拟浏览器访问 (随机获取不同的浏览器信息,防止被封ip)
        response = requests.get(url, headers=hs)


3、在你的代码引用这个获取网页内容的接口

[Python] 纯文本查看 复制代码
# 线程获取单章节内容
def thread_spider(chapter_info : dict):
    name = chapter_info['name']
    url_c  = chapter_info['url']
    full_url = url[0 : url.index('/book')] + url_c
    context = get_url_html(full_url)
    rc = parse_get_one_chapter(context)
    
    chapter_info['context'] = rc
    return  chapter_info   


免费评分

参与人数 2吾爱币 +8 热心值 +1 收起 理由
iteamo + 1 热心回复!
苏紫方璇 + 7 + 1 欢迎分析讨论交流,吾爱破解论坛有你更精彩!

查看全部评分

发帖前要善用论坛搜索功能,那里可能会有你要找的答案或者已经有人发布过相同内容了,请勿重复发帖。

aakgbb 发表于 2023-7-27 23:01
嘻嘻,感谢分享,这种方法我以前试过,挺好用的,不过有些网站你同一IP读取速度过快,还是会限制的,加了UA也一样
BLUE7777777 发表于 2023-7-27 18:18
大部分网站实际上还是会封IP,之前同一科室的人用过这方法爬东西,害我们整个科室的人都访问不了那个网站。
sunmoon1 发表于 2023-7-27 14:40
BTFKM 发表于 2023-7-27 16:03
请 https://blog.csdn.net/qq_44921056/article/details/119510170
randyho 发表于 2023-7-27 17:35
感谢分享,马上试试
CrackLife 发表于 2023-7-27 17:53
感谢楼主,想请教下这个能解决封IP地址的方法是否靠谱
poshui1968 发表于 2023-7-27 19:04
感谢分享,马上试试。
py学徒 发表于 2023-7-27 20:38
BTFKM 发表于 2023-7-27 16:03
请 https://blog.csdn.net/qq_44921056/article/details/119510170

相当不错!很赞。
poptop 发表于 2023-7-28 09:02

感谢分享,马上试试
您需要登录后才可以回帖 登录 | 注册[Register]

本版积分规则

返回列表

RSS订阅|小黑屋|处罚记录|联系我们|吾爱破解 - LCG - LSG ( 京ICP备16042023号 | 京公网安备 11010502030087号 )

GMT+8, 2024-11-24 20:34

Powered by Discuz!

Copyright © 2001-2020, Tencent Cloud.

快速回复 返回顶部 返回列表