python爬取代{过}{滤}理脚本

LSA 发表于 2017-4-4 15:25

最近写了一个爬取代{过}{滤}理的python脚本，参考了一下别人的代码，有了大量代{过}{滤}理就方便了，这是v1.0版本，采用了多线程（一页用一个线程爬代{过}{滤}理），顺便熟悉一下队列和bs4，感觉bs4的确很强大而且方便很多。还有很多地方不是很完善，日后有空会继续完善这个脚本。
功能描述：爬取www.xicidaili.com的代{过}{滤}理，并去1212.ip138.com/ic.asp验证代{过}{滤}理的可用性，最后把可用代{过}{滤}理写入useful_proxies.txt文件。
代码不是很长我就不打包了，大家有需要就直接复制代码吧，欢迎反馈bug和和提出建议！

源代码：
#coding:utf-8

import requests
from bs4 import BeautifulSoup as bs
import re
import Queue
import threading
import time
import optparse

url = 'http://www.xicidaili.com/nn/'
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; .NET4.0C; .NET4.0E; .NET CLR 2.0.50727; .NET CLR 3.0.30729; .NET CLR 3.5.30729; InfoPath.3; rv:11.0) like Gecko'}

class Proxy_collection(threading.Thread): #继承Thread实现多线程

def __init__(self, que):
   threading.Thread.__init__(self) #重用父类Thread的__init__()
   self._que = que

def run(self):
   while not self._que.empty():
         url = self._que.get()
         r = requests.get(url, headers=headers, timeout=5)
         soup = bs(r.content, 'lxml', from_encoding='utf-8')
         bqs = soup.find_all(name='tr', attrs={'class':re.compile(r'|[^odd]')})
         for bq in bqs:

            us = bq.find_all(name='td')

            try:
               self.proxies_confirm(str(us.string), str(us.string), str(us.string)) #取协议：ip：端口去验证
            except Exception,e:
               #print e
               pass

def proxies_confirm(self, type_self, ip, port):
   ip_dic = {}
   ip_dic = ip + ':' + port

   r = requests.get('http://1212.ip138.com/ic.asp', headers=headers, proxies=ip_dic, timeout=5)
   result = re.findall(r'\d+\.\d+\.\d+\.\d+', r.content)
   result_ip = ''.join(result) #转为字符串

   if ip == result_ip:
         print type_self + '---' + ip + ':' + port + ' is useful!!!\n'
         with open('useful_proxies.txt', 'a') as f:
            f.write(type_self.lower() + '---' + ip + ':' + port + '\n')

if __name__ == '__main__':
thread = []

que = Queue.Queue()

parser = optparse.OptionParser('usage %prog '+\
   '-p <page num>')

parser.add_option('-p', dest='pagenum', type='int',\
   help='specify page nums,default 5',default=5)

(options, args) = parser.parse_args()
pagenum = options.pagenum

for i in range(1, pagenum+1):
   que.put('http://www.xicidaili.com/nn/' + str(i))

for i in range(pagenum):
   thread.append(Proxy_collection(que))
start = time.clock()
for i in thread:
   i.start()
for i in thread:
   i.join()

end = time.clock()
print end - start

LSA 发表于 2017-4-16 17:36

soloyuyang 发表于 2017-4-11 19:19
666亲测可用，初学爬虫，楼主验证的时候1212网站的作用是什么，是看1212检测出来的ip是爬下来的ip么？是的 ...

1212网站的作用是验证爬取的代{过}{滤}理是否可用，就是用爬取的代{过}{滤}理ip作为代{过}{滤}理（proxies=ip_dic）去请求1212网站，如果显示的ip就是代{过}{滤}理ip的话就说明这个代{过}{滤}理可用

soloyuyang 发表于 2017-4-11 19:19

666亲测可用，初学爬虫，楼主验证的时候1212网站的作用是什么，是看1212检测出来的ip是爬下来的ip么？是的话打印输出？但是是如何判定代{过}{滤}理地址可不可用的，是用爬下来的ip去替代headers么？

c文字 发表于 2017-4-4 15:33

爬了可以干嘛呢？不明真相的小白弱弱的一问

2205 发表于 2017-4-4 15:48

c文字发表于 2017-4-4 15:33
爬了可以干嘛呢？不明真相的小白弱弱的一问

给某些程序猿写注册机、写爬虫代码等等。

LSA 发表于 2017-4-4 15:48

c文字发表于 2017-4-4 15:33
爬了可以干嘛呢？不明真相的小白弱弱的一问

1.隐藏自己的ip
2.有些网站有反爬虫，如果用同一个ip去爬的话很快会被ban,所以切换ip可以不间断爬取信息

谁折南枝傍小丛 发表于 2017-4-4 16:10

感谢分享。

裕龙境 发表于 2017-4-4 16:10

反正我不会用，支持一下

python193 发表于 2017-4-4 17:18

多谢啊！

c文字 发表于 2017-4-5 13:41

谢谢大神们解答。

梦醒方如初 发表于 2017-4-5 16:41

谢谢分享！

zht1023 发表于 2017-4-5 22:27

前端时间开始了Python的学习，{:17_1074:}代码带走了。
谢谢了哈

页: [1] 2 3

吾爱破解 - 52pojie.cn's Archiver

python爬取代{过}{滤}理脚本