使用Python2.X实现百度Url采集工具
大致原理:使用urllib2模拟访问网页。然后正则匹配链接。然后保存到一个文件内。[*]#!/usr/bin/env python
[*]# coding=utf8
[*]import urllib
[*]import urllib2
[*]import re
[*]import sys
[*]import os
[*]reload(sys)
[*]sys.setdefaultencoding('utf8')
[*]headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36'}
[*]text = raw_input("Search Content:\r\n")
[*]text = text.decode('gbk', 'replace')
[*]text = urllib.quote(text.encode('utf-8', 'replace'))
[*]ys = int(raw_input("Search Number of pages:\r\n"))
[*]zz = "http://www\.baidu\.com/link\?url=+"
[*]f=open('caiji.txt','w')
[*]for i in range(ys):
[*] url = "https://www.baidu.com/s?wd=" + text + "&pn=" + str(i) + "0"
[*] req = urllib2.Request(url, headers=headers)
[*] print url
[*] web = urllib2.urlopen(req)
[*] zz = "http://www\.baidu\.com/link\?url=+"
[*] by = re.compile(zz)
[*] result = by.findall(web.read())
[*] web.close()
[*] qcf = {}.fromkeys(result).keys()
[*] we = "本页采集到:"
[*] qw = "个URL"
[*] print we.encode('cp936') + str(len(qcf)) + qw.encode('cp936')
[*] for i in qcf:
[*] url = i
[*] req = urllib2.Request(url, headers=headers)
[*] try:
[*] u = urllib2.urlopen(req)
[*] except urllib2.URLError, e:
[*] print e.code
[*] redirectUrl = u.geturl()
[*] f.write(redirectUrl)
[*] f.write("\r\n")
[*]f.close()
[*]ok = "采集的Url保存到"
[*]print ok.encode('cp936') + str(os.getcwd()) + "\caiji.txt"
http://ttt.sssie.com/content/uploadfile/201608/86cf1471270756.jpg下载地址: https://yunpan.cn/c6aK9j92n3f5f访问密码 cb70
好东西0.0 谢谢分享 楼主真666 大致原理:使用urllib2模拟访问网页。 谢谢分享 大致原理:使用urllib2模拟访问网页 高手,虽然没看懂{:1_921:} 谢谢,一直希望学 什么用途??、?
页:
[1]
2