好友
阅读权限10
听众
最后登录1970-1-1
|
突突兔
发表于 2016-8-15 22:30
大致原理:使用urllib2模拟访问网页。 然后正则匹配链接。 然后保存到一个文件内。 - #!/usr/bin/env python
- # coding=utf8
- import urllib
- import urllib2
- import re
- import sys
- import os
- reload(sys)
- sys.setdefaultencoding('utf8')
- headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36'}
- text = raw_input("Search Content:\r\n")
- text = text.decode('gbk', 'replace')
- text = urllib.quote(text.encode('utf-8', 'replace'))
- ys = int(raw_input("Search Number of pages:\r\n"))
- zz = "http://www\.baidu\.com/link\?url=[a-zA-Z0-9_-]+"
- f=open('caiji.txt','w')
- for i in range(ys):
- url = "https://www.baidu.com/s?wd=" + text + "&pn=" + str(i) + "0"
- req = urllib2.Request(url, headers=headers)
- print url
- web = urllib2.urlopen(req)
- zz = "http://www\.baidu\.com/link\?url=[a-zA-Z0-9_-]+"
- by = re.compile(zz)
- result = by.findall(web.read())
- web.close()
- qcf = {}.fromkeys(result).keys()
- we = "本页采集到:"
- qw = "个URL"
- print we.encode('cp936') + str(len(qcf)) + qw.encode('cp936')
- for i in qcf:
- url = i
- req = urllib2.Request(url, headers=headers)
- try:
- u = urllib2.urlopen(req)
- except urllib2.URLError, e:
- print e.code
- redirectUrl = u.geturl()
- f.write(redirectUrl)
- f.write("\r\n")
- f.close()
- ok = "采集的Url保存到"
- print ok.encode('cp936') + str(os.getcwd()) + "\caiji.txt"
下载地址: https://yunpan.cn/c6aK9j92n3f5f 访问密码 cb70
|
|
发帖前要善用【论坛搜索】功能,那里可能会有你要找的答案或者已经有人发布过相同内容了,请勿重复发帖。 |
|
|
|
|