首先,膜拜论坛里其他爬虫大佬,妹子图有人前面发过,一个是Beautiful爬的,一个是多线程,但是我保证,我没有看着他们的写,而是自己独立写的简单的单线程爬取,有的爱友可能没有python环境,我就顺便打包了一下,嘻嘻。大佬勿喷,小白一起交流。
exe下载地址:https://www.lanzouj.com/i57m9wf
我不知道行不行,因为我电脑测试成功了,不知道没有环境的行不行,可以试试,激起大家学爬虫的热情。
[Python] 纯文本查看 复制代码 import requests
import os
from lxml import etree
class Spider(object):
def headers(self):
head={
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 SE 2.X MetaSr 1.0',
'Referer':'https://www.mzitu.com/tag/youhuo/'
}
self.first_request(head)
def first_request(self,head):
url = 'http://www.mzitu.com'
response = requests.get(url,headers=head)
# print(response.content.decode())
html = etree.HTML(response.content.decode('utf-8'))
Bigtit_list = html.xpath('//ul[@id="pins"]/li/a/img/@alt')
Bigsrc_list = html.xpath('//ul[@id="pins"]/li/a/@href')
# print(Bigtit_list,Bigsrc_list)
for Bigtit,Bigsrc in zip(Bigtit_list,Bigsrc_list):
if os.path.exists(Bigtit) == False:
os.mkdir(Bigtit)
print(Bigsrc)
self.second_request(Bigtit,Bigsrc,head)
def second_request(self,Bigtit,Bigsrc,head):
for i in range(1,15):
response = requests.get(Bigsrc+'/'+str(i),headers=head)
html = etree.HTML(response.content.decode())
img_name = html.xpath('//div[@class="main-image"]/p/a/img/@alt')
img_link =html.xpath('//div[@class="main-image"]/p/a/img/@src')
for name,link in zip(img_name,img_link):
try:
rst = requests.get(link,headers=head)
img = rst.content
print(link)
file_name = Bigtit +'\\'+link.split('/')[-1]
print('正在下载的图片为:',name)
with open(file_name,'wb') as f:
f.write(img)
except Exception as err:
print(err)
spyder=Spider()
spyder.headers()
下面这个源码是壁纸88的,昨天一个爱友问我能不能爬爬试试,我试了试,只能爬取网站面向大众的图片,后台原画质的图片,我抓包到原画质图片的下载地址,可是毕竟还是有点太小白了,不会分析后面的特征码从哪里来,很抱歉。
[Python] 纯文本查看 复制代码 import requests
import re
from lxml import etree
class Spyder(object):
def first_url(self,page):
for i in range(1,page):
url = 'http://www.bizhi88.com/s/122/'+str(i)+'.html'
response = requests.get(url)
html = response.content.decode()
mid_tit_list = re.compile('<a class="title" href=".*?" target="_blank" title=".*?">(.*?)</a>').findall(html)
mid_url_list = re.compile('<a class="title" href="(.*?)" target="_blank" title=".*?">.*?</a>').findall(html)
for mid_tit,mid_url in zip(mid_tit_list,mid_url_list):
self.get_url(mid_tit,mid_url)
def get_url(self,mid_tit,mid_url):
url = 'http://www.bizhi88.com'+mid_url
response = requests.get(url)
html = etree.HTML(response.content.decode())
new_url = html.xpath('//div[@class="layout wp-con"]/div/img/@src')
# print(new_url)
self.data_save(new_url,mid_tit)
def data_save(self,new_url,mid_tit):
response = requests.get(new_url[0])
data = response.content
print('正在下载的图片名字是:',mid_tit)
with open(mid_tit+'.jpg','wb') as f:
f.write(data)
spyder = Spyder()
page = int(input('请输入要下载而页数:'))
spyder.first_url(page)
|