python初学爬虫校花图片（内存有点不足啊，直接看源码)

倾情发表于 2020-2-22 10:37

```
import string
import urllib

import requests
from lxml import etree

url = "http://www.xiaohuar.com/hua/"
proxies = {
# 'http':'http://183.196.170.247:9000/',
# "http": "111.29.3.190:80"
}
headers = {

"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36"
}

rep = requests.get(url, headers=headers, proxies=proxies)
html_content = rep.content.decode(encoding='gb2312')
# print(html_content)
dom = etree.HTML(html_content)
# xh_photos_href = dom.xpath('//div[@id="list_img"]//div[@class="img"]/a/@href')
# http://www.xiaohuar.com/
xh_photos_src = dom.xpath('//div[@id="list_img"]//div[@class="img"]/a/img/@src')
xh_photos_info = dom.xpath('//div[@id="list_img"]//a/img/@alt')
# 下一页
xh_photos_href = dom.xpath('//body//div[@id="list_img"]//div[@id="page"]/div/a/@href')
# 判断尾页
xh_photos_a_text = dom.xpath('//body//div[@id="list_img"]//div[@id="page"]/div/a/text()')
print(xh_photos_a_text[-2])
i = 1
while xh_photos_a_text[-2] == '下一页':
for url, alt_info in zip(xh_photos_src, xh_photos_info):
   # img_url=c.strip('//')
   src_name = url.split('.')
   print(src_name)
   img_name = alt_info + '.' + src_name[-1]
   print(img_name)
   url = 'http://www.xiaohuar.com/' + url
   url = urllib.parse.quote(url, safe=string.printable)
   rep = requests.get(url, headers=headers)
   with open(r'I:\Pchong\pc_image\flowers\\' + img_name.replace("/", "_"), 'wb')as f:
         f.write(rep.content)
         print('图片保存完毕！！')
print('-------第' + str(i) + '图片爬取完毕---------')
i += 1

print("下一页地址：",xh_photos_href[-2])
print("下一页地址：",xh_photos_href)
rep = requests.get(xh_photos_href[-2], headers=headers, proxies=proxies)
html_content = rep.content.decode(encoding='gbk')
# print(html_content)
dom = etree.HTML(html_content)
# xh_photos_href = dom.xpath('//div[@id="list_img"]//div[@class="img"]/a/@href')
# http://www.xiaohuar.com/
xh_photos_src = dom.xpath('//div[@id="list_img"]//div[@class="img"]/a/img/@src')
xh_photos_info = dom.xpath('//div[@id="list_img"]//a/img/@alt')
# 下一页
xh_photos_href = dom.xpath('//body//div[@id="list_img"]//div[@id="page"]/div/a/@href')
# 判断尾页
xh_photos_a_text = dom.xpath('//body//div[@id="list_img"]//div[@id="page"]/div/a/text()')

```

renakeji 发表于 2020-2-22 10:59

打不开你的目标网站是什么原因呢

倾情发表于 2020-2-22 11:46

renakeji 发表于 2020-2-22 10:59
打不开你的目标网站是什么原因呢

我看了一下这个网站挂了{:1_909:}{:1_909:}{:301_999:}

倾情发表于 2020-2-22 11:38

longbow2 发表于 2020-2-22 11:12
感谢楼主，看源码可以学到很多，不过这个url网址在浏览器打不开呢？

爬了近2000张图片，好像这个网站服务器不行了。好多人拿来测试估计他们的服务器受不了，给挂了{:1_924:}，重在学习其精髓。下一次发个可以让大家测试的网站。可以测试一下{:1_919:}

ll996075dd 发表于 2020-2-22 10:39

还不错，能在简短一点就更不错了

jamesore 发表于 2020-2-22 10:42

不错，学习了。

L244913RZXX 发表于 2020-2-22 10:46

学习学习

没i那么简单 发表于 2020-2-22 10:47

楼主你好骚啊

hshcompass 发表于 2020-2-22 11:00

谢谢分析。
想学，找不到路子。

wenweiqun 发表于 2020-2-22 11:11

真不错哦

longbow2 发表于 2020-2-22 11:12

感谢楼主，看源码可以学到很多，不过这个url网址在浏览器打不开呢？

cry323 发表于 2020-2-22 11:15

感谢分享，学习下

页: [1] 2 3

吾爱破解 - 52pojie.cn's Archiver

python初学爬虫校花图片（内存有点不足啊，直接看源码)