好友
阅读权限20
听众
最后登录1970-1-1
|
本帖最后由 hj170520 于 2020-6-15 00:16 编辑
爬取“人人网”的照片的时候,为了分类“相册”,把相册放到对应的“文件夹”。
但爬取过程中有的成功放到了文件夹里,有的没有,是不是他们的编码有问题?
爬取过程中“中文字符”必须转码,似乎有的没有转码成功,怎么解决这种事情呢?
核心代码是这块,cookies 我就不放了额
[Python] 纯文本查看 复制代码 import requests
import os
import re
class download():
def __init__(self):
self.url = 'http://photo.renren.com/photo/我的账户对应ID/albumlist/v7'
self.headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/69.0.3497.100 Safari/537.36'}
self.cookies =
self.albumName = []
self.albumId = []
self.album = []
self.albumurl = []
self.albumlist = {}
self.photoid = []
self.photourl = ''
def album_information(self):
req = requests.get(self.url, headers=self.headers, cookies=self.cookies)
req.encoding = req.apparent_encoding
self.albumName = re.findall(r'\"albumName\":\"(.*?)\"', req.text)
self.albumId = re.findall(r'\"albumId\":\"(.*?)\"', req.text)
for i in range(len(self.albumName)):
self.album.append(str(self.albumName[i]).encode('utf-8').decode('unicode-escape', "ignore"))
self.albumurl.append('http://photo.renren.com/photo/我的账户对应ID/album-%s/v7' % self.albumId[i])
for ii in range(len(self.album)):
self.albumlist[self.album[ii]] = self.albumurl[ii]
return self.album, self.albumurl, self.albumlist
def album_creat(self):
for i in range(len(self.album)):
if not os.path.exists('./%s' % self.album[i]):
os.makedirs('./%s' % self.album[i])
def photo_download(self):
for i in range(len(self.albumlist)):
self.albumdir = list(self.albumlist.keys())[i]
req = requests.get(self.albumurl[i], headers=self.headers, cookies=self.cookies)
req.encoding = req.apparent_encoding
self.photoid = re.findall(r'\"photoId\":\"(.*?)\"', req.text)[0]
self.photourl = 'http://photo.renren.com/photo/我的账户对应ID/photo-%s/v7' % self.photoid
req = requests.get(self.photourl, headers=self.headers, cookies=self.cookies)
self.largeurl = re.findall(r'\"largeurl\":\"(.*?)\"', req.text)
for ii in range(len(self.largeurl)):
if re.search(r'\\/', self.largeurl[ii]):
self.largeurl_new = self.largeurl[ii].replace('\\/', '/')
req = requests.get(self.largeurl_new, headers=self.headers, cookies=self.cookies)
with open('./%s' % self.albumdir + str(ii) + '.jpg', 'wb') as f:
f.write(req.content)
else:
req = requests.get(self.largeurl[ii], headers=self.headers, cookies=self.cookies)
with open('./%s/' % self.albumdir + str(ii) + '.jpg', 'wb') as f:
f.write(req.content)
if __name__ == '__main__':
d = download()
d.album_information()
d.album_creat()
d.photo_download()
有大佬给把把脉吗?
用图片更直观显示:
有的在文件夹内编号是0,1,2,3 但有的是在文件夹外,就不是编号0,1,2,3而是多了个文件夹的名字 |
免费评分
-
查看全部评分
|
发帖前要善用【论坛搜索】功能,那里可能会有你要找的答案或者已经有人发布过相同内容了,请勿重复发帖。 |
|
|
|
|