吾爱破解 - 52pojie.cn

 找回密码
 注册[Register]

QQ登录

只需一步,快速开始

查看: 15323|回复: 94
收起左侧

[Python 转载] python3爬取妹子图片

  [复制链接]
落木依旧 发表于 2018-10-14 15:14
本帖最后由 落木依旧 于 2018-10-14 15:20 编辑

源码由@qq58452077提供,https://www.52pojie.cn/forum.php ... ypeid%26typeid%3D29
在python2的基础上做了些修改,支持在python3环境下运行,附件改后缀py,运行即可,输入下图中类似画红圈的数字,每个数字对应了一个相册
[Asm] 纯文本查看 复制代码
import urllib.request
import lxml.html
import time
import os
import re

def serchIndex(name):
    url='https://www.nvshens.com/girl/search.aspx?name='+name
    print(url)
    html = urllib.request.urlopen(url).read().decode('UTF-8')
    return html

def selectOne(html):
    tree = lxml.html.fromstring(html)
    one = tree.cssselect('#DataList1 > tr > td:nth-child(1) > li > div > a')[0]
    href = one.get('href')
    url = 'https://www.nvshens.com'+href+'album/'
    print(url)
    html = urllib.request.urlopen(url).read().decode('UTF-8')
    print(html)
    return html

def findPageTotal(html):
    tree = lxml.html.fromstring(html)
    lis = tree.cssselect('#photo_list > ul > li')
    list = []
    for li in lis:
        url = li.cssselect('div.igalleryli_div > a')
        href = url[0].get('href')
        list.append(href)
    findimage_urls = set(list)
    print(findimage_urls)
    return findimage_urls

def dowmloadImage(image_url,filename)  :
    for i in  range(len(image_url)):
        try:
            req = urllib.request.Request(image_url)
            req.add_header('User-Agent','chrome 4{}'.format(i))
            image_data = urllib.request.urlopen(req).read()
        except (urllib.request.HTTPError, urllib.request.URLError) as e:
            time.sleep(0.1)
            continue
        open(filename,'wb').write(image_data)
        break

def mkdirByGallery(path):
    # 去除首位空格
    path = path.strip()
    path = 'E:\\py\\photo\\'+path
    #这两个函数之间最大的区别是当父目录不存在的时候os.mkdir(path)
    #不会创建,os.makedirs(path)
    #则会创建父目录。
    isExists = os.path.exists(path)
    if not isExists:
        os.makedirs(path)
    return path

if __name__ != '__main__':
        name = str(input("name:"))
        html = serchIndex(name)
        html = selectOne(html)
        pages = findPageTotal(html)
        img_id = 1
        for page in pages:
            path = re.search(r'[0-9]+',page).group()
            path = mkdirByGallery(path)
            for i in range(1,31):
                url='https://www.nvshens.com'+page+str(i)+'.html'
                html = urllib.request.urlopen(url).read().decode('UTF-8')
                tree = lxml.html.fromstring(html)
                title = tree.cssselect('head > title')[0].text
                if title.find(u"该页面未找到")!= -1:
                    break
                imgs = tree.cssselect('#hgallery > img')
                list = []
                for img in imgs:
                    src = img.get('src')
                    list.append(src)
                image_urls = set(list)
                image_id = 0
                for image_url in image_urls:
                    dowmloadImage(image_url,path+'\\'+'2018-{}-{}-{}.jpg'.format(img_id,i,image_id))
                    image_id += 1
            img_id += 1

if __name__ == '__main__':
    page = str(input("pageid:"))
    path = mkdirByGallery(page)
    for i in range(1,31):
        url = 'https://www.nvshens.com/g/' + page+'/' + str(i) + '.html'
        print(url)
        html = urllib.request.urlopen(url).read().decode('UTF-8')
        tree = lxml.html.fromstring(html)
        title = tree.cssselect('head > title')[0].text
        if title.find(u"该页面未找到") != -1:
            break
        imgs = tree.cssselect('#hgallery > img')
        list = []
        for img in imgs:
            src = img.get('src')
            list.append(src)
        image_urls = set(list)
        image_id = 0
        for image_url in image_urls:
            dowmloadImage(image_url, path+'\\'+'2018-{}-{}.jpg'.format(i,image_id))
            image_id += 1

if __name__ != '__main__':
    url = 'https://www.nvshens.com/gallery/meitui/'
    print(url)
    html = urllib.request.urlopen(url).read().decode('UTF-8')
    tree = lxml.html.fromstring(html)
    lis = tree.cssselect('#listdiv > ul > li')
    list = []
    for li in lis:
        url = li.cssselect('div.galleryli_div > a')
        href = url[0].get('href')
        list.append(href)
    findimage_urls = set(list)
    print(findimage_urls)
    print(len(findimage_urls))

捕获2.PNG
捕获.PNG

paqumeizitupian.txt

4.31 KB, 下载次数: 297, 下载积分: 吾爱币 -1 CB

免费评分

参与人数 9吾爱币 +7 热心值 +9 收起 理由
炒蛋 + 1 + 1 用心讨论,共获提升!
ppdxxm + 1 我很赞同!
隐风眠 + 1 用心讨论,共获提升!
zjjyl + 1 + 1 谢谢@Thanks!
yumao + 1 + 1 用心讨论,共获提升!
为海尔而战 + 1 + 1 python代码,顶,我只会python。。。
额微粒波地 + 1 + 1 我很赞同!
xiaozunsheng + 1 + 1 谢谢@Thanks!
zhanglei1371 + 1 + 1 谢谢@Thanks!

查看全部评分

发帖前要善用论坛搜索功能,那里可能会有你要找的答案或者已经有人发布过相同内容了,请勿重复发帖。

liujiajia 发表于 2018-12-5 19:44
D:\python>python tupian1.py
pageid:27742
https://www.nvshens.com/g/27742/1.html
Traceback (most recent call last):
  File "D:\Program Files (x86)\python\lib\site-packages\lxml\cssselect.py", line 13, in <module>
    import cssselect as external_cssselect
ModuleNotFoundError: No module named 'cssselect'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "tupian1.py", line 106, in <module>
    title = tree.cssselect('head > title')[0].text
  File "D:\Program Files (x86)\python\lib\site-packages\lxml\html\__init__.py", line 432, in cssselect
    from lxml.cssselect import CSSSelector
  File "D:\Program Files (x86)\python\lib\site-packages\lxml\cssselect.py", line 16, in <module>
    'cssselect does not seem to be installed. '
ImportError: cssselect does not seem to be installed. See http://packages.python.org/cssselect/
这是因为啥?
ZhXT 发表于 2019-1-11 18:19
楼主我又加了一段,让他可以在E:/py/photo/下创建一个range.txt,并接收两个数值,在这两个数值之间产生随机数,然后读取range.txt,就可以进行批量爬取了。但是他一遇到错误就直接停止了,有没有什么方法可以让这个程序遇到错误时跳过正在爬取的网址,爬取下一个网址啊


fFile = open("E:/py/photo/range.txt", "w+")
startid=int(input("pageStartid:"))
endid=int(input("pageEndid:"))
for i in range(startid,endid):
    rePage=str(i)
    fFile.write(rePage+"\n")
fFile.close()


for page in fFile.readlines():
    page = page.strip()
    fFile.close()
    path = mkdirByGallery(page)
头像被屏蔽
大象无形 发表于 2018-10-14 15:31
凌乱的思绪 发表于 2018-10-14 15:35
不错的源码感谢
Sofon 发表于 2018-10-14 15:39
这个不错啊!下载来用用
hnrzxx 发表于 2018-10-14 15:39
不错的源码。
dongxin 发表于 2018-10-14 15:59
感谢楼主分享。
mkinnf 发表于 2018-10-14 16:23
哦呦 这个可以有啊
瑾年丶 发表于 2018-10-14 16:34
楼主这个咋用???
辛苦了 发表于 2018-10-14 16:51
数字这么大得喝多少营养快线啊
pob777 发表于 2018-10-14 16:54
Traceback (most recent call last):
  File "C:\Users\冬季创意者\Downloads\paqumeizitupian.py", line 13, in <module>
    import lxml.html
ModuleNotFoundError: No module named 'lxml'
您需要登录后才可以回帖 登录 | 注册[Register]

本版积分规则

返回列表

RSS订阅|小黑屋|处罚记录|联系我们|吾爱破解 - LCG - LSG ( 京ICP备16042023号 | 京公网安备 11010502030087号 )

GMT+8, 2024-11-16 07:29

Powered by Discuz!

Copyright © 2001-2020, Tencent Cloud.

快速回复 返回顶部 返回列表