求助大神，今日头条中的这种图片怎么抓取？

应真先生 · 发表于 2019-7-12 23:39

最近在学爬虫，教程又是17年的，网站跟现在的都不一样只能自己琢磨，完全没有头绪，大佬帮忙改一下代码，让我研究一下
这个爬出来的图片都不是高清图

爬头条街拍图片

[Python] 纯文本查看 复制代码

#coding:utf-8
import requests
import time
from urllib.parse import urlencode
import os
from hashlib import md5
from multiprocessing.pool import Pool
def get_page(offsetp):
    t = time.time()
    timestamp = int(t * 1000)
    params = {
        'aid': 24,
        'app_name': 'web_search',
        'offset': offsetp,
        'format': 'json',
        'keyword': '街拍',
        'autoload': 'true',
        'count': 20,
        'en_qc': 1,
        'cur_tab': 1,
        'from': 'search_tab',
        'timestamp': timestamp
    }
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:67.0) Gecko/20100101 Firefox/67.0',
        'Cookie': 'tt_webid=6698224417073694219; UM_distinctid=16b1c89b4891dc-07c5eab40102ac-4c312d7d-100200-16b1c89b48a400; csrftoken=f2f895fa787b16b95088a70f7de231c9; __tasessionId=v9ffxsrfd1559551950358; CNZZDATA1259612802=1433961361-1559551484-https%253A%252F%252Flanding.toutiao.com%252F%7C1559551484; s_v_web_id=72ad7620bc7eee5c573a2fe6966db22e'
    }
    url = 'https://www.toutiao.com/api/search/content/?' + urlencode(params)
    print(url)
    try:
        r = requests.get('https://www.toutiao.com/api/search/content/?', headers=headers, params=params)
        if r.status_code == 200:
            return r.json()
    except requests.ConnectionError :
        return None



def get_images(json):
    if json.get('data'):
        for item in json.get('data'):
            title = item.get('title')
            if (title == None):
                continue
            images = item.get('image_list')
            for image in images:
                yield {
                    'image': image.get('url'),
                    'title': title
                }

def save_image(item):
    title = item.get('title')
    if not os.path.exists(title):
        try:
            os.mkdir(title)
        except Exception as e:
            print(e)
            return
    try:
        response = requests.get(item.get('image'))
        if response.status_code == 200:
            file_path = '{0}/{1}.{2}'.format(title,md5(response.content).hexdigest(),'jpg')
            if not os.path.exists(file_path):
                with open(file_path,'wb') as f:
                    f.write(response.content)
            else:
                print('Aready Download',file_path)
    except requests.ConnectionError :
        print("Failed to save image")

def main(offset):
    print(offset)
    json = get_page(offset)
    for item in get_images(json):
        print(item)
        save_image(item)
        time.sleep(1)

GROUP_START = 1
GROUP_END = 20

if  __name__ == "__main__":
    pool = Pool()
    pool.map(main, [i*20 for i in range(10)])
    pool.close()
    pool.join()
    print("OK")

应真先生 · 发表于 2019-7-12 23:50

已解决，发现预览图与网页里的图片存在路径差异，但是文件名一样，用正则表达式匹配修改了一下

落桔生梗 · 发表于 2019-7-13 00:09

学完入门后，最好找一些新的技术学，因为更新换代太快了

uuukkk · 发表于 2019-7-13 01:07

更新换代太快了

仙水 · 发表于 2019-7-13 01:45

支持一下，不过不了解楼主的问题

czqxz · 发表于 2019-7-13 07:33

你得找到原图的链接去爬，否则看到什么就是爬什么

披荆斩棘天使翼 · 发表于 2019-7-13 10:24

更新换代太快了

cold_ · 发表于 2019-7-13 13:36

api接口

帐号		自动登录	找回密码
密码			注册[Register]

[已解决] 求助大神，今日头条中的这种图片怎么抓取？