知音漫客爬虫

天空宫阙 · 发表于 2019-10-23 23:49

本帖最后由天空宫阙于 2019-10-23 23:56 编辑

1.发现知音漫客是一个很适合练习的网站,h5页面的源码有漫画图片的真实地址这里还是选择pc端做一下练习，pc的源码中漫画图片在服务器上的位置有做简单的加密处理，最核心部分是通过chapter_addr解密得到this.imgpath(图在服务器上的位置)其实一部漫画在服务器上的位置是相当定的，就没有几种组合方式，但此处还是通过抓包分析了this.imgpath(图在服务器上的位置)的解密过程（其实说白了很简单就是类似用后一位字母代替前一位类似，不过此处是unicode移动的位数也不是一位）。

this.decode经过解密后的一段js

[JavaScript] 纯文本查看 复制代码

!__cr.imgpath=__cr.imgpath.replace(/./g,function(a){return String.fromCharCode(a.charCodeAt(0)-__cr.chapter_id%10)})!

python可以这样模拟

[Python] 纯文本查看 复制代码

def decode(raw, chapter_id):
    # 移动unicode对应数字位数为chapter_id最后值
    # 解密减 加密加
    # !__cr.imgpath=__cr.imgpath.replace(/./g,function(a){return String.fromCharCode(a.charCodeAt(0)-__cr.chapter_id%10)})!
    result = ''
    for i in raw:
        result += chr(ord(i)-int(chapter_id) % 10)
    return result

其中this.decode如何得到上面这段呢？其实也是做了类似的操作。

2.整个的源码如下

[Python] 纯文本查看 复制代码

import requests
from bs4 import BeautifulSoup
import json
import time
import os
import re
from tqdm import tqdm

headers = {
    'user-agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36'
}

def get_index(index_url):
    chapterslist = {}
    response = requests.get(index_url, headers=headers)
    soup = BeautifulSoup(response.text, 'lxml')
    chapterList = soup.select('#chapterList')[0]
    chapters = chapterList.select('a')
    for chapter in chapters:
        chapterslist[chapter['title']] = chapter['href']
    return chapterslist


def quote_keys_for_json(json_str):
    """给键值不带双引号的json字符串的所有键值加上双引号。
    注：解析一般的不严格的json串，可以check out https://github.com/dmeranda/demjson, 速度比标准库要慢。"""
    quote_pat = re.compile(r'".*?"')
    a = quote_pat.findall(json_str)
    json_str = quote_pat.sub('@', json_str)
    key_pat = re.compile(r'(\w+):')
    json_str = key_pat.sub(r'"\1":', json_str)
    assert json_str.count('@') == len(a)
    count = -1

    def put_back_values(match):
        nonlocal count
        count += 1
        return a[count]
    json_str = re.sub('@', put_back_values, json_str)
    return json_str

def decode(raw, chapter_id):
    # 移动unicode对应数字位数为chapter_id最后值
    # 解密减 加密加
    # !__cr.imgpath=__cr.imgpath.replace(/./g,function(a){return String.fromCharCode(a.charCodeAt(0)-__cr.chapter_id%10)})!
    result = ''
    for i in raw:
        result += chr(ord(i)-int(chapter_id) % 10)
    return result

def get_info(index_url, num, index_dict):
    base = index_url
    tail = index_dict[f'{str(num)}话']
    detial_url = base + tail
    response = requests.get(detial_url, headers=headers)
    raw_address = BeautifulSoup(response.text, 'lxml').select('#content > div.comiclist > script')[0].string
    address = re.search('__cr.init\(({.*?})\)', raw_address, re.S)
    if address:
        # 类似python的字典形式 但引用没有引号用quote_keys_for_json()转一下
        # quote_keys_for_json()出处，https://segmentfault.com/q/1010000006090535?_ea=1009953
        info = json.loads(quote_keys_for_json(address.group(1)))
    return info

def get_certain_chapter_links(index_url,chapter,index_dict):
    certain_chapter_links = []
    info = get_info(index_url, chapter, index_dict)
    image_path = decode(info['chapter_addr'], info['chapter_id'])
    certain_chapter_total = int(info['end_var'])
    for num in range(1,certain_chapter_total+1):
        # 核心拼接"//" + i + "/comic/" + this.imgpath + a
        image_address = 'http://mhpic.' + info['domain'] + '/comic/' + image_path + str(num) + '.jpg' +info['comic_definition']['middle']
        # image_address = 'http://mhpic.' + info['domain'] + '/comic/' + image_path + str(num) + '.jpg' +info['comic_definition']['high']
        certain_chapter_links.append(image_address)
    return certain_chapter_total,certain_chapter_links

def downloadFILE(url,name):
    headers={
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36'
    }
    resp = requests.get(url=url,stream=True,headers=headers)
    content_size = int(int(resp.headers['Content-Length'])/1024)
    with open(name, "wb") as f:
        print("Pkg total size is:",content_size,'k,start...')
        for data in tqdm(iterable=resp.iter_content(1024),total=content_size,unit='k',desc=name):
            f.write(data)
        print(name , "download finished!")

if __name__ == "__main__":
    # 动漫主页https://www.zymk.cn/1/
    index_url = 'https://www.zymk.cn/1/'
    index_dict = get_index(index_url)
    # 下载目录doupo
    if not os.path.exists('zyresult'):
        os.mkdir('zyresult')
    # 爬取的章节1到802
    for chapter in range(1,803):
        try:
            total,certain_chapter_links = get_certain_chapter_links(index_url,chapter,index_dict)
            for i in range(0,total):
                temp = f'{str(chapter).zfill(3)}话{str(int(i)+1).zfill(2)}.jpg'
                name = os.path.join('zyresult',temp)
                url = certain_chapter_links[i]
                downloadFILE(url,name)
        except Exception as e:
            error = f'error at {chapter} ep'
            detail = str(e)
            print(error+'\n'+detail+'\n')
            with open('log.txt','a',encoding='utf-8') as f:
                f.write(error+'\n'+detail+'\n')
                f.close()
            continue

3.最后的效果

下载了一整部斗破苍穹漫画没有发现异常

天空宫阙 · 发表于 2019-10-24 00:11

1983 发表于 2019-10-24 00:02
嗯，可以练习，关键是要看得懂。。。

写这篇帖子的时候思路也不是很清晰，就讲了我认为最核心的，也比较晚了就这样吧，代码的逻辑还是没有问题的

天空宫阙 · 发表于 2019-11-3 12:53

雷晨发表于 2019-11-3 10:55
楼主你好，请问可以出个爬http://mzsock.com/这个网站的教程吗？谢谢了

图片链接就在网页源代码里requests加BeautifulSoup就可以实现

1983 · 发表于 2019-10-24 00:02

嗯，可以练习，关键是要看得懂。。。

caowang32700484 · 发表于 2019-10-24 00:12

hhhhhh看不懂，但还是给个币吧

andobear · 发表于 2019-10-24 00:19

欢迎分析讨论交流

o651560441 · 发表于 2019-10-24 07:24

谢谢楼主发的思想，我可以学习学习

shangpengpeng · 发表于 2019-10-24 08:35

可以的，，，加油

asspoo · 发表于 2019-10-24 09:02

希望有个成品

直接用

zhangbice · 发表于 2019-10-24 09:03

挺好的，谢谢分享

cwl · 发表于 2019-10-24 09:06

感谢分享，逻辑很清晰

帐号		自动登录	找回密码
密码			注册[Register]

[Python 转载] 知音漫客爬虫

免费评分