吾爱破解 - 52pojie.cn

 找回密码
 注册[Register]

QQ登录

只需一步,快速开始

查看: 1464|回复: 11
收起左侧

[Python 原创] dd云书房电子书下载

  [复制链接]
木子汐 发表于 2023-7-18 15:54
本帖最后由 木子汐 于 2023-7-18 16:33 编辑

epubID为电子书id,token自己抓取修改,在C:\Users\Lenovo\Desktop\目录下新建img文件夹存放书中图片,
或者修改img_path为自己想要的图片存放路径,运行结束后会在C:\Users\Lenovo\Desktop\生成名为epubID+book.docx的word文档。




[Python] 纯文本查看 复制代码
import json
import os
import random
import re

import requests
import time
from docx import Document
from docx.shared import Inches

doc = Document()


def sort_by_bottom_left(item):
    return -int(item['bottom_value']), int(item['left_value'])


book = []
book_base = []
snippet_base = []
# 医院医疗质量标准化建设与管理(重新转)
# epubID = '1901295227'
# 现代医院信息化建设策略与实践电子书
epubID = '1901218068'
token = 'pc_1876eb17825a8a3ee0de0563c51d00645dd7230f5c4a089663b89a**********'
url = "https://e.dangdang.com/media/api.go?action=getPcMediaInfo&epubID=" + epubID + "&token=" + token + "&wordSize=2&style=2"
response = requests.request("GET", url, timeout=5)
print(response.text)
data = json.loads(response.text)['data']
mediaPageInfo = data['mediaPageInfo']
mediaPageInfo_sort = {}
for k, v in mediaPageInfo.items():
    k = k.replace('pagenum', '')
    mediaPageInfo_sort[k] = v
mediaPageInfo_sort = {k: v for k, v in sorted(mediaPageInfo_sort.items(), key=lambda x: int(x[0]))}
text = ''
for k, v in mediaPageInfo_sort.items():
    book_txt_base = []
    book_txt = []
    book_img_base = []
    book_img = []
    url = "https://e.dangdang.com/media/api.go?action=getPcChapterInfo&epubID=" + epubID + "&" "token=" + token + \
          "&chapterID=" + str(v['chapterID']) + "&pageIndex=" + str(v['pageIndex']) + \
          "&locationIndex=" + k + "&wordSize=2&style=2"
    print('chapterID=', v['chapterID'], 'pageIndex=', v['pageIndex'], 'locationIndex=', k)
    time.sleep(random.uniform(0, 2))
    while True:
        try:
            response = requests.request("GET", url, timeout=5)
            break
        except BaseException as e:
            print(e)
            time.sleep(random.uniform(0, 2))
    data = json.loads(response.text)
    if data['status']['code'] != 0:
        continue
    snippet_data = json.loads(data['data']['chapterInfo'])['snippet']
    snippet_split = snippet_data.split('\n')
    div_style = ''
    for s in snippet_split:
        if s.startswith('<div style') or s.startswith('</div><div style='):
            div_style = s
        elif s.startswith('<img src'):
            if div_style == '':
                print('error')
            pattern = r'left:\s*(\d+)px;\s*top:\s*(\d+)px;\s*width:\s*\d+px;\s*height:\s*\d+px;'
            match = re.findall(pattern, div_style)
            if match:
                left_value = match[0][0]
                top_value = match[0][1]
                pattern = r'src="([^"]+)".*?width:\s*(\d+)px;\s*height:\s*(\d+)px;'
                match = re.findall(pattern, s)
                if match:
                    image_url = match[0][0]
                    width = match[0][1]
                    height = match[0][2]
                    div_style = ''
                    book_img_base.append(
                        {'image_url': image_url, 'left_value': left_value, 'bottom_value': (880 - int(top_value)),
                         'top_value': top_value, 'width': width, 'height': height, 'text_content': image_url + '\n'})
        elif s.startswith('<span class'):
            pattern = r'<span\s+class="([^"]*)"\s+style="left:([^"]*)px;\s+bottom:([^"]*)px;\s*">([^<]*)<\/span>'
            match = re.match(pattern, s)
            if match:
                class_name = match.group(1)
                left_value = match.group(2)
                bottom_value = match.group(3)
                text_content = match.group(4)
                book_txt_base.append({'class_name': class_name, 'left_value': left_value, 'bottom_value': bottom_value,
                                      'text_content': text_content})
    book_txt_base += book_img_base
    book_txt_base = sorted(book_txt_base, key=sort_by_bottom_left)
    text = '\n********************'
    bottom_value = '0'
    for s in book_txt_base:
        if bottom_value != s['bottom_value']:
            bottom_value = s['bottom_value']
            text += '\n' + s['text_content']
        else:
            text += s['text_content']
    print(text)
    book_base.append({'locationIndex': k, 'book_txt_base': book_txt_base, 'text': text})
with open(r'C:\Users\Lenovo\Desktop\\' + epubID + 'book_base.txt', "a") as file:
    file.write(json.dumps(book_base, ensure_ascii=False))
if os.path.exists(r'C:\Users\Lenovo\Desktop\\' + epubID + 'book_base.txt'):
    with open(r'C:\Users\Lenovo\Desktop\\' + epubID + 'book_base.txt', "r") as file:
        data = file.read()
        data = json.loads(data)
        for d in data:
            if len(d['book_txt_base']) > 0:
                img_size = {}
                for b in d['book_txt_base']:
                    if 'image_url' in b:
                        img_name = b['image_url'].split('/')[-1]
                        img_size[img_name] = {'width': (int(b['width']) * 9.16 / 880),
                                              'height': (int(b['height']) * 9.16 / 880)}
                text = d['text'].replace('\n********************\n', '')
                txt = ''
                for t in text.split('\n'):
                    if t.startswith('http'):
                        if txt != '':
                            doc.add_paragraph(txt)
                            txt = ''
                        img_name = t.split('/')[-1]
                        img_path = r'C:\Users\Lenovo\Desktop\img\\' + img_name
                        if not os.path.exists(img_path):
                            while True:
                                try:
                                    response = requests.request("GET", t, timeout=5)
                                    with open(img_path, 'wb') as f:
                                        f.write(response.content)
                                    break
                                except BaseException as e:
                                    print(e)

                        doc.add_picture(img_path, width=Inches(img_size[img_name]['width']),
                                        height=Inches(img_size[img_name]['height']))
                    else:
                        txt += t + '\n'
                if txt != '':
                    doc.add_paragraph(txt)
                    txt = ''
            doc.add_page_break()
        doc.save(r'C:\Users\Lenovo\Desktop\\' + epubID + 'book.docx')
print('ok')

免费评分

参与人数 1吾爱币 +7 热心值 +1 收起 理由
苏紫方璇 + 7 + 1 欢迎分析讨论交流,吾爱破解论坛有你更精彩!

查看全部评分

发帖前要善用论坛搜索功能,那里可能会有你要找的答案或者已经有人发布过相同内容了,请勿重复发帖。

sdieedu 发表于 2023-7-18 19:55
本帖最后由 sdieedu 于 2023-7-18 20:00 编辑

File ~\DDebook.py:15 in <module>
    from docx import Document

  File C:\ProgramData\Anaconda3\lib\site-packages\docx.py:30 in <module>
    from exceptions import PendingDeprecationWarning

ModuleNotFoundError: No module named 'exceptions'


搞定了 卸载 重新安装

pip install python-docx
大白baymax 发表于 2023-7-18 16:27
 楼主| 木子汐 发表于 2023-7-18 16:29
大白baymax 发表于 2023-7-18 16:30

有个左右箭头可以规范的插入python代码
 楼主| 木子汐 发表于 2023-7-18 16:34
大白baymax 发表于 2023-7-18 16:30
有个左右箭头可以规范的插入python代码

已经编辑了
djxding 发表于 2023-7-18 19:08
本帖最后由 djxding 于 2023-7-18 21:24 编辑

请问楼主:
     PY中,必须先安装哪些功能模块?

已经解决了。
风雨骑行 发表于 2023-7-18 19:55
没有卖的,下不了 ,只能下试读的
sdieedu 发表于 2023-7-18 20:02
sdieedu 发表于 2023-7-18 19:55
File ~\DDebook.py:15 in
    from docx import Document

还有一个问题:需要有全部权限才可以下载的吧

建议楼主 搞个JD的阅读
long8586 发表于 2023-7-18 22:31

打个包啊,谢谢
您需要登录后才可以回帖 登录 | 注册[Register]

本版积分规则

返回列表

RSS订阅|小黑屋|处罚记录|联系我们|吾爱破解 - LCG - LSG ( 京ICP备16042023号 | 京公网安备 11010502030087号 )

GMT+8, 2024-11-24 21:33

Powered by Discuz!

Copyright © 2001-2020, Tencent Cloud.

快速回复 返回顶部 返回列表