python下载云展网webp合并为PDF

简华沙 发表于 2024-3-9 12:40

本帖最后由简华沙于 2024-3-9 18:38 编辑

在稍早一些时候看到一个[求助下载云展网的图片合并为PDF的帖](https://www.52pojie.cn/thread-1898430-1-1.html)子，据楼主说论坛其他的下载器都无法完成他的需求，因此尝试解决该问题。
在简单搜索后于Reddit中发现一个帖子同样是[求下载图片转为PDF的讨论](https://www.reddit.com/r/lepin/comments/17tjyw0/how_to_download_pdf_from_yunzhan365/)，不过作者自己完成了(https://gist.github.com/DMHYT/3755eec84f4384e575611d2ccf568b2f)编写并且可以运行起来，因此直接使用他的代码尝试解决问题。
跑起来后发现存在两个问题：
- 翻页次数不足，原来写的flips = floor((NUM_PAGES - 3) / 2)，翻页是从5 + 2 * i开始，最后第290页没有翻到（翻到288，289就直接结束了）；而求助的帖子打开的默认是第1页，且翻页次数需要向上取整；
- webp图片重复；翻页请求第一次会发出请求到/files/large/4cc5e910949d0565e4bf092f36003b4f.webp?x-oss-process=image/resize,h_731,w_517，后隔一段时间后再次发出请求到/files/large/4cc5e910949d0565e4bf092f36003b4f.webp?x-oss-process=image/resize,h_731,w_517&hyztg=1，因此原来的代码分割URL后会重复输出图片；在收集URL时去重即可；

因上面两个问题，290页的内容输出到PDF变为了578页（少翻了290页只获取了289页，重复变为两倍），下面是简单修复后的代码（也是解决原来求助帖子时用的代码）。

# Fetches yunzhan365.com book contents and saves it to PDF.
# Really slow but I just wanted to make this work in any way.
# Third-party modules: requests, selenium, pillow
# Usage: python yunzhan.py <needed yunzhan book url>

from io import BytesIO
from json import dumps, loads
from math import ceil
import requests
from sys import argv
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from time import sleep, time
from PIL import Image

if __name__ == "__main__":

LINK = argv if len(argv) > 1 else input("Link: ")

if "yunzhan365.com/basic/" in LINK:
   print("Fixing the URL...")
   soup = BeautifulSoup(requests.get(LINK).text, "html.parser")
   book_info = soup.find("div", {"class": "book-info"})
   title = book_info.find("h1", {"class": "title"})
   LINK = title.find("a").get("href")
   print("Fixed to " + LINK)

desired_capabilities = DesiredCapabilities.CHROME
desired_capabilities["goog:loggingPrefs"] = {"performance": "ALL"}

options = webdriver.ChromeOptions()
# options.add_argument('headless')
options.add_argument("--ignore-certificate-errors")
options.add_argument("--log-level=3")

driver = webdriver.Chrome(options=options)
driver.get(LINK)
sleep(5)

NUM_PAGES = driver.execute_script("return originTotalPageCount;")
print("Number of pages: " + str(NUM_PAGES))

flips = ceil((NUM_PAGES - 1) / 2)
print("Flips: " + str(flips))

if flips > 0:
   for i in range(flips):
         print("\rFetching pages " + str(1 + 2 * i) + "&" + str(2 + 2 * i) + "/" + str(NUM_PAGES) + "...", end="")
         driver.execute_script("nextPageFun(\"mouse wheel flip\")")
         sleep(0.5)

print("\nWriting the network log...")
logs = driver.get_log("performance")
with open("network_log.json", "w", encoding="utf-8") as f:
   f.write("[")
   for log in logs:
         network_log = loads(log["message"])["message"]
         if ("Network.response" in network_log["method"] or "Network.request" in network_log[
            "method"] or "Network.webSocket" in network_log["method"]):
            f.write(dumps(network_log) + ",")
   f.write("{}]")
driver.quit()
json_file_path = "network_log.json"
with open(json_file_path, "r", encoding="utf-8") as f:
   logs = loads(f.read())

print("Sorting the pages...")
page_links = []
for log in logs:
   try:
         url = log["params"]["request"]["url"]
         if "files/large/" in url:
            webp = url.split('?')
            if webp not in page_links:
               page_links.append(webp)
   except Exception:
         pass
print("Page count: " + str(len(page_links)))

if flips > 0:
   for i in range(flips):
         p1 = 3 + 2 * i
         p2 = 4 + 2 * i
         if p2 < len(page_links):
            page_links, page_links = page_links, page_links

images = []
for page_index in range(len(page_links)):
   print("\rLoading pages " + str(page_index + 1) + "/" + str(NUM_PAGES) + "...", end="")
   images.append(Image.open(BytesIO(requests.get(page_links).content)).convert("RGB"))
print("\nImage count: " + str(len(images)))

print("Saving to PDF...")
images.save("result-" + str(round(time() * 1000)) + ".pdf", save_all=True, append_images=images)
print("Done!")

xiaoshan208 发表于 2024-3-9 13:59

小白表示看不懂，求大神打包一下给个现成的工具~~

wangpj520 发表于 2024-3-9 14:01

谢谢分享，试试看效果如何。

牧尘主宰 发表于 2024-3-9 15:06

具体用法是，保存成.py文件，命令行运行（python 1.py url）。我运行时报错了，然后把代码喂给了通义千问，通信千问一顿分析输出了代码，重新运行，然后成功了。

zlicqh 发表于 2024-3-9 15:16

谢谢分享{:301_998:}

15553590982 发表于 2024-3-9 15:22

伸手党。。。。困难症

eric66 发表于 2024-3-9 15:28

Thanks{:301_998:}谢谢楼主

IT大小白 发表于 2024-3-9 15:30

本帖最后由 IT大小白于 2024-3-9 16:35 编辑

LINK为要下载的链接，
原： images.append(Image.open(BytesIO(requests.get(page_links).content)).convert("RGB"))
改成： images.append(Image.open(BytesIO(requests.get(page_links).content)).convert("RGB"))
可以下载

wangyuyan 发表于 2024-3-9 15:54

谢谢分享，很棒{:1_893:}

ruancc 发表于 2024-3-9 16:05

谢谢分享，试试看效果如何。

页: [1] 2 3

吾爱破解 - 52pojie.cn's Archiver

python下载云展网webp合并为PDF