吾爱破解 - 52pojie.cn

 找回密码
 注册[Register]

QQ登录

只需一步,快速开始

查看: 675|回复: 1
收起左侧

[Python 原创] HN日报PDF下载自动合并

[复制链接]
fengcong1980 发表于 2024-4-16 19:36
本帖最后由 fengcong1980 于 2024-4-16 19:51 编辑

输入日期格式:2024-04/16
AI
https://hnrb.voc.com.cn/hnrb_epaper/html/2024-04/16/node_201.htm

链接:https://pan.baidu.com/s/1D_7Jh2FI1uM0KfeIH8qb4Q?pwd=52pj
提取码:52pj
[Python] 纯文本查看 复制代码
import os
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
from PyPDF2 import PdfMerger, PdfReader
from datetime import datetime

def fetch_pdf_links(url):
    session = requests.Session()
    response = session.get(url)

    if response.status_code != 200:
        print(f"请求失败,状态码:{response.status_code}")
        return []

    soup = BeautifulSoup(response.text, 'html.parser')
    pdf_links = [urljoin(url, link['href']) for link in soup.find_all('a', href=True) if link['href'].endswith('.pdf')]
    return pdf_links

def save_and_count_pages(pdf_url, target_folder):
    safe_filename = os.path.basename(pdf_url).replace('?', '').replace(':', '').replace('*', '')
    target_path = os.path.join(target_folder, safe_filename)

    if os.path.exists(target_path):
        print(f"文件已存在:{target_path}")
        return None

    pdf_response = requests.get(pdf_url)
    if pdf_response.status_code == 200:
        with open(target_path, 'wb') as f:
            f.write(pdf_response.content)

        with open(target_path, 'rb') as pdf_file:
            pdf_reader = PdfReader(pdf_file)
            num_pages = len(pdf_reader.pages)
            print(f"PDF 文件 '{safe_filename}' 包含 {num_pages} 页")

        return (target_path, num_pages)
    else:
        print(f"下载PDF失败,状态码:{pdf_response.status_code}")
        return None


def merge_downloaded_pdfs(pdf_files_info, target_folder, output_filename):
    merger = PdfMerger()

    for file_path, _ in pdf_files_info:
        merger.append(open(file_path, 'rb'))

    merged_pdf_path = os.path.join(target_folder, output_filename)

    with open(merged_pdf_path, 'wb') as output_stream:
        merger.write(output_stream)

    print(f"成功按顺序合并所有PDF文件为: {merged_pdf_path}")


def format_and_validate_date(input_date):
    try:
        year, month_day = input_date.split('-')
        month, day = month_day.split('/')
        formatted_date = f"{year}-{month.zfill(2)}/{day.zfill(2)}"
        return formatted_date
    except ValueError:
        print("无效的日期格式,请按照'YYYY-MM/DD'格式输入")
        return None


def main():
    input_date = input("请输入日期(格式YYYY-MM/DD):")
    formatted_date = format_and_validate_date(input_date)
    if formatted_date is not None:
        base_url = f"https://hnrb.voc.com.cn/hnrb_epaper/html/{formatted_date}/node_201.htm"
        target_folder = os.path.join(os.path.expanduser("~"), 'Desktop', 'HNRB')
        os.makedirs(target_folder, exist_ok=True)

        pdf_links = fetch_pdf_links(base_url)
        downloaded_pdfs_info = [(save_and_count_pages(link, target_folder)) for link in pdf_links if link is not None]

        # 删除下载失败的记录
        downloaded_pdfs_info = [info for info in downloaded_pdfs_info if info is not None]

        now = datetime.now()
        output_filename = f"HNRB_{now.strftime('%Y%m%d%H%M%S')}.pdf"
        merged_pdf_path = os.path.join(target_folder, output_filename)

        merge_downloaded_pdfs(downloaded_pdfs_info, target_folder, output_filename)

        print(f"成功按顺序合并所有PDF文件为: {merged_pdf_path}")


if __name__ == "__main__":
    main()

免费评分

参与人数 1吾爱币 +7 热心值 +1 收起 理由
苏紫方璇 + 7 + 1 欢迎分析讨论交流,吾爱破解论坛有你更精彩!

查看全部评分

发帖前要善用论坛搜索功能,那里可能会有你要找的答案或者已经有人发布过相同内容了,请勿重复发帖。

uxin 发表于 2024-4-17 07:15
感谢分享
您需要登录后才可以回帖 登录 | 注册[Register]

本版积分规则

返回列表

RSS订阅|小黑屋|处罚记录|联系我们|吾爱破解 - LCG - LSG ( 京ICP备16042023号 | 京公网安备 11010502030087号 )

GMT+8, 2024-11-24 17:10

Powered by Discuz!

Copyright © 2001-2020, Tencent Cloud.

快速回复 返回顶部 返回列表