吾爱破解 - 52pojie.cn

 找回密码
 注册[Register]

QQ登录

只需一步,快速开始

查看: 1450|回复: 4
收起左侧

[求助] pdfkit html转pdf 用正则表达式解决?

[复制链接]
wanpojie 发表于 2021-9-9 01:48
本帖最后由 wanpojie 于 2021-9-9 01:52 编辑

目标地址:http://www.zjzj.net/news/quota/16

目前写的代码如下
[Python] 纯文本查看 复制代码
import requests # 数据请求 发送请求 第三方模块 pip install requestsimport parsel # 数据解析模块 第三方模块 pip install parsel
import os # 文件操作模块
import re # 正则表达式模块
import pdfkit # pip install pdfkit
import time

html_str = """
<!DOCTYPE HTML>
<html lang="en">
<head>
    <title>Document</title><meta charset="utf-8">
</head>
<body>
{article}
</body >
</html >
"""

#创建文件夹

filename = 'pdf\\' # 文件名字
filename_1 = 'html\\'
if not os.path.exists(filename): #如果没有这个文件夹的话
    os.mkdir(filename) # 自动创建一下这个文件夹

if not os.path.exists(filename_1): #如果没有这个文件夹的话
    os.mkdir(filename_1) # 自动创建一下这个文件夹

#去掉文件名中的符号
def changetitle(name):
    mode = re.compile(r'[\\\/\:\*\?\"\<\>\|]')
    new_name = re.sub(mode, '_', name)
    return new_name
#发送请求


headers={
        "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
        "Accept-Encoding":"gzip, deflate",
        "Accept-Language":"zh-CN,zh;q=0.9,en-US;q=0.8,en;q=0.7",
        "Connection":"keep-alive",
        "Cookie":"JSESSIONID=01FDF84ADF27D5E94ABF62FEC47B2805; SL_GWPT_Show_Hide_tmp=1; SL_wptGlobTipTmp=1",
        "Host":"www.zjzj.net",
        "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.63 Safari/537.36",
}

for num in range(1,3):
    url2 = f"http://www.zjzj.net/news/quota/16?pageIndex={num}"
    res = requests.get(url=url2, headers=headers)
    selector = parsel.Selector(res.text)
    cont = selector.css('.lists').get()
    titles = selector.css('.lists a::attr(title)').getall()
    urls = selector.css('.lists a::attr(href)').getall()
    for (title,n) in zip(titles,urls):
        URL_base = "http://www.zjzj.net/"
        url3 = URL_base+n
        print(title, url3)
        response = requests.get(url=url3, headers=headers)
        selector = parsel.Selector(response.text)
        new_title = changetitle(title)
        print("已经获取到",new_title,"的数据")
        content_views = selector.css('.info').get()
        print("已经获取到",new_title,"的内容数据")
        html_content = html_str.format(article=content_views)
        print("已经获取到", new_title, "的内容数据已经转换完毕")
        print(html_content)
        html_path = filename_1 + new_title + '.html'
        pdf_path = filename + new_title + '.pdf'
        with open(html_path, mode='w', encoding='utf-8') as f:
            f.write(html_content)
        config = pdfkit.configuration(wkhtmltopdf=r'C:\Program Files\wkhtmltopdf\bin\wkhtmltopdf.exe')
        pdfkit.from_file(html_path, pdf_path, configuration=config)
        print(title,"已保存完毕,休息5秒钟")
        time.sleep(1)



image.png


报错Exit with code 1 due to network error: ProtocolUnknownError,应该是html转pdf渲染上图中红框部分出现了错误,不知道怎么解决,正则表达式去掉红框部分?



有没有办法报错后跳过这一次for循环,进入下一个循环,给出提示


发帖前要善用论坛搜索功能,那里可能会有你要找的答案或者已经有人发布过相同内容了,请勿重复发帖。

叫我小王叔叔 发表于 2021-9-9 09:04
挺高产的呀!
 楼主| wanpojie 发表于 2021-9-9 10:37
rsnodame 发表于 2021-9-9 19:26
 楼主| wanpojie 发表于 2021-9-9 21:49
rsnodame 发表于 2021-9-9 19:26
html to pdf,不试试wkhtmltox?

[HTML] 纯文本查看 复制代码
<!DOCTYPE HTML>
<html lang="en">
<head>
    <title>Document</title><meta charset="utf-8">
</head>
<body>
<div class="cont">
        <div class="info">
          <div class="tit">
            <h1>关于公布“百年峥嵘 筑梦浙江”红色主题征集活动评选结果的通知</h1>
            <h2>-- www.zjzj.net    2021-08-05 --</h2>
            	<span>本文件出处:
            			[<a href="/news/2/10">新闻中心-政策文件</a>]、
            			[<a href="/news/2/53">新闻中心-党史学习教育</a>]
            	</span>
          </div>
          <div class="container" id="ContentTextb">
			<p></p><div style="text-align: center;">浙建价协〔2021〕20号</div><div style="text-align: center;"><img src="/uploads/batchProduct/202108/0516591550.jpg" width="750" height="1000" alt=""><br></div><div style="text-align: center;"><img src="/uploads/batchProduct/202108/0516590474.jpg" width="750" height="1000" alt=""><br></div><div style="text-align: center;"><img src="/uploads/batchProduct/202108/0516585360.jpg" width="750" height="1000" alt=""><br></div><div style="text-align: center;"><img src="/uploads/batchProduct/202108/0516584223.jpg" width="750" height="1000" alt=""><br></div><div style="text-align: center;"><img src="/uploads/batchProduct/202108/0516583011.jpg" width="750" height="1000" alt=""><br></div><div style="text-align: center;"><img src="/uploads/batchProduct/202108/0516581758.jpg" width="750" height="1000" alt=""><br></div>
          </div>
         
          
          
        </div>
      </div>
</body >
</html >



很奇怪的是这一段html就无法转换成pdf  用的就是wkhtmltopdf   
报错如下





Traceback (most recent call last):
  File "D:/OneDrive/python/zhejiang/main.py", line 165, in <module>
    pdfkit.from_file(html_path, pdf_path, configuration=config)
  File "D:\OneDrive\python\zhoushan\venv\lib\site-packages\pdfkit\api.py", line 49, in from_file
    return r.to_pdf(output_path)
  File "D:\OneDrive\python\zhoushan\venv\lib\site-packages\pdfkit\pdfkit.py", line 156, in to_pdf
    raise IOError('wkhtmltopdf reported an error:\n' + stderr)
OSError: wkhtmltopdf reported an error:
Loading pages (1/6)
Warning: Blocked access to file                                   
Warning: Blocked access to file
Warning: Blocked access to file
Warning: Blocked access to file
Warning: Blocked access to file
Warning: Blocked access to file
Error: Failed to load about:blank, with network status code 301 and http status code 0 - Protocol "about" is unknown
Error: Failed to load about:blank, with network status code 301 and http status code 0 - Protocol "about" is unknown
Error: Failed to load about:blank, with network status code 301 and http status code 0 - Protocol "about" is unknown
Error: Failed to load about:blank, with network status code 301 and http status code 0 - Protocol "about" is unknown
Error: Failed to load about:blank, with network status code 301 and http status code 0 - Protocol "about" is unknown
Error: Failed to load about:blank, with network status code 301 and http status code 0 - Protocol "about" is unknown
Counting pages (2/6)                                               
Resolving links (4/6)                                                      
Loading headers and footers (5/6)                                          
Printing pages (6/6)
Done                                                                     
Exit with code 1 due to network error: ProtocolUnknownError






您需要登录后才可以回帖 登录 | 注册[Register]

本版积分规则

返回列表

RSS订阅|小黑屋|处罚记录|联系我们|吾爱破解 - LCG - LSG ( 京ICP备16042023号 | 京公网安备 11010502030087号 )

GMT+8, 2024-11-25 22:50

Powered by Discuz!

Copyright © 2001-2020, Tencent Cloud.

快速回复 返回顶部 返回列表