pdfkit html转pdf 用正则表达式解决？

wanpojie · 发表于 2021-9-9 01:48

本帖最后由 wanpojie 于 2021-9-9 01:52 编辑

目标地址：http://www.zjzj.net/news/quota/16

目前写的代码如下

[Python] 纯文本查看 复制代码

import requests # 数据请求 发送请求 第三方模块 pip install requestsimport parsel # 数据解析模块 第三方模块 pip install parsel
import os # 文件操作模块
import re # 正则表达式模块
import pdfkit # pip install pdfkit
import time

html_str = """
<!DOCTYPE HTML>
<html lang="en">
<head>
    <title>Document</title><meta charset="utf-8">
</head>
<body>
{article}
</body >
</html >
"""

#创建文件夹

filename = 'pdf\\' # 文件名字
filename_1 = 'html\\'
if not os.path.exists(filename): #如果没有这个文件夹的话
    os.mkdir(filename) # 自动创建一下这个文件夹

if not os.path.exists(filename_1): #如果没有这个文件夹的话
    os.mkdir(filename_1) # 自动创建一下这个文件夹

#去掉文件名中的符号
def changetitle(name):
    mode = re.compile(r'[\\\/\:\*\?\"\<\>\|]')
    new_name = re.sub(mode, '_', name)
    return new_name
#发送请求


headers={
        "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
        "Accept-Encoding":"gzip, deflate",
        "Accept-Language":"zh-CN,zh;q=0.9,en-US;q=0.8,en;q=0.7",
        "Connection":"keep-alive",
        "Cookie":"JSESSIONID=01FDF84ADF27D5E94ABF62FEC47B2805; SL_GWPT_Show_Hide_tmp=1; SL_wptGlobTipTmp=1",
        "Host":"www.zjzj.net",
        "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.63 Safari/537.36",
}

for num in range(1,3):
    url2 = f"http://www.zjzj.net/news/quota/16?pageIndex={num}"
    res = requests.get(url=url2, headers=headers)
    selector = parsel.Selector(res.text)
    cont = selector.css('.lists').get()
    titles = selector.css('.lists a::attr(title)').getall()
    urls = selector.css('.lists a::attr(href)').getall()
    for (title,n) in zip(titles,urls):
        URL_base = "http://www.zjzj.net/"
        url3 = URL_base+n
        print(title, url3)
        response = requests.get(url=url3, headers=headers)
        selector = parsel.Selector(response.text)
        new_title = changetitle(title)
        print("已经获取到",new_title,"的数据")
        content_views = selector.css('.info').get()
        print("已经获取到",new_title,"的内容数据")
        html_content = html_str.format(article=content_views)
        print("已经获取到", new_title, "的内容数据已经转换完毕")
        print(html_content)
        html_path = filename_1 + new_title + '.html'
        pdf_path = filename + new_title + '.pdf'
        with open(html_path, mode='w', encoding='utf-8') as f:
            f.write(html_content)
        config = pdfkit.configuration(wkhtmltopdf=r'C:\Program Files\wkhtmltopdf\bin\wkhtmltopdf.exe')
        pdfkit.from_file(html_path, pdf_path, configuration=config)
        print(title,"已保存完毕，休息5秒钟")
        time.sleep(1)

报错Exit with code 1 due to network error: ProtocolUnknownError，应该是html转pdf渲染上图中红框部分出现了错误，不知道怎么解决，正则表达式去掉红框部分？

有没有办法报错后跳过这一次for循环，进入下一个循环，给出提示

叫我小王叔叔 · 发表于 2021-9-9 09:04

挺高产的呀！

wanpojie · 发表于 2021-9-9 10:37

叫我小王叔叔发表于 2021-9-9 09:04
挺高产的呀！

都一个套路

rsnodame · 发表于 2021-9-9 19:26

html to pdf，不试试wkhtmltox？

wanpojie · 发表于 2021-9-9 21:49

rsnodame 发表于 2021-9-9 19:26
html to pdf，不试试wkhtmltox？

[HTML] 纯文本查看 复制代码

<!DOCTYPE HTML>
<html lang="en">
<head>
    <title>Document</title><meta charset="utf-8">
</head>
<body>
<div class="cont">
        <div class="info">
          <div class="tit">
            <h1>关于公布“百年峥嵘 筑梦浙江”红色主题征集活动评选结果的通知</h1>
            <h2>-- www.zjzj.net    2021-08-05 --</h2>
            	<span>本文件出处：
            			[<a href="/news/2/10">新闻中心-政策文件</a>]、
            			[<a href="/news/2/53">新闻中心-党史学习教育</a>]
            	</span>
          </div>
          <div class="container" id="ContentTextb">
			<p></p><div style="text-align: center;">浙建价协〔2021〕20号</div><div style="text-align: center;"><img src="/uploads/batchProduct/202108/0516591550.jpg" width="750" height="1000" alt=""><br></div><div style="text-align: center;"><img src="/uploads/batchProduct/202108/0516590474.jpg" width="750" height="1000" alt=""><br></div><div style="text-align: center;"><img src="/uploads/batchProduct/202108/0516585360.jpg" width="750" height="1000" alt=""><br></div><div style="text-align: center;"><img src="/uploads/batchProduct/202108/0516584223.jpg" width="750" height="1000" alt=""><br></div><div style="text-align: center;"><img src="/uploads/batchProduct/202108/0516583011.jpg" width="750" height="1000" alt=""><br></div><div style="text-align: center;"><img src="/uploads/batchProduct/202108/0516581758.jpg" width="750" height="1000" alt=""><br></div>
          </div>
         
          
          
        </div>
      </div>
</body >
</html >

很奇怪的是这一段html就无法转换成pdf  用的就是wkhtmltopdf
报错如下

Traceback (most recent call last):
  File "D:/OneDrive/python/zhejiang/main.py", line 165, in <module>
pdfkit.from_file(html_path, pdf_path, configuration=config)
  File "D:\OneDrive\python\zhoushan\venv\lib\site-packages\pdfkit\api.py", line 49, in from_file
return r.to_pdf(output_path)
  File "D:\OneDrive\python\zhoushan\venv\lib\site-packages\pdfkit\pdfkit.py", line 156, in to_pdf
raise IOError('wkhtmltopdf reported an error:\n' + stderr)
OSError: wkhtmltopdf reported an error:
Loading pages (1/6)
Warning: Blocked access to file
Warning: Blocked access to file
Warning: Blocked access to file
Warning: Blocked access to file
Warning: Blocked access to file
Warning: Blocked access to file
Error: Failed to load about:blank, with network status code 301 and http status code 0 - Protocol "about" is unknown
Error: Failed to load about:blank, with network status code 301 and http status code 0 - Protocol "about" is unknown
Error: Failed to load about:blank, with network status code 301 and http status code 0 - Protocol "about" is unknown
Error: Failed to load about:blank, with network status code 301 and http status code 0 - Protocol "about" is unknown
Error: Failed to load about:blank, with network status code 301 and http status code 0 - Protocol "about" is unknown
Error: Failed to load about:blank, with network status code 301 and http status code 0 - Protocol "about" is unknown
Counting pages (2/6)
Resolving links (4/6)
Loading headers and footers (5/6)
Printing pages (6/6)
Done
Exit with code 1 due to network error: ProtocolUnknownError

帐号		自动登录	找回密码
密码			注册[Register]

[求助] pdfkit html转pdf 用正则表达式解决？