pdfkit html转pdf 用正则表达式解决？

wanpojie 发表于 2021-9-9 01:48

本帖最后由 wanpojie 于 2021-9-9 01:52 编辑

目标地址：http://www.zjzj.net/news/quota/16

目前写的代码如下

import requests # 数据请求发送请求第三方模块 pip install requestsimport parsel # 数据解析模块第三方模块 pip install parsel
import os # 文件操作模块
import re # 正则表达式模块
import pdfkit # pip install pdfkit
import time

html_str = """
<!DOCTYPE HTML>
<html lang="en">
<head>
<title>Document</title><meta charset="utf-8">
</head>
<body>
{article}
</body >
</html >
"""

#创建文件夹

filename = 'pdf\\' # 文件名字
filename_1 = 'html\\'
if not os.path.exists(filename): #如果没有这个文件夹的话
os.mkdir(filename) # 自动创建一下这个文件夹

if not os.path.exists(filename_1): #如果没有这个文件夹的话
os.mkdir(filename_1) # 自动创建一下这个文件夹

#去掉文件名中的符号
def changetitle(name):
mode = re.compile(r'[\\\/\:\*\?\"\<\>\|]')
new_name = re.sub(mode, '_', name)
return new_name
#发送请求

headers={
   "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
   "Accept-Encoding":"gzip, deflate",
   "Accept-Language":"zh-CN,zh;q=0.9,en-US;q=0.8,en;q=0.7",
   "Connection":"keep-alive",
   "Cookie":"JSESSIONID=01FDF84ADF27D5E94ABF62FEC47B2805; SL_GWPT_Show_Hide_tmp=1; SL_wptGlobTipTmp=1",
   "Host":"www.zjzj.net",
   "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.63 Safari/537.36",
}

for num in range(1,3):
url2 = f"http://www.zjzj.net/news/quota/16?pageIndex={num}"
res = requests.get(url=url2, headers=headers)
selector = parsel.Selector(res.text)
cont = selector.css('.lists').get()
titles = selector.css('.lists a::attr(title)').getall()
urls = selector.css('.lists a::attr(href)').getall()
for (title,n) in zip(titles,urls):
   URL_base = "http://www.zjzj.net/"
   url3 = URL_base+n
   print(title, url3)
   response = requests.get(url=url3, headers=headers)
   selector = parsel.Selector(response.text)
   new_title = changetitle(title)
   print("已经获取到",new_title,"的数据")
   content_views = selector.css('.info').get()
   print("已经获取到",new_title,"的内容数据")
   html_content = html_str.format(article=content_views)
   print("已经获取到", new_title, "的内容数据已经转换完毕")
   print(html_content)
   html_path = filename_1 + new_title + '.html'
   pdf_path = filename + new_title + '.pdf'
   with open(html_path, mode='w', encoding='utf-8') as f:
         f.write(html_content)
   config = pdfkit.configuration(wkhtmltopdf=r'C:\Program Files\wkhtmltopdf\bin\wkhtmltopdf.exe')
   pdfkit.from_file(html_path, pdf_path, configuration=config)
   print(title,"已保存完毕，休息5秒钟")
   time.sleep(1)

报错Exit with code 1 due to network error: ProtocolUnknownError，应该是html转pdf渲染上图中红框部分出现了错误，不知道怎么解决，正则表达式去掉红框部分？

有没有办法报错后跳过这一次for循环，进入下一个循环，给出提示

叫我小王叔叔 发表于 2021-9-9 09:04

挺高产的呀！

wanpojie 发表于 2021-9-9 10:37

叫我小王叔叔发表于 2021-9-9 09:04
挺高产的呀！

都一个套路

rsnodame 发表于 2021-9-9 19:26

{:301_997:} html to pdf，不试试wkhtmltox？

wanpojie 发表于 2021-9-9 21:49

rsnodame 发表于 2021-9-9 19:26
html to pdf，不试试wkhtmltox？

<!DOCTYPE HTML>
<html lang="en">
<head>
<title>Document</title><meta charset="utf-8">
</head>
<body>
<div class="cont">
   <div class="info">
      <div class="tit">
         <h1>关于公布“百年峥嵘筑梦浙江”红色主题征集活动评选结果的通知</h1>
         <h2>-- www.zjzj.net 2021-08-05 --</h2>
            <span>本文件出处：
            [<a href="/news/2/10">新闻中心-政策文件</a>]、
            [<a href="/news/2/53">新闻中心-党史学习教育</a>]
            </span>
      </div>
      <div class="container" id="ContentTextb">
<p></p><div style="text-align: center;">浙建价协〔2021〕20号</div><div style="text-align: center;"><img src="/uploads/batchProduct/202108/0516591550.jpg" width="750" height="1000" alt=""><br></div><div style="text-align: center;"><img src="/uploads/batchProduct/202108/0516590474.jpg" width="750" height="1000" alt=""><br></div><div style="text-align: center;"><img src="/uploads/batchProduct/202108/0516585360.jpg" width="750" height="1000" alt=""><br></div><div style="text-align: center;"><img src="/uploads/batchProduct/202108/0516584223.jpg" width="750" height="1000" alt=""><br></div><div style="text-align: center;"><img src="/uploads/batchProduct/202108/0516583011.jpg" width="750" height="1000" alt=""><br></div><div style="text-align: center;"><img src="/uploads/batchProduct/202108/0516581758.jpg" width="750" height="1000" alt=""><br></div>
      </div>



   </div>
   </div>
</body >
</html >

很奇怪的是这一段html就无法转换成pdf用的就是wkhtmltopdf
报错如下

Traceback (most recent call last):
File "D:/OneDrive/python/zhejiang/main.py", line 165, in <module>
pdfkit.from_file(html_path, pdf_path, configuration=config)
File "D:\OneDrive\python\zhoushan\venv\lib\site-packages\pdfkit\api.py", line 49, in from_file
return r.to_pdf(output_path)
File "D:\OneDrive\python\zhoushan\venv\lib\site-packages\pdfkit\pdfkit.py", line 156, in to_pdf
raise IOError('wkhtmltopdf reported an error:\n' + stderr)
OSError: wkhtmltopdf reported an error:
Loading pages (1/6)
Warning: Blocked access to file
Warning: Blocked access to file
Warning: Blocked access to file
Warning: Blocked access to file
Warning: Blocked access to file
Warning: Blocked access to file
Error: Failed to load about:blank, with network status code 301 and http status code 0 - Protocol "about" is unknown
Error: Failed to load about:blank, with network status code 301 and http status code 0 - Protocol "about" is unknown
Error: Failed to load about:blank, with network status code 301 and http status code 0 - Protocol "about" is unknown
Error: Failed to load about:blank, with network status code 301 and http status code 0 - Protocol "about" is unknown
Error: Failed to load about:blank, with network status code 301 and http status code 0 - Protocol "about" is unknown
Error: Failed to load about:blank, with network status code 301 and http status code 0 - Protocol "about" is unknown
Counting pages (2/6)
Resolving links (4/6)
Loading headers and footers (5/6)
Printing pages (6/6)
Done
Exit with code 1 due to network error: ProtocolUnknownError

页: [1]

吾爱破解 - 52pojie.cn's Archiver

pdfkit html转pdf 用正则表达式解决？