pdfkit html转pdf 用正则表达式解决?
本帖最后由 wanpojie 于 2021-9-9 01:52 编辑目标地址:http://www.zjzj.net/news/quota/16
目前写的代码如下
import requests # 数据请求 发送请求 第三方模块 pip install requestsimport parsel # 数据解析模块 第三方模块 pip install parsel
import os # 文件操作模块
import re # 正则表达式模块
import pdfkit # pip install pdfkit
import time
html_str = """
<!DOCTYPE HTML>
<html lang="en">
<head>
<title>Document</title><meta charset="utf-8">
</head>
<body>
{article}
</body >
</html >
"""
#创建文件夹
filename = 'pdf\\' # 文件名字
filename_1 = 'html\\'
if not os.path.exists(filename): #如果没有这个文件夹的话
os.mkdir(filename) # 自动创建一下这个文件夹
if not os.path.exists(filename_1): #如果没有这个文件夹的话
os.mkdir(filename_1) # 自动创建一下这个文件夹
#去掉文件名中的符号
def changetitle(name):
mode = re.compile(r'[\\\/\:\*\?\"\<\>\|]')
new_name = re.sub(mode, '_', name)
return new_name
#发送请求
headers={
"Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"Accept-Encoding":"gzip, deflate",
"Accept-Language":"zh-CN,zh;q=0.9,en-US;q=0.8,en;q=0.7",
"Connection":"keep-alive",
"Cookie":"JSESSIONID=01FDF84ADF27D5E94ABF62FEC47B2805; SL_GWPT_Show_Hide_tmp=1; SL_wptGlobTipTmp=1",
"Host":"www.zjzj.net",
"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.63 Safari/537.36",
}
for num in range(1,3):
url2 = f"http://www.zjzj.net/news/quota/16?pageIndex={num}"
res = requests.get(url=url2, headers=headers)
selector = parsel.Selector(res.text)
cont = selector.css('.lists').get()
titles = selector.css('.lists a::attr(title)').getall()
urls = selector.css('.lists a::attr(href)').getall()
for (title,n) in zip(titles,urls):
URL_base = "http://www.zjzj.net/"
url3 = URL_base+n
print(title, url3)
response = requests.get(url=url3, headers=headers)
selector = parsel.Selector(response.text)
new_title = changetitle(title)
print("已经获取到",new_title,"的数据")
content_views = selector.css('.info').get()
print("已经获取到",new_title,"的内容数据")
html_content = html_str.format(article=content_views)
print("已经获取到", new_title, "的内容数据已经转换完毕")
print(html_content)
html_path = filename_1 + new_title + '.html'
pdf_path = filename + new_title + '.pdf'
with open(html_path, mode='w', encoding='utf-8') as f:
f.write(html_content)
config = pdfkit.configuration(wkhtmltopdf=r'C:\Program Files\wkhtmltopdf\bin\wkhtmltopdf.exe')
pdfkit.from_file(html_path, pdf_path, configuration=config)
print(title,"已保存完毕,休息5秒钟")
time.sleep(1)
报错Exit with code 1 due to network error: ProtocolUnknownError,应该是html转pdf渲染上图中红框部分出现了错误,不知道怎么解决,正则表达式去掉红框部分?
有没有办法报错后跳过这一次for循环,进入下一个循环,给出提示
挺高产的呀! 叫我小王叔叔 发表于 2021-9-9 09:04
挺高产的呀!
都一个套路 {:301_997:} html to pdf,不试试wkhtmltox? rsnodame 发表于 2021-9-9 19:26
html to pdf,不试试wkhtmltox?
<!DOCTYPE HTML>
<html lang="en">
<head>
<title>Document</title><meta charset="utf-8">
</head>
<body>
<div class="cont">
<div class="info">
<div class="tit">
<h1>关于公布“百年峥嵘 筑梦浙江”红色主题征集活动评选结果的通知</h1>
<h2>-- www.zjzj.net 2021-08-05 --</h2>
<span>本文件出处:
[<a href="/news/2/10">新闻中心-政策文件</a>]、
[<a href="/news/2/53">新闻中心-党史学习教育</a>]
</span>
</div>
<div class="container" id="ContentTextb">
<p></p><div style="text-align: center;">浙建价协〔2021〕20号</div><div style="text-align: center;"><img src="/uploads/batchProduct/202108/0516591550.jpg" width="750" height="1000" alt=""><br></div><div style="text-align: center;"><img src="/uploads/batchProduct/202108/0516590474.jpg" width="750" height="1000" alt=""><br></div><div style="text-align: center;"><img src="/uploads/batchProduct/202108/0516585360.jpg" width="750" height="1000" alt=""><br></div><div style="text-align: center;"><img src="/uploads/batchProduct/202108/0516584223.jpg" width="750" height="1000" alt=""><br></div><div style="text-align: center;"><img src="/uploads/batchProduct/202108/0516583011.jpg" width="750" height="1000" alt=""><br></div><div style="text-align: center;"><img src="/uploads/batchProduct/202108/0516581758.jpg" width="750" height="1000" alt=""><br></div>
</div>
</div>
</div>
</body >
</html >
很奇怪的是这一段html就无法转换成pdf用的就是wkhtmltopdf
报错如下
Traceback (most recent call last):
File "D:/OneDrive/python/zhejiang/main.py", line 165, in <module>
pdfkit.from_file(html_path, pdf_path, configuration=config)
File "D:\OneDrive\python\zhoushan\venv\lib\site-packages\pdfkit\api.py", line 49, in from_file
return r.to_pdf(output_path)
File "D:\OneDrive\python\zhoushan\venv\lib\site-packages\pdfkit\pdfkit.py", line 156, in to_pdf
raise IOError('wkhtmltopdf reported an error:\n' + stderr)
OSError: wkhtmltopdf reported an error:
Loading pages (1/6)
Warning: Blocked access to file
Warning: Blocked access to file
Warning: Blocked access to file
Warning: Blocked access to file
Warning: Blocked access to file
Warning: Blocked access to file
Error: Failed to load about:blank, with network status code 301 and http status code 0 - Protocol "about" is unknown
Error: Failed to load about:blank, with network status code 301 and http status code 0 - Protocol "about" is unknown
Error: Failed to load about:blank, with network status code 301 and http status code 0 - Protocol "about" is unknown
Error: Failed to load about:blank, with network status code 301 and http status code 0 - Protocol "about" is unknown
Error: Failed to load about:blank, with network status code 301 and http status code 0 - Protocol "about" is unknown
Error: Failed to load about:blank, with network status code 301 and http status code 0 - Protocol "about" is unknown
Counting pages (2/6)
Resolving links (4/6)
Loading headers and footers (5/6)
Printing pages (6/6)
Done
Exit with code 1 due to network error: ProtocolUnknownError
页:
[1]