wanpojie 发表于 2021-9-9 01:48

pdfkit html转pdf 用正则表达式解决?

本帖最后由 wanpojie 于 2021-9-9 01:52 编辑

目标地址:http://www.zjzj.net/news/quota/16

目前写的代码如下

import requests # 数据请求 发送请求 第三方模块 pip install requestsimport parsel # 数据解析模块 第三方模块 pip install parsel
import os # 文件操作模块
import re # 正则表达式模块
import pdfkit # pip install pdfkit
import time

html_str = """
<!DOCTYPE HTML>
<html lang="en">
<head>
    <title>Document</title><meta charset="utf-8">
</head>
<body>
{article}
</body >
</html >
"""

#创建文件夹

filename = 'pdf\\' # 文件名字
filename_1 = 'html\\'
if not os.path.exists(filename): #如果没有这个文件夹的话
    os.mkdir(filename) # 自动创建一下这个文件夹

if not os.path.exists(filename_1): #如果没有这个文件夹的话
    os.mkdir(filename_1) # 自动创建一下这个文件夹

#去掉文件名中的符号
def changetitle(name):
    mode = re.compile(r'[\\\/\:\*\?\"\<\>\|]')
    new_name = re.sub(mode, '_', name)
    return new_name
#发送请求


headers={
      "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
      "Accept-Encoding":"gzip, deflate",
      "Accept-Language":"zh-CN,zh;q=0.9,en-US;q=0.8,en;q=0.7",
      "Connection":"keep-alive",
      "Cookie":"JSESSIONID=01FDF84ADF27D5E94ABF62FEC47B2805; SL_GWPT_Show_Hide_tmp=1; SL_wptGlobTipTmp=1",
      "Host":"www.zjzj.net",
      "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.63 Safari/537.36",
}

for num in range(1,3):
    url2 = f"http://www.zjzj.net/news/quota/16?pageIndex={num}"
    res = requests.get(url=url2, headers=headers)
    selector = parsel.Selector(res.text)
    cont = selector.css('.lists').get()
    titles = selector.css('.lists a::attr(title)').getall()
    urls = selector.css('.lists a::attr(href)').getall()
    for (title,n) in zip(titles,urls):
      URL_base = "http://www.zjzj.net/"
      url3 = URL_base+n
      print(title, url3)
      response = requests.get(url=url3, headers=headers)
      selector = parsel.Selector(response.text)
      new_title = changetitle(title)
      print("已经获取到",new_title,"的数据")
      content_views = selector.css('.info').get()
      print("已经获取到",new_title,"的内容数据")
      html_content = html_str.format(article=content_views)
      print("已经获取到", new_title, "的内容数据已经转换完毕")
      print(html_content)
      html_path = filename_1 + new_title + '.html'
      pdf_path = filename + new_title + '.pdf'
      with open(html_path, mode='w', encoding='utf-8') as f:
            f.write(html_content)
      config = pdfkit.configuration(wkhtmltopdf=r'C:\Program Files\wkhtmltopdf\bin\wkhtmltopdf.exe')
      pdfkit.from_file(html_path, pdf_path, configuration=config)
      print(title,"已保存完毕,休息5秒钟")
      time.sleep(1)






报错Exit with code 1 due to network error: ProtocolUnknownError,应该是html转pdf渲染上图中红框部分出现了错误,不知道怎么解决,正则表达式去掉红框部分?



有没有办法报错后跳过这一次for循环,进入下一个循环,给出提示


叫我小王叔叔 发表于 2021-9-9 09:04

挺高产的呀!

wanpojie 发表于 2021-9-9 10:37

叫我小王叔叔 发表于 2021-9-9 09:04
挺高产的呀!

都一个套路

rsnodame 发表于 2021-9-9 19:26

{:301_997:} html to pdf,不试试wkhtmltox?

wanpojie 发表于 2021-9-9 21:49

rsnodame 发表于 2021-9-9 19:26
html to pdf,不试试wkhtmltox?

<!DOCTYPE HTML>
<html lang="en">
<head>
    <title>Document</title><meta charset="utf-8">
</head>
<body>
<div class="cont">
      <div class="info">
          <div class="tit">
            <h1>关于公布“百年峥嵘 筑梦浙江”红色主题征集活动评选结果的通知</h1>
            <h2>-- www.zjzj.net    2021-08-05 --</h2>
                    <span>本文件出处:
                                    [<a href="/news/2/10">新闻中心-政策文件</a>]、
                                    [<a href="/news/2/53">新闻中心-党史学习教育</a>]
                    </span>
          </div>
          <div class="container" id="ContentTextb">
                        <p></p><div style="text-align: center;">浙建价协〔2021〕20号</div><div style="text-align: center;"><img src="/uploads/batchProduct/202108/0516591550.jpg" width="750" height="1000" alt=""><br></div><div style="text-align: center;"><img src="/uploads/batchProduct/202108/0516590474.jpg" width="750" height="1000" alt=""><br></div><div style="text-align: center;"><img src="/uploads/batchProduct/202108/0516585360.jpg" width="750" height="1000" alt=""><br></div><div style="text-align: center;"><img src="/uploads/batchProduct/202108/0516584223.jpg" width="750" height="1000" alt=""><br></div><div style="text-align: center;"><img src="/uploads/batchProduct/202108/0516583011.jpg" width="750" height="1000" alt=""><br></div><div style="text-align: center;"><img src="/uploads/batchProduct/202108/0516581758.jpg" width="750" height="1000" alt=""><br></div>
          </div>
         
         
         
      </div>
      </div>
</body >
</html >


很奇怪的是这一段html就无法转换成pdf用的就是wkhtmltopdf   
报错如下





Traceback (most recent call last):
File "D:/OneDrive/python/zhejiang/main.py", line 165, in <module>
    pdfkit.from_file(html_path, pdf_path, configuration=config)
File "D:\OneDrive\python\zhoushan\venv\lib\site-packages\pdfkit\api.py", line 49, in from_file
    return r.to_pdf(output_path)
File "D:\OneDrive\python\zhoushan\venv\lib\site-packages\pdfkit\pdfkit.py", line 156, in to_pdf
    raise IOError('wkhtmltopdf reported an error:\n' + stderr)
OSError: wkhtmltopdf reported an error:
Loading pages (1/6)
Warning: Blocked access to file                                 
Warning: Blocked access to file
Warning: Blocked access to file
Warning: Blocked access to file
Warning: Blocked access to file
Warning: Blocked access to file
Error: Failed to load about:blank, with network status code 301 and http status code 0 - Protocol "about" is unknown
Error: Failed to load about:blank, with network status code 301 and http status code 0 - Protocol "about" is unknown
Error: Failed to load about:blank, with network status code 301 and http status code 0 - Protocol "about" is unknown
Error: Failed to load about:blank, with network status code 301 and http status code 0 - Protocol "about" is unknown
Error: Failed to load about:blank, with network status code 301 and http status code 0 - Protocol "about" is unknown
Error: Failed to load about:blank, with network status code 301 and http status code 0 - Protocol "about" is unknown
Counting pages (2/6)                                             
Resolving links (4/6)                                                      
Loading headers and footers (5/6)                                          
Printing pages (6/6)
Done                                                                     
Exit with code 1 due to network error: ProtocolUnknownError





页: [1]
查看完整版本: pdfkit html转pdf 用正则表达式解决?