吾爱破解 - 52pojie.cn

 找回密码
 注册[Register]

QQ登录

只需一步,快速开始

查看: 2884|回复: 38
收起左侧

[Python 转载] 中国公路建设行业协会公路工程工法pdf下载

  [复制链接]
封心棒棒糖 发表于 2022-5-12 13:24
本帖最后由 封心棒棒糖 于 2022-5-13 15:39 编辑

1、特殊字符文件名的问题 已修复
2、使用方法 很简单不多赘述,代码里有重要参数说明
[Python] 纯文本查看 复制代码
import re

from bs4 import BeautifulSoup

import requests

headers = {
    'user-agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) "
                  "Chrome/86.0.4240.75 Safari/537.36"
}


def main(url):
    try:
        resp = requests.get(url=url, headers=headers)
        if resp.status_code == 200:
            resp.encoding = 'gb2312'
            soup = BeautifulSoup(resp.text, 'html.parser')
            tr = soup.find_all('tr')
            info_list = []
            for item in tr:
                if len(item.find_all('td')) > 0:
                    title = re.sub(r"[\/\\\:\*\?\"\<\>\|]", "_", item.find_all('td')[1].text).replace(" ", "")
                    info = {"title": title,
                            "url": 'https://gfgl.chhca.org.cn/' + item.find_all('td')[4].find_all('a')[0]['href']}
                    info_list.append(info)
            return info_list
    except Exception as e:
        print(f'获取列表失败: {e}')


def download(info_list):
    for info in info_list:
        title = info['title']
        url = info['url']
        try:
            resp = requests.get(url=url, headers=headers)
            if resp.status_code == 200:
                pdf_url = re.findall('<frame src="(.*?)">', resp.text)[0]
                pdf_url = ('https://gfgl.chhca.org.cn' + pdf_url).replace('pdf/generic/web/viewer.html?file=/', '')
                resp_pdf = requests.get(url=pdf_url, headers=headers)
                if resp_pdf.status_code == 200:
                    with open(f"{title}.pdf", 'wb') as f:
                        f.write(resp_pdf.content)
                print(f'{title}: 下载完成')
        except Exception as e:
            print(f'{title}: 下载失败: {e}')


if __name__ == '__main__':
    """
    range(1, 2)  第1页的内容
    range(1, n)  第n-1页的内容
    
    range(1, 2)  1: 表示其实页  2: 表示结束页的前一页
    """
    for num in range(1, 2):
        print(f'正在下载第{num}页.....')
        info_list = main(f'https://gfgl.chhca.org.cn/?page={num}&jsfl=&gffl=&gjz=&sbnf=')
        download(info_list)
        print(f'下载完成第{num}页.')

免费评分

参与人数 6吾爱币 +6 热心值 +5 收起 理由
zxcv110 + 1 + 1 感谢发布原创作品,吾爱破解论坛因你更精彩!
wanshiz + 1 + 1 热心回复!
ymhld + 1 + 1 谢谢@Thanks!
charleslyc + 1 + 1 谢谢@Thanks!
gongjiankk + 1 用心讨论,共获提升!
lsyh1688 + 1 + 1 谢谢@Thanks!

查看全部评分

发帖前要善用论坛搜索功能,那里可能会有你要找的答案或者已经有人发布过相同内容了,请勿重复发帖。

lsyh1688 发表于 2022-5-12 22:43
本帖最后由 lsyh1688 于 2022-5-12 22:45 编辑
封心棒棒糖 发表于 2022-5-12 20:50
代码下面,range(1,2)中的2改成要下载的总页数,默认下载第一页,改成3就是前两页,n就是前n-1页
求教如何能下完!!!

没有下载完,下载了426个就报错停止了,下面是报错显示,麻烦给看看。
Traceback (most recent call last):
  File "E:\Pythonsoft1\Anaconda3\lib\site-packages\urllib3\contrib\pyopenssl.py", line 444, in wrap_socket
    cnx.do_handshake()
  File "E:\Pythonsoft1\Anaconda3\lib\site-packages\OpenSSL\SSL.py", line 1907, in do_handshake
    self._raise_ssl_error(self._ssl, result)
  File "E:\Pythonsoft1\Anaconda3\lib\site-packages\OpenSSL\SSL.py", line 1631, in _raise_ssl_error
    raise SysCallError(errno, errorcode.get(errno))
OpenSSL.SSL.SysCallError: (10060, 'WSAETIMEDOUT')

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "E:\Pythonsoft1\Anaconda3\lib\site-packages\urllib3\connectionpool.py", line 600, in urlopen
    chunked=chunked)
  File "E:\Pythonsoft1\Anaconda3\lib\site-packages\urllib3\connectionpool.py", line 343, in _make_request
    self._validate_conn(conn)
  File "E:\Pythonsoft1\Anaconda3\lib\site-packages\urllib3\connectionpool.py", line 849, in _validate_conn
    conn.connect()
  File "E:\Pythonsoft1\Anaconda3\lib\site-packages\urllib3\connection.py", line 356, in connect
    ssl_context=context)
  File "E:\Pythonsoft1\Anaconda3\lib\site-packages\urllib3\util\ssl_.py", line 359, in ssl_wrap_socket
    return context.wrap_socket(sock, server_hostname=server_hostname)
  File "E:\Pythonsoft1\Anaconda3\lib\site-packages\urllib3\contrib\pyopenssl.py", line 450, in wrap_socket
    raise ssl.SSLError('bad handshake: %r' % e)
ssl.SSLError: ("bad handshake: SysCallError(10060, 'WSAETIMEDOUT')",)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "E:\Pythonsoft1\Anaconda3\lib\site-packages\requests\adapters.py", line 445, in send
    timeout=timeout
  File "E:\Pythonsoft1\Anaconda3\lib\site-packages\urllib3\connectionpool.py", line 638, in urlopen
    _stacktrace=sys.exc_info()[2])
  File "E:\Pythonsoft1\Anaconda3\lib\site-packages\urllib3\util\retry.py", line 398, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='gfgl.chhca.org.cn', port=443): Max retries exceeded with url: /pdf/index.asp?gfid=9079 (Caused by SSLError(SSLError("bad handshake: SysCallError(10060, 'WSAETIMEDOUT')")))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "E:/Pythondata/00 book.py", line 53, in <module>
    download(info_list)
  File "E:/Pythondata/00 book.py", line 35, in download
    resp = requests.get(url=url, headers=headers)
  File "E:\Pythonsoft1\Anaconda3\lib\site-packages\requests\api.py", line 72, in get
    return request('get', url, params=params, **kwargs)
  File "E:\Pythonsoft1\Anaconda3\lib\site-packages\requests\api.py", line 58, in request
    return session.request(method=method, url=url, **kwargs)
  File "E:\Pythonsoft1\Anaconda3\lib\site-packages\requests\sessions.py", line 512, in request
    resp = self.send(prep, **send_kwargs)
  File "E:\Pythonsoft1\Anaconda3\lib\site-packages\requests\sessions.py", line 622, in send
    r = adapter.send(request, **kwargs)
  File "E:\Pythonsoft1\Anaconda3\lib\site-packages\requests\adapters.py", line 511, in send
    raise SSLError(e, request=request)
requests.exceptions.SSLError: HTTPSConnectionPool(host='gfgl.chhca.org.cn', port=443): Max retries exceeded with url: /pdf/index.asp?gfid=9079 (Caused by SSLError(SSLError("bad handshake: SysCallError(10060, 'WSAETIMEDOUT')")))

Process finished with exit code 1
ymhld 发表于 2022-5-13 09:04
            print("下载文件",title)
            try:
                resp_pdf = requests.get(url=pdf_url, headers=headers)
                if resp_pdf.status_code == 200:
                    with open(f"{title}.pdf", 'wb') as f:
                        f.write(resp_pdf.content)
            except Exception as e:
                continue
spp_wall 发表于 2022-5-12 17:15
小赵有号了 发表于 2022-5-12 17:48
这是啥。。能不能添加点说明。谢谢老板
wyq20110208 发表于 2022-5-12 18:22
能解释一下怎么用吗
taxuewuhen 发表于 2022-5-12 19:05
感谢分享
sp0770 发表于 2022-5-12 19:13
怎么用啊?
ZDRAGON 发表于 2022-5-12 19:28
本帖最后由 ZDRAGON 于 2022-5-12 19:32 编辑

python运行即可,已经在下载了
image.png
duzix 发表于 2022-5-12 19:31
这么高端,下载东西都用编程了
lsyh1688 发表于 2022-5-12 19:44
感谢分享,下载下来30个!

编程小白好奇,只有这30个吗?
lsyh1688 发表于 2022-5-12 19:58
找到中国公路建设行业协会网站了,工法文章有接近3千条。
您需要登录后才可以回帖 登录 | 注册[Register]

本版积分规则

返回列表

RSS订阅|小黑屋|处罚记录|联系我们|吾爱破解 - LCG - LSG ( 京ICP备16042023号 | 京公网安备 11010502030087号 )

GMT+8, 2024-11-24 22:51

Powered by Discuz!

Copyright © 2001-2020, Tencent Cloud.

快速回复 返回顶部 返回列表