某信公众号视频批量下载

三滑稽甲苯 发表于 2024-8-4 22:10

[ 本帖最后由三滑稽甲苯于 2024-8-4 22:41 编辑 ]\n\n# 某信公众号视频批量下载

示例网址：`aHR0cHM6Ly9tcC53ZWl4aW4ucXEuY29tL21wL2FwcG1zZ2FsYnVtP2FjdGlvbj1nZXRhbGJ1bSZhbGJ1bV9pZD0xNjQwODY5NjU4MTU1MDczNTQxI3dlY2hhdF9yZWRpcmVjdA==`
**注意：论坛把一些不该解码的字符解码了，导致代码中 `&` 解码相关的地方等会出现问题。请参考链接给出的源码。**

## 获取合集链接的 HTML

这一步比较简单，我们首先把常见的请求头抄到代码里，然后用 `requests` 请求它即可：

```python3
from requests import RequestException, Session

HEADERS = {
# ...
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/127.0.0.0 Safari/537.36 Edg/127.0.0.0",
}
x = Session()
x.headers.update(HEADERS)

def fetch_article_urls(album_url: str) -> list:
"""Fetch the list of article urls from the `album_url`."""
assert album_url.startswith(
   "https://<domain>/mp/appmsgalbum?"
), "Invalid album URL"
r = x.get(album_url)
album_info = extract_album_info(r.text) # TODO
article_urls = extract_article_urls(album_info) # TODO
return article_urls
```

注意到代码里使用 `Session.get` 来请求，而非直接 `requests.get`。这么做主要是为了保存会话中的 cookie，从而可以避免一些反爬措施。另外来说，`Session` 会保持连接，从而加快请求速度。

## 从合集中提取推送链接

首先打开网址，点击合集链接，从中选取关键词复制 (例如 `__biz=Mzg5ODU0MjM2NA`)，然后回到合集网址，打开源代码面板在此网页源码中进行搜索：
!(https://attach.52pojie.cn/forum/202408/04/221026yo0irqsmwd0qv1oo.jpg)
可以看到，马上就找到了需要的数据。那么接下来就让 GPT 编写正则表达式来提取这个 `videoList` 变量内的数据：

```regex
var\s+videoList\s+=\s*(\[\s*\{.*?\}\s*\]);
```

稍微有点头痛的是将匹配到的字符串结果解析为 `list` 对象。如果是标准的 JSON 格式的话，可以通过 `json.loads` 轻松解决；但是它的 key 没有用双引号包裹，不能通过 `json.loads` 解析。在网上寻找了一番后，发现 [`json5`](https://github.com/dpranke/pyjson5) 可以很好的满足我们的需求：它允许数据内含有注释，允许字典键不被引号包括，允许数组最后一个元素末尾有逗号... 同时，还有一个小细节需要注意：每个项目的 `pos_num` 最后还有一个 ` * 1` 需要特殊处理一下，给它删掉。那么我们就可以实现一个函数来从合集的 HTML 中提取数据的函数：

```python3
def extract_album_info(html: str) -> list:
"""Extract the album information from the HTML."""
# Extract the __appmsgalbum array string
match = search(r"var\s+videoList\s+=\s*(\[\s*\{.*?\}\s*\]);", html, DOTALL)
if not match:
   return []
json_str = match.group(1)
# Remove ' * 1'
json_str = sub(r"\s+\*\s+1", "", json_str)
data = loads(json_str)
return data
```

接下来，我们需要从数据中提取出感兴趣的部分：各个推送的网址。这一部分相对简单，只需要注意将 `&` 解码为 `&` 即可，代码略。

## 获取推送链接 HTML

这一步和 [获取合集链接的 HTML](#获取合集链接的 HTML) 基本一样，此处不再赘述。

```python3
def download_single(article_url: str, filename: str):
"""Download a single video from the given `article_url`."""
assert article_url.startswith("https://<domain>/s?"), "Invalid article URL"
print(f"Extracting video from {article_url}...")
r = x.get(article_url)
if "环境异常" in r.text:
   print("Aborted due to detected environment exception.")
   return
data = extract_video_info(r.text) # TODO
if not data:
   print("No video found.")
   return
item = best_quality(data) # TODO
item["url"] = item["url"].replace("&", "&") # 解码
download_video(item["url"], f"{filename}.mp4") # TODO
```

## 从推送链接中提取最优视频链接

这一步和 [从合集中提取推送链接](#从合集中提取推送链接) 基本一样，只是正则表达式、解码和特殊处理的具体操作有异，此处不再赘述。

```python3
def extract_video_info(html: str) -> list:
"""Extract the video information from the HTML."""
# Extract the __mpVideoTransInfo array string
match = search(
   r"window\.__mpVideoTransInfo\s*=\s*(\[\s*\{.*?\},\s*\]);", html, DOTALL
)
if not match:
   return []
json_str = match.group(1)
# Remove '* 1 || 0'
json_str = sub(r"\s*\*\s*1\s*\|\|\s*0", "", json_str)
# Only keep url in '(url).replace(/^http(s?):/, location.protocol)'
json_str = sub(
   r"\(\s*(\'http[^\)]*)\)\.replace\(\s*/\^http\(s\?\):/, location\.protocol\s*\)",
   r"\1",
   json_str,
)
data = loads(json_str)
return data

def best_quality(data: list) -> dict:
"""Return the URL of the best quality video."""
# Consider first `video_quality_level`, then `filesize`
# Note that they're both strings, so we should convert them to integers
item = max(
   data,
   key=lambda x: (int(x["video_quality_level"] or 0), int(x["filesize"]) or 0),
)
return item
```

## 支持断点续传的视频下载

视频下载这里反爬较为严格，需要额外补充 `Host`, `Origin`, `Referer` 请求头。另外，由于希望实现断点续传，需要自行构造 `Range` 请求头来获取想要的片段。断点续传与流式下载大文件的代码是改编而来的，参考来源已在文末列出。这里我就偷个懒，贴一下我的下载函数完事：

```python3
def download_video(video_url: str, filename: str):
"""Download the video given the video URL."""
print(f"🔍 Downloading {filename}...", end="\r")
tmp_file_path = filename + ".tmp"
if not path.exists(filename) or path.exists(tmp_file_path):
   try:
         r = x.get(video_url, headers=VIDEO_HEADERS, stream=True)
         r.raise_for_status() # Raise an exception if the response is not 200 OK
         total_size = int(r.headers["Content-Length"])
         if path.exists(tmp_file_path):
            tmp_size = path.getsize(tmp_file_path)
            print(f"Already downloaded {tmp_size} Bytes out of {total_size} Bytes ({100 * tmp_size / total_size:.2f}%)")
            if tmp_size == total_size:
               move(tmp_file_path, filename)
               print("✅ Downloaded {filename} successfully.")
               return True
            elif tmp_size > total_size:
               print("❌ The downloaded .tmp file is larger than the remote file. It is likely corrupted.")
               return False
         else:
            tmp_size = 0
            print(f"File is {total_size} Bytes, downloading...")

         with open(tmp_file_path, "ab") as f:
            retries = 0
            while retries < RETRIES:
               try:
                     res = x.get(video_url, headers={**VIDEO_HEADERS, "Range": f"bytes={tmp_size}-"}, stream=True)
                     for chunk in res.iter_content(chunk_size=CHUNK_SIZE):
                        tmp_size += len(chunk)
                        f.write(chunk)
                        f.flush()

                        done = int(50 * tmp_size / total_size)
                        print(f"\r[{'█' * done}{' ' * (50 - done)}] {100 * tmp_size / total_size:.0f}%", end="")
                     break
               except RequestException as e:
                     retries += 1
                     print(f"\n⚠️ Retrying... ({retries}/{RETRIES})")
                     sleep(INTERVAL)
            else:
               print(f"\n❌ Failed to download {filename} after {RETRIES} retries.")
               return False

         if tmp_size == total_size:
            move(tmp_file_path, filename)
            print(f"\n✅ Downloaded {filename} successfully.")

   except RequestException as e:
         # Log the error
         print(e)
         with open(filename + '_log.txt', 'a+', encoding = 'UTF-8') as f:
            f.write('%s, %s\n' % (video_url, e))
         print(f"❌ Failed to download {filename}.")
else:
   print(f"✅ Downloaded {filename} successfully.")
```

## 开源地址与参考

- (https://github.com/PRO-2684/gadgets/blob/main/wechat_video/README_CN.md)
- (https://www.cnblogs.com/yanghao2008/p/16368311.html)
- [伪造请求头绕过微信反爬虫策略](https://github.com/systemmin/wxdown/blob/b3173e19665717b835d96caa92d9aea3af6413db/internal/service/html_parallel.go#L84)

baliao 发表于 2024-8-5 08:31

感谢大佬分析,
大佬文章中链接好像不对 >>> 1. 伪造请求头绕过微信反爬虫策略, 给的链接不对?
2. 关于解析的部分,视频源不同,解析不一样, 你看看这个公众号里面,
https://mp.weixin.qq.com/s/tUAWO_o8kzpNo_2X-PDaxg
这些视频源主要有2种来源:
http://v.qq.com/x/page/
https://mp.weixin.qq.com
解析方法也不同
3. 大佬能够分享下像试卷/试题网这样的怎么爬和写入到word里面.
题目不是图片的那种,尤其是数学, 涉及到开方/平方/分式等的如何写入到word里面
例如这种: http://www.jyeoo.com/任意扫码即可,选择数学初高中任一数学即可,谢谢!

wudalang123 发表于 2024-8-5 20:01

对于批量下载任务，使用多线程或异步编程可以显著提高效率，因为它们可以同时处理多个下载任务，而不是一个接一个地顺序执行。以下是一些实现多线程和异步编程的方法：

### 多线程实现：

1. **使用 `threading` 模块**：
- Python 的 `threading` 模块可以用来创建多个线程。你可以为每个下载任务创建一个线程。

```python
import threading

def download_task(video_url, filename):
   # 这里是下载视频的逻辑
   pass

def main():
   threads = []
   video_urls = [...]# 这里是你的视频URL列表
   for url in video_urls:
      thread = threading.Thread(target=download_task, args=(url, f"{url.split('/')[-1]}"))
      threads.append(thread)
      thread.start()

   # 等待所有线程完成
   for thread in threads:
      thread.join()

if __name__ == "__main__":
   main()
```

2. **使用 `concurrent.futures.ThreadPoolExecutor`**：
- `ThreadPoolExecutor` 是一个线程池管理器，可以更方便地创建和管理线程。

```python
from concurrent.futures import ThreadPoolExecutor

def download_task(video_url, filename):
   # 这里是下载视频的逻辑
   pass

def main():
   video_urls = [...]# 这里是你的视频URL列表
   with ThreadPoolExecutor(max_workers=5) as executor:
      executor.map(lambda url: download_task(url, f"{url.split('/')[-1]}"), video_urls)

if __name__ == "__main__":
   main()
```

### 异步编程实现：

1. **使用 `asyncio` 模块**：
- Python 的 `asyncio` 模块提供了编写单线程并发代码的框架，使用 `async` 和 `await` 来定义和调用异步函数。

```python
import asyncio

async def download_task(video_url, filename):
   # 这里是异步下载视频的逻辑
   pass

async def main():
   video_urls = [...]# 这里是你的视频URL列表
   tasks = }") for url in video_urls]
   await asyncio.gather(*tasks)

if __name__ == "__main__":
   asyncio.run(main())
```

2. **使用 `aiohttp` 进行异步HTTP请求**：
- 如果下载任务涉及到HTTP请求，`aiohttp` 是一个支持异步请求的HTTP客户端/服务器框架。

```python
import aiohttp
import asyncio

async def download_task(session, video_url, filename):
   async with session.get(video_url) as response:
      # 处理异步响应
      pass

async def main():
   video_urls = [...]# 这里是你的视频URL列表
   async with aiohttp.ClientSession() as session:
      tasks = }") for url in video_urls]
      await asyncio.gather(*tasks)

if __name__ == "__main__":
   asyncio.run(main())

289051401 发表于 2024-8-4 22:30

这是对电脑端的软件进行爬取吗

cbkxh 发表于 2024-8-4 22:34

谢谢分享，看看

hu11 发表于 2024-8-4 22:37

仔细研究下

harry30979 发表于 2024-8-4 22:51

谢谢，慢慢研究一下

seetheplanet 发表于 2024-8-5 00:02

感谢分享！！

wan456 发表于 2024-8-5 00:13

Python 断点续传下载文件非常好用

tutu2 发表于 2024-8-5 00:22

下载的好像不是原视频啊，求解

W921027 发表于 2024-8-5 00:23

感谢分享学习很好用

meder 发表于 2024-8-5 00:35

感谢分享 666

页: [1] 2 3 4 5 6 7 8

吾爱破解 - 52pojie.cn's Archiver

某信公众号视频批量下载