吾爱破解 - 52pojie.cn

 找回密码
 注册[Register]

QQ登录

只需一步,快速开始

查看: 9315|回复: 71
收起左侧

[Python 原创] 一个音乐网站爬虫,下载歌曲

  [复制链接]
sakura32 发表于 2024-1-10 19:45
本帖最后由 sakura32 于 2024-1-13 13:04 编辑

功能:
搜索歌曲/歌手,返回一个结果列表,然后选择列表中的编号进行下载。自动合并专辑封面和歌词(合并歌词代码有问题,无法正常合并)

使用说明:
1.需要配置好playwright
2.无法在python控制台中直接运行(会闪退,不知道什么原因),在pycharm中能正常运行

其他说明:
爬的网站曲库一般,音质一般,lrc歌词质量较差

额外补充说明:爬的网站gequbao.com,网站本身是能直接正常用的,但是直链下载几次后网站隐藏链接需要关注公众号,解决方案:网站有试听功能,试听指向的链接就是下载链接,藏得很浅且不加密,所以只要抓到这个链接就行了,用浏览器-检查/审查元素-网络抓包或者网页资源嗅探类插件都能抓到

再次补充说明:做了一个网站的爬虫,曲库更多,但是下架了一些版权歌(例如周董的)

截图:
QQ截图20240110193726.png

源码:https://github.com/PPJUST/Music-Spider
main.py
[Python] 纯文本查看 复制代码
# 主程序
import re
import time

from lxml import html
from tqdm import tqdm

from down_music import *
from music_info import *

etree = html.etree
baseurl_search = r'https://www.gequbao.com/s/'
baseurl_homepage = r'https://www.gequbao.com'
headers = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36"}


def get_search_result(keyword):
    """获取原始搜索结果文本"""
    url_search = baseurl_search + keyword
    response = requests.get(url_search, headers=headers)
    if response.status_code == 200:
        return response.text
    else:
        print('响应状态码错误')


def get_urls(html_str: str):
    """利用正则提取网页链接"""
    pattern = r'<a href="(/music/\d+)" target'
    short_urls = re.findall(pattern, html_str)  # 短链接/music/402856
    urls = [baseurl_homepage + i for i in short_urls]  # 拼接完整链接
    return urls


def get_music_info(urls: list):
    """获取链接对应的链接字典"""
    url_info_dict = {}  # {url:{获取的info}...}
    for url in tqdm(urls, bar_format='{l_bar}{bar}| {n_fmt}/{total_fmt}'):
        spider = MusicInfo(url)
        info_dict = spider.get_info()
        url_info_dict[url] = info_dict
        time.sleep(0.2)

    return url_info_dict


def show_music_list(url_info_dict: dict):
    """显示带编号的歌曲列表"""
    for index, info_dict in enumerate(url_info_dict.values(), start=1):
        music_name = info_dict['music_name']
        print(index, music_name)


def down_music(url, info_dict):
    """下载歌曲"""
    spider = DownMusic(info_dict)
    # 检查是否正确下载,如果错误则重新获取链接
    if spider.is_error():
        print('下载链接已失效,尝试重新获取')
        re_spider = MusicInfo(url)
        re_info_dict = re_spider.get_info()
        return down_music(url, re_info_dict)
    else:
        print('完成下载')


def main():
    while True:
        keyword = input('输入歌名/歌手,回车后查询:').strip()
        html_str = get_search_result(keyword)
        urls = get_urls(html_str)
        url_info_dict = get_music_info(urls)
        show_music_list(url_info_dict)

        while True:
            number = int(input('输入歌曲编号,回车后下载歌曲(输入0返回搜索栏):').strip())
            if number == 0:
                break
            select_url, select_info_dict = list(url_info_dict.items())[number - 1]
            down_music(select_url, select_info_dict)


if __name__ == '__main__':
    main()


music_info.py
[Python] 纯文本查看 复制代码
# 该模块用于获取歌曲的封面、文件名、下载链接等信息

from playwright.sync_api import sync_playwright


class MusicInfo:
    def __init__(self, music_page: str):
        """
        :param music_page: str类型,歌曲页面链接
        """
        self._music_download_link = ''  # 歌曲下载链接
        self._cover_download_link = ''  # 封面下载链接
        self._lrc_download_link = ''  # 歌词下载链接
        self._music_name = ''  # 歌曲名

        self._goto_page(music_page)

    def _goto_page(self, music_page: str):
        """
        :param music_page: str类型,歌曲页面链接
        """
        with sync_playwright() as p:
            browser = p.chromium.launch(headless=True)
            page = browser.new_page()
            page.on('response', self._on_response)  # 响应请求
            page.goto(music_page)
            page.wait_for_load_state('networkidle')
            html = page.content()  # 获取页面源码
            browser.close()

        self._get_music_name_and_lrc(html)

    def _on_response(self, response):
        state = response.status  # 状态码
        url = response.url  # 链接
        # print(f'Statue {state}: {url}')
        # 酷我接口
        if 'kuwo' in url and '.mp3' in url:  # 提取歌曲下载链接
            self._music_download_link = url
        elif 'kuwo' in url and '.jpg' in url:  # 提取封面
            self._cover_download_link = url
        # 网易云接口
        elif 'music.126' in url and '.mp3' in url:  # 提取歌曲下载链接
            self._music_download_link = url
        elif 'music.126' in url and'.jpg' in url:  # 提取封面
            self._cover_download_link = url


    def _get_music_name_and_lrc(self, html: str):
        """获取歌曲文件名"""
        html_lines = html.split('\n')
        for line in html_lines:
            # print(f'Line: {line}')
            if 'description' in line:  # 提取歌曲名称
                # <meta name="description" content="青花瓷-周杰伦.mp3免费在线下载播放,歌曲宝在线音乐搜索
                split1 = line.find('content=')
                split2 = line.find('.mp3')
                music_name = line[split1 + len('content=') + 1:split2]
                self._music_name = music_name

            elif 'btn-download-lrc' in line and 'href' in line:  # 提取歌词
                # <a id="btn-download-lrc" href="/download/lrc/1655094" class="btn btn-primary"
                split1 = line.find('href=')
                split2 = line.find(' class')
                short_lrc_url = line[split1 + len('href=') + 1:split2 - 1]
                lrc_download_link = 'https://www.gequbao.com' + short_lrc_url
                self._lrc_download_link = lrc_download_link

    def get_info(self):
        """返回信息"""
        info_dict = {
            'music_download_link': self._music_download_link,
            'cover_download_link': self._cover_download_link,
            'lrc_download_link': self._lrc_download_link,
            'music_name': self._music_name
        }

        return info_dict



down_music.py
[Python] 纯文本查看 复制代码
# 该模块用于获取歌曲的封面、文件名、下载链接等信息
import os

import requests
from mutagen.id3 import ID3, APIC, USLT


class DownMusic:
    """下载歌曲"""

    def __init__(self, info_dict: dict):
        self._music_download_link = info_dict['music_download_link']
        self._cover_download_link = info_dict['cover_download_link']
        self._lrc_download_link = info_dict['lrc_download_link']
        self._music_name = info_dict['music_name']

        if self._music_download_link:  # 如果没有获取到歌曲链接,则不进行下一步
            result = self._down_music()  # 歌曲链接有有效期,过期后无法下载文件
            if result:
                self._is_error = False
                self._down_lrc()
                self._down_cover()

                self._join_music_metadata()
                self._delete_useless_file()
            else:
                self._is_error = True
        else:
            self._is_error = True

    def is_error(self):
        """测试运行是否出错"""
        return self._is_error

    def _down_music(self):
        """下载歌曲"""
        filename = self._music_name + '.mp3'
        result = self._download_file(self._music_download_link, filename)
        return result

    def _down_lrc(self):
        """下载歌词"""
        filename = self._music_name + '.lrc'
        self._download_file(self._lrc_download_link, filename)

    def _down_cover(self):
        """下载封面"""
        filename = self._music_name + '.jpg'
        self._download_file(self._cover_download_link, filename)

    def _join_music_metadata(self):
        """拼合歌曲文件"""
        file_music = self._music_name + '.mp3'
        file_lrc = self._music_name + '.lrc'
        file_cover = self._music_name + '.jpg'

        audio = ID3(file_music)

        # 添加封面
        with open(file_cover, 'rb') as f:
            cover = f.read()
        audio['APIC'] = APIC(
            encoding=3,  # utf-8
            mime='image/jpeg',  # image/jpeg或image/png
            type=3,  # cover image
            desc=u'Cover',
            data=cover
        )

        # 添加歌词
        with open(file_lrc, 'r', encoding='utf-8') as f:
            lyrics = f.read()
        audio['USLT'] = USLT(
            encoding=3,  # utf-8
            lang='chi',  # 歌词语言
            desc=u'Lyrics',
            text=lyrics
        )

        audio.save()

    def _delete_useless_file(self):
        """合并后删除无用文件"""
        file_lrc = self._music_name + '.lrc'
        file_cover = self._music_name + '.jpg'

        os.remove(file_lrc)
        os.remove(file_cover)

    @staticmethod
    def _download_file(url, filename):
        headers = {
            "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36"
        }
        response = requests.get(url, headers=headers)
        with open(filename, 'wb') as f:
            f.write(response.content)

        if os.path.getsize(filename):
            return True
        else:
            return False

免费评分

参与人数 13吾爱币 +11 热心值 +13 收起 理由
Yxyun + 1 + 1 我很赞同!
wahfs + 1 + 1 我很赞同!
lijinglei + 1 + 1 热心回复!
Seborn + 1 + 1 用心讨论,共获提升!
LiHuaming + 1 谢谢@Thanks!
hj6224310 + 1 + 1 我很赞同!
AngobertWolf + 1 + 1 热心回复!
xiaobaicai66 + 1 + 1 谢谢@Thanks!
YLSpace + 1 + 1 我很赞同!
wapjsx + 1 + 1 谢谢@Thanks!
Luohongyu188 + 1 热心回复!
anlanchenxiang + 1 我很赞同!
bnb + 2 + 1 谢谢@Thanks!

查看全部评分

本帖被以下淘专辑推荐:

发帖前要善用论坛搜索功能,那里可能会有你要找的答案或者已经有人发布过相同内容了,请勿重复发帖。

moliol 发表于 2024-1-11 10:37
支持支持!但是不会用
东坡小哥哥 发表于 2024-1-12 14:22
输入歌名/歌手,回车后查询:周杰伦
  0%|          | 0/81
Traceback (most recent call last):
  File "C:\Users\BuyeaChen\Desktop\my App\Music-Spider-main\main.py", line 85, in <module>
    main()
  File "C:\Users\BuyeaChen\Desktop\my App\Music-Spider-main\main.py", line 73, in main
    url_info_dict = get_music_info(urls)
                    ^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\BuyeaChen\Desktop\my App\Music-Spider-main\main.py", line 40, in get_music_info
    spider = MusicInfo(url)
             ^^^^^^^^^^^^^^
  File "C:\Users\BuyeaChen\Desktop\my App\Music-Spider-main\music_info.py", line 16, in __init__
    self._goto_page(music_page)
  File "C:\Users\BuyeaChen\Desktop\my App\Music-Spider-main\music_info.py", line 23, in _goto_page
    browser = p.chromium.launch(headless=True)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\BuyeaChen\AppData\Local\Programs\Python\Python312\Lib\site-packages\playwright\sync_api\_generated.py", line 14806, in launch
    self._sync(
  File "C:\Users\BuyeaChen\AppData\Local\Programs\Python\Python312\Lib\site-packages\playwright\_impl\_sync_base.py", line 115, in _sync
    return task.result()
           ^^^^^^^^^^^^^
  File "C:\Users\BuyeaChen\AppData\Local\Programs\Python\Python312\Lib\site-packages\playwright\_impl\_browser_type.py", line 95, in launch
    Browser, from_channel(await self._channel.send("launch", params))
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\BuyeaChen\AppData\Local\Programs\Python\Python312\Lib\site-packages\playwright\_impl\_connection.py", line 62, in send
    return await self._connection.wrap_api_call(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\BuyeaChen\AppData\Local\Programs\Python\Python312\Lib\site-packages\playwright\_impl\_connection.py", line 492, in wrap_api_call
    return await cb()
           ^^^^^^^^^^
  File "C:\Users\BuyeaChen\AppData\Local\Programs\Python\Python312\Lib\site-packages\playwright\_impl\_connection.py", line 100, in inner_send
    result = next(iter(done)).result()
             ^^^^^^^^^^^^^^^^^^^^^^^^^
playwright._impl._errors.Error: Executable doesn't exist at C:\Users\BuyeaChen\AppData\Local\ms-playwright\chromium-1091\chrome-win\chrome.exe
╔════════════════════════════════════════════════════════════╗
║ Looks like Playwright was just installed or updated.       ║
║ Please run the following command to download new browsers: ║
║                                                            ║
║     playwright install                                     ║
║                                                            ║
║ <3 Playwright Team                                         ║
╚════════════════════════════════════════════════════════════╝
duhe 发表于 2024-1-10 19:53
bnb 发表于 2024-1-10 19:58
这个怎么用aardio调用python
不想装python环境
naoxin2023 发表于 2024-1-10 20:07

支持支持
dball 发表于 2024-1-10 20:07
求网盘分享,谢谢楼主
flylujun 发表于 2024-1-10 20:23
带源码,支持 一下
井谦 发表于 2024-1-10 20:37
求网盘分享,谢谢楼主
Leonkeen 发表于 2024-1-10 21:04
带源码 支持一下
soughing 发表于 2024-1-10 21:06
求网盘分享,谢谢楼主
头像被屏蔽
moruye 发表于 2024-1-10 21:30
提示: 作者被禁止或删除 内容自动屏蔽
您需要登录后才可以回帖 登录 | 注册[Register]

本版积分规则

返回列表

RSS订阅|小黑屋|处罚记录|联系我们|吾爱破解 - LCG - LSG ( 京ICP备16042023号 | 京公网安备 11010502030087号 )

GMT+8, 2024-11-18 14:47

Powered by Discuz!

Copyright © 2001-2020, Tencent Cloud.

快速回复 返回顶部 返回列表