【Python爬虫】笔者想听DJ,结果写了一个DJ嗨嗨网的下载工具。
本帖最后由 小涩席 于 2020-3-19 00:43 编辑如题,一到晚上就想嗨,不想睡,这个时候就需要点音乐,于是乎百度找了找DJ网站,结果发现下载都要钱,作为白嫖狂魔,肯定还是舍不得的。
这个时候脑海里就弹出了一句话:“兄弟们,干就完了!”{:301_1001:}
最后花了点时间,码了下代码,其中知识点有:
1.Python正则表达库RE
2.请求库Request
3.解析库lxml
4.文件操作库os
5.时间库time
6.还有自行写的请求头库GetRandomheader
代码复制粘贴到PyCharm中运行就行了,PyCharm是啥百度一下多得很,具体代码如下:
```
# -*- coding : 'UTF-8' -*-
# 'http://www.djkk.com/dance/sort/chinese_1.html'
# Author :XSX
# Python3.8 PyCharm Community Edition 2019.3.3
# 导入所有使用的模块,没有的自行pip install 库名称 进行安装
import requests
import re
import os
from lxml import etree
from GetRandomheader import Randomheader
import time
# 得到某页下所有歌曲链接
def Getsonglinks(url, headers):
songLinks = []
rsp = requests.get(url, headers=headers)
rsp.encoding = rsp.apparent_encoding
html = etree.HTML(rsp.text)
songlinks = html.xpath('//div[@class="song layui-elip layui-col-xs12 layui-col-sm8 layui-col-md6"]/div/a/@href')
for songlink in songlinks:
songdata = 'http://www.djkk.com' + str(songlink)
songLinks.append(songdata)
print(songLinks)
return songLinks
# 正则匹配歌曲下载地址以及歌曲名,最后新建文件夹后进行下载保存。
def GetPageText(url, headers):
r = requests.get(url, headers=headers)
r.encoding = r.apparent_encoding
songurls = re.findall(r'.*songurl: "(.*)",time.*', r.text)
songnames = re.findall(r'.*songname:"(.*)",songtype.*', r.text)
songurl = str(songurls)
songname = str(songnames)
print('正在下载 》》》》》' + songname + '\n' + '------------------------------------')
print('歌曲地址:' + songurl)
if not os.path.exists('./DJSongs'):
os.mkdir('./DJSongs')
r1 = requests.get(songurl, headers=headers)
with open('./DJSongs/' + songname + '.m4a', 'wb')as f:
f.write(r1.content)
# 定义main函数,并调用上方封装函数运行代码。
if __name__ == '__main__':
url1 = 'http://www.djkk.com/dance/sort/chinese_2.html'
headers = Randomheader()
for url in Getsonglinks(url1, headers):
time.sleep(2)
GetPageText(url, headers)
print('全部下载完成!')
``` 我也用这个网站试着编了一下,可是就是每回都是只能下几首就卡在那儿了,试着用你的代码爬一下也是一样啊,是不 是爬音乐有不有一样的地方吗?我把代码发上来求大神帮着指点下看哪里有问题。程序没有提示出错,可是就是每回下了几首就卡那儿不动了。
import requests,os,time,random,re
from lxml import etree
def ranheader():
user1 = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36 OPR/26.0.1656.60'
user2 = 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:34.0) Gecko/20100101 Firefox/34.0'
user3 = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.57.2 (KHTML, like Gecko) Version/5.1.7 Safari/534.57.2'
#user4 = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11'
user4 = 'Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko'
#user6 = 'Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5'
#user7 = 'Mozilla/5.0 (iPad; U; CPU OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5'
#user8 = 'Mozilla/5.0 (Linux; U; Android 2.3.7; en-us; Nexus One Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1'
list1 =
agent = random.choice(list1)
header = {'User-Agent': agent}
return header
def get_urllist(url): #访问网址
req=requests.get(url,headers=ranheader()).content.decode('utf-8')
#html=etree.HTML(req.content.decode('utf-8'))
return req
def mp3_list(req): #用正则找出音乐所在网页列表跟名字列表
html=etree.HTML(req)
addr_list=html.xpath('//li/div/div/a/@href')
mp3_name=html.xpath('//li/div/div/a/@title')
#print(addr_list)
#print(mp3_name)
get_mp3(addr_list,mp3_name)
def get_mp3(addlist,mp3_name): #得到音乐所在的具体地址,并开始写入
for x in range(0,len(addlist)):
addr=short_url+addlist
html_1=get_urllist(addr)
#print (html_1)
mp3_addr=re.compile('m4a: "(.*?)"}]').findall(html_1)
print('正在保存:'+mp3_name)
filename=requests.get(mp3_addr,headers=ranheader())
time.sleep(0.5)
try:
with open(r"f:/mp3/"+mp3_name+".mp3",'wb') as f:
f.write(filename.content)
time.sleep(1)
except:
print("文件保存失败!")
if __name__ == '__main__':
url1='http://www.djkk.com/dance/sort/chinese_'
short_url='http://www.djkk.com'
starpage=input('请输入要开始的页面:')
endpage=input("请输入要结束的页面:")
for i in range(int(starpage),int(endpage)+1):
url=url1+str(i)+".html"
mp3_list(get_urllist(url)) yulinsoft 发表于 2020-3-19 09:11
把你的请求头GetRandomheader也发出来会更好。
代码如下,可以封装成库,直接调用就好了。
import random
def Randomheader():
# random choice User-Agent
user_agent_list = [
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/45.0.2454.85 Safari/537.36 115Browser/6.0.3',
'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50',
'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50',
'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)',
'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)',
'Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1',
'Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11',
'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)',
'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0',
'Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1',
]
user_agent = random.choice(user_agent_list)
headers = {'User-Agent': user_agent}
print(headers)
return headers 嗯哼,大半夜还在嗨 恋上轻灵染上忧 发表于 2020-3-19 00:50
嗯哼,大半夜还在嗨
晚上写代码才爽呀!哈哈哈 我想看大人的电影,能不能给俺也整个下载软件。 A风继续吹 发表于 2020-3-19 01:07
我想看大人的电影,能不能给俺也整个下载软件。
这个可以有,但是违规的事情我们不做{:301_997:} 楼主真嗨,晚安 从dalao们的文章里又学到了点东西,什么叫网络爬虫。 哈哈,迪追搞起来! 好吧。。终于不用付费下载了。。。 快进我的收藏夹:victory: