简易爬虫-抓取某个音乐网站的资源
#!/usr/bin/python# -*- coding: utf-8 -*-
import requests
import re
import time
songKey = []#存放歌曲key(url相关)
songNames = []#存放歌曲名称
Author = []
url = "https://www.hifini.com/" #网站首页的URL
#url = "<a href=\"thread-62290.htm\"……>金池《谁不是》</a><a href=\"thread-62290.htm\">金池《谁不是》</a>"
html= requests.get(url)
strr= html.text
pat1 = 'thread-......htm'
song_url = re.findall(pat1,strr)
song_url = set(song_url)
#part2 = re.findall('《*》',html)
#part0 = re.findall(r"^(<a href=\")(\d+)(\">)",url,re.M)#用于解析歌曲所在的html字符串的正则
#print song_url
for i in song_url:
song_html = requests.get("https://www.hifini.com/"+i)
strr2 = song_html.text
realsong_url = re.findall(' url: \'(.*?)\',',strr2,re.S)
songNames = re.findall(' title: \'(.*?)\',',strr2,re.S)
PicUrl = re.findall(' pic: \'(.*?)\'',strr2,re.S)
Author = re.findall(' author:\'(.*?)\',',strr2,re.S)
Realsong_url = "".join(realsong_url)
SongNames = "".join(songNames)
Picurl = "".join(PicUrl)
AuThor = "".join(Author)
print(Realsong_url)
print(SongNames)
print(Picurl)
print(AuThor)
print()
# source_data = "www.hifini.com/"+ Realsong_url
# print("song_url = " + source_data)
# print("song_name = " + SongNames)
# print("pic_url = " + Picurl)
# print("author = " + AuThor)
# print()
papa08 发表于 2020-11-21 18:34
点赞楼主!一直都想学Python的,也想学这个爬虫,有学习资料什么好推荐的吗?
看bilibili自己学习,不要觉得python很难,python只是脚本语言,做基本的爬虫很简单 蟹老板阿 发表于 2020-11-24 20:40
我这个是极其简易的抓取脚本,不算严格意义上的爬虫,一切合法,只做学习用
看了是64k音质!一点用没有!么意义!浪费表情!需要高清建议不要下载了!白忙活 是不是高清啊!还是128kb的啊 👍👍 感谢分享!! 能不能加多点注释,小白不懂 纯小白看不懂 以后的趋势是各网站反爬虫机制越来越健全 谢谢分享 谢谢分享,借鉴下。 正在学习中,抱走了,谢谢分享
页:
[1]
2