本帖最后由 我_落_泪_、情绪 于 2020-3-11 20:03 编辑
!!!不是fm电台,只是爬取的文本频道号,我们上课要用,不要误解了,不好意思!!!
爬取网站 http://www.radio366.com/
近期在上关于无线搜索的课,因为要查找本地的fm频道,发现了这个网站,但是要先选择地址,再打开频道,才能看到具体的fm频道。一个一个弄十分麻烦,因此写了一个爬虫来爬取,该网站应该没有反爬机制。适合新手练手。
贴上代码
[Python] 纯文本查看 复制代码 #!/usr/bin/env python
# -*- coding: utf-8 -*-
# [url=home.php?mod=space&uid=238618]@Time[/url] : 2020/3/11 14:49
# [url=home.php?mod=space&uid=686208]@AuThor[/url] : Ft
# [url=home.php?mod=space&uid=406162]@site[/url] :
# [url=home.php?mod=space&uid=267492]@file[/url] : fm.py
# @Software: PyCharm
#---------------------
import requests
from lxml import etree
import re
txtName = "codingWord.txt"
f=open(txtName, "a+")
headers={"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.132 Safari/537.36"}
url2="http://www.radio366.com/"
html2=requests.get(url2,headers=headers)
html2.encoding = 'gb2312'
page2 = etree.HTML(html2.text)
shengfen=page2.xpath("//*[@id='sheng']/ul//a")
for city in shengfen:
js=re.search(r'=.*',city.attrib['href'])
print(city.text)
shengs=js.group()[1::]
url="http://www.radio366.com/xx.asp?sheng="+shengs
html=requests.get(url,headers=headers)
html.encoding = 'gb2312'
page = etree.HTML(html.text)
result = etree.tostring(page, encoding="gb2312", pretty_print=True, method="html").decode('gb2312')
diantai=page.xpath('//body/div[@class="content"]/ul//a/@href')
name=page.xpath('//body/div[@class="content"]/ul//a/text()')
number=len(name)
for i in range(0,number):
xpage="http://www.radio366.com/"+diantai[i]
html1 = requests.get(xpage, headers=headers)
html1.encoding = 'gb2312'
page1 = etree.HTML(html1.text)
result1 = etree.tostring(page1, encoding="gb2312", pretty_print=True, method="html").decode('gb2312')
fm=page1.xpath('//div[@id="plbottom"]/text()')
dian=re.search(r'FM\d{1,3}.\d',fm[0])
if(dian==None) and len(fm)>1:
dian = re.search(r'FM\d{1,3}.\d', fm[1])
if (dian == None):
dian = re.search(r'调频\d{1,3}.\d', fm[1])
if(dian==None):
dian=re.search(r'调频\d{1,3}.\d',fm[0])
if (dian != None):
print(name[i])
print(dian.group())
爬下来的内容提交附件了 |