本帖最后由 deffun 于 2020-4-7 12:11 编辑
artical 那里可以使用正则来匹配
楼主好像对find、find_all、正则匹配不太清楚 。
我改了下代码,用了函数,加入了异常捕获,更pythonic一些。对初学者应该有很大帮助
[Python] 纯文本查看 复制代码 #!/usr/bin/env python3
import requests
from bs4 import BeautifulSoup
import re
def page_extractor(html: str, page: int):
print('这是第 {} 页内容'.format(page))
soup = BeautifulSoup(html, 'html.parser')
for s in soup.find_all('article', id=re.compile('post-\d+$')):
item = {
'标题': s.find('h1', class_='entry-title').getText(),
# '链接': s.find('a', text='magnet').get('href'),
# 磁力链接对应的文本不一定是`magnet`
# '链接': s.find('a', href=re.compile('^magnet')).get('href'),
# 有的`article`没有磁力链接,使用get时会出错
'链接': s.find('a', href=re.compile('^magnet')).get('href') if s.find('a', href=re.compile('^magnet')) else None,
# 没有磁力链接则设为None
}
print(item)
if item['链接'] is not None: # 只保存有磁力链接的条目
with open('fitgirl.txt', 'a+') as f:
line = '{}\t{}\n'.format(item['标题'], item['链接'])
f.write(line)
def main():
for p in range(1, 19):
url = 'http://fitgirl-repacks.site/page/'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'}
r = requests.get(url + str(p), headers=headers)
try:
page_extractor(html=r.text, page=p)
except Exception as e:
err_html_path = str(p) + '.html'
with open(err_html_path, 'w', encoding='UTF8') as f:
f.write(r.text)
print('出错,查看"{}"以调试,{}'.format(err_html_path, e))
if __name__ == '__main__':
main()
输出
[Python] 纯文本查看 复制代码 这是第 1 页内容
{'标题': 'Upcoming repacks', '链接': None}
{'标题': 'Gaia', '链接': 'magnet:?xt=urn:btih:6C06A408A106D4D7C5759CFF9E62EC513AD8E2CF'}
{'标题': 'X4: Foundations – v3.00 + 2 DLCs', '链接': 'magnet:?xt=urn:btih:3415C5ADD0853E25490316863BD273579C60CC3F'}
{'标题': 'Russian Movies Weekend #18', '链接': None}
{'标题': 'Max Payne 3: Complete Edition – v1.0.0.216 + All DLCs', '链接': 'magnet:?xt=urn:btih:2CB17311A72FD1930CC4DBC3A3F4D684D8ADDE27'}
{'标题': 'Assassin’s Creed: Odyssey – Ultimate Edition – v1.5.3 + All DLCs', '链接': 'magnet:?xt=urn:btih:BF3A22D7D6C9BB87B604A7B77E8409C0881F372F'}
{'标题': 'Operencia: The Stolen Sun – Explorer’s Edition', '链接': 'magnet:?xt=urn:btih:B7ABB387B1D9B0856ECA4BF19C0B2466AD22AD4F'}
{'标题': 'Neverwinter Nights: Enhanced Edition – v79.8193.9 + All DLCs', '链接': 'magnet:?xt=urn:btih:6FDBED3DD9A240F0BBD9A293AA282AC083D63C29'}
{'标题': 'The Complex', '链接': 'magnet:?xt=urn:btih:6A42639D0F7D39E5FB3C62FACAC6AEAAA37284F0'}
{'标题': 'Deep Sky Derelicts: Definitive Edition – v1.5.1 + Soundtrack + ArtBook', '链接': 'magnet:?xt=urn:btih:C10DC5A8F713E83BCCAF27376CADE0B6FBB7009D'}
这是第 2 页内容
{'标题': 'Biped – v1.1', '链接': 'magnet:?xt=urn:btih:ECF961E39CE79AD6FE56F0509466363594B7739C'}
{'标题': 'Rocket League – v1.75 + 36 DLCs + Offline Unlocker', '链接': 'magnet:?xt=urn:btih:0609BBBBC670AF2B98310511F0171D8DF5F9BC95'}
{'标题': 'UNDER NIGHT IN-BIRTH Exe:Late[cl-r] + All DLCs & OST', '链接': 'magnet:?xt=urn:btih:5DDD3F9960103F0C45868115834A10D1B805F660'}
{'标题': 'One Piece: Pirate Warriors 4 + 2 DLCs + Multiplayer', '链接': 'magnet:?xt=urn:btih:B89FBD43806BCD11D0D3B7D666626B348D1CDBD7'}
{'标题': 'DiRT Rally 2.0: Game of the Year Edition – v1.13 + All DLCs', '链接': 'magnet:?xt=urn:btih:D318A1955E5B4466BB2CE609E1DBDE82BC592E39'}
{'标题': 'The Legend of Heroes: Trails of Cold Steel III – v1.05 + 57 DLCs', '链接': 'magnet:?xt=urn:btih:A29EF9243E9924B4012537193C54408B1CF08471'}
{'标题': 'CONTROL – v1.09 + DLC', '链接': 'magnet:?xt=urn:btih:81D689EED1BA96FF8413F048CB5DFD159F42B1C3'}
{'标题': 'Two Point Hospital – v1.19.49336 + 9 DLCs', '链接': 'magnet:?xt=urn:btih:72D55558D7E7C60DD6C2E92309888A1C16FB96AF'}
{'标题': 'God Eater 3 – v2.50 + All DLCs + Multiplayer', '链接': 'magnet:?xt=urn:btih:35D62FC0DB0893575DC6A1CB3979A79972819A17'}
{'标题': 'Cities: Skylines – Deluxe Edition – v1.13.0-f7 + All DLCs', '链接': 'magnet:?xt=urn:btih:EC44A717E96F6637A5EA5E446CF9E488326B2721'}
这是第 3 页内容
{'标题': 'Amid Evil – v2055 (Ancient Alphas)', '链接': 'magnet:?xt=urn:btih:7DBF99635A21890CF6B2DE4A6FB824E32923DB91'}
{'标题': 'Iron Danger – v1.00.31', '链接': 'magnet:?xt=urn:btih:3BCF2C17676AF337662968776E94CE60D77BD9B6'}
# 省略
另外,数据保存应该用csv或者数据库,直接保存到没有分隔符的txt里会给后期处理带来麻烦,当然这是后话了 |