python正则表达式提取标签属性与链接？

double07 发表于 2021-4-28 21:56

<tr class="sticky_normal">
<td class="rowfollow nowrap" valign="middle" style='padding: 0'><a href="?cat=432"><img src="pic/cattrans.gif" alt="网红/Blu-Ray Uncensored" title="网红/Blu-Ray Uncensored" style="background-image: url(pic/category/chd/scenetorrents/cht/uenbd.png);" /></a></td>
<td class="torrenttr" width="100%" align="left"><table class="torrentname" width="100%"><tr class="sticky_normal"><td class="torrentimg"><a title="重庆渝中区洪涯洞打卡点" href="details.php?id=482015&hit=1"><img alt="torrent thumbnail" src="https://img.cqgov.org/images/2021/04/28/cbc2118b6f2466e5f954567a54c937a3.jpg"></a></td><td class="embedded"><img class="sticky" src="pic/trans.gif" alt="Sticky" title="置頂" /> <a title="重庆渝中区洪涯洞打卡点" href="details.php?id=482015&hit=1">重庆渝中区洪涯洞打卡点</a> <img class="pro_50pctdown" src="pic/trans.gif" alt="50%" title="50%" /> <img src="pic/trans.gif" class="label_sub" alt="中国" /> 网红景点</td><td width="80" class="embedded" style="text-align: right; " valign="middle"><a href="download.php?id=482015&https=1"><img class="download" src="pic/trans.gif" style='padding-bottom: 2px;' alt="download" title="下载图片" /></a><a id="bookmark2"href="javascript: bookmark(482015,2);" ><img class="delbookmark" src="pic/trans.gif" alt="Unbookmarked" title="收藏" /></a></td>
</tr></table></td><td class="rowfollow"><a href="details.php?id=482015&hit=1&cmtpage=1#startcomments" >4</a></td><td class="rowfollow nowrap">18时 36分</td><td class="rowfollow">19.22 GB</td><td class="rowfollow" align="center"><a href="details.php?id=482015&hit=1&dllist=1#seeders">82</a></td>
<td class="rowfollow"><a href="details.php?id=482015&hit=1&dllist=1#leechers">11</a></td>
<td class="rowfollow"><a href="viewsnatches.php?id=482015">122</a></td>
<td class="rowfollow" style="font-weight: bold">--</td><td class="rowfollow">匿名</td>

请问把<a>标签”重庆渝中区洪涯洞打卡点“及herf中的链接用正则怎样提取出来？

按这样写，但提不出来：<a.*?title="(.*?)"href="(.*?)".*?>.*?</a>

cmy2019 发表于 2021-4-28 22:05

你写的正则缺个空格，就在href前边。。。

fanvalen 发表于 2021-4-28 22:08

Airey 发表于 2021-4-28 22:10

本帖最后由 Airey 于 2021-4-28 22:13 编辑

for k in soup.find_all('a'):
print(k['href'])#查a标签的href值
k.get('href')

fanvalen 发表于 2021-4-28 22:15

一是空格让它存在就好，
二是href="(.*?)"之后的.*？遇到一个>就终止了后面接.*?</a>（>后面没有接b标签），因为到.*?</a>不是连续的所以匹配不到

double07 发表于 2021-4-28 22:30

本帖最后由 double07 于 2021-4-28 22:33 编辑

刚开始学，还在摸索中，按目前的表达式，提取出来是段代码，而且有重复的值。最终目标想提取出”重庆渝中区洪涯洞打卡点“及details.php?id=482015&hit=1这两个字段且不重复，正则怎样实现？

vista_info 发表于 2021-4-29 02:06

考虑用xpath？

double07 发表于 2021-4-29 08:46

kai-memory 发表于 2021-4-29 02:06
考虑用xpath？

xpath没法用，现学的正则:lol

百千三昧 发表于 2021-4-29 13:02

百千三昧 发表于 2021-4-29 13:13

页: [1] 2

吾爱破解 - 52pojie.cn's Archiver

python正则表达式提取标签属性与链接？