爬虫第二篇论坛搜软件

yx_robert 发表于 2018-11-16 02:01

本帖最后由 yx_robert 于 2018-11-16 02:08 编辑

练练手
顺便解决点实际问题

求助xpath
用的还不是很顺手
有写的很丑陋的地方
求大神指点

#! /usr/bin/env python
# -*- coding: UTF-8 -*-

from lxml import etree
import requests
import sys

reload(sys)
sys.setdefaultencoding('gbk')

def gbk_2_utf(_str):
return _str.decode('gbk').encode('UTF-8')

headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36",
'Referer': 'https://www.52pojie.cn/forum-16-1.html'
}

save_file = u'搜索结果.txt'

# url_soft = 'https://www.52pojie.cn/forum-16-1.html'
# url_code = 'https://www.52pojie.cn/forum-24-1.html'
# url_movie = 'https://www.52pojie.cn/forum-56-1.html'

# Windows
# https://www.52pojie.cn/forum.php ... r=typeid&typeid=231 #windows
# https://www.52pojie.cn/forum.php ... d&typeid=231&page=1
# https://www.52pojie.cn/forum.php ... lter=typeid&page=13

# 辅助软件
# https://www.52pojie.cn/forum.php ... r=typeid&typeid=289
# https://www.52pojie.cn/forum.php ... d&typeid=289&page=2

main_web = 'https://www.52pojie.cn/'
url = 'https://www.52pojie.cn/forum.php?mod=forumdisplay&fid=16&typeid=231&filter=typeid&typeid=231&page=%d'
max_pag = 50
filter_str = 'amp;'
tar_str = u'百度'
# tar_str = ''

def main():
with open(save_file, 'w') as f:
   for i in range(1, max_pag + 1):
         cur_url = url % i
         req = requests.get(cur_url, headers=headers)
         req.encoding = 'gbk'
         # print req.text
         root = etree.HTML(req.text)
         # res = root.xpath('//*[@href="javascript:;"]/@class')
         # result1 = html.xpath('//li/a/text()')
         name_list = root.xpath(
            '//a/text()')
         url_list = root.xpath(
            '//a/@href')

         if len(name_list) == len(url_list):
            for idx in range(0, len(url_list)):
               if tar_str == '':
                     f.write(name_list + '\n')
                     f.write(main_web + url_list + '\n\n\n')
               else:
                     if name_list.find(tar_str) != -1:
                        f.write(name_list + '\n')
                        f.write(main_web + url_list + '\n\n\n')
         # break
f.close()

if __name__ == "__main__":
main()

yx_robert 发表于 2019-11-3 21:40

雷晨发表于 2019-11-3 19:06
楼主你好，请问可以出个爬http://mzsock.com/这个网站的教程吗？谢谢了

........教唆未成年犯法啊.. 不敢

拉风丶 发表于 2018-11-16 02:24

虽然不懂，但还是支持你一下！！

alairlee 发表于 2018-11-16 02:42

学python中，谢谢分享

cube 发表于 2018-11-16 03:40

一路xpath

kzx136 发表于 2018-11-16 07:27

还在学习中，谢谢分享

xiaobaibaibai 发表于 2018-11-16 08:26

正好学习谢谢分享~

空心烂木头 发表于 2018-11-16 08:37

我也想学习，可惜基础为0，不过现在我用采集器用的比较溜，，需求都能满足

Cameron·陽 发表于 2018-11-16 08:38

虽然我看不懂，但还是支持你一下！！

wang_qianxu 发表于 2018-11-16 08:44

好高深呀，支持

1sina 发表于 2018-11-16 08:48

虽然不懂，但还是支持你一下！！

页: [1] 2

吾爱破解 - 52pojie.cn's Archiver

爬虫第二篇 论坛搜软件

爬虫第二篇论坛搜软件