编写程序爬取安全客文章

xudongtiankong 发表于 2017-5-5 14:02

抓取了安全客的文章，生成pdf格式，代码如下，当然可以增加多线程来提高抓取速度。我这边没有弄。
抓取的文章和代码如下~
{:17_1062:}#!/usr/bin/python
# encoding = utf-8
import pdfkit
import time
import urllib2

def main():
try:
   urllist = list()
   url = "http://bobao.360.cn/learning/detail/%s.html"
   for i in range(0, 2000):
         urllist.append(url % i)

   count = 0
   for urlname in urllist:
         try:
            response = urllib2.urlopen(urlname)
            result = response.read()
            if result.strip() == '':
               continue
            pdfkit.from_url(urlname, './output/%s.pdf' % count)
            time.sleep(0.1)
         except:
            pass
         count = count + 1
except Exception as e:
   print str(e)

if __name__ == '__main__':
main()

附件地址：
链接: https://pan.baidu.com/s/1slhPGXv 密码: hsef

Pythoner 发表于 2017-5-5 14:32

url = "http://bobao.360.cn/learning/detail/%s.html"
for i in range(0, 2000):
urllist.append(url % i)

只有2000个嘛，
http://bobao.360.cn/learning/detail/3815.html这个咋办？

只有leaning？

http://bobao.360.cn/news/detail/4148.html 这个咋办？

xudongtiankong 发表于 2017-5-5 14:34

Pythoner 发表于 2017-5-5 14:32
url = "http://bobao.360.cn/learning/detail/%s.html"
for i in range(0, 2000):
...

这个根据自己改，我这个只是写个样例而已

history850 发表于 2017-5-5 14:28

注册6年积分22 这水潜的够深啊

xudongtiankong 发表于 2017-5-5 14:35

history850 发表于 2017-5-5 14:28
注册6年积分22 这水潜的够深啊

专业潜水党~~

九八。 发表于 2017-5-5 14:40

来看一下分析

YAIBA2 发表于 2017-5-5 14:41

学到了,pdfkit这个很强大啊，给个连接就能直接存

mayiwan 发表于 2017-5-5 14:41

没有最水只有更水。。。。:loveliness:

Pythoner 发表于 2017-5-5 14:46

xudongtiankong 发表于 2017-5-5 14:34
这个根据自己改，我这个只是写个样例而已

(⊙o⊙)…好吧，这个思路...什么网页都可以保存pdf了...不止什么安全客.谢谢了

wuai920981023 发表于 2017-5-5 14:52

厉害了，。潜水佩服你们

页: [1] 2

吾爱破解 - 52pojie.cn's Archiver

编写程序爬取安全客文章