小白学习python爬虫，分享一下代码，伪链家全站。

来两碗米饭 发表于 2019-8-25 14:02

本帖最后由 3651535042 于 2019-8-26 20:43 编辑

有免费评分的。给下评分谢谢。攒点吾爱币去换教程！

学习python爬虫一个星期（有过python基础，大一上学期上了半年课），所以写的并不是很好，大牛看到能帮我指出错误就更好了。

这次爬取的是链家二手房全站（伪），因为一个城市我只能获取到3000条数据，但是一个城市肯定不止这些
所以就叫做伪全站吧。

用到的库，requests，xpath（获取元素），pandas（保存数据），threading（多线程）

import requests
from lxml import etree
import pandas as pd
from requests.exceptions import ConnectionError
from threading import Thread

headers={
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36',
}
#get链接
def get_url(url):
try:
   r = requests.get(url,headers=headers)
   r.encoding = 'utf8'
   html = etree.HTML(r.text)
   if r.status_code == 200:
         return html
except ConnectionError as e:
   print('采集错误')+e

def xpath_html(html):
#获取需要的数据
type = []
big = []
direction = []
finish = []
follow = []
money = []
name = html.xpath('.//a[@class="title"]//text()')
district = html.xpath('//div[@class="houseInfo"]/a/text()')
sum = html.xpath('//div[@class="houseInfo"]/text()')
sites = html.xpath('//div[@class="positionInfo"]/text()')
site = html.xpath('//div[@class="positionInfo"]/a/text()')
moneys = html.xpath('//div[@class="totalPrice"]//text()')
unitPrice = html.xpath('//div[@class="priceInfo"]//div//span//text()')
followInfo = html.xpath('//div[@class="followInfo"]//text()')
try:
   crumbs = html.xpath('.//div[@class="crumbs fl"]//h1//a//text()')
except:
   crumbs='null'

for i in sum:
   #获取的是一个以a|b|c|d这种格式的一个总和数据
   #用split来分割，分别获取。
   try:
         type.append(i.split(' | '))
         big.append(i.split(' | '))
         direction.append(i.split(' | '))
         finish.append(i.split(' | '))
   except:
         type.append('null')
         big.append('null')
         direction.append('null')
         finish.append('null')
for a in sites:
   follow.append(a.replace('-', ''))

for b in range(0, (len(moneys)), 2):
   money.append(moneys + moneys)
try:
   tp=pd.DataFrame({
         'name':name,
         'district':district,
         'type':type,
         'big':big,
         'direction':direction,
         'finish':finish,
         'follow':follow,
         'money':money,
         'site':site,
         'unitPrice':unitPrice,
         'followInfo':followInfo
   })
except:
   tp='null'

#这里加个报错是因为，有缺失值，暂时没有找到解决方法，但是不想让他停止就暂时这样解决
try:
   tp.to_csv('D://爬虫爬的玩意//%s.csv'%crumbs,mode='a',encoding='utf8',index=False,header=None)
except:
   print('保存失败')

def main(html_l,start_url,end_url):
#获取每个城市的链接
qgg = html_l.xpath('//div[@class="city_province"]/ul//li/a//@href')
try:
   for index in qgg:
         for i in range(start_url,end_url):
            #拼接上翻页的后缀，实现每个城市的翻页
            url=index+str('ershoufang/pg{}/'.format(i))
            print('第%s页'%i)
            data = get_url(url)
            xpath_html(html=data)
except ConnectionError as e:
   print('失败')

if __name__ == '__main__':
#选择城市的链接
url_l='https://www.lianjia.com/city/'
dete=get_url(url=url_l)
thad=[]
t1 = Thread(target=main,args=(dete,1,20))
t2 = Thread(target=main,args=(dete,20,40))
t3 = Thread(target=main,args=(dete,40,60))
t4 = Thread(target=main,args=(dete,60,80))
t5 = Thread(target=main,args=(dete,80,101))
thad +=
for i in thad:
   i.start()
for i in thad:
   i.join()

有的城市，没有二手房这个地址，我就给加上异常处理全部替换成为空了。

加这多异常处理，是好bug太多但是不知道怎么改就先这样吧，等以后学好了再修改一下

netCheney 发表于 2019-8-26 21:22

本帖最后由 netCheney 于 2019-8-26 21:28 编辑

3651535042 发表于 2019-8-25 15:56
java,python,web,hadoop,mysql,都学了就是没有C
现在大学挺顶啊，呵呵，不错不错，比我们那时候的c高级程序设计和算法基本结构概论有意思多了，现在还学高数吗？我们那时候高数是必修课啊，考试头都大了。。。不过没有C的编程世界是不完善的，Python的元编程本质不还是c吗？所有的语言都是建立在c的底层架构上的，CPython

来两碗米饭 发表于 2019-8-25 16:01

896749057 发表于 2019-8-25 15:55
老哥我想问就是爬虫这东西假如说www.xxx.com/ddd/ddd
就是能不能做到输入www.xxx.com把他这个网站里 ...

能从网站源码里找到的就可以啊你要爬那个网站发我试试

skoa 发表于 2019-8-25 14:43

正在学习爬虫，学习一下

shu_zzf 发表于 2019-8-25 15:16

812290870 发表于 2019-8-25 15:20

看到英文就学不进去了~

netCheney 发表于 2019-8-25 15:54

现在大学开始教Python了？不都是Java和c吗？

896749057 发表于 2019-8-25 15:55

老哥我想问就是爬虫这东西假如说www.xxx.com/ddd/ddd
就是能不能做到输入www.xxx.com把他这个网站里所有的ddd/ddd/ddd全部给爬出来

来两碗米饭 发表于 2019-8-25 15:56

netCheney 发表于 2019-8-25 15:54
现在大学开始教Python了？不都是Java和c吗？

java,python,web,hadoop,mysql,都学了就是没有C{:1_926:}

qq38455 发表于 2019-8-25 17:43

爬虫用的多吗，工作上

zfwl_666 发表于 2019-8-25 17:49

谢谢分享！

页: [1] 2

吾爱破解 - 52pojie.cn's Archiver

小白学习python爬虫，分享一下代码，伪链家全站。