为了找到合适的房源，我爬了链家3000+条数据

18732970707 发表于 2019-11-20 09:07

本帖最后由 18732970707 于 2019-11-20 09:10 编辑

北京这么大，总有一套房子适合自己；一、选择目标网站：链家：https://bj.lianjia.com/
https://img.hacpai.com/file/2019/11/image-3361319e.png?imageView2/2/w/1280/format/jpg/interlace/1/q/100
点击【租房】，进入租房首页：
https://img.hacpai.com/file/2019/11/image-fd321ba6.png?imageView2/2/w/1280/format/jpg/interlace/1/q/100这就是要爬取的首页了；二、先爬取一页1、分析页面
右击一个房源的链接，点击[检查]，如图：
https://img.hacpai.com/file/2019/11/image-afe6a729.png?imageView2/2/w/1280/format/jpg/interlace/1/q/100
进入开发者模式，
此时可以看到 a 标签中的链接：
https://img.hacpai.com/file/2019/11/image-79e0b158.png?imageView2/2/w/1280/format/jpg/interlace/1/q/100
使用 xpath 就可以把链接提取出来，不过该链接是真实 url 的后半段，需要进行字符串拼接才能获取到真正的 url；
https://img.hacpai.com/file/2019/11/image-3e1b985c.png?imageView2/2/w/1280/format/jpg/interlace/1/q/100
后面会在代码中体现；
爬取的信息暂且只对下图中标出的进行爬取：
https://img.hacpai.com/file/2019/11/image-89f4689b.png?imageView2/2/w/1280/format/jpg/interlace/1/q/100
包括标题、时间、价格、房间格局、面积；三、对全部页面进行爬取1、分析页面 url
https://img.hacpai.com/file/2019/11/image-92c1e257.png?imageView2/2/w/1280/format/jpg/interlace/1/q/100
点击租房，找到其跳转到的网页：https://bj.lianjia.com/zufang/
对，这就是要爬取的首页：
https://img.hacpai.com/file/2019/11/image-2901ad17.png?imageView2/2/w/1280/format/jpg/interlace/1/q/100我们往下拉到最底端，点击下一页或者其他页，
第 1 页：https://bj.lianjia.com/zufang/pg1/#contentList
第 2 也：https://bj.lianjia.com/zufang/pg2/#contentList
第 3 页：https://bj.lianjia.com/zufang/pg3/#contentList
.
.
.
第 100 页：https://bj.lianjia.com/zufang/pg100/#contentList通过观察 url 可以发现规律：每一页只有 pg 后面的数字在变，且与页数相同；
拼接字符串后使用一个循环即可对所有页面进行爬取；四、源码开发工具：pycharm
python版本：3.7.2import requests
from lxml import etree

#编写了一个常用的方法，输入url即可返回text的文本；
def get_one_page(url):
headers = {
   'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36',
}
response = requests.get(url,headers=headers)
response.encoding = 'utf-8'
if response.status_code == 200:
   return response.text
else:
   print("请求状态码 != 200,url错误.")
   return None

for number in range(0,101):             #利用range函数循环0-100，抓去第1页到100页。
initialize_url = "https://bj.lianjia.com/zufang/pg" + str(number) + "/#contentList"             #字符串拼接出第1页到100页的url；

html_result = get_one_page(initialize_url)    #获取URL的text文本
html_xpath = etree.HTML(html_result)             #转换成xpath格式
   #抓去首页中的url（每页有30条房源信息）
page_url = html_xpath.xpath("//div[@class='content w1150']/div[@class='content__article']//div[@class='content__list']/div/a[@class='content__list--item--aside']/@href")

for i in page_url:             #循环每一条房源url
   true_url = "https://bj.lianjia.com" + i             #获取房源的详情页面url
   true_html = get_one_page(true_url)             #获取text文本
   true_xpath = etree.HTML(true_html)             #转换成xpath格式

            #抓取页面题目，即：房源详情页的标题
   title = true_xpath.xpath("//div[@class='content clear w1150']/p[@class='content__title']//text()")
            #抓取发布时间并对字符串进行分割处理
   release_date = true_xpath.xpath("//div[@class='content clear w1150']//div[@class='content__subtitle']//text()")
   release_date_result = str(release_date).strip().split("：")
            #抓取价格
   price = true_xpath.xpath("//div[@class='content clear w1150']//p[@class='content__aside--title']/span//text()")
            #抓取房间样式
   house_type = true_xpath.xpath("//div[@class='content clear w1150']//ul[@class='content__aside__list']//span//text()")
            #抓取房间面积
   acreage = true_xpath.xpath("//div[@class='content clear w1150']//ul[@class='content__aside__list']//span//text()")

   print(str(title) + " --- " + str(release_date_result) + " --- " + str(price) + " --- " + str(house_type) + " --- " + str(acreage))

            #写入操作，将信息写入一个text文本中
   with open(r"E:\admin.txt",'a') as f:
         f.write(str(title) + " --- " + str(release_date_result) + " --- " + str(price) + " --- " + str(house_type) + " --- " + str(acreage) + "\n")

最后将爬取的信息一边输出一边写入文本；当然也可以直接写入 JSON 文件或者直接存入数据库；
啊，最后，竟然忘了说正向代{过}{滤}理了；

正向代{过}{滤}理是因为怕被封才用的，自己搭建一个正向代{过}{滤}理，或者在网上购买都可以，不过本数据没有侵犯性，纯属为了便于自己观察租房的信息，找到合适的房源的，如有侵权，请联系，删帖。

特约贵宾 发表于 2019-11-20 10:29

8814202 发表于 2019-11-20 09:26
为什么不用Q房网呢
--------来自Q房网员工的关注

上网站看了下，房源真不咋滴

川哥发表于 2019-11-20 09:39

morgen1210 发表于 2019-11-20 09:37
基本都是中介发的假房源，等你去看看就不是这个价格了，各种坑

链家还真没假房源，起码网上不敢放

爱生活爱拉芳 发表于 2019-11-20 09:17

哥们你发错地方了吧

燃烧我的卡路里 发表于 2019-11-20 09:20

最后找到合适的房源了吗？

8814202 发表于 2019-11-20 09:26

本帖最后由 8814202 于 2019-11-20 09:28 编辑

为什么不用Q房网呢;www;www;www
--------来自Q房网员工的关注{:1_886:}

morgen1210 发表于 2019-11-20 09:37

基本都是中介发的假房源，等你去看看就不是这个价格了，各种坑

liuhongyan 发表于 2019-11-20 09:37

租房还是自如，贵是贵点，但是舒服

lzqlaj 发表于 2019-11-20 09:41

morgen1210 发表于 2019-11-20 09:37
基本都是中介发的假房源，等你去看看就不是这个价格了，各种坑

确实是这样。

coolcalf 发表于 2019-11-20 09:43

Py做爬虫确实代码量少，来得快

溜溜球 发表于 2019-11-20 09:50

可以可以，

页: [1] 2 3

吾爱破解 - 52pojie.cn's Archiver

为了找到合适的房源，我爬了链家3000+条数据