新手用python写的第一个爬虫，内容不知道怎么保存，请大牛们帮忙指导一下

dsct3003 发表于 2020-6-2 08:44

本帖最后由 dsct3003 于 2020-6-2 22:08 编辑

这是自己自学写的第一个房产的网络爬虫，想爬取的内容能够输出，但无法保存写入。主要是不知道如何把字典写入到文本，希望论坛大牛给予帮助。。。由于第一次写，代码的结构顺序也比较乱，还望大家不要取笑。。

coding='utf-8'
import requests
from bs4 import BeautifulSoup
import os

Headers={'User-Agent':'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'}

if not os.path.exists('./链家房产信息'):
os.makedirs('./链家房产信息')#创建一个保存链家房产信息的文件夹

def get_pega_urls(url):

respans=requests.get(url=url,headers=Headers)
respans.encoding=respans.apparent_encoding
soup=BeautifulSoup(respans.text,'lxml')
urls=soup.find_all('a',{'class':"noresultRecommend"})#找到存放每一页信息的url的tag
for i in urls:
   pega_urls=i['href']
   yemian=pega_info(pega_urls)
   print(yemian)

def pega_info(url):
yemian={}
respan=requests.get(url,headers=Headers)
respan.encoding=respan.apparent_encoding
soup=BeautifulSoup(respan.text,'lxml')

info=soup.select('div.communityName')
xqmz=info.select('a').get_text()
yemian['小区名称: ']=xqmz
jbxx=soup.select('div.m-content')
hx=jbxx.select('li').get_text()
yemian['房屋户型: ']=hx
price=soup.select('div.price')
qian1=price.select('span').get_text()
qian2=price.select('span').get_text()
qian=qian1+qian2
yemian['价格: ']=qian
lc=jbxx.select('li').get_text()
yemian['所在楼层: ']=lc
mj=jbxx.select('li').get_text()
yemian['建筑面积: ']=mj
jg=jbxx.select('li').get_text()
yemian['户型结构: ']=jg
snmj=jbxx.select('li').get_text()
yemian['套内面积: ']=snmj
lx=jbxx.select('li').get_text()
yemian['建筑类型: ']=lx
cx=jbxx.select('li').get_text()
yemian['房屋朝向: ']=cx
jg=jbxx.select('li').get_text()
yemian['建筑结构: ']=jg
zxqk=jbxx.select('li').get_text()
yemian['装修情况: ']=zxqk
th=jbxx.select('li').get_text()
yemian['梯户比例: ']=th
gn=jbxx.select('li').get_text()
yemian['供暖方式: ']=gn
bt=jbxx.select('li').get_text()
yemian['配备电梯: ']=bt

return yemian

def spider():
base_url='https://zz.lianjia.com/ershoufang/pg{}/'
for i in range(1,8):
   url=base_url.format(str(i))#获取1—8页的url
   pega_urls=get_pega_urls(url)
   print('=*'*30)
   print('开始爬取第'+str(i)+'页')#输出一个分页标志

spider()

no-problem 发表于 2020-6-2 09:41

纳尼直接将获得参数通过文本写进去呗

隋戈子 发表于 2020-6-2 09:41

排版你可以先弄好一下
https://img.vim-cn.com/ce/907fe671d590d55c24330501653d535eac4588.png

顺其自然1231 发表于 2020-6-2 09:41

建议看看numpy 和 pandas两个模块，pandas基于numpy扩展，读取，处理，保存文件都很快捷方便。

Zeaf 发表于 2020-6-2 09:43

建议用pandas库，输出表格

dsct3003 发表于 2020-6-2 09:50

no-problem 发表于 2020-6-2 09:41
纳尼直接将获得参数通过文本写进去呗

试过，不可以的噢！

dsct3003 发表于 2020-6-2 09:52

隋戈子发表于 2020-6-2 09:41
排版你可以先弄好一下

第一次发帖，没经验:loveliness:，下次注意

知心发表于 2020-6-2 09:56

dsct3003 发表于 2020-6-2 09:52
第一次发帖，没经验，下次注意

建议楼主编辑重排一下。要不可能没有大牛来帮忙了。不能找别人帮忙的时候给别人增加麻烦

chinaqin 发表于 2020-6-2 09:57

直接csv多好的，

知心发表于 2020-6-2 09:57

你可以百度一下python怎么写入表格。然后把获取到的内容输入到表格中。获取的内容整理为元组字典，遍历输出

页: [1] 2 3

吾爱破解 - 52pojie.cn's Archiver

新手用python写的第一个爬虫，内容不知道怎么保存，请大牛们帮忙指导一下