小白的一次爬取学校名称与联系方式的经历
自从上次写完招标信息的爬虫后,老板又想要山东某城市所有学校的联系方式以及地址(不知道想干啥)上网上一搜,这也太多了吧,于是找了个网站,用了爬虫爬虫的结构很简单,就是使用lxml筛选然后保存csv,中间也遇到了问题比如刚开始使用zip进行整合(毕竟要保存到csv里)爬完总感觉怪怪的,与网站一对比数量都不对
import requests
from lxml import etree
import csv
import time
def xuexiao(url):
headers = {
'cookie': 'UM_distinctid=17356d24bdd81f-051dfa8e743e7a-3e3e5e0e-1fa400-17356d24bde729; CNZZDATA1277745904=410363986-1594885714-%7C1594885714; __gads=ID=f0a866948e66ec5f:T=1594889512:S=ALNI_MaAkOE8thLRl4YlAsBqXH9vt-esng',
'user-agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36'
}
time.sleep(1)
r = requests.get(url,headers = headers).text
html = etree.HTML(r)
mingcheng = html.xpath('/html/body/div/div/div/div/div/h4/a/text()')
dianhua = html.xpath('/html/body/div/div/div/div/div/text()')
dizhi = html.xpath('/html/body/div/div/div/div/div/text()')
youbian = html.xpath('/html/body/div/div/div/div/div/text()')
xinxi = zip(mingcheng,dianhua,dizhi,youbian)
with open('学校.csv', 'a+', newline='')as f:
writer = csv.writer(f)
writer.writerows(xinxi)
if __name__ == '__main__':
for i in range(1,16):
if i ==1:
url = 'https://www.ruyile.com/xuexiao/?a=226&t=3'
xuexiao(url)
if i != 1:
url = 'https://www.ruyile.com/xuexiao/?a=226&t=3&p='+str(i)
xuexiao(url)
仔细看了一下网站,有的学校没有写邮编,邮编会出现空集,所以zip就会有问题,然后我上了趟厕所,顺便苦思冥想了三分钟,用循环的方式搞了一下,还别说,成了,代码如下
import requests
from lxml import etree
import csv
import time
def xuexiao(url):
headers = {
'cookie': 'UM_distinctid=17356d24bdd81f-051dfa8e743e7a-3e3e5e0e-1fa400-17356d24bde729; CNZZDATA1277745904=410363986-1594885714-%7C1594885714; __gads=ID=f0a866948e66ec5f:T=1594889512:S=ALNI_MaAkOE8thLRl4YlAsBqXH9vt-esng',
'user-agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36'
}
time.sleep(1)
r = requests.get(url,headers = headers).text
html = etree.HTML(r)
for i in range(3,22):
mingcheng = html.xpath('/html/body/div/div/div/div/div['+str(i)+']/h4/a/text()')
dianhua = html.xpath('/html/body/div/div/div/div/div['+str(i)+']/text()')
dizhi = html.xpath('/html/body/div/div/div/div/div['+str(i)+']/text()')
youbian = html.xpath('/html/body/div/div/div/div/div['+str(i)+']/text()')
xinxi = mingcheng + dianhua + dizhi + youbian
if xinxi != []:
with open('学校.csv', 'a+', newline='')as f:
writer = csv.writer(f)
writer.writerows()
if __name__ == '__main__':
for i in range(1,16):
if i ==1:
url = 'https://www.ruyile.com/xuexiao/?a=226&t=3'
xuexiao(url)
if i != 1:
url = 'https://www.ruyile.com/xuexiao/?a=226&t=3&p='+str(i)
xuexiao(url)
音羽白离 发表于 2020-7-17 11:04
小白想知道为什么你写的代码中存的csv文件里中文不会是乱码,而我用csv储存时就是乱码
盲猜爬的网站编码的问题 哈哈,我第一次爬的是国内大学计算机能力排名,然后现在找工作稳定下来发现又忘了大半,难搞呦。。不过对于小白而言第一次成功实现,真的有成就感 虽然我看不懂,但是好像很厉害 研究研究 虽然这些字母我都认识,但是写的什么我看不懂;www 感谢分享 我来跑跑看 枣庄的伙计啊? 看出来了!这期间厕所功劳最大 我用的上,,试试 鼓励程序员以后想问题都去上一趟厕所:lol 求教:楼主是怎么学的这些?我也很感兴趣,望回复