学了一个多月python了，是骡子是马也该牵出来溜溜了

全村儿人希望 发表于 2018-11-9 00:12

用Scrapy写的淘宝商品搜索爬虫，写的不好大佬勿喷

# -*- coding: utf-8 -*-

import scrapy
import json
from Taobao.items import TaobaoItem
# url编码
from urllib.parse import quote
# url解码
from urllib.parse import unquote

class TaobaoSpider(scrapy.Spider):
name = 'taobao'
# allowed_domains = ['taobao.com/']
page = input('请输入打印页数:')
Quote = input('请输入要搜索的商品名')
start_urls = ['https://ai.taobao.com/search/getItem.htm?_tb_token_=e3d450b1e33e&__ajax__=1&pid=mm_33793785_3431230_471812702&unid=&clk1=&page={}&pageSize=60&pvid=200_11.224.194.119_358_1541678031255&squareFlag=&sourceId=search&ppathName=&supportCod=&city=&ppath=&dc12=&pageNav=false&itemAssurance=&fcatName=&price=&cat=&from=&tmall=&key={}&fcat=&ppage=0&debug=false&maxPageSize=200&sort=&exchange7=&custAssurance=&postFree=&npx=50&location='.format(int(page),quote(Quote,'utf-8'))]
a = 1

def parse(self, response):
   js = json.loads(response.body)['result']['auction']

   f = open('{}.csv'.format(self.Quote), 'w', encoding='utf-8')
   f.write("商品名,价格,店名\n")

   for text in js:
         dict = {
            'name' : text['description'],
            'nick' : text['nick'],
            'realPrice' : text['realPrice'],
         }

         f.write("{name},{realPrice},{nick}\n".format(**dict))

         origPicUrl = 'https:' + text['origPicUrl']
         item = TaobaoItem()
         item['origPicUrl'] = origPicUrl

         yield item

   f.close()

   if self.page != '1':
         for n in range(2,int(self.page)):
            yield scrapy.Request('https://ai.taobao.com/search/getItem.htm?_tb_token_=e3d450b1e33e&__ajax__=1&pid=mm_33793785_3431230_471812702&unid=&clk1=&page={}&pageSize=60&pvid=200_11.224.194.119_358_1541678031255&squareFlag=&sourceId=search&ppathName=&supportCod=&city=&ppath=&dc12=&pageNav=false&itemAssurance=&fcatName=&price=&cat=&from=&tmall=&key=python%E7%BC%96%E7%A8%8B%E4%BB%8E%E5%85%A5%E9%97%A8%E5%88%B0%E5%AE%9E%E6%88%98&fcat=&ppage=0&debug=false&maxPageSize=200&sort=&exchange7=&custAssurance=&postFree=&npx=50&location='.format(n),self.parse)
         print('=' * 40 + '第' + self.page + '页下载完毕' + '=' * 40)

json源码里的商品名有很多英文加中文，不会吧中文提取出来，有没有大佬能指点一二的

squall007 发表于 2018-11-9 08:20

本帖最后由 squall007 于 2018-11-9 09:00 编辑

那是商品描述，不是商品名称，用正则替换：

以下代码需要手动复制
import re
desc='低至6895起/苹果x/送壳膜Apple/苹果 <span class=H>iPhone</span> X 全网通4G手机国行正品10苹果x <span class=H>iphone</span>x 3/6/12期分期'
desc=re.sub(r'<.+?>','',desc)
print(desc)

全村儿人希望 发表于 2018-11-17 07:42

lupeng-1985 发表于 2018-11-16 14:27
一个月学的太快了，求教啊我学了也快一个月了因为白天上班，晚上就学一个多小时到现在只会最简单 ...

我上班也在学的，工作比较轻松，上班跟着视频看着学，下班回家电脑操作

fx9156 发表于 2018-11-9 00:17

买的书还是看的视频？

威风的黑龙 发表于 2018-11-9 00:30

这提取出来有啥用没

全村儿人希望 发表于 2018-11-9 00:30

fx9156 发表于 2018-11-9 00:17
买的书还是看的视频？

看的电子书 + 视频自学的

全村儿人希望 发表于 2018-11-9 00:31

威风的黑龙发表于 2018-11-9 00:30
这提取出来有啥用没

没啥大用:Dweeqw

coradong1985 发表于 2018-11-9 00:51

虚无空幻 发表于 2018-11-9 01:29

py的话,我感觉还是用来数据绘图吧.采集我个人用c++ curl足够了. 感觉采集就那点代码,没什么不同....

tfrist 发表于 2018-11-9 01:56

鼓励一下

差池发表于 2018-11-9 05:03

正则表达式

Fade 发表于 2018-11-9 05:11

努力共勉

页: [1] 2 3 4 5 6 7 8 9 10

吾爱破解 - 52pojie.cn's Archiver

学了一个多月python了，是骡子是马也该牵出来溜溜了