python如何用正则提取大写的年月日?
json格式链接:https://sf-item.taobao.com/json/get_notice_attach.htm?project_id=2958865,如何在链接中用正则取最后一行年月日并转为正常年月日 ?最好看html 有没有标签 或者带什么calss的 才好提取出来,提取到了转换就简单了 CN_NUM = {'〇' : 0, '一' : 1, '二' : 2, '三' : 3, '四' : 4, '五' : 5, '六' : 6, '七' : 7, '八' : 8, '九' : 9, '零' : 0,'壹' : 1, '贰' : 2, '叁' : 3, '肆' : 4, '伍' : 5, '陆' : 6, '柒' : 7, '捌' : 8, '玖' : 9, '貮' : 2, '两' : 2}
感觉用不到正则表达式,这个在最后直接去掉空行后按字典替换最后一行就行
# *_* coding : UTF-8 *_*
# author:Leemamas
# 开发时间:2021/7/158:30
import requests
from bs4 import BeautifulSoup
import re
def crawl_data(url):
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36 Edg/91.0.864.64'
}
try:
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'lxml')
return soup
except Exception as e:
print(e)
def change(key):
number = ['〇', '一', '二', '三', '四', '五', '六', '七', '八', '九']
result = []
for i in key:
for j, n in enumerate(number):
if i == n:
result.append(j)
result = ''.join(list(map(str, result)))
return result
if __name__ == '__main__':
url = 'https://sf-item.taobao.com/json/get_notice_attach.htm?project_id=2958865'
html = crawl_data(url)
all_p = html.find_all('p')
pattern = r'[\u4e00-\u9fa5〇]{4}年([\u4e00-\u9fa5]{1}|[\u4e00-\u9fa5]{2})月([\u4e00-\u9fa5]{1}|[\u4e00-\u9fa5]{2})日'
for p in all_p:
data = p.text
if re.match(pattern, data):
year_pos = 4
month_pos = data.find('月')
day_pos = data.find('日')
year = data[:year_pos]
month = data
day = data
print(data)
print(change(year), change(month), change(day))
常规的匹配日期的正则分组里面把数字分组的数字换成对应的大写 类似于 \d变成[〇一二三四五六七八九] 这样 我只知道在notepad++里查找汉字的正则表达式是 [一-龥!-~]
页:
[1]