同样的代码在windows11上可运行但linux上运行不了是什么原因？

1e3e · 发表于 2023-6-14 17:02

new.xlsx内容为只有一列"id":1 0 3
old.xlsx内容为只有一列"id":1 2 3
win11代码为：
import pandas as pd
from pathlib import Path

#define parameters
#path to files https://gist.github.com/VankatPe ... 42b6029c3b92f20862a
path_old=Path(r'D:\Program Files\Pycharm_professional_2019.3.3_Portable\bin\Pycharm\config\scratches\old.xlsx')
path_new=Path(r'D:\Program Files\Pycharm_professional_2019.3.3_Portable\bin\Pycharm\config\scratches\new.xlsx')
#list of key column(s)
key=['id']
#sheets to read in
sheet='Sheet1'

# Read in the two excel files and fill NA
old = pd.read_excel(path_old).fillna(0)
new = pd.read_excel(path_new).fillna(0)
#set index
old=old.set_index(key)
new=new.set_index(key)

#identify dropped rows and added (new) rows
dropped_rows = set(old.index) - set(new.index)
added_rows = set(new.index) - set(old.index)

#combine data
df_all_changes = pd.concat([old, new], axis='columns', keys=['old','new'], join='inner')

#prepare functio for comparing old values and new values
def report_diff(x):
return x[0] if x[0] == x[1] else '{} ---> {}'.format(*x)

#swap column indexes
df_all_changes = df_all_changes.swaplevel(axis='columns')[new.columns[0:]]

#apply the report_diff function
df_changed = df_all_changes.groupby(level=0, axis=1).apply(lambda frame: frame.apply(report_diff, axis=1))

#create a list of text columns (int columns do not have '{} ---> {}')
df_changed_text_columns = df_changed.select_dtypes(include='object')

#create 3 datasets:
#diff - contains the differences
#dropped - contains the dropped rows
#added - contains the added rows
diff = df_changed_text_columns[df_changed_text_columns.apply(lambda x: x.str.contains("--->") == True, axis=1)]
dropped = old.loc[dropped_rows]
added = new.loc[added_rows]

#create a name for the output excel file
fname =  '{} vs {}.xlsx'.format(path_old.stem, path_new.stem)

#write dataframe to excel
writer=pd.ExcelWriter(fname, engine='xlsxwriter')
diff.to_excel(writer, sheet_name='diff', index=True)
dropped.to_excel(writer, sheet_name='dropped', index=True)
added.to_excel(writer, sheet_name='added', index=True)

#get xlswriter objects
workbook = writer.book
worksheet = writer.sheets['diff']
worksheet.hide_gridlines(2)
worksheet.set_default_row(15)

#get number of rows of the df diff
row_count_str=str(len(diff.index)+1)

#define and apply formats
highligt_fmt = workbook.add_format({'font_color': '#FF0000', 'bg_color':'#B1B3B3'})
worksheet.conditional_format('A1:ZZ'+row_count_str, {'type':'text', 'criteria':'containing', 'value':'--->',
                        'format':highligt_fmt})

#save the output
writer.save()
print ('\nDone.\n')

linux代码为：
import pandas as pd
from pathlib import Path

#define parameters
#path to files https://gist.github.com/VankatPe ... 42b6029c3b92f20862a
path_old=Path(r'/home/asd/Downloads/test/old.xlsx')
path_new=Path(r'/home/asd/Downloads/test/new.xlsx')
#list of key column(s)
key=['id']
#sheets to read in
sheet='Sheet1'

# Read in the two excel files and fill NA
old = pd.read_excel(path_old).fillna(0)
new = pd.read_excel(path_new).fillna(0)
#set index
old=old.set_index(key)
new=new.set_index(key)

#identify dropped rows and added (new) rows
dropped_rows = set(old.index) - set(new.index)
added_rows = set(new.index) - set(old.index)

#combine data
df_all_changes = pd.concat([old, new], axis='columns', keys=['old','new'], join='inner')

#prepare functio for comparing old values and new values
def report_diff(x):
return x[0] if x[0] == x[1] else '{} ---> {}'.format(*x)

#swap column indexes
df_all_changes = df_all_changes.swaplevel(axis='columns')[new.columns[0:]]

#apply the report_diff function
df_changed = df_all_changes.groupby(level=0, axis=1).apply(lambda frame: frame.apply(report_diff, axis=1))

#create a list of text columns (int columns do not have '{} ---> {}')
df_changed_text_columns = df_changed.select_dtypes(include='object')

#create 3 datasets:
#diff - contains the differences
#dropped - contains the dropped rows
#added - contains the added rows
diff = df_changed_text_columns[df_changed_text_columns.apply(lambda x: x.str.contains("--->") == True, axis=1)]
dropped = old.loc[dropped_rows]
added = new.loc[added_rows]

#create a name for the output excel file
fname =  '{} vs {}.xlsx'.format(path_old.stem, path_new.stem)

#write dataframe to excel
writer=pd.ExcelWriter(fname, engine='xlsxwriter')
diff.to_excel(writer, sheet_name='diff', index=True)
dropped.to_excel(writer, sheet_name='dropped', index=True)
added.to_excel(writer, sheet_name='added', index=True)

#get xlswriter objects
workbook = writer.book
worksheet = writer.sheets['diff']
worksheet.hide_gridlines(2)
worksheet.set_default_row(15)

#get number of rows of the df diff
row_count_str=str(len(diff.index)+1)

#define and apply formats
highligt_fmt = workbook.add_format({'font_color': '#FF0000', 'bg_color':'#B1B3B3'})
worksheet.conditional_format('A1:ZZ'+row_count_str, {'type':'text', 'criteria':'containing', 'value':'--->',
                        'format':highligt_fmt})

#save the output
writer.save()
print ('\nDone.\n')

重点来了：linux报错
Traceback (most recent call last):
  File "/home/asd/.config/JetBrains/PyCharmCE2020.1/scratches/scratch_27.py", line 25, in <module>
df_all_changes = pd.concat([old, new], axis='columns', keys=['old','new'], join='inner')
  File "/home/asd/archiconda3/lib/python3.7/site-packages/pandas/util/_decorators.py", line 311, in wrapper
return func(*args, **kwargs)
  File "/home/asd/archiconda3/lib/python3.7/site-packages/pandas/core/reshape/concat.py", line 307, in concat
return op.get_result()
  File "/home/asd/archiconda3/lib/python3.7/site-packages/pandas/core/reshape/concat.py", line 528, in get_result
indexers[ax] = obj_labels.get_indexer(new_labels)
  File "/home/asd/archiconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 3442, in get_indexer
raise InvalidIndexError(self._requires_unique_msg)
pandas.errors.InvalidIndexError: Reindexing only valid with uniquely valued Index objects

nob · 发表于 2023-6-14 17:03

初始的df1有'year','add','tel_num'三列，df2有'group','key_word'两列。
df1['add']是单独提取df1的'add'这列，得到序列['北京西路2号','上海西路3号','广州西路4号','深圳西路5号',...,'宁夏新华街号']。
然后apply是用match_group分别处理这个序列的每个元素，具体看下面。

先说df2的'key_word'这列，是序列['西藏','内蒙古','广州西路4号','广州','北京','上海','南京']。
match_goup(x)会找到df2['key_word']中第一个出现在字符串x中的子串，
比如match_goup('北京西路2号')='北京'，match_goup('广州西路4号')='广州西路4号'，match_goup('广州东路4号')='广州'，
如果没找到则返回'未收录'，比如match_goup('宁夏新华街号')='未收录'。

序列df1['add']经过match_group的逐一处理就变成['北京','上海','广州西路4号','未收录',...,'未收录']，这个序列被加入df1作为'group'列。
注释的语句忽略，最后就是输出df1,df2到文件。

ma15803216102 · 发表于 2023-6-14 17:10

ChatGPT:
根据错误提示，这个问题似乎是由于重新索引时，索引值不唯一造成的。请检查 old 和 new 两个数据框是否设置了相同的列名，如果是，则需要将其中一个数据框的列名进行修改以确保唯一性。您也可以尝试对两个数据框的索引列进行去重的操作，以确保每个索引值的唯一性。例如：df.drop_duplicates()。

具体来说，您可以在 set_index() 函数中使用参数 drop_duplicates=True 来对索引列进行去重操作，如下所示：

[Python] 纯文本查看 复制代码

#set index and drop duplicates
old=old.set_index(key, drop_duplicates=True)
new=new.set_index(key, drop_duplicates=True)

winsphinx · 发表于 2023-6-14 17:14

看一下pandas的linux下和win下版本是不是一样

pip show pandas

1e3e · 发表于 2023-6-14 17:20

本帖最后由 1e3e 于 2023-6-15 16:17 编辑

ma15803216102 发表于 2023-6-14 17:10
ChatGPT:
根据错误提示，这个问题似乎是由于重新索引时，索引值不唯一造成的。请检查 old 和 new 两个数据 ...

修改路径也不行，将path_old=Path(r'/home/asd/Downloads/test/old.xlsx')改为path_old=Path(r'old.xlsx') path_new=Path(r'/home/asd/Downloads/test/new.xlsx')改为path_new=Path(r'new.xlsx')也报错

1e3e · 发表于 2023-6-14 17:21

winsphinx 发表于 2023-6-14 17:14
看一下pandas的linux下和win下版本是不是一样

pip show pandas

似乎是路径的原因，我修改路径后是正确的，path_new=Path(r'/home/asd/Downloads/test/new.xlsx')改为path_new=Path(r'new.xlsx')运行结果是正确的，但闹心的是python脚本必须和old.xlsx 一个文件夹

1e3e · 发表于 2023-6-14 17:27

ma15803216102 发表于 2023-6-14 17:10
ChatGPT:
根据错误提示，这个问题似乎是由于重新索引时，索引值不唯一造成的。请检查 old 和 new 两个数据 ...

e/asd/archiconda3/lib/python3.7/site-packages/openpyxl/worksheet/header_footer.py:48: UserWarning: Cannot parse header or footer so it will be ignored
  warn("""Cannot parse header or footer so it will be ignored""")
Traceback (most recent call last):
  File "/home/asd/Downloads/test/定稿的代码.py", line 2694, in <module>
old=old.set_index(key, drop_duplicates=True)
  File "/home/asd/archiconda3/lib/python3.7/site-packages/pandas/util/_decorators.py", line 311, in wrapper
return func(*args, **kwargs)
TypeError: set_index() got an unexpected keyword argument 'drop_duplicates'

nob · 发表于 2023-6-14 18:34

既然改路径可行，建议优先排查路径问题。
如果路径确定没写错，可能是read_excel出了问题，建议看看new和old有没有读到内容。

一闪一闪233 · 发表于 2023-6-14 18:36

你可以检查一下这两个 DataFrame 的 'id' 列是否有重复的值。你可以使用 pandas 的 duplicated 函数来检查：

[Python] 纯文本查看 复制代码

print(old.index.duplicated().any())
print(new.index.duplicated().any())

这两行代码会输出 True 或 False，表明相应的 DataFrame 的索引中是否有重复的值。如果有重复的值，你需要确定如何处理这些重复值，才能继续运行你的代码。可能的处理方式包括移除重复的行，或者创建一个新的、唯一的索引。

另外，如果在 Windows 11 上这段代码能够运行，但在 Linux 上不能，可能的原因是两个操作系统上的 pandas 版本不同，或者对数据的处理有微妙的差别。但无论如何，保持索引唯一总是一个好的实践。

lnshijia · 发表于 2023-6-15 08:17

报错可能是路径的原因

帐号		自动登录	找回密码
密码			注册[Register]

[求助] 同样的代码在windows11上可运行但linux上运行不了是什么原因？

最佳答案