在 pandas 中爆炸具有不同大小列表列的多个列

问题描述:

我遇到以下情况,我可能在csv文件中获得300列,其中一些是列表参数(50列),它们的大小可能不均匀,其中没有值.

I have following situation where i may get 300 columns in csv file and some of them are list parameter (50 columns)and they can uneven size including no values.

    Time               COL1                        COL2
2020-03-13 10:43:00.500  0.0 10.0 20.0 30.0 40.0   50.0 60.0 70.0 80.0 90.0
2020-03-13 10:43:00.900  10.0 20.0 30.0 70.0       10.0 20.0
2020-03-13 10:44:00.100  10.0 20.0 30.0 70.0   

我想做
1.将基于最高频率的列爆炸至rows.ex:第一行最大频率(用空格分隔的值) 是5,在第二行是4

I want do
1. explode the column based on highest frequency to rows.ex: 1st row max freq(space separated values) is 5 and in 2nd row it is 4

Time                  COL1      COL2
2020-03-13 10:43:00.500   0.0    50.0
2020-03-13 10:43:00.580   10.0   60.0
2020-03-13 10:43:00.660   20.0   70.0
2020-03-13 10:43:00.740   30.0   80.0
2020-03-13 10:43:00.820   40.0   90.0    
2020-03-13 10:43:00.900   10.0   10.0 
2020-03-13 10:43:00.1150  20.0   20.0 
2020-03-13 10:43:00.1400  30.0 
2020-03-13 10:43:00.1650  80.0 
2020-03-13 10:44:00.100   10.0 
2020-03-13 10:44:00.350   20.0 
2020-03-13 10:44:00.600   30.0
2020-03-13 10:44:00.850   70.0

  1. 和提交的时间应在两个后续行时间之间. 这样的事情.

  1. and Time filed should be between two subsequent row time. something like this.

curr_row_time = data_frame ['Time'] [ind1]
next_row_time = data_frame ['时间'] [ind1 + 1] timestamp1 = datetime.datetime.strptime(str(curr_row_time), '%Y-%m-%d%H:%M:%S.%f') timestamp2 = datetime.datetime.strptime(str(next_row_time), '%Y-%m-%d%H:%M:%S.%f') time_delta =(timestamp2-timestamp1).total_seconds()* 1000 time_step_increment = time_delta/max_frequency(应为一行的最大值) 最后一行的最大频率可以取其平均值或上次使用的最大频率 赞赏是否有关于第1点的建议,但第2点是第二要务,但也有建议.

curr_row_time = data_frame['Time'][ind1]
next_row_time = data_frame['Time'][ind1+1] timestamp1 = datetime.datetime.strptime(str(curr_row_time ), '%Y-%m-%d %H:%M:%S.%f') timestamp2 = datetime.datetime.strptime(str(next_row_time), '%Y-%m-%d %H:%M:%S.%f') time_delta = (timestamp2 - timestamp1).total_seconds() * 1000 time_step_increment = time_delta / max_frequency (should be max for a row) for last row max frequency can be taken either avg of it or last used max frequency Appreciate if any suggestion for point 1 however 2nd point is on second priority but its good to have that also.

我在下面使用的第1点

我尝试了以下链接中提到的解决方案,方法是将空格分隔为列表 10.0 20.0 30.0 70.0 ----> [10.0,20.0,30.0,70.0] 熊猫在多列上爆炸

I tried solution mentioned in the below links by converting space separated to list 10.0 20.0 30.0 70.0 ----> [10.0, 20.0, 30.0, 70.0] Pandas Explode on Multiple columns

 File "<stdin>", line 3, in <listcomp>
  File "<__array_function__ internals>", line 6, in concatenate
ValueError: zero-dimensional arrays cannot be concatenated

以下链接适用于等长列表列 拆分(爆炸)pandas数据框字符串条目以分隔行

Below links works for equal size list column Split (explode) pandas dataframe string entry to separate rows

首先,您可以根据需要使用concatstr.splitstack创建爆炸列.使用reset_indexjoin能够获取关联的时间"列.然后,您需要更改时间"列中的值以创建插值.我不确定是否可以直接插值datetime列,但是可以将类型更改为int64mask值(如果与上一行相同,则使用shiftinterpolate).像这样:

First you can create the exploded columns like you want with concat, str.split and stack. Use reset_index and join to be able to get the column 'Time' associated. Then you need to change the values in the column 'Time' to create the interpolation. I'm not sure if one can interpolate directly a datetime column, but you can change the type to int64, mask the values if same than previous row with shift and interpolate. So like this:

l_col = ['COL1', 'COL2']

df_f = pd.concat([df[col].str.split(' ', expand=True) for col in l_col ], 
                 axis=1, keys=l_col)\
         .stack()\
         .reset_index(level=1, drop=True)\
         .join(df[['Time']])\
         .reset_index(drop=True)

df_f['Time'] = pd.to_datetime(df_f['Time'].astype('int64')
                                          .mask(df_f.Time.eq(df_f.Time.shift()))
                                          .interpolate(method='linear'))

print (df_f)
    COL1  COL2                    Time
0    0.0  50.0 2020-03-13 10:43:00.500
1   10.0  60.0 2020-03-13 10:43:00.580
2   20.0  70.0 2020-03-13 10:43:00.660
3   30.0  80.0 2020-03-13 10:43:00.740
4   40.0  90.0 2020-03-13 10:43:00.820
5   10.0  10.0 2020-03-13 10:43:00.900
6   20.0  20.0 2020-03-13 10:43:15.700
7   30.0  None 2020-03-13 10:43:30.500
8   70.0  None 2020-03-13 10:43:45.300
9   10.0       2020-03-13 10:44:00.100
10  20.0  None 2020-03-13 10:44:00.100
11  30.0  None 2020-03-13 10:44:00.100
12  70.0  None 2020-03-13 10:44:00.100

我不确定您想要的是COL2中缺少的值,例如因此您可能需要一些fillna来解决这个问题.

I'm not sure what you want for the missing values in COL2 e.g. so you may need some fillna to work this out.