在 pandas 中爆炸具有不同大小列表列的多个列
我遇到以下情况,我可能在csv文件中获得300列,其中一些是列表参数(50列),它们的大小可能不均匀,其中没有值.
I have following situation where i may get 300 columns in csv file and some of them are list parameter (50 columns)and they can uneven size including no values.
Time COL1 COL2
2020-03-13 10:43:00.500 0.0 10.0 20.0 30.0 40.0 50.0 60.0 70.0 80.0 90.0
2020-03-13 10:43:00.900 10.0 20.0 30.0 70.0 10.0 20.0
2020-03-13 10:44:00.100 10.0 20.0 30.0 70.0
我想做
1.将基于最高频率的列爆炸至rows.ex:第一行最大频率(用空格分隔的值)
是5,在第二行是4
I want do
1. explode the column based on highest frequency to rows.ex: 1st row max freq(space separated values)
is 5 and in 2nd row it is 4
Time COL1 COL2
2020-03-13 10:43:00.500 0.0 50.0
2020-03-13 10:43:00.580 10.0 60.0
2020-03-13 10:43:00.660 20.0 70.0
2020-03-13 10:43:00.740 30.0 80.0
2020-03-13 10:43:00.820 40.0 90.0
2020-03-13 10:43:00.900 10.0 10.0
2020-03-13 10:43:00.1150 20.0 20.0
2020-03-13 10:43:00.1400 30.0
2020-03-13 10:43:00.1650 80.0
2020-03-13 10:44:00.100 10.0
2020-03-13 10:44:00.350 20.0
2020-03-13 10:44:00.600 30.0
2020-03-13 10:44:00.850 70.0
-
和提交的时间应在两个后续行时间之间. 这样的事情.
and Time filed should be between two subsequent row time. something like this.
curr_row_time = data_frame ['Time'] [ind1]
next_row_time = data_frame ['时间'] [ind1 + 1]
timestamp1 = datetime.datetime.strptime(str(curr_row_time),
'%Y-%m-%d%H:%M:%S.%f')
timestamp2 = datetime.datetime.strptime(str(next_row_time),
'%Y-%m-%d%H:%M:%S.%f')
time_delta =(timestamp2-timestamp1).total_seconds()* 1000
time_step_increment = time_delta/max_frequency(应为一行的最大值)
最后一行的最大频率可以取其平均值或上次使用的最大频率
赞赏是否有关于第1点的建议,但第2点是第二要务,但也有建议.
curr_row_time = data_frame['Time'][ind1]
next_row_time = data_frame['Time'][ind1+1]
timestamp1 = datetime.datetime.strptime(str(curr_row_time ),
'%Y-%m-%d %H:%M:%S.%f')
timestamp2 = datetime.datetime.strptime(str(next_row_time),
'%Y-%m-%d %H:%M:%S.%f')
time_delta = (timestamp2 - timestamp1).total_seconds() * 1000
time_step_increment = time_delta / max_frequency (should be max for a row)
for last row max frequency can be taken either avg of it or last used max frequency
Appreciate if any suggestion for point 1 however 2nd point is on second priority but its good to have that also.
我在下面使用的第1点
我尝试了以下链接中提到的解决方案,方法是将空格分隔为列表 10.0 20.0 30.0 70.0 ----> [10.0,20.0,30.0,70.0] 熊猫在多列上爆炸
I tried solution mentioned in the below links by converting space separated to list 10.0 20.0 30.0 70.0 ----> [10.0, 20.0, 30.0, 70.0] Pandas Explode on Multiple columns
File "<stdin>", line 3, in <listcomp>
File "<__array_function__ internals>", line 6, in concatenate
ValueError: zero-dimensional arrays cannot be concatenated
以下链接适用于等长列表列 拆分(爆炸)pandas数据框字符串条目以分隔行
Below links works for equal size list column Split (explode) pandas dataframe string entry to separate rows
首先,您可以根据需要使用concat
,str.split
和stack
创建爆炸列.使用reset_index
和join
能够获取关联的时间"列.然后,您需要更改时间"列中的值以创建插值.我不确定是否可以直接插值datetime列,但是可以将类型更改为int64
,mask
值(如果与上一行相同,则使用shift
和interpolate
).像这样:
First you can create the exploded columns like you want with concat
, str.split
and stack
. Use reset_index
and join
to be able to get the column 'Time' associated. Then you need to change the values in the column 'Time' to create the interpolation. I'm not sure if one can interpolate directly a datetime column, but you can change the type to int64
, mask
the values if same than previous row with shift
and interpolate
. So like this:
l_col = ['COL1', 'COL2']
df_f = pd.concat([df[col].str.split(' ', expand=True) for col in l_col ],
axis=1, keys=l_col)\
.stack()\
.reset_index(level=1, drop=True)\
.join(df[['Time']])\
.reset_index(drop=True)
df_f['Time'] = pd.to_datetime(df_f['Time'].astype('int64')
.mask(df_f.Time.eq(df_f.Time.shift()))
.interpolate(method='linear'))
print (df_f)
COL1 COL2 Time
0 0.0 50.0 2020-03-13 10:43:00.500
1 10.0 60.0 2020-03-13 10:43:00.580
2 20.0 70.0 2020-03-13 10:43:00.660
3 30.0 80.0 2020-03-13 10:43:00.740
4 40.0 90.0 2020-03-13 10:43:00.820
5 10.0 10.0 2020-03-13 10:43:00.900
6 20.0 20.0 2020-03-13 10:43:15.700
7 30.0 None 2020-03-13 10:43:30.500
8 70.0 None 2020-03-13 10:43:45.300
9 10.0 2020-03-13 10:44:00.100
10 20.0 None 2020-03-13 10:44:00.100
11 30.0 None 2020-03-13 10:44:00.100
12 70.0 None 2020-03-13 10:44:00.100
我不确定您想要的是COL2中缺少的值,例如因此您可能需要一些fillna
来解决这个问题.
I'm not sure what you want for the missing values in COL2 e.g. so you may need some fillna
to work this out.