pandas 按组统计最近n天事件发生的次数

问题描述:

我有ID发生的事件表.如何计算每种事件类型在当前行之前的最近n天中发生的次数?

I have table of events occurring by id. How would I count the number of times in the last n days that each event type has occurred prior to the current row?

例如带有事件列表,例如:

For example with a list of events like:

df = pd.DataFrame([{'id': 1, 'event_day': '2016-01-01', 'event_type': 'type1'},
{'id': 1, 'event_day': '2016-01-02', 'event_type': 'type1'},
{'id': 2, 'event_day': '2016-02-01', 'event_type': 'type2'},
{'id': 2, 'event_day': '2016-02-15', 'event_type': 'type3'},
{'id': 3, 'event_day': '2016-01-06', 'event_type': 'type3'},
{'id': 3, 'event_day': '2016-03-11', 'event_type': 'type3'},])
df['event_day'] = pd.to_datetime(df['event_day'])
df = df.sort_values(['id', 'event_day'])

或:

   event_day event_type  id
0 2016-01-01      type1   1
1 2016-01-02      type1   1
2 2016-02-01      type2   2
3 2016-02-15      type3   2
4 2016-01-06      type3   3
5 2016-03-11      type3   3

by id我想计算在过去n天内每个event_type在当前行之前发生的次数.例如,在第3行id = 2中,事件历史记录中直到(但不包括)该点有多少次在过去n天内发生了事件2、2和3,它们的ID为2?

by id I want to count the number of times each event_type has occurred prior to the current row in the last n days. For example, in row 3 id=2, so how many times up to (but not including) that point in the event history have events types 1, 2, and 3 occurred in the last n days for id 2?

所需的输出如下所示:

    event_day   event_type  event_type1_in_last_30days  event_type2_in_last_30days  event_type3_in_last_30days  id
0   2016-01-01  type1       0                           0                           0                           1
1   2016-01-02  type1       1                           0                           0                           1
2   2016-02-01  type2       0                           0                           0                           2
3   2016-02-15  type3       0                           1                           0                           2
4   2016-01-06  type3       0                           0                           0                           3
5   2016-03-11  type3       0                           0                           0                           3

res = ((((df['event_day'].values >= df['event_day'].values[:, None] - pd.to_timedelta('30 days')) 
        & (df['event_day'].values < df['event_day'].values[:, None]))
        & (df['id'].values == df['id'].values[:, None]))
        .dot(pd.get_dummies(df['event_type'])))
res
Out: 
array([[ 0.,  0.,  0.],
       [ 1.,  0.,  0.],
       [ 0.,  0.,  0.],
       [ 0.,  1.,  0.],
       [ 0.,  0.,  0.],
       [ 0.,  0.,  0.]])


第一部分是生成如下矩阵:


The first part is to generate a matrix as follows:

(df['event_day'].values >= df['event_day'].values[:, None] - pd.to_timedelta('30 days'))
Out: 
array([[ True,  True,  True,  True,  True,  True],
       [ True,  True,  True,  True,  True,  True],
       [False,  True,  True,  True,  True,  True],
       [False, False,  True,  True, False,  True],
       [ True,  True,  True,  True,  True,  True],
       [False, False, False,  True, False,  True]], dtype=bool)

这是一个6x6的矩阵,每行都会与其他行进行比较.它利用NumPy的广播进行成对比较(.values[:, None]添加了另一个轴).要使其完整,我们需要检查该行是否也比另一行更早出现:

It's a 6x6 matrix and for each row it makes a comparison against the other rows. It makes use of NumPy's broadcasting for pairwise comparision (.values[:, None] adds another axis). To make it complete, we need to check if this row occurs sooner than the other row as well:

(((df['event_day'].values >= df['event_day'].values[:, None] - pd.to_timedelta('30 days')) 
   & (df['event_day'].values < df['event_day'].values[:, None])))
Out: 
array([[False, False, False, False, False, False],
       [ True, False, False, False, False, False],
       [False,  True, False, False,  True, False],
       [False, False,  True, False, False, False],
       [ True,  True, False, False, False, False],
       [False, False, False,  True, False, False]], dtype=bool)

另一个条件是关于ID的.使用类似的方法,您可以构造一个成对的比较矩阵,以显示id何时匹配:

Another condition is about the id's. Using a similar approach, you can construct a pairwise comparison matrix that shows when id's match:

(df['id'].values == df['id'].values[:, None])
Out: 
array([[ True,  True, False, False, False, False],
       [ True,  True, False, False, False, False],
       [False, False,  True,  True, False, False],
       [False, False,  True,  True, False, False],
       [False, False, False, False,  True,  True],
       [False, False, False, False,  True,  True]], dtype=bool)

它变成:

(((df['event_day'].values >= df['event_day'].values[:, None] - pd.to_timedelta('30 days')) 
    & (df['event_day'].values < df['event_day'].values[:, None]))
    & (df['id'].values == df['id'].values[:, None]))
Out: 
array([[False, False, False, False, False, False],
       [ True, False, False, False, False, False],
       [False, False, False, False, False, False],
       [False, False,  True, False, False, False],
       [False, False, False, False, False, False],
       [False, False, False, False, False, False]], dtype=bool)

最后,您希望每种类型都可以看到它,以便可以使用get_dummies:

Lastly, you want to see it for each type so you can use get_dummies:

pd.get_dummies(df['event_type'])
Out: 
   type1  type2  type3
0    1.0    0.0    0.0
1    1.0    0.0    0.0
2    0.0    1.0    0.0
3    0.0    0.0    1.0
4    0.0    0.0    1.0
5    0.0    0.0    1.0

如果将结果矩阵与此矩阵相乘,则应该为每种类型提供满足该条件的行数.您可以将结果数组传递给DataFrame构造函数和concat:

If you multiply the resulting matrix with this one, it should give you the number of rows satisfying that condition for each type. You can pass the resulting array to a DataFrame constructor and concat:

pd.concat([df, pd.DataFrame(res, columns = ['e1', 'e2', 'e3'])], axis=1)
Out: 
   event_day event_type  id   e1   e2   e3
0 2016-01-01      type1   1  0.0  0.0  0.0
1 2016-01-02      type1   1  1.0  0.0  0.0
2 2016-02-01      type2   2  0.0  0.0  0.0
3 2016-02-15      type3   2  0.0  1.0  0.0
4 2016-01-06      type3   3  0.0  0.0  0.0
5 2016-03-11      type3   3  0.0  0.0  0.0