对Pandas中的行和列MultiIndex使用布尔索引

问题描述:

问题以粗体结尾.但首先,让我们设置一些数据:

Questions are at the end, in bold. But first, let's set up some data:

import numpy as np
import pandas as pd
from itertools import product

np.random.seed(1)

team_names = ['Yankees', 'Mets', 'Dodgers']
jersey_numbers = [35, 71, 84]
game_numbers = [1, 2]
observer_names = ['Bill', 'John', 'Ralph']
observation_types = ['Speed', 'Strength']

row_indices = list(product(team_names, jersey_numbers, game_numbers, observer_names, observation_types))
observation_values = np.random.randn(len(row_indices))

tns, jns, gns, ons, ots = zip(*row_indices)

data = pd.DataFrame({'team': tns, 'jersey': jns, 'game': gns, 'observer': ons, 'obstype': ots, 'value': observation_values})

data = data.set_index(['team', 'jersey', 'game', 'observer', 'obstype'])
data = data.unstack(['observer', 'obstype'])
data.columns = data.columns.droplevel(0)

这给出:

我想提取此DataFrame的一个子集用于后续分析.假设我想切出jersey编号为71的行.我真的不喜欢使用xs进行此操作的想法.通过xs进行横截面时,将丢失所选的列.如果我运行:

I want to pluck out a subset of this DataFrame for subsequent analysis. Say I wanted to slice out the rows where the jersey number is 71. I don't really like the idea of using xs to do this. When you do a cross section via xs you lose the column you selected on. If I run:

data.xs(71, axis=0, level='jersey')

然后我返回正确的行,但是我丢失了jersey列.

then I get back the right rows, but I lose the jersey column.

此外,对于我想要jersey列中一些不同值的情况,xs似乎不是一个很好的解决方案.我认为找到一种更好的解决方案

Also, xs doesn't seem like a great solution for the case where I want a few different values from the jersey column. I think a much nicer solution is the one found here:

data[[j in [71, 84] for t, j, g in data.index]]

您甚至可以过滤球衣和球队的组合:

You could even filter on a combination of jerseys and teams:

data[[j in [71, 84] and t in ['Dodgers', 'Mets'] for t, j, g in data.index]]

好!

问题是:我该如何做一些类似的事情来选择列的子集.例如,说我只想要代表Ralph数据的列.不使用xs怎么办?或者,如果我只想要带有observer in ['John', 'Ralph']的列怎么办?再次,我真的更喜欢一个解决方案,该解决方案将行和列索引的所有级别保留在结果中……就像上面的布尔索引示例一样.

So the question: how can I do something similar for selecting a subset of columns. For example, say I want only the columns representing data from Ralph. How can I do that without using xs? Or what if I wanted only the columns with observer in ['John', 'Ralph']? Again, I'd really prefer a solution that keeps all the levels of the row and column indices in the result...just like the boolean indexing examples above.

我可以做我想做的事情,甚至可以合并行和列索引中的选择.但是我发现的唯一解决方案涉及一些真正的体操:

I can do what I want, and even combine selections from both the row and column indices. But the only solution I've found involves some real gymnastics:

data[[j in [71, 84] and t in ['Dodgers', 'Mets'] for t, j, g in data.index]]\
    .T[[obs in ['John', 'Ralph'] for obs, obstype in data.columns]].T

第二个问题是:是否有一种更紧凑的方式来完成我刚才所做的事情?

这里是一种使用内置感觉语法的方法.但它仍然像笨拙一样笨拙:

Here is one approach that uses slightly more built-in-feeling syntax. But it's still clunky as hell:

data.loc[
    (data.index.get_level_values('jersey').isin([71, 84])
     & data.index.get_level_values('team').isin(['Dodgers', 'Mets'])), 
    data.columns.get_level_values('observer').isin(['John', 'Ralph'])
]

所以比较:

def hackedsyntax():
    return data[[j in [71, 84] and t in ['Dodgers', 'Mets'] for t, j, g in data.index]]\
    .T[[obs in ['John', 'Ralph'] for obs, obstype in data.columns]].T

def uglybuiltinsyntax():
    return data.loc[
        (data.index.get_level_values('jersey').isin([71, 84])
         & data.index.get_level_values('team').isin(['Dodgers', 'Mets'])), 
        data.columns.get_level_values('observer').isin(['John', 'Ralph'])
    ]

%timeit hackedsyntax()
%timeit uglybuiltinsyntax()

hackedsyntax() - uglybuiltinsyntax()

结果:

1000 loops, best of 3: 395 µs per loop
1000 loops, best of 3: 409 µs per loop

仍然希望有一种更清洁或更规范的方法来做到这一点.

Still hopeful there's a cleaner or more canonical way to do this.