为什么要遍历一个消耗大量内存的大型Django QuerySet?

问题描述:

该表格包含大约1000万行。

The table in question contains roughly ten million rows.

for event in Event.objects.all():
    print event

这将使内存使用量稳步增加到4 GB左右,此时行快速打印。第一行打印之前的漫长延迟使我感到惊讶 - 我预计几乎会立即打印。

This causes memory usage to increase steadily to 4 GB or so, at which point the rows print rapidly. The lengthy delay before the first row printed surprised me – I expected it to print almost instantly.

我也尝试过 Event.objects.iterator() 的行为方式相同。

I also tried Event.objects.iterator() which behaved the same way.

我不明白Django加载到内存中,或者为什么要这样做。我希望Django能够在数据库级别迭代结果,这意味着结果将以大致恒定的速率打印(而不是在长时间的等待之后立即打印)。

I don't understand what Django is loading into memory or why it is doing this. I expected Django to iterate through the results at the database level, which'd mean the results would be printed at roughly a constant rate (rather than all at once after a lengthy wait).

我误解了什么?

(我不知道是否相关,但我使用的是PostgreSQL。)

(I don't know whether it's relevant, but I'm using PostgreSQL.)

Nate C关闭,但不完整。

Nate C was close, but not quite.

文档


您可以通过以下方式评估QuerySet:

You can evaluate a QuerySet in the following ways:


  • 迭代。 QuerySet是可迭代的,并且在您首次迭代时执行其数据库查询。例如,这将打印数据库中所有条目的标题:

  • Iteration. A QuerySet is iterable, and it executes its database query the first time you iterate over it. For example, this will print the headline of all entries in the database:

for e in Entry.objects.all():
    print e.headline


因此,当您首次输入该循环并获取查询器的迭代形式时,将一次检索1000万行。等待您的体验是Django加载数据库行并为每个对象创建对象,然后返回您可以实际迭代的内容。然后你有一切记忆,结果会溢出。

So your ten million rows are retrieved, all at once, when you first enter that loop and get the iterating form of the queryset. The wait you experience is Django loading the database rows and creating objects for each one, before returning something you can actually iterate over. Then you have everything in memory, and the results come spilling out.

从我阅读的文档, iterator() 只是绕过了QuerySet的内部缓存机制。我认为它可能是一个一个接一个的事情,但这将反过来要求数千万个个人点击数据库。可能并不是所有这些都是可取的。

From my reading of the docs, iterator() does nothing more than bypass QuerySet's internal caching mechanisms. I think it might make sense for it to a do a one-by-one thing, but that would conversely require ten-million individual hits on your database. Maybe not all that desirable.

有效地迭代大型数据集是我们还没有得到很正确的,但是有一些代码片段可能会对你有用目的:

Iterating over large datasets efficiently is something we still haven't gotten quite right, but there are some snippets out there you might find useful for your purposes:

  • Memory Efficient Django QuerySet iterator
  • batch querysets
  • QuerySet Foreach