python - 日付と数値の不等式フィルター

Question

item_name、manufacturing_date、number_of_items_shipped を持つ Google App Engine データストア [Python] をクエリしようとしています。データストアには最大 100 万件のレコードがあり、増え続けています。

シナリオ: x_items [ユーザー入力]よりも多く出荷され、some_date [ユーザー入力] より後に製造されたすべての item_names を取得します。基本的に、在庫確認のようなものです。

プロパティの事実上2つの不等式。しかし、GAE でのクエリの制限により、これを行うことができません。

この問題について SO を検索しました。しかし、今まで運がありません。この問題に遭遇しましたか？もしそうなら、あなたはこれを解決できましたか？私にお知らせください。

また、Google I/O 2010 のNext Gen Queriesで、Alfred Fuller は、この制限を間もなく撤廃すると述べました。8か月以上経ちましたが、この制限は現在も実施されています。不運にも。

この制限を回避できた場合は、誰かが回答を投稿できるかどうかを確認してください。

どうもありがとう。

score 1 · Accepted Answer

Sudhirの回答に基づいて、気になる粒度に基づいて、各レコードを製造日の「バケット」に割り当てます。製造日の範囲が数年を超える場合は、たとえば月ごとのバケットを使用します。範囲が昨年だけの場合は、毎週。

特定の範囲内で > n 個の販売日と製造日を含むレコードを検索する場合は、その範囲内のバケットごとに 1 回クエリを実行し、興味のないアイテムをポストフィルターで除外します。

例（完全にテストされていません）：

BUCKET_SIZE_DAYS = 10

def put(self):
    self.manufacture_bucket = int(self.manufacture_date.toordinal() / BUCKET_SIZE_DAYS)
    super(self.__class__, self).put()

def filter_date_after(self, date_start):
    first_bucket = int(date_start.toordinal() / BUCKET_SIZE_DAYS)
    last_bucket = int(datetime.datetime.today().toordinal() / BUCKET_SIZE_DAYS)

    for this_bucket in range(first_bucket, last_bucket+1):
        for found in self.filter("manufacture_bucket =", this_bucket):
            if found.manufacture_date >= date_start:
                yield found

その後、次のように使用できるはずです。

widgets.filter("sold >", 7).filter_date_after(datetime.datetime(2010,11,21))

読者のための演習として残しておきます：

最後に追加された他のフィルターとうまく機能させる
複数のバケットサイズにより、常に ln (日付範囲内の日数) バケットをクエリできます。

score 0 · Accepted Answer

Unfortunately, you can't circumvent this restriction, but I can help you model the data in a slightly different way.

First off, Bigtable is suited to very fast reads off large databases - the kind you do when have a million people hitting your app at the same time. What you're trying to do here is a report on historical data. While I would recommend moving the reporting to a RDBMS, there is a way you can do it on Bigtable.

First, override the put() method on your item model to split the date before saving it. What you would do is something like

def put(self):
  self.manufacture_day = self.manufacture_date.day
  self.manufacture_month = self.manufacture_date.month
  self.manufacture_year = self.manufacture_date.year
  super(self.__class__, self).put()

You can do this to any level of granularity you want, even hours, minutes, seconds, whatever.

You can apply this retroactively to your database by just loading and saving your item entities. The mapper is very convenient for this.

Then change your query to use the inequality only on the item count, and select the days / months / years you want using normal equalities. You can do ranges by either firing multiple queries or using the IN clause. (Which does the same thing anyway).

This does seem contrived and tough to do, but keep in mind that your reports will run almost instantaneously if you do this, even when millions of people try to run them at the same time. You might not need this kind of scale, but well... that's what you get :D

python - 日付と数値の不等式フィルター

2 に答える 2

Related

Reference