sql - SQLで結合の交差を効率的に行うにはどうすればよいですか？

Question

私は3つのテーブル、、、booksおよびtags（taggings）を持っていますbooks-xref-tags：

books
id | title |      author     
 1 | Blink | Malcolm Gladwell
 2 |  1984 |    George Orwell

taggings
book_id | tag_id
      1 |      1
      1 |      2
      2 |      1
      2 |      3

tags
id | name
 1 | interesting
 2 |  nonfiction
 3 |     fiction

「おもしろい」と「フィクション」の両方のタグが付いた本をすべて検索したいのですが。私が思いついた最高のものは

select books.* from books, taggings, tags
 where taggings.book_id = books.id
   and taggings.tag_id  = tag.id
   and tag.name = "interesting"
intersect
select books.* from books, taggings, tags
 where taggings.book_id = books.id
   and taggings.tag_id  = tag.id
   and tag.name = "fiction"

それはうまくいくようですが、行またはタグの数のいずれかでどのようにスケーリングするかはわかりません。つまり、数百冊の本、数百のタグ、数千のタグを追加するとどうなりますか？検索が「「面白い」と「フィクション」と「水生」と「石工」になるとどうなりますか？

SQLで直接クエリを実行するより良い方法がない場合は、別のアプローチを考えています。

最初のタグが付いたすべての本と、それらの本のすべてのタグを選択します
すべてのタグが照会されていないものをリストから削除します

score 3 · Accepted Answer

3つ以上のタグを使用するオプションを維持したい場合は、同様のタグに対するこの回答が興味深い場合があります。

MySQL構文を使用します（何を使用するかはわかりません）が、非常に単純であり、他のデータベースで使用できるはずです。

これはあなたにとってはそのように見えます（MySQL構文を使用）：

SELECT books.id, books.title, books.author
FROM books
INNER JOIN taggings ON ( taggings.book_id = books.book_id )
INNER JOIN tags ON ( tags.tag_id = taggings.tag_id )
WHERE tags.name IN ( @tag1, @tag2, @tag3 )
GROUP BY books.id, books.title, books.author
HAVING COUNT(*) = @number_of_tags

私の他の投稿から：

例のように3つのタグがある場合、number_of_tagsは3である必要があり、結合すると、IDごとに3つの行が一致します。

そのクエリを動的に作成するか、たとえば10個のタグで定義し、タグで発生しない値で初期化することができます。

score 1 · Accepted Answer

ここではSQLの「古い学校」の方言がもう少しありますが、構文はよりコンパクトであり、内部結合です。

select * from books, taggings tg1, tags t1, taggings tg2, tags t2 
 where tg1.book_id = books.id
   and tg1.tag_id  = t1.id
   and t1.name = 'interesting'
   and tg2.book_id = books.id
   and tg2.tag_id  = t2.id
   and t2.name = 'fiction'

編集：うわー、それは1つのクエリに参加しすぎることに対するスタッカーからの多くの憎しみです。existsサブクエリを使用すると、さらに最適化できます。

select * from books
 where exists (select * from taggings, tags
                where tags.name = 'fiction'
                  and taggings.tag_id = tags.id
                  and taggings.book_id = books.id)
   and exists (select * from taggings, tags
                where tags.name = 'interesting'
                  and taggings.tag_id = tags.id
                  and taggings.book_id = books.id)

score 1 · Accepted Answer

適切なベンチマークがないにもかかわらず、mysqlは実際にこれをはるかにうまく結合する方法を知っているので、交差する代わりにALLをお勧めします。

select books.* from books, taggings, tags
 where taggings.book_id = books.id
   and taggings.tag_id  = tag.id
   and tag.name ALL("interesting", "fiction");

スケーリングに関しては、何百万冊もの本があり、タグテーブルのカーディナリティが低いため、最終的にはタグIDをコード/メモリに移行してtaggings.tag_id ALL（3、7 、105）または何か。タグテーブルを取得するための最後の結合では、1kタグのように乗り越えない限り、インデックスは使用されないため、毎回テーブルスキャンを実行します。

私の経験では、交差点と結合はパフォーマンスにとって大きな悪です。ほとんどの場合、結合は私たちが一般的に経験する問題です。参加数が少ないほど、取得が速くなります。

score 1 · Accepted Answer

with
  tt as
  (
      select id
      from tags
      where name in ('interesting', 'fiction')
  ),
  mm as
  (
      select book_id
      from taggings join tt on taggings.tag_id = tt.id
      group by taggings.book_id having count(*) = 2
  )
select books.*
from books join mm on books.id = mm.book_id

このバリエーションは、次の理由から（少なくとも、Oracleでは）PeterLangのソリューションよりも優れた実行プランを生成するように見えますEXPLAIN PLAN。

tagsとの間の結合は、taggingsテーブルからテーブルではなく、テーブルからインデックスで実行されます。これが実際に大規模なデータセットのクエリパフォーマンスに影響を与えるかどうかはわかりません。
プランは、で最終結合を実行する前に、データセットをグループ化してカウントしbooksます。これは、大規模なデータセットのパフォーマンスに最も確実に影響します。

score 0 · Accepted Answer

どのデータベースですか？それは答えを少し変えるでしょう。たとえば、これはSQLサーバーで機能し、タグテーブルに2回移動する必要がないため高速である必要がありますが、mysqlはCTEを実行しないため、mysqlでは失敗します。

WITH taggingNames
AS
(
    SELECT tag.Name, tag.tag_id, tagging.book_id
    FROM tags
    INNER JOIN taggings ON tags.tag_id = taggings.tagid
) 
SELECT b.* 
FROM books b
INNER JOIN (
  SELECT t1.book_id
   FROM taggingNames 
   INNER JOIN taggingNames t2 ON t2.book_id = t1.book_id AND t2.Name='fiction'
   WHERE t1.Name='interesting' 
   GROUP BY t1.book_id
 ) ids ON b.book_id = ids.book_id

私はそれを見て、ピーター・ラングの答えも好きだと思いました。

sql - SQLで結合の交差を効率的に行うにはどうすればよいですか？

5 に答える 5

Related

Reference