7

誰かが私のSQLクエリを検証できれば本当にありがたいです。

次のデータセットの場合:

MD5      UserPK      CategoryPK    
ADCDE    1           7  
ADCDE    1           4  
ADCDE    1           7  
dffrf    1           7  
dffrf    2           7  
dffrf    2           6 
dffrf    1           1 

同一のMD5値、同一のCatgegoryPK、および2つ以上の異なるUserPK値を持つ2つ以上の行が存在するMD5およびCategoryPKを選択したいと思います。

つまり、2人以上の異なるユーザー(UserPK)が同じカテゴリ(UserPK)を同じファイル(Md5)に割り当てたすべてのレコードのMD5とcategoryPKを知りたいのです。同じユーザーがカテゴリを複数回割り当てたレコードには興味がありません(別のユーザーが同じカテゴリをそのファイルに割り当てた場合を除く)。

したがって、上記のデータから、次のように返されます。

md5    CategoryPK
dffrf  7

私が書いたクエリは次のとおりです。

SELECT md5, 
       count(md5), 
       count(distinct categorypk) as cntcat, 
       count(distinct userpk) as cntpk
FROM Hash
       group by md5 having count(md5) > 1 
                           and cntpk > 1
                           and cntcat = 1;

それはうまくいくようですが、私が怒りでそれを使い始める前に、私が何かを逃した場合、またはそれを行うためのより良い方法がある場合に備えて、セカンドオピニオンをいただければ幸いです。

ありがとう

4

2 に答える 2

13

I don't think your code will give you what you're after; what happens when a file has been assigned more than one category by multiple users, with some categories overlapping? Then cntcat != 1, so your HAVING clause will fail to match even though the file has indeed been categorised the same way by multiple users.

I would instead use a self-join:

SELECT   a.MD5, a.CategoryPK
FROM     Hash a
  JOIN   Hash b
      ON a.MD5 = b.MD5
     AND a.UserPK <> b.UserPK
     AND a.CategoryPK = b.CategoryPK
GROUP BY a.MD5, a.CategoryPK
HAVING   COUNT(DISTINCT a.UserPK) > 2  -- you said "more than 2" ?
于 2012-05-18T09:37:47.593 に答える
1

I can't see any problems with what you have written apart from you are not getting the category in your select list which appears to be in the criteria? I think you could simplify it slightly and get the category out:

SELECT  MD5, CategoryPK
FROM    Hash
GROUP BY MD5, CategoryPK
HAVING MIN(UserPK) <> MAX(UserPK)

Alternatively, you could look at solving this with a join, you may need to run a few tests and use EXPLAIN, but sometimes joins perform better than GROUP BY. It is worth trying anyway to see if you see any significant difference.

SELECT  DISTINCT t1.MDF, t2.CategoryPK
FROM    Hash T1
        INNER JOIN Hash T2
            ON T1.MD5 = T2.MD5
            AND T1.CategoryPK = T2.CategoryPK
            AND T1.UserPK < T2.UserPK
于 2012-05-18T09:37:22.980 に答える