sql - h-score (h-index) を計算するための SQL

Question

科学者のNp個の論文のうちh個がそれぞれ少なくともh回の引用を持ち、他の(Np − h)個の論文がそれぞれh以下の引用しか持たない場合、その科学者はインデックスhを持ちます。

SCIENTISTS、PAPERS、CITATIONS テーブルがあり、SCIENTISTS と PAPERS の間に 1-n の関係があり、PAPERS と CITATION TABLES の間に 1-n の関係があるとします。SCIENTISTS テーブルの各科学者の h スコアを計算する SQL ステートメントを作成する方法は?

ここで私が行った調査の成果を示すために、各論文の引用数を SQL で計算します。

SELECT COUNT(CITATIONS.id) AS citations_count
FROM PAPERS
LEFT OUTER JOIN CITATIONS ON (PAPERS.id = CITATIONS.paper_id)
GROUP BY PAPERS.id
ORDER BY citations_count DESC;

score 4 · Accepted Answer

h 値が行っていることは、2 つの方法で引用をカウントすることです。科学者が次の引用数を持っているとしましょう:

その数以上の引用がある数と、2 つの違いを見てみましょう。

10    1    9
 8    2    6
 5    3    2
 5    3    2
 2    5    -3
 1    6    -5

必要な数は、これが 0 の場所です。この場合、数は 4 です。

数値が 4 であるという事実は、元のデータにないため、これを難しくしています。数値表を生成する必要があるため、計算が難しくなります。

以下は、100 個の数値を持つテーブルを生成するための SQL Server 構文を使用してこれを行います。

with numbers as (
      select 1 as n
      union all
      select n+1
      from numbers
      where n < 100
     ),
     numcitations as (
      SELECT p.scientistid, p.id, COUNT(c.id) AS citations_count
      FROM PAPERS p LEFT OUTER JOIN
           CITATIONS c
           ON p.id = c.paper_id
      GROUP BY p.scientist, p.id
     ),
     hcalc as (
      select scientistid, numbers.n,
             (select count(*)
              from numcitations nc
              where nc.scientistid = s.scientistid and
                    nc.citations_count >= numbers.n
             ) as hval
      from numbers cross join
           (select scientistid from scientist) s
     )
select *
from hcalc
where hval = n;

編集：

数値表を使用せずにこれを行う方法があります。h スコアは、引用数が引用数以上のケースの数です。これは計算がはるかに簡単です。

select scientistid, count(*)
from (SELECT p.scientistid, p.id, COUNT(c.id) AS citations_count,
             rank() over (partition by p.scientistid, p.id order by count(c.id) desc) as ranking
      FROM PAPERS p LEFT OUTER JOIN
           CITATIONS c
           ON p.id = c.paper_id
      GROUP BY p.scientist, p.id
     ) t
where ranking <= citations_count
group by scientistid;

score 0 · Accepted Answer

これは MS SQL ソリューションです。

/*
The h-index is calculated by counting the number of publications for which an author has been 
cited by other authors at least that same number of times.  For instance, an h-index of 17 means 
that the scientist has published at least 17 papers that have each been cited at least 17 times.  
If the scientist's 18th most cited publication was cited only 10 times, the h-index would remain at 17.  
If the scientist's 18th most cited publication was cited 18 or more times, the h-index would rise to 18.
*/
declare @num_pubs int = 4;
declare @sited_seed int = 10;
declare @publication table
    (
     scientist_id int,
     publication_title int,
     publication_cited int
    );

with numbers as (
      select 1 as n
      union all
      select n + 1
      from numbers
      where n < @num_pubs
     )

insert into @publication select 1 as scientist_id, n as publication_title, ABS(Checksum(NewID()) % @sited_seed) as publication_cited from numbers

select * from @publication

-- data sample for scientist#1
-- scientist_id     publication      sited
-- 1                pub 1            2
-- 1                pub 2            0
-- 1                pub 3            1
-- 1                pub 4            3
select scientist_id, max(pub_row_number) as h_index
from (
        select p.scientist_id, pub_row_number --, count(p.publication_cited)
        from @publication as p
            inner join  
            (
                select scientist_id, publication_title, publication_cited, 
                        ROW_NUMBER() OVER (PARTITION BY scientist_id ORDER BY publication_cited) as pub_row_number from @publication
                -- create a unique index for publications, and using it to triangulate against the number of cited publications
                -- scientist_id     publication      sited  pub_row_number
                -- 1                pub 1            2      1
                -- 1                pub 2            0      2
                -- 1                pub 3            1      3
                -- 1                pub 4            3      4
            ) as c
        on pub_row_number <= p.publication_cited and p.scientist_id = c.scientist_id
                -- triangulation {pub_row_number <= p.publication_cited} solves two problems
                -- 1. it removes all publications with sited equals to zero
                -- 2. for every publication (pub_row_number) it creates a set of all available sited publications, so that it is possible to   
                --    count the number of publications that has been cited at least same number of times
                -- scientist_id  pub_row_number       publication_cited
                -- 1             1         >          0  >> filtered out
                -- 1             1         <=         1
                -- 1             1         <=         2
                -- 1             1         <=         3
                -- 1             2         <=         2
                -- 1             2         <=         3
                -- 1             3         <=         3
        group by p.scientist_id, pub_row_number --, p.publication_cited
        having pub_row_number <= count(p.publication_cited)
                -- {pub_count <= count(p.publication_cited)} this tiangulation creates a final count
                -- there are 3 publications sited at least once, 2 - sited at least 2 times, and one sited at least 3 times
                -- scientist_id  pub_row_number       count(publication_cited)
                -- 1             1         <=         3
                -- 1             2         <=         2 
                -- 1             3         >          1  >> filtered out via tiangulation
    ) as final 
                -- finally, max(pub_count) pulls the answare
                -- scientist_id  h_index   
                -- 1             2           
group by scientist_id

UNION 
-- include scientist without publications
select scientist_id, 0 from scientist where scientist_id not in (select scientist_id from publication) 

UNION 
-- include scientist with publications but without citation
select scientist_id, 0 from publication group by scientist_id having sum(publication_cited) = 0

sql - h-score (h-index) を計算するための SQL

4 に答える 4

Related

Reference