sql - Use google bigquery to build histogram graph

Question

How can write a query that makes histogram graph rendering easier?

For example, we have 100 million people with ages, we want to draw the histogram/buckets for age 0-10, 11-20, 21-30 etc... What does the query look like?

Has anyone done it? Did you try to connect the query result to google spreadsheet to draw the histogram?

score 18 · Accepted Answer

You could also use the quantiles aggregation operator to get a quick look at the distribution of ages.

SELECT
  quantiles(age, 10)
FROM mytable

Each row of this query would correspond to the age at that point in the list of ages. The first result is the age 1/10ths of the way through the sorted list of ages, the second is the age 2/10ths through, 3/10ths, etc.

score 12 · Accepted Answer

See the 2019 update, with #standardSQL --Fh

The subquery idea works, as does "CASE WHEN" and then doing a group by:

SELECT COUNT(field1), bucket 
FROM (
    SELECT field1, CASE WHEN age >=  0 AND age < 10 THEN 1
                        WHEN age >= 10 AND age < 20 THEN 2
                        WHEN age >= 20 AND age < 30 THEN 3
                        ...
                        ELSE -1 END as bucket
    FROM table1) 
GROUP BY bucket

Alternately, if the buckets are regular -- you could just divide and cast to an integer:

SELECT COUNT(field1), bucket 
FROM (
    SELECT field1, INTEGER(age / 10) as bucket FROM table1)
GROUP BY bucket

score 8 · Accepted Answer

With #standardSQL and an auxiliary stats query, we can define the range the histogram should look into.

Here for the time to fly between SFO and JFK - with 10 buckets:

WITH data AS ( 
    SELECT *, ActualElapsedTime datapoint
    FROM `fh-bigquery.flights.ontime_201903`
    WHERE FlightDate_year = "2018-01-01" 
    AND Origin = 'SFO' AND Dest = 'JFK'
)
, stats AS (
  SELECT min+step*i min, min+step*(i+1)max
  FROM (
    SELECT max-min diff, min, max, (max-min)/10 step, GENERATE_ARRAY(0, 10, 1) i
    FROM (
      SELECT MIN(datapoint) min, MAX(datapoint) max
      FROM data
    )
  ), UNNEST(i) i
)

SELECT COUNT(*) count, (min+max)/2 avg
FROM data 
JOIN stats
ON data.datapoint >= stats.min AND data.datapoint<stats.max
GROUP BY avg
ORDER BY avg

If you need round numbers, see: https://stackoverflow.com/a/60159876/132438

score 1 · Accepted Answer

次のようなサブクエリを記述します。

(SELECT '1' AS agegroup, count(*) FROM people WHERE AGE <= 10 AND AGE >= 0)

次に、次のようなことを行うことができます。

SELECT * FROM
(SELECT '1' AS agegroup, count(*) FROM people WHERE AGE <= 10 AND AGE >= 0),
(SELECT '2' AS agegroup, count(*) FROM people WHERE AGE <= 20 AND AGE >= 10),
(SELECT '3' AS agegroup, count(*) FROM people WHERE AGE <= 120 AND AGE >= 20)

結果は次のようになります。

Row agegroup count 
1   1       somenumber
2   2       somenumber
3   3       another number

これがお役に立てば幸いです。もちろん、この年齢層では、「0から10」のように書くことができます。

score 1 · Accepted Answer

There is now the APPROX_QUANTILES aggregation function in standard SQL.

SELECT
    APPROX_QUANTILES(column, number_of_bins)
...

score 0 · Accepted Answer

あなたは情報の単一のベクトルを探しています。私は通常、次のようにクエリを実行します。

select
  count(*) as num,
  integer( age / 10 ) as age_group
from mytable
group by age_group

任意のグループには大きなcaseステートメントが必要になります。シンプルですが、はるかに長くなります。すべてのバケットにN年が含まれている場合、私の例は問題ないはずです。

score 0 · Accepted Answer

Using a cross join to get your min and max values (not that expensive on a single tuple) you can get a normalized bucket list of any given bucket count:

select
  min(data.VAL) as min,
  max(data.VAL) as max,
  count(data.VAL) as num,
  integer((data.VAL-value.min)/(value.max-value.min)*8) as group
from [table] data
CROSS JOIN (SELECT MAX(VAL) as max, MIN(VAL) as min, from [table]) value
GROUP BY group
ORDER BY group

in this example we're getting 8 buckets (pretty self explanatory) plus one for null VAL

sql - Use google bigquery to build histogram graph

7 に答える 7

Related

Reference