I'm trying to get a handle on generating random data in Postgres, and find I'm misunderstanding something about LATERAL JOIN
. Building on some help I got earlier, I've got some code that's trying to:
-- Generate a series of numbers
-- Generate timestamps on the hour matching the number sequence
-- Generate a random(ish) score in a roughly normal distribution for each row
-- Pick an ID randomly from a table in the database.
This last bit is not working. When I run the script shown below, I get a random value for the facility_id
, but every row has the same random value. I'd like the random ID assigned on each row, not once globally for the whole run. In procedural thinking, the facility_id
is being assigned before the loop, and I want it assigned in the loop. I thought that LATERAL JOIN
would help me out here, but
WITH facilities_count AS
(SELECT count(*) from facility)
SELECT hour_number AS id, -- Get numbers in sequence.
'2019-01-01 00:00'::timestamp + interval '1 HOUR' * hour_number AS stamp, -- Get hours in sequence
ABS(TRUNC(normal_rand(1, 0, 1) * 100)) AS score, -- Create a random score in a ~normal distribution.
random_facility.id
FROM (SELECT * FROM generate_series(1,8760,1) AS hour_number) generated_numbers
LEFT JOIN LATERAL
(SELECT id
FROM facility
OFFSET floor(random() * (select count from facilities_count))
LIMIT 1) random_facility
ON true;
I thought that a subquery might work, but I also get a single value across all rows for the facility_id
with this code:
WITH facilities_counter AS
(SELECT count(*) from facility)
SELECT hour_number AS id, -- Get numbers in sequence.
'2019-01-01 00:00'::timestamp + interval '1 HOUR' * hour_number AS stamp, -- Get hours in sequence
ABS(TRUNC(normal_rand(1, 0, 1) * 100)) AS score, -- Create a random score in a ~normal distribution.
(SELECT id FROM facility OFFSET floor(random() * (select count from facilities_counter)) LIMIT 1)
FROM (SELECT * FROM generate_series(1,8760,1) AS hour_number) generated_numbers;
I haven't listed the facility
table definition, but the only field that matters above is id
, so any table would work the same way.
In case it makes any difference to the answer, once I can figure out the solution to this problem, I'd like to then use the random facility_id
on each row as an input to select something else out of another table.
Thanks for any help. I'm working on this not just to get the solution, but to try and get a better mental model of how the various tools work. I'm (obviously) not at the point where I can read code like the above and predict in my head how it will behave. Getting to that understanding is kind of fundamental to figuring out how to solve problems on my own. Put another way, I'm not only trying to solve this problem, I'm trying to reduce my mental gaps generally.
Premature Optimization
I'm editing my question as I need to post more code.
Ha! Glad to see I'm not the only one accused of having premature optimization problems ;-) That's something that can be hurled in anger in a design or review session, to be sure.
I think what you're showing is that the id
is used in an expression, which makes its outcome indeterminate. And here's where I live and relearn that I I should post my table structure. Our id
fields are UUIDs. I've attempted to hack together something that would use the UUID in an expression, but it does not change the behavior.
where left(id::text,1) <> 'X' -- prevent premature optimization
The premature optimization you describe, and the workaround to defeating it, are not behaviors I could predict. Because gaps. I've tried running my code with the left
expression through explain (costs off)
, and I get this kind of output:
CTE facilities_counter
-> Aggregate
-> Seq Scan on facility
InitPlan 3 (returns $2)
-> Limit
InitPlan 2 (returns $1)
-> CTE Scan on facilities_counter
-> Seq Scan on facility facility_1
Filter: ("left"((id)::text, 1) <> 'X'::text)
-> ProjectSet
-> Function Scan on generate_series hour_number```
I'm unclear from the output how I would distinguish the behavior you describe, which is still present here (?)