I'm looking to optimize or reduce the number of steps in the below workflow.
I have a Hive table named say Logs. I apply some custom udfs to obtain Transformed Logs.
I created transformed logs as a table with something like
CREATE TABLE transform_logs
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
AS
SELECT nonsafehash(visitorid), nonsafehash(url), action FROM logs
I then do
./bin/hadoop dfs -cat /user/hive/warehouse/transform_logs/\* > transform_logs.csv
Only to then do
./bin/hadoop dfs -put transform_logs.csv /some/other/path
Are my last two steps equivalent to simply 'mv' ?
My end goal is to have a single csv under /some/other/path.
It seems like I should not have to write to the filesystem to achieve this.