Disclaimer: I know little about ghc compiling pipeline, but I hope to learn some more about it with this post, for example, if comparing imperative vs functional is relevant to code compilation.
As you know, loop unrolling reduces the number of iterations over a loop by duplicating the code inside it. This improves performance since it reduces the number of jumps (and penalities associated with it) and AFAIR, creates bigger blocks of code, leaving room for better Register renaming optimization.
I was wondering, could there be an equivalent to Loop Unrolling for functional programming? Could we 'unroll' a function, open/expand it's definition, to first reduce the number of calls to said function and/or creating bigger chunks of code -- then leaving room for more code rewrite optimizations (like register renaming, or some FP equivalent)?
Something that would 'unroll' or 'expand' a function definition, using for example function evaluation (maybe mixed with some tactic) in order to have a trade-off between space vs time.
An example of what I had in mind:
map1 _ [] = []
map1 f (x:xs) = (f x): map f xs
Would unroll to
map2 _ [] = []
map2 f (x:x1:xs) = (f x):(f x1):map2 f xs
map2 f (x:xs) = (f x): map2 f xs
Once more:
map4 _ [] = []
map4 f (x:x1:x2:x3:xs) = (f x):(f x1):(f x2):(f x3):map4 f xs
map4 f (x:x1:x2:xs) = (f x):(f x1):(f x2):map4 f xs
map4 f (x:x1:xs) = (f x):(f x1):map4 f xs
map4 f (x:xs) = (f x): map4 f xs
Two things are at play: multiple cases of map4 (and consequent tests on list) could degrade performance, or the reduced number of calls of map4 would improve performance. Maybe this could reduce some constant overhead created by lazy evaluation?
Well that doesn't seems to hard to code a test for, so after putting criterion to roll this out, this is what I've got:
Problem size 5*10^6
map 105.4 ms
map2 93.34 ms
map4 89.79 ms
Problem size 1*10^7
map 216.3 ms
map2 186.8 ms
map4 180.1 ms
Problem size 5*10^7
map 1050 ms
map2 913.7 ms
map4 899.8 ms
Well, it seems that unrolling had some effect^1! map4 appears to be 16% faster.
Time for the questions then:
- Have this been discussed before? Is something like that already implemented?
- Is it really the reduced number of evaluations of map4 that improves speed?
- Can this be automated?
- Could I evaluate by chunks? ie.: if (f x) is fully evaluated, fully evalute everything up to (f x4).
- Any other form this sort of unrolling come at play?
- How inflation on the function size could this lead to?
- Any short-commings on why this is not a good idea?
1: I`ve also unrolled fib, since this sort of optimization would also happen in that form, but the performance gain is cheating a (very) bad algorithm.