First for the easy part. A 100-element array is so small that it can easily fit in a core's cache. Besides, clearing the array is equivalent to setting a memory area to 0s, something that is available as a CPU command and therefore as fast as you can make it.
In fact, SSE commands and parallel-optimized memory controllers mean that the chipset can probalbly clear memory in parallel using only a single CPU command.
On the other hand, Parallel.For introduces some overhead. It has to partition the data, create the appropriate tasks to work on them, collect the results and return the final result. Below Parallel.For, the runtime has to copy the data to each core, handle memory synchronization, collect the results etc. In your example, this can be significantly larger that the actual time needed to zero the memory locations.
In fact, for small sizes it is quite possible that 99.999% of the overhead is memory synchronization as each core tries to access the same memory page. Remember, memory locking is at the page level and you can fit 2K 16-bit ints in a 4K memory page.
As for how PLINQ schedules tasks - there are many different partitioning schemes used, depending on the operators you use. Check Partitioning in LINQ for a nice intro. In any case, the partitioner will try to determine whether there is any benefit to be gained from partitioning and may not partition the data at all.
In your case, the partitioner will probably use Ranged partitioning. Your payload uses only a few CPU cycles so all you see is the overhead of partitioning, creating tasks, managing synchronization and collecting the results.
A better benchmark would be to run some aggregations on a large array, eg. counts and averages and the like.