I have a simple 2 node cluster (master on one, workers on both). I tried using:
python disco/util/distrfiles.py bigtxt /etc/nodes > bigtxt.chunks
To distribute the files (which worked ok).
I expected this to mean that the processes would spawn and only operate on local data, but it seems that they are trying to access data on the other machine, at times.
Instead, I completely copied the data directory. Everything worked fine, until the reduce portion. I received the error:
CommError: Unable to access resource (http://host:8989/host/8b/sup@4f6:d2f6:34b3b/map-index.txt):
It seems like the item is expected to be accessed directly via http. But I don't think this is happening correctly. Are files supposed to be passed back and forth by http? Must I have a distributed FS for multi-node MapReduce?