I am executing a subprocess using Popen
and feeding it input as follows (using Python 2.7.4):
env = dict(os.environ)
env['LC_ALL'] = 'en_US.UTF-8'
args = ['chasen', '-i u', '-F"%m "']
process = Popen(args, stdout=PIPE, stderr=PIPE, stdin=PIPE, env=env)
out, err = process.communicate(input=string)
Adding the entry to the environment it is executed with is necessary because the input string includes Japanese characters, and when the script is not executed from the command line (in my case being called by Apache), Python cannot guess the encoding.
This setup has worked fine for me with other commands, however now I'm using chasen
(a Japanese tokenizer), whenever I send it unicode characters the subprocess does not return, and it just sits there with the Python script chewing up memory. This seems like an encoding problem, but I thought I had would have sorted this out by specifying the encoding with the LC_ALL
environment variable.
Edit: Extra weirdness as follows... I don't get this problem when executing the Python script from the command line with the notable exception of the '。' character. For some reason this causes the strangeness from chasen
also.