5

I am executing a subprocess using Popen and feeding it input as follows (using Python 2.7.4):

env = dict(os.environ)
env['LC_ALL'] = 'en_US.UTF-8'
args = ['chasen', '-i u', '-F"%m "']
process = Popen(args, stdout=PIPE, stderr=PIPE, stdin=PIPE, env=env)
out, err = process.communicate(input=string)

Adding the entry to the environment it is executed with is necessary because the input string includes Japanese characters, and when the script is not executed from the command line (in my case being called by Apache), Python cannot guess the encoding.

This setup has worked fine for me with other commands, however now I'm using chasen (a Japanese tokenizer), whenever I send it unicode characters the subprocess does not return, and it just sits there with the Python script chewing up memory. This seems like an encoding problem, but I thought I had would have sorted this out by specifying the encoding with the LC_ALL environment variable.

Edit: Extra weirdness as follows... I don't get this problem when executing the Python script from the command line with the notable exception of the '。' character. For some reason this causes the strangeness from chasen also.

4

1 に答える 1

2

これはchasenのバグです。Python を介して実行すると、発行される次の syscall を確認できます。

write(1, "\n", 1)                       = 1
read(0, "", 4096)                       = 0
write(1, "\n", 1)                       = 1
read(0, "", 4096)                       = 0

つまり、 EOFを正しく処理しません。これを修正するには、次のように改行 ( '\n') を Python 文字列に追加します。

# coding: utf-8
import os
from subprocess import Popen, PIPE

string = u"悪妻は百年の不作。"

env = dict(os.environ)
env['LC_ALL'] = 'en_US.UTF-8'
args = ['chasen', '-i u', '-F"%m "']
process = Popen(args, stdout=PIPE, stderr=PIPE, stdin=PIPE, env=env)
out, err = process.communicate(input=(string + u'\n').encode('utf-8'))

print(out)
于 2013-10-11T23:23:39.867 に答える