句読点に応じてテキストを適切なケースに変更し、書式設定 (空白など) を修正する最も効率的な方法はどれですか?
the qUiCk BROWN fox:: jumped. over , the lazy dog.
望ましい結果:
The quick brown fox: jumped. Over, the lazy dog.
句読点に応じてテキストを適切なケースに変更し、書式設定 (空白など) を修正する最も効率的な方法はどれですか?
the qUiCk BROWN fox:: jumped. over , the lazy dog.
望ましい結果:
The quick brown fox: jumped. Over, the lazy dog.
質問に「正規表現」というタグを付けましたが、これを解決するために正規表現を使用することはお勧めしません。これは、単純なステート マシンで処理するのが最適です。
これは、例を処理するのに適切な単純なステート マシンです。他のテキストで試してみると、処理できないケースが見つかる可能性があります。設計が明確で、目的に合わせて問題なく変更できることを願っています。
import string
s = "the qUiCk BROWN fox:: jumped. over , the lazy dog."
s_correct = "The quick brown fox: jumped. Over, the lazy dog."
def chars_from_lines(lines):
for line in lines:
for ch in line:
yield ch
start, in_sentence, saw_space = range(3)
punct = set(string.punctuation)
punct_non_repeat = punct - set(['.', '-'])
end_sentence_chars = set(['.', '!', '?'])
def edit_sentences(seq):
state = start
ch_punct_last = None
for ch in seq:
ch = ch.lower()
if ch == ch_punct_last:
# Don't pass repeated punctuation.
continue
elif ch in punct_non_repeat:
ch_punct_last = ch
else:
# Not punctuation to worry about, so forget the last.
ch_punct_last = None
if state == start and ch.isspace():
continue
elif state == start:
state = in_sentence
yield ch.upper()
elif state == in_sentence and ch in end_sentence_chars:
state = start
yield ch
yield ' '
elif state == in_sentence and not ch.isspace():
yield ch
elif state == in_sentence and ch.isspace():
state = saw_space
continue
elif state == saw_space and ch.isspace():
# stay in state saw_space
continue
elif state == saw_space and ch in punct:
# stay in state saw_space
yield ch
elif state == saw_space and ch.isalnum():
state = in_sentence
yield ' '
yield ch
#with open("input.txt") as f:
# s_result = ''.join(ch for ch in edit_sentences(chars_from_lines(f)))
s_result = ''.join(ch for ch in edit_sentences(s))
print(s_result)
print(s_correct)
line
が入力文字列であるとします。以下は、あなたが望むものにかなり近いことをするはずです。改行 (およびその他の空白) は単一のスペースに変換されることに注意してください。
import string # used to check if a character is a letter
#assume we start with a letter and not, for instance, a quotation mark
assert line[0] in string.letters
line = line.capitalize()
duplPunct = [] #list of indices of duplicate punctuation
prev = line[0]
for i in range(len(line))[1:]:
if line[i] == prev and prev not in string.letters:
duplPunct.append(i)
prev = line[i]
while len(duplPunct):
i = duplPunct.pop() #returns last index needing deletion
line = line[:i]+line[i+1:]
words = line.split() #removes all whitespace
floatingchar = [] #list of indices of words containing only a single invalid character
for i in range(len(words))[1:]:
word = words[i]
if len(word) == 1 and word not in 'ai':
#assume single-character 'words' should be part of previous word
floatingchar.append(i)
while len(floatingchar):
i = floatingchar.pop()
words[i-1] = words[i-1]+words[i]
del words[i]
needCaps = [] #list of indices of words requiring capitalization
for i in range(len(words))[:-1]:
if words[i][-1] in '.!?':
needCaps.append(i+1)
while len(needCaps):
i = needCaps.pop()
words[i] = words[i].capitalize()
line = ' '.join(words)