ruby - 省略形と引用符で使用される ' の違いを検出する方法

Question

テキストのブロックを解析しようとしていますが、さまざまなコンテキストでアポストロフィの違いを検出する方法が必要です。一方のグループでは所有と略語、もう一方のグループでは引用。

例えば

「私は車の所有者です」 -> [「私は」、「その」、「車」、「所有者」]

しかし

「彼は「こんにちは」と言った」 -> [「彼」、「言った」、「「こんにちは」」]

" 'ello " や " cars' " のようなものは、アポストロフィのペアと同じように、引用符の一方の端として解析されるため、両側の空白を検出しても役に立ちません。とてつもなく複雑な NLP ソリューション以外にそれを行う方法はないと感じており、単語の途中で発生しないアポストロフィを無視する必要がありますが、これは残念なことです。

編集：

書いてから、これは不可能だと気づきました。正規表現ベースのパーサーは、次を解析する必要があります。

「こんにちは、私の仲間の犬」

2つの異なる方法で、文の残りの部分を理解することでのみそれを行うことができました. 私は、可能性が最も低いケースを無視し、まれな異常のみを引き起こすほどまれであることを期待するという、洗練されていない解決策を支持していると思います。

score 0 · Accepted Answer

考慮すべきいくつかのルール:

引用符は、空白文字を含むアポストロフィで始まるか、その前に何もありません。
引用符は、その後に句読点または空白文字を含むアポストロフィで終わります。
一部の単語は、引用符の末尾のように見える場合があります (例: peoples'.
引用符で区切るアポストロフィーの前後に直接文字が含まれることはありません。

score 0 · Accepted Answer

非常に単純な 2 フェーズプロセスを使用します。

パス 1/2 では、この正規表現から始めて、テキストを単語と単語以外の文字の交互のセグメントに分割します。

/(\w+)|(\W+)/gi

次のようなリストに一致を保存します (Ruby を使用しないため、AS3 スタイルの疑似コードを使用しています)。

class MatchedWord
{
    var text:String;
    var charIndex:int;
    var isWord:Boolean;
    var isContraction:Boolean = false;
    function MatchedWord( text:String, charIndex:int, isWord:Boolean )
    {
        this.text = text; this.charIndex = charIndex; this.isWord = isWord;
    }
}
var match:Object;
var matched_word:MatchedWord;
var matched_words:Vector.<MatchedWord> = new Vector.<MatchedWord>();
var words_regex:RegExp = /(\w+)|(\W+)/gi
words_regex.lastIndex = 0; //this is where to start looking for matches, and is updated to the end of the last match each time exec is called
while ((match = words_regex.exec( original_text )) != null)
    matched_words.push( new MatchedWord( match[0], match.index, match[1] != null ) ); //match[0] is the entire match and match[1] is the first parenthetical group (if it's null, then it's not a word and match[2] would be non-null)

パス 2 of 2 では、マッチのリストを反復処理して、短縮形を見つけます。それぞれの (トリムされた単語以外の) ENDS がアポストロフィで一致するかどうかを確認します。一致する場合は、次の隣接する (単語) 一致をチェックして、8 つの一般的な収縮語尾の 1 つと一致するかどうかを確認します。私が考えることができるすべての 2 部構成の収縮にもかかわらず、一般的なエンディングは 8 つしかありません。

d
l
ll
m
re
s
t
ve

このような一致 (non-word)="'" と (word)="d" のペアを特定したら、前の隣接する (単語) 一致を含め、3 つの一致を連結して短縮形を取得します。

ここで説明したプロセスを理解するために必要な変更の 1 つは、「'twas」や「'tis」などのアポストロフィで始まる短縮形を含めるように短縮形の末尾のリストを拡張することです。それらについては、単に前の隣接する (単語) 一致を連結せず、アポストロフィ一致をもう少し詳しく見て、その前に他の非単語文字が含まれているかどうかを確認します (そのため、アポストロフィで終わることが重要です) ）。トリミングされた文字列がアポストロフィに等しい場合は、次の一致とマージし、アポストロフィでのみ終了する場合は、アポストロフィを取り除き、次の一致とマージします。同様に、前の一致を含む条件は、アポストロフィで終わる (トリミングされた単語以外の) 一致を最初に確認する必要があります。

もう 1 つの変更が必要になる場合があるのは、8 つの語尾のリストを拡張して、「g'day」や「g'night」などの完全な単語である語尾を含めることです。繰り返しますが、これは前の (単語) 一致の条件付きチェックを含む単純な変更です。「g」の場合は、それを含めます。

そのプロセスは収縮の大部分を捉える必要があり、考えられる新しいものを含めるのに十分な柔軟性があります.

データ構造は次のようになります。

Condition(Ending, PreCondition)

PreCondition の場所

"*", "!", or "<exact string>"

条件の最終的なリストは次のようになります。

new Condition("d","*") //if apostrophe d is found, include the preceding word string and count as successful contraction match
new Condition("l","*");
new Condition("ll","*");
new Condition("m","*");
new Condition("re","*");
new Condition("s","*");
new Condition("t","*");
new Condition("ve","*");
new Condition("twas","!"); //if apostrophe twas is found, exclude the preceding word string and count as successful contraction match
new Condition("tis","!");
new Condition("day","g"); //if apostrophe day is found and preceding word string is g, then include preceding word string and count as successful contraction match
new Condition("night","g");

私が説明したようにこれらの条件を処理するだけで、これら 86 の収縮すべて (およびそれ以上) をカバーするはずです。

'tis 'twas't are n't can't could've could't did n't don't do n't don't had n't have't he'd he'll he's how'd how'd how'd how's I''''彼女はそうすべきだった't what'''''''''''''''''''''''''d なぜ

ちなみに、"gotta" > "got to" や "gonna" > "going to" など、アポストロフィを使用しないスラングの短縮形についても忘れないでください。

これが最終的な AS3 コードです。全体として、テキストを構文解析して単語グループと非単語グループに分け、短縮形を識別してマージするコードは 50 行未満です。単純。ブール型の「isContraction」変数を MatchedWord クラスに追加して、短縮形が識別されたときに以下のコードでフラグを設定することもできます。

//Automatically merge known contractions
var conditions:Array = [
    ["d","*"], //if apostrophe d is found, include the preceding word string and count as successful contraction match
    ["l","*"],
    ["ll","*"],
    ["m","*"],
    ["re","*"],
    ["s","*"],
    ["t","*"],
    ["ve","*"],
    ["twas","!"], //if apostrophe twas is found, exclude the preceding word string and count as successful contraction match
    ["tis","!"],
    ["day","g"], //if apostrophe day is found and preceding word string is g, then include preceding word string and count as successful contraction match
    ["night","g"]
];
for (i = 0; i < matched_words.length - 1; i++) //not a type-o, intentionally stopping at next to last index to avoid a condition check in the loop
{
    var m:MatchedWord = matched_words[i];
    var apostrophe_text:String = StringUtils.trim( m.text ); //check if this ends with an apostrophe first, then deal more closely with it
    if (!m.isWord && StringUtils.endsWith( apostrophe_text, "'" ))
    {
        var m_next:MatchedWord = matched_words[i + 1]; //no bounds check necessary, since loop intentionally stopped at next to last index
        var m_prev:MatchedWord = ((i - 1) >= 0) ? matched_words[i - 1] : null; //bounds check necessary for previous match, since we're starting at beginning, since we may or may not need to look at the prior match depending on the precondition
        for each (var condition:Array in conditions)
        {
            if (StringUtils.trim( m_next.text ) == condition[0])
            {
                var pre_condition:String = condition[1];
                switch (pre_condition)
                {
                    case "*": //success after one final check, include prior match, merge current and next match into prior match and delete current and next match
                        if (m_prev != null && apostrophe_text == "'") //EQUAL apostrophe, not just ENDS with apostrophe
                        {
                            m_prev.text += m.text + m_next.text;
                            m_prev.isContraction = true;
                            matched_words.splice( i, 2 );
                        }
                        break;
                    case "!": //success after one final check, do not include prior match, merge current and next match, and delete next match
                        if (apostrophe_text == "'")
                        {
                            m.text += m_next.text;
                            m.isWord = true; //match now includes word text so flip it to a "word" block for logical consistency
                            m.isContraction = true;
                            matched_words.splice( i + 1, 1 );
                        }
                        else
                        {   //strip apostrophe off end and merge with next item, nothing needs deleted
                            //preserve spaces and match start indexes by manipulating untrimmed strings
                            var apostrophe_end:int = m.text.lastIndexOf( "'" );
                            var apostrophe_ending:String = m.text.substring( apostrophe_end, m.text.length );
                            m.text = m.text.substring( 0, m.text.length - apostrophe_ending.length); //strip apostrophe and any trailing spaces
                            m_next.text = apostrophe_ending + m_next.text;
                            m_next.charIndex = m.charIndex + apostrophe_end;
                            m_next.isContraction = true;
                        }
                        break;
                    default: //conditional success, check prior match meets condition
                        if (m_prev != null && m_prev.text == pre_condition)
                        {
                            m_prev.text += m.text + m_next.text;
                            m_prev.isContraction = true;
                            matched_words.splice( i, 2 );
                        }
                        break;
                }
            }
        }
    }
}

score 0 · Accepted Answer

うーん、これは簡単ではないと思います。これは、「I'm」や「I've」のようなものに対してのみ機能する正規表現です。

>> s1 =~ /[\w\s]*((?<!I)'(?:[^']+)')[\w\s]*/
=> nil
>> s2 =~ /[\w\s]*((?<!I)'(?:[^']+)')[\w\s]*/
=> 0
>> $1
=> "'hello there'"

もう少しいじってみると、他の一般的な収縮を取り除くことができるかもしれませんが、それでも何もないよりはましかもしれません.

ruby - 省略形と引用符で使用される ' の違いを検出する方法

3 に答える 3

Related

Reference