iphone - 正規表現パターンおよび/または NSRegularExpression は、非常に大きなファイルの検索が少し遅すぎます。最適化できますか?

Question

iOS フレームワークでは、次の 3.2 MB ファイルで発音を検索しています: https://cmusphinx.svn.sourceforge.net/svnroot/cmusphinx/trunk/pocketsphinx/model/lm/en_US/cmu07a.dic

NSRegularExpression を使用して、NSArray として指定された任意の単語セットを検索しています。検索は、大きなファイルの内容を NSString として実行されます。改行とタブ文字で囲まれた単語を一致させてから、行全体を取得する必要があります。たとえば、NSArray に「monday」という単語がある場合、辞書ファイル内のこの行を一致させたいとします。

monday  M AH N D IY

この行は改行で始まり、文字列「monday」の後にタブ文字が続き、その後に発音が続きます。最終的な出力を得るには、正規表現によって行全体が一致する必要があります。また、次のようにリストされている単語の別の発音を見つける必要があります。

monday(2)   M AH N D EY

代替発音は常に (2) で始まり、(5) まで上がります。そのため、改行とタブ文字で囲まれた単一の数字を含む括弧が続く単語の反復も検索します。

次のように、100% 動作する NSRegularExpression メソッドがあります。

NSArray *array = [NSArray arrayWithObjects:@"friday",@"monday",@"saturday",@"sunday", @"thursday",@"tuesday",@"wednesday",nil]; // This array could contain any arbitrary words but they will always be in alphabetical order by the time they get here.

// Use this string to build up the pattern.
NSMutableString *mutablePatternString = [[NSMutableString alloc]initWithString:@"^("]; 

int firstRound = 0;
for(NSString *word in array) {
    if(firstRound == 0) { // this is the first round

        firstRound++;
    } else { // After the first iteration we need an OR operator first.
        [mutablePatternString appendString:[NSString stringWithFormat:@"|"]];
     }
    [mutablePatternString appendString:[NSString stringWithFormat:@"(%@(\\(.\\)|))",word]];
}

[mutablePatternString appendString:@")\\t.*$"];

// This results in this regex pattern:

// ^((change(\(.\)|))|(friday(\(.\)|))|(monday(\(.\)|))|(saturday(\(.\)|))|(sunday(\(.\)|))|(thursday(\(.\)|))|(tuesday(\(.\)|))|(wednesday(\(.\)|)))\t.*$

NSRegularExpression * regularExpression = [NSRegularExpression regularExpressionWithPattern:mutablePatternString
                                                                                     options:NSRegularExpressionAnchorsMatchLines
                                                                                       error:nil];
int rangeLocation = 0;
int rangeLength = [string length];
NSMutableArray * matches = [NSMutableArray array];
[regularExpression enumerateMatchesInString:string
                                     options:0
                                       range:NSMakeRange(rangeLocation, rangeLength)
                                  usingBlock:^(NSTextCheckingResult *result, NSMatchingFlags flags, BOOL *stop){
                                      [matches addObject:[string substringWithRange:result.range]];
                                  }];

[mutablePatternString release];

// matches array is returned to the caller.

私の問題は、大きなテキストファイルを考えると、iPhone では十分な速度ではないことです。iPhone 4 では 8 ワードに 1.3 秒かかりますが、これはアプリケーションには長すぎます。次の既知の要因を考慮します。

• 3.2 MB のテキストファイルには、一致する単語がアルファベット順にリストされています。

• 検索する任意の単語の配列は、このメソッドに到達すると常にアルファベット順になります。

• 代替発音は、(1) ではなく、単語の後の括弧内の (2) で始まります。

• (2) がない場合、(3)、(4) またはそれ以上はありません。

• 代替発音が 1 つ存在することはまれで、平均して 8 回に 1 回発生します。さらに別の発音はさらにまれです。

正規表現または目的の C のいくつかの側面を改善することによって、この方法を最適化できますか? NSRegularExpression は既に十分に最適化されているため、別の Objective-C ライブラリや C で実行しようとする価値はないと思いますが、ここで間違っている場合はお知らせください。それ以外の場合は、パフォーマンスを改善するための提案に非常に感謝しています。これを任意の発音ファイルに一般化することを望んでいるので、事前にアルファベットの範囲を計算してより制限された検索を行うなどのソリューションから離れようとしています.

****編集****

2012 年 8 月 16 日までに提供されたすべての検索関連の回答に対する iPhone 4 のタイミングは次のとおりです。

dasblinkenlight の create NSDictionary アプローチhttps://stackoverflow.com/a/11958852/119717 : 5.259676 秒

https://stackoverflow.com/a/11957535/119717での Ωmega の最速の正規表現: 0.609593 秒

https://stackoverflow.com/a/11969602/119717での dasblinkenlight の複数の NSRegularExpression アプローチ: 1.255130 秒

https://stackoverflow.com/a/11970549/119717での私の最初のハイブリッドアプローチ: 0.372215 秒

https://stackoverflow.com/a/11970549/119717での私の 2 番目のハイブリッドアプローチ: 0.337549 秒

これまでのところ、私の回答の 2 番目のバージョンが最適です。検索関連の回答はすべて、私のバージョンで採用したアプローチに基づいているため、回答を最もよくマークすることはできません。それらはすべて非常に役立ち、私の回答は他の回答に基づいているだけです。私は多くのことを学び、私の方法は元の時間の 4 分の 1 で終わったので、これは非常に役に立ちました。

score 4 · Accepted Answer

とにかくファイル全体をメモリに入れているので、検索しやすい構造として表現することもできます。

キーと値NSDictionary wordsを持つミュータブルを作成しますNSStringNSMutableArray
ファイルをメモリに読み込む
ファイルを表す文字列を 1 行ずつ調べます
ごとに、または文字lineを検索して単語部分を分離します。'(''\t'
単語の部分文字列を取得します (ゼロからインデックス'('または'\t'マイナス 1 まで)。これはあなたのkeyです。
words; が含まれているかどうかを確認しますkey。そうでない場合は、新しく追加しますNSMutableArray
特定の場所で見つけた/作成したに追加lineしますNSMutableArraykey
終了したら、ファイルを表す元の文字列を破棄します。

この構造があれば、正規表現エンジンが一致できない時間内に検索を実行できるはずです。これは、線形のフルテキストスキャンを一定のハッシュルックアップに置き換えたためです。時間。

** 編集: ** このソリューションと正規表現の相対速度を確認しました。シミュレーターでは約 60 倍高速です。オッズは正規表現ベースのソリューションに対して大きく積み上げられているため、これはまったく驚くべきことではありません。

ファイルの読み取り:

NSBundle *bdl = [NSBundle bundleWithIdentifier:@"com.poof-poof.TestAnim"];
NSString *path = [NSString stringWithFormat:@"%@/words_pron.dic", [bdl bundlePath]];
data = [NSString stringWithContentsOfFile:path encoding:NSUTF8StringEncoding error:nil];
NSMutableDictionary *tmp = [NSMutableDictionary dictionary];
NSUInteger pos = 0;
NSMutableCharacterSet *terminator = [NSMutableCharacterSet characterSetWithCharactersInString:@"\t("];
while (pos != data.length) {
    NSRange remaining = NSMakeRange(pos, data.length-pos);
    NSRange next = [data
        rangeOfCharacterFromSet:[NSCharacterSet newlineCharacterSet]
        options:NSLiteralSearch
        range:remaining
    ];
    if (next.location != NSNotFound) {
        next.length = next.location - pos;
        next.location = pos;
    } else {
        next = remaining;
    }
    pos += (next.length+1);
    NSString *line = [data substringWithRange:next];
    NSRange keyRange = [line rangeOfCharacterFromSet:terminator];
    keyRange.length = keyRange.location;
    keyRange.location = 0;
    NSString *key = [line substringWithRange:keyRange];
    NSMutableArray *array = [tmp objectForKey:key];
    if (!array) {
        array = [NSMutableArray array];
        [tmp setObject:array forKey:key];
    }
    [array addObject:line];
}
dict = tmp; // dict is your NSMutableDictionary ivar

検索中:

NSArray *keys = [NSArray arrayWithObjects:@"sunday", @"monday", @"tuesday", @"wednesday", @"thursday", @"friday", @"saturday", nil];
NSMutableArray *all = [NSMutableArray array];
NSLog(@"Starting...");
for (NSString *key in keys) {
    for (NSString *s in [dict objectForKey:key]) {
        [all addObject:s];
    }
}
NSLog(@"Done! %u", all.count);

score 4 · Accepted Answer

これを試してください：

^(?:change|monday|tuesday|wednesday|thursday|friday|saturday|sunday)(?:\([2-5]\))?\t.*$

そしてこれも（可能な最初の文字のリストで前向き先読みを使用して）：

^(?=[cmtwfs])(?:change|monday|tuesday|wednesday|thursday|friday|saturday|sunday)(?:\([2-5]\))?\t.*$

そして最後に、いくつかの最適化を行ったバージョン:

^(?=[cmtwfs])(?:change|monday|t(?:uesday|hursday)|wednesday|friday|s(?:aturday|unday))(?:\([2-5]\))?\t.*$

score 2 · Accepted Answer

これは、dasblinkenlight と Ωmega の回答のハイブリッドアプローチであり、この時点で回答として追加する必要があると考えました。文字列を前方検索する dasblinkenlight の方法を使用し、ヒットが発生した場合は狭い範囲で完全な正規表現を実行するため、検索する辞書と単語が両方ともアルファベット順であるという事実を利用して、最適化された正規表現。配るベストアンサーチェックが2つあればいいのに！これにより、正しい結果が得られ、シミュレーターでの純粋な正規表現アプローチの約半分の時間がかかります (後でデバイスでテストして、参照デバイスである iPhone 4 での時間比較を確認する必要があります)。

NSMutableArray *mutableArrayOfWordsToMatch = [[NSMutableArray alloc] initWithArray:array];
NSMutableArray *mutableArrayOfUnfoundWords = [[NSMutableArray alloc] init]; // I also need to know the unfound words.

NSUInteger pos = 0;

NSMutableString *mutablePatternString = [[NSMutableString alloc]initWithString:@"^(?:"];
int firstRound = 0;
for(NSString *word in array) {
    if(firstRound == 0) { // this is the first round

        firstRound++;
    } else { // this is all later rounds
        [mutablePatternString appendString:[NSString stringWithFormat:@"|"]];
    }
    [mutablePatternString appendString:[NSString stringWithFormat:@"%@",word]];
}

[mutablePatternString appendString:@")(?:\\([2-5]\\))?\t.*$"];

// This creates a string that reads "^(?:change|friday|model|monday|quidnunc|saturday|sunday|thursday|tuesday|wednesday)(?:\([2-5]\))?\t.*$"

// We don't want to instantiate the NSRegularExpression in the loop so let's use a pattern that matches everything we're interested in.

NSRegularExpression * regularExpression = [NSRegularExpression regularExpressionWithPattern:mutablePatternString
                                                                                    options:NSRegularExpressionAnchorsMatchLines
                                                                                      error:nil];
NSMutableArray * matches = [NSMutableArray array];

while (pos != data.length) {

    if([mutableArrayOfWordsToMatch count] <= 0) { // If we're at the top of the loop without any more words, stop.
        break;
    }  

    NSRange remaining = NSMakeRange(pos, data.length-pos);
    NSRange next = [data
                    rangeOfString:[NSString stringWithFormat:@"\n%@\t",[mutableArrayOfWordsToMatch objectAtIndex:0]]
                    options:NSLiteralSearch
                    range:remaining
                    ]; // Just search for the first pronunciation.
    if (next.location != NSNotFound) {

        // If we find the first pronunciation, run the whole regex on a range of {position, 500} only.

        int rangeLocation = next.location;
        int searchPadding = 500;
        int rangeLength = searchPadding;

        if(data.length - next.location < searchPadding) { // Only use 500 if there is 500 more length in the data.
            rangeLength = data.length - next.location;
        } 

        [regularExpression enumerateMatchesInString:data 
                                            options:0
                                              range:NSMakeRange(rangeLocation, rangeLength)
                                         usingBlock:^(NSTextCheckingResult *result, NSMatchingFlags flags, BOOL *stop){
                                             [matches addObject:[data substringWithRange:result.range]];
                                         }]; // Grab all the hits at once.

        next.length = next.location - pos;
        next.location = pos;
        [mutableArrayOfWordsToMatch removeObjectAtIndex:0]; // Remove the word.
        pos += (next.length+1);
    } else { // No hits.
        [mutableArrayOfUnfoundWords addObject:[mutableArrayOfWordsToMatch objectAtIndex:0]]; // Add to unfound words.
        [mutableArrayOfWordsToMatch removeObjectAtIndex:0]; // Remove from the word list.
    }
}    

[mutablePatternString release];
[mutableArrayOfUnfoundWords release];
[mutableArrayOfWordsToMatch release];

// return matches to caller

編集:これは、正規表現を使用せず、メソッドからもう少し時間を節約する別のバージョンです:

NSMutableArray *mutableArrayOfWordsToMatch = [[NSMutableArray alloc] initWithArray:array];
NSMutableArray *mutableArrayOfUnfoundWords = [[NSMutableArray alloc] init]; // I also need to know the unfound words.

NSUInteger pos = 0;

NSMutableArray * matches = [NSMutableArray array];

while (pos != data.length) {

    if([mutableArrayOfWordsToMatch count] <= 0) { // If we're at the top of the loop without any more words, stop.
        break;
    }  

    NSRange remaining = NSMakeRange(pos, data.length-pos);
    NSRange next = [data
                    rangeOfString:[NSString stringWithFormat:@"\n%@\t",[mutableArrayOfWordsToMatch objectAtIndex:0]]
                    options:NSLiteralSearch
                    range:remaining
                    ]; // Just search for the first pronunciation.
    if (next.location != NSNotFound) {
        NSRange lineRange = [data lineRangeForRange:NSMakeRange(next.location+1, next.length)];
        [matches addObject:[data substringWithRange:NSMakeRange(lineRange.location, lineRange.length-1)]]; // Grab the whole line of the hit.
        int rangeLocation = next.location;
        int rangeLength = 750;

        if(data.length - next.location < rangeLength) { // Only use the searchPadding if there is that much room left in the string.
            rangeLength = data.length - next.location;
        } 
        rangeLength = rangeLength/5;
        int newlocation = rangeLocation;

        for(int i = 2;i < 6; i++) { // We really only need to do this from 2-5.
            NSRange morematches = [data
                            rangeOfString:[NSString stringWithFormat:@"\n%@(%d",[mutableArrayOfWordsToMatch objectAtIndex:0],i]
                            options:NSLiteralSearch
                            range:NSMakeRange(newlocation, rangeLength)
                            ];
            if(morematches.location != NSNotFound) {
                NSRange moreMatchesLineRange = [data lineRangeForRange:NSMakeRange(morematches.location+1, morematches.length)]; // Plus one because I don't actually want the line break at the beginning.
                 [matches addObject:[data substringWithRange:NSMakeRange(moreMatchesLineRange.location, moreMatchesLineRange.length-1)]]; // Minus one because I don't actually want the line break at the end.
                newlocation = morematches.location;

            } else {
                break;   
            }
        }

        next.length = next.location - pos;
        next.location = pos;
        [mutableArrayOfWordsToMatch removeObjectAtIndex:0]; // Remove the word.
        pos += (next.length+1);
    } else { // No hits.
        [mutableArrayOfUnfoundWords addObject:[mutableArrayOfWordsToMatch objectAtIndex:0]]; // Add to unfound words.
        [mutableArrayOfWordsToMatch removeObjectAtIndex:0]; // Remove from the word list.
    }
}    

[mutableArrayOfUnfoundWords release];
[mutableArrayOfWordsToMatch release];

score 1 · Accepted Answer

あなたが提供した辞書ファイルを見ると、合理的な戦略はデータを読み込んで、それをあらゆる種類の永続的なデータストアに入れることだと思います。

nファイルを読み、発音の文字列 (は固有の発音の数)を使用して、固有の単語ごとにオブジェクトを作成しますn。辞書はすでにアルファベット順になっているため、読んでいる順に解析すると、アルファベット順のリストになります。

次に、データに対してバイナリ検索を実行できます。オブジェクトが膨大な数であっても、バイナリ検索は探しているものを非常に迅速に見つけることができます (アルファベット順と仮定)。

超高速のパフォーマンスが必要な場合は、すべてをメモリに保持することもできます。

iphone - 正規表現パターンおよび/または NSRegularExpression は、非常に大きなファイルの検索が少し遅すぎます。最適化できますか?

4 に答える 4

Related

Reference