regex - R を使用してフォーラムテキストメッセージの署名を検出して削除する

Question

フォーラムからデータフレームにスクレイピングされたテキストメッセージのコレクションがあります。再現可能な例を次に示します。

example.df <- data.frame(author=c("Mikey", "Donald", "Mikey", "Daisy", "Minnie", "Daisy"),
                         message=c("Hello World! Mikey Mouse", 
                                   "Quack Quack! Donald Duck", 
                                   "I was born in 1928. Mikey Mouse", 
                                   "Quack Quack! Daisy Duck", 
                                   "The quick fox jump over Minnie Mouse", 
                                   "Quack Quack! Daisy Duck"))

私の考えは、メッセージに複数のメッセージを書いたすべての人について、同じ作成者のすべてのメッセージで見つかった最も長い共通サフィックスを見つけることです。他のすべての場合は、うまく劣化する正規表現の方法を見つけます。

関数 getLongestCommonSubstring のおかげで有望に見えるバイオコンダクタパッケージ RLibstree を見つけましたが、関数を同じ作成者からのすべてのメッセージにグループ化する方法がわかりません。

score 0 · Accepted Answer

これは、追加のライブラリを使用しない実装です。

example.df <- data.frame(author=c("Mikey", "Donald", "Mikey",
                                  "Daisy", "Minnie", "Daisy"),
                         message=c("Hello World! Mikey Mouse", 
                                   "Quack Quack! Donald Duck", 
                                   "I was born in 1928. Mikey Mouse", 
                                   "Quack Quack! Daisy Duck", 
                                   "The quick fox jump over Minnie Mouse", 
                                   "Quack Quack! Daisy Duck"))

signlen = function(am)  # determine signature length of an author's messages
{
    if (length(am) <= 1) return(0)  # return if not more than 1 message

    # turn the messages into reversed vectors of single characters
    # in order to conveniently access the suffixes from index 1 on
    am = lapply(strsplit(as.character(am), ''), rev)
    # find the longest common suffix in the messages
    longest_common = .Machine$integer.max
    for (m in 2:length(am))
    {
        i = 1
        max_length = min(length(am[[m]]), length(am[[m-1]]), longest_common)
        while (i <= max_length && am[[m]][i] == am[[m-1]][i]) i = i+1
        longest_common = i-1
        if (longest_common == 0) return(0)  # shortcut: need not look further
    }
    return(longest_common)
}

# determine signature length of every author's messages
signature_length = tapply(example.df$message, example.df$author, signlen)
#> signature_length
# Daisy Donald  Mikey Minnie 
#    23      0     12      0 

# determine resulting length "to" of messages with signatures removed
to = nchar(as.character(example.df$message))-signature_length[example.df$author]
#> to
# Mikey Donald  Mikey  Daisy Minnie  Daisy 
#    12     24     19      0     36      0 

# remove the signatures by replacing messages with resulting substring
example.df$message = substr(example.df$message, 1, to)
#> example.df
#  author                              message
#1  Mikey                         Hello World!
#2 Donald             Quack Quack! Donald Duck
#3  Mikey                  I was born in 1928.
#4  Daisy                                     
#5 Minnie The quick fox jump over Minnie Mouse
#6  Daisy

score 0 · Accepted Answer

関数を同じ作成者からのすべてのメッセージにグループ化する方法がわかりません。

おそらくtapplyあなたが探しているものです。

> tapply(as.character(example.df$message), example.df$author, function(x) x)
$Daisy
[1] "Quack Quack! Daisy Duck" "Quack Quack! Daisy Duck"

$Donald
[1] "Quack Quack! Donald Duck"

$Mikey
[1] "Hello World! Mikey Mouse"        "I was born in 1928. Mikey Mouse"

$Minnie
[1] "The quick fox jump over Minnie Mouse"

もちろん、の代わりに独自の関数を使用できますfunction(x) x。

regex - R を使用してフォーラム テキスト メッセージの署名を検出して削除する

3 に答える 3

Related

Reference

regex - R を使用してフォーラムテキストメッセージの署名を検出して削除する