r - R を使用したデータの再構築

Question

次のようなデータセット (dat) があります。

 Person       IPaddress
36598035    222.999.22.99
36598035    222.999.22.99
36598035    222.999.22.99
36598035    222.999.22.99
36598035    222.999.22.99
36598035    444.666.44.66
37811171    111.88.111.88
37811171    111.88.111.88
37811171    111.88.111.88
37811171    111.88.111.88
37811171    111.88.111.88

これは、特定の期間にわたって Web サイトにログインした個人のインスタンスを反映しています。次のようなデータが必要です。

Person        IPaddress      Number of Logins
36598035    222.999.22.99           6
37811171    111.88.111.88           5

そのため、同じ人物の複数のエントリではなく、個人ごとに 1 つの行だけがあり、ログイン回数がカウントされます。

また、私の例では、36598035 という人物が複数の IP アドレスでログインしていることに気付くでしょう。これが発生した場合、最終的なデータセットの IP アドレスにモード IP アドレス、つまり個人が最も頻繁にログインした IP アドレスを反映させたいと考えています。

score 5 · Accepted Answer

ここに 1 つのアプローチがあります。

library(dplyr)

mydf %>%
    group_by(Person, IPaddress) %>% # For each combination of person and IPaddress
    summarize(freq = n()) %>% # Get total number of log-in
    arrange(Person, desc(freq)) %>% # The most frequent IP address is in the 1st row for each user
    group_by(Person) %>% # For each user
    mutate(total = sum(freq)) %>% # Get total number of log-in
    select(-freq) %>% # Remove count
    do(head(.,1)) # Take the first row for each user

#    Person     IPaddress total
#1 36598035 222.999.22.99     6
#2 37811171 111.88.111.88     5

アップデート

dplyr0.3が出ました。したがって、次のこともできます。を使用して 1 行だけ短くしcountます。私もslice@aosmith 推奨として使用しました。

mydf %>%
    count(Person, IPaddress) %>%
    arrange(Person, desc(n)) %>%
    group_by(Person) %>%
    mutate(total = sum(n)) %>%
    select(-n) %>%
    slice(1)

score 4 · Accepted Answer

data.table簡潔なソリューションに使用できます。

library(data.table)
setDT(dat)
dat[, list(IPaddress=names(which.max(table(IPaddress))),
           Logins=.N), 
    by=Person]

score 1 · Accepted Answer

試す：

ddf
     Person     IPaddress
1  36598035 222.999.22.99
2  36598035 222.999.22.99
3  36598035 222.999.22.99
4  36598035 222.999.22.99
5  36598035 222.999.22.99
6  36598035 444.666.44.66
7  37811171 111.88.111.88
8  37811171 111.88.111.88
9  37811171 111.88.111.88
10 37811171 111.88.111.88
11 37811171 111.88.111.88

dd1 = data.table(with(ddf, table(Person, IPaddress)))[rev(order(N))][!duplicated(Person)]
dd1
     Person     IPaddress N
1: 36598035 222.999.22.99 5
2: 37811171 111.88.111.88 5

dd1$all_login_count = data.table(with(ddf, table(Person)))$V1
dd1
     Person     IPaddress N all_login_count
1: 36598035 222.999.22.99 5               6
2: 37811171 111.88.111.88 5               5

r - R を使用したデータの再構築

3 に答える 3

Related

Reference