python - Aggregate text keys-values python defaultdict

Question

First of all, I would like to point out that I am a python newbie and I am totally inexperienced at coding, so please be patient. I've already searched for an answer to my problem but with no success. I have a bunch of lines in text with names and teams in this format:

Team (year)|Surname1, Name1

e.g.

Yankees (1993)|Abbot, Jim
Yankees (1994)|Abbot, Jim
Yankees (1993)|Assenmacher, Paul
Yankees (2000)|Buddies, Mike
Yankees (2000)|Canseco, Jose

and so on for several years and several teams. I would like to aggregate names of players according to team (year) combination deleting any duplicated names (it may happen that in the original database there is some redundant information). In the example, my output should be:

Yankees (1993)|Abbot, Jim, Assenmacher, Paul
Yankees (1994)|Abbot, Jim
Yankees (2000)|Buddies, Mike, Canseco, Jose

I've written this code so far:

file_in = open('filein.txt')
file_out = open('fileout.txt', 'w+')

from collections import defaultdict
teams = defaultdict(set)

for line in file_in:
    items = [line.split('|')]
    team = items[0]
    name = items[1]
    teams[team].add(name)

I end up with a big dictionary made up by keys (the name of the team and the year) and sets of values. But I don't know exactly how to go on to aggregate things.

I would also be able to compare my final sets of values (e.g. how many players have Yankee's team of 1993 and 1994 in common?). How can I do this?

Any help is appreciated

score 0 · Accepted Answer

まず、この行：

items = [line.split('|')]

になるはずだった：

items = line.split('|')

それ以外の場合は、リストで構成されるリストを作成していました。

第二に、これを変更しました：

teams[team].add(name)

これに：

teams[team].add(name.strip())

そうしないと、プレーヤー名に余分な改行とスペースが含まれます。

完全に変更された読み取りコード:

for line in file_in:
    items = line.split('|')
    team = items[0]
    name = items[1]
    teams[team].add(name.strip())

後で辞書を印刷するには:

>>> for team, players in teams.iteritems():
...    print '{}|{}'.format(team, '|'.join(players))
... 
Yankees (1994)|Abbot, Jim
Yankees (1993)|Assenmacher, Paul|Abbot, Jim
Yankees (2000)|Canseco, Jose|Buddies, Mike

score 0 · Accepted Answer

このソリューションは最適ではありませんが、希望どおりに機能します。

for line in w.split('\n'):
    items = line.split('|')
    team = items[0]
    names = items[1].split(',')
    if team in teams:
       teams[team].extend(names)
    else:
       teams[team] = names

そこから使用できます：

for team, names in teams.iteritems():
    print team, len(set(names))

score 0 · Accepted Answer

この場合、Map-Reduce に慣れる必要があります。それについて少し調べてください。それが役に立ちます。ここにいくつかのコードがあると確信しており、それを見つけようとしています。それまでの間、これは良い場所です。開始するには: http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/

python - Aggregate text keys-values python defaultdict

3 に答える 3

Related

Reference