You've got multiple questions in your code, so let's answer them one by one.
uniq = list(set(chain(" ".join([s1,s2,s3]).split()))) # is there simpler way for this?
For one thing, it might be conceptually simpler (although just as verbose) to split()
the strings independently, instead of joining them together then splitting the result.
uniq = list(set(chain(*map(str.split, (s1, s2, s3))))
Beyond that: it looks like you're always using the word lists, not the actual sentences, so you're splitting in multiple places. Why not just split them all at once, up at the top?
Meanwhile, instead of having to explicitly pass around s1
, s2
, and s3
, why not stick them in a collection? And you can stick the results in a collection as well.
So:
sentences = (s1, s2, s3)
wordlists = [sentence.split() for sentence in sentences]
uniq = list(set(chain.from_iterable(wordlists)))
# ...
vectors = [vectorize(sentence, dictionary) for sentence in sentences]
for vector in vectors:
print vector
dictionary = {}
for i in range(len(uniq)): # can this be done with dict(list_comprehension)?
dictionary[i] = uniq[i]
You could do it as dict()
on a list comprehension—but, even more simply, use a dict comprehension. And, while you're at it, use enumerate
instead of the for i in range(len(uniq))
bit.
dictionary = {idx: word for (idx, word) in enumerate(uniq)}
That replaces the whole # ...
part in the above.
Meanwhile, if you want a reverse dictionary lookup, this is not the way to do it:
def getKey(dic, value):
return [k for k,v in sorted(dic.items()) if v == value]
Instead, create an inverse dictionary, mapping values to lists of keys.
def invert_dict(dic):
d = defaultdict(list)
for k, v in dic.items():
d[v].append(k)
return d
Then, instead of your getKey
function, just do a normal lookup in the inverted dict.
If you need to alternate modifications and lookups, you probably want some kind of bidirectional dictionary, that manages its own inverse dictionary as it goes along. There are a bunch of recipes for such a thing on ActiveState, and there may be some modules on PyPI, but it's not that hard to build yourself. And at any rate, you don't seem to need that here.
Finally, there's your vectorize
function.
The first thing to do is to take a word list instead of a sentence to split, as mentioned above.
And there's no reason to re-split the sentence after lower
; just use a map or generator expression on the word list.
In fact, I'm not sure why you're doing lower
here, when your dictionary is built out of the original-case versions. I'm guessing that's a bug, and you wanted to do lower
when building the dictionary as well. That's one of the advantages of making the word lists in advance in a single, easy-to-find place: you just need to change that one line:
wordlists = [sentence.lower().split() for sentence in sentences]
Now you're already a bit simpler:
def vectorize(wordlist, dictionary):
vector = []
for word in wordlist:
word_count = wordlist.count(word)
dic_pos = getKey(dictionary, word)[0]
vector.append((dic_pos,word_count))
return vector
Meanwhile, you may recognize that the vector = []… for word in wordlist… vector.append
is exactly what a list comprehension is for. But how do you turn three lines of code into a list comprehension? Easy: refactor it into a function. So:
def vectorize(wordlist, dictionary):
def vectorize_word(word):
word_count = wordlist.count(word)
dic_pos = getKey(dictionary, word)[0]
return (dic_pos,word_count)
return [vectorize_word(word) for word in wordlist]