CS224N Assignment 1

Part 1
Question 1.1: Implement distinct_words [code] (2 points)
Write a method to work out the distinct words (word types) that occur in the corpus.
You can use for loops to process the input corpus (a list of list of strings), but try using Python list comprehensions (which are generally faster). In particular, this may be useful to flatten a list of lists. If you’re not familiar with Python list comprehensions in general, here’s more information.
Your returned corpus_words should be sorted. You can use python’s sorted function for this.
You may find it useful to use Python sets to remove duplicate words.
def distinct_words(corpus):
""" Determine a list of distinct words for the corpus.
Params:
corpus (list of list of strings): corpus of documents
Return:
corpus_words (list of strings): sorted list of distinct words across the corpus
n_corpus_words (integer): number of distinct words across the corpus
"""
corpus_words = []
n_corpus_words = -1
# ------------------
# Write your implementation here.
corpus_words = [item for sublist in corpus for item in sublist]
corpus_words = sorted(list(set(corpus_words)))
n_corpus_words = len(corpus_words)
# ------------------
return corpus_words, n_corpus_words
Question 1.2: Implement compute_co_occurrence_matrix [code] (3 points)
Write a method that constructs a co-occurrence matrix for a certain window-size $n$ (with a default of 4), considering words $n$ before and $n$ after the word in the center of the window. Here, we start to use numpy (np) to represent vectors, matrices, and tensors. If you’re not familiar with NumPy, there’s a NumPy tutorial in the second half of this cs231n Python NumPy tutorial.
def compute_co_occurrence_matrix(corpus, window_size=4):
""" Compute co-occurrence matrix for the given corpus and window_size (default of 4).
Note: Each word in a document should be at the center of a window. Words near edges will have a smaller
number of co-occurring words.
For example, if we take the document "<START> All that glitters is not gold <END>" with window size of 4,
"All" will co-occur with "<START>", "that", "glitters", "is", and "not".
Params:
corpus (list of list of strings): corpus of documents
window_size (int): size of context window
Return:
M (a symmetric numpy matrix of shape (number of unique words in the corpus , number of unique words in the corpus)):
Co-occurence matrix of word counts.
The ordering of the words in the rows/columns should be the same as the ordering of the words given by the distinct_words function.
word2ind (dict): dictionary that maps word to index (i.e. row/column number) for matrix M.
"""
words, n_words = distinct_words(corpus)
M = None
word2ind = {}
# ------------------
# Write your implementation here.
M = np.zeros((n_words, n_words))
for index, word in enumerate(words):
word2ind[word] = index
for sentence in corpus:
for sta, word in enumerate(sentence):
for i in range(-window_size, window_size + 1):
idx = min(max(sta+i, 0), len(sentence) - 1)
if idx == sta: continue
M[word2ind[word], word2ind[sentence[idx]]] += 1
# ------------------
return M, word2ind
Question 1.3: Implement reduce_to_k_dim [code] (1 point)
Construct a method that performs dimensionality reduction on the matrix to produce k-dimensional embeddings. Use SVD to take the top k components and produce a new matrix of k-dimensional embeddings.
Note: All of numpy, scipy, and scikit-learn (sklearn) provide some implementation of SVD, but only scipy and sklearn provide an implementation of Truncated SVD, and only sklearn provides an efficient randomized algorithm for calculating large-scale Truncated SVD. So please use sklearn.decomposition.TruncatedSVD.
sklearn-svd使用方法
from sklearn.decomposition import TruncatedSVD
import numpy as np
# 一个示例矩阵(比如共现矩阵)
A = np.array([
[1, 1, 0],
[0, 1, 1],
[1, 0, 1]
])
# 保留 2 个奇异值
svd = TruncatedSVD(n_components=2)
A_reduced = svd.fit_transform(A)
print("UΣ(降维结果):")
print(A_reduced)
print("\n奇异值 Σ:")
print(svd.singular_values_)
print("\n右奇异向量 V^T:")
print(svd.components_)
Question 1.4: Implement plot_embeddings [code] (1 point)
Here you will write a function to plot a set of 2D vectors in 2D space. For graphs, we will use Matplotlib (plt).
For this example, you may find it useful to adapt this code. In the future, a good way to make a plot is to look at the Matplotlib gallery, find a plot that looks somewhat like what you want, and adapt the code they give.
def plot_embeddings(M_reduced, word2ind, words):
""" Plot in a scatterplot the embeddings of the words specified in the list "words".
NOTE: do not plot all the words listed in M_reduced / word2ind.
Include a label next to each point.
Params:
M_reduced (numpy matrix of shape (number of unique words in the corpus , 2)): matrix of 2-dimensioal word embeddings
word2ind (dict): dictionary that maps word to indices for matrix M
words (list of strings): words whose embeddings we want to visualize
"""
# -------- primary ----------
for word in words:
idx = word2ind[word]
x = M_reduced[idx][0]
y = M_reduced[idx][1]
plt.scatter(x, y, marker='x', color='red')
plt.text(x, y, word, fontsize=9)
plt.show()
# -------- better ----------
x_coords = M_reduced[:, 0]
y_coords = M_reduced[:, 1]
for word in words:
x = x_coords[word2ind[word]]
y = y_coords[word2ind[word]]
plt.scatter(x, y, marker='x', color='red')
plt.text(x, y, word, fontsize=9)
plt.show()
Part 2
Question 2.1: GloVe Plot Analysis [written] (3 points)
Run the cell below to plot the 2D GloVe embeddings for ['movie', 'book', 'mysterious', 'story', 'fascinating', 'good', 'interesting', 'large', 'massive', 'huge'].
Cosine Similarity
Now that we have word vectors, we need a way to quantify the similarity between individual words, according to these vectors. One such metric is cosine-similarity. We will be using this to find words that are “close” and “far” from one another.
We can think of n-dimensional vectors as points in n-dimensional space. If we take this perspective L1 and L2 Distances help quantify the amount of space “we must travel” to get between these two points. Another approach is to examine the angle between two vectors. From trigonometry we know that:
Instead of computing the actual angle, we can leave the similarity in terms of $similarity = cos(\Theta)$. Formally the Cosine Similarity $s$ between two vectors $p$ and $q$ is defined as:
用python实现余弦相似度计算如下:
similarity = np.dot(p, q) / (np.linalg.norm(p) * np.linalg.norm(q))
Question 2.3: Synonyms & Antonyms (2 points) [code + written]
When considering Cosine Similarity, it’s often more convenient to think of Cosine Distance, which is simply 1 - Cosine Similarity.
Find three words $(w_1,w_2,w_3)$ where $w_1$ and $w_2$ are synonyms and $w_1$ and $w_3$ are antonyms, but Cosine Distance $(w_1,w_3) <$ Cosine Distance $(w_1,w_2)$.
As an example, $w_1$=“happy” is closer to $w_3$=“sad” than to $w_2$=“cheerful”. Please find a different example that satisfies the above. Once you have found your example, please give a possible explanation for why this counter-intuitive result may have happened.
You should use the the wv_from_bin.distance(w1, w2) function here in order to compute the cosine distance between two words. Please see the GenSim documentation for further assistance.
导致同义词比反义词的余弦距离更大的原因可能为,在某个语料库下,两个反义词一起出现的频率很高。

