Unsupervised Learning to Understand Text
Document Clustering
Table 3.2: Factors that determine the quality of the results
Selecting the Number of Clusters
The following code uses the TSENVisualizer tool from the yellowbrick library to project and visualize the vectorized documents within a 2-dimensional space:
Table 3.3: Dimensionality reduction techniques
One of the key features of t-SNE is that it tries to preserve the local structure of the data as much as possible,
Agglomerative Clustering (AC)
Linkage() Function
The following code uses the ground-truth labels and three
Word Vectorization with Neural Networks
Word2Vec
The first 10 dimensions of the numeric "fox" embedding are displayed below:
The following function is then used to select a sample of representative
Finally, you can use a method with t-SNE to reduce the 300-dimensional embeddings of the words in
Sentence Vectorization with Deep Learning
Bidirectional Encoder Representations from Transformers (BERT)
SBERT
Sentence_transformers Library
The same TSNEVisualizer tool that was used earlier in this unit to visualize the vectorized documents produced by the TF-IDF vectorizer can now be used for the embeddings produced by SBERT:
The dendrogram tool suggests the use of 4 clusters, each marked with a different color in the figure 3.24.
Read the sentences and tick True or False.
Show examples of applications for which Dimensionality Reduction can be used. Describe the techniques that are used in Dimensionality Reduction.
Describe the functionality of TF-IDF vectorization.
You are given a numPy array 'Docs' that includes one text document in each row.
Complete the following code so that it uses Word2Vec to replace every word in a given sentence with its most similar one.