Lesson Unsupervised Learning - Artificial Intelligence - ثالث ثانوي
Part 1
1. Basics of Artificial Intelligence
2. Artificial Intelligence Algorithms
3. Natural Language Processing (NPL)
Part 2
4. Image Recognition
5. Optimization & Decision-making Algorithms
Lesson 2 Unsupervised Learning Link to digital lesson www.ien.edu.sa Using Unsupervised Learning to Understand Text Unsupervised learning is a type of machine learning where the model is not given any labeled training data. Instead, the model is only given a set of examples and must find patterns and relationships within the data on its own. In the context of understanding text, unsupervised learning can be used to discover latent structures and patterns within a dataset of text documents. There are many different techniques that can be used for unsupervised learning of text data, including clustering algorithms, dimensionality reduction techniques, and generative models. Clustering algorithms can be used to group together similar documents, while dimensionality reduction techniques can be used to reduce the dimensionality of the data and identify important features. Generative models, on the other hand, can be used to learn the underlying distribution of the data and generate new text that is similar to the original dataset. Clustering Algorithms Clustering algorithms can group similar customers based on their behavior, demographics, or purchasing history for targeted marketing and increased customer retention. Dimensionality Reduction Techniques Dimensionality reduction is used in image compression to reduce the number of pixels in an image to minimize the amount of data needed to represent the image while preserving its main features. Generative Models Generative models are used in anomaly detection applications where anomalies are detected in data by learning the normal patterns of the data using a generative model. Unsupervised Learning In unsupervised learning, the model is given large amounts of data that are not labeled and it has to find patterns in the unstructured data through observation and clustering. Dimensionality Reduction Dimensionality reduction is a technique in machine learning and data analysis to reduce the number of features (dimensions) in a dataset while retaining as much information as possible. Unlabeled Data Feature Vectors Algorithm Model وزارة التعليم Ministry of Education 154 2024-1446 Predicted Output Figure 3.15: Unsupervised learning representation ---
Unsupervised Learning to Understand Text
One of the key advantages of using unsupervised learning is that it can be used to identify patterns and relationships that may not be immediately apparent to a human observer. This can be especially useful for understanding large datasets of unstructured text, where manual analysis may be impractical. In this unit, you will use an openly available dataset of news articles from the BBC to demonstrate some key techniques for unsupervised learning (Greene & Cunningham, 2006). The following code is used to load the dataset, which is organized into five different news folders representing articles from different news sections: business, politics, sports, technology, and entertainment. These five labels will not be used to inform any of the algorithms presented in this unit. Instead, they will only be used for visualization and validation purposes. Each news folder includes hundreds of text files, with each file including the content of a single specific article. The dataset is already loaded into the Jupyter Notebook, and the codeblock will open the dataset. and extract all the documents and required labels to two list data structures. Cluster A cluster is a group of similar things. In machine learning, grouping unlabeled data in homogeneous clusters is called clustering. Figure 3.16: Representation of a cluster BBC open dataset https://www.kaggle.com/datasets/shivamkushwaha/bbc-full-text-document-classification D. Greene and P. Cunningham. "Practical Solutions to the Problem of Diagonal Dominance in Kernel Document Clustering", Proc. ICML 2006. All rights, including copyright, in the content of the original articles are owned by the BBC. # used to list all the files and subfolders in a given folder from os import listdir # used for generating random number import random shuffling lists bbc_docs = [] # holds the text of the articles bbc_labels=[ ] # holds the news section for each article for folder in listdir('bbc'): #for each news-section folder for file in listdir('bbc/'+folder): # for each text file in this folder # open the text file, use encoding='utf8' because articles may include non-ascii characters with open('bbc/'+folder+'/'+file,encoding='utf8',errors='ignore') as f: bbc_docs.append(f.read()) # read the text of the article and append to the docs list # use the name of the folder (news section) as a label for this doc bbc_labels.append(folder) #shuffle the docs and labels lists in parallel merged = list(zip(bbc_docs, bbc_labels)) # link the two lists random.shuffle(merged) #shuffle them in parallel (with the same random order) bbc_docs, bbc_labels = zip(*merged) #separate them again into individual lists وزارة التعليم Ministry of Education 2024-1446 155
Cluster
Document Clustering Now that the dataset has been loaded, the next step is to experiment with various unsupervised methods. Clustering is arguably the most popular type of method in this domain. Given a collection of unlabeled documents, the goal of clustering is to group documents that are similar to one another, while separating documents that are dissimilar. Table 3.2: Factors that determine the quality of the results 1 2 3 Document Clustering Document clustering is a method which groups textual documents into clusters based on their content similarity. The way in which the data has been vectorized. Even though TF-IDF is an established technique in this space, this unit will also explore more sophisticated alternatives. The exact definition of document-to-document similarity. For vectorized text data, the Euclidean and Cosine distance measures are the most popular. The former will be used in the examples presented in this unit. The selected number of clusters. Agglomerative Clustering (AC) provides an intuitive method for selecting the appropriate number of clusters for a given dataset, which is a key challenge for clustering tasks. Selecting the Number of Clusters Selecting the correct number of clusters is a crucial step for any clustering task. Unfortunately, the vast majority of clustering algorithms expect the practitioner to provide the correct number of clusters as part of the input. The selected number can have a significant impact on the quality and interpretability of the results. There are several approaches that can be used to select the number of clusters. • One common approach is to use measure of cluster "compactness". This can be done by calculating the sum of the distances between the points within each cluster, and selecting the number of clusters that minimizes this sum. • Another approach is to use a measure of the "separation" between the clusters, such as the average distance between points in different clusters, accordingly, the number of clusters raised from this average is determined. In practice, the above approaches often contradict each other by recommending different numbers. This is an especially common challenge when working with text data, whose structure is often difficult to discern. وزارة التعليم Ministry of EducaFigure 3.17: Machine calculating the distances between points 156 2024-1446 Euclidean Distance Euclidean distance is a straight-line distance between two points in a multidimensional space. It is calculated as the square root of the sum of the squares of the differences between the corresponding dimensions of the points. Euclidean distance is used in clustering to measure the similarity between two data points. Cosine Distance Cosine distance measures the cosine similarity between two data points. It calculates the cosine of the angle between two vectors representing the data points and is often used in text data clustering. The cosine similarity value is between -1 and 1, with -1 indicating the complete opposite and 1 indicating the same direction.
Document Clustering
Table 3.2: Factors that determine the quality of the results
Selecting the Number of Clusters
The number of clusters in unsupervised learning determines how many groups or categories the algorithm will divide the data into. Choosing the right number of clusters is important because it affects the accuracy and interpretability of the results. If the number of clusters is too high, the groups may be too specific and not meaningful. If the number of clusters is too low, the groups may be too broad and not capture the underlying structure of the data. It is important to strike a balance between having enough clusters to capture meaningful patterns but not so many that the results become too complex to understand. Hierarchical Clustering Hierarchical clustering is a clustering algorithm for grouping data into clusters based on similarity. In hierarchical clustering, the data points are organized into a tree-like structure, where each node represents a cluster, and the parent node represents a merger of its child nodes. The following code imports specific libraries that will be used for the end-to-end hierarchical clustering: # used for tfi-df vectorization, as seen in the previous unit from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.cluster import Agglomerative Clustering # used for agglomerative clustering # used to visualize and support hierarchical clustering tasks import scipy.cluster.hierarchy as hierarchy # set the color palette to be used by the 'hierarchy' tool hierarchy.set_link_color_palette (['blue', 'green', 'red', 'yellow', 'brown', 'purple', 'orange', 'pink', 'black']) import matplotlib.pyplot as plt # used for general visualizations Text Vectorization Similar to the supervised methods that were presented in the previous unit, many methods for unsupervised learning also require raw text to be vectorized into a numeric format. The following code uses the TfidfVectorizer tool (which was also used in the previous lesson) for this purpose: vectorizer = TfidfVectorizer(min_df=10) # apply tf-idf vectorization, ignore words that appear in more than 10 docs text_tfidf=vectorizer.fit_transform(bbc_docs) #fit and transform in one line text_tfidf p <2225x5867 sparse matrix of type '<class 'numpy.float64'>' with 392379 stored elements in Compressed Sparse Row format> As can be seen above, the document data have now been converted into the sparse numeric format that was also used in the previous lesson. Ministry of Education 2024-1446 157
Hierarchical Clustering
Text Vectorization
The following code uses the TSENVisualizer tool from the yellowbrick library to project and visualize the vectorized documents within a 2-dimensional space: %%capture !pip install yellowbrick from yellowbrick.text import TSNEVisualizer Dimensionality Reduction Dimensionality reduction can be useful in a number of applications, such as: • Visualizing high-dimensional data: It can be difficult to visualize data in a high-dimensional space, so reducing the number of dimensions can make it easier to visualize and understand the data. • Reducing the complexity of a model: A model with fewer dimensions may be simpler and easier to understand, and the training process is faster. Improving the performance of a model: Dimensional reduction can help remove noise and redundancy from the data, which can improve the performance of a model. Table 3.3: Dimensionality reduction techniques Technique Feature selection Feature transformation Manifold learning Description Feature selection involves selecting a subset of the original features. Feature transformation involves combining or transforming the original features to create a new set of features. The original features can be dropped as they have become redundant. Manifold learning techniques, such as t-SNE and UMAP (Uniform Manifold Approximation and Projection), are unsupervised learning techniques that aim to preserve the structure of the data in a lower-dimensional space. t-Distributed Stochastic Neighbor Embedding (T-SNE) T-SNE (t-Distributed Stochastic Neighbor Embedding) is an unsupervised machine learning algorithm for dimensionality reduction. Example of use Medical datasets may have hundreds of columns per patient case. Only a few of these features can help the model diagnose correctly. Other traits are unrelated to the diagnosis and may distract the model. Feature selection discards all but the most discriminating features. Consider predicting a patient's stay on admission, we can create additional features for the model from the current features of the patient's medical records. For example, compute the number of lab tests ordered during the past week, or the number of visits during the past month. Another example is computing the area of a rectangle from it's height and width. They can convert a high-dimensional image into a lower-dimensional space while keeping its primary characteristics and structure. Since it takes up less space, this compressed representation may be stored and sent, and the original image can be rebuilt with minimal loss of information وزارة التعليم Ministry of Education 158 2024-1446
The following code uses the TSENVisualizer tool from the yellowbrick library to project and visualize the vectorized documents within a 2-dimensional space:
Table 3.3: Dimensionality reduction techniques
One of the key features of t-SNE is that it tries to preserve the local structure of the data as much as possible, so that similar data points are near to each other in the low-dimensional representation. It does this by minimizing the divergence between two probability distributions: the distribution of the high-dimensional data and the distribution of the low-dimensional data. The vectorized BBC dataset is indeed high-dimensional, as it includes a separate dimension (column) for each of the unique words that appear in the data. The total number of dimensions can be computed as follows: print('Number of unique words in the BBC documents vectors:', len (vectorizer.get_feature_names_out())) Number of unique words in the BBC documents vectors: 5867 The following code can now be used to project these 5,867 dimensions into just two (the X and Y coordinates of the plot). This code will create a scatter plot diagram, where each color represents one of five news sections. tsne = TSNEVisualizer (colors=['blue', 'green', 'red', 'yellow', 'brown']) tsne.fit(text_tfidf, bbc_labels) tsne.show(); TSNE Projection of 2225 Documents business entertainment politics sport tech Figure 3.18: TSNE projection This visualization uses the original "ground-truth" label (news section) of each document to reveal the dispersion of each label across the 2D projected vectorization space. The figure reveals that, even though there are some impurities in certain pockets of the space, the five news sections are generally well-separated. Later, an improved vectorization that reduces these impurities will be described. وزارة التعليم Ministry of Education 2024-1446 159
One of the key features of t-SNE is that it tries to preserve the local structure of the data as much as possible,
Agglomerative Clustering (AC) Agglomerative Clustering (AC), also called hierarchical clustering, is one of the most popular and effective methods in this space, it addresses this challenge by providing an intuitive visual method for selecting the appropriate number of clusters. AC follows a bottom-up approach. It begins by computing the distance between all pairs of data points. It then selects the two closest points and merges them into a single cluster. This process is repeated until all of the data points have been merged into a single cluster or until the desired number of clusters has been reached. fx Linkage() Function Level 6 Level 5 Level 4 Level 3 Level 2 Level 1 abcde |gh| Figure 3.19: Agglomerative Clustering (AC) Python implements Agglomerative Clustering with the linkage() function. Two parameters are provided for the linkage() function: • The vectorized text data. The toarray() function is used to convert the data to its dense format, as required by this function. • The distance metric that should be used to decide which clusters to merge next during the agglomerative process. There are many different options to choose from for a distance metric depending on the needs and preferences of the user, such as Euclidian, Manhattan, etc. For this project you will use the ward distance metric. The following code uses the linkage() function from the 'hierarchy' tool (imported above) to apply this process to the vectorized BBC data: plt.figure() # create a new empty figure # iteratively merge points and clusters until all points belong to a single cluster # return the linkage of the produced tree linkage_tfidf-hierarchy.linkage(text_tfidf.toarray(), method = 'ward') #visualize the linkage hierarchy.dendrogram(linkage_tfidf) # show the figure plt.show() وزارة التعليم Ministry of Education 160 2024-1446 Figure 3.20: Hierarchy dendrogram for the BBC data
Agglomerative Clustering (AC)
Linkage() Function
Ward Distance The example above uses the popular Ward distance metric for the second parameter. The Ward distance is based on the concept of within-cluster variance, defined as the sum of the distances between the points in a cluster. In each iteration, the method evaluates every possible merge by computing the within-cluster variance before and after the merge. It then performs the merge that leads to the lowest variance increase. Even though Ward is one of the multiple options, it has been shown to work well for text data. + + + Figure 3.21: Example of Ward distance metric The dendrogram in figure 3.20 provides an intuitive way of selecting the number of clusters. In this example, the library suggests using 7 clusters, each highlighted with a different color. The practitioner can either adopt this suggestion or use the dendrogram to pick a different number. For instance, the blue and green pair was merged last with the cluster group of all the other colors. Therefore, choosing 6 clusters would merge purple and orange, while choosing 5 clusters would also merge blue and green. Dendrogram The dendrogram is a tree diagram which shows the hierarchical relationship between data. Usually, it is created as an output from hierarchical clustering. The following code adopts the tool's suggestions and uses the Agglomerative Clustering tool from the sklearn library to cut the tree after the 7 clusters have been created: AC_tfidf=AgglomerativeClustering(linkage= 'ward',n_clusters=7) #prepare the tool, set the number of clusters AC_tfidf.fit(text_tfidf.toarray()) # apply the tool to the vectorized BBC data pred_tfidf=AC_tfidf.labels_ # get the cluster labels pred_tfidf array([6, 2, 4, ..., 6, 3, 5], dtype=int64) Note that the original "ground-truth" label (news section) of each document has not been used at all ●●during this process. Instead, clustering was done exclusively based on the text of each document. Having such ground-truth labels can be useful in practice, as it allows for the validation of the clustering results. The current "ground-truth" labels are the ones on the bbc_labels list. وزارة التعليم Ministry of Education 2024-1446 161
Ward Distance
The following code uses the ground-truth labels and three different scoring functions from the sklearn library to evaluate the quality of the produced clustering: • The Homogeneity score takes values between 0 and 1 and is maximized when all the points of each cluster have the same ground-truth label. Equivalently, each cluster contains only data points of a single class. • The Adjusted Rand score takes values between -0.5 and 1.0 and is maximized when all the data points with the same label are in the same cluster and all points with different labels are in different clusters. • The Completeness score also takes values between 0 and 1 and is maximized when all data points of a given class are assigned to the same cluster. from sklearn.metrics import homogeneity_score, adjusted_rand_ score, completeness_score print('\nHomogeneity score:', homogeneity_score (bbc_labels, pred_tfidf)) print('\nAdjusted Rand score: ', adjusted_rand_score (bbc_labels, pred_tfidf)) print('\nCompleteness score:', completeness_score (bbc_labels, pred_tfidf)) Homogeneity score: 0.6224333236569846 Adjusted Rand score: 0.4630492696176891 Completeness score: 0.5430590192420555 To complete the analysis, the data is re-clustered using 5 clusters, which are equal to the number of ground truth labels: Closer to 1 means that the group of texts in the cluster belongs to 1 label. Closer to 1 means better 1-1 mapping of clusters to labels. - AC_tfidf AgglomerativeClustering (linkage= 'ward',n_clusters=5) AC_tfidf.fit(text_tfidf.toarray()) pred_tfidf AC_tfidf.labels_ print('\nHomogeneity score:', homogeneity_score (bbc_labels, pred_tfidf)) print('\nAdjusted Rand score:', adjusted_rand_score (bbc_labels, pred_tfidf)) print('\nCompleteness score: ', completeness_score (bbc_labels, pred_tfidf)) Homogeneity score: 0.528836079209762 Adjusted Rand score: 0.45628412883628383 Completeness score: 0.6075627851312266 Providing the AC clustering the actual number of labels gives a better Completeness score, meaning the clustering is more representative. Even though the score results reveal that the combination of Agglomerative Clustering with TF-IDF vectorization produces reasonable results, the quality of the clustering can be improved. The next section pill demonstrates how vectorization techniques based on neural networks can lead to superior results. Ministry of Education 162 2024-1446
The following code uses the ground-truth labels and three
Word Vectorization with Neural Networks TF-IDF vectorization is based on counting and normalizing the frequency of words across the documents in the dataset. Even though this can lead to good results, frequency-techniques have a significant limitation, as they completely ignore the semantic connection between words. For example, even though the words 'trip' and 'journey' are synonyms, frequency-based vectorization would treat them as completely separate and independent features. Similarly, even though the words 'apple' and 'fruit' are semantically related (as apples are a type of fruit), this relation will also be ignored. This limitation can significantly impact downstream applications that use this type of vectorization. Consider these two sentences: "I have a very high fever, so I have to visit a doctor." "My body temperature has risen significantly, so I need to see a healthcare professional." Even though these two sentences describe the exact same scenario, they do not share any informative words. Therefore, any clustering algorithm that is based on TF-IDF (or any other frequency-based) vectorization would fail to see their similarity and would likely not place them in the same cluster. Word2Vec This limitation can be addressed via methods that consider the semantic similarity between words. One of the most popular methods for this purpose is Word2Vec, which uses an architecture based on neural networks. Word2Vec is based on the intuition that semantically similar words will typically be surrounded by the same context words. Therefore, given that the neural network uses the hidden embedding of each word to predict its context, similar words should be mapped to similar embeddings. In practice, Word2Vec models are pre-trained on millions of documents to learn high-quality word embeddings. Such pre-trained models can then be downloaded and used in any text-based application. The following code uses the gensim library to download a popular pre-trained model that has been trained on a very large dataset from Google News: import gensim.downloader as api model_wv = api.load('word2vec-google-news-300') fox_emb=model_wv [ 'fox'] print(len(fox_emb)) 300 Stopwords Stopwords are common words in a language often removed during the text pre-processing step in NLP tasks such as word vectorization. These words include articles, conjunctions, and prepositions and are not typically considered useful for determining the meaning or context of a text. This model maps each word to an embedding with 300 dimensions. Embedding Embedding represents words or tokens in a continuous vector space where semantically similar words are mapped to nearby points. وزارة التعليم Ministry of Education 2024-1446 163
Word Vectorization with Neural Networks
Word2Vec
The first 10 dimensions of the numeric "fox" embedding are displayed below: fox_emb[:10] array([-0.08203125, -0.01379395, -0.3125 , -0.04125977, -0.12988281, -0.10107422, -0.00164795, 0.15917969, dtype=float32) 0.05493164, 0.12402344], The model can use the embeddings of the words to evaluate their similarity. Consider the following example, which compares the word 'car' with other words of decreasing similarity. Similarity values are always between 0 and 1. pairs = [ ] ('car', 'minivan'), ('car', 'bicycle'), ('car', 'airplane'), ('car', 'street'), ('car', 'apple'), for w1, w2 in pairs: print(w1, w2, model_wv.similarity (w1, w2)) car minivan 0.69070363 car bicycle 0.5364484 car airplane 0.42435578 car street 0.33141237 car apple 0.12830706 The following code can be used to find the 5 most similar words to a given word: print(model_wv.most_similar(positive=['apple'], topn=5)) [('apples', 0.720359742641449), ('pear', 0.6450697183609009), ('fruit', 0.6410146355628967), ('berry', 0.6302295327186584), ('pears', 0.613396167755127)] Visualization can be used to further validate the embeddings of this pre-trained model. This can be achieved by: . Selecting a sample of words from the BCC dataset. • Using t-SNE to reduce the 300-dimensional embedding of each word to a 2-dimensional point. Visualizing the points as a scatter plot in 2-dimensional space. وزارة التعليم Ministry of Education 164 2024-1446
The first 10 dimensions of the numeric "fox" embedding are displayed below:
%%capture import nltk #import the nltk library for nlp import re #import the re library for regular expressions import numpy as np # used for numeric computations from collections import Counter # used to count the frequency of elements in a given list from sklearn.manifold import TSNE # tool used for Dimensionality Reduction # download the 'stopwords' tool from the nltk library. It includes very common words for different languages nltk.download('stopwords') from nltk.corpus import stopwords #import the 'stopwords' tool stop-set(stopwords.words('english')) # load the set of english stopwords The following function is then used to select a sample of representative words from the BBC dataset. Specifically, the code selects the top 50 most frequent words from each of the 5 BBC news sections, excluding stopwords (very common English words) and words that are not included in the pre-trained Word2Vec model. def get_sample (bbc_docs: list, bbc labels: list ): word_sample=set() # a sample of words from the BBC dataset # for each BBC news section Some very common and frequent English words considered stopwords include "a", "the", "is" and "are". for label in ['business', 'entertainment', 'politics', 'sport', 'tech']: # get all the words in this news section, ignore stopwords # for each BBC doc and for each word in the BBC doc # if the word belongs to the label and is not a stopword and is included in the Word2Vec model label_words=[word for i in range(len(bbc_docs)) for word in re.findall(r'\b\w\w+\b',bbc_docs[i].lower()) if bbc_labels[i] == label and word not in stop and word in model_wv] cnt=Counter(label_words) # count the frequency of each word in this news section # get the top 50 most frequent words in this section top50 [word for word, freq in cnt.most_common (50)] # add the top50 words to the word sample word_sample.update(top50) word_sample=list(word_sample) # convert the set to a list return word_sample pilcïllä word_sample=get_sample (bbc_docs, bbc_labels) التعليم Ministry of Education 2024-1446 165
The following function is then used to select a sample of representative
Finally, you can use a method with t-SNE to reduce the 300-dimensional embeddings of the words in the sample into 2-dimensional points. The points are then visualized via a simple scatter plot. movamedy oscar actor filmms singer festival actress stars tv music star Abym band win majsthes play game players years group months million prime team season coach show year final online video internet broadband digital campaign mobile software phones phone technologymputer data information rugby half series Besond chart system dub two next service service cup six old top injury companies last time backree side numbasin dollar firm company firms shares stock sales net chinaleal home matany could way business oso set open party prices bank wquids best expected well however world rise state law going country market economic growth take high tax financial ike lord great good think get added economy US go general governmentction blair says told say public tories made people make said brown mps use used using users security spokesman presidenphancellor tony directe ader chief police secretary minister mient december uk m bbc frangitain ward england british Wardsea award awards prize analysts Figure 3.22: Representation of the most frequent words from the BBC dataset The plot verifies that the Word2Vec embeddings successfully capture the semantic associations between words, as indicated by intuitive word groups such as: • economy, economic, business, financial, sales, bank, firm, firms • Internet, mobile, phones, phone, broadband, online, digital actor, actress, film, comedy, films, festival, band, movie • game, team, match, players, coach, injury, club, rugby وزارة التعليم Ministry of Education 166 2024-1446
Finally, you can use a method with t-SNE to reduce the 300-dimensional embeddings of the words in
Sentence Vectorization with Deep Learning Even though Word2Vec can be used to model individual words, clustering requires the vectorization of entire documents. One of the most popular methods for this purpose is Sentence-BERT (SBERT), which is based on deep learning methods. Bidirectional Encoder Representations from Transformers (BERT) BERT is a powerful language representation model developed by Google. Pre-training and fine-tuning are the main factors to which BERT can apply transfer learning; the ability to retain information for one problem and apply it to solve another. Pre-training is done by feeding the model a massive amount of unlabeled data for multiple tasks, such as masked language prediction (random words in an input text are masked, and the task is to predict these words). For fine-tuning, the BERT model is first initialized with the pre-trained parameters, and all of the parameters are fine-tuned using labeled datasets from the downstream tasks. Each downstream task has separate fine-tuned models, even though they are initialized with the same pre-trained parameters. For example, the fine-tuning sentiment analysis model is different from the question-answering model. Interestingly, the models will have little to no architectural difference after the fine-tuning step. SBERT SBERT is a modified version of BERT. Similar to Word2Vec, BERT is trained to predict words based on the context of their sentence. Unlike BERT, SBERT is trained to predict whether two sentences are semantically similar. SBERT can be effectively used to create embeddings for pieces of text that are longer than a sentence such as paragraphs or short documents or articles in the BBC dataset that is used in this unit. Even though all three models are based on neural networks, BERT and SBERT implement significantly different and more complex architectures than Word2Vec. Sentence_transformers Library The 'sentence_transformers' library implements the full functionality of the SBERT model. The library comes with several pre-trained SBERT models, each trained on a different dataset and with different objectives. The following code loads one of the most popular general-purpose pre-trained models and uses it to create embeddings for the documents in the BBC dataset: %%capture !pip install sentence_transformers from sentence_transformers import Sentence Transformer model = Sentence Transformer('all-MiniLM-L6-v2') #load the pre-trained model text_emb model.encode(bbc_docs) # embed the BBC documents = وزارة التعليم Ministry of Education 2024-1446 167
Sentence Vectorization with Deep Learning
Bidirectional Encoder Representations from Transformers (BERT)
SBERT
Sentence_transformers Library
The same TSNEVisualizer tool that was used earlier in this unit to visualize the vectorized documents produced by the TF-IDF vectorizer can now be used for the embeddings produced by SBERT: tsne = TSNEVisualizer(colors=['blue', 'green', 'red', 'yellow', 'brown']) tsne.fit(text_emb, bbc_labels) tsne.show(); TSNE Projection of 2225 Documents business entertainment politics sport tech Figure 3.23: TSNE Projection of embeddings by SBERT The figure reveals that SBERT leads to a more distinct separation of the different news sections, with fewer impurities than TF-IDF. The next step is to to use the embeddings to inform the Agglomerative Clustering algorithm: plt.figure() # create a new figure # iteratively merge points and clusters until all points belong to a single cluster. Return the the linkage of the produced tree linkage_emb=hierarchy.linkage(text_emb, method='ward') hierarchy.dendrogram (linkage_emb) #visualize the linkage plt.show() # show the figure وزارة التعليم Ministry of Education 168 2024-1446 14 12 Figure 3.24: Hierarchy dendrogram for SBERT
The same TSNEVisualizer tool that was used earlier in this unit to visualize the vectorized documents produced by the TF-IDF vectorizer can now be used for the embeddings produced by SBERT:
The dendrogram tool suggests the use of 4 clusters, each marked with a different color in figure 3.24. The following code uses this suggestion to compute the clusters and compute the evaluation metrics: AC_emb=Agglomerative Clustering (linkage= 'ward', n_clusters=4) AC_emb.fit(text_emb) pred_emb=AC_emb.labels_ print('\nHomogeneity score: ', homogeneity_score (bbc_labels, pred_emb)) print('\nAdjusted Rand score:',adjusted_rand_score(bbc_labels,pred_emb)) print('\nCompleteness score:', completeness_score (bbc_labels, pred_emb)) Homogeneity score: 0.6741395570357063 Adjusted Rand score: 0.6919474005627763 Completeness score: 0.7965514907905805 If the data is re-clustered using the correct number of 5 clusters, then the yellow cluster marked in the figure above would be split into two. The results are then as follows: AC_emb=Agglomerative Clustering(linkage= 'ward',n_clusters=5) AC_emb.fit(text_emb) pred_emb=AC_emb.labels_ print('\nHomogeneity score: ', homogeneity_score (bbc_labels, pred_emb)) print('\nAdjusted Rand score:',adjusted_rand_score(bbc_labels, pred_emb)) print('\nCompleteness score:', completeness_score (bbc_labels, pred_emb)) Homogeneity score: 0.7865655030556284 Adjusted Rand score: 0.8197670431956582 Completeness score: 0.7887580797775077 The results verify that using SBERT for text vectorization leads to significantly improved clustering results when compared with TF-IDF. In fact, even if the number of clusters is set to 5 (the correct value) for TF-IDF and to 4 for SBERT, SBERT still scores much higher for all three metrics. The gap then becomes even larger if the number is set to 5 for both approaches. This is a testament to the potential of neural networks, whose sophisticated architecture allows them to understand the complex semantic patterns found in text data. وزارة التعليم Ministry of Education 2024-1446 169
The dendrogram tool suggests the use of 4 clusters, each marked with a different color in the figure 3.24.
Exercises 1 Read the sentences and tick ✓ True or False. 1. In unsupervised learning, labeled datasets are used to train the model. 2. Unsupervised learning requires the vectorization of the data. 3. SBERT is more optimal than TD-IDF for word vectorization. 4. Agglomerative Clustering follows a up-bottom approach to cluster selection. 5. SBERT is trained to predict whether two sentences are semantically different. True False 2 Give examples of applications of dimensionality reduction. Describe the techniques that are used in dimensionality reduction. 3 Describe the functionality of TF-IDF vectorization. وزارة التعليم Ministry of Education 170 2024-1446
Read the sentences and tick True or False.
Show examples of applications for which Dimensionality Reduction can be used. Describe the techniques that are used in Dimensionality Reduction.
Describe the functionality of TF-IDF vectorization.
4 You are given a numPy array 'Docs' that includes one text document in each row. You are also given an array 'labels' that includes the label for each doc in Docs. Complete the following code so that it uses a pre-trained SBERT model to compute the embeddings for all the documents in Docs and then uses the TSNEVisualizer tool to visualize the embeddings in 2-dimensional space, using a different color for each of the four possible labels. from sentence_transformers import from import TSNEVisualizer model = ('all-MiniLM- L6-v2') # loads the pre-trained model docs emb model. = tsne = tsne. tsne.show(); (Docs) # embeds the docs =['blue', 'green', 'red', 'yellow']) 5 Complete the following code so that it uses Word2Vec to replace every word in a given sentence with its most similar one. import gensim. downloader as import re model_wv = old_sentence='My name is John and I like basketball.' new_sentence='' for word in re. replacement=model_wv. new_sentence+= sentence-new_sentence.strip() وزارة التعليم Ministry of Education 2024-1446 ('word2vec-google-news-300') (r'\b\w\w+\b', old_sentence.lower()): (positive=['apple'], =1)[0] 171