Lesson Supervised Learning - Artificial Intelligence - ثالث ثانوي

3. Natural Language Processing (NPL) In this unit, you will learn an end-to-end process for training a supervised and an unsupervised learning model for understanding the sentiment of a given piece of text. At the end, you will learn how machine learning can be used to support applications related to Natural Language Processing (NLP). وزارة التعليم Ministry of Education 132 2074-1446 Learning Objectives In this unit, you will learn to: > Define supervised learning. > Train a supervised learning model to understand text. > Define unsupervised learning. > Train an unsupervised learning model to understand text. > Create a simple chatbot. > Generate text using Natural Language Processing (NLP) techniques. Tools > Jupyter Notebook

Lesson 1 Supervised Learning

Natural Language Processing (NPL)

Learning Objectives

Tools

Lesson 1 Supervised Learning Link to digital lesson www.ien.edu.sa Using Supervised Learning to Understand Text Natural Language Processing (NLP) is a field of Al that focuses on enabling computers to understand, interpret, and generate human language. NLP is concerned with tasks such as text classification, sentiment analysis, machine translation, and question-answering. This lesson will focus specifically on how supervised learning, one of the main types of machine learning (ML), can be used to automatically understand and make useful predictions about a text's properties. You already learned in unit 1 that Al is an umbrella term that includes Machine Learning and Deep Learning, as illustrated in figure 3.1. Al is a broad field of computer science that focuses on creating intelligent machines, while machine learning is a subset of Al that focuses on building algorithms and models that allow machines to learn from data without being explicitly programmed. Artificial Intelligence Machine Learning Deep Learning Deep learning is a type of machine learning that uses deep neural networks to automatically learn from large amounts of data. It allows computers to recognize patterns and make decisions in a more humanlike way, by building complex models of the data. Deep Learning Figure 3.1: Fields under the Al umbrella Machine Learning Machine learning is a subfield of Al that focuses on developing algorithms that enable computers to learn from data, rather than following explicit programming instructions. It involves training computer models to recognize patterns and make predictions based on input data, allowing the model to improve its accuracy over time. This allows machines to perform tasks such as classification, regression, clustering, and pulcïll Grecommendation, without being explicitly programmed for each task. Ministry of Education 2024-1446 133

Lesson 1 Supervised Learning

Using Supervised Learning to Understand Text

Machine Learning

Machine learning can be broadly categorized into three main types: Supervised learning a type of machine learning where the algorithm learns from labeled training data, with the goal of making predictions on new data not present in the training or test sets, as shown in figure 3.2. Examples: Image classification (e.g. recognizing objects in photos) • Fraud detection (e.g. identifying suspicious financial transactions) • Model Predicted Output Spam filtering (e.g. identifying unwanted email messages) Unsupervised learning a type of machine learning where the algorithm works with unlabeled data, trying to find patterns and relationships in the data. Examples: Testing Data Set Algorithm Anomaly detection (e.g. detecting unusual patterns in data) • Clustering (e.g. grouping similar data points together) • Dimensionality reduction (e.g. selecting the dimensions that reduce data complexity) Reinforcement learning a type of machine learning where an agent interacts with its environment and learns by trial and error, receiving rewards or punishments for its actions. Examples: • Game playing (e.g. playing chess or Go) • Robotics (e.g. teaching a robot to navigate its environment) • Resource allocation (e.g. optimizing resource usage in a network) Training Data Set Desired Output 00000 0000 000 Supervisor Figure 3.2: Supervised learning representation Table 3.1 summarizes the advantages and disadvantages of each type of machine learning Table 3.1: Advantages and disadvantages of Machine Learning types Advantages Supervised Learning Well-established and widely used. • Easy to understand and implement. Can handle both linear and non-linear data. Unsupervised Learning Does not require labeled data, making it more flexible. • Can discover hidden patterns in data. Can handle high-dimensional and complex data. Reinforcement Learning Disadvantages Requires labeled data, which can be expensive to obtain. Limited to the task it was trained for, and may not generalize well to new data. • Difficult to adapt to other problems if the model is too complex. Harder to understand and interpret than supervised learning. • Limited to exploratory analysis, and may not be suitable for decision-making tasks. • Difficult to adapt to other problems if the model is too complex. Flexible, and can handle complex and dynamic • More complex than supervised or unsupervised learning. environments. Can learn from experience and improve over time. pillSuitable for decision-making tasks, such as Ministry of Educa game playing and robotics. 134 2024-1446 Challenging to design reward functions that accurately capture the desired behavior. • May require large amounts of training data and computational resources.

Lesson 1 Supervised Learning

Machine learning can be broadly categorized into three main types:

Table 3.1: Advantages and disadvantages of Machine Learning types

Supervised Learning Supervised Learning is a type of ML that involves the use of labeled datal to train an algorithm to make predictions. The algorithm is trained on a labeled dataset and then tested on an unseen dataset. Supervised learning is commonly used in NLP for tasks such as text classification, sentiment analysis, and named entity recognition. In these tasks, the algorithm is trained on a labeled dataset where each example is labeled with the correct category or sentiment. If the labels are numeric, then the supervised learning task is referred to as "regression". If the labels are discrete, the task is referred to as "classification". Regression Supervised Learning In supervised learning, you use manually curated and labeled datasets to train computer algorithms to predict new values. For instance, regression can focus on predicting the sale price of a house based on its size, location, and number of bedrooms. It can also be used to predict the demand for a product based on historical sales data and advertising expenditure. In an NLP context, regression can use the available text to predict the sentiment score of a movie review or the popularity of a social media post. Classification Classification, on the other hand, can be used in applications such as diagnosing a medical condition based on symptoms and test results. When it comes to understanding text, supervised learning can be used to classify or predict categories or labels based on the words and phrases within a document. For example, a supervised learning model might be trained to classify an email as spam or not spam based on the words and phrases used in the email. Another popular application is sentiment classification, which focuses on predicting whether the overall sentiment of a given document is negative or positive. This application is used as a working example in this unit, to demonstrate all the steps in the end-to-end process of building and using a supervised learning model. In this unit you will use a dataset of movie reviews from the popular website IMDb.com. The dataset has already been split into two parts, one to be used for training the model and one to be used for testing. To load the data into a DataFrame, you will use the Pandas Python library that you have used before. The Pandas library is a popular tool for manipulating spreadsheet data. The following code is used to import the library into your program and then load the two datasets: %%capture # capture is used to suppress the installation output # install the pandas library, if it is missing !pip install pandas import pandas as pd وزارة التعليم Ministry of Education 2024-1446 Pandas is a popular library used to read and process spreadsheet-like data. 135

Lesson 1 Supervised Learning

Supervised Learning

Regression

Classification

#load the train and testing data imdb_train_reviews = pd.read_csv('imdb_data/imdb_train.csv') imdb_test_reviews = pd.read_csv('imdb_data/imdb_test.csv') imdb_train_reviews text label As you can see in figure 3.3, the DataFrame dataset has 0 I grew up (b. 1965) watching and loving the Th... 0 two columns: 1 When I put this movie in my DVD player, and sa... 0 • text review. 2 Why do people who do not know what a particula... 0 ⚫ label. 3 4 Even though I have great interest in Biblical ... Im a die hard Dads Army fan and nothing will e... 0 1 positive review 39995 "Western Union" is something of a forgotten cl... 39996 This movie is an incredible piece of work. It ... 39997 My wife and I watched this movie because we pl... 39998 1 0 When I first watched Flatliners, I was amazed.... 1 39999 Why would this film be so good, but only gross... 1 40000 rows x 2 columns Figure 3.3: Labelled training dataset negative review A "O" label represents a negative review, while a "1" label represents a positive review. The next step is to assign the text and label columns to separate variables from the training and testing examples in the DataFrame dataset: # extract the text from the 'text' column for both training and testing X_train_text=imdb_train_reviews['text'] ✗_test_text=imdb_test_reviews['text'] # extract the labels from the 'label' column for both training and testing Y_train imdb_train_reviews['label'] Y_test=imdb_test_reviews['label'] X_train_text #training data in text format 01234 39995 39996 I grew up (b. 1965) watching and loving the Th... When I put this movie in my DVD player, and sa... Why do people who do not know what a particula... Even though I have great interest in Biblical Im a die hard Dads Army fan and nothing will e... "Western Union" is something of a forgotten cl... This movie is an incredible piece of work. It My wife and I watched this movie because we pl... When I first watched Flatliners, I was amazed.... Why would this film be so good, but only gross... Name: text, Length: 40000, dtype: object 39997 39998 39999 The X, Y notations are typically used in supervised learning to represent the input data used to make the prediction (X) and the target labels (Y). Figure 3.4: Snapshot of the training examples (X_train_text) from the DataFrame dataset. غزارة التعليم Ministry of Education 136 2024-1446

Lesson 1 Supervised Learning

# load the train and testing data.

Data Preparation and Pre-Processing Although this raw text format, as shown in figure 3.5, is intuitive to the human reader, it is unusable by supervised learning algorithms. Instead, algorithms require such documents to be converted into a numeric vector format. The vectorization process can be implemented via multiple different methods, and it has a great impact on the performance of the trained model. Sklearn Library The supervised model will be built with sklearn (also known as "scikit-learn"), a popular Python library for machine learning. It provides a range of tools and algorithms for tasks such as classification, regression, clustering, and dimensionality reduction. One useful tool within sklearn is the CountVectorizer, which can be used to preprocess and vectorize text data. CountVectorizer The CountVectorizer converts a collection of text documents into a matrix of token counts, where each row represents a document and each column represents a particular token. Tokens can be individual words, phrases or even more complex constructs that capture various patterns in the underlying text data. The entries in the matrix indicate the number of times each token appears in each document. This is also known as "bag-of-words" (BoW) representation, as the order of the words is not preserved and only the counts of the words are retained. Even though the BoW representation is an oversimplification of human language, it can achieve very competitive results in practice. Vectorization Vectorization is the process of converting strings of words or phrases (text) to a corresponding vector of real numbers, that is used to encode properties of the text using a format that ML algorithms can understand. 0 apples BoW text vector 1 do 1 | "I like oranges, do you like oranges?" 2 like 2 oranges 1 you Figure 3.5: "bag-of-words" (BoW) representation The following code uses the CountVectorizer tool to vectorize the IMDb training dataset: from sklearn.feature_extraction.text import CountVectorizer # the min_df parameter is used to ignore terms that appear in less than 10 reviews vectorizer_v1 = CountVectorizer(min_df=10) vectorizer_v1.fit(X_train_text) #fit the vectorizer on the training data # use the fitted vectorizer to vectorize the data X_train_v1 = vectorizer_v1.transform (X_train_text) X_train_v1 وزارة التعليم Ministry of Education 2024-1446 <40000x23392 sparse matrix of type '<class 'numpy.int64'>' with 5301561 stored elements in Compressed Sparse Row format> 137

Lesson 1 Supervised Learning

Data Preparation and Pre-Processing

Sklearn Library

CountVectorizer

# expand the sparse data into a sparse matrix format, where each column represents a different word X_train_v1_dense = pd.DataFrame(X_train_v1.toarray(), ✗_train_v1_dense columns-vectorizer_v1.get_feature_names_out()) 00 000 007 01 02 04 05 06 07 08 SNE ZOO zoom zooming zooms zorro zu zucco zucker zulu über 0 0 0 0 0 1 0 0 0 0 2 0 0 0 0 00 000 0 00 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 3 0 0 0 0 0 0 0 0 °° 0 0 0 0 °° 0 000 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 244 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 39995 0 0 0 0 0 0 0 0 39996 0 0 39997 0 0 °° 0 0 0 39998 0 39999 0 2 0 0 0 000 0 0 0 0 0 0 0 0 0 10000 : B 0 0 0 0 0 0 0 0 0 E 000 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 522 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 40000 rows x 23392 columns Figure 3.6: Vectorizing the training dataset This "dense" matrix format represents the 40,000 reviews in the training data. It also has a column for each of the words that appear in at least 10 reviews (enforced via the min_df parameter). As can be seen above, this creates a total of 23,392 columns, sorted in alphanumeric order. The matrix entry in position [i,j] represents the number of times that the j_th word appears in the i_th review. Even though this matrix could directly be used by a supervised learning algorithm, it is highly inefficient in terms of memory usage. This is due to the fact that the vast majority of the entries in this matrix are equal to 0. This happens because only a very small percentage of the 23,392 possible words will actually appear in each review. To address this inefficiency, the CountVectorizer tool stores the vectorized data in a sparse format, which only remembers the non-zero entries in each column. The code below uses the getsizeof() function, which returns the size of a Python object in bytes, to demonstrate the memory savings of the sparse format for the IMDb data: from sys import getsizeof print('\nMegaBytes of RAM memory used by the raw text format:', getsizeof(X_train_text)/1000000) print('\nMegaBytes of RAM memory used by the dense matrix format:', getsizeof(X_train_v1_dense)/1000000) print('\nMegaBytes of RAM memory used by the sparse format:', getsizeof(X_train_v1)/1000000) MegaBytes of RAM memory used by the raw text format: 54.864133 MegaBytes of RAM memory used by the dense matrix format: 7485.440144 MegaBytes of RAM memory used by the sparse format: 4.8e-05 وزارة التعليم Ministry of Education 138 2024-1446

Lesson 1 Supervised Learning

# expand the sparse data into a sparse matrix format, where each column represents a different word.

As expected, the sparse format requires far less memory, more specifically 0.000048 megabytes. The dense matrix occupies 7 gigabytes. This matrix will not be used again and can thus be deleted to free up this significant amount of memory: # delete the dense matrix del X_train_v1_dense Building a Prediction Pipeline Now that the training data has been vectorized, the next step is to build a first prediction pipeline. To do this, you will use a type of classifier called a Naive Bayes classifier. The Naive Bayes classifier uses the probabilities of certain words or phrases occurring in a document to predict the likelihood of the document belonging to a certain class. The "naive" part of the name comes from the assumption that the presence of a particular word in a document is independent of the presence of any other word. This is a strong assumption, but it allows the algorithm to be trained very quickly and effectively. The following code uses the implementation of the Naive Bayes Classifier (MultinomialNB) from the sklearn library to train a supervised learning model on the vectorized IMDb training data: Classifier In ML, a classifier is a model that is used to distinguish data points into different categories or classes. The goal of a classifier is to learn from labeled training data, and then make predictions about the class label for new data. from sklearn.naive_bayes import MultinomialNB model_v1=MultinomialNB() # a Naive Bayes Classifier model_v1.fit(X_train_v1, Y_train) #fit the classifier on the vectorized training data from sklearn.pipeline import make_pipeline # create a prediction pipeline: first vectorize using vectorizer_v1, then use model_v1 to predict prediction_pipeline_v1 = make_pipeline(vectorizer_v1, model_v1) For example, this code will produce a result array with the first element being "1" for a positive review and "0" for a negative review: prediction_pipeline_v1.predict( [ 'One of the best movies of the year. Excellent cast and very interesting plot.', 'I was very disappointed with his film. I lost all interest after 30 minutes' ]) وزارة التعليم Ministry of Education 2024-1446 array([1, 0], dtype=int64) 139

Lesson 1 Supervised Learning

As expected, the sparse format requires far less memory,

Build a Prediction Pipeline

The pipeline correctly predicts a positive label for the first review and a negative label for the second review. The built-in function predict_proba() can be used to obtain the probabilities that the pipeline assigns to each of the two possible labels. The first element is the probability that "0" will be assigned and the second element is the probability that "1" will be assigned: prediction_pipeline_v1. predict_proba (['One of the best movies of the year. Ex cellent cast and very interesting plot.', 'I was very disappointed with his film. I lost all interest after 30 minutes' ]) array([[0.08310769, 0.91689231], [0.83173475, 0.16826525]]) The model is 8.3% certain the first review is negative and 91.7% certain it is positive. Likewise, it is 83.2% certain the second review is negative and 16.8% certain it is positive. 8.3% First review Second review 16.8% 91.7% 83.2% ○ Positive O Negative Figure 3.7: Pie charts showing the review percentages The next step is to test the accuracy of this new pipeline on the reviews in the IMDb testing set. The output is an array of all the result labels for the reviews given in the test data: # use the pipeline to predict the labels of the testing data predictions_v₁ = prediction_pipeline_v1.predict(X_test_text) #vectorize the text data, then predict predictions_v1 array([0, 0, 0, ..., 0, 0, 0], dtype=int64) Python provides multiple tools to analyze and visualize the results of classification pipelines. Examples include the accuracy_score() function from sklearn and the "confusion matrix" visualization from the scikit-plot library. There are also other evaluation metrics such as precision, recall, specificity, sensitivity, and F1 score which, depending on the case, can be computed from the confusion matrix. The following output is an approximation of how accurate the prediction was: from sklearn.metrics import accuracy_score accuracy_score (Y_test, predictions_v1) # get the achieved accuracy 0.8468 وزارة التعليم Ministry of Education 140 2024-1446

Lesson 1 Supervised Learning

The pipeline correctly predicts a positive and negative label

actual labels %%capture !pip install scikit-plot%3B #install the scikit-plot library, if it is missing import scikitplot%3B #import the library class_names=['neg','pos'] # pick intuitive names for the 0 and 1 labels # plot the confusion matrix. scikitplot.metrics.plot_confusion_matrix( [class_names[i] for i in Y_test], [class_names[i] for i in predictions_v1], title="Confusion Matrix", # title to use cmap "Purples", # color palette to use figsize=(5,5) # figure size predicted labels ); The confusion matrix contains the counts of actual vs. predicted classifications. In a binary classification task (i.e. a problem with two labels, such as the IMDb task), the confusion matrix will have four cells: True Negatives (upper left): the number of times the classifier correctly predicted the negative class. False Positives (upper right): the number of times the classifier incorrectly predicted the negative class. True label Confusion Matrix 2000 1750 neg 2164 331 1500 1250 False Negatives (lower left): 1000 the number of times the classifier incorrectly pos 435 2070 predicted the positive class. 750 True Positives (lower right): the number of times the classifier correctly 500 neg pos Predicted label predicted the positive class. Figure 3.8: Confusion matrix results of the Naive Bayes classifier on the testing data using the IMDb dataset. The results reveal that even though this first pipeline achieves a competitive accuracy of 84.68%, it still misclassifies hundreds of reviews. There are 331 incorrect predictions in the upper right quarter and 435 incorrect predictions in the lower left corner. This totals 766 incorrect predictions. The first step toward improving performance is to study the behavior of the prediction pipeline, in order to reveal how it processes and understands text. Accuracy Accuracy is the ratio of correct predictions to the total number of predictions. Accuracy = (True Positives + True Negatives) (True Positives + True Negatives + False Positives + False Negatives) وزارة التعليم Ministry of Education 2024-1446 141

Lesson 1 Supervised Learning

The confusion matrix contains the counts of actual vs.

Explaining Black-Box Predictors The Naive Bayes Classifier uses simple mathematical formulas to combine the probabilities of thousands of words and deliver its predictions. Despite its simplicity, it is still unable to deliver an intuitive, user-friendly explanation of exactly how it predicts a positive or negative label for a specific piece of text. Compare that to decision tree classifiers which are more intuitive, as they represent the learned decision rules in a tree like structure, making it easier for people to understand how the classifier arrived at its predictions. The tree structure also allows for a visual representation of the decisions being made at each branch, which can be useful in understanding the relationships between input features and the target variable. The lack of explainability is an even bigger challenge for more complex algorithms, such as those based on ensembles (combinations of multiple algorithms) or neural networks. Without explainability, supervised learning algorithms are reduced to black-box predictors: even though they understand the text well enough to predict its label, they are unable to communicate how they make their decisions. A significant amount of research has been devoted to addressing this challenge by designing explainability methods that can interpret black-box models. One of the most popular methods is LIME (Local Interpretable Model-Agnostic Explanations). LIME (Local Interpretable Model-Agnostic Explanations) LIME is a method for explaining the predictions made by black-box models. It does this by looking at one data point at a time and making small changes to it to see how it affects the model's prediction. LIME then uses this information to train a simple and understandable model, such as a linear regression, to explain the prediction. For text data, LIME identifies the words or phrases that have the biggest impact on the prediction. A Python implementation is shown below: %%capture !pip install lime #install the lime library, if it is missing from lime.lime_text import LimeTextExplainer # create a local explainer for explaining individual predictions explainer_v1 = LimeTextExplainer(class_names-class_names) # an example of an obviously negative review easy_example='This movie was horrible. The actors were terrible and the plot was very boring. # use the prediction pipeline to get the prediction probabilities for this example print(prediction_pipeline_v1. predict_proba([easy_example])) [[0.99874831 0.00125169]] وزارة التعليم Ministry of Education 142 2024-1446

Lesson 1 Supervised Learning

Explaining Black-Box Predictors

LIME (Local Interpretable Model-Agnostic Explanations) LIME is a method for explaining the predictions made by black-box

As expected, the predictor delivers a very confident negative prediction for this easy example. # explain the prediction for this example exp = explainer_v1.explain_instance(easy_example.lower(), prediction_pipeline_v1.predict_proba, num_features=10) # print the words with the strongest influence on the prediction exp.as_list() [('terrible', -0.07046118794796816), ('horrible', -0.06841672591649835), ('boring', -0.05909016205135171), ('plot', -0.024063095577996376), ('was', -0.014436071624747861), ('movie', -0.011956911011210977), ('actors', -0.011682594571408675), ('this', -0.009712387273986628), ('very', 0.008956707731803237), ('were', -0.008897098392433257)] The score of each word represents a coefficient in the simple linear regression model that was used to deliver the explanation. A more visual representation can be obtained as follows: Focus the explainer on the 10 most influential features. #visualize the impact of the most influential words. fig = exp.as_pyplot_figure() وزارة التعليم Ministry of Education 2024-1446 terrible Local explanation for class pos horrible boring plot was movie actors this very were -0.07 -0.06 -0.05 -0.04 -0.03 -0.02 -0.01 0.00 0.01 Figure 3.9: The words with the highest influence on the prediction 143

Lesson 1 Supervised Learning

As expected, the predictor delivers a very confident negative prediction for this easy example.

A negative coefficient increases the probability of the negative class, while a positive coefficient decreases it. For instance, the words 'horrible', 'terrible', and 'boring' had the strongest impact on the model's decision to predict a negative label. The word 'very' slightly pushed the model in a different (positive) direction, but it was not nearly enough to change the decision. To a human observer, it might look strange that sentiment-free words such as 'plot' or 'was' seem to have relatively high coefficients. However, it is important to remember that machine learning does not always follow human common sense. These high coefficients may indeed reveal flaws in the algorithm's logic and could be responsible for some of the predictor's mistakes. Alternatively, the coefficients may be indicative of latent but informative predictive patterns. For instance, it may indeed be the case that human reviewers are more likely to use the word 'plot' or use past tense ('was') when speaking in a negative context. The LIME Python library can also visualize the explanations in other ways. For example: exp.show_in_notebook() Prediction probabilities neg pos neg 1.00 pos 0.00 terrible 0.07 horrible Text with highlighted words this movie was horrible. the actors were terrible and the plot was very boring. 0.071 boring 0.06 plot 0.02 was 0.01 movie 0.01 actors 0.01 this 0.01 very 0.01 were 0.01 Figure 3.10: Other visual representations The review used in the previous example was obviously negative and easy to predict. Consider the following more challenging review which can confuse the algorithm, taken from the testing set of the IMDb data: # an example of a positive review that is misclassified as negative by prediction_pipeline_v1 mistake_example = X_test_text[4600] mistake_example "I personally thought the movie was pretty good, very good acting by Tadanobu Asano of Ichi the Killer fame. I really can't say much about the story, but there were parts that confused me a little too much, and overall I thought the movie was just too lengthy. Other than that however, the movie contained superb acting great fighting and a lot of the locations. were beautifully shot, great effects, and a lot of sword play. Another solid effort by Tadanobu Asano in my opinion. Well I really can't say anymore about the movie, but if you're only outlook on Asian cinema is Crouching Tiger Hidden Dragon or House of Flying Daggers, I would suggest you trying to rent it, but if you're a die-hard Asian cinema fan I would say this has to be in your collection very good Japanese film." وزارة التعليم Ministry of Education 144 2024-1446

Lesson 1 Supervised Learning

A negative coefficient increases the probability of the negative class,

#get the correct labels for this example print('Correct Label:', class_names [Y_test [4600]]) # get the prediction probabilities for this example print('Prediction Probabilities for neg, pos:', prediction_pipeline_v1. predict_proba( [mistake_example])) Correct Label: pos Prediction Probabilities for neg, pos: [[0.8367931 0.1632069]] Even though this is clearly a positive review, the pipeline reported a very confident negative prediction with a probability of 83%. The explainer can now be used to provide insight into why the predictor made this erroneous decision: # explain the prediction for this example exp = explainer_v1. explain_instance(mistake_example, prediction_pipeline_ v1.predict proba, num_features=10) #visualize the explanation = fig exp.as_pyplot_figure() Local explanation for class pos Asano Asian acting movie beautifully superb great outlook solid Ichi -0.3 -0.2 -0.1 0.0 0.1 Figure 3.11: Words that influenced the erroneous decision Even though the predictor correctly captures the positive influence of certain words such as 'beautifully', 'great', and 'superb', it ultimately makes a negative decision based on multiple words that seem to have no obvious negative sentiment (e.g. 'Asano', 'Asian', 'movie', 'acting'). This demonstrates significant flaws in the logic that the predictor utilizes to classify the vocabulary in the text of the given reviews. The next section demonstrates how improving this logic can significantly boost the predictor's performance. وزارة التعليم Ministry of Education 2024-1446 145

Lesson 1 Supervised Learning

# get the correct labels of this example.

Improving Text Vectorization The first version of the prediction pipeline used the CountVectorizer tool to simply count the number of times that each word appears in each review. This approach ignores two important facts about human language: • The meaning and importance of a word can change based on the words that surround it. • The frequency of a word within a document is not always an accurate representation of its importance. For instance, even though two occurrences of the word 'great' may be a strong positive indicator in a document with 100 words, it is far less important in a larger document with 1000 words. Regular Expression A regular expression is a pattern of text used for matching and manipulating strings which provides a concise and flexible way to specify text patterns and is widely used in text processing and data analysis. This section will demonstrate how text vectorization can be improved to take these two facts into account. The following code imports three different Python libraries that will be used to achieve this: • nltk and gensim: two popular libraries used for various Natural Language Processing (NLP) tasks. ⚫re: a library used to search and process text using regular expressions. %%capture !pip install nltk #install nltk !pip install gensim #install gensim import nltk # import nltk nltk.download('punkt') # install nltk's tokenization tool, used to split a text into sentences import re # import re from gensim.models.phrases import Phrases, ENGLISH_CONNECTOR_WORDS # import tools from the gensim library Detecting Phrases The following function can be used to split a given document into a list of tokenized sentences, where each tokenized sentence is represented as a list of words: # convert a given doc to a list of tokenized sentences def tokenize_doc(doc: str): return [re.findall(r' \b\w+\b', Tokenization The process of breaking up textual data into pieces such as words, sentences, symbols and other elements called tokens. The sent_tokenize() function splits the doc into a list of sentences. sent.lower()) for sent in nltk.sent_tokenize(doc)] The sent_tokenize() function from the nltk library splits the document into a list of sentences. Each ⚫sentence is then lowercased and fed to the findall() function of the re library, which locates occurrences of the '\b\w+\b' regular expression. You will test it on the string provided in the raw_text variable. In this expression: وزارة التعليم Ministry of Education 146 2024 - 1446

Lesson 1 Supervised Learning

Improving Text Vectorization

Detecting Phrases

• \w matches all alphanumeric characters (a-z, A-Z, 0-9) and the underscore character. \w+ is used to capture "one or more" \w characters. So, in the string "hello123_world", the pattern \w+ would match the words "hello", "123", and "world". • \b represents the boundary between a \w character and a non-\w character, as well as at the start or end of the given string. For example, the pattern \bcat\b would match the word "cat" in the string "The cat is cute", but it would not match the word "cat" in the string "The category is pets". Let's see an example of tokenization using the tokenize_doc() function. raw_text='The movie was too long. I fell asleep after the first 2 hours.' tokenized_sentences-tokenize_doc(raw_text) tokenized sentences [['the', 'movie', 'was', 'too', 'long'], ['i', 'fell', 'fell', 'asleep', 'after', 'the', 'first', '2', 'hours']] The tokenize_doc() function can now be combined with the Phrases tool from the gensim library to create a phrase model, a model that can identify multi-word phrases in a given sentence. The following code utilizes the IMDB training data (X_train_text) to build such a model: sentences=[] # list of all the tokenized sentences across all the docs in this dataset for doc in X_train_text: # for each doc in this dataset sentences+ tokenize_doc(doc) # get the list of tokenized sentences in this doc # build a phrase model on the given data imdb_phrase_model = Phrases (sentences, 1 connector_words=ENGLISH_CONNECTOR_WORDS, 2 scoring 'npmi', 3 threshold 0.25). freeze() 4 As shown above, the Phrases() function accepts four parameters: 1 The list of tokenized sentences from the given document collection. 2 A list of common english words that appear frequently in phrases (e.g. 'of', 'the'), that do not have any positive or negative value, but can add sentiment depending on the context, so they are treated differently. ③ A scoring function is used to determine if a sequence of words should be included in the same phrase. The code above uses the popular Normalized Pointwise Mutual Information (NPMI) measure for this purpose. NPMI is based on the co-occurrence frequency of the words in a candidate phrase and takes a value between -1 (complete independence) and +1 (complete co-occurrence). A threshold for the scoring function. Phrases with a lower score are ignored. In practice, this threshold can be tuned to identify the value that yields the best results for a downstream application (e.g. predictive modeling). pill The freeze() suffix converts the phrase model into an unchangeable ("frozen") but much faster format. Ministry of Education 2024-1446 147

Lesson 1 Supervised Learning

\w matches all alphanumeric characters (a-z, A-Z, 0-9) and the underscore character.

When applied to the two tokenized sentence examples shown above, this phrase model produces the following results: imdb phrase_model [tokenized_sentences[0]] ['the', 'movie', 'was', 'too_long'] imdb phrase_model [tokenized_sentences [1]] ['i', 'fell asleep', 'after', 'the', 'first', '2_hours'] The phrase model identifies three phrases: 'too_long', 'fell_asleep', and '2_hours'. All three carry more information than their individual words. For example, 'too_long' clearly carries a negative sentiment, even though the words 'too' or 'long' by themselves do not. Similarly, even though seeing the word 'asleep' in a movie review is likely negative evidence, the phrase 'fell_asleep' delivers a much clearer message. Finally, '2_hours' captures a much more specific context than the words '2' and 'hours'. neutral neutral + tokenized too long negative too_long context fell + negative asleep tokenized negative+ fell_asleep specific context tokenized 2_hours context context + 2 hours Figure 3.12: Positive and negative sentiments before and after tokenization The following function uses this phrase-detection capability to annotate phrases in a given document: def annotate_phrases(doc:str, phrase_model): sentences-tokenize_doc(doc)# split the document into tokenized sentences tokens [] # list of all the words and phrases found in the doc for sentence in sentences: # for each sentence # use the phrase model to get tokens and append them to the list tokens+=phrase_model [sentence] return '.join(tokens) # join all the tokens together to create a new annotated document The following code uses the annotate_phrases() function to annotate both the training and testing reviews from IMDb dataset: # annotate all the test and train reviews ✗_train_text_annotated=[annotate_phrases (doc, imdb_phrase_model) for doc in X_ train_text] X_test_text_annotated=[annotate_phrases(text, imdb_phrase_model) for text in X_ test_text] وزارة التعليم Ministry of Education 148 2024 -1446

Lesson 1 Supervised Learning

When applied to the two tokenized sentence examples shown above, this phrase model produces the following results:

# an example of an annotated document from the imdb training data X_train_text_annotated [0] 'i grew up b 1965 watching and loving the thunderbirds all my_mates at school watched we played thunderbirds before school during lunch and after school we all wanted to be virgil or scott no_one wanted to be alan counting down from 5 became an art form i took my children to see the movie hoping they would get_a_glimpse of what i loved as a child how bitterly disappointing the only high_point was the snappy theme_tune not that it could compare with the original score of the thunderbirds thankfully early saturday mornings one television_channel still plays reruns of the series. gerry_anderson and his wife created jonatha frakes should hand in his directors chair his version was completely hopeless a waste of film utter_ rubbish a cgi remake may be acceptable but replacing marionettes with homo_ sapiens subsp sapiens was a huge error of judgment' Using TF-IDF for Text Vectorization The frequency of a word within a document is not always an accurate representation of its importance. A better way to represent frequency is the popular TF-IDF measure. TF-IDF, which stands for "Term Frequency Inverse Document Frequency", uses a simple mathematical formula to determine the importance of tokens (i.e. words or phrases) in a document based on two factors: ⚫ the frequency of the token in the document, as measured by the number of times the token appears in the document divided by the total number of tokens in the documents ⚫ the token's inverse document frequency, computed by dividing the total number of documents in the dataset by the number of documents that contain the token. The first factor avoids the overestimation of the importance of terms that appear in longer documents. The second factor penalizes terms that appear in too many documents, which helps to adjust for the fact that some words are more common than others. TfidfVectorizer Tool The sklearn library provides a tool that supports this type of TF-IDF vectorization. The TfidfVectorizer tool can be used to vectorize a phrase. . Term Frequency Inverse Document Frequency (TF-IDF) TF-IDF is a statistical method which is used to determine the importance of tokens in a document. Document Corpus word term Figure 3.13: Words and terms in document number of documents in data set -= IDF number of documents containing term times of term appears in document number of words in the document TF * IDF = Value from sklearn.feature_extraction.text import TfidfVectorizer # train a TF-IDF model with the IMDb training dataset vectorizer_tf = TfidfVectorizer (min_df=10) vectorizer_tf.fit(X_train_text_annotated) = X_train_tf vectorizer_tf.transform (X_train_text_annotated) = TF وزارة التعليم Ministry of Education 2024-1446 149

Lesson 1 Supervised Learning

# an example of an annotated document from the imdb training data

Using TF-IDF for Text Vectorization

This new vectorizer can now be input to the same Naive Bayes Classifier to build a new predictive pipeline and apply it to the IMDb testing data: # train a new Naive Bayes Classifier on the newly vectorized data model_tf Multinomial NB() model_tf.fit(X_train_v2, Y_train) # create a new prediction pipeline prediction_pipeline_tf = make_pipeline (vectorizer_tf, model_tf) # get predictions using the new pipeline predictions_tf = prediction_pipeline_tf.predict(X_test_text_annotated) # print the achieved accuracy accuracy_score (Y_test, predictions_tf) 0.8858 This new pipeline achieves an accuracy of 88.58%, a significant improvement over the 84.68% reported by the previous one. This improved pipeline can now be used to revisit the test example that was misclassified by the first pipeline: # get the review example that confused the previous algorithm mistake_example_annotated=X_test_text_annotated [4600] print('\nReview: ', mistake_example_annotated) # get the correct labels for this example print('\nCorrect Label:', class_names [Y_test [4600]]) #get the prediction probabilities for this example print('\nPrediction Probabilities for neg, pos:', prediction_pipeline_ tf.predict_proba([mistake_example_annotated])) Review: i personally thought the movie was pretty good very good acting by tadanobu_asano of ichi_the_killer fame i really can_t say much about the story but there were parts that confused me a little_too much and overall i thought the movie was just too lengthy other_than that however the movie contained superb_acting great fighting and a lot of the locations were beautifully_shot great effects and a lot of sword play another solid effort by tadanobu_asano in my opinion well i really can_t say anymore about the movie but if you re only outlook on asian_cinema is crouching_tiger hidden_ dragon or house of flying_daggers i_would suggest you trying to rent_it but if you re a die_hard asian_cinema fan i would say this has to be in your_ collection very good japanese film Correct Label: pos Prediction Probabilities for neg, pos: [[0.32116538 0.67883462]] وزارة التعليم Ministry of Education 150 2024-1446

Lesson 1 Supervised Learning

This new vectorizer can now be input to the same Naive Bayes Classifier to build a new predictive pipeline and apply it to the IMDb testing data:

The new pipeline confidently predicts the correct positive label for this review. The following code uses the LIME explainer to explain the logic behind this prediction: # create an explainer explainer_tf = LimeTextExplainer(class_names=class_names) # explain the prediction of the second pipeline for this example exp = explainer_tf.explain_instance(mistake_example_annotated, prediction_ pipeline_tf.predict_proba, num_features=10) #visualize the results fig exp.as_pyplot_figure() = superb acting beautifully_shot can_t very good die hard your_collection other than solid Local explanation for class pos IT outlook great -0.02 0.00 0.02 0.04 0.06 Figure 3.14: Word influence for TF-IDF and Naive Bayes Classifier combination The results verify that the new pipeline follows a significantly more intelligent logic. It correctly identifies the positive sentiment of phrases like 'superb_acting', 'beautifully_shot' and 'very good'. It is also not misguided by the words that erroneously drove the first pipeline toward a negative prediction. The performance of the predictive pipeline can be further improved in multiple ways, such as replacing the Naive Bayes classifier with more sophisticated methods and tuning the parameters of these methods to maximize their potential. Another option would be to experiment with alternative vectorization techniques that are not based on token frequency, such as the word and document embeddings that will be explored in the following lesson. وزارة التعليم Ministry of Education 2024-1446 151

Lesson 1 Supervised Learning

The new pipeline confidently predicts the correct positive label for this review. The following code uses the LIME explainer to explain the logic behind this prediction:

Exercises 1 Read the sentences and tick ✓ True or False. 1. In supervised learning, you use labeled datasets to train the model. 2. Vectorization is a technique for converting data from numeric vector format to raw data. 3. The sparse format requires far less memory than the dense matrix. 4. The Naive Bayes Classifier algorithm is used to build a prediction pipeline. 5. The frequency of a word within a document is the only accurate representation of its importance. True False 2 Explain the reason the dense matrix format requires more space in the memory than the sparse format. 3 Analyze how the two mathematical factors in TD-IDF are utilized to evaluate the importance of a word in a document. وزارة التعليم Ministry of Education 152 2024-1446

Lesson 1 Supervised Learning

Read the sentences and tick True or False.

Explain the reason the dense matrix format requires more space in the memory than the sparse format.

Analyze how the two mathematical factors in TD-IDF are utilized to inspect the importance of a word in a document.

4 You are given a numPy array, X_train_text, that includes one document in each row. You are also given a second array, Y_train, that includes the labels for the documents in X_train_text. Complete the following code so that it uses TF-IDF to vectorize the data, trains a Multinomial NB classification model on the vectorized version, and then combines the vectorizer and classification model into a single prediction pipeline. from . naive_bayes import MultinomialNB from sklearn.pipeline import make_pipeline from sklearn.feature_extraction.text import vectorizer = vectorizer.fit( X_train = vectorizer. (min_df=10) ) # fits the vectorizer on the training data (X_train_text) # uses the fitted vectorizer to vectorize the data model_MNB=MultinomialNB() # a Naive Bayes Classifier model_MNB.fit(X_train, prediction_pipeline = make_pipeline( ) #fits the classifier on the vectorized training data ) 5 Complete the following code so that it builds LimeTextExplainer for the prediction pipeline that you built in the previous exercise and uses the explainer to explain the prediction for a specific text example. from import LimeTextExplainer text_example="I really enjoyed this movie, the actors were excellent" class_names=['neg', 'pos'] #creates a local explainer for explaining individual predictions explainer = = exp explainer. (class_names=class_names) # explains the prediction for this example (text_example.lower(), prediction_pipeline. =10) #focuses the explainer on the 10 most influential features print(exp. ) # prints the words with the highest influence on the prediction وزارة التعليم Ministry of Education 2024-1446 153

Lesson 1 Supervised Learning

You are given a numPy array X_train_text that includes one document in each row.

Complete the following code so that it builds LimeTextExplainer for the prediction