Lesson Supervised Learning for Image Analysis - Artificial Intelligence - ثالث ثانوي

4. Image Recognition In this unit, you will learn about supervised and unsupervised learning for image recognition by creating and training a model to classify or cluster images of different animal heads, as an example. You will also learn about image generation and how to alter images or complete their missing content while maintaining realism. Ministry of Education 196 2024-1446 Learning Objectives In this unit, you will learn to: > Preprocess images and extract their features. > Train a supervised learning model to classify images. > Define the structure of a neural network. > Train an unsupervised learning model to cluster images. > Generate images based on a text prompt. > Realistically complete missing parts of an image. Tools > Jupyter Notebook > Google Colab

Lesson 1 Supervised Learning for Image Analysis

Image Recognition

Learning Objectives

Tools > Jupyter Notebook

Lesson 1 Supervised Learning for Image Analysis Link to digital lesson www.ien.edu.sa Supervised Learning for Computer Vision Computer vision is a subfield of Artificial Intelligence that focuses on teaching computers how to interpret and understand the visual world. It involves using digital images and videos to train machines to recognize and analyze visual information, such as objects, people, and scenery. The ultimate goal of computer vision is to enable machines to "see" the world as humans do and use this information to make decisions or take actions. Computer vision has a wide range of applications, such as: • Medical Imaging: Computer vision can help doctors and healthcare professionals in diagnosing diseases by analyzing medical images, such as X-rays, MRIs, and CT scans. • Autonomous Vehicles: Self-driving cars and drones use computer vision to recognize traffic signals and road patterns, pedestrians, and obstacles in the road and in the air, enabling them to navigate safely and efficiently. • Quality Control and Inspection: Computer vision is used to inspect products and identify defects in manufacturing processes. This is used in various industries, including automotive, electronics, and textiles. • Robotics: Computer vision is used to help robots navigate and interact with their environment, including recognizing and manipulating objects. Supervised and unsupervised learning are two main types of machine learning that are commonly used in computer vision applications. Both approaches involve training algorithms on large datasets of images or videos to enable machines to recognize and interpret visual information. Supervised learning and unsupervised learning were introduced in unit 3, lessons 1 and 2, and were both applied in NLP and NLG. In this lesson, they will be applied for image analysis. Unsupervised learning involves training algorithms on unlabeled datasets, where no explicit labels or categories are provided. The algorithm then learns to identify similar patterns in the data without any prior knowledge of the labels. For example, an unsupervised learning algorithm might be used to group similar images together based on common features, such as color, texture, or shape. Unsupervised learning will be detailed in lesson 2. raw image machine learning classification model labeled output 今日 Figure 4.1: Image classification with computer vision ☐ 98% Arabian leopard 日 1% apple 1% car 197

Lesson 1 Supervised Learning for Image Analysis

Supervised Learning for Computer Vision

In constrast, supervised learning involves training algorithms on labeled datasets, where each image or video is assigned a specific label or category. The algorithm then learns to recognize patterns and features that are associated with each label, allowing it to accurately classify new images or videos. For example, a supervised learning algorithm might be trained to recognize different species of cats based on labeled images of each breed (e.g, see figure 4.1). Supervised learning is the focus of this lesson. The process of supervised learning typically involves four key steps: data collection, labeling, training, and testing. During data collection and labeling, images or videos are collected and organized into a dataset. Then, each image or video is labeled with a corresponding class or category, such as "eagle" or "cat". During the training phase, the machine learning algorithm uses this labeled dataset to "learn" the patterns and features that are associated with each class or category. As more training data is presented to the algorithm, it becomes more accurate at recognizing the different classes in the dataset and improves its performance. Once the model has been trained, it is tested on a separate set of images or videos to evaluate its performance. The testing set is different from the training set to ensure that the model is able to generalize to new data. For example, the data for a cat has properties such as weight, color, breed etc. The accuracy of the model is then evaluated based on how well it performs on the testing set. The above process is very similar to the one followed for supervised learning tasks on different types of data, such as text. However, visual data is generally considered harder to handle than text due to multiple reasons, as described in Table 4.1. Table 4.1: Challenges of visual data classification Visual data is high-dimensional Visual data is noisy and very diverse Visual data does not follow a strict structure Images contain a large amount of data, which makes them more difficult to process and analyze than textual data. While the basic elements of a text document are words, the elements of an image are pixels. As you will see in this chapter, even a small image can consist of thousands of pixels. Images can be affected by noise, lighting, blurring, and other factors that make it difficult to accurately classify them. In addition, there is a wide variety of visual data, with many different objects, scenes, and contexts that can be difficult to accurately classify. While text tends to follow specific rules for syntax and grammar, visual data does not have such constraints. This makes it harder and more computationally expensive to analyze. As a result of these complexities, the effective classification of visual data requires specialized techniques. This unit covers techniques that utilize the geometric and color properties of images, besides more advanced machine learning techniques based on neural networks. Specifically, this first lesson demonstrates how Python can be used for: Loading a data Loading a dataset of labeled images. • Converting the images to a numeric format that can be used by computer vision algorithms. pill Splitting the numeric data into training and testing datasets. Ministry of Education 2024-1446

Lesson 1 Supervised Learning for Image Analysis

In constrast, supervised learning involves training algorithms on labeled datasets,

Table 4.1: Challenges of visual data classification

. Analyzing the data to extract informative patterns and features. • Using the transformed data to train classification models that can be used to predict the labels of new images. The dataset you will be using includes 1,730 face images for 16 different types of animals, making it ideal for supervised learning and for demonstrating the aforementioned techniques. Loading and Preprocessing Images The following code imports a set of libraries that are used to load the images from the LHI-Animal- Faces dataset and convert them to a numeric format. %%capture import matplotlib.pyplot as plt #used for visualization from os import listdir # used to list the contents of a directory !pip install scikit-image # used for image manipulation from skimage.io import imread # used to read a raw image file (e.g. png or jpg) from skimage.transform import resize # used to resize images # used to convert an image to the "unsigned byte" format from skimage import img_as_ubyte Ensuring that all the images in the dataset have the same dimensions is required by supervised learning algorithms, therefore, the following code reads the images from their input_folder and resizes each of them to the same (width x height) dimensions. def resize_images(input_folder:str, width:int, height: int ): labels = [] # a list with the label for each image resized_images = [] # a list of resized images in np array format filenames = [ ] # a list of the original image file names [] for subfolder in listdir(input_folder): # for each sub folder print(subfolder) path input_folder + '/' + subfolder = for file in listdir(path): # for each image file in this subfolder image = imread(path + '/' + file) #reads the image = resized img_as_ubyte(resize(image, (width, height))) #resizes the image labels.append(subfolder[:-4]) #uses subfolder name without "Head" suffix resized_images.append(resized) # stores the resized image filenames.append(file) pilzïlläjljg return resized_images, labels, filenames وزارة التعليم # stores the filename of this image Ministry of Education 2024-1446 199

Lesson 1 Supervised Learning for Image Analysis

Analyzing the data to extract informative patterns and features.

Loading and Preprocessing Images

resized_images, labels, filenames = resize_images("Animal Face/Image", width=100, height=100) #retrieves the images with their labels and resizes them to 100 x 100 BearHead CatHead ChickenHead CowHead DeerHead DuckHead EagleHead ElephantHead PigeonHead RabbitHead SheepHead TigerHead WolfHead LionHead MonkeyHead Natural PandaHead The names of the folders. Without the "Head" suffix, they serve as the labels for the images contained in them. 20- 40 60 80 100 The imread() function creates an "RGB" format of the image. This format is widely used because it allows for the representation of a wide range of colors. In the RGB color system, the letters R, G, and B mean that the format contains three major color components, namely red (R = Red), green (G = Green), and blue (B = Blue). Each pixel is represented by three 8-bit channels (one for red, one for green, and one for blue) and can take on a value between 0 and 255. This 0-255 format is also known as the "unsigned byte" format. The combination of these three channels allows for the representation of a wide range of colors in the pixel. For example, a pixel with the value (255, 0, 0) would be fully red, a pixel with the value (0, 255, 0) would be fully green, and a pixel with the value (0, 0, 255) would be fully blue. A pixel with the value (255, 255, 255) would be white, and a pixel with the value (0, 0, 0) would be black. 120 140 160 25 50 75 100 125 150 Figure 4.2: Original lion head image In the RGB system, pixel values are arranged in a two-dimensional grid, with rows and columns representing the x and y coordinates of the pixels in the image. The resulting grid is referred to as the "image matrix." For example, consider the image in figure 4.2 and the associated code below: # reads an image file, stores it in a variabe and # shows it to the user in a window image = imread('Animal Face/Image/Lion Head/lioni78.jpg') plt.imshow(image) image.shape (169, 169, 3) Printing the image shape reveals a 169x169 matrix, for a total of 28,561 pixels. The "3" in the third column represents the 3 channels (Red/Green/Blue) of the RGB system. For example, the following code would print the RGB value of the first pixel of this image: # the pixel at the first column of the first row print(image[0][0]) وزارة التعليم Ministry of Educatio[102 68 66] 200 2024-1446

Lesson 1 Supervised Learning for Image Analysis

The imread() function creates an "RGB" format of the image.

Resizing has the effect of converting RGB images to a float-based format: resized = resize(image, (100, 100)) print (resized.shape) print(resized[0][0]) (100, 100, 3) [0.40857161 0.27523827 0.26739514] Although the image has now indeed been resized to a 100x100 matrix, the 3 RGB values of each pixels have been normalized to a value between 0 and 1. It can be transformed back to the original unsigned byte format via the following code: resized = img_as_ubyte(resized) print(resized. shape) print(resized[0][0]) print(image[0][0]) The RGB values of the resized pixel are slightly different from those in the original image, which is a common effect of resizing. Printing the resized image also reveals that it is slightly less clear, as appears in figure 4.3. Again, this is a result of compressing the 169x169 matrix to a 100x100 format. (100, 100, 3) [104 70 68] [102 68 66] 20 40 60 # displays the resized image plt.imshow(resized);B Before proceeding with the training of supervised learning algorithms, it is good practice to check if any of the images in the dataset violates the (100,100,3) format: 80 20 60 80 Figure 4.3: Resized lion head image violations = [index for index in range(len(resized_images)) if resized_images [index].shape != (100,100,3)] violations [455, 1587] The code reveals two such images. This is unexpected, given that the resize_images() function was applied to all images in the dataset. The following code snippets print the two images, along with their dimensions and file names: وزارة التعليم Ministry of Education 2024-1446 201

Lesson 1 Supervised Learning for Image Analysis

Resizing has the effect of converting RGB images to a float-based format:

pos1 = violations[0] pos2 = violations[1] print(filenames [pos1]) print(resized_images[pos1].shape) plt.imshow(resized_images[pos1]) plt.title(labels [pos1]) 20 40 60 cow1.gif (100, 100, 4) print(filenames[pos2]); print(resized_images[pos2].shape); plt.imshow(resized_images [pos2]); plt.title(labels [pos2]); 80- 20 20 40 Cow 60 Figure 4.4: RGBA image Tiger 80 tiger0000000168.jpg (100, 100) The first image has a shape of (100, 100, 4). The "4" reveals that the image has an "RGBA" rather than RGB format. This is an extended format and contains a fourth additional channel called the "Alpha" channel that represents the transparency of each pixel. For example: # prints the first pixel of the RGBA image # a value of 255 reveals that the pixel is # not transparent at all resized_images [pos1][0][0] The second image has a shape of (100, 100). The lack of the third dimension reveals that the image has a grayscale rather than RGB format. The misleading yellow/blue format shown above is due to a color map that imshow applies by default to grayscale images. It can be switched off as follows: 60 80 O 20 40 60 80 Figure 4.5: Image with misleading yellow/blue color map 20 40 60 array([135, 150, 84, 255], dtype=uint8) plt.imshow(resized_images[pos2], cmap gray') وزارة التعليم Ministry of Education 202 2024-1446 80 20 40 60 80 Figure 4.6: Grayscale image

Lesson 1 Supervised Learning for Image Analysis

The first image has a shape of (100, 100, 4). The "4" reveals

Grayscale images have only one channel (rather than the 3 RGB channels). Each pixel value is a single number ranging from 0 to 255. The pixel value 0 represents black and the pixel value 255 represents white. For example: resized_images[pos2][0][0] 100 As an additional data quality check, the following code counts the frequency of each animal label in the dataset: # used to count the frequency of each element in a list from collections import Counter label_cnt Counter (labels) label_cnt = Counter({'Bear' 101, 'Cat': 160, 'Chicken': 100, 'Cow': 104, 'Deer' 103, The outlier in the data can be seen clearly here. The "Nat" (Nature) category has only 8elements in comparison to the other categories. 'Duck' 103, 'Eagle' 101, 'Elephant' 100, 'Lion' 102, 'Monkey': 100, 'Nat': 8, 'Panda': 119, 'Pigeon': 115, 'Rabbit' 100, 'Sheep' 100, 'Tiger' 114, 'Wolf' 100}) The dataset contains both images of animals and of nature to showcase outlier data. The Counter reveals a very small category "Nat" with only 8 images. A quick inspection reveals that this is an outlier category with images of natural landscapes without any animal faces. The following code removes the two RGBA and Grayscale images, as well as all the images from the "Nat" category from the resized_images, labels, and filenames lists: N = len(labels) resized_images = [resized_images[i] for i in range(N) if i not in violations and labels[i] != "Nat"] filenames = [filenames[i] for i in range(N) if i not in violations and labels[i] != "Nat"] labels = [labels[i] for i in range(N) if i not in violations and labels[i] != "Nat"] وزارة التعليم Ministry of Education 2024-1446 203

Lesson 1 Supervised Learning for Image Analysis

Grayscale images have only one channel (rather than the 3 RGB channels).

The next step is to convert the resized_images and labels lists to numpy arrays, which is required by many computer vision algorithms. The following code also uses the (X,Y) names that are typically used to represent data and labels, respectively, in supervised learning tasks: import numpy as np = np.array(resized_images) y = np.array(labels) X.shape (1720, 100, 100, 3) The shape of the final X dataset reveals that it includes 1,720 RGB images, according to the number of channels, all with the same 100x100 dimensions (10,000 pixels). Finally, the train_test_split() function from the sklearn library can be used to split the dataset into training and testing sets: from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split( ✗, y, test_size = 0.20, shuffle = True, random_state = 42, # uses 20% of the data for testing # to randomly shuffle the data # to ensure that data is always shuffled in the same way ) Given that the animal folders were loaded one at a time, the images from each folder are packed together in the above lists. This can be misleading for many algorithms, especially in the computer vision domain. Setting shuffle=True in the code above solves this issue. In general, it is recommended to randomly shuffle the data before proceeding with analysis. Prediction without Feature Engineering Although the steps followed in the previous section have indeed converted the data into a numeric format, they are not in the standard one-dimensional format that is required by many machine learning algorithms. For instance, unit 3 described how each document had to be converted to a one- dimensional numeric vector before the data could be used for training and testing machine learning models. Instead, each data point in the dataset has a 3-dimensional format: X_train[0].shape (100, 100, 3) وزارة التعليم Ministry of Education 204 2024-1446

Lesson 1 Supervised Learning for Image Analysis

The next step is to convert the resized_images and labels lists to numpy arrays, which is expected

Prediction without Feature Engineering

The following code can be used to "flatten" each image into a one-dimensional vector. Each image is now represented as a flat numeric vector of 100 x 100 x 3 = 30,000 values: X_train_flat = np.array([img.flatten() for img in X_train]) X_test_flat = np.array([img.flatten() for img in X_test]) X_train_flat[0].shape (30000,) This flat format can now be used with any standard classification algorithm, without any additional effort to engineer additional predictive features. An example of feature engineering for image data will be explored in the following section. The following code uses the Naive Bayes (NB) classifier that was also used to classify text data in unit 3: from sklearn.naive_bayes import MultinomialNB #imports the Naive Bayes Classifier model MNB = MultinomialNB() model_MNB.fit(X_train_flat,y_train) # fits the model on the flat training data MultinomialNB() from sklearn.metrics import accuracy_score # used to measure the accuracy pred = model_MNB.predict(✗_test_flat) # gets the predictions for the flat test set accuracy_score (y_test, pred) 0.36046511627906974 The following code prints the confusion matrix of the results, to provide additional insight: %%capture !pip install scikit-plot import scikitplot scikitplot.metrics.plot_confusion_matrix(y_test, #actual labels pred, #predicted labels title = "Confusion Matrix". = cmap "Purples", figsize (10,10), = x_tick_rotation = 90, وزارة التعليم Ministry of Education 2024-1446 normalize ) = True # to print percentages 205

Lesson 1 Supervised Learning for Image Analysis

The following code can be used to "flatten" each image into a one-dimensional vector.

The normalized values help to view the elements as percentages. True label Confusion Matrix Bear 0.57 0.0 0.0 0.05 0.0 0.0 0.14 0.05 0.1 0.0 0.0 0.0 0.0 0.1 0.0 0.0 Cat 0.03 0.21 0.0 0.0 0.18 0.03 0.05 0.08 0.05 0.03 0.0 0.13 0.08 0.05 0.03 0.05 Chicken- 0.0 0.0 0.3 0.04 0.0 0.0 0.04 0.11 0.33 0.0 0.0 0.0 0.15 0.0 0.0 0.04 Cow -0.23 0.0 0.03 0.23 0.0 0.0 0.0 0.17 0.1 0.03 0.03 0.0 0.0 0.1 0.0 0.07 Deer 0.0 0.03 0.03 0.06 0.41 0.0 0.03 0.06 0.16 0.09 0.0 0.03 0.0 0.0 0.06 0.03 Duck -0.04 0.11 0.0 0.04 0.04 0.3 0.07 0.07 0.11 0.04 0.0 0.04 0.0 0.0 0.0 0.15 Eagle 0.05 0.0 0.0 0.0 0.0 0.0 0.41 0.09 0.0 0.0 0.27 0.14 0.05 0.0 0.0 0.0 Elephant 0.0 0.0 0.0 0.15 0.0 0.0 0.04 0.74 0.07 0.0 0.0 0.0 0.0 0.0 0.0 0.0 Lion 0.05 0.05 0.0 0.05 0.0 0.0 0.0 0.05 0.55 0.0 0.0 0.05 0.1 0.05 0.0 0.05 Monkey 0.09 0.04 0.0 0.09 0.04 0.04 0.17 0.0 0.13 0.13 0.0 0.0 0.0 0.04 0.13 0.09 Panda 0.0 0.0 0.0 0.0 0.0 0.0 0.04 0.0 0.0 0.0 0.96 0.0 0.0 0.0 0.0 0.0 Pigeon 0.03 0.06 0.0 0.0 0.03 0.0 0.19 0.23 0.03 0.03 0.0 0.13 0.1 0.16 0.0 0.0 Rabbit 0.04 0.07 0.0 0.0 0.07 0.04 0.11 0.3 0.07 0.07 0.0 0.04 0.04 0.15 0.0 0.0 Sheep 0.14 0.05 0.0 0.05 0.09 0.05 0.0 0.09 0.05 0.0 0.0 0.0 0.05 0.45 0.0 0.0 Tiger 0.09 0.04 0.0 0.0 0.0 0.04 0.13 0.09 0.13 0.0 0.0 0.0 0.0 0.13 0.3 0.04 Wolf -0.09 0.12 0.0 0.09 0.0 0.06 0.06 0.09 0.0 0.06 0.0 0.12 0.03 0.03 0.0 0.22 0.8 0.6 0.4 0.2 Bear Cat Chicken Elephant Lion Monkey Panda Pigeon Rabbit Sheep Tiger Wolf Predicted label Figure 4.7: Confusion matrix of MultinomialNB algorithm performance The MultinomialNB algorithm achieves an accuracy of around 30%. While this might seem low, it has to be considered in the context of the fact that the dataset includes 20 different labels. This means that, assuming a relatively balanced dataset where each label covers 1/20 of the data, a random classifier that randomly assigns a label to each testing point would achieve an accuracy of around 5%. Therefore, a 30% accuracy would be 6 times better than a random guess! Still, as shown in the following sections, this accuracy can be improved significantly. The confusion matrix also verifies that there is room for improvement. Indeed, the Naive Bayes model often mistakes pigeons for eagles or wolves for cats. The easiest way to try to improve the results is to leave the data as it is and experiment with different classifiers. One model which has been shown to work well with vectorized image data is the SGDClassifier from the sklearn library. During training, the SGDClassifier adjusts the weights of the model based on the training data. The goal is to find the set of weights that minimizes a "loss" function, which measures the difference between the predicted labels and the true labels in the training data. The following code uses the SGDClassifier to train a model on the flat •⚫ dataset: 0.0 MultinomialNB algorithm MultinomialNB is a machine learning algorithm used for classifying text or other data into different categories. It is based on the Naive Bayes algorithm, which is a simple and efficient method for solving classification problems. SGDClassifier algorithm The SGDClassifier is a machine learning algorithm used to classify data into different categories or groups. It is based on a technique called Stochastic Gradient Descent (SGD), which is an efficient method for optimizing and training various types of models, including classifiers. وزارة التعليم 2024-1446 Ministry of Education 206

Lesson 1 Supervised Learning for Image Analysis

Figure 4.7: Confusion matrix of MultinomialNB algorithm performance

The MultinomialNB algorithm achieves an accuracy around 30%.

MultinomialNB

SGDClassifier

from sklearn.linear_model import SGDClassifier model sgd SGDClassifier() model_sgd.fit(X_train_flat, y_train) pred-model_sgd.predict(X_test_flat) accuracy_score (y_test, pred) 0.46511627906976744 The SGDClassifier achieves signicantly higher accuracy of over 46%, despite the fact that it was trained on the exact same data as the MultinomialNB classifier. This demonstrates the potential benefits of experimenting with various classification algorithms to find the one that best fits each particular dataset. In that effort, it is also important to understand the strengths and weaknesses of each algorithm. For example, the SGDClassifier is known to perform better when the input data is scaled and the features are standardized. This is why you will be using standard scaling in your model. Standard scaling A preprocessing technique used in machine learning to scale the features of a dataset so that they have zero mean and unit variance. The following code uses the StandardScaler tool from the sklearn library to scale the data: from sklearn.preprocessing import StandardScaler scaler Standard Scaler() = X_train_flat_scaled = scaler.fit_transform(X_train_flat) X_test_flat_scaled = scaler.fit_transform (X_test_flat) print(X_train_flat[0]) # the values of the first image pre-scaling print(X_train_flat_scaled[0]) # the values of the first image post-scaling [144 142 151 76 75 80] [0.33463473 0.27468959 0.61190285 ... -0.65170221 -0.62004162 -0.26774175] A new model can now be trained and tested using the scaled datasets: model sgd = SGDClassifier() model_sgd.fit(X_train_flat_scaled, y_train) pred-model_sgd.predict(✗_test_flat_scaled) accuracy_score(y_test, pred) 0.4906976744186046 The results indeed demonstrate an improvement after scaling. It is likely that further improvement can be achieved by experimenting with other algorithms and tuning their parameters to better fit the .dataset التعليم Ministry of Education 2024-1446 207

Lesson 1 Supervised Learning for Image Analysis

Standard scaling

The following code uses the StandardScaler tool from the sklearn library to scale the data:

Prediction with Feature Selection While the previous section focused on training models by simply flattening the data, this section will describe how the original data can be transformed to engineer smart features that capture key properties of the image data. Specifically, the section demonstrates a popular technique called the Histogram of Oriented Gradients (HOG). The first step towards engineering HOGS is to convert the RGB images to grayscale. This can be done with the rgb2gray() function from the sckit-image library: Histogram of Oriented Gradients (HOG) HOGS divide an image into small sections and analyze the distribution of intensity changes within each section, in order to identify and understand the shape of an object in the image. from skimage.color import rgb2gray # used to convert a multi-color (rgb) image to grayscale # converts the training data X_train_gray = np.array([rgb2gray(img) for img in X_train]) # converts the testing data X_test_gray = np.array([rgb2gray(img) for img in X_test]) plt.imshow(X_train[0]); 20- 40 60 80 20 40 60 80 Figure 4.8: RGB image plt.imshow(X_train_gray[0],cmap='gray'); 20 60 80 20 40 60 80 Figure 4.9: Grayscale image The new shape of each image is now 100x100, rather than the RGB-based 100x100x3 format: print(X_train_gray[0].shape) print(X_train[0].shape) (100, 100) (100, 100, 3) وزارة التعليم Ministry of Education 208 2024-1446

Lesson 1 Supervised Learning for Image Analysis

Prediction with Feature Selection

The next step is to create the HOG features for each image in the data. This can be achieved via the hog() function from the scikit-image library. The following code shows an example for the first image in the training dataset: from skimage.feature import hog hog_vector, hog_img = hog( (8100,) X_train_gray[0], visualize = True ) hog_vector.shape 20 40 20 40 60 80 Figure 4.10: HOG of image The hog_vector is a one-dimensional vector with 8,100 numeric values that can now be used to represent this image. A visual representation of this vector is shown using: plt.imshow(hog_img); This new representation captures the boundaries of the key shapes in the image. It eliminates noise and focuses on the informative parts that can help a classifier to make a prediction. The following code applies this transformation to all images in both training and testing sets: 60 80 O X_train_hog = np.array([hog(img) for img in X_train_gray]) X_test_hog = np.array([hog (img) for img in X_test_gray]) A new SGDClassifier can now be trained on this new representation: # scales the new data scaler Standard Scaler() = StandardScaler() X_train_hog_scaled = scaler.fit_transform(X_train_hog) X_test_hog_scaled = scaler.fit_transform(X_test_hog) # trains a new model model_sgd = SGDClassifier() model_sgd.fit(X_train_hog_scaled, y_train) # tests the model pred = model_sgd.predict(X_test_hog_scaled) accuracy_score (y_test, pred) وزارة التعليم Ministry of Education 2024-1446 0.7418604651162791 209

Lesson 1 Supervised Learning for Image Analysis

The next step is to create the HOG features for each image in the data.

scikitplot.metrics.plot_confusion_matrix(y_test, # actual labels pred, #predicted labels title = "Confusion Matrix", # title to use cmap = "Purples", # color palette to use figsize (10,10), #figure size x_tick_rotation = 90 ); = True label 30 Bear 12 00 0 0 Confusion Matrix 010 3 2 2 0 0 0 0 1 Cat 031 1 0 0 0 O Chicken 01 23 0 0 1 0 0 о 0 0 2 1 1 0 0 1 0 1 о 2 0 25 0 0 0 Cow 1 21 17 2 1 2 0 0 0 O 0 4 0 0 Deer 0 5 0 0 23 0 0 1 0 0 0 0 0 3 0 0 20 Duck 0 1 T 0 0 21 2 0 0 0 2 0 0 0 0 0 Eagle 0 0 1 о 0 1 17 0 0 0 1 2 0 0 0 0 Elephant 1 0 0 1 1 1 0 20 1 1 0 0 о 0 1 0 Lion 1 0 0 0 0 0 1 0 16 0 1 0 1 0 0 15 0 15 Monkey 3 0 1 0 0 0 0 0 2 15 0 1 0 0 1 0 0 0 0 0 0 0 27 1 0 0 0 0 0 10 Panda 0 0 0 0 Pigeon 0 0 1 0 0 2 5 0 0 0 0 23 O 0 0 Rabbit 0 1 2 1 1 5 0 0 0 0 0 1 15 1 0 0 Sheep 0 0 0 2 3 1 1 1 1 L 0 0 0 12 0 0 Tiger 0 0 0 0 O 1 0 0 0 3 T Wolf 9 1 0 1 1 0 0 0 T T T Bear Cat Chicken Cow Deer Duck Eagle Elephant Huon 1 0 Monkey Panda Pigeon Rabbit o Sheep 0 0 0 0 0 18 0 0 22 Tiger Wolf Predicted label Figure 4.11: Confusion matrix of SGDClassifier algorithm performance The new results reveal a massive improvement in accuracy, which has now jumped to over 70% and has far surpassed the accuracy achieved by the same classifier on the flat data without any feature engineering. The improvement is also apparent in the updated confusion matrix, which now includes far less false positives. This demonstrates the value of using computer vision techniques to engineer intelligent features that capture the various visual properties of the data. وزارة التعليم Ministry of Education 210 2024-1446

Lesson 1 Supervised Learning for Image Analysis

A new SGDClassifier can now be trained on this new representation:

Prediction Using Neural Networks This section demonstrates how neural networks can be used to design classifiers that are customized for image data and can often surpass even highly effective techniques, such as the HOG process that was described in the previous section. The popular Tensorflow and Keras libraries are used for this purpose. TensorFlow is a low-level library that provides a wide range of tools for machine learning and artificial intelligence. It allows users to define and manipulate numerical computations involving tensors, which are multi-dimensional arrays of data. Keras, on the other hand, is a higher-level library that provides a simpler interface for building and training models. It is built on top of TensorFlow (or other backends) and provides a set of pre-defined layers and models that can be easily assembled to build a deep learning model. Keras is designed to be user-friendly and easy to use, making it a popular choice for practitioners. Activation functions are mathematical functions applied to the output of each neuron in a neural network that have the advantage of adding non-linear properties to the model and allowing the network to learn complex patterns in the data. The choice of activation function is important and can impact the network's performance. Neurons receive input, process it with weights and biases, and produce an output based on an activation function, as shown in figure 4.12. Neural networks are constructed by connecting many neurons together in layers and are trained to adjust the weights and biases and improve their performance over time. The following code installs the libraries tensorflow and keras: inputs weight variables bias variable M-0σ activation function output W₁ 1 X₂ 2 Σ fx --> y n n %%capture Figure 4.12: Activation function !pip install tensorflow !pip install keras In the previous unit you were introduced to artificial neurons and neural network architectures. Specifically, the Word2Vec model, which used a hidden layer and an output layer to predict the context words of a given word in a sentence. Next, you will use Keras to create a similar neural architecture for images. First, the labels in y_train are converted to an integer format, as required by Keras: # gets the set of all distinct labels classes=list(set(y_train)) print(classes) print() #replaces each label with an integer (its index in the classes lists) for both the training and testing data y_train_num = np.array([classes.index(label) for label in y_train]) y_test_num = np.array([classes.index(label) for label in y_test]) print() # example: print(y train[:5]) # first 5 labels print(y_train_num[:5]) # first 5 labels in integer format وزارة التعليم Ministry of Education 2024-1446 211

Lesson 1 Supervised Learning for Image Analysis

Prediction Using Neural Networks

The Sequential tool from the Keras library can now be used to build a neural network as a sequence of layers. from keras.models import Sequential # used to build neural networks as sequences of layers # every neuron in a dense layer is connected to every other neuron in the previous layer from keras.layers import Dense # builds a sequential stack of layers model = Sequential() # adds a dense hidden layer with 200 neurons, and the ReLU activation function activation='relu')) model.add(Dense (200, input_shape = (X_train_hog.shape[1],), activation= 'relu')) # adds a dense output layer and the softmax activation function model.add(Dense (len(classes), activation = 'softmax')) model.summary() Model: "sequential" Layer (type) dense (Dense) dense 1 (Dense) II Output Shape (None, 200) (None, 16) Total params: 1,623,416 Trainable params: 1,623,416 Non-trainable params: 0 II II || II II Param # 1620200 3216 The number of neurons in the hidden layer is a design choice. The number of neurons in the output layer is dictated by the number of classes. The model summary reveals the total number of parameters that the model has to learn by fitting on the training data. Since the input has 8,100 entries, which are the the dimensions of the HOG images X_train_hog, and the hidden layer has 200 neurons and is a dense layer that is fully connected to the input, this creates a total of 8,100 x 200 = 1,620,000 weighted connections whose weights (parameters) have to be learned. An additional 200 "bias" parameters are added, one for each neuron in the hidden layer. A bias parameter is a value that is added to the input of each neuron in a neural network. It is used to shift the activation function of the neuron to the negative or positive side, allowing the network to model more complex relationships between the input data and the output labels. وزارة التعليم Ministry of Education 212 2024-1446

Lesson 1 Supervised Learning for Image Analysis

The Sequential tool from the Keras library can now be used to build a neural network as a sequence of layers.

Given that the output layer has 16 neurons that are fully connected to the 200 neurons of the hidden layer, this adds an additional 16 x 200 = 3,200 weighted connections. An additional 16 bias parameters are added, one for each neuron in the output layer. The following line is used to "compile" the model: #compiling the model model.compile(loss = 'sparse_categorical_crossentropy', metrics = ['accuracy'], optimizer = 'adam') The Keras smart model preparation method known as model.compile() is used to define the basic characteristics of a smart model and prepare it for training, verification, and prediction. It takes three main arguments, as illustrated in Table 4.2 Table 4.2: The arguments of the "compile" method loss metrics optimizer This is the loss function that is used to evaluate the error in the model during training. It measures how well the model's predictions match the true labels for a given set of input data. The goal of training is to minimize the loss function, which typically involves adjusting the model's weights and biases. In this case, the loss function is 'sparse_categorical_crossentropy', which is a loss function suitable for multi-class classification tasks where the labels are integers (as in y_train_num). This is a list of metrics that is used to evaluate the model during training and testing. These metrics are computed using the output of the model and the true labels, and they can be used to monitor the performance of the model and identify areas where it can be improved. "Accuracy" is a common metric for classification tasks that measures the fraction of correct predictions made by the model. This is the optimization algorithm that is used to adjust the model's weights and biases during training. The optimizer uses the loss function and the metrics to guide the training process, and it adjusts the model's parameters in an effort to minimize the loss and maximize the performance of the model. In this case, the optimizer is 'adam', which is a popular algorithm for training neural networks. Finally, the fit() method is used to train the model on the available data: model.fit(X_train_hog, #training data y_train_num, #labels in integer format batch_size = 80, #number of samples processed per batch epochs = 40, #number of iterations over the whole dataset وزارة التعليم Ministry of Education 2024-1446 213

Lesson 1 Supervised Learning for Image Analysis

Given that the output layer has 16 neurons that are fully connected

Table 4.2: The arguments of the "compile" method

Epoch 1/40 17/17 [==== Epoch 2/40 17/17 [==== Epoch 3/40 17/17 [==== II II Epoch 4/40 17/17 [===== Epoch 5/40 17/17 [=== ===] 1s 16ms/step loss: 2.2260 - accuracy: 0.3333 - - Os 15ms/step - loss: 1.1182 ] - Os 15ms/step - loss: 0.7198 accuracy: 0.7256 - accuracy: 0.8155 ===] Os 15ms/step loss: 0.4978 accuracy: 0.9031 ===] Os 16ms/step loss: 0.3676 - - - accuracy: 0.9388 II II II II II II II II II II II II II II II II || || || || II II II II II II II II || II II II II II II II || II II || II II II II II || II II II II II II II || II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II II || II II II II II II II II II II II II II II II II II II II II II II II II II ။ II II II II || II II II II II II II II Epoch 36/40 17/17 [===== Epoch 37/40 [===== 17/17 [== Epoch 38/40 17/17 [=== Epoch 39/40 17/17 [=== Epoch 40/40 17/17 [==== ] - =======] ] - Os 21ms/step - loss: 0.0080 Os 15ms/step - loss: 0.0085 - accuracy: 1.0000 - accuracy: 1.0000 - =] Os 15ms/step - loss: 0.0076 - accuracy: 1.0000 II II II ] - Os 15ms/step - loss: 0.0073 ] - Os 15ms/step - loss: 0.0071 - accuracy: 1.0000 train - accuracy: 1.0000 The fit() method is used to a model on a given set of input data and labels. It takes four main arguments, as illustrated in Table 4.3. Table 4.3: The arguments of the "fit" method X_train_hog y_train_num batch_size epochs This is the input data that is used to train the model. It consists of the HOG-transformed data that was also used to train the latest version of the SGDClassifier in the previous section. This includes the label for each image in integer format. This is the number of samples that is processed in each batch during training. The model updates its weights and biases after each batch, and the batch size can affect the speed and stability of the training process. Larger batch sizes can lead to faster training, but they can also be more computationally expensive and may result in less stable gradients. This is the number of times the model iterates over the entire dataset during training. An epoch consists of one pass through the entire dataset, and the model updates its weights and biases after each epoch. The number of epochs can affect the model's ability to learn and generalize to new data. It is an important hyperparameter that should be chosen carefully. In this case, the model is trained for 40 epochs. وزارة التعليم Ministry of Education 214 2024-1446

Lesson 1 Supervised Learning for Image Analysis

The fit() method is used to train a model on a given set of input data and labels.

Table 4.3: The arguments of the "fit" method

The trained model can now be used to predict the labels of the images in the testing set: pred = model.predict(X_test_hog) pred[0] # prints the predictions for the first image 14/14 [== = ] - Os 2ms/step array([4.79123509e-03, 9.79321003e-01, 8.39506648e-03, 1.97884417e-03, 7.83501855e-06, 3.50346789e-04, 3.45465224e-07, 1.19854585e-05, 4.41945267e-05, 4.11721296e-04, 1.27362555e-05, 9.83431892e-06, 1.97038025e-04, 2.34744814e-03, 5.49758552e-04, 1.57057808e-03], dtype=float32) While the predict() function from the sklearn library returns the most likely label as predicted by the classifier, the Keras predict() function returns the probability of all candidate labels. The np.argmax() function can then be used to return the index of the highest probability: # index of the class with the highest predicted probability print(np.argmax(pred [0])) #name of this class print(classes[np.argmax(pred[0])]) # uses axis=1 to find the index of the max value per row accuracy_score (y_test_num, np.argmax(pred, axis=1)) 1 Duck 0.7529021558872305 This simple neural network achieves an accuracy around 75%, similar to the one reported by the SGDClassifier. However, the advantage of neural architectures comes from their versatility, which allows you to experiment with different architectures to find the one that best fits your dataset. This accuracy was achieved with a simple and shallow architecture that included just one hidden layer with 200 neurons. Adding additional layers would make the network deeper, while adding more neurons per layer would make it wider. The choice of the number of layers and number of neurons per layer are important components of neural network design that have a considerable impact on their performance. However, they are not the only way to improve performance and, in some cases, using a different type of neural network architecture may be more effective. Prediction Using Convolutional Neural Networks One such type of architecture that is particularly well-suited for image classification is the Convolutional Neural Network (CNN). As the CNN processes the input data, it continually adjusts the parameters of convolved filters to detect patterns based on the data it sees, in order to better detect the desired •⚫features. The output of each layer is then passed on to the next layer, where more complex features are detected, until the final output is produced. وزارة التعليم Ministry of Education 2024-1446 215

Lesson 1 Supervised Learning for Image Analysis

The trained model can now be used to predict the labels of the images in the testing set:

Prediction Using Convolutional Neural Networks

Despite the benefits of complex neural networks like CNNs, it is important to note that: • The power of convolutional neural networks (CNNs) is their ability to automatically extract relevant features from images, without the need for manual feature engineering. • More complex neural architectures have more parameters that have to be learned from the data during training. This typically requires a larger training dataset, which may not be available in some cases. In such cases, creating an overly complex architecture is unlikely to be effective. • Even though neural networks have indeed achieved impressive results in image processing and other tasks, they are not guaranteed to always deliver the best performance across problems and datasets. Convolutional Neural Network (CNN) CNNs are deep neural networks that automatically learn a hierarchy of features from raw data, like images, by applying a series of convolved filters to the input data, which are designed to detect specific patterns or features. • Even if a neural network architecture is the best possible solution for a specific task, it may take a lot of time, effort, and computational resources to experiment with different options until the best architecture is found. It is therefore best practice to start with simpler (but still effective) models, such as the SGDClassifier and many others from libraries such as sklearn. Once you have built a good prediction for the dataset and have reached the point where such models can no longer be improved, then experimenting with neural architectures is an excellent next step. وزارة التعليم Ministry of Education 216 2024-1446 О C Input Manual feature extraction Learning Figure 4.13: Neural network with manual feature engineering INFORMATION One of the key advantages of CNNs is that they are very good at learning from large amounts of data, and can often achieve high levels of accuracy on tasks such as image classification without the need for manual feature engineering, such as the HOG process. Output

Lesson 1 Supervised Learning for Image Analysis

Despite the benefits of complex neural networks like CNNs, it is important to note that:

One of the key advantages of CNNs is that they are very good

Input Feature extraction & learning Figure 4.14: Convolutional neural network without manual feature engineering Output Transfer Learning Transfer learning is a process of reusing a pre-trained neural network to solve a new task. In the context of convolutional neural networks (CNN), transfer learning involves taking a pre-trained model, which was trained on a large dataset, and adapting it to a new dataset or task. Instead of starting from scratch, transfer learning allows the use of pre-trained models which have already learned important features, such as edges, shapes, and textures from the training dataset. وزارة التعليم Ministry of Education 2024-1446 Predict and Load pretrained network Replace final layers Train network assess network accuracy Deploy results New layers are added to learn the specific features of your data. Improve network Figure 4.15: Reuse of a pretrained network 217

Lesson 1 Supervised Learning for Image Analysis

Figure 4.14: Convolutional neural network without manual feature engineering

Transfer Learning

Exercises 1 What are the challenges of visual data classification? 2 You are given two numpy arrays, X_train and y_train. Each row in X_train has a shape of (100, 100, 3) and represents a 100 × 100 RGB image. The n_th row in y_train represents the label of the n_th image in X_train. Complete the following code so that it flattens X_train and then trains a Multinomial NB model on this dataset. from sklearn.naive_bayes import MultinomialNB #imports the Naive Bayes Classifier from sklearn X_train_flat = np.array( ) model MNB = Multinomial NB() # new Naive Bayes model 3 model_MNB.fit( ) # fits model on the flat training data Describe briefly how CNNs work and state one of their key advantages. وزارة التعليم Ministry of Education 218 2024-1446

Lesson 1 Supervised Learning for Image Analysis

What are the challenges of visual data classification?

You are given two numpy arrays X_train and y_train. Each row in X_train has a shape

Descibe briefly how CNNs work and one of their key advantages.

4 You are given two numpy arrays, X_train and y_train. Each row in X_train has a shape of (100, 100, 3) and represents a 100 × 100 RGB image. The n_th row in y_train represents the label of the n_th image in X_train. Complete the following code so that it applies the HOG transformation on this dataset and then uses the transformed data to train a MultinomialNB model: from skimage.color import from sklearn. # used to convert a multi-color (rgb) image to grayscale import StandardScaler # used to scale the data from sklearn.naive_bayes import MultinomialNB #imports the Naive Bayes Classifier from sklearn X_train_gray = np.array([ (img) for img in X_train]) # converts training data X_train_hog scaler = Standard Scaler() X_train_hog_scaled model_MNB model MNB = Multinomial NB() model_MNB.fit(X_train_flat_scaled, 5 Name some disadvantages of CNNs. وزارة التعليم Ministry of Education 2024-1446 .fit_transform(X_train_hog) ) 219

Lesson 1 Supervised Learning for Image Analysis

You are given two numpy arrays X_train and y_train. Each row in X_train has a shape

Name some challenges of CNNs.