Lesson Generating Text - Artificial Intelligence - ثالث ثانوي

Lesson 3 Generating Text Link to digital lesson www.ien.edu.sa Natural Language Generation Natural Language Generation (NLG) is a sub-field of natural language processing (NLP) that focuses on generating human-like text using computer algorithms. The goal of NLG is to produce written or spoken language that is natural and understandable to humans, without the need for human intervention. There are several different approaches to NLG, including template-based, rule-based, and machine learning-based methods. Computer Science NLP Natural Language Processing (NLP) Natural Language Processing (NLP) is a branch of Al which gives computers the ability to simulate human natural languages. Linguistics ΑΙ Natural Language Generation (NLG) Natural Language Generation (NLG) is the process of generating human-like text using Al. Figure 3.25: NLP Venn diagram Table 3.4: The impact of NLG NLG could be used to automatically generate news articles, reports, or other written content, freeing up time for humans to focus on more creative or higher-level tasks. 01 وزارة التعليم Ministry of Education 172 2024-1446 ? It could also be used to improve the efficiency and effectiveness of customer service chatbots, enabling them to provide more natural and helpful responses to customer inquiries. NLG has the potential to increase accessibility for people with disabilities or language barriers, by enabling them to communicate with machines in a way that is natural and intuitive for them.

Lesson 3 Generating Text

Natural Language Generation

Table 3.4: The impact of NLG

There are four types of NLG: Template-Based NLG Template-based NLG involves the use of predefined templates that specify the structure and content of the generated text. These templates are filled in with specific information to generate the final text. This approach is relatively simple and can be effective at generating text for specific, well-defined tasks. On the other hand, it may struggle with more open-ended tasks or tasks that require a high degree of variability in the generated text. For example, a weather report template might look like this: "Today in [city], it is [temperature] degrees with [weather condition]." Selection-Based NLG Selection-Based NLG involves the selection of a subset of sentences or paragraphs to create a representative summary of a much larger corpus. Even though this approach does not generate new text, it is very popular in practice. This is because, by sampling from a pool of sentences that have been written by humans, it reduces the risk of generating unpredictable or poorly formed text. For example, a selection-based weather report generator might have a database of phrases such as "It is hot outside," "The temperature is rising," and "Expect sunny skies." Rule-Based NLG Rule-based NLG uses a set of predefined rules to generate text. The rules might specify how to combine words and phrases to form sentences, or how to choose words based on the context in which they are being used. They are often used to create customer service chatbots. Rule-based systems can be simple to implement. They can also be inflexible and may not produce very natural-sounding output. Machine Learning-Based NLG Machine learning-based NLG involves training a machine learning model on a large dataset of human- generated text. The model learns the patterns and structure of the text, and can then generate new text that is similar in style and content. This approach can be more effective for tasks that require a high degree of variability in the generated text. This approach may require a larger amount of training data and computational resources. Using Template-Based NLG Template-Based NLG is relatively simple and can be effective at generating text for specific, well-defined tasks, such as generating reports or descriptions of data. One advantage of template-based NLG is that it can be relatively easy to implement and maintain. The templates can be designed by humans, and do not require the use of complex machine learning algorithms or large amounts of training data. This makes template-based NLG a good choice for tasks where the structure and content of the generated text are well-defined and do not need to vary significantly. NLG templates can be based on any predefined linguistic construct. One common practice is to create a template that requires words with a specific part-of-speech tag to be placed in specific slots within a sentence. Part of Speech (POS) Tags Part of speech tags, also known as POS tags, are labels that are assigned to words in a text to indicate their grammatical role, or part of speech, in the sentence. For example, a word may be tagged as a noun, verb, adjective, adverb, etc. Part of speech tags are used in NLP to analyze and understand the structure and .meaning of a text التعليم Ministry of Education 2024-1446 PRON VERB DET ADJ NOUN | want an early Figure 3.26: Example of the POS process upgrade 173

Lesson 3 Generating Text

Template-Based NLG

Selection-Based NLG

Rule-Based NLG

Machine Learning-Based NLG

Using Template-Based NLG

Syntax Analysis Syntax analysis is often used along with POS tags in template-based NLG to ensure that the templates can lead to realistic text. Syntax analysis involves identifying the parts of speech of the words in the sentence, and the relationships between them, to determine the grammatical structure of the sentence. A sentence includes different types of syntax elements. For example: • The predicate is the part of the sentence that contains the verb. It typically expresses what is being done or what is happening. • The subject is the part of the sentence that performs the action expressed by the verb, or that is affected by the action. • The direct object is a noun or pronoun that refers to the person or thing that is directly affected by the action expressed by the verb. The following code uses the wonderwords library, which follows this syntax-based approach, to provide some examples of template-based NLG: %%capture !pip install wonderwords # used to generate template-based randomized sentences from wonderwords. random_sentence import RandomSentence # make a new generator with specific words generator=RandomSentence ( # specify some nouns nouns = ["lion", "rabbit", "horse","table"], verbs=["eat", "run", "laugh"], #specify some verbs adjectives=['angry', 'small']) # specify some adjectives # generates a sentence with the following template: [subject (noun)] [predicate (verb)] generator.bare_bone_sentence() 'The table runs.' # generates a sentence with the following template: # the [(adjective)] [subject (noun)] [predicate (verb)] [direct object (noun)] generator.sentence() 'The small lion runs rabbit.' The above examples show that, while template-based NLG can be used to generate sentences with a specific pre-approved structure, these sentences may be not be meaningful in practice. Even though the quality of the results can be significantly improved by defining more sophisticated templates and placing more constraints on vocabulary use, this approach is not practical for generating realistic text on a large scale. Rather than manually creating predefined templates, a different approach to template- based NLG is to use the structure and vocabulary of any real sentence as a more dynamic template. ⚫⚫The paraphrase() function adopts this approach. وزارة التعليم Ministry of Education 174 2024-1446

Lesson 3 Generating Text

Syntax Analysis

fx Paraphrase() Function Given a paragraph of text, the function first splits the text into sentences. Then, it tries to replace each word in the sentence with another semantically similar word. Semantic similarity is evaluated via the Word2Vec model that was introduced in the previous lesson. To avoid cases where Word2Vec recommends replacements that are very similar to the original word (e.g. replacing "apple" with "apples"), the function uses the popular fuzzywuzzy library to evaluate the lexical similarity between the original word and a candidate to replace it. The function itself is shown below: def paraphrase(text:str, # text to be paraphrased stop: set, # set of stopwords model_wv, # Word2Vec Model ): lexical_sim_ubound: float, # upper bound on lexical similarity semantic_sim_lbound: float # lower bound on semantic similarity words word_tokenize(text) #tokenizes the text to words new_words=[] # new words that will replace the old ones for word in words: # for every word in the text topn=10) word_l=word.lower() %23 lower-case the word # if the word is a stopword or is not included in the Word2Vec model, do not try to replace it if word_l in stop or word_l not in model_wv: new_words.append(word) # append the original word else: # otherwise # get the 10 most similar words, as per the Word2Vec model # returned words are sorted from most to least similar to the original # semantic similarity is always between 0 and 1 replacement_words=model_wv.most_similar(positive=[word_l], # for each candidate replacement word for rword, sem_sim in replacement_words: #get the lexical similarity between the candidate and the original word # the partial_ratio function returns values between 0 and 100 # it compares the shorter of the two words with all equal-sized substrings # of the original word. lex_sim=fuzz.partial_ratio(word_l, rword) # if the lexical sim is less than the bound, stop and use this candidate if lex_sim<lexical_sim_ubound: break دارة التعليم Ministry of Education 2024-1446 fuzz denotes the fuzzywuzzy library 175

Lesson 3 Generating Text

Paraphrase() Function

return # quality check: if the chosen candidate is not semantically similar enough to # the original, then just use the original word if sem_sim<<semantic_sim_lbound: new_words.append(word) else: # use the candidate new_words.append(rword) '.join(new_words) # re-join the new words into a single string and return Returns a paraphrased version of the given text The following code imports all the tools required to support the paraphrase() function, and in the white box below is displayed the output of the paraphrase method for the text assigned to the text variable: %%capture import gensim.downloader as api # used to download and load a pre-trained Word2Vec model model_wv = api.load('word2vec-google-news-300') import nltk # used to split a piece of text into words. Maintains punctuations as separate tokens from nltk import word_tokenize nltk.download('stopwords') # downloads the stopwords tool of the nltk library # used to get list of very common words in different languages from nltk.corpus import stopwords stop-set(stopwords.words('english')) # gets the list of english stopwords !pip install fuzzywuzzy [speedup] from fuzzywuzzy import fuzz text='We had dinner at this restaurant yesterday. It is very close to my house. All my friends were there, we had a great time. The location is excellent and the steaks were delicious. I will definitely return soon, highly recommended!' # parameters: target text, stopwords, Word2Vec model, upper bound on lexical similarity, lower bound on semantic similarity paraphrase (text, stop, model_wv, 80, 0.5) 'We had brunch at this eatery Monday. It is very close to my bungalow. All my acquaintances were there, we had a terrific day. The locale is terrific and the tenderloin were delicious. I will certainly rejoin quickly, hugely advised! As with any template-based approach, the results can be improved by adding more constraints to correct some of the less intuitive replacements shown above. However, the example above pill demonstrates that even this simple function can produce very realistic text. Ministry of Education 176 2024-1446

Lesson 3 Generating Text

quality check: if the chosen candidate is not semantically similar enough to

Using Selection-Based NLG In this section, you will see a practical approach to selecting a sample of representative sentences from a given document. The approach exemplifies the use and benefits of selection-based NLG and relies on two key building blocks: • The Word2Vec model, which will be used to identify pairs of semantically similar words. • The Networkx library, a popular python library used to create and process different types of network data. The input document that will be used in this chapter is a news article written after the final match of the FIFA World Cup 2022. # reads the input document that we want to summarize with open('article.txt', encoding='utf8', errors='ignore') as f: text-f.read() text[:100] # shows the first 100 characters of the article 'It was a consecration, the spiritual overtones entirely appropriate. Lionel Messi not only emulated First, the text is tokenized using the re library and the same regular expression that was used in the previous units: import re # used for regular expressions # tokenize the document, ignore stopwords, focus only on words included in the Word2Vec model tokenized_doc=[word for word in re.findall(r'\b\w\w+\b',text.lower()) if word not in stop and word in model_wv] # get the vocabulary (set of unique words) vocab-set(tokenized_doc) Networkx Library The vocabulary of the document can now be modeled as a weighted graph. Python's Networkx library provides an extensive set of tools for creating and analyzing graphs. In Selection-Based NLG, representing the vocabulary of a document as a weighted graph can help to capture the relationships between words and facilitate the selection of relevant phrases and sentences. In a weighted graph, each node represents a word or a concept, and the edges between nodes represent relationships between these concepts. The weights on the edges represent the strength of these relationships, allowing the NLG system to determine which concepts are most strongly related. When generating text, the weighted graph can be used to find the relationships beth Ost relevant phrases and sentences based on the relationships between words. For example, the system might use the graph to find the most relevant words and phrases to describe a particular entity and then use these words to select the most appropriate sentence from Pits database. Ministry of Education 2024-1446 restaurant 2 great house 3 2 1 dinner 2 2 1 location delicious 3 recommended Figure 3.27: Example of a Networkx weighted graph 177

Lesson 3 Generating Text

Using Selection-Based NLG

Networkx Library

fx Build graph Function The build_graph() function uses NetworkX to create a graph that includes: • One node for each word in a given vocabulary. • An edge between every two words. The weight on the edge is equal to the semantic similarity between the words, as computed by Doc2Vec, which is an NLP tool for representing documents as a vector and is a generalization of the Word2Vec method. The function returns a graph with one node for each word in the given vocabulary. There is also an edge between two nodes if their Word2Vec similarity is higher than the given threshold. # tool used to create combinations (e.g. pairs, triplets) of the elements in a list from itertools import combinations import networkx as nx # python library for processing graphs def build_graph(vocab:set, # set of unique words model_wv # Word2Vec model ): #gets all possible pairs of words in the doc pairs combinations (vocab, 2) G=nx.Graph() # makes a new graph for w1, w2 in pairs: # for every pair of words w1, w2 sim=model_wv.similarity(w1, w2) # gets the similarity between the two words G.add_edge(w1, w2, weight=sim) return G # creates a graph for the vocabulary of the World Cup document G-build_graph (vocab, model_wv) # prints the weight of the edge (semantic similarity) between the two words G['referee']['goalkeeper'] {'weight': 0.40646762} Given such a word-based graph, a set of words that are all semantically similar to each other can be represented as a cluster of nodes connected to each other by high-weight edges. Such node clusters are also referred to as "communities". The graph output is a simple set of vertices and set of weighted edges. No clustering has been done yet to create the "communities". Figure 3.28 uses different colors to mark the communities in ●⚫an example graph: وزارة التعليم Ministry of Education 178 2024-1446 Figure 3.28: Communities in a graph

Lesson 3 Generating Text

Build_graph() Function

Louvain Algorithm The Networkx library includes multiple algorithms for analyzing the graph and finding such communities. One of the most effective options is the Louvain algorithm, which works by iteratively moving nodes between communities until it finds the community structure that best represents the linkage of the underlying network. fx Get_communities() Function The following function uses the Louvain algorithm to find the communities in a given word-based graph. The function also computes an importance score for each community. Then it returns two dictionaries: • word_to_community, which maps each word to its community. ⚫ community_scores, which maps each community to an importance score. The score is equal to the sum of the frequencies of all the words in the community. For example, if at community includes three words that appear 5, 8, and 6 times in the document, the community's score is equal to 19. Conceptually, the score represents the part of the document that is "covered" by the community. from networkx.algorithms.community import louvain_communities from collections import Counter # used to count the frequency of elements in a list def get_communities( G, # the input graph tokenized_doc:list): # the list of words in a tokenized document # gets the communities in the graph communities-louvain_communities(G, weight='weight') word_cnt =Counter(tokenized_doc)# counts the frequency of each word in the doc word_to_community={}# maps each word to its community community_scores={}# maps each community to a frequency score for comm in communities: # for each community # convert it from a set to a tuple so that it can be used as a dictionary key comm-tuple(comm) score=0 # initialize the community score to 0 for word in comm: # for each word in the community word_to_community [word]=comm #map the word to the community score+=word_cnt [word] # add the frequency of the word to the community's score community_scores [comm] = score #map the community to the score وزارة التعليم Ministry of Education 2024-1446 return word_to_community, community_scores 179

Lesson 3 Generating Text

Louvain Algorithm

Get_communities() Function

word_to_community, community_scores = get_communities (G, tokenized_doc) word_to_community['player'] [:10] # prints 10 words from the community of the word 'team' ('champion', 'stretch', 'finished', 'fifth', 'playing', 'scoring', ' scorer', 'opening' 'team' 'win') Now that all the words have been mapped to a community and each community is associated with an importance score, the next step is to use this information to evaluate the importance of each sentence in the original document. The evaluate_sentences() function is designed for this purpose. fx Evaluate sentences() Function The function starts by splitting the document into sentences. It then computes an importance score for each sentence, based on the words that it includes. Each word inherits the importance score of the community that it belongs to. For example, consider a sentence with 5 words w1, w2, w3, w4, w5. Words w1 and w2 belong to a community with a score of 25, w3 and w4 belong to a community with a score of 30, and w5 belongs to a community with a score of 15. The total score of the sentence is then 25+25+30+30+15=125. The function then uses these scores to rank the sentences in descending order, from most to least important. from nltk import sent_tokenize # used to split a document into sentences def evaluate_sentences (doc:str, #original document word_to_community: dict, # maps each word to its community community_scores: dict, # maps each community to a score model_wv): # Word2Vec model #splits the text into sentences sentences=sent_tokenize(doc) scored sentences = [] # stores (sentence, score) tuples for raw_sent in sentences: # for each sentence # get all the words in the sentence, ignore stopwords and focus only on words that are in the Word2Vec model sentence_words=[word for word in re.findall(r'\b\w\w+\b', raw_sent.lower()) # tokenizes if word not in stop and # ignores stopwords وزارة التعليم Ministry of Education 180 2024-1446

Lesson 3 Generating Text

Now that all the words have been mapped to a community

Evaluate_sentences() Function

word in model_wv] # ignores words that are not in the Word2Vec model sentence_score=0 # the score of the sentence for word in sentence_words: # for each word in the sentence word_comm=word_to_community [word] # get the community of this word sentence_score+= community_scores [word_comm] # add the score of this community to the sentence score scored_sentences.append((sentence_score, raw_sent)) # stores this sentence and its total score # scores the sentences by their score, in descending order scored_sentences=sorted(scored_sentences, key-lambda x:x[0], reverse=True) return scored_sentences scored sentences-evaluate_sentences (text, word_to_community, community_ scores, model_wv) len(scored_sentences) 61 The original document includes a total of 61 sentences. The following code can now be used to get the top 3 most important sentences: for i in range(3): print(scored_sentences[i],'\n') وزارة التعليم Ministry of Education 2024-1446 (3368, 'Lionel Messi not only emulated the deity of Argentinian football, Diego Maradona, by leading the nation to World Cup glory; he finally plugged the burning gap on his CV, winning the one title that has eluded him at the fifth time of asking, surely the last time.') (2880, 'He scored twice in 97 seconds to force extra-time; the first a penalty, the second a sublime side-on volley and there was a point towards the end of regulation time when he appeared hell-bent on making sure that the additional period would not be needed.') (2528, 'It will go down as surely the finest World Cup final of all time, the most pulsating, one of the greatest games in history because of how Kylian Mbappé hauled France up off the canvas towards the end of normal time.') 181

Lesson 3 Generating Text

The original doc includes a total of 61 sentences. The following code can now be used to get the top 3 most important of these sentences:

print(scored_sentences [-1]) # prints the last sentence with the lowest score print() print(scored_sentences [30]) # prints a sentence at the middle of the scoring scale (0, 'By then it was 2-0.') (882, 'Di María won the opening penalty, exploding away from Ousmane Dembélé before being caught and Messi did the rest.') The results verify this approach can indeed successfully identify representative sentences that capture the main points of the original document, while assigning lower scores to less informative sentences. The same approach can be applied as is to generate a summary of any given document. Using Rule-Based NLG to Create a Chatbot In this section, you will build a course-recommendation chatbot by combining a simple knowledge base of questions and answers with the SBERT neural model. This demonstrates the transfer learning used in SBERT, as the same architecture of SBERT (all-MiniLM-L6-v2) will now be fine-tuned to a task other than sentiment analysis: NLG. 1. Load the Pre-Trained SBERT Model The first step is to load the pre-trained SBERT model: %%capture from sentence_transformers import SentenceTransformer, util model_sbert = SentenceTransformer('all-MiniLM-L6-v2') 2. Create a Simple Knowledge Base The second step is to create a simple knowledge base to capture the question-answer script that the chatbot will follow. The script includes 4 questions (Q1-Q4) and their respective answers (A1-A4). Each answer consists of a list of options. The second cell represents the next question that the chatbot will get to. If it is the final question, the second cell will be None. These options represent the possible answers that are considered acceptable for the corresponding questions. For example, the answer to question Q2 has two possible options (["Java", None] and ["Python", None]). Each option consists of two values: • The actual text of the acceptable answer (e.g. "Java" or "Courses on Marketing"). • An ID that points that to the next question that the chatbot should ask if the option is selected. For example, if the user selects the ["Courses on Engineering","3"] option as a response to Q1 then the next question that will be asked is Q3. This simple knowledge base can be easily extended to add more Q/A levels and make the chatbot more intelligent. وزارة التعليم Ministry of Education 182 2024-1446

Lesson 3 Generating Text

The results verify this approach can indeed successfully

Using Rule-Based NLG to Create a Chatbot

2. Create a Simple Knowledge Base

QA={ "Q1":"What type of courses are you interested in?", "A1":[["Courses in Computer Programming","2"], ["Courses in Engineering","3"], ["Courses in Marketing", "4"]], "Q2": "What type of Programming Languages are you interested in?", "A2":[["Java", None], ["Python", None]], "Q3":"What type of Engineering are you interested in?", "A3":[["Mechanical Engineering", None], ["Electrical Engineering", None]], "Q4":"What type of Marketing are you interested in?", "A4":[["Social Media Marketing", None], ["Search Engine Optimization", None]] } fx Chat Function Finally, the following chat() function is used to process the knowledge base and implement the chatbot. After asking a question, the chatbot reads the user's response. • If the response is semantically similar to one of the acceptable answer options for this question, then that option is selected and the chatbot proceeds to the next question. • If the response is not similar to any of the options, it asks the user to rephrase the response. The function uses SBERT to evaluate the semantic similarity score between the response and each candidate option. An option is considered similar if this score is higher than a lower bound parameter (sim_lbound). import numpy as np # used for processing numeric data def chat(QA:dict, # the Question-Answer script of the chatbot model_sbert, # a pre-trained SBERT model sim_lbound: float): #lower bound on the similarity between the user's response and the closest candidate answer qa_id = '1' # the QA id while True: # an infinite loop, will break in specific conditions print('>>',QA['Q'+qa_id]) # prints the question for this qa_id candidates=QA["A"+qa_id] #gets the candidate answers for this qa_id print(flush=True) #used only for formatting purposes response=input() # reads the user's response # embed the response tensor=True) response_embeddings = model_sbert.encode([response], convert_to_ # embed each candidate answer. x is the text, y is the qa_id. Only embed x. candidate_embeddings = model_sbert.encode([x for x,y in candidates], وزارة التعليم Ministry of Education 2024-1446 183

Lesson 3 Generating Text

This simple knowledge can be easily extended to add more Q/A levels and make the chatbot more intelligent.

Chat() Function

convert_to_tensor=True) # gets the similarity score for each candidate similarity_scores = util.cos_sim (response_embeddings, candidate_ embeddings) #finds the index of the closest answer # np.argmax(L) finds the index of the highest number in a list L winner_index= np.argmax(similarity_scores [0]) # if the score of the winner is less than the bound, ask again if similarity_scores [0] [winner_index] <sim_lbound: print(>> Apologies, I could not understand you. Please rephrase your response.') continue # gets the winner (best candidate answer) winner-candidates [winner_index] # prints the winner's text print('\n>> You have selected: ', winner[0]) print() qa_id=winner[1] # gets the qa_id for this winner if qa_id==None: # no more questions to ask, exit the loop print('>> Thank you, I just emailed you a list of courses.') break Consider the following two interactions between the chatbot and a user: Interaction 1 chat (QA, model_sbert, 0.5) >> What type of courses are you interested in? marketing courses >> You have selected: Courses on Marketing >> What type of Marketing are you interested in? seo >> You have selected: Search Engine Optimization >> Thank you, I just emailed you a list of courses. • In this first interaction, the chatbot correctly understands that the user is looking for Marketing courses. It is also intelligent enough to understand that the term "SEO" is semantically similar to "Search Engine Optimization", leading to the successful conclusion of the discussion. Searcr التعليم Ministry of Education 184 2024-1446

Lesson 3 Generating Text

convert_to_tensor=True)

Interaction 1

Interaction 2 chat (QA, model_sbert, 0.5) >> What type of courses are you interested in? cooking classes >> Apologies, I could not understand you. Please rephrase your response. >> What type of courses are you interested in? software courses >> You have selected: Courses on Computer Programming >> What type of Programming Languages are you interested in? C++ >> You have selected: Java >> Thank you, I just emailed you a list of courses. In this second interaction, the chatbot correctly realizes that "Cooking Classes" is not semantically similar to any of the options in its knowledge base. It is also intelligent enough to understand that "Software courses" should be mapped to the "Courses on Computer Programming" option. The final part of the interaction highlights a weakness: the chatbot matches the user's "C++" response to "Java". Atηoуn the two programming languages are indeed related (and are arguably more related than Python and C++), the appropriate response would have been to say that the chatbot does not have the knowledge to recommend C++ courses. One way to address this weakness would be to use lexical rather than semantic similarity to compare responses and options for some questions. Using Machine Learning to Generate Realistic Text The methods described in the previous sections use templates, rules, or selection techniques to produce text for different applications. In this section, you will explore the state-of-the-art in machine learning for NLG. Table 3.5: Advanced machine learning techniques for NLG Technique Long short-term memory (LSTM) network Transformer- based models وزارة التعليم Description An LSTM network is made up of several "memory cells" that are connected together. When the network is given a sequence of data, it processes each element in the sequence one at a time and for each element, the network updates its memory cells to produce an output. LSTMs are particularly well-suited for NLG tasks because they can retain information from sequences of data (such as speech or handwriting recognition) and handle the complexity of natural language. Transformer-based models are models that can understand and generate human language. They work by using a technique called "self-attention" that helps them understand the relationships between different words in a sentence. Ministry of Education 2024-1446 185

Lesson 3 Generating Text

Interaction 2

Using Machine Learning to Generate Realistic Text

OUTPUT 1 OUTPUT 2 OUTPUT N LSTM LSTM LSTM ↑ ENCODERS DECODERS INPUT 1 "I" INPUT 2 "am" INPUT N "today" O INPUT O OUTPUT Figure 3.29: LSTM I am a student Figure 3.30: Transformer أنا طالب Transformers Transformers are particularly well-suited for NLG tasks because they can process sequential input. data efficiently. In a transformer model, the input data is first passed through an encoder, which converts the input into a continuous representation. The continuous representation is then passed through a decoder, which generates the output sequence. One of the key features of these models is the use of attention mechanisms that allow the model to focus on the important parts of a sequence while ignoring less informative parts. Transformer models have been shown to produce high-quality text for a variety of NLG tasks, including machine translation, summarization, and question answering. GPT-2 Model In this section, you will use GPT-2, a powerful language model developed by OpenAI, to generate text based on text prompts that are provided by the user. GPT-2 (Generative Pre-training Transformer 2) was trained on a dataset of over 8 million web pages and has the ability to generate human-like text in a variety of languages and styles. The transformer-based architecture of GPT-2 allows it to capture long-range dependencies and generate coherent text. GPT-2 is trained with the objective of predicting the next word, given all of the previous words within the text. The model can thus be used to produce texts of arbitrary length, by continuously predicting and appending more words. %%capture !pip install transformers !pip install torch import torch # an open-source machine learning library for neural networks, required for GPT2 from transformers import GPT2LMHead Model, GPT2Tokenizer # initialize a tokenizer and a generator based on a pre-trained GPT2 model # used to: # -encode the text provided by the user into tokens #-translate (decode) the output of the generator back to text tokenizer = GPT2Tokenizer.from_pretrained('gpt2') # used to generate new tokens based on the inputted text generator = GPT2LMHeadModel.from_pretrained('gpt2') The following text will then be provided as a seed to GPT-2: text='We had dinner at this restaurant yesterday. It is very close to my house. All my house. All my friends were there, we had a great time. The location is excellent and the steaks were delicious. I will definitely return soon, highly recommended!' وزارة التعليم Ministry of Education 186 2024-1446

Lesson 3 Generating Text

Transformers

OpenAI GPT-2 Model

# encodes the given text into tokens encoded_text = tokenizer.encode(text, return_tensors='pt') # use the generator to generate more tokens #do_sample=True prevents GPT-2 from just predicting the most likely word at every step generated_tokens = generator.generate (encoded_text, generate max_length=200) #max number of new tokens to #decode the generated tokens to convert them to words # skip_special_tokens=True is used to avoid special tokens such as '>' or '-' characters print(tokenizer.decode(generated_tokens[0], skip_special_tokens=True)) We had dinner at this restaurant yesterday. It is very close to my house. All my friends were there, we had a great time. The location is excellent and the steaks were delicious. I will definitely return soon, highly recommended! I've been coming here for a while now and I've been coming here for a while now and I've been coming here for a while now and I've been coming here for a while now and I've been coming here for a while now and I've been coming here for a while now and I've been coming here for a while now and I've been coming here for a while now and I've been coming here for a while now and I've been coming here for a while now and I've been coming here for a while now and I've been coming here for a while now and I've been coming here for a while now and I've been coming here for a while now and I've been coming here for a while now and # use the generator to generate more tokens #do_sample=True prevents GPT-2 from just predicting the most likely word at every step generated_tokens = generator.generate (encoded_text, generate max_length=200, #max number of new tokens to do_sample=True) print(tokenizer.decode(generated_tokens[0], skip_special_tokens=True)) We had dinner at this restaurant yesterday. It is very close to my house. All my friends were there, we had a great time. The location is excellent and the steaks were delicious. I will definitely return soon, highly recommended! If you just found this place helpful. If you like to watch videos or go to the pool while you're there, go for it! Good service - I'm from Colorado and love to get in and out of this place. The food was amazing! Also, we were happy to see the waitstaff with their great hands - I went for dinner. I ordered a small side salad (with garlic on top), and had a slice of tuna instead. When I was eating, I was able to get up and eat my salad while waiting for my friend to pick up the plate, so I had a great time too. Staff was welcoming and accommodating. Parking is cheap in this pylälläjljgneighborhood, and it is in the neighborhood that it needs to Ministry of Education 2024-1446 187

Lesson 3 Generating Text

We had dinner at this restaurant yesterday.

This leads to a much more diverse output, while maintaining the authenticity of the generated text. The text uses a rich vocabulary and is syntactically correct. GPT-2 allows for the further customization of the output. An example is the use of the 'temperature' parameter, which allows the model to take more risks and to sometimes select some lower-probability words. Higher values of this parameter lead to more diverse texts. For example: # Generate tokens with higher diversity generated_tokens = generator.generate( encoded_text, max_length=200, do_sample=True, temperature=2.0) print(tokenizer.decode(generated_tokens[0], skip_special_tokens=True)) We had dinner at this restaurant yesterday. It is very close to my house. All my friends were there, we had a great time. The location is excellent and the steaks were delicious. I will definitely return soon, highly recommended! - - Worth a 5 I thought a steak at a large butcher was the end story!! We were lucky. The price was cheap!! That night though as soon as dinner was on my turn that price cut completely out. At the tail area they only have french fries or kiwifet no gravy they get a hard egg the other day too they call kawif at 3 PM it will be better this summer if I stay more late with friends. When asked it takes 2 or 3 weeks so far to cook that in this house. Once I found a place it was great. Everything I am waiting is just perfect as usual....great prices especially at one where a single bite would suffice or make more as this only runs on the regular hours However, if the temperature is set too high, the model departs from the guidance of the original input and leads to less realistic and meaningful output: # Too high temperature leads to divergence in the meaning of the tokens generated_tokens = generator.generate( encoded_text, max_length=200, do_sample=True, temperature=4.0) print(tokenizer.decode(generated_tokens[0], skip_special_tokens=True)) We had dinner at this restaurant yesterday. It is very close to my house. All my friends were there, we had a great time. The location is excellent and the steaks were delicious. I will definitely return soon, highly recommended! It has the nicest ambagas of '98 that I like; most Mexican. And really nice steak house; amazing Mexican atmosphere to this very particular piece of house I just fell away before its due date, no surprise my 5yo one fell in right last July so it took forever at any number on it being 6 (with it taking two or sometimes 3 month), I really have found comfort/affability on many more restaurants when ordering. If you try at it they tell ya all about 2 and three places will NOT come out before they close them/curry. Also at home i would leave everything until 1 hour but sometimes wait two nights waiting for 2+ then when 2 times you leave you wait in until 6 in such that it works toزارة التعليم Ministry of Education 188 2024-1446

Lesson 3 Generating Text

This leads to a much more diverse output, while maintaining the authenticity of the generated text. The text uses a rich vocabulary and is syntactically correct.

Exercises 1 2 Read the sentences and tick ✓ True or False. 1. Machine Learning-based NLG requires large amounts of training data and computational resources. 2. Verb could be a POS tag. 3. In template-based NLG syntax, analysis is used separately from POS tags. 4. Communities are node clusters that represent semantically different words. 5. The more Q/A levels are added to a chatbot's knowledge base, the smarter it gets. True False Compare the different approaches to Natural Language Generation (NLG). 3 State three different applications for NLG. وزارة التعليم Ministry of Education 2024-1446 189

Lesson 3 Generating Text

Read the sentences and tick True or False.

Compare the different approaches of Natural Language Generation (NLG).

State three different applications for NLG.

4 Complete the following code so that the build_graph() function accepts a given vocabulary of words and a trained Word2Vec model and returns a graph with one node for each word in the vocabulary. The graph should have an edge between two nodes if their similarity according to Word2Vec is higher than the given similarity_threshold. There should be no weights on the edges. from import combinations #tool used to create combinations import networkx as nx #python library for processing graphs def build_graph(vocab:set, # set of unique words model_wv, # Word2Vec model ): similarity threshold: float pairs combinations(vocab, G=nx. # makes a new graph ) # gets all possible pairs of words in the vocabulary for w1, w2 in pairs: # for every pair of words w1,w2 sim-model_wv. if return G وزارة التعليم Ministry of Education 190 2024-1446 G. (w1,w2) (w1, w2)# gets the similarity between the two words

Lesson 3 Generating Text

Complete the following code so that the build_graph()

5 Complete the following code so that the function get_max_sim() uses a pre-trained SBERT model to compare a given sentence, my_sentence, with all the sentences in a second given list of sentences, L. The function should then return the sentence from L1 with the highest similarity score to my_sentence. from sentence_transformers import from model_sbert = util import combinations #tool used to create combinations def get_max_sim (L1, my_sentence): # embeds my_sentence my_embedding = model_sbert, # embeds the sentences from L2 L_embeddings = model_sbert. ('all-MiniLM-L6-v2') ([my_sentence], convert_to_tensor=True) (L, convert_to_tensor=True) similarity_scores = .cos_sim( winner_index=np.argmax(similarity_scores[0]) return وزارة التعليم Ministry of Education 2024-1446 191

Lesson 3 Generating Text

Complete the following code so that the function get_max_sim()