Natural Language Processing (NPL)
Learning Objectives
Tools
Using Supervised Learning to Understand Text
Machine Learning
Machine learning can be broadly categorized into three main types:
Table 3.1: Advantages and disadvantages of Machine Learning types
Supervised Learning
Regression
Classification
# load the train and testing data.
Data Preparation and Pre-Processing
Sklearn Library
CountVectorizer
# expand the sparse data into a sparse matrix format, where each column represents a different word.
As expected, the sparse format requires far less memory,
Build a Prediction Pipeline
The pipeline correctly predicts a positive and negative label
The confusion matrix contains the counts of actual vs.
Explaining Black-Box Predictors
LIME (Local Interpretable Model-Agnostic Explanations) LIME is a method for explaining the predictions made by black-box
As expected, the predictor delivers a very confident negative prediction for this easy example.
A negative coefficient increases the probability of the negative class,
# get the correct labels of this example.
Improving Text Vectorization
Detecting Phrases
\w matches all alphanumeric characters (a-z, A-Z, 0-9) and the underscore character.
When applied to the two tokenized sentence examples shown above, this phrase model produces the following results:
# an example of an annotated document from the imdb training data
Using TF-IDF for Text Vectorization
This new vectorizer can now be input to the same Naive Bayes Classifier to build a new predictive pipeline and apply it to the IMDb testing data:
The new pipeline confidently predicts the correct positive label for this review. The following code uses the LIME explainer to explain the logic behind this prediction:
Read the sentences and tick True or False.
Explain the reason the dense matrix format requires more space in the memory than the sparse format.
Analyze how the two mathematical factors in TD-IDF are utilized to inspect the importance of a word in a document.
You are given a numPy array X_train_text that includes one document in each row.
Complete the following code so that it builds LimeTextExplainer for the prediction