Text classification, also known as text categorization, is one of the most important parts of text analysis and an important task of Natural Language Processing (NLP). Some of the most common examples are sentiment analysis, topic detection, language detection, and intent detection. We'll understand how to use Bidirectional Encoder Representations from Transformers( BERT) and Deep Neural Networks (DNN) models to build text classifiers. They have a different architecture to classify text into categories. Let’s start with the introduction of each:
A DNN is a collection of neurons organized in a sequence of multiple layers, where neurons in one layer receive neuron activations from the previous layer, and perform a simple mathematical computation (e.g. a weighted sum of the input followed by a nonlinear activation). The neurons of the network jointly implement a complex nonlinear mapping from the input to the output. This mapping between the input and output is done using a technique called error backpropagation.
BERT is an open-sourced NLP pre-training model developed by researchers at Google in 2018. It’s built on pre-training contextual representations. BERT is different from the rest of the models, it’s the first deeply bidirectional, unsupervised language representation, pre-trained using only a plain text corpus. And since it’s open-sourced, anyone with machine learning (ML) knowledge can easily build an NLP model.
How BERT works
BERT makes use of the Transformer, an attention mechanism that learns contextual relations between words in a text. In its vanilla form, Transformer includes two separate mechanisms — an encoder that reads the text input and a decoder that produces a prediction for the task. Since BERT’s goal is to generate a language model, only the encoder mechanism is necessary. The detailed implementations of Transformer are described in a paper by Google.
As opposed to directional models, which generally read the input text sequentially (left-to-right or right-to-left), the Transformer encoder which is used in BERT reads the entire input text or words sequence at once. This characteristic allows the model to learn the context of a word based on all of its surroundings.
The purpose is to find out which model works better on different sizes of the dataset for the classification task. Here, the classification task is sentiment analysis. The models which we are going to use for comparison are BERT and DNN. Since the BERT model and Universal Sentence Encoder are pre-trained models, these will provide us the embedding for our text input for the classifiers.
For sentiment analysis, we will use Large Movie Review Dataset v1.0. In this data, each review from the user is labeled as positive or negative classes. We will create three datasets from it. The three sizes are 100, 1000, and 10,000 for the training part. The test set will be10% of the training dataset, so, small dataset (10), medium dataset (100.) and a large dataset (1000).
Let’s create DNN and BERT classifiers and compare their results.
a) DNN classifier using Universal Sentence Encoder
b) BERT classifier
In this model, we will use pre-trained BERT embeddings for the classifier.
Results
After building classifiers, we create matrices for the accuracy of these models on the test set (Confusion matrix, Precision, Recall, F1).
a) Confusion matrix
b) Accuracy:
c) Precision, Recall, F1 score:
The classifier built with BERT outperformed the DNN model. Both models performed equally regarding accuracy on a small dataset. On a medium test set (100), BERT is better than DNN with a difference of 7% accuracy, while in a large test set (1000) it is 1% better. Precision, recall, and F1 are important matrices for a model evaluation other than accuracy, as we have to take care of the false-positive and false-negative count in the predictions. Regarding precision, BERT is better in medium test set while in small and large test set both models give the same precision. For large and small, all three matrices — precision, recall and F1 scores are about the same but in medium test set recall and F1 score of BERT classifier are higher (recall = 0.14 and F1 = 0.7) than the DNN classifier. Let’s see an example to find out how good BERT performs if we train on a small dataset and test it on a large test set. For this experiment, we had 400 training samples and 600 test samples and here are the results :
BERT also performed very well compared to DNN in situations where we have small data for training. In terms of accuracy or precision, recall, F1 score, BERT fared better. BERT crossed 82% accuracy, while DNN was 80% or less. We can infer that BERT performs well even with a smaller dataset.
BERT classifier is better than the DNN classifier which is built with universal sentence encoder. There are two major reasons behind the better performance of BERT. First, BERT is pre-trained on a large volume of unlabelled text which includes the entire Wikipedia (that’s about 2,500 million words) and a book corpus (800 million words). Second, traditional context-free models (like word2vec or GloVe) generate a single word embedding representation for each word in the vocabulary, which means the word “right” would have the same context-free representation in “I’ll make the payment right away” and “Take a right turn”. However, BERT would represent based on both the previous and next context, making it bidirectional.
An earlier version of this blog was published on Medium by the author.
Text classification, also known as text categorization, is one of the most important parts of text analysis and an important task of Natural Language Processing (NLP). Some of the most common examples are sentiment analysis, topic detection, language detection, and intent detection. We'll understand how to use Bidirectional Encoder Representations from Transformers( BERT) and Deep Neural Networks (DNN) models to build text classifiers. They have a different architecture to classify text into categories. Let’s start with the introduction of each:
A DNN is a collection of neurons organized in a sequence of multiple layers, where neurons in one layer receive neuron activations from the previous layer, and perform a simple mathematical computation (e.g. a weighted sum of the input followed by a nonlinear activation). The neurons of the network jointly implement a complex nonlinear mapping from the input to the output. This mapping between the input and output is done using a technique called error backpropagation.
BERT is an open-sourced NLP pre-training model developed by researchers at Google in 2018. It’s built on pre-training contextual representations. BERT is different from the rest of the models, it’s the first deeply bidirectional, unsupervised language representation, pre-trained using only a plain text corpus. And since it’s open-sourced, anyone with machine learning (ML) knowledge can easily build an NLP model.
How BERT works
BERT makes use of the Transformer, an attention mechanism that learns contextual relations between words in a text. In its vanilla form, Transformer includes two separate mechanisms — an encoder that reads the text input and a decoder that produces a prediction for the task. Since BERT’s goal is to generate a language model, only the encoder mechanism is necessary. The detailed implementations of Transformer are described in a paper by Google.
As opposed to directional models, which generally read the input text sequentially (left-to-right or right-to-left), the Transformer encoder which is used in BERT reads the entire input text or words sequence at once. This characteristic allows the model to learn the context of a word based on all of its surroundings.
The purpose is to find out which model works better on different sizes of the dataset for the classification task. Here, the classification task is sentiment analysis. The models which we are going to use for comparison are BERT and DNN. Since the BERT model and Universal Sentence Encoder are pre-trained models, these will provide us the embedding for our text input for the classifiers.
For sentiment analysis, we will use Large Movie Review Dataset v1.0. In this data, each review from the user is labeled as positive or negative classes. We will create three datasets from it. The three sizes are 100, 1000, and 10,000 for the training part. The test set will be10% of the training dataset, so, small dataset (10), medium dataset (100.) and a large dataset (1000).
Let’s create DNN and BERT classifiers and compare their results.
a) DNN classifier using Universal Sentence Encoder
b) BERT classifier
In this model, we will use pre-trained BERT embeddings for the classifier.
Results
After building classifiers, we create matrices for the accuracy of these models on the test set (Confusion matrix, Precision, Recall, F1).
a) Confusion matrix
b) Accuracy:
c) Precision, Recall, F1 score:
The classifier built with BERT outperformed the DNN model. Both models performed equally regarding accuracy on a small dataset. On a medium test set (100), BERT is better than DNN with a difference of 7% accuracy, while in a large test set (1000) it is 1% better. Precision, recall, and F1 are important matrices for a model evaluation other than accuracy, as we have to take care of the false-positive and false-negative count in the predictions. Regarding precision, BERT is better in medium test set while in small and large test set both models give the same precision. For large and small, all three matrices — precision, recall and F1 scores are about the same but in medium test set recall and F1 score of BERT classifier are higher (recall = 0.14 and F1 = 0.7) than the DNN classifier. Let’s see an example to find out how good BERT performs if we train on a small dataset and test it on a large test set. For this experiment, we had 400 training samples and 600 test samples and here are the results :
BERT also performed very well compared to DNN in situations where we have small data for training. In terms of accuracy or precision, recall, F1 score, BERT fared better. BERT crossed 82% accuracy, while DNN was 80% or less. We can infer that BERT performs well even with a smaller dataset.
BERT classifier is better than the DNN classifier which is built with universal sentence encoder. There are two major reasons behind the better performance of BERT. First, BERT is pre-trained on a large volume of unlabelled text which includes the entire Wikipedia (that’s about 2,500 million words) and a book corpus (800 million words). Second, traditional context-free models (like word2vec or GloVe) generate a single word embedding representation for each word in the vocabulary, which means the word “right” would have the same context-free representation in “I’ll make the payment right away” and “Take a right turn”. However, BERT would represent based on both the previous and next context, making it bidirectional.
An earlier version of this blog was published on Medium by the author.
Text classification, also known as text categorization, is one of the most important parts of text analysis and an important task of Natural Language Processing (NLP). Some of the most common examples are sentiment analysis, topic detection, language detection, and intent detection. We'll understand how to use Bidirectional Encoder Representations from Transformers( BERT) and Deep Neural Networks (DNN) models to build text classifiers. They have a different architecture to classify text into categories. Let’s start with the introduction of each:
A DNN is a collection of neurons organized in a sequence of multiple layers, where neurons in one layer receive neuron activations from the previous layer, and perform a simple mathematical computation (e.g. a weighted sum of the input followed by a nonlinear activation). The neurons of the network jointly implement a complex nonlinear mapping from the input to the output. This mapping between the input and output is done using a technique called error backpropagation.
BERT is an open-sourced NLP pre-training model developed by researchers at Google in 2018. It’s built on pre-training contextual representations. BERT is different from the rest of the models, it’s the first deeply bidirectional, unsupervised language representation, pre-trained using only a plain text corpus. And since it’s open-sourced, anyone with machine learning (ML) knowledge can easily build an NLP model.
How BERT works
BERT makes use of the Transformer, an attention mechanism that learns contextual relations between words in a text. In its vanilla form, Transformer includes two separate mechanisms — an encoder that reads the text input and a decoder that produces a prediction for the task. Since BERT’s goal is to generate a language model, only the encoder mechanism is necessary. The detailed implementations of Transformer are described in a paper by Google.
As opposed to directional models, which generally read the input text sequentially (left-to-right or right-to-left), the Transformer encoder which is used in BERT reads the entire input text or words sequence at once. This characteristic allows the model to learn the context of a word based on all of its surroundings.
The purpose is to find out which model works better on different sizes of the dataset for the classification task. Here, the classification task is sentiment analysis. The models which we are going to use for comparison are BERT and DNN. Since the BERT model and Universal Sentence Encoder are pre-trained models, these will provide us the embedding for our text input for the classifiers.
For sentiment analysis, we will use Large Movie Review Dataset v1.0. In this data, each review from the user is labeled as positive or negative classes. We will create three datasets from it. The three sizes are 100, 1000, and 10,000 for the training part. The test set will be10% of the training dataset, so, small dataset (10), medium dataset (100.) and a large dataset (1000).
Let’s create DNN and BERT classifiers and compare their results.
a) DNN classifier using Universal Sentence Encoder
b) BERT classifier
In this model, we will use pre-trained BERT embeddings for the classifier.
Results
After building classifiers, we create matrices for the accuracy of these models on the test set (Confusion matrix, Precision, Recall, F1).
a) Confusion matrix
b) Accuracy:
c) Precision, Recall, F1 score:
The classifier built with BERT outperformed the DNN model. Both models performed equally regarding accuracy on a small dataset. On a medium test set (100), BERT is better than DNN with a difference of 7% accuracy, while in a large test set (1000) it is 1% better. Precision, recall, and F1 are important matrices for a model evaluation other than accuracy, as we have to take care of the false-positive and false-negative count in the predictions. Regarding precision, BERT is better in medium test set while in small and large test set both models give the same precision. For large and small, all three matrices — precision, recall and F1 scores are about the same but in medium test set recall and F1 score of BERT classifier are higher (recall = 0.14 and F1 = 0.7) than the DNN classifier. Let’s see an example to find out how good BERT performs if we train on a small dataset and test it on a large test set. For this experiment, we had 400 training samples and 600 test samples and here are the results :
BERT also performed very well compared to DNN in situations where we have small data for training. In terms of accuracy or precision, recall, F1 score, BERT fared better. BERT crossed 82% accuracy, while DNN was 80% or less. We can infer that BERT performs well even with a smaller dataset.
BERT classifier is better than the DNN classifier which is built with universal sentence encoder. There are two major reasons behind the better performance of BERT. First, BERT is pre-trained on a large volume of unlabelled text which includes the entire Wikipedia (that’s about 2,500 million words) and a book corpus (800 million words). Second, traditional context-free models (like word2vec or GloVe) generate a single word embedding representation for each word in the vocabulary, which means the word “right” would have the same context-free representation in “I’ll make the payment right away” and “Take a right turn”. However, BERT would represent based on both the previous and next context, making it bidirectional.
An earlier version of this blog was published on Medium by the author.