ZEMOSO ENGINEERING STUDIO
February 14, 2020
12 min read

How to use BERT and DNN to build smarter NLP algorithms for products

Text Classifiers (BERT & DNN)

Text classification, also known as text categorization, is one of the most important parts of text analysis and an important task of Natural Language Processing (NLP). Some of the most common examples are sentiment analysis, topic detection, language detection, and intent detection. We'll understand how to use Bidirectional Encoder Representations from Transformers( BERT) and Deep Neural Networks (DNN) models to build text classifiers. They have a different architecture to classify text into categories. Let’s start with the introduction of each:

1. DNN

A DNN is a collection of neurons organized in a sequence of multiple layers, where neurons in one layer receive neuron activations from the previous layer, and perform a simple mathematical computation (e.g. a weighted sum of the input followed by a nonlinear activation). The neurons of the network jointly implement a complex nonlinear mapping from the input to the output. This mapping between the input and output is done using a technique called error backpropagation.

Reference multiple layer neural network, DNN architecture
Reference multiple layer neural network, DNN architecture

2. BERT

BERT is an open-sourced NLP pre-training model developed by researchers at Google in 2018. It’s built on pre-training contextual representations. BERT is different from the rest of the models, it’s the first deeply bidirectional, unsupervised language representation, pre-trained using only a plain text corpus. And since it’s open-sourced, anyone with machine learning (ML) knowledge can easily build an NLP model. 

How BERT works

BERT makes use of the Transformer, an attention mechanism that learns contextual relations between words in a text. In its vanilla form, Transformer includes two separate mechanisms — an encoder that reads the text input and a decoder that produces a prediction for the task. Since BERT’s goal is to generate a language model, only the encoder mechanism is necessary. The detailed implementations of Transformer are described in a paper by Google.

As opposed to directional models, which generally read the input text sequentially (left-to-right or right-to-left), the Transformer encoder which is used in BERT reads the entire input text or words sequence at once. This characteristic allows the model to learn the context of a word based on all of its surroundings. 

Comparison 

The purpose is to find out which model works better on different sizes of the dataset for the classification task. Here, the classification task is sentiment analysis. The models which we are going to use for comparison are BERT and DNN. Since the BERT model and Universal Sentence Encoder are pre-trained models, these will provide us the embedding for our text input for the classifiers.

  1. Dataset description

For sentiment analysis, we will use Large Movie Review Dataset v1.0. In this data, each review from the user is labeled as positive or negative classes. We will create three datasets from it. The three sizes are 100, 1000, and 10,000 for the training part. The test set will be10% of the training dataset, so, small dataset (10), medium dataset (100.) and a large dataset (1000).

  1. Build classifiers

Let’s create DNN and BERT classifiers and compare their results.

a) DNN classifier using Universal Sentence Encoder

Data loading and splitting it into train dataset and test dataset
Data loading and splitting it into train dataset and test dataset

Training part of DNN classifier using universal sentence encoder
Training part of DNN classifier using universal sentence encoder

Predictions on test set
Predictions on test set

b) BERT classifier

In this model, we will use pre-trained BERT embeddings for the classifier.

BERTert training code snippet
BERT training code snippet 

Results

After building classifiers, we create matrices for the accuracy of these models on the test set (Confusion matrix, Precision, Recall, F1). 

a) Confusion matrix

 Confusion matrix for both models for all three test set
Confusion matrix for both models for all three test sets

b) Accuracy: 

Overall accuracy of both models on different sized test set
Overall accuracy of both models on different sized test set

c) Precision, Recall, F1 score:

Precision, Recall, and and F1 score of each model on all three datasets
Precision, Recall, and F1 score of each model on all three datasets

The classifier built with BERT outperformed the DNN model. Both models performed equally regarding accuracy on a small dataset. On a medium test set (100), BERT is better than DNN with a difference of 7% accuracy, while in a large test set (1000) it is 1% better. Precision, recall, and F1 are important matrices for a model evaluation other than accuracy, as we have to take care of the false-positive and false-negative count in the predictions. Regarding precision, BERT is better in medium test set while in small and large test set both models give the same precision. For large and small, all three matrices — precision, recall and F1 scores are about the same but in medium test set recall and F1 score of BERT classifier are higher (recall = 0.14 and F1 = 0.7) than the DNN classifier. Let’s see an example to find out how good BERT performs if we train on a small dataset and test it on a large test set. For this experiment, we had 400 training samples and 600 test samples and here are the results :

Accuracy, precision, recall and FI for both models, trained on small dataset
Accuracy, precision, recall and FI for both models, trained on small dataset.

DNN and BERT accuracy comparison

BERT also performed very well compared to DNN in situations where we have small data for training. In terms of accuracy or precision, recall, F1 score, BERT fared better. BERT crossed 82% accuracy, while DNN was 80% or less. We can infer that BERT performs well even with a smaller dataset. 

Conclusion

BERT classifier is better than the DNN classifier which is built with universal sentence encoder.  There are two major reasons behind the better performance of BERT.  First, BERT is pre-trained on a large volume of unlabelled text which includes the entire Wikipedia (that’s about 2,500 million words) and a book corpus (800 million words). Second, traditional context-free models (like word2vec or GloVe) generate a single word embedding representation for each word in the vocabulary, which means the word rightwould have the same context-free representation in “I’ll make the payment right away and Take a right turn”. However, BERT would represent based on both the previous and next context, making it bidirectional. 

An earlier version of this blog was published on Medium by the author.

ZEMOSO ENGINEERING STUDIO

How to use BERT and DNN to build smarter NLP algorithms for products

February 14, 2020
12 min read

Text Classifiers (BERT & DNN)

Text classification, also known as text categorization, is one of the most important parts of text analysis and an important task of Natural Language Processing (NLP). Some of the most common examples are sentiment analysis, topic detection, language detection, and intent detection. We'll understand how to use Bidirectional Encoder Representations from Transformers( BERT) and Deep Neural Networks (DNN) models to build text classifiers. They have a different architecture to classify text into categories. Let’s start with the introduction of each:

1. DNN

A DNN is a collection of neurons organized in a sequence of multiple layers, where neurons in one layer receive neuron activations from the previous layer, and perform a simple mathematical computation (e.g. a weighted sum of the input followed by a nonlinear activation). The neurons of the network jointly implement a complex nonlinear mapping from the input to the output. This mapping between the input and output is done using a technique called error backpropagation.

Reference multiple layer neural network, DNN architecture
Reference multiple layer neural network, DNN architecture

2. BERT

BERT is an open-sourced NLP pre-training model developed by researchers at Google in 2018. It’s built on pre-training contextual representations. BERT is different from the rest of the models, it’s the first deeply bidirectional, unsupervised language representation, pre-trained using only a plain text corpus. And since it’s open-sourced, anyone with machine learning (ML) knowledge can easily build an NLP model. 

How BERT works

BERT makes use of the Transformer, an attention mechanism that learns contextual relations between words in a text. In its vanilla form, Transformer includes two separate mechanisms — an encoder that reads the text input and a decoder that produces a prediction for the task. Since BERT’s goal is to generate a language model, only the encoder mechanism is necessary. The detailed implementations of Transformer are described in a paper by Google.

As opposed to directional models, which generally read the input text sequentially (left-to-right or right-to-left), the Transformer encoder which is used in BERT reads the entire input text or words sequence at once. This characteristic allows the model to learn the context of a word based on all of its surroundings. 

Comparison 

The purpose is to find out which model works better on different sizes of the dataset for the classification task. Here, the classification task is sentiment analysis. The models which we are going to use for comparison are BERT and DNN. Since the BERT model and Universal Sentence Encoder are pre-trained models, these will provide us the embedding for our text input for the classifiers.

  1. Dataset description

For sentiment analysis, we will use Large Movie Review Dataset v1.0. In this data, each review from the user is labeled as positive or negative classes. We will create three datasets from it. The three sizes are 100, 1000, and 10,000 for the training part. The test set will be10% of the training dataset, so, small dataset (10), medium dataset (100.) and a large dataset (1000).

  1. Build classifiers

Let’s create DNN and BERT classifiers and compare their results.

a) DNN classifier using Universal Sentence Encoder

Data loading and splitting it into train dataset and test dataset
Data loading and splitting it into train dataset and test dataset

Training part of DNN classifier using universal sentence encoder
Training part of DNN classifier using universal sentence encoder

Predictions on test set
Predictions on test set

b) BERT classifier

In this model, we will use pre-trained BERT embeddings for the classifier.

BERTert training code snippet
BERT training code snippet 

Results

After building classifiers, we create matrices for the accuracy of these models on the test set (Confusion matrix, Precision, Recall, F1). 

a) Confusion matrix

 Confusion matrix for both models for all three test set
Confusion matrix for both models for all three test sets

b) Accuracy: 

Overall accuracy of both models on different sized test set
Overall accuracy of both models on different sized test set

c) Precision, Recall, F1 score:

Precision, Recall, and and F1 score of each model on all three datasets
Precision, Recall, and F1 score of each model on all three datasets

The classifier built with BERT outperformed the DNN model. Both models performed equally regarding accuracy on a small dataset. On a medium test set (100), BERT is better than DNN with a difference of 7% accuracy, while in a large test set (1000) it is 1% better. Precision, recall, and F1 are important matrices for a model evaluation other than accuracy, as we have to take care of the false-positive and false-negative count in the predictions. Regarding precision, BERT is better in medium test set while in small and large test set both models give the same precision. For large and small, all three matrices — precision, recall and F1 scores are about the same but in medium test set recall and F1 score of BERT classifier are higher (recall = 0.14 and F1 = 0.7) than the DNN classifier. Let’s see an example to find out how good BERT performs if we train on a small dataset and test it on a large test set. For this experiment, we had 400 training samples and 600 test samples and here are the results :

Accuracy, precision, recall and FI for both models, trained on small dataset
Accuracy, precision, recall and FI for both models, trained on small dataset.

DNN and BERT accuracy comparison

BERT also performed very well compared to DNN in situations where we have small data for training. In terms of accuracy or precision, recall, F1 score, BERT fared better. BERT crossed 82% accuracy, while DNN was 80% or less. We can infer that BERT performs well even with a smaller dataset. 

Conclusion

BERT classifier is better than the DNN classifier which is built with universal sentence encoder.  There are two major reasons behind the better performance of BERT.  First, BERT is pre-trained on a large volume of unlabelled text which includes the entire Wikipedia (that’s about 2,500 million words) and a book corpus (800 million words). Second, traditional context-free models (like word2vec or GloVe) generate a single word embedding representation for each word in the vocabulary, which means the word rightwould have the same context-free representation in “I’ll make the payment right away and Take a right turn”. However, BERT would represent based on both the previous and next context, making it bidirectional. 

An earlier version of this blog was published on Medium by the author.

Recent Publications
Actual access control without getting in the way of actual work: 2023
Actual access control without getting in the way of actual work: 2023
March 13, 2023
Breaking the time barrier: Test Automation and its impact on product launch cycles
Breaking the time barrier: Test Automation and its impact on product launch cycles
January 20, 2023
Product innovation for today and the future! It’s outcome-first, timeboxed, and accountable
Product innovation for today and the future! It’s outcome-first, timeboxed, and accountable
January 9, 2023
From "great potential" purgatory to "actual usage" reality: getting SDKs right in a product-led world
From "great potential" purgatory to "actual usage" reality: getting SDKs right in a product-led world
December 6, 2022
Why Realm trumps SQLite as database of choice for complex mobile apps — Part 2
Why Realm trumps SQLite as database of choice for complex mobile apps — Part 2
October 13, 2022
Testing what doesn’t exist with a Wizard of Oz twist
Testing what doesn’t exist with a Wizard of Oz twist
October 12, 2022
Docs, Guides, Resources: Getting developer microsites right in a product-led world
Docs, Guides, Resources: Getting developer microsites right in a product-led world
September 20, 2022
Beyond methodologies: Five engineering do's for an agile product build
Beyond methodologies: Five engineering do's for an agile product build
September 5, 2022
Actual access control without getting in the way of actual work: 2023
Actual access control without getting in the way of actual work: 2023
March 13, 2023
Breaking the time barrier: Test Automation and its impact on product launch cycles
Breaking the time barrier: Test Automation and its impact on product launch cycles
January 20, 2023
Product innovation for today and the future! It’s outcome-first, timeboxed, and accountable
Product innovation for today and the future! It’s outcome-first, timeboxed, and accountable
January 9, 2023
From "great potential" purgatory to "actual usage" reality: getting SDKs right in a product-led world
From "great potential" purgatory to "actual usage" reality: getting SDKs right in a product-led world
December 6, 2022
Why Realm trumps SQLite as database of choice for complex mobile apps — Part 2
Why Realm trumps SQLite as database of choice for complex mobile apps — Part 2
October 13, 2022
ZEMOSO ENGINEERING STUDIO
February 14, 2020
12 min read

How to use BERT and DNN to build smarter NLP algorithms for products

Text Classifiers (BERT & DNN)

Text classification, also known as text categorization, is one of the most important parts of text analysis and an important task of Natural Language Processing (NLP). Some of the most common examples are sentiment analysis, topic detection, language detection, and intent detection. We'll understand how to use Bidirectional Encoder Representations from Transformers( BERT) and Deep Neural Networks (DNN) models to build text classifiers. They have a different architecture to classify text into categories. Let’s start with the introduction of each:

1. DNN

A DNN is a collection of neurons organized in a sequence of multiple layers, where neurons in one layer receive neuron activations from the previous layer, and perform a simple mathematical computation (e.g. a weighted sum of the input followed by a nonlinear activation). The neurons of the network jointly implement a complex nonlinear mapping from the input to the output. This mapping between the input and output is done using a technique called error backpropagation.

Reference multiple layer neural network, DNN architecture
Reference multiple layer neural network, DNN architecture

2. BERT

BERT is an open-sourced NLP pre-training model developed by researchers at Google in 2018. It’s built on pre-training contextual representations. BERT is different from the rest of the models, it’s the first deeply bidirectional, unsupervised language representation, pre-trained using only a plain text corpus. And since it’s open-sourced, anyone with machine learning (ML) knowledge can easily build an NLP model. 

How BERT works

BERT makes use of the Transformer, an attention mechanism that learns contextual relations between words in a text. In its vanilla form, Transformer includes two separate mechanisms — an encoder that reads the text input and a decoder that produces a prediction for the task. Since BERT’s goal is to generate a language model, only the encoder mechanism is necessary. The detailed implementations of Transformer are described in a paper by Google.

As opposed to directional models, which generally read the input text sequentially (left-to-right or right-to-left), the Transformer encoder which is used in BERT reads the entire input text or words sequence at once. This characteristic allows the model to learn the context of a word based on all of its surroundings. 

Comparison 

The purpose is to find out which model works better on different sizes of the dataset for the classification task. Here, the classification task is sentiment analysis. The models which we are going to use for comparison are BERT and DNN. Since the BERT model and Universal Sentence Encoder are pre-trained models, these will provide us the embedding for our text input for the classifiers.

  1. Dataset description

For sentiment analysis, we will use Large Movie Review Dataset v1.0. In this data, each review from the user is labeled as positive or negative classes. We will create three datasets from it. The three sizes are 100, 1000, and 10,000 for the training part. The test set will be10% of the training dataset, so, small dataset (10), medium dataset (100.) and a large dataset (1000).

  1. Build classifiers

Let’s create DNN and BERT classifiers and compare their results.

a) DNN classifier using Universal Sentence Encoder

Data loading and splitting it into train dataset and test dataset
Data loading and splitting it into train dataset and test dataset

Training part of DNN classifier using universal sentence encoder
Training part of DNN classifier using universal sentence encoder

Predictions on test set
Predictions on test set

b) BERT classifier

In this model, we will use pre-trained BERT embeddings for the classifier.

BERTert training code snippet
BERT training code snippet 

Results

After building classifiers, we create matrices for the accuracy of these models on the test set (Confusion matrix, Precision, Recall, F1). 

a) Confusion matrix

 Confusion matrix for both models for all three test set
Confusion matrix for both models for all three test sets

b) Accuracy: 

Overall accuracy of both models on different sized test set
Overall accuracy of both models on different sized test set

c) Precision, Recall, F1 score:

Precision, Recall, and and F1 score of each model on all three datasets
Precision, Recall, and F1 score of each model on all three datasets

The classifier built with BERT outperformed the DNN model. Both models performed equally regarding accuracy on a small dataset. On a medium test set (100), BERT is better than DNN with a difference of 7% accuracy, while in a large test set (1000) it is 1% better. Precision, recall, and F1 are important matrices for a model evaluation other than accuracy, as we have to take care of the false-positive and false-negative count in the predictions. Regarding precision, BERT is better in medium test set while in small and large test set both models give the same precision. For large and small, all three matrices — precision, recall and F1 scores are about the same but in medium test set recall and F1 score of BERT classifier are higher (recall = 0.14 and F1 = 0.7) than the DNN classifier. Let’s see an example to find out how good BERT performs if we train on a small dataset and test it on a large test set. For this experiment, we had 400 training samples and 600 test samples and here are the results :

Accuracy, precision, recall and FI for both models, trained on small dataset
Accuracy, precision, recall and FI for both models, trained on small dataset.

DNN and BERT accuracy comparison

BERT also performed very well compared to DNN in situations where we have small data for training. In terms of accuracy or precision, recall, F1 score, BERT fared better. BERT crossed 82% accuracy, while DNN was 80% or less. We can infer that BERT performs well even with a smaller dataset. 

Conclusion

BERT classifier is better than the DNN classifier which is built with universal sentence encoder.  There are two major reasons behind the better performance of BERT.  First, BERT is pre-trained on a large volume of unlabelled text which includes the entire Wikipedia (that’s about 2,500 million words) and a book corpus (800 million words). Second, traditional context-free models (like word2vec or GloVe) generate a single word embedding representation for each word in the vocabulary, which means the word rightwould have the same context-free representation in “I’ll make the payment right away and Take a right turn”. However, BERT would represent based on both the previous and next context, making it bidirectional. 

An earlier version of this blog was published on Medium by the author.

Recent Publications

ZEMOSO ENGINEERING STUDIO

Testing what doesn’t exist with a Wizard of Oz twist

October 12, 2022
7 min read
ZEMOSO ENGINEERING STUDIO

Beyond methodologies: Five engineering do's for an agile product build

September 5, 2022
6 min read
ZEMOSO ENGINEERING STUDIO

How we built a big data platform for a futuristic AgriTech product

June 3, 2022
8 min read
ZEMOSO NEWS

Zemoso Labs starts operations in Waterloo, Canada

May 25, 2022
5 min read
ZEMOSO ENGINEERING STUDIO

Honorable mention at O’Reilly’s Architectural Katas event

May 17, 2021
5 min read
ZEMOSO ENGINEERING STUDIO

Product dev with testable spring boot applications, from day one

May 4, 2021
5 min read
ZEMOSO ENGINEERING STUDIO

When not to @Autowire in Spring or Spring Boot applications

May 1, 2021
5 min read
ZEMOSO ENGINEERING STUDIO

Efficiently handle data and integrations in Spring Boot

January 24, 2021
5 min read
ZEMOSO ENGINEERING STUDIO

Our favorite CI/CD DevOps Practice: Simplify with GitHub Actions

October 25, 2020
5 min read
ZEMOSO ENGINEERING STUDIO

GraphQL — Why is it essential for agile product development?

April 30, 2019
12 min read
ZEMOSO ENGINEERING STUDIO

GraphQL with Java Spring Boot and Apollo Angular for Agile platforms

April 30, 2019
9 min read
ZEMOSO ENGINEERING STUDIO

Deploying Airflow on Kubernetes 

November 30, 2018
2 min read
ZEMOSO PRODUCT STUDIO

How to validate your Innovation: Mastering Experiment Design

November 22, 2018
8 min read
ZEMOSO PRODUCT STUDIO

Working Backwards: Amazon's Culture of Innovation: My notes

November 19, 2018
8 min read
ZEMOSO ENGINEERING STUDIO

Product developer POV: Caveats when building with Spark

November 5, 2018
2 min read

Want more best practices?

Access thought-leadership and best practice content across
the product development lifecycle

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

© 2023  Zemoso Technologies
Privacy Policy

Terms of Use
LinkedIn Page - Zemoso TechnologiesFacebook Page - Zemoso TechnologiesTwitter Account - Zemoso Technologies

© 2021 Zemoso Technologies
Privacy Policy

LinkedIn Page - Zemoso TechnologiesFacebook Page - Zemoso TechnologiesTwitter Account - Zemoso Technologies
May 25, 2023