ZEMOSO ENGINEERING STUDIO
August 1, 2022
12 min read

How to use BERT and DNN to build smarter NLP algorithms for products

Text Classifiers (BERT & DNN)

Text classification, also known as text categorization, is one of the most important parts of text analysis and an important task of Natural Language Processing (NLP). Some of the most common examples are sentiment analysis, topic detection, language detection, and intent detection. We'll understand how to use Bidirectional Encoder Representations from Transformers( BERT) and Deep Neural Networks (DNN) models to build text classifiers. They have a different architecture to classify text into categories. Let’s start with the introduction of each:

1. DNN

A DNN is a collection of neurons organized in a sequence of multiple layers, where neurons in one layer receive neuron activations from the previous layer, and perform a simple mathematical computation (e.g. a weighted sum of the input followed by a nonlinear activation). The neurons of the network jointly implement a complex nonlinear mapping from the input to the output. This mapping between the input and output is done using a technique called error backpropagation.

Reference multiple layer neural network, DNN architecture
Reference multiple layer neural network, DNN architecture

2. BERT

BERT is an open-sourced NLP pre-training model developed by researchers at Google in 2018. It’s built on pre-training contextual representations. BERT is different from the rest of the models, it’s the first deeply bidirectional, unsupervised language representation, pre-trained using only a plain text corpus. And since it’s open-sourced, anyone with machine learning (ML) knowledge can easily build an NLP model. 

How BERT works

BERT makes use of the Transformer, an attention mechanism that learns contextual relations between words in a text. In its vanilla form, Transformer includes two separate mechanisms — an encoder that reads the text input and a decoder that produces a prediction for the task. Since BERT’s goal is to generate a language model, only the encoder mechanism is necessary. The detailed implementations of Transformer are described in a paper by Google.

As opposed to directional models, which generally read the input text sequentially (left-to-right or right-to-left), the Transformer encoder which is used in BERT reads the entire input text or words sequence at once. This characteristic allows the model to learn the context of a word based on all of its surroundings. 

Comparison 

The purpose is to find out which model works better on different sizes of the dataset for the classification task. Here, the classification task is sentiment analysis. The models which we are going to use for comparison are BERT and DNN. Since the BERT model and Universal Sentence Encoder are pre-trained models, these will provide us the embedding for our text input for the classifiers.

  1. Dataset description

For sentiment analysis, we will use Large Movie Review Dataset v1.0. In this data, each review from the user is labeled as positive or negative classes. We will create three datasets from it. The three sizes are 100, 1000, and 10,000 for the training part. The test set will be10% of the training dataset, so, small dataset (10), medium dataset (100.) and a large dataset (1000).

  1. Build classifiers

Let’s create DNN and BERT classifiers and compare their results.

a) DNN classifier using Universal Sentence Encoder

Data loading and splitting it into train dataset and test dataset
Data loading and splitting it into train dataset and test dataset

Training part of DNN classifier using universal sentence encoder
Training part of DNN classifier using universal sentence encoder

Predictions on test set
Predictions on test set

b) BERT classifier

In this model, we will use pre-trained BERT embeddings for the classifier.

BERTert training code snippet
BERT training code snippet 

Results

After building classifiers, we create matrices for the accuracy of these models on the test set (Confusion matrix, Precision, Recall, F1). 

a) Confusion matrix

 Confusion matrix for both models for all three test set
Confusion matrix for both models for all three test sets

b) Accuracy: 

Overall accuracy of both models on different sized test set
Overall accuracy of both models on different sized test set

c) Precision, Recall, F1 score:

Precision, Recall, and and F1 score of each model on all three datasets
Precision, Recall, and F1 score of each model on all three datasets

The classifier built with BERT outperformed the DNN model. Both models performed equally regarding accuracy on a small dataset. On a medium test set (100), BERT is better than DNN with a difference of 7% accuracy, while in a large test set (1000) it is 1% better. Precision, recall, and F1 are important matrices for a model evaluation other than accuracy, as we have to take care of the false-positive and false-negative count in the predictions. Regarding precision, BERT is better in medium test set while in small and large test set both models give the same precision. For large and small, all three matrices — precision, recall and F1 scores are about the same but in medium test set recall and F1 score of BERT classifier are higher (recall = 0.14 and F1 = 0.7) than the DNN classifier. Let’s see an example to find out how good BERT performs if we train on a small dataset and test it on a large test set. For this experiment, we had 400 training samples and 600 test samples and here are the results :

Accuracy, precision, recall and FI for both models, trained on small dataset
Accuracy, precision, recall and FI for both models, trained on small dataset.

DNN and BERT accuracy comparison

BERT also performed very well compared to DNN in situations where we have small data for training. In terms of accuracy or precision, recall, F1 score, BERT fared better. BERT crossed 82% accuracy, while DNN was 80% or less. We can infer that BERT performs well even with a smaller dataset. 

Conclusion

BERT classifier is better than the DNN classifier which is built with universal sentence encoder.  There are two major reasons behind the better performance of BERT.  First, BERT is pre-trained on a large volume of unlabelled text which includes the entire Wikipedia (that’s about 2,500 million words) and a book corpus (800 million words). Second, traditional context-free models (like word2vec or GloVe) generate a single word embedding representation for each word in the vocabulary, which means the word rightwould have the same context-free representation in “I’ll make the payment right away and Take a right turn”. However, BERT would represent based on both the previous and next context, making it bidirectional. 

An earlier version of this blog was published on Medium by the author.

ZEMOSO ENGINEERING STUDIO

How to use BERT and DNN to build smarter NLP algorithms for products

February 14, 2020
12 min read

Text Classifiers (BERT & DNN)

Text classification, also known as text categorization, is one of the most important parts of text analysis and an important task of Natural Language Processing (NLP). Some of the most common examples are sentiment analysis, topic detection, language detection, and intent detection. We'll understand how to use Bidirectional Encoder Representations from Transformers( BERT) and Deep Neural Networks (DNN) models to build text classifiers. They have a different architecture to classify text into categories. Let’s start with the introduction of each:

1. DNN

A DNN is a collection of neurons organized in a sequence of multiple layers, where neurons in one layer receive neuron activations from the previous layer, and perform a simple mathematical computation (e.g. a weighted sum of the input followed by a nonlinear activation). The neurons of the network jointly implement a complex nonlinear mapping from the input to the output. This mapping between the input and output is done using a technique called error backpropagation.

Reference multiple layer neural network, DNN architecture
Reference multiple layer neural network, DNN architecture

2. BERT

BERT is an open-sourced NLP pre-training model developed by researchers at Google in 2018. It’s built on pre-training contextual representations. BERT is different from the rest of the models, it’s the first deeply bidirectional, unsupervised language representation, pre-trained using only a plain text corpus. And since it’s open-sourced, anyone with machine learning (ML) knowledge can easily build an NLP model. 

How BERT works

BERT makes use of the Transformer, an attention mechanism that learns contextual relations between words in a text. In its vanilla form, Transformer includes two separate mechanisms — an encoder that reads the text input and a decoder that produces a prediction for the task. Since BERT’s goal is to generate a language model, only the encoder mechanism is necessary. The detailed implementations of Transformer are described in a paper by Google.

As opposed to directional models, which generally read the input text sequentially (left-to-right or right-to-left), the Transformer encoder which is used in BERT reads the entire input text or words sequence at once. This characteristic allows the model to learn the context of a word based on all of its surroundings. 

Comparison 

The purpose is to find out which model works better on different sizes of the dataset for the classification task. Here, the classification task is sentiment analysis. The models which we are going to use for comparison are BERT and DNN. Since the BERT model and Universal Sentence Encoder are pre-trained models, these will provide us the embedding for our text input for the classifiers.

  1. Dataset description

For sentiment analysis, we will use Large Movie Review Dataset v1.0. In this data, each review from the user is labeled as positive or negative classes. We will create three datasets from it. The three sizes are 100, 1000, and 10,000 for the training part. The test set will be10% of the training dataset, so, small dataset (10), medium dataset (100.) and a large dataset (1000).

  1. Build classifiers

Let’s create DNN and BERT classifiers and compare their results.

a) DNN classifier using Universal Sentence Encoder

Data loading and splitting it into train dataset and test dataset
Data loading and splitting it into train dataset and test dataset

Training part of DNN classifier using universal sentence encoder
Training part of DNN classifier using universal sentence encoder

Predictions on test set
Predictions on test set

b) BERT classifier

In this model, we will use pre-trained BERT embeddings for the classifier.

BERTert training code snippet
BERT training code snippet 

Results

After building classifiers, we create matrices for the accuracy of these models on the test set (Confusion matrix, Precision, Recall, F1). 

a) Confusion matrix

 Confusion matrix for both models for all three test set
Confusion matrix for both models for all three test sets

b) Accuracy: 

Overall accuracy of both models on different sized test set
Overall accuracy of both models on different sized test set

c) Precision, Recall, F1 score:

Precision, Recall, and and F1 score of each model on all three datasets
Precision, Recall, and F1 score of each model on all three datasets

The classifier built with BERT outperformed the DNN model. Both models performed equally regarding accuracy on a small dataset. On a medium test set (100), BERT is better than DNN with a difference of 7% accuracy, while in a large test set (1000) it is 1% better. Precision, recall, and F1 are important matrices for a model evaluation other than accuracy, as we have to take care of the false-positive and false-negative count in the predictions. Regarding precision, BERT is better in medium test set while in small and large test set both models give the same precision. For large and small, all three matrices — precision, recall and F1 scores are about the same but in medium test set recall and F1 score of BERT classifier are higher (recall = 0.14 and F1 = 0.7) than the DNN classifier. Let’s see an example to find out how good BERT performs if we train on a small dataset and test it on a large test set. For this experiment, we had 400 training samples and 600 test samples and here are the results :

Accuracy, precision, recall and FI for both models, trained on small dataset
Accuracy, precision, recall and FI for both models, trained on small dataset.

DNN and BERT accuracy comparison

BERT also performed very well compared to DNN in situations where we have small data for training. In terms of accuracy or precision, recall, F1 score, BERT fared better. BERT crossed 82% accuracy, while DNN was 80% or less. We can infer that BERT performs well even with a smaller dataset. 

Conclusion

BERT classifier is better than the DNN classifier which is built with universal sentence encoder.  There are two major reasons behind the better performance of BERT.  First, BERT is pre-trained on a large volume of unlabelled text which includes the entire Wikipedia (that’s about 2,500 million words) and a book corpus (800 million words). Second, traditional context-free models (like word2vec or GloVe) generate a single word embedding representation for each word in the vocabulary, which means the word rightwould have the same context-free representation in “I’ll make the payment right away and Take a right turn”. However, BERT would represent based on both the previous and next context, making it bidirectional. 

An earlier version of this blog was published on Medium by the author.

Recent Publications
Winning first place at O'Reilly Media’s Architectural Katas — Spring 2022
Winning first place at O'Reilly Media’s Architectural Katas — Spring 2022
July 13, 2022
How to remix Amazon’s Working Backwards with Google’s Venture’s User Journey: The Dr. Strange Way
How to remix Amazon’s Working Backwards with Google’s Venture’s User Journey: The Dr. Strange Way
June 14, 2022
How we built a big data platform for a futuristic AgriTech product
How we built a big data platform for a futuristic AgriTech product
June 3, 2022
Zemoso Labs starts operations in Waterloo, Canada
Zemoso Labs starts operations in Waterloo, Canada
May 25, 2022
Deconstructing Elon Musk’s dog ate my homework answer for Twitter: More validation will be asked of startups
Deconstructing Elon Musk’s dog ate my homework answer for Twitter: More validation will be asked of startups
May 20, 2022
Real Talk: Lessons learned and evolved from 3M and Post-it®’s adoption of Crazy 8 methodology
Real Talk: Lessons learned and evolved from 3M and Post-it®’s adoption of Crazy 8 methodology
April 10, 2022
Why Realm trumps SQLite as database of choice for complex mobile apps — Part 1
Why Realm trumps SQLite as database of choice for complex mobile apps — Part 1
February 7, 2022
Understanding dynamic multi-column search with JPA Criteria for product development
Understanding dynamic multi-column search with JPA Criteria for product development
January 2, 2022
Product engineering zeitgeist: Implement OOPs in React JS using atomic design
Product engineering zeitgeist: Implement OOPs in React JS using atomic design
July 8, 2021
Honorable mention at O’Reilly’s Architectural Katas event
Honorable mention at O’Reilly’s Architectural Katas event
May 17, 2021
Product dev with testable spring boot applications, from day one
Product dev with testable spring boot applications, from day one
May 4, 2021
When not to @Autowire in Spring or Spring Boot applications
When not to @Autowire in Spring or Spring Boot applications
May 1, 2021
Refactored our Android code from zero to hero for a product rescue project
Refactored our Android code from zero to hero for a product rescue project
April 4, 2021
Efficiently handle data and integrations in Spring Boot
Efficiently handle data and integrations in Spring Boot
January 24, 2021
Transforming product engineering with atomic design and a theming library — Part 2
Transforming product engineering with atomic design and a theming library — Part 2
November 15, 2020
Unlock the power of atomic design in React and React Native using a theming library — Part 1
Unlock the power of atomic design in React and React Native using a theming library — Part 1
November 2, 2020
Our favorite CI/CD DevOps Practice: Simplify with GitHub Actions
Our favorite CI/CD DevOps Practice: Simplify with GitHub Actions
October 25, 2020
Kubernetes — What is it, what problems does it solve, and how does it compare with alternatives?
Kubernetes — What is it, what problems does it solve, and how does it compare with alternatives?
June 21, 2019
GraphQL — Why is it essential for agile product development?
GraphQL — Why is it essential for agile product development?
April 30, 2019
GraphQL with Java Spring Boot and Apollo Angular for Agile platforms
GraphQL with Java Spring Boot and Apollo Angular for Agile platforms
April 30, 2019
Orchestrating backend services for product development with AWS Step Functions
Orchestrating backend services for product development with AWS Step Functions
April 1, 2019
Deploying Airflow on Kubernetes 
Deploying Airflow on Kubernetes 
November 30, 2018
How To Decide When to Use Amazon Working Backwards, Business Model Canvas and Lean Canvas
How To Decide When to Use Amazon Working Backwards, Business Model Canvas and Lean Canvas
November 30, 2018
How to validate your Innovation: Mastering Experiment Design
How to validate your Innovation: Mastering Experiment Design
November 22, 2018
Working Backwards: Amazon's Culture of Innovation: My notes
Working Backwards: Amazon's Culture of Innovation: My notes
November 19, 2018
Product developer POV: Caveats when building with Spark
Product developer POV: Caveats when building with Spark
November 5, 2018
ZEMOSO ENGINEERING STUDIO
February 14, 2020
12 min read

How to use BERT and DNN to build smarter NLP algorithms for products

Text Classifiers (BERT & DNN)

Text classification, also known as text categorization, is one of the most important parts of text analysis and an important task of Natural Language Processing (NLP). Some of the most common examples are sentiment analysis, topic detection, language detection, and intent detection. We'll understand how to use Bidirectional Encoder Representations from Transformers( BERT) and Deep Neural Networks (DNN) models to build text classifiers. They have a different architecture to classify text into categories. Let’s start with the introduction of each:

1. DNN

A DNN is a collection of neurons organized in a sequence of multiple layers, where neurons in one layer receive neuron activations from the previous layer, and perform a simple mathematical computation (e.g. a weighted sum of the input followed by a nonlinear activation). The neurons of the network jointly implement a complex nonlinear mapping from the input to the output. This mapping between the input and output is done using a technique called error backpropagation.

Reference multiple layer neural network, DNN architecture
Reference multiple layer neural network, DNN architecture

2. BERT

BERT is an open-sourced NLP pre-training model developed by researchers at Google in 2018. It’s built on pre-training contextual representations. BERT is different from the rest of the models, it’s the first deeply bidirectional, unsupervised language representation, pre-trained using only a plain text corpus. And since it’s open-sourced, anyone with machine learning (ML) knowledge can easily build an NLP model. 

How BERT works

BERT makes use of the Transformer, an attention mechanism that learns contextual relations between words in a text. In its vanilla form, Transformer includes two separate mechanisms — an encoder that reads the text input and a decoder that produces a prediction for the task. Since BERT’s goal is to generate a language model, only the encoder mechanism is necessary. The detailed implementations of Transformer are described in a paper by Google.

As opposed to directional models, which generally read the input text sequentially (left-to-right or right-to-left), the Transformer encoder which is used in BERT reads the entire input text or words sequence at once. This characteristic allows the model to learn the context of a word based on all of its surroundings. 

Comparison 

The purpose is to find out which model works better on different sizes of the dataset for the classification task. Here, the classification task is sentiment analysis. The models which we are going to use for comparison are BERT and DNN. Since the BERT model and Universal Sentence Encoder are pre-trained models, these will provide us the embedding for our text input for the classifiers.

  1. Dataset description

For sentiment analysis, we will use Large Movie Review Dataset v1.0. In this data, each review from the user is labeled as positive or negative classes. We will create three datasets from it. The three sizes are 100, 1000, and 10,000 for the training part. The test set will be10% of the training dataset, so, small dataset (10), medium dataset (100.) and a large dataset (1000).

  1. Build classifiers

Let’s create DNN and BERT classifiers and compare their results.

a) DNN classifier using Universal Sentence Encoder

Data loading and splitting it into train dataset and test dataset
Data loading and splitting it into train dataset and test dataset

Training part of DNN classifier using universal sentence encoder
Training part of DNN classifier using universal sentence encoder

Predictions on test set
Predictions on test set

b) BERT classifier

In this model, we will use pre-trained BERT embeddings for the classifier.

BERTert training code snippet
BERT training code snippet 

Results

After building classifiers, we create matrices for the accuracy of these models on the test set (Confusion matrix, Precision, Recall, F1). 

a) Confusion matrix

 Confusion matrix for both models for all three test set
Confusion matrix for both models for all three test sets

b) Accuracy: 

Overall accuracy of both models on different sized test set
Overall accuracy of both models on different sized test set

c) Precision, Recall, F1 score:

Precision, Recall, and and F1 score of each model on all three datasets
Precision, Recall, and F1 score of each model on all three datasets

The classifier built with BERT outperformed the DNN model. Both models performed equally regarding accuracy on a small dataset. On a medium test set (100), BERT is better than DNN with a difference of 7% accuracy, while in a large test set (1000) it is 1% better. Precision, recall, and F1 are important matrices for a model evaluation other than accuracy, as we have to take care of the false-positive and false-negative count in the predictions. Regarding precision, BERT is better in medium test set while in small and large test set both models give the same precision. For large and small, all three matrices — precision, recall and F1 scores are about the same but in medium test set recall and F1 score of BERT classifier are higher (recall = 0.14 and F1 = 0.7) than the DNN classifier. Let’s see an example to find out how good BERT performs if we train on a small dataset and test it on a large test set. For this experiment, we had 400 training samples and 600 test samples and here are the results :

Accuracy, precision, recall and FI for both models, trained on small dataset
Accuracy, precision, recall and FI for both models, trained on small dataset.

DNN and BERT accuracy comparison

BERT also performed very well compared to DNN in situations where we have small data for training. In terms of accuracy or precision, recall, F1 score, BERT fared better. BERT crossed 82% accuracy, while DNN was 80% or less. We can infer that BERT performs well even with a smaller dataset. 

Conclusion

BERT classifier is better than the DNN classifier which is built with universal sentence encoder.  There are two major reasons behind the better performance of BERT.  First, BERT is pre-trained on a large volume of unlabelled text which includes the entire Wikipedia (that’s about 2,500 million words) and a book corpus (800 million words). Second, traditional context-free models (like word2vec or GloVe) generate a single word embedding representation for each word in the vocabulary, which means the word rightwould have the same context-free representation in “I’ll make the payment right away and Take a right turn”. However, BERT would represent based on both the previous and next context, making it bidirectional. 

An earlier version of this blog was published on Medium by the author.

Recent Publications

ZEMOSO ENGINEERING STUDIO

How we built a big data platform for a futuristic AgriTech product

June 3, 2022
8 min read
ZEMOSO NEWS

Zemoso Labs starts operations in Waterloo, Canada

May 25, 2022
5 min read
ZEMOSO ENGINEERING STUDIO

Honorable mention at O’Reilly’s Architectural Katas event

May 17, 2021
5 min read
ZEMOSO ENGINEERING STUDIO

Product dev with testable spring boot applications, from day one

May 4, 2021
5 min read
ZEMOSO ENGINEERING STUDIO

When not to @Autowire in Spring or Spring Boot applications

May 1, 2021
5 min read
ZEMOSO ENGINEERING STUDIO

Efficiently handle data and integrations in Spring Boot

January 24, 2021
5 min read
ZEMOSO ENGINEERING STUDIO

Our favorite CI/CD DevOps Practice: Simplify with GitHub Actions

October 25, 2020
5 min read
ZEMOSO ENGINEERING STUDIO

GraphQL — Why is it essential for agile product development?

April 30, 2019
12 min read
ZEMOSO ENGINEERING STUDIO

GraphQL with Java Spring Boot and Apollo Angular for Agile platforms

April 30, 2019
9 min read
ZEMOSO ENGINEERING STUDIO

Deploying Airflow on Kubernetes 

November 30, 2018
2 min read
ZEMOSO PRODUCT STUDIO

How to validate your Innovation: Mastering Experiment Design

November 22, 2018
8 min read
ZEMOSO PRODUCT STUDIO

Working Backwards: Amazon's Culture of Innovation: My notes

November 19, 2018
8 min read
ZEMOSO ENGINEERING STUDIO

Product developer POV: Caveats when building with Spark

November 5, 2018
2 min read

Want more best practices?

Access thought-leadership and best practice content across
the product development lifecycle

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

© 2021 Zemoso Technologies
Privacy Policy

Terms of Use
LinkedIn Page - Zemoso TechnologiesFacebook Page - Zemoso TechnologiesTwitter Account - Zemoso Technologies

© 2021 Zemoso Technologies
Privacy Policy

LinkedIn Page - Zemoso TechnologiesFacebook Page - Zemoso TechnologiesTwitter Account - Zemoso Technologies
August 1, 2022