What the heck is BERT in Machine Learning?
A new transfer learning technique called BERT (short for Bidirectional Encoder Representations for Transformers) is making big waves in the NLP space. BERT excels at handling “context-heavy” language problems. For example…
Bats are found in the dark places.
Cricket Bats are going high-tech these days.
Earlier, context-free models (like word2vec or GloVe) generated a single word embedding representation for each word in the vocabulary which means the word “bats” would have the same context-free representation in “Bats are found in the dark places” and “Cricket Bats are going high-tech these days”. In other words, the vector for “bats” needs to include information about mammals as well as all things to do with Cricket game. BERT would represent based on both previous and next after reading the entire sentence making it bidirectional.
So How Does BERT Work?
BERT takes a completely different way to learn. BERT is given billions of sentences at training time. It’s then asked to predict a random selection of missing words from these sentences. After practicing with this corpus of text several times over, BERT adopts a very good understanding of how a sentence is formed grammatically. This is how it excels at dealing with homonyms, like “bats.”
BERT is open source and is pre-trained on a large corpus of unlabelled text which includes the entire Wikipedia (that’s about 2,500 million words) and a book corpus (800 million words) and whole of it is available when you deploy it. This makes it a great asset for building models! It means that you can achieve state-of-the-art accuracy, or get comparable accuracy to older algorithms, with a tenth of the amount of data.
Google says this is massive, massive in that it impacts 10% of all queries but also Google said this is its biggest steps forward for search in the past 5 years, and it’s one of the biggest steps forward in the history of search altogether.