- Published on
Understanding the Architecture
- Authors

- Name
- Laksmita Widya Astuti
For text classification, BERT uses the Transformer encoder portion of the architecture. At a high level, the model can be understood as three stacked stages:
- an input layer (embeddings)
- a sequence of self-attention encoder layers
- a classification head
In many applied settings, BERT Base is used as the starting point and then adapted (fine-tuned) to match a specific domain.
1) BERT Base vs the Original Transformer Configuration
Compared to the original Transformer configuration described by Vaswani et al. (2017), BERT uses a larger encoder stack.
BERT comes in two common sizes:
- BERT Base
- BERT Large
Key differences (typical configurations):
| Configuration | BERT Base | BERT Large |
|---|---|---|
| Hidden size / embedding size | 768 | 1024 |
| Number of attention heads | 12 | 16 |
| Number of encoder layers | 12 | 24 |
These are larger than the default settings often referenced from the initial Transformer paper (for example, 6 encoder layers, 512 hidden size, and 8 attention heads).
2) Encoder Input: Token Embeddings and Padding
The Transformer encoder receives input embeddings as vectors.
A common way to represent the input is a matrix shaped like:
For example, you might use a maximum sequence length of 128 tokens. Sentences are tokenized, and then padded up to 128 tokens so all inputs have a consistent length.
3) Mapping Tokens into the Vocabulary Embedding Matrix
After tokenization, token IDs are mapped into a vocabulary embedding lookup table.
A typical size for this embedding matrix is:
For BERT Base, this is commonly:
Where:
- 30,000 is the vocabulary size
- 768 is the embedding dimension for each token
4) Example: How the Lookup Works
Consider a simple example where the first token is "saya" ("I").
If "saya" corresponds to some token ID (for example, ID = 1), then the embedding vector for that token is taken from the corresponding row in the embedding matrix.
In other words, the model retrieves a 768-dimensional vector:
This embedding becomes part of the encoder input.
5) Multi-Head Self-Attention (Conceptual Steps)
Next, BERT computes multi-head self-attention.
For BERT Base:
- hidden size = 768
- number of heads = 12
- per-head dimension = 768 / 12 = 64
Attention Mechanism
In the attention mechanism, the input embeddings are projected into three matrices:
- Q (Query)
- K (Key)
- V (Value)
A simplified attention computation:
- Compute the dot product .
- Scale by , where , so the scale factor is .
- Apply softmax to obtain attention weights.
- Multiply by to produce the attention output.
The attention formula can be expressed as:
With a max sequence length of 128, the attention weights form a matrix of size:
6) The [CLS] Representation and the Classification Head
This attention-and-feed-forward computation is repeated across encoder layers.
For classification tasks, the final hidden representation of the first special token, [CLS], is used as a single "summary" vector for the whole sequence.
Many BERT implementations then apply a pooler:
- a linear transformation
- followed by a tanh activation
This pooled vector becomes the input to the classification layer.
Visual Flow:
Input Tokens -> Embeddings -> Encoder Layers -> [CLS] Token -> Pooler -> Classifier
7) Probabilities, Loss, and Model Selection (Sentiment Analysis Example)
For a sentiment analysis setup, the pooled output is passed to a classifier to produce probabilities (or logits), and training uses a loss function.
A common configuration includes:
- sigmoid for producing probabilities (often used for binary or multi-label setups)
- binary cross-entropy loss (BCE / BCELoss)
The model outputs logits that can be used to decide whether sentiment is, for example, neutral, positive, or negative (depending on the exact labeling setup).
During training, the loss is used to:
- validate and select the best checkpoint
- detect whether the model is overfitting, underfitting, or has a good fit
Loss Function
For binary classification, the binary cross-entropy loss is:
Where:
- is the true label
- is the predicted probability
- is the number of samples
Summary
BERT's architecture for text classification consists of:
- Token embeddings that map vocabulary to dense vectors
- Multi-head self-attention that captures contextual relationships
- [CLS] token pooling that summarizes the sequence
- Classification head that produces final predictions
References
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30.
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805.
This is part 2 of a series exploring BERT and its applications in multilingual NLP contexts.