How BERT Works and Why It Helps Code-Mixed Text

Summary

In Part 1, we covered the NLP landscape, why deep learning can be expensive to build and tune, and why pretrained language models are often a practical shortcut. In Part 2, we go more technical: how BERT works (a whitebox conceptual view), what it learns during pretraining, and why this architecture is a strong fit for code-mixed text (for example, Indonesian-English).

1) From Static Word Embeddings to Contextual Embeddings

Classic representations such as Word2Vec or GloVe produce a single, fixed vector for each word. The problem is that meaning depends on context.

Example:

"bank" in "bank transfer" vs "bank" in "sit on a park bench"

BERT produces contextual embeddings: the vector for a token is computed based on the surrounding tokens, so the same word can have different representations in different sentences.

2) Transformer - The Core Engine Behind BERT

BERT is built on the Transformer Encoder architecture.

Key components:

Tokenization (commonly WordPiece)
Embeddings (token, position, and segment embeddings)
Self-attention (each token can "look at" other tokens)
Feed-forward network (non-linear transformation per token)
Residual connections + LayerNorm (training stability)

Self-attention intuition

When processing a token, the model computes how much other tokens matter for understanding it.

3) What Does BERT Learn During Pretraining?

BERT is trained on very large corpora using self-supervised objectives (no manual labels needed).

3.1 Masked Language Modeling (MLM)

Some tokens are masked, and the model is trained to predict the missing tokens.

Simple example:

I like drinking [MASK] every morning.

Why this matters: BERT learns syntactic and semantic relationships bidirectionally, not only left-to-right.

3.2 Next Sentence Prediction (NSP) (historical)

In the original BERT formulation, a second task predicts whether sentence B is the real continuation of sentence A.

Note: Many modern variants reduce or replace NSP, but the idea is to help with sentence-to-sentence coherence.

4) Why BERT Works Well for Code-Mixed Text

Code-mixed Indonesian-English text is challenging because it often includes:

mixed vocabulary and grammar
language switching within a single sentence
informal spelling, abbreviations, and social media style

BERT helps because:

It is contextual: token meaning is shaped by surrounding context, not a fixed dictionary vector
Subword tokenization (WordPiece): unknown words can be decomposed into smaller pieces
Multilingual representations (for certain models): the embedding space can include multiple languages

However, model choice still matters:

mBERT: more general across languages
IndoBERTweet / BERTweet: often stronger for Twitter-like text, slang, and informal spelling

5) Fine-tuning - Adapting BERT to a Specific Task

After pretraining, we typically fine-tune BERT for a downstream task, such as:

sentiment analysis
intent classification
NER (Named Entity Recognition)
QA (Question Answering)

Common approach:

Add a task head (for example, a classification layer)
Train on labeled data
Use a small learning rate (BERT can be sensitive to learning rate)

6) A Minimal Text Classification Pipeline (High-Level)

A typical workflow looks like this:

Light text cleaning (avoid overly aggressive normalization; context matters)
Tokenize (WordPiece)
Feed tokens into BERT
Use a pooled representation (commonly the [CLS] token embedding) as input to a classifier head
Fine-tune end-to-end on labeled data

Note: Some implementations use mean pooling over all token embeddings instead of [CLS].

Closing

BERT is not "magic". Its strength comes from the Transformer encoder plus large-scale pretraining objectives that produce high-quality language representations. For code-mixed text, model choice (mBERT vs Twitter-domain models) and data quality (labeling consistency, coverage, and balance) often determine the final performance.

If you are interested in a code-mixed dataset for English-Indonesian with sentiment labels, you can check out my dataset here: indonglish-dataset

This is part 2 of a series exploring BERT and its applications in multilingual NLP contexts.