- Published on
Why Pretrained Language Models Matter
- Authors

- Name
- Laksmita Widya Astuti
Introduction
Natural Language Processing (NLP) is fascinating because it enables computers to understand everyday language. In modern practice, many NLP solutions rely on deep learning. However, deep learning is not always the most efficient answer because the architecture can be complex, it often requires large amounts of data, and it comes with many hyperparameters that take time to tune.
This article sets the context for why pretrained language models such as BERT are a strong practical approach, especially when:
- labeled data is limited
- compute is limited
- you need a model that can adapt quickly to a specific task
1) Why Deep Learning NLP Is Not Always "Enough"
Deep learning can deliver high performance, but it often comes with trade-offs:
- Data: large datasets are usually required for good generalization
- Complexity: it can be hard to interpret models in a whitebox way due to many hidden layers
- Tuning cost: experimentation can be expensive because there are many parameters and configurations
Because of this, improving performance does not always mean "use a bigger model". Sometimes it is more effective to improve the data or to start from a model that has already learned general language patterns.
2) Two Ways to Improve Performance: Data-Centric vs Model-Centric
At a high level, an AI system is built from two main elements:
- Code (the model)
- Data
There are two common improvement strategies.
2.1 Data-Centric AI
This approach focuses on making the data clearer and more consistent so the model learns the right signal.
Typical focus areas include:
- consistent labeling
- data management (versioning, auditing)
- slicing (understanding weak segments)
- augmentation
- data cleaning and curation
In many real-world settings, the model is relatively fixed, and most gains come from improving the data.
2.2 Model-Centric AI
This approach focuses on experimenting with model architectures and tuning.
It can be effective, but the experimentation cost can be high, especially when data and compute are limited.
3) A Problem I Find Interesting: Code-Mixed Text (Indonesian-English)
I became interested in code-mixing, which is the use of two or more languages within a single sentence or utterance. In Indonesia, a common example is Indonesian-English mixing in social media text.
Conceptually, code-mixing refers to using linguistic units such as words, phrases, or clauses from different languages within the same sentence context. It is often observed in informal language, especially on social platforms.
The Challenge
A major challenge for Indonesian NLP in this area is data availability. Koto et al. (2020) noted that one key reason Indonesian NLP is less represented in research is the lack of annotated datasets, language resources, and standardization.
4) Why Pretrained Models Are a Strong Candidate
To handle code-mixed text, I considered several pretrained models, such as:
- IndoBERTweet
- multilingual BERT (mBERT)
- BERTweet
4.1 What Is a Pretrained Model?
A pretrained model is a model (in this context: deep learning) trained on very large corpora (for example, Wikipedia or Twitter) so it learns general language patterns.
These models can then be fine-tuned for specific tasks, such as:
- sentiment analysis
- question answering
- summarization
- intent classification
Key Advantages
- Lower compute cost to get started
- Faster time to a strong baseline
- No need to train from scratch
This makes pretrained models particularly valuable when you have limited resources but need production-ready results quickly.
5) What Comes Next
In the next article, I will cover:
- how BERT works conceptually (a whitebox view)
- what BERT learns during pretraining
- why the architecture is relevant for code-mixed text
References
Koto, F., Rahimi, A., Lau, J. H., & Baldwin, T. (2020). IndoLEM and IndoBERT: A Benchmark Dataset and Pre-trained Language Model for Indonesian NLP. Proceedings of the 28th International Conference on Computational Linguistics, 757-770.
This is part 1 of a series exploring BERT and its applications in multilingual NLP contexts.