Sailor2: Advancing Inclusive Multilingual Large Language Models for Southeast Asia

Overview of Sailor2 and Its Objectives

Sailor2 is a pioneering family of multilingual large language models (LLMs) specifically crafted for Southeast Asian (SEA) languages. Developed by a consortium of researchers led by Longxu Dou, Qian Liu, and Fan Zhou, and supported by organizations such as Sea AI Lab, SCB10X, WiseSight, Hugging Face, and others, Sailor2 stands out for its commitment to underrepresented languages in a region of remarkable linguistic diversity. The models are available in 1B, 8B, and 20B parameter scales and are released under the highly permissive Apache 2.0 license, ensuring open access for both research and commercial endeavors.

The principal objective of Sailor2 is to democratize access to powerful language technologies across Southeast Asia by supporting 13+ SEA languages, in addition to English and Chinese. The initiative targets both high-resource and endangered or low-resource languages, aiming to fill a longstanding gap in AI inclusivity, where global NLP research has focused almost exclusively on dominant world languages. The project’s ethos, “Serving the Underserved in Southeast Asia with Open LLMs,” is deeply engrained in its transparent methodology, open-source tools, and detailed documentation, collectively termed the Sailor2 Cookbook.

Sailor2's technological ambition is not merely to provide translation or basic language support, but to bring advanced generative and reasoning capabilities to the region. It does so by combining massive data curation, sophisticated continual pre-training, careful model expansion, robust instruction/preference tuning, and domain-targeted evaluation. Results indicate that the flagship Sailor2-20B model achieves performance on par with GPT-4o across SEA languages—a significant milestone for open-source multilingual LLMs.

Key Contributors and Collaborations

The Sailor2 project is an extensive collaboration, involving not just the lead authors but a broad alliance of academic and industrial partners across continents. Chief contributors include Longxu Dou, Qian Liu (project leader), and Fan Zhou, all deeply involved in SEA language technology. Notable institutional partnerships span Sea AI Lab, SCB10X, WiseSight, Hugging Face, National University of Singapore (NUS), Nanyang Technological University (NTU), Hong Kong University (HKU), ABAKA AI, Peafowl.ai, Michigan State University, New York University, Umeå University, PyThaiNLP, HCMUT, City University, among others. Such global participation ensures that the project not only has technical depth but also domain expertise in the various languages involved.

Furthermore, the Sailor2 community ethos is evident in the open-source approach—toolkits, codebases, checkpoints, benchmark datasets, and detailed methodological cookbooks are all publicly accessible, inviting broad participation and scrutiny from the research and developer ecosystem.

Supported Southeast Asian Languages

At its core, Sailor2 supports a total of 13 major Southeast Asian languages—each representing a significant population and complex socio-linguistic contexts. These languages, validated via extensive pre-training on nearly 400 billion SEA-specific tokens, are listed below alongside metadata.

Table 1: Southeast Asian Languages Supported by Sailor2

Language	ISO Code	Country/Region	No. of Speakers (approx.)
Indonesian	ind	Indonesia	268 million
Vietnamese	vie	Vietnam	96 million
Javanese	jav	Indonesia (Java island)	82 million
Thai	tha	Thailand	70 million
Burmese	mya	Myanmar	54 million
Sundanese	sun	Indonesia (West Java)	42 million
Malay	zsm	Malaysia, Brunei, Singapore	33 million
Tagalog	tgl	Philippines (Luzon)	28 million
Cebuano	ceb	Philippines (Cebu, Mindanao)	21 million
Khmer	khm	Cambodia	16 million
Ilocano	ilo	Philippines (Northern Luzon)	8 million
Lao	lao	Laos	7 million
Waray	war	Philippines (Eastern Visayas)	3 million

Note: In some descriptions, additional languages (such as Minangkabau, Min Nan, Acehnese) or regional dialects are occasionally referenced (e.g., in dataset entries and expansion to 14–16 SEA languages for certain benchmarks), reflecting community-driven expansion.

Analytical Context:

This language selection results from a deliberate emphasis on both the broadest-reach regional lingua francas and the highly underserved, lower-resource languages. Such coverage ensures linguistic equity and augments research for languages with (historically) minimal digital presence. The robust representation of Philippine languages (Tagalog, Cebuano, Ilocano, Waray) and Indonesian languages (Javanese, Sundanese) particularly highlights Sailor2’s attention to both population coverage and regional diversity.

Data Curation Techniques

Corpus Collection and Cleansing

Sailor2’s data curation pipeline is a model of transparency and rigor. Data is primarily harvested from extensive web crawling (96 CommonCrawl snapshots from 2013 through early 2024), large-scale PDF parsing, and community-submitted data. Southeast Asian data is sourced independently and supplemented by partnerships, with special attention to indigenous media, government documents, social media, law, science, news, literature, and educational content. The total raw disk size per language is substantial, reflecting the scale of the project (e.g., Vietnamese: 1.9TB; Indonesian: 1.3TB).

Table 2: Example Raw Data Sizes per Language

Language	Raw Data Size
Vietnamese	1.9 TB
Indonesian	1.3 TB
Thai	242 GB
Malay	44 GB
Burmese	25.8 GB
Tagalog	17.5 GB
Khmer	6.9 GB
Others (Cebuano, Lao, Javanese, Waray, Sundanese, Ilocano)	0.2–2.1 GB each

All data undergoes a multi-stage cleaning process employing the SailCraft toolkit, an open-source pipeline specializing in multi-lingual, multi-format text processing. The data cleaning pipeline comprises:

Initial cleaning: Filtering out noisy, malformed, or corrupt files.
Near and exact deduplication: To eliminate redundant documents and boilerplate text.
URL deduplication: Prefers longer, substantive documents to reduce token bloat (cutting nearly 50% of tokens).
Frequent line removal: Following practices from Llama3, lines appearing more than five times in 10 million document buckets are excised—removing up to 5% of non-contentful tokens.
Heuristics-based filtering: Custom language- and domain-specific rules tune the dataset for quality.

Synthetic Data Curation

Where languages are critically low-resource or where internet-derived content is linguistically unbalanced, Sailor2 employs synthetic data generation. The NLLB-3.3B model translates high-quality English materials into SEA languages. FastText classifiers are trained per language to select the best translations—a process informed by constructing datasets of 10,000 positive and 10,000 negative samples per language. Only the top 10–20% of the synthesized data, as judged by classification scores, is retained for model pre-training. Sources for translation include Cosmopedia, MADLAD, and UltraChat.

Composite Data Mixture Optimization

RegMix, a mixture optimization toolkit using 1M proxy models, is employed to balance the representation of individual languages, ensuring that low-resource languages are upsampled and not drowned out by more abundant sources (Indonesian, Vietnamese).

Model Expansion Strategies

A distinctive technical innovation of Sailor2 is its approach to model scaling. Rather than starting from scratch or simply fine-tuning, Sailor2 leverages "block-expansion"—inspired by LlamaPro. The Qwen2.5 base models (0.5B, 7B, 14B parameters) are expanded to new sizes (1B, 8B, and 20B respectively) prior to continual pre-training. This approach ensures that:

The SEA knowledge introduced by further pre-training is stored in the new model layers, reducing "catastrophic forgetting" of English and Chinese.
The model capacity is sufficient to accommodate the rich linguistic features of SEA languages.
Downstream adaptation and checkpoint compatibility are preserved.

Such architectural expansion, paired with targeted continual pre-training, leads to significant performance enhancements on SEA language tasks without sacrificing general language ability.

Continual Pre-Training Methodology

Two-Stage Pre-Training Process

Stage 1: Balanced Multilingual Data (450B tokens)

High learning rate (1e-4), large batch size.
Mix optimization using RegMix to achieve a balanced token spread across major languages.
Languages included: Vietnamese (102B tokens), Indonesian (94B), Thai (92B), English (51B), Chinese (50B), Burmese (23.5B), Malay (21B), Tagalog (10B), Khmer (6.5B).
Data from English and Chinese anchors cross-lingual transfer, while substantial SEA-language content brings in language-specific nuances.

Stage 2: High-Quality SEA Data (60B tokens)

Lower learning rate (1e-5), smaller batch size, more epochs.
Focus: Quality-over-quantity, specifically targeting low-resource and under-performing languages with synthetic translation and classifier filtering.
Languages include: Vietnamese (10.9B HQ tokens), Indonesian (12.8B), Thai (13.9B), Burmese (2.8B), Malay (1.3B), Tagalog (2.2B), Khmer (0.9B), with high-quality smaller sets for Waray, Ilocano, Javanese, Lao, Cebuano, Sundanese.

This dual-stage process delivers models with broad, balanced, and deep proficiency across the entire SEA language set, tailored to the linguistic characteristics and available digital footprints of each language.

Token Allocation Table (Balanced Data Example):

Language	Stage 1 Tokens	Stage 2 HQ Tokens
Vietnamese	102B	10.9B
Indonesian	94B	12.8B
Thai	92B	13.9B
Others	...	...

Innovations:

Replay tokens: 100B tokens from earlier stages/models to preserve prior learning.
Language Mixture Annealing: Gradual enrichment of low-resource language presence in training batches as the pre-training progresses.
Up-sampling and down-sampling: Dynamic adjustment during mixture optimization in response to initial evaluation and corpus size.

This methodology has proven to reduce token degeneration and enhance model stability, especially for languages with volatile or sparse data sources.

Instruction and Preference Tuning

Post-training, Sailor2 undergoes two forms of targeted tuning: supervised instruction tuning, and advanced preference alignment using human and model feedback.

Supervised Fine-Tuning (SFT):

Stage 1: Large-batch, single-epoch exposure to domain-diverse, medium-quality datasets (4M examples), covering English, Chinese, and 16 SEA languages.
Stage 2: Small-batch, multi-epoch training on high-reward, high-quality exemplars (400K examples), selected using hybrid metrics (harmonic mean of reward scores and perplexity).
Data sources: SEA-UltraChat (4.8M multilingual examples), generated from UltraChat with skilled translation and multi-turn prompt engineering with GPT-4o as one of the judgers.

Preference Tuning:

Off-policy tuning: Uses outputs from powerful external models like Llama-3-8B-Instruct, filtered for language-consistency and relevance.
On-policy tuning: Model-generated responses are scored via the Skywork-Reward-Gemma2-27B, with careful selection of 'chosen' vs 'rejected' responses for training.
Algorithmic approaches: DPO (Direct Preference Optimization) and length-regularized variants. Experiments show DPO yields high training stability.
Language-consistency Verifiers: FastText classifiers ensure response language matches prompt language (except in designated translation tasks), increasing user trust and reliability.

Distinctive Features:

Embedding-based similarity filters prevent overfitting and maintain diversity during instruction tuning.
Preference alignment is two-staged—initial broad off-policy learning, then model-centric on-policy iteration—to maximize transferability and specialization.

Model Customization Features

Long-Context Training:

Sailor2 supports max context windows up to 128K tokens (via AnchorAttention), exceeding most LLMs in handling lengthy inputs (legal, scientific, code, etc.). This yields compelling results on the RULER benchmark and facilitates applications requiring document-scale processing.

Speculative Decoding:

Integration of one-layer GliDe model (draft specialist) for speculative decoding yields 2–2.5x speedups. Languages with high token granularity, such as Burmese, derive particular efficiency and performance benefits.

Model Pruning:

Sheared LLaMA-inspired pruning condenses larger models into compact forms (e.g., 20B down to 14B and 8B down to 3B) with minimal performance loss. Such model variants offer options for deployment in memory-constrained environments without severe accuracy compromises.

Integration and API Support:

Direct compatibility with Hugging Face Transformers (≥v4.46.3). Dedicated sample scripts and API endpoints enable both community tinkering and production deployment. A limited free API service is available via Float16.cloud, providing accessible experimentation and proof-of-concept applications for all users.

Sample Customization Code:

from transformers import AutoModelForCausalLM, AutoTokenizer
device = "cuda"
model = AutoModelForCausalLM.from_pretrained("sail/Sailor2-20B", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("sail/Sailor2-20B")
input_message = "Model bahasa adalah model probabilistik"
model_inputs = tokenizer([input_message], return_tensors="pt").to(device)
generated_ids = model.generate(model_inputs.input_ids, max_new_tokens=64)
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)

A bash example for API usage via Float16.cloud is also well-documented.

Evaluation Benchmarks and Performance

Sailor2’s performance is evaluated on an expansive suite of benchmarks, covering both standard NLP tasks and region-specific challenges.

Key Evaluation Benchmarks:

SailCompass: A custom few-shot base model evaluator, encompassing tasks from TydiQA, M3Exam, FLORES-200, XCOPA.
SEA-WildBench: Chat-focused multilingual evaluation, adapted from WildBench and constructed with translation by GPT-4o.
CultureBench, BLEnD, IndoCulture, Tatabahasa, VMLU, Global MMLU, Meta Thai MMLU, FLoRes-200, BLEBELE, XQuAD. These cover cultural, knowledge, translation, reading comprehension, and reasoning abilities in multiple SEA languages.

Highlight Performance Results—Sailor2-20B (vs State of the Art):

Model	SeaWildBench SWB Score	Win Rate vs GPT-4o	M3Exam-Javanese (score)	Size
Sailor2-20B	0.56	~50%	+14.6 vs Qwen2.5-32B	20B
Sailor2-8B	0.49	45%	—	8B
Qwen2.5-72B	0.45	40%	—	72B
Llama3.1-70B	0.30	30%	—	70B

Notably, the flagship Sailor2-20B-Chat model achieves a near 50% win rate against GPT-4o on SeaWildBench. This reflects parity with GPT-4o—the current gold standard for LLMs—in local chat scenarios involving SEA languages. On the M3Exam-Javanese benchmark (testing a truly low-resource language), Sailor2 surpasses Qwen2.5-32B by +14.6 points, solidifying its dominance in inclusive model design.

Downstream Task Scores:

Model	BLEBELE (tha)	XCOPA (vie)	XQuAD (vie)	TydiQA (vie)	M3Exam (vie)
Sailor2-20B	47.44	83.6	62.02/82.05	67.77	74.46

These results underscore Sailor2’s superior quality in both high-resource (Vietnamese, Thai, Indonesian, Malay) and low-resource SEA languages. Its small-sized variants (Sailor2-8B, 3B) are also best-in-class among open models in their capacity league.

Sailor2 also demonstrates robust performance on cultural understanding, creative writing, translation, and extended reasoning tasks, all critical for real-life chat and assistance deployments across the region.

Table: Performance Metrics—Sailor2 vs GPT-4o

Model	SeaWildBench (SWB) Score	Win Rate vs GPT-4o
Sailor2-20B-Chat	0.56	~50%
Sailor2-8B-Chat	0.49	45%
Qwen2.5-72B	0.45	40%
Llama3.1-70B	0.30	30%

Elaborating on this, the 20B Sailor2 model demonstrates not only absolute accuracy on downstream tasks but also practical efficiency—both in terms of fewer hallucinations, superior length control, and greater cultural/contextual awareness—when compared to larger models like Llama3.1-70B.

Training Infrastructure and Resources

Sailor2’s infrastructure blends robust hardware utilization with innovative engineering:

Pre-training libraries: Customized version of Megatron-LM, enabling Zero Bubble Pipeline Parallelism for high-throughput multi-GPU training.
Vocabulary management: Strategic transformer layer redistribution to accommodate large language-specific vocabularies within practical memory constraints.
Attention optimization: AnchorAttention for long-context support; FlashAttention for speed.
Speculative decoding: Integration with flash noise for draft model acceleration.
Framework compatibility: All models, data, and tooling integrate with standard Hugging Face Transformers. The Oat framework ensures easy scaling and experimental alignment across dozens of deployments.
Open infrastructure: All checkpoints (pre-annealing, base, SFT, chat) are available on Hugging Face, with clear documentation for replication or further fine-tuning.

Pre-training Datasets Examples:

sailor2-pretrain-data-stage1 (450B tokens)
sailor2-pretrain-data-stage2 (60B tokens, HQ)
community-dataset (collaborative curation)
sea-commoncrawl (web sources)
sea-internet, sea-pdf-text, sea-synthetic (translated and processed open data)

This open, modular infrastructure supports both regional research and global LLM experimentation, lowering the barrier to entry for multilingual AI development.

Open Source Tools and Cookbook

The Sailor2 project stands out for its openness and documentation. All major technical components are released under Apache 2.0:

SailCraft: End-to-end data cleaning toolkit, addressing the unique needs of diverse languages and formats.
RegMix: Language mixture optimizer for balanced corpus creation.
Megatron-Sailor2: Modifications for scaling, parallelism, and model expansion in pre-training.
Oat: Highly flexible post-training and alignment toolkit; supports DPO and custom loss functions.
SailCompass: Custom evaluation for few-shot and zero-shot regional tasks.
SEA-WildBench: A curated multilingual challenge suite for chat-focused evaluation.
Cookbook documentation: The Sailor2 Cookbook provides transparent recipes and lessons learned for every stage—from raw data to chat deployment.

The project also offers API examples and community colleges for collaboration, making it a model for reproducibility in NLP research.

Licensing and Accessibility

Sailor2 is distributed under the Apache 2.0 license, one of the most permissive open-source licenses available. This ensures:

Freedom to use, modify, and distribute for research, education, and commercial applications.
No restrictive proprietary encumbrance, directly supporting AI adoption in resource-limited settings.
All models, codebases, datasets, tools, and documentation are available via GitHub, Hugging Face, and linked project websites.

Additionally, sample code, integration guidance, APIs, and Jupyter notebook examples support users at all stages of technical proficiency. A free, limited API is offered via Float16.cloud, democratizing access to advanced LLM capabilities for individuals and organizations in the region.

Implications for Inclusive Multilingual AI

The launch and success of Sailor2 have deep and far-reaching implications for global AI fairness and technological inclusiveness.

Key Contributions:

Closing the AI gap for underserved languages: By directly targeting SEA’s most spoken (and most marginalized) languages, Sailor2 sets a new bar for representation and capability outside dominant world languages.
Democratizing high-performance LLMs: Through truly open licensing, transparent recipes, and rich resource sharing, both local researchers and global developers gain the tools to innovate on top of state-of-the-art architectures.
Advancing language preservation and cultural knowledge: Sailor2 models excel not just in mechanical translation but in tasks requiring deep cultural awareness, creative reasoning, and nuanced dialog—preserving and valorizing regional knowledge.
Scalable techniques for future low-resource AI: The iterative recipe of model expansion, two-stage continual training, and feedback-driven alignment establishes a reusable pattern for extending LLM technology to other underrepresented language families.

These advances support inclusive technology policy, educational opportunities, digital governance, and research in linguistics, anthropology, and beyond. They also set a precedent for responsible, community-aware development in global AI.

Conclusion and Forward-Looking Perspective

Sailor2 embodies the convergence of cutting-edge AI engineering, community partnership, and open science principles. Its demonstrated performance parity with models like GPT-4o on SEA tasks, coupled with its unmatched accessibility and rich tool ecosystem, propels it to the forefront of multilingual and inclusive LLM research. As AI becomes ever more integral to public life, models like Sailor2 are indispensable not just for technical progress but for ethical, social, and cultural empowerment across the world’s most linguistically diverse regions.

For further engagement:

As a blueprint, Sailor2 invites researchers worldwide to build their own inclusive LLMs—translating the advances made here to all low-resource languages, and shaping a fairer, culturally attuned AI future.

References (10)

The URL for the second reference seems to have a typo or is incomplete. I have corrected the text to match the content of the PDF provided in the next reference.