Research Teams Behind Sailor2 Multilingual LLMs

The Research Teams Behind Sailor2 Multilingual LLMs: Institutions, Contributors, and Collaborative Structure

Introduction

Sailor2 is a landmark in multilingual large language model (LLM) development, reaching parity with GPT-4o on Southeast Asian (SEA) languages while upholding a philosophy of linguistic inclusivity, transparency, and open-source access. Unlike most LLMs, which focus on major world languages, Sailor2 explicitly targets underrepresented SEA tongues—including Vietnamese, Thai, Javanese, Tagalog, and others—by mobilizing a diverse, multinational team across East and Southeast Asia. This research report provides a comprehensive exploration of the institutional, organizational, and philosophical underpinnings behind the Sailor2 project, charts the composition and operations of its research teams, and highlights the infrastructure, open-source platforms, evaluation tools, and future outlook that characterize its impact and ambitions.

This analysis synthesizes the most current information sourced across project blogs, official preprints, author metadata, open-source codebases, governance documentation, evaluation reports, and community tools. The report aims to satisfy the most demanding research standards with evidence-driven, richly contextualized coverage.


Project Genesis and Open-Source Ethos

Rationale and Vision

Large language models have, until recently, neglected many of the world’s less resourced languages— especially those across SEA, a region home to over 650 million people and spectacular linguistic diversity. The Sailor2 family emerges as a direct response to this gap, guided by a community-driven, open-access vision: “Serving the Underserved in Southeast Asia with Open LLMs”1.

Sailor2 aims to democratize access to advanced multilingual natural language technology and counterbalance the dominance of English and Chinese in AI, pioneering both high-performing language models (up to 20B parameters) and a standardized, transparent “cookbook” for LLM development. By rooting the effort in research, industry, and grassroots community input across the region, Sailor2 sets new standards for linguistic equity and open collaboration2.

Licensure and Open Science

Sailor2 is released under the Apache 2.0 license, removing restrictions on research and commercial use and promoting extensibility and wide adoption. All code, models, benchmarks, and tools are made publicly available through Hugging Face and GitHub, and project methodology, evaluation, and meta-knowledge are documented openly1,3.

This transparency not only enhances reproducibility and trust but also makes Sailor2’s “recipe” for multilingual LLMs accessible for further adaptation in other low-resource settings globally.


Institutional Affiliations and Collaborative Structure

A hallmark of Sailor2’s success is the exceptionally broad and multinational range of participating institutions, spanning leading research labs, universities, corporations, AI startups, and grassroots initiatives. These entities provide compute, human resources, funding, language data, evaluation expertise, and community bridges2,1.

Summary Table: Institutions, Countries, and Roles in Sailor2

Institution Country Role
Sea AI Lab Singapore Project leadership, core R&D, infrastructure
Shanghai Jiao Tong University (SJTU) China Core research, academic collaboration
Singapore Management University (SMU) Singapore Core research, academic collaboration
Hugging Face United States/France Platform partnership, model hosting, community tools
SCB 10X Thailand Corporate sponsorship, data curation, financial support
WiseSight Thailand Industry partnership, Thai language data, evaluation
National University of Singapore (NUS) Singapore Research collaboration, contributing authors
Nanyang Technological University (NTU) Singapore Research collaboration, contributing authors
Singapore University of Technology and Design (SUTD) Singapore Research collaboration, contributing authors
Peafowl.ai Singapore Research collaboration, AI startup partner
Float16.cloud Indonesia API hosting, model deployment
PyThaiNLP Thailand Community partnership, language advocacy
Vietnamese NLP Community Vietnam Community partnership, dataset curation

Each of these entities integrates into the collaborative tapestry of Sailor2, and many are represented directly by named contributors accredited in the official technical paper and code repositories4,1.

Organizational and Collaborative Structure

Sailor2 employs a distributed, decentralized, and open collaboration model, coordinating via public repositories, paper authorship, Slack/Discord communities, institutional partnerships, and grassroots datasets. The core leadership remains centered at Sea AI Lab in Singapore, but decision-making, major research contributions, and authorship are explicitly shared with equal-contribution recognitions for researchers across SJTU (China), SMU (Singapore), and other key partners. This model both ensures continuity and maximizes regional and cross-disciplinary buy-in4.

The approach can be summarized as:

  • Shared leadership: Noted explicitly in equal-contributor designations and shared paper communications;
  • Wide disciplinary coverage: Linguists, AI researchers, web infrastructure engineers, country-specific NLP experts, and corporate engineers all participate;
  • Community-driven governance: Community contributions—especially data and feedback from SEA language speakers—are formally integrated in the official datasets and evaluation benchmarks;
  • Decentralized R&D: Codebases, evaluation scripts, and model weights are maintained collaboratively and released through both centralized and decentralized open-source channels.

Lead Researchers, Project Leadership, and Notable Contributors

Project Leadership

The formal project lead and main contact is Qian Liu (Sea AI Lab)2, who is also deeply involved in the technical research behind Sailor2's RegMix data mixture optimization and preference tuning innovations for LLM training. Liu’s record of community engagement, model governance, and regional outreach is central to orchestrating Sailor2’s team-based model.

Lead contributors, recognized for equal and foundational contributions, include:

  • Longxu Dou (Sea AI Lab)
  • Fan Zhou (SJTU, China)
  • Changyu Chen (SMU, Singapore)
  • Zili Wang (Independent/Sea AI Lab)
  • Ziqi Jin (SUTD, Singapore)
  • Zichen Liu, Tongyao Zhu (Sea AI Lab, NUS)
  • Cunxiao Du (Sea AI Lab)
  • Penghui Yang (NTU, Singapore)
  • Haonan Wang (NUS, Singapore)
  • Jiaheng Liu (NUS/Sea AI Lab)

Many contributors are dual-affiliated, often spanning both a university and an applied AI lab, reinforcing the bridges between fundamental research and production-scale deployment4.

Institutional Affiliations of the Author Team

Key authors hail from a spectrum of institutions located in China, Singapore, Thailand, Vietnam, the United States, Sweden, Hong Kong SAR, Denmark, Myanmar, and the UAE. This multinational composition is reflected both in the institutional acknowledgment table and in the diverse linguistic and cultural expertise embedded in the research team.

Advisor and Secondary Contributor Roles

Tianyu Pang, Chao Du, Xinyi Wan, Wei Lu, and Min Lin are listed as advisors, with deep experience in NLP, distributed AI, and evaluation, many sharing affiliations with Sea AI Lab or regional Singaporean universities2.

Contributing institutions include newer AI startups (e.g., Peafowl.ai), established platform companies (Hugging Face), and regional language advocacy organizations (PyThaiNLP).


Corporate, Industry, and Platform Partners

SCB 10X

SCB 10X, a technology investment and venture business of SCBX Group (Siam Commercial Bank), Thailand’s oldest bank, emerges as a critical corporate partner. The firm supports Sailor2 via AI and research sponsorship, project funding, and deep engagement in data collection and Thai and Southeast Asian NLP efforts. SCB 10X’s in-house AI team has developed its own open LLMs for Thai (Typhoon and related models), signifying a mature institutional understanding of LLM needs for non-majority languages. This collaboration fosters bidirectional benefit: Sailor2 gains SEA language expertise and infrastructural support, while SCB 10X leverages LLM progress for local innovation5.

WiseSight

A leading Thailand-based analytics and AI company, WiseSight’s contributions include cultural context, Thai language corpora, and NLP evaluation feedback. Their participation, alongside SCB 10X, anchors Sailor2’s strong grounding in key SEA markets and ties open-source research to industry adoption pathways.

Float16.cloud

Float16.cloud is credited with hosting the public API for Sailor2-20B-Chat, providing scalable cloud-based access to the largest Sailor2 models for both evaluation and user interaction. This collaboration demonstrates best practices in model deployment and democratizes hands-on experimentation by the regional developer and research communities.

Hugging Face

As the worldwide leading platform for open-source machine learning, Hugging Face is essential for model hosting, code sharing, comprehensive documentation, and community-driven feedback. The centrality of Sailor2 on Hugging Face (via model cards, dataset repositories, and spaces for demos) directly broadens its reach, contributing to visibility, collaboration, and standardization across the global LLM research space.


Open-Source Platforms, Community Tools, and Codebase Structure

Sailor2’s entire development pipeline, deployment, and evaluation suite are rooted in open-source platforms and tools, facilitating reproducibility, collaborative iteration, and extensibility for similar work across other languages and contexts.

Core Codebases and Community Tools

Tool Function
SailCraft Multi-stage data preprocessing pipeline
RegMix Regression-based data mixture optimization
Megatron-Sailor2 Training infrastructure for large-scale LLMs
Oat and SailCompass Instruction fine-tuning, preference optimization, and evaluation suite

Explanation and Impact

SailCraft implements a robust multi-stage, language-agnostic data preprocessing pipeline, foundational for achieving Sailor2's high data quality and avoiding common duplication or contamination pitfalls6.

RegMix introduces a regression-based, empirical approach to optimizing language and domain data mixture, surpassing human heuristics, and cited as a major research contribution (accepted at ICLR 2025)7.

Megatron-Sailor2, adapted from Nvidia’s Megatron-LLM, scales the training infrastructure for multi-billion-parameter multilingual LLMs, with customized support for continual pre-training and expansion from base models (e.g., Qwen2.5)8.

Oat and SailCompass cover the final stages of instruction fine-tuning, preference optimization, and reproducible evaluation, scaffolding Sailor2's robust, community-verifiable benchmarks and chat-level testing.

All code, benchmarks, model weights, and pre-processed datasets are hosted in mirrored organizations on GitHub (sail-sg, sailor2) and Hugging Face. This structure both simplifies onboarding for external contributors and supports transparent, granular tracking of activity and contributions across institutional lines.


Authorship, Equal Contribution, and Publication Metadata

The Sailor2 technical report—available as a preprint on arXiv (arXiv:2502.12982) and mirrored across project websites—lists over 45 authors from more than twenty institutions9,10. The lead authors and primary contacts are:

Equal contribution status is denoted for the core research quartet: Qian Liu, Longxu Dou, Fan Zhou (SJTU), and Changyu Chen (SMU). Authorship order and footnotes in the publication closely correspond to direct contributions on code, data curation, evaluation, and project communication.

Publication metadata highlights:

  • Title: Sailor2: Sailing in South-East Asia with Inclusive Multilingual LLMs
  • Authors: Longxu Dou, Qian Liu, Fan Zhou, Changyu Chen, ... and 40+ contributors
  • Date: February 18, 2025 (arXiv), also featured on various institutional and lab homepages
  • License: Apache 2.0 (models/code), CC BY-NC-SA 4.0 (publication/paper)

Data, Methods, and Community Engagement

Data Sourcing, Language Coverage, and Data Governance

Sailor2’s competitive performance, especially on low-resource SEA languages, is driven by diligent, collaborative data collection and cleaning, grounded in both automated retrieval (web crawl, public PDF corpus) and community-driven, human-in-the-loop processes.

A significant share of the datasets comes from:

  • Community members and local institutions (Thai, Vietnamese, Indonesian, Filipino, Burmese, Khmer, Lao, etc.)
  • Translation-based augmentation: leveraging state-of-the-art models (NLLB 3.3B) for high-quality SEA data synthesis from English
  • Quality controls: Deduplication, classifier-based filtering, and continual partner review

Key dataset statistics include—Vietnamese (1.9TB); Indonesian (1.3TB); Thai (242GB); with other languages represented according to available resources and community contributions1.

Two-Stage Training and Model Expansion

Sailor2 integrates a two-stage continual pre-training regime: initial large-scale, multi-language exposure followed by targeted high-quality token annealing for low-resource language optimization. This method, inspired by the MiniCPM framework and refined with RegMix mixture selection, is central to model performance—and further optimized by model expansion inspired by LlamaPro, allowing specialized layers for SEA-language knowledge without catastrophic forgetting effective cross-lingual capability1.

Supervised fine-tuning and preference tuning adopt advanced frameworks like Oat and UltraFeedback, with benchmark alignment on diverse SEA datasets and community evaluation benchmarks (e.g., SailCompass, Sea-WildBench, FLORES-200).

Collaborative Philosophy and Community Engagement

The governance model upholds active co-development with open calls for dataset, code, evaluation, and prompt engineering contributions. Community initiative is core both to language dataset pooling and to cultural adaptation of evaluation metrics (custom tasks on cuisine/tradition, prompt calibration, and end-user accessibility via Hugging Face Spaces and Float16.cloud API).

Model checkpoints, code updates, and research papers are proactively announced in multilingual community channels, forums, and via regional academic partners, ensuring diverse feedback and stakeholder engagement. Sailor2 explicitly encourages regional practitioners, students, and research groups to “get aboard” as outlined in calls to action across the Hugging Face community, GitHub, and blog updates1,11,6.


Evaluation Tools, Benchmarks, and Collaborative Validation

A key to Sailor2's recognition as a leading open-source SEA LLM is its rigorously open and reproducible evaluation pipeline, combining academic and community cross-validation.

SailCompass: Flagship Evaluation Framework

SailCompass is a custom benchmark suite built specifically to capture SEA linguistic nuances, spanning generation, multiple-choice, and classification tasks across core regional languages. It encompasses both general-use and culturally grounded datasets, and introduces prompt calibration, MCQ optimization, and perplexity-ranking practices tailored for SEA contexts. The benchmark and tools are fully open-sourced, enabling robust, reproducible comparison by external researchers12,13.

Other Benchmarks and Collaborative Testbeds

  • SEA-WildBench: Local chat and conversation evaluation, including direct faceoff against GPT-4o
  • M3Exam, FLORES-200, VMLU, XCOPA, TydiQA, Tatabahasa: Genre and task coverage from code understanding to cultural knowledge
  • Assessment of instruction handling, cross-lingual transfer, and SEA-specific reasoning

Alignment with Open Feedback (UltraFeedback & SEA-UltraFeedback)

Sailor2 leverages diversified preference datasets (UltraFeedback, SEA-UltraFeedback) for fine-grained alignment and reward model training. These are derived both from scaled AI feedback and community-annotated, multi-turn SEA language dialogues, ensuring reinforcement learning and direct preference optimization tune models towards regional human conversational expectations14,15.


Licensing, Funding, and Sponsorship

All Sailor2 models, code, and supporting tools are released under Apache 2.0, and contributions from diverse institutions imply a mix of:

  • Institutional and university research support (compute resources, developer hours, open access to regional datasets)
  • Corporate and AI platform sponsorship (notably SCB 10X and float16.cloud for compute, hosting, and financial support)1
  • Open-source community development for datasets, evaluation, and model alignment

No restrictive licensing or paywall barriers exist at any layer, ensuring both non-commercial and commercial actors can benefit and extend the Sailor2 work in their respective contexts.


Collaborative Governance, Philosophy, and Methodological Innovations

The collaborative philosophy of Sailor2 is articulated in key governance and design features:

  • Equity in contribution and recognition: Equal contribution status among leading researchers, distributed responsibilities, transparent paper authorship, and repository commit attributions.
  • Community-first openness: Periodic community feedback cycles, open evaluation challenges, and integration of suggestions into successive model and data releases.
  • Iterative technical cookbook: The documentation and open publication of continuous improvements (e.g., switching to RegMix from fixed heuristics, model expansion lessons, block scaling strategies, reward model selection) create a living manual for future multilingual LLM projects targeting similar under-resourced settings.
  • Platform decentralization: Redundant hosting, open code mirroring, and decentralized communication channels mitigate risk of central failure and encourage experimentation outside the principal institutions.

Future Directions, Expansion, and Roadmap

The Sailor2 project’s stated roadmap and informal community goals include:

  • Synthetic Data Curation for Ultra-Low Resource Languages: Expansion into additional SEA languages (e.g., Acehnese, Minangkabau) and dialects using NLLB-derived translations and regional classifier-guided corpus selection.
  • Tokenizer-Free Models and Open Vocabulary Learning: Experimentation with pixel-level or byte-level tokenization, inspired by recent advances in tokenizer-free and open-vocabulary transformer models.
  • Efficient Continual Pre-training: Innovations around model plasticity, selective parameter updating, and regular over-training evaluation to minimize compute needs while maximizing language generalization.
  • Broader Human Evaluation and Benchmarking: Expansion and localization of evaluation benchmarks—including cultural, code, and instruction-following tasks—strengthening human-in-the-loop alignment and inclusivity.
  • Deepening Regional Collaboration and Academic Partnership: Explicit invitations to SEA and global institutions for joint data pooling, benchmark translation, and evidence-based model expansion.

With the foundation of robust open-source infrastructure, inclusive governance, and sustained engagement by top regional researchers, Sailor2 is positioned to be both a trailblazer for SEA language AI and a replicable template for other multilingual LLM undertakings worldwide.


Conclusion

The Sailor2 multilingual LLM project demonstrates what becomes possible when excellence in open science, cross-border collaboration, and regional engagement coalesce—delivering parity with cutting-edge commercial language models while centering local voices, data, and languages often ignored by mainstream AI development. Its breadth of institutional affiliations, diversity of contributors, innovative open-source practices, and commitment to community feedback combine to set powerful standards for the next generation of fair, accessible, and high-performing multilingual language AI.

By charting the organizational architecture, governance, major contributors, methodology, and community scaffolding behind Sailor2, this report offers an in-depth reference on the model’s historical significance and ongoing transformative potential for Southeast Asia’s information ecosystem and beyond.


Institutional Summary Table

Institution/Group Role Country
Sea AI Lab Project leadership, core research, infrastructure Singapore
Shanghai Jiao Tong University (SJTU) Academic research, co-authorship China
Singapore Management University (SMU) Academic research, co-authorship Singapore
SCB 10X Corporate sponsor, data partner Thailand
WiseSight Data and evaluation partner Thailand
Hugging Face Platform partner, model/dataset hosting USA/France
Float16.cloud API hosting and deployment Indonesia
National University of Singapore (NUS) Academic contributor Singapore
Nanyang Technological University (NTU) Academic contributor Singapore
Singapore University of Technology and Design (SUTD) Academic contributor Singapore
Peafowl.ai Industry/startup partner Singapore
PyThaiNLP Community partner, language advocacy Thailand
Other contributing academic institutions (China, US, Sweden, Denmark, etc.) Research collaboration, co-authorship Various

All these partners contributed either research, funding, datasets, benchmarks, technical support, or direct authorship, reflecting Sailor2’s genuinely collective, regional, and open thrust.


References

  1. Sailor2: Sailing in South-East Asia with Inclusive Multilingual LLMs.
  2. Sailor2: Sailing in South-East Asia with Inclusive Multilingual LLMs.
  3. sailor2 (Sailor2) - Hugging Face.
  4. Sailor2: Sailing in South-East Asia with Inclusive.
  5. scb10x (SCB 10X) - Hugging Face.
  6. GitHub Pages - Sailor.
  7. RegMix: Data Mixture as Regression for Language Model Pre-training.
  8. GitHub - sail-sg/Megatron-Sailor2: Megatron for Sailor2/Qwen2.5.
  9. [.
  10. Paper page - Sailor2: Sailing in South-East Asia with Inclusive ....
  11. @SivilTaram on Hugging Face: "Introducing Sailor-14B Model and Sailor2 ....
  12. SailCompass: Towards Reproducible and Robust Evaluation for Southeast ....
  13. [.
  14. PowerPoint 演示文稿.
  15. UltraFeedback: Boosting Language Models with High-quality Feedback.