Research Teams Behind Sailor2 Multilingual LLMs
The Research Teams Behind Sailor2 Multilingual LLMs: Institutions, Contributors, and Collaborative Structure
Introduction
Sailor2 is a landmark in multilingual large language model (LLM) development, reaching parity with GPT-4o on Southeast Asian (SEA) languages while upholding a philosophy of linguistic inclusivity, transparency, and open-source access. Unlike most LLMs, which focus on major world languages, Sailor2 explicitly targets underrepresented SEA tongues—including Vietnamese, Thai, Javanese, Tagalog, and others—by mobilizing a diverse, multinational team across East and Southeast Asia. This research report provides a comprehensive exploration of the institutional, organizational, and philosophical underpinnings behind the Sailor2 project, charts the composition and operations of its research teams, and highlights the infrastructure, open-source platforms, evaluation tools, and future outlook that characterize its impact and ambitions.
This analysis synthesizes the most current information sourced across project blogs, official preprints, author metadata, open-source codebases, governance documentation, evaluation reports, and community tools. The report aims to satisfy the most demanding research standards with evidence-driven, richly contextualized coverage.
Project Genesis and Open-Source Ethos
Rationale and Vision
Large language models have, until recently, neglected many of the world’s less resourced languages— especially those across SEA, a region home to over 650 million people and spectacular linguistic diversity. The Sailor2 family emerges as a direct response to this gap, guided by a community-driven, open-access vision: “Serving the Underserved in Southeast Asia with Open LLMs”1.
Sailor2 aims to democratize access to advanced multilingual natural language technology and counterbalance the dominance of English and Chinese in AI, pioneering both high-performing language models (up to 20B parameters) and a standardized, transparent “cookbook” for LLM development. By rooting the effort in research, industry, and grassroots community input across the region, Sailor2 sets new standards for linguistic equity and open collaboration2.
Licensure and Open Science
Sailor2 is released under the Apache 2.0 license, removing restrictions on research and commercial use and promoting extensibility and wide adoption. All code, models, benchmarks, and tools are made publicly available through Hugging Face and GitHub, and project methodology, evaluation, and meta-knowledge are documented openly1,3.
This transparency not only enhances reproducibility and trust but also makes Sailor2’s “recipe” for multilingual LLMs accessible for further adaptation in other low-resource settings globally.
Institutional Affiliations and Collaborative Structure
A hallmark of Sailor2’s success is the exceptionally broad and multinational range of participating institutions, spanning leading research labs, universities, corporations, AI startups, and grassroots initiatives. These entities provide compute, human resources, funding, language data, evaluation expertise, and community bridges2,1.
Summary Table: Institutions, Countries, and Roles in Sailor2
Institution | Country | Role |
---|---|---|
Sea AI Lab | Singapore | Project leadership, core R&D, infrastructure |
Shanghai Jiao Tong University (SJTU) | China | Core research, academic collaboration |
Singapore Management University (SMU) | Singapore | Core research, academic collaboration |
Hugging Face | United States/France | Platform partnership, model hosting, community tools |
SCB 10X | Thailand | Corporate sponsorship, data curation, financial support |
WiseSight | Thailand | Industry partnership, Thai language data, evaluation |
National University of Singapore (NUS) | Singapore | Research collaboration, contributing authors |
Nanyang Technological University (NTU) | Singapore | Research collaboration, contributing authors |
Singapore University of Technology and Design (SUTD) | Singapore | Research collaboration, contributing authors |
Peafowl.ai | Singapore | Research collaboration, AI startup partner |
Float16.cloud | Indonesia | API hosting, model deployment |
PyThaiNLP | Thailand | Community partnership, language advocacy |
Vietnamese NLP Community | Vietnam | Community partnership, dataset curation |
Each of these entities integrates into the collaborative tapestry of Sailor2, and many are represented directly by named contributors accredited in the official technical paper and code repositories4,1.
Organizational and Collaborative Structure
Sailor2 employs a distributed, decentralized, and open collaboration model, coordinating via public repositories, paper authorship, Slack/Discord communities, institutional partnerships, and grassroots datasets. The core leadership remains centered at Sea AI Lab in Singapore, but decision-making, major research contributions, and authorship are explicitly shared with equal-contribution recognitions for researchers across SJTU (China), SMU (Singapore), and other key partners. This model both ensures continuity and maximizes regional and cross-disciplinary buy-in4.
The approach can be summarized as:
- Shared leadership: Noted explicitly in equal-contributor designations and shared paper communications;
- Wide disciplinary coverage: Linguists, AI researchers, web infrastructure engineers, country-specific NLP experts, and corporate engineers all participate;
- Community-driven governance: Community contributions—especially data and feedback from SEA language speakers—are formally integrated in the official datasets and evaluation benchmarks;
- Decentralized R&D: Codebases, evaluation scripts, and model weights are maintained collaboratively and released through both centralized and decentralized open-source channels.
Lead Researchers, Project Leadership, and Notable Contributors
Project Leadership
The formal project lead and main contact is Qian Liu (Sea AI Lab)2, who is also deeply involved in the technical research behind Sailor2's RegMix data mixture optimization and preference tuning innovations for LLM training. Liu’s record of community engagement, model governance, and regional outreach is central to orchestrating Sailor2’s team-based model.
Lead contributors, recognized for equal and foundational contributions, include:
- Longxu Dou (Sea AI Lab)
- Fan Zhou (SJTU, China)
- Changyu Chen (SMU, Singapore)
- Zili Wang (Independent/Sea AI Lab)
- Ziqi Jin (SUTD, Singapore)
- Zichen Liu, Tongyao Zhu (Sea AI Lab, NUS)
- Cunxiao Du (Sea AI Lab)
- Penghui Yang (NTU, Singapore)
- Haonan Wang (NUS, Singapore)
- Jiaheng Liu (NUS/Sea AI Lab)
Many contributors are dual-affiliated, often spanning both a university and an applied AI lab, reinforcing the bridges between fundamental research and production-scale deployment4.
Institutional Affiliations of the Author Team
Key authors hail from a spectrum of institutions located in China, Singapore, Thailand, Vietnam, the United States, Sweden, Hong Kong SAR, Denmark, Myanmar, and the UAE. This multinational composition is reflected both in the institutional acknowledgment table and in the diverse linguistic and cultural expertise embedded in the research team.
Advisor and Secondary Contributor Roles
Tianyu Pang, Chao Du, Xinyi Wan, Wei Lu, and Min Lin are listed as advisors, with deep experience in NLP, distributed AI, and evaluation, many sharing affiliations with Sea AI Lab or regional Singaporean universities2.
Contributing institutions include newer AI startups (e.g., Peafowl.ai), established platform companies (Hugging Face), and regional language advocacy organizations (PyThaiNLP).
Corporate, Industry, and Platform Partners
SCB 10X
SCB 10X, a technology investment and venture business of SCBX Group (Siam Commercial Bank), Thailand’s oldest bank, emerges as a critical corporate partner. The firm supports Sailor2 via AI and research sponsorship, project funding, and deep engagement in data collection and Thai and Southeast Asian NLP efforts. SCB 10X’s in-house AI team has developed its own open LLMs for Thai (Typhoon and related models), signifying a mature institutional understanding of LLM needs for non-majority languages. This collaboration fosters bidirectional benefit: Sailor2 gains SEA language expertise and infrastructural support, while SCB 10X leverages LLM progress for local innovation5.
WiseSight
A leading Thailand-based analytics and AI company, WiseSight’s contributions include cultural context, Thai language corpora, and NLP evaluation feedback. Their participation, alongside SCB 10X, anchors Sailor2’s strong grounding in key SEA markets and ties open-source research to industry adoption pathways.
Float16.cloud
Float16.cloud is credited with hosting the public API for Sailor2-20B-Chat, providing scalable cloud-based access to the largest Sailor2 models for both evaluation and user interaction. This collaboration demonstrates best practices in model deployment and democratizes hands-on experimentation by the regional developer and research communities.
Hugging Face
As the worldwide leading platform for open-source machine learning, Hugging Face is essential for model hosting, code sharing, comprehensive documentation, and community-driven feedback. The centrality of Sailor2 on Hugging Face (via model cards, dataset repositories, and spaces for demos) directly broadens its reach, contributing to visibility, collaboration, and standardization across the global LLM research space.
Open-Source Platforms, Community Tools, and Codebase Structure
Sailor2’s entire development pipeline, deployment, and evaluation suite are rooted in open-source platforms and tools, facilitating reproducibility, collaborative iteration, and extensibility for similar work across other languages and contexts.
Core Codebases and Community Tools
Tool | Function |
---|---|
SailCraft | Multi-stage data preprocessing pipeline |
RegMix | Regression-based data mixture optimization |
Megatron-Sailor2 | Training infrastructure for large-scale LLMs |
Oat and SailCompass | Instruction fine-tuning, preference optimization, and evaluation suite |
Explanation and Impact
SailCraft implements a robust multi-stage, language-agnostic data preprocessing pipeline, foundational for achieving Sailor2's high data quality and avoiding common duplication or contamination pitfalls6.
RegMix introduces a regression-based, empirical approach to optimizing language and domain data mixture, surpassing human heuristics, and cited as a major research contribution (accepted at ICLR 2025)7.
Megatron-Sailor2, adapted from Nvidia’s Megatron-LLM, scales the training infrastructure for multi-billion-parameter multilingual LLMs, with customized support for continual pre-training and expansion from base models (e.g., Qwen2.5)8.
Oat and SailCompass cover the final stages of instruction fine-tuning, preference optimization, and reproducible evaluation, scaffolding Sailor2's robust, community-verifiable benchmarks and chat-level testing.
All code, benchmarks, model weights, and pre-processed datasets are hosted in mirrored organizations on GitHub (sail-sg, sailor2) and Hugging Face. This structure both simplifies onboarding for external contributors and supports transparent, granular tracking of activity and contributions across institutional lines.
Data, Methods, and Community Engagement
Data Sourcing, Language Coverage, and Data Governance
Sailor2’s competitive performance, especially on low-resource SEA languages, is driven by diligent, collaborative data collection and cleaning, grounded in both automated retrieval (web crawl, public PDF corpus) and community-driven, human-in-the-loop processes.
A significant share of the datasets comes from:
- Community members and local institutions (Thai, Vietnamese, Indonesian, Filipino, Burmese, Khmer, Lao, etc.)
- Translation-based augmentation: leveraging state-of-the-art models (NLLB 3.3B) for high-quality SEA data synthesis from English
- Quality controls: Deduplication, classifier-based filtering, and continual partner review
Key dataset statistics include—Vietnamese (1.9TB); Indonesian (1.3TB); Thai (242GB); with other languages represented according to available resources and community contributions1.
Two-Stage Training and Model Expansion
Sailor2 integrates a two-stage continual pre-training regime: initial large-scale, multi-language exposure followed by targeted high-quality token annealing for low-resource language optimization. This method, inspired by the MiniCPM framework and refined with RegMix mixture selection, is central to model performance—and further optimized by model expansion inspired by LlamaPro, allowing specialized layers for SEA-language knowledge without catastrophic forgetting effective cross-lingual capability1.
Supervised fine-tuning and preference tuning adopt advanced frameworks like Oat and UltraFeedback, with benchmark alignment on diverse SEA datasets and community evaluation benchmarks (e.g., SailCompass, Sea-WildBench, FLORES-200).
Collaborative Philosophy and Community Engagement
The governance model upholds active co-development with open calls for dataset, code, evaluation, and prompt engineering contributions. Community initiative is core both to language dataset pooling and to cultural adaptation of evaluation metrics (custom tasks on cuisine/tradition, prompt calibration, and end-user accessibility via Hugging Face Spaces and Float16.cloud API).
Model checkpoints, code updates, and research papers are proactively announced in multilingual community channels, forums, and via regional academic partners, ensuring diverse feedback and stakeholder engagement. Sailor2 explicitly encourages regional practitioners, students, and research groups to “get aboard” as outlined in calls to action across the Hugging Face community, GitHub, and blog updates1,11,6.
Evaluation Tools, Benchmarks, and Collaborative Validation
A key to Sailor2's recognition as a leading open-source SEA LLM is its rigorously open and reproducible evaluation pipeline, combining academic and community cross-validation.
SailCompass: Flagship Evaluation Framework
SailCompass is a custom benchmark suite built specifically to capture SEA linguistic nuances, spanning generation, multiple-choice, and classification tasks across core regional languages. It encompasses both general-use and culturally grounded datasets, and introduces prompt calibration, MCQ optimization, and perplexity-ranking practices tailored for SEA contexts. The benchmark and tools are fully open-sourced, enabling robust, reproducible comparison by external researchers12,13.
Other Benchmarks and Collaborative Testbeds
- SEA-WildBench: Local chat and conversation evaluation, including direct faceoff against GPT-4o
- M3Exam, FLORES-200, VMLU, XCOPA, TydiQA, Tatabahasa: Genre and task coverage from code understanding to cultural knowledge
- Assessment of instruction handling, cross-lingual transfer, and SEA-specific reasoning
Alignment with Open Feedback (UltraFeedback & SEA-UltraFeedback)
Sailor2 leverages diversified preference datasets (UltraFeedback, SEA-UltraFeedback) for fine-grained alignment and reward model training. These are derived both from scaled AI feedback and community-annotated, multi-turn SEA language dialogues, ensuring reinforcement learning and direct preference optimization tune models towards regional human conversational expectations14,15.
Licensing, Funding, and Sponsorship
All Sailor2 models, code, and supporting tools are released under Apache 2.0, and contributions from diverse institutions imply a mix of:
- Institutional and university research support (compute resources, developer hours, open access to regional datasets)
- Corporate and AI platform sponsorship (notably SCB 10X and float16.cloud for compute, hosting, and financial support)1
- Open-source community development for datasets, evaluation, and model alignment
No restrictive licensing or paywall barriers exist at any layer, ensuring both non-commercial and commercial actors can benefit and extend the Sailor2 work in their respective contexts.
Collaborative Governance, Philosophy, and Methodological Innovations
The collaborative philosophy of Sailor2 is articulated in key governance and design features:
- Equity in contribution and recognition: Equal contribution status among leading researchers, distributed responsibilities, transparent paper authorship, and repository commit attributions.
- Community-first openness: Periodic community feedback cycles, open evaluation challenges, and integration of suggestions into successive model and data releases.
- Iterative technical cookbook: The documentation and open publication of continuous improvements (e.g., switching to RegMix from fixed heuristics, model expansion lessons, block scaling strategies, reward model selection) create a living manual for future multilingual LLM projects targeting similar under-resourced settings.
- Platform decentralization: Redundant hosting, open code mirroring, and decentralized communication channels mitigate risk of central failure and encourage experimentation outside the principal institutions.
Future Directions, Expansion, and Roadmap
The Sailor2 project’s stated roadmap and informal community goals include:
- Synthetic Data Curation for Ultra-Low Resource Languages: Expansion into additional SEA languages (e.g., Acehnese, Minangkabau) and dialects using NLLB-derived translations and regional classifier-guided corpus selection.
- Tokenizer-Free Models and Open Vocabulary Learning: Experimentation with pixel-level or byte-level tokenization, inspired by recent advances in tokenizer-free and open-vocabulary transformer models.
- Efficient Continual Pre-training: Innovations around model plasticity, selective parameter updating, and regular over-training evaluation to minimize compute needs while maximizing language generalization.
- Broader Human Evaluation and Benchmarking: Expansion and localization of evaluation benchmarks—including cultural, code, and instruction-following tasks—strengthening human-in-the-loop alignment and inclusivity.
- Deepening Regional Collaboration and Academic Partnership: Explicit invitations to SEA and global institutions for joint data pooling, benchmark translation, and evidence-based model expansion.
With the foundation of robust open-source infrastructure, inclusive governance, and sustained engagement by top regional researchers, Sailor2 is positioned to be both a trailblazer for SEA language AI and a replicable template for other multilingual LLM undertakings worldwide.
Conclusion
The Sailor2 multilingual LLM project demonstrates what becomes possible when excellence in open science, cross-border collaboration, and regional engagement coalesce—delivering parity with cutting-edge commercial language models while centering local voices, data, and languages often ignored by mainstream AI development. Its breadth of institutional affiliations, diversity of contributors, innovative open-source practices, and commitment to community feedback combine to set powerful standards for the next generation of fair, accessible, and high-performing multilingual language AI.
By charting the organizational architecture, governance, major contributors, methodology, and community scaffolding behind Sailor2, this report offers an in-depth reference on the model’s historical significance and ongoing transformative potential for Southeast Asia’s information ecosystem and beyond.
Institutional Summary Table
Institution/Group | Role | Country |
---|---|---|
Sea AI Lab | Project leadership, core research, infrastructure | Singapore |
Shanghai Jiao Tong University (SJTU) | Academic research, co-authorship | China |
Singapore Management University (SMU) | Academic research, co-authorship | Singapore |
SCB 10X | Corporate sponsor, data partner | Thailand |
WiseSight | Data and evaluation partner | Thailand |
Hugging Face | Platform partner, model/dataset hosting | USA/France |
Float16.cloud | API hosting and deployment | Indonesia |
National University of Singapore (NUS) | Academic contributor | Singapore |
Nanyang Technological University (NTU) | Academic contributor | Singapore |
Singapore University of Technology and Design (SUTD) | Academic contributor | Singapore |
Peafowl.ai | Industry/startup partner | Singapore |
PyThaiNLP | Community partner, language advocacy | Thailand |
Other contributing academic institutions (China, US, Sweden, Denmark, etc.) | Research collaboration, co-authorship | Various |
All these partners contributed either research, funding, datasets, benchmarks, technical support, or direct authorship, reflecting Sailor2’s genuinely collective, regional, and open thrust.
References
- Sailor2: Sailing in South-East Asia with Inclusive Multilingual LLMs.
- Sailor2: Sailing in South-East Asia with Inclusive Multilingual LLMs.
- sailor2 (Sailor2) - Hugging Face.
- Sailor2: Sailing in South-East Asia with Inclusive.
- scb10x (SCB 10X) - Hugging Face.
- GitHub Pages - Sailor.
- RegMix: Data Mixture as Regression for Language Model Pre-training.
- GitHub - sail-sg/Megatron-Sailor2: Megatron for Sailor2/Qwen2.5.
- [.
- Paper page - Sailor2: Sailing in South-East Asia with Inclusive ....
- @SivilTaram on Hugging Face: "Introducing Sailor-14B Model and Sailor2 ....
- SailCompass: Towards Reproducible and Robust Evaluation for Southeast ....
- [.
- PowerPoint 演示文稿.
- UltraFeedback: Boosting Language Models with High-quality Feedback.