Toaripi SLM

Toaripi Small Language Model

A small language model for Toaripi (East Elema; ISO 639‑3: tqo) that generates educational content for primary learners — online and offline.

About

The Toaripi SLM is a research‑led, community‑centred effort to create a lightweight AI language model for Toaripi that can generate original educational content for classrooms and self‑study.

The initiative draws on parallel English–Toaripi Bible text to bootstrap vocabulary and grammar, then uses careful prompting to produce educational materials and language practice content.

At a glance

  • Why: Support Toaripi literacy by bootstrapping classroom materials and language practice content.
  • How: Fine‑tune a compact open model on aligned English↔Toaripi Bible verses to learn vocabulary/structure.
  • Where: Online (simple web UI/API) and offline via quantised weights on CPU‑only devices.

Goals & Vision

Goals

  1. Build a Toaripi‑capable small model (≈1–7B params) by fine‑tuning an open base model with aligned English↔Toaripi data.
  2. Generate original learning materials (vocabulary, dialogues, comprehension exercises, Q&A) fit for primary learners.
  3. Ensure accessibility with online and offline options using quantisation and efficient runtimes.
  4. Invite open collaboration from Toaripi speakers, educators, linguists and developers.

Non‑goals (for clarity)

  • × This is not a theological tool; scripture is used purely as bilingual training data.
  • × This is not a general‑purpose chatbot; the scope is educational content generation.

How It Works

1

Training with parallel text

We use aligned English↔Toaripi Bible verses to teach the model Toaripi vocabulary and structure. This parallel corpus provides thousands of clean sentence pairs.

2

Generating educational content

After fine‑tuning, we prompt the model to produce educational outputs: vocabulary lists, comprehension questions, dialogues, and grammar exercises suitable for primary learners.

3

Running online and offline

For connected users, a lightweight web UI/API. For remote schools, quantised model weights for CPU‑only devices like Raspberry Pi.

Why Low-Resource Languages Matter

The Language Gap in AI

While AI language models excel in major languages like English and Mandarin, over 7,000 languages worldwide remain largely underrepresented in AI systems. This creates a digital divide where speakers of low-resource languages cannot benefit from modern language technologies.

Languages like Toaripi, with limited digital text and few speakers, face the risk of being left behind in our increasingly AI-driven world.

Breaking Down Barriers

Developing small language models for low-resource languages helps:

  • Preserve and revitalize endangered languages
  • Create educational materials for native speakers
  • Bridge the digital divide for underrepresented communities
  • Enable cultural knowledge transfer to future generations

Research Foundation

Stanford HAI's research emphasizes that "minding the language gap" is crucial for equitable AI development. Small language models offer a practical path forward for low-resource languages when training data is limited.

Read the Stanford HAI White Paper

Join the Community

We invite educators, Toaripi speakers, linguists and developers to collaborate on this community-led initiative.

Contribute on GitHub