Core Specifications
- Architecture: FastText-based embeddings
- Dimensions: 300-dimensional vector space
- Training Data: Clean Sinhala corpus optimized for linguistic accuracy
- Vocabulary: ~500k Comprehensive coverage of Sinhala lexicon
Key Capabilities
Word Embeddings
Generate dense vector representations for individual Sinhala words
Sentence Embeddings
Create contextual embeddings for full Sinhala sentences
Semantic Search
Find semantically similar words and phrases
Similarity Analysis
Compute cosine similarity between text elements
Document Processing
Semantic search in Documents (.txt, .csv, .tsv)
Language-Specific Optimization
- Trained exclusively on high-quality Sinhala text Corpus. Dataset available at Huggingface: CleanSinhalaTextCorpus
- Optimized for Sinhala’s grammatical structures
- Handles Sinhala compound words and morphology effectively
Model Details
| Property | Description |
|---|---|
| Model | Embedding_Siyabasa APIUgannA_SiyabasaV2 |
| Supported data types Input Output | Text Text embeddings |
| Token limits Input token limit Output dimension size | 1000 300 |
| Version Model API | V_2.0 V_1.0 |
| Latest update | August 2025 |
| Language | Sinhala only |