Skip to main content

Core Specifications

  • Architecture: FastText-based embeddings
  • Dimensions: 300-dimensional vector space
  • Training Data: Clean Sinhala corpus optimized for linguistic accuracy
  • Vocabulary: ~500k Comprehensive coverage of Sinhala lexicon

Key Capabilities

Word Embeddings

Generate dense vector representations for individual Sinhala words

Sentence Embeddings

Create contextual embeddings for full Sinhala sentences

Semantic Search

Find semantically similar words and phrases

Similarity Analysis

Compute cosine similarity between text elements

Document Processing

Semantic search in Documents (.txt, .csv, .tsv)

Language-Specific Optimization

  • Trained exclusively on high-quality Sinhala text Corpus. Dataset available at Huggingface: CleanSinhalaTextCorpus
  • Optimized for Sinhala’s grammatical structures
  • Handles Sinhala compound words and morphology effectively

Model Details

PropertyDescription
ModelEmbedding_Siyabasa API
UgannA_SiyabasaV2
Supported data types
Input
Output

Text
Text embeddings
Token limits
Input token limit
Output dimension size

1000
300
Version
Model
API

V_2.0
V_1.0
Latest updateAugust 2025
LanguageSinhala only