Amazigh language technology has received a new research resource with the release of DATASHI, a parallel English-Tashlhiyt corpus designed for translation, orthography normalization and low-resource language processing.
The paper, submitted to arXiv in March 2026 by Nasser-Eddine Monir and Zakaria Baou, describes DATASHI as a 5,000-sentence dataset for Tashlhiyt, one of Morocco’s major Amazigh languages. The authors say the corpus is meant to address a persistent gap in computational resources for Amazigh languages, especially for tasks that require paired examples across languages and writing practices.
One of DATASHI’s most useful features is its attention to orthographic variation. The paper says the dataset includes a 1,500-sentence subset that pairs expert-standardized forms with non-standard user-generated versions. That matters because many Amazigh languages are written across multiple habits and scripts, and even Latin-based writing can vary widely from speaker to speaker. For language tools, this variation can make search, translation and automated correction difficult.
The corpus is intended to support text-based tasks such as tokenization, translation and normalization. It may also serve as a foundation for read-speech data collection and multimodal work, which could be important for future Amazigh voice tools. In practical terms, that means DATASHI could help researchers build systems that recognize Tashlhiyt more consistently, handle spelling variation better and connect written data to spoken language resources.
The authors tested several large language models on the dataset and found that performance improved when models were given a few examples. Their evaluation also looked closely at difficult linguistic features, including geminates, emphatics, uvulars and pharyngeals. These sounds are central to many Amazigh languages but are often poorly handled by general-purpose tools trained mainly on high-resource languages.
For Amazigh communities, the significance goes beyond technical research. Digital tools increasingly shape how languages are used in education, media, public administration and cultural preservation. If Tashlhiyt and other Amazigh languages remain underrepresented in datasets, they risk being poorly served by translation systems, search engines, voice assistants and classroom technologies.
DATASHI is not a complete solution, but it is a useful building block. It points toward a future where Amazigh language technology is developed through resources that reflect the language’s own structure, variation and community needs.

