Clibrain, a Madrid-based AI startup, has joined the race to create generative AI models optimized for Spanish speakers. The company has released Lince Zero, a Spanish-instruction tuned LLM, which has been trained on a dedicated corpus of Spanish language data. Lince Zero is a 7BN parameter taster of a more powerful (foundational) model (40BN parameters) that the company has in the pipeline, which will simply be called Lince.
According to Clibrain, Spanish is one of the most spoken languages globally, boasting considerable variety in terms of dialects and variants. The company argues that this linguistic diversity makes it challenging for mainstream models to perform adequately on the Spanish language. Clibrain aims to address this gap by developing models that can parse and understand more Spanish linguistic nuance than the average LLM.
Clibrain’s LLM, Lince, is based on existing open source technologies. However, it is not just using existing architectures, touting its own senior engineering talent in AI. The startup was only founded in April 2023, with a multidisciplinary team of close to 30 staff with an R&D lab focused on generative AI at the core.
Clibrain’s co-founder and CEO, Elena Gonzalez-Blanco, brings an educational background in linguistics research and poetry to the startup, combined with a career focus on AI. She points back to her years doing linguistics research as powering a particularly key contribution to the project, enabling Clibrain to source unique training data to feed its model making ambitions.
“We have a corpus [of training data] which is unique,” she says. “I am a linguist; I have, let’s say, 15 years of research in terms of history of language, Spanish language… a lot of contacts that have not been used for training yet. So we have a unique corpus [as a differentiator].”
Clibrain’s debut model release is called Lince Zero and is being released under an open source license. This LLM is largely based on existing open source technologies, so it cannot yet boast its foundational model. However, the company says that’s coming soon.
The release of Lince Zero is the first step on Clibrain’s ambitious roadmap. It is largely based on existing open source technologies, so it cannot yet boast its foundational model. However, the company says that’s coming soon. As you can tell from the parameter numbers, these LLMs are far from contending to be the biggest models on the block. But, as Gonzalez-Blanco argues, Clibrain’s conviction is that model size, per se, won’t be the killer feature when it comes to generating a performance advantage around enhanced understanding of Spanish. Rather, quality attention to linguistic detail will count, and it hopes this will give it an edge in Spanish markets.
Clibrain’s Lince is far from the first conversational AI model to focus on Spanish. The Barcelona Supercomputing Center’s MarIA project, which launched back in 2021, claimed to be the first “massive” AI system in the Spanish language. Still, Clibrain argues it has surpassed MarIA and pulled together the most technologically “advanced” model focused on the Spanish speaking market to date.
There are a number of non-English language-optimized LLMs out there now, such as Baidu’s Chinese language model, Ernie, or this LLM model family that’s being tuned for German. South Korean tech giant Naver is also working on generative AI models trained on Korean.
However, Clibrain contends that its full focus on the Spanish language will enable its forthcoming foundational model, plus a series of domain-trained models it plans to develop atop the big one, to parse and understand more Spanish linguistic nuance than the average LLM.
Lince Zero’s performance is equivalent to GPT-3, whereas Clibrain says MarIA’s performance is equivalent to GPT-2. Although benchmarking linguistic performance of LLMs is a cutting-edge business in and of itself, Clibrain is encouraging Spanish speakers to check out what it’s built and start generating feedback.
Clibrain’s co-founders have been bootstrapping development so far, using funds gleaned from previous startup exits. The company doesn’t have a hefty investor roster nor deep funding warchest as yet. Gonzalez-Blanco says they had wanted to focus on developing core models and getting their first products to market, rather than on external fundraising. Still, the company may look to raise a bigger round of investment than the founders were able to plough in themselves as they continue to progress with the Lince product roadmap.
First reported on TechCrunch
Frequently Asked Questions
Q: What is Clibrain and what is its goal?
A: Clibrain is a Madrid-based AI startup focused on creating generative AI models optimized for Spanish speakers. The company aims to develop models that can parse and understand Spanish linguistic nuance better than existing language models.
Q: What is Lince Zero?
A: Lince Zero is Clibrain’s debut model release. It is a Spanish-instruction tuned Language Model (LLM) trained on a dedicated corpus of Spanish language data. Lince Zero is a 7 billion parameter model and serves as a preview of Clibrain’s more powerful foundational model, Lince, which has 40 billion parameters and is currently in development.
Q: What makes Clibrain’s approach unique?
A: Clibrain differentiates itself by leveraging its unique corpus of training data, sourced through the linguistics research background of its co-founder and CEO, Elena Gonzalez-Blanco. The company combines existing open source technologies with its own senior engineering talent in AI to develop its models.
Q: How does Clibrain’s LLM compare to other conversational AI models in the Spanish language?
A: Clibrain contends that its focus on the Spanish language enables its models to outperform other existing models, including the Barcelona Supercomputing Center’s MarIA project. Clibrain claims to have the most technologically advanced model for the Spanish-speaking market.
Q: What are Clibrain’s plans for the future?
A: The release of Lince Zero is the first step in Clibrain’s roadmap. The company plans to develop its foundational model, Lince, and a series of domain-trained models. They aim to provide enhanced understanding of Spanish through quality attention to linguistic detail.
Q: How does Lince Zero’s performance compare to other models?
A: Clibrain states that Lince Zero’s performance is equivalent to OpenAI’s GPT-3 model, while suggesting that MarIA’s performance is equivalent to GPT-2. However, benchmarking linguistic performance of language models is an ongoing process.
Featured Image Credit: Unsplash