The news is by your side.

Swecha to build TELUGU LLM corpus & culture portal

0 14

Hyderabad: Swecha, a non-profit organization dedicated to promoting Free Software and Free Knowledge movements; announced a massive internship program on Artificial Intelligence (AI), the ‘SUMMER OF AI’, for over a lakh Engineering students this summer, to equip and make them job ready with AI skills, while aiding Swecha to develop Telugu language centric LLM. This initiative is being undertaken by Swecha, in collaboration with the IIIT Hyderabad; Ozonetel, a leading provider of cloud communication solutions; Meta and TASK. Guests Chaitanya Chokkareddy, CTO, OZONETEL Communications Pvt Ltd.; Y Kiran Chandra, Founder, Swecha; & Praveen Chandra, Secretary, Swecha; briefed media about this project at a press conference, at Swecha, today.

This initiative is significant, considering that Indian language and India-centric LLM (Large Language Models) are virtually non-existent. India, with its rich culture and a population that constitutes one-sixth of the world, would greatly benefit from having its own LLMs. Most Indian languages are considered low-resource languages, making it challenging to develop LLMs for them. A significant amount of foundational knowledge needs to be compiled and digitized to create the necessary digital data for these languages.

Today, AI is transforming the knowledge landscape by introducing new job functions such as dataset compilation, data cleaning, data labelling, and managing datasets. These roles are essential for building and refining basic models, ensuring the accuracy and effectiveness of AI Applications.

Swecha aims to capitalize on the vast talent pool of engineering students graduating in India and ready to enter the industry, by training them in AI. This presents an opportunity to create a large pool of trained AI engineers, extending well beyond the small group of researchers and developers specialized in deep models.

This project, SUMMER OF AI, attempts to combine the two objectives. A very large scale internship program for first and second year engineering students, trained in basics of AI and then be engaged in very large scale data collection thru interviews. The Project aims to interview people in the villages and towns, collect information and knowledge on various folks, local skills and information, which includes the Telugu folk tales, songs, food, local-places-history and more.

The approach of the project is to collect speech, transcribe the speech and create a dataset for both speech and as a base LLM. In addition to this, the team is also working with few large libraries and Telugu academy to also ingest a lot of books. This process will be done through 100k interns month long internships. we started the first batch with 10k interns. Tools are being built to help with the data collection (at this scale). And at the end of the collection, backend tools needed to create the dataset and also to publish the information on a Teluguwiki like portal. On successful completion of this project, similar approach will be adopted to collect data for other languages and regions also.

The project SUMMER OF AI, has potential to reap riches for the Telugu language by preserving its culture through the documentation of oral traditions, folk knowledge, and personal narratives. This initiative will develop a comprehensive corpus that serves as a foundational resource for training and refining Telugu Language Models, ensuring more accurate and contextually appropriate language processing in digital environments. Ultimately, it empowers the Telugu community and supports language revitalization. To Register for the Internship:

Leave A Reply

Your email address will not be published.