Meta Platforms, the parent company of Facebook, has introduced an AI model named SeamlessM4T that can translate and transcribe speech in multiple languages, enabling real-time communication across language barriers. The model can facilitate translations between text and speech in almost 100 languages and offers complete speech-to-speech translation for 35 languages.
Meta's CEO, Mark Zuckerberg, envisions these tools aiding global interactions within the metaverse, a network of interconnected virtual worlds that the company focuses on.
The SeamlessM4T model is accessible to the public for non-commercial purposes. This aligns with Meta's ongoing pattern of releasing a series of AI models at no cost throughout the year, such as the Llama language model, to foster an open AI ecosystem.
Despite these advancements, Meta faces legal issues similar to others in the industry regarding using training data for their models. In response to copyright infringement lawsuits from authors, Meta and OpenAI have been accused of using books as training data without authorization. For the SeamlessM4T model, Meta argues that audio training data was collected from a vast repository of web data, and text data was drawn from datasets generated from sources like Wikipedia and associated websites.
Why does it matter?
The offer of translation and transcription capabilities across numerous languages addresses a fundamental barrier to global connectivity. However, AI models still struggle with nuances, cultural contexts, and accuracy in language translations. In addition, large language models are trained on vast amounts of written text and multimedia data obtained from the internet, including books, articles, images, and web pages, which can raise legal and ethical concerns. The recent Books3 takedown, containing 196,000 copyrighted books for AI training sourced from the Pile dataset, referenced by Meta to train its in-house models, came in response to a DMCA request by the Danish anti-piracy group Rights Alliance. The incident prompts reflection on the complexities surrounding incorporating copyrighted material in AI training.