Microsoft Launches New US-Developed MAI Models To Rival AI Competitors

Microsoft recently unveiled three in-house foundational AI models, now live in Microsoft Foundry, that beat OpenAI's Whisper and Google's Gemini 3.1 Flash on accuracy benchmarks while undercutting both on price.

Fawad Malik2 hours agoLast Updated: April 6, 2026

2 minutes read

A glowing blue processor chip on a circuit board for a news story about Microsoft's US-made MAI models. — Microsoft has officially unveiled its "MAI" series, a new line of US-made artificial intelligence models designed to compete directly with global AI rivals.

Key Takeaways

MAI-Transcribe-1 posts a 3.8% average Word Error Rate on FLEURS, beating Whisper on all 25 languages.
MAI-Voice-1 generates 60 seconds of audio in under one second on a single GPU.
MAI-Image-2 debuted at third place on the Arena.ai image model leaderboard.
All three models are live now in Microsoft Foundry and MAI Playground, priced below major cloud competitors.

On April 2, 2026, Microsoft AI CEO Mustafa Suleyman unveiled three in-house foundational AI models, MAI-Transcribe-1, MAI-Voice-1, and MAI-Image-2, now in public preview via Microsoft Foundry and the US‑only MAI Playground.

According to the Microsoft AI official announcement, these are the same models already powering Microsoft products like Copilot, Bing, PowerPoint, and Azure Speech, now available to developers for the first time.

The launch is the first major output from Microsoft’s MAI Superintelligence team, formed in November 2025 to pursue artificial intelligence self-sufficiency.

How MAI-Transcribe-1 Beats Whisper and Gemini

The headline model, MAI-Transcribe-1, is a speech-to-text system that achieves the lowest average Word Error Rate on the FLEURS benchmark, the industry standard for multilingual speech recognition, across the top 25 most-used languages, averaging 3.8% WER.

As VentureBeat reported, benchmarks show it outperforms OpenAI’s Whisper-large-v3 on all 25 languages and beats Google’s Gemini 3.1 Flash Lite on 22, making a direct competitive claim against two major enterprise transcription systems built on LLMs.

Microsoft confirmed that batch transcription is 2.5 times faster than its existing Azure Fast transcription service. The model handles challenging real-world audio, including background noise, low-quality recordings, and overlapping speech.

It is currently being tested in Copilot’s Voice mode and Microsoft Teams for live conversation transcription. This high-speed processing acts as a critical feed for the newly unveiled Copilot Cowork, allowing the agentic system to trigger multi-step workflows directly from live audio.

MAI-Transcribe-1 is priced at $0.36 per hour, which Microsoft positions as the best price-performance of any large cloud provider for transcription.

MAI-Voice-1 and the Custom Voice Capability

Alongside MAI-Transcribe-1, Microsoft launched MAI-Voice-1, a text-to-speech engine that generates 60 seconds of natural-sounding audio in under one second on a single GPU.

The model preserves speaker identity in long-form content and supports custom voice creation through Microsoft Foundry using just a few seconds of audio, following Microsoft’s responsible AI policies. Pricing starts at $22 per one million characters.

VentureBeat noted that MAI-Voice-1’s ability to clone voices from seconds of audio and produce speech at 60x real-time positions it a direct competitor to ElevenLabs, Resemble AI, and other voice AI startups.

Any Foundry developer can now access voice generation through the same API they already use for GPT-4 and Claude.

MAI-Image-2 and the Arena.ai Leaderboard Debut

MAI-Image-2, Microsoft’s most advanced text-to-image model, first appeared on the MAI Playground on March 19 before its formal Foundry release on April 2.

National Today reported that the model debuted third on the Arena.ai leaderboard for image model families and delivers at least twice the generation speed in Foundry and Copilot compared to its predecessor, with no loss in output quality.

Microsoft is rolling out MAI-Image-2 across Bing and PowerPoint, pricing it at $5 per one million tokens for text input and $33 per one million tokens for image output.

WPP, the advertising holding company, is confirmed among the first enterprise partners building with MAI-Image-2 at scale.

What the Renegotiated OpenAI Deal Makes Possible

The organizational context behind the launch is as notable as the benchmarks.

Suleyman’s MAI Superintelligence team, which built the audio model with around ten engineers, could pursue this work only after Microsoft renegotiated its partnership with OpenAI in late 2025.

The original agreement barred Microsoft from independently developing artificial general intelligence, yet the U.S.-based company still earned $7.6 billion from its OpenAI investment last quarter.

The revised terms now allow Microsoft to create competing models while retaining licensing rights to OpenAI’s models through 2032.

MAI models are currently available through Microsoft Foundry and the MAI Playground, with Playground access limited to the US for now and broader rollout expected after public preview.

Source: Today we’re announcing 3 new world class MAI model

Fawad Malik2 hours agoLast Updated: April 6, 2026

2 minutes read