Argile Labs

The Accessibility Gap

Government websites serve everyone. That includes people with visual impairments, cognitive disabilities, or learning differences who struggle with dense, text-heavy pages. That’s why the official accessibility guidelines make their guidance clear: real-time visual feedback synced to audio is meaningfully better for users with learning or cognitive difficulties (WCAG 2.1).

Seems straightforward until you factor in the environment. Public sector organizations don’t get to pick their stack based on what’s convenient. Security reviews, procurement cycles, compliance mandates mean solutions must build on approved infrastructure. In this case, Microsoft Azure.

Why Off-the-Shelf Didn’t Cut It

Most text-to-speech tools land in one of two buckets: consumer apps that ignore enterprise security, or enterprise tools with voice quality that sounds like a 2010 GPS.

The ask was straightforward: consumer-level user experience and voices, inside the existing Azure environment. For this integration, Azure Cognitive Services provided the text-to-speech capability, with neural text-to-speech that sounds natural and word-level timing data for synchronized highlighting.

The standard way to integrate something like this is to route everything through your own servers. Browser hits your server, your server hits Azure, response bounces back the same way. It works, but it adds latency. For real-time audio streaming, users notice. For an accessibility feature specifically, that lag hits hardest for the people who need it most.

We needed a different approach.

What We Actually Built

To achieve the minimal lag and user experience required, we came to a key insight: we needed separate authentication from streaming. Instead of proxying all audio traffic, we use Azure Functions to generate short-lived authorization tokens. The browser gets a token, then streams audio directly from Azure’s Speech service. The backend only handles auth, not audio. Lower latency, better experience, keys stay protected, better experience all around.

The synchronized highlighting took more work. Azure gives you precise timing data for each word, but web pages are messy. Markup, formatting, special characters, nested elements make it such that the timing data doesn’t just drop cleanly onto what’s on screen.

We built a normalization layer to bridge that gap. It walks the page content, builds a clean text representation to send to the Speech service, then maps the timing data back to the right DOM elements. The result is words lighting up exactly when they’re spoken, across paragraphs, headings, and formatted content.

Accessibility requirements shaped the rest too: keyboard navigation, pause and speed controls, highlighting that holds up in high-contrast modes needed to be considered and implemented to provide the expected user experience. All of it sitting natively inside the existing Azure stack, meeting the security and data residency requirements that were non-negotiable from the start.

How It Turned Out

The final product was a custom integration. No third-party tools bolted on, no compromises on the security side, ready for a public agency site with thousands of daily visitors.

By breaking down each requirement for this project, we were able to provide an ideal solution that fit within each constraint. Our solution integrated seamlessly into the existing enterprise environment to meet the security, procurement, and data residency needs for this project.

Making Government Websites Speak with AI

The Accessibility Gap

Why Off-the-Shelf Didn’t Cut It

What We Actually Built

How It Turned Out

Make your team more capable with AI.