Who Said What When? The Challenge of Diarization in Financial Compliance

Nigel Cannings April 17, 2025

Financial trading floor communications are highly regulated and pose unique challenges to ensure all communications are captured, monitored, and attributed to the correct individuals.

Fortunately, there have been huge advances in speech recognition technology on the financial trading floor over the last few years. Transformer networks, using similar underlying technologies to ChatGPT and DALL-E (text-to-image AI models), have increased accuracy almost to human levels in some scenarios.

The Verint® Communications Analytics™ team has been at the forefront of graphics processing unit (GPU)-accelerated speech recognition since 2014, helping shape today’s speech technology landscape.

However, one area of speech transcription processing has stubbornly failed to keep up with the accuracy of automatic speech recognition. That is diarization, or put simply, who said what and when.

What Is a Diarization System?

A diarization system is designed to automatically identify and segment different speakers in an audio recording according to speaker identity. Imagine listening to a conversation and trying to determine who is speaking at any given moment. Diarization systems do this by analyzing speech patterns and vocal characteristics to assign speaker labels to speech segments.

Why Diarization Is Important in Financial Compliance

This technology is essential for applications such as speech transcription services, meeting analytics and speaker identity management. Mono audio files (such as recordings of Microsoft Teams meetings and trade floor conversations) don’t tell when the speaker changes, and which speaker is talking at a particular time.

On the trade floor, metadata often won’t tell us who is on a call at a particular time, and that makes trades hard to reconcile.

Speaker diarization is the first step toward working out when a trade was made and who made it.

If you are trying to read a transcript, it is very hard without the speaker turns shown. But as we turn to Large Language Models (LLMs) to help us summarize and interpret what is in an interaction, being able to attribute a particular portion of text to a particular speaker is essential.

Key Metric: Diarization Error Rate (DER)

The primary metric for evaluating diarization systems is the Diarization Error Rate (DER), which represents the percentage of speaker time that is incorrectly classified. A lower DER indicates better performance, meaning the system can more accurately distinguish between speakers in an audio recording.

The Speaker Diarization Benchmarking Study

The Verint Communications Analytics team has been refining its diarization algorithm over the past few years to look at the real challenges that these systems face—the biggest of which are speaker overlap and noisy environments, both of which are particularly prevalent in trading floor conditions!

To assess the effectiveness of Communications Analytics’ latest, proprietary Greedy Clustering Version 3 (GCV3) algorithm, the team conducted an extensive benchmarking exercise, comparing its performance against:

  • Pyannote (a leading open-source framework for diarization, and used in Whisper X)
  • Commercial diarization APIs from four leading providers.

Competitive Analysis: Verint GCV3 vs. Pyannote

The GCV3 algorithm demonstrated a significant improvement over Pyannote, outperforming it by an average of 8.21% in DER across multiple datasets. While Pyannote performed well on the VoxConverse dataset, Verint’s GCV3 delivered superior results across all other test scenarios.

It is worth noting that Pyannote has been extensively trained on the VoxConverse development dataset, and so it would be expected to perform particularly well against it. GCV3 does not use it in its training at all.

The team used two generated datasets, one from meetings conducted over the years on daily development Zoom standup calls, and one from a wide selection of CC-BY (creative commons-by attribution) licensed YouTube videos.

The Zoom calls most closely mimicked many of the financial trade floor use cases we find in production, with up to 20 speakers, on different headsets and microphone setups (many far-field) with a wide variety of accents, including some non-English speakers.

DatasetLanguageVerint (GCV3) DERPyannote DERImprovement
DIHARD3 (full)English10.2821.7011.42
AISHELL-4 (Dev + Eval)Mandarin32.9040.247.34
VoxConverseEnglish19.6515.94-3.71
RTVESpanish35.5936.120.53
Internal Standup MeetingEnglish10.3132.2021.89
Custom YouTube DatasetUS English4.2416.0511.81

Competitive Analysis: Verint GCV3 vs. Commercial APIs

In addition to Pyannote, the team evaluated GCV3 against major commercial diarization APIs. The tests used the most challenging, manually labelled internal meeting data and YouTube audio, ensuring a fair comparison by converting the JSON results (JavaScript Object Notation, a text format for storing and transporting data) from four major providers into RTTM (Rich Transcription Time Marked) files using provided speaker labels and timestamps.

Verint’s GCV3 gave results that were twice as good as the closest competitor and significantly surpassed the others.

SystemAverage DER
Verint (GCV3)7.28
Company 115.59
Company 240.12
Company 349.64
Company 461.83

This goes to show that determining speaker changes and then speaker identity (in this case Speaker 1, Speaker 2, etc.) is a challenge for even the top names in the industry.

Based around a “traditional” x-vector extraction approach, our proprietary GCV3 algorithm employs a sophisticated, multi-stage approach to achieve high diarization accuracy, with a particular focus on dealing with edge-case speaker segments, to enhance speaker turn changes, and refine x-vector segmentation.

Looking Ahead

The results of our benchmarking study highlight Verint’s commitment to delivering state-of-the-art audio processing and speech transcription solutions and sets a new standard in diarization accuracy.

As we continue to refine and improve our technology, Verint remains dedicated to pushing the boundaries of speech technology as a critical factor in best-in-class AI applications, delivering state-of-the-art solutions for the most demanding audio processing applications.

Find out more by watching our webinar: Finding the Gaps in Voice Communications – A New Approach to Financial Compliance.

Or, read the eBook: https://www.verint.com/wp-content/uploads/2024/10/8-common-voice-challenges-financial-compliance-ebook.pdf.