VoxPrivacy: A Benchmark for Evaluating Interactional Privacy of Speech Language Models

Abstract

As Speech Language Models (SLMs) transition from personal devices to shared, multi-user environments such as smart homes, a new challenge emerges: the model is expected to distinguish between users to manage information flow appropriately. Without this capability, an SLM could reveal one user's confidential schedule to another—a privacy failure we term interactional privacy. Thus, the ability to generate speaker-aware responses becomes essential for SLM safe deployment. Current SLM evaluations test dialogue ability but overlook speaker identity. Multi-speaker benchmarks check who said what without assessing whether SLMs adapt their responses. Privacy benchmarks focus on globally sensitive data (e.g., bank passwords) while neglecting contextually sensitive information (e.g., a user's private appointment). To address this gap, we introduce VoxPrivacy, the first benchmark designed to evaluate interactional privacy in SLMs. VoxPrivacy spans three tiers of increasing difficulty, from following direct secrecy commands to proactively protecting privacy. Our evaluation of nine SLMs on a 32-hour bilingual dataset reveals a widespread vulnerability: most models perform near random chance, about 50% accuracy on binary privacy decisions. Our analysis shows these failures stem from a specific inability to handle conversational context, not a general failure to converse. We also demonstrate a viable path forward: by fine-tuning on a new 4,000-hour training set, we improve the model's privacy-preserving capabilities while achieving fair robustness. To support future work, we are releasing the VoxPrivacy benchmark, the large-scale training set, and the fine-tuned model to help the development of safer and more context-aware SLMs.

VoxPrivacy-Task

VoxPrivacy Task Overview
Tier 1: Direct Command Secrecy

This task tests the model's obedience to an explicit command (e.g., "Do not share this with anyone."). The model is expected to uphold this command absolutely, refusing to disclose the information to any subsequent querier, regardless of their identity.

English

中文

Tier 2: Speaker-Verified Secrecy

This task introduces speaker verification as a condition for disclosure. Given a more nuanced instruction like "Let's keep this between us", the model must leverage the querier's voice as a biometric key, granting access exclusively to the original speaker while denying all others.

English

中文

Tier 3: Proactive Privacy Protection

The most challenging task evaluates a model's ability to proactively protect user privacy, acting without any explicit instruction. It requires the model to use common-sense understanding to recognize when an utterance is inherently private (e.g., "I'm worried about my upcoming medical results.") based on its content. The model must then automatically enforce a speaker-conditioned access policy, disclosing the information only to the verified owner.

English

中文

VoxPrivacy-Data

VoxPrivacy Data Pipeline

Overview of the VoxPrivacy benchmark construction and evaluation pipeline

7,107
Total Utterances
Across all tasks and categories
32.86
Hours of Audio
Bilingual dataset (EN/ZH)
400
Unique Speakers
200 EN + 200 ZH (1:1 gender)
Key Dataset Features
Quality Assurance
  • Audio quality measured by DNSMOS
  • Speech intelligibility via Whisper WER
  • Human verification for coherence
Diversity & Balance
  • 8 secret categories covering real-world scenarios
  • Balanced 1:1 English-Chinese ratio
  • Equal gender distribution (1:1 ratio)
Multi-LLM Generation
  • Deepseek, Gemini, and ChatGPT collaboration
  • Mitigates single-model bias
  • Ensures linguistic diversity
High-Fidelity Audio
  • CosyVoice2 TTS for natural speech
  • Distinct speaker characteristics
  • Quality thresholds enforced

VoxPrivacy-Eval

Critical Gap

Most open-source SLMs perform at ~50% accuracy on privacy tasks

⚠️ No better than random guessing

Language Gap

Significant EN vs ZH performance disparity

🌐 Multilingual privacy remains challenging

Fine-tuning Approach

Our fine-tuned model achieves 80%+ accuracy across tasks

📈 Shows promising improvements over baseline models

Performance Across Privacy Tiers
Tier 1: Direct Command

"Do not share this with anyone"

Closed-source ~82%
Open-source ~35%
Ours (Fine-tuned) ~85%
Tier 2: Speaker-Verified

"Keep this between us"

Closed-source ~72%
Open-source ~49%
Ours (Fine-tuned) ~82%
Tier 3: Proactive Protection

"I'm worried about my medical results"

Closed-source ~65%
Open-source ~49%
Ours (Fine-tuned) ~78%
Closed-source (Gemini)
Open-source (Average)
Our Fine-tuned Model
Key Insights

Speaker-aware reasoning is an advanced capability that most open-source SLMs fundamentally lack.

The shift from following commands (Tier 2) to making social judgments (Tier 3) is a critical failure point.

Fine-tuning shows promising results as a potential approach, with substantial improvements over baseline open-source models.