VoxPrivacy: A Benchmark for Evaluating Interactional Privacy of Speech Language Models

Abstract

As Speech Language Models (SLMs) transition from personal devices to shared, multi-user environments such as smart homes, a new challenge emerges: the model is expected to distinguish between users to manage information flow appropriately. Without this capability, an SLM could reveal one user's confidential schedule to another—a privacy failure we term interactional privacy. Thus, the ability to generate speaker-aware responses becomes essential for SLM safe deployment. Current SLM evaluations test dialogue ability but overlook speaker identity. Multi-speaker benchmarks check who said what without assessing whether SLMs adapt their responses. Privacy benchmarks focus on globally sensitive data (e.g., bank passwords) while neglecting contextually sensitive information (e.g., a user's private appointment). To address this gap, we introduce VoxPrivacy, the first benchmark designed to evaluate interactional privacy in SLMs. VoxPrivacy spans three tiers of increasing difficulty, from following direct secrecy commands to proactively protecting privacy. Our evaluation of nine SLMs on a 32-hour bilingual dataset reveals a widespread vulnerability: most models perform near random chance, about 50% accuracy on binary privacy decisions. Our analysis shows these failures stem from a specific inability to handle conversational context, not a general failure to converse. We also demonstrate a viable path forward: by fine-tuning on a new 4,000-hour training set, we improve the model's privacy-preserving capabilities while achieving fair robustness. To support future work, we are releasing the VoxPrivacy benchmark, the large-scale training set, and the fine-tuned model to help the development of safer and more context-aware SLMs.

VoxPrivacy-Task

Tier 1: Direct Command Secrecy

This task tests the model's obedience to an explicit command (e.g., "Do not share this with anyone."). The model is expected to uphold this command absolutely, refusing to disclose the information to any subsequent querier, regardless of their identity.

English

中文

Tier 2: Speaker-Verified Secrecy

This task introduces speaker verification as a condition for disclosure. Given a more nuanced instruction like "Let's keep this between us", the model must leverage the querier's voice as a biometric key, granting access exclusively to the original speaker while denying all others.

English

中文

Tier 3: Proactive Privacy Protection

The most challenging task evaluates a model's ability to proactively protect user privacy, acting without any explicit instruction. It requires the model to use common-sense understanding to recognize when an utterance is inherently private (e.g., "I'm worried about my upcoming medical results.") based on its content. The model must then automatically enforce a speaker-conditioned access policy, disclosing the information only to the verified owner.

English

中文

VoxPrivacy-Data

Overview of the VoxPrivacy benchmark construction and evaluation pipeline

7,107

Total Utterances

Across all tasks and categories

32.86

Hours of Audio

Bilingual dataset (EN/ZH)

400

Unique Speakers

200 EN + 200 ZH (1:1 gender)

Key Dataset Features

Quality Assurance

Audio quality measured by DNSMOS
Speech intelligibility via Whisper WER
Human verification for coherence

Diversity & Balance

8 secret categories covering real-world scenarios
Balanced 1:1 English-Chinese ratio
Equal gender distribution (1:1 ratio)

Multi-LLM Generation

Deepseek, Gemini, and ChatGPT collaboration
Mitigates single-model bias
Ensures linguistic diversity

High-Fidelity Audio

CosyVoice2 TTS for natural speech
Distinct speaker characteristics
Quality thresholds enforced

Breakdown by Task Tier and Secret Category (Hours / Utterance Count)

Task	Personal Info	Location Info	Academic Background	Interpersonal Secrets	Professional Aspirations	Belief Conditions	Illicit Actions	Transient Secrets	Overall
Tier 1	1.43h / 297	1.29h / 267	1.43h / 297	1.58h / 326	1.40h / 288	1.40h / 288	1.61h / 334	1.31h / 272	11.45h / 2369
Tier 2	1.59h / 297	1.43h / 267	1.59h / 297	1.76h / 326	1.55h / 288	1.55h / 288	1.79h / 334	1.46h / 272	12.72h / 2369
Tier 3	1.08h / 297	0.99h / 267	1.08h / 297	1.20h / 326	1.06h / 288	1.06h / 288	1.22h / 334	1.00h / 272	8.69h / 2369
Overall	4.10h / 891	3.71h / 801	4.10h / 891	4.54h / 864	4.01h / 864	4.01h / 864	4.62h / 1002	3.76h / 816	32.86h / 7107

• Values represent combined totals for both Chinese and English data (1:1 ratio)

• Each secret category contains balanced samples across different privacy contexts

VoxPrivacy-Eval

Critical Gap

Most open-source SLMs perform at ~50% accuracy on privacy tasks

⚠️ No better than random guessing

Language Gap

Significant EN vs ZH performance disparity

🌐 Multilingual privacy remains challenging

Fine-tuning Approach

Our fine-tuned model achieves 80%+ accuracy across tasks

📈 Shows promising improvements over baseline models

Performance Across Privacy Tiers

Tier 1: Direct Command

"Do not share this with anyone"

Closed-source ~82%

Open-source ~35%

Ours (Fine-tuned) ~85%

Tier 2: Speaker-Verified

"Keep this between us"

Closed-source ~72%

Open-source ~49%

Ours (Fine-tuned) ~82%

Tier 3: Proactive Protection

"I'm worried about my medical results"

Closed-source ~65%

Open-source ~49%

Ours (Fine-tuned) ~78%

Closed-source (Gemini)

Open-source (Average)

Our Fine-tuned Model

Key Insights

Speaker-aware reasoning is an advanced capability that most open-source SLMs fundamentally lack.

The shift from following commands (Tier 2) to making social judgments (Tier 3) is a critical failure point.

Fine-tuning shows promising results as a potential approach, with substantial improvements over baseline open-source models.

Performance on Tier 1. The LLM (Upper Bound) is achieved by using the text-only mode of Gemini-2.0-flash with perfect speaker information provided through explicit ID tags in the prompt. IRR (Invalid Response Rate) assesses conversational reliability, while Accuracy evaluates adherence to the non-disclosure command.

	EN				ZH
	IRR↓	Accuracy↑			IRR↓	Accuracy↑
Models	Deepseek-V3	Deepseek-V3	Gemini2.5-Pro	Human†	Deepseek-V3	Deepseek-V3	Gemini2.5-Pro	Human†
Tier 1
Upper Bound
LLM	0.24	97.33	98.01	97.00	0.32	99.10	99.10	100.00
Closed-sourced
Gemini-2.0-flash	0.57	79.92	81.35	82.00	1.18	85.01	88.72	85.00
Gemini-2.5-pro	0.15	81.42	81.95	-	0.56	83.90	84.03	-
Open-sourced
Qwen2.5Omni	0.93	41.42	39.41	37.00	0.90	31.59	30.50	29.50
MiniCPM-o2.6	0.67	26.86	30.06	-	1.44	22.28	23.77	-
Qwen2Audio	1.04	27.47	30.02	32.25	4.36	25.88	25.88	29.50
Voxtral3B	0.41	37.11	37.91	34.50	2.77	22.26	24.89	21.75
Baichuan-Omni-1.5	7.31	39.81	39.00	42.25	7.78	31.50	31.50	33.75
GLM4Voice	2.13	44.91	43.88	42.50	1.88	25.81	26.03	26.00
Kimi-Audio	2.40	73.04	71.38	64.75	16.42	38.26	40.77	35.25
Ours: Kimi-Audio-sft	5.06	88.11	87.92	83.25	9.13	79.43	80.23	82.50

Performance on Conditional Privacy Tasks: Speaker-Verified (Tier 2) and Proactive Privacy Protection (Tier 3).

	EN					ZH
Models	IRR↓	Accuracy↑	Precision↑	Recall↑	F1↑	IRR↓	Accuracy↑	Precision↑	Recall↑	F1↑
Tier 2
Upper Bound
LLM	1.26	88.37	87.24	94.31	90.64	1.10	93.72	93.88	93.39	93.64
Closed-sourced
Gemini-2.0-flash	1.76	66.10	66.60	63.38	64.95	2.46	67.34	81.76	43.29	56.61
Gemini-2.5-pro	1.05	76.05	75.89	76.90	76.39	2.25	77.93	83.41	70.32	76.31
Open-sourced
Qwen2.5Omni	0.08	48.27	48.05	41.65	44.63	1.02	49.05	47.10	12.50	19.76
MiniCPM-o2.6	0.67	49.92	50.50	25.42	33.82	1.10	49.10	46.50	12.56	19.78
Qwen2Audio	1.09	50.47	50.78	27.77	35.90	3.57	49.03	48.20	23.59	31.68
Voxtral3B	0.42	48.10	47.22	31.59	37.85	3.23	49.47	48.52	20.25	28.57
Baichuan-Omni-1.5	14.81	48.34	42.69	39.29	40.92	14.34	51.78	45.73	29.78	36.07
GLM4Voice	2.27	50.04	50.10	44.58	47.18	1.96	49.70	49.73	15.89	24.08
Kimi-Audio	3.28	49.61	49.88	72.62	59.14	14.54	50.25	45.69	18.63	26.47
Ours: Kimi-Audio-sft	1.85	83.93	85.11	80.32	82.65	4.13	79.34	80.10	76.96	78.50
Tier 3
Upper Bound
LLM	0.21	85.21	84.38	89.17	86.71	0.57	87.80	87.92	88.40	88.16
Closed-sourced
Gemini-2.0-flash	0.17	55.47	55.56	57.88	56.69	1.36	65.93	66.40	39.07	49.19
Gemini-2.5-pro	0.38	66.28	65.30	68.92	67.06	1.33	68.58	70.90	63.83	67.18
Open-sourced
Qwen2.5Omni	0.36	50.18	50.40	34.00	40.61	0.85	48.80	46.45	14.55	22.16
MiniCPM-o2.6	0.34	48.40	46.44	20.95	28.87	0.85	49.20	47.67	12.56	19.88
Qwen2Audio	0.93	49.53	49.52	26.23	34.29	3.74	49.91	50.25	35.68	41.73
Voxtral3B	0.59	48.43	47.79	32.99	39.04	3.83	50.40	52.94	14.21	22.41
Baichuan-Omni-1.5	21.63	52.39	42.97	42.97	42.97	19.70	51.55	43.41	45.59	44.47
GLM4Voice	1.43	50.90	51.15	45.32	48.06	2.81	50.31	50.22	20.04	28.64
Kimi-Audio	0.76	50.13	50.00	62.07	55.39	17.78	51.60	50.25	21.11	29.73
Ours: Kimi-Audio-sft	2.27	77.57	78.18	77.48	77.83	2.21	82.88	84.76	62.10	71.68

VoxPrivacy: A Benchmark for Evaluating Interactional Privacy of Speech Language Models

Abstract

VoxPrivacy-Task

Tier 1: Direct Command Secrecy

Tier 2: Speaker-Verified Secrecy

Tier 3: Proactive Privacy Protection

VoxPrivacy-Data

Key Dataset Features

Quality Assurance

Diversity & Balance

Multi-LLM Generation

High-Fidelity Audio

Detailed Data Distribution

VoxPrivacy-Eval

Critical Gap

Language Gap

Fine-tuning Approach

Performance Across Privacy Tiers

Tier 1: Direct Command

Tier 2: Speaker-Verified

Tier 3: Proactive Protection

Key Insights

Tier 1: Direct Command Secrecy - Detailed Results

Tier 2&3: Conditional Privacy Tasks - Detailed Results