As Speech Language Models (SLMs) transition from personal devices to shared, multi-user environments such as smart homes, a new challenge emerges: the model is expected to distinguish between users to manage information flow appropriately. Without this capability, an SLM could reveal one user's confidential schedule to another—a privacy failure we term interactional privacy. Thus, the ability to generate speaker-aware responses becomes essential for SLM safe deployment. Current SLM evaluations test dialogue ability but overlook speaker identity. Multi-speaker benchmarks check who said what without assessing whether SLMs adapt their responses. Privacy benchmarks focus on globally sensitive data (e.g., bank passwords) while neglecting contextually sensitive information (e.g., a user's private appointment). To address this gap, we introduce VoxPrivacy, the first benchmark designed to evaluate interactional privacy in SLMs. VoxPrivacy spans three tiers of increasing difficulty, from following direct secrecy commands to proactively protecting privacy. Our evaluation of nine SLMs on a 32-hour bilingual dataset reveals a widespread vulnerability: most models perform near random chance, about 50% accuracy on binary privacy decisions. Our analysis shows these failures stem from a specific inability to handle conversational context, not a general failure to converse. We also demonstrate a viable path forward: by fine-tuning on a new 4,000-hour training set, we improve the model's privacy-preserving capabilities while achieving fair robustness. To support future work, we are releasing the VoxPrivacy benchmark, the large-scale training set, and the fine-tuned model to help the development of safer and more context-aware SLMs.
This task tests the model's obedience to an explicit command (e.g., "Do not share this with anyone."). The model is expected to uphold this command absolutely, refusing to disclose the information to any subsequent querier, regardless of their identity.
English
中文
This task introduces speaker verification as a condition for disclosure. Given a more nuanced instruction like "Let's keep this between us", the model must leverage the querier's voice as a biometric key, granting access exclusively to the original speaker while denying all others.
English
中文
The most challenging task evaluates a model's ability to proactively protect user privacy, acting without any explicit instruction. It requires the model to use common-sense understanding to recognize when an utterance is inherently private (e.g., "I'm worried about my upcoming medical results.") based on its content. The model must then automatically enforce a speaker-conditioned access policy, disclosing the information only to the verified owner.
English
中文
Overview of the VoxPrivacy benchmark construction and evaluation pipeline
Breakdown by Task Tier and Secret Category (Hours / Utterance Count)
| Task | Personal Info | Location Info | Academic Background | Interpersonal Secrets | Professional Aspirations | Belief Conditions | Illicit Actions | Transient Secrets | Overall |
|---|---|---|---|---|---|---|---|---|---|
| Tier 1 | 1.43h / 297 | 1.29h / 267 | 1.43h / 297 | 1.58h / 326 | 1.40h / 288 | 1.40h / 288 | 1.61h / 334 | 1.31h / 272 | 11.45h / 2369 |
| Tier 2 | 1.59h / 297 | 1.43h / 267 | 1.59h / 297 | 1.76h / 326 | 1.55h / 288 | 1.55h / 288 | 1.79h / 334 | 1.46h / 272 | 12.72h / 2369 |
| Tier 3 | 1.08h / 297 | 0.99h / 267 | 1.08h / 297 | 1.20h / 326 | 1.06h / 288 | 1.06h / 288 | 1.22h / 334 | 1.00h / 272 | 8.69h / 2369 |
| Overall | 4.10h / 891 | 3.71h / 801 | 4.10h / 891 | 4.54h / 864 | 4.01h / 864 | 4.01h / 864 | 4.62h / 1002 | 3.76h / 816 | 32.86h / 7107 |
• Values represent combined totals for both Chinese and English data (1:1 ratio)
• Each secret category contains balanced samples across different privacy contexts
Most open-source SLMs perform at ~50% accuracy on privacy tasks
⚠️ No better than random guessing
Significant EN vs ZH performance disparity
🌐 Multilingual privacy remains challenging
Our fine-tuned model achieves 80%+ accuracy across tasks
📈 Shows promising improvements over baseline models
"Do not share this with anyone"
"Keep this between us"
"I'm worried about my medical results"
Speaker-aware reasoning is an advanced capability that most open-source SLMs fundamentally lack.
The shift from following commands (Tier 2) to making social judgments (Tier 3) is a critical failure point.
Fine-tuning shows promising results as a potential approach, with substantial improvements over baseline open-source models.
Performance on Tier 1. The LLM (Upper Bound) is achieved by using the text-only mode of Gemini-2.0-flash with perfect speaker information provided through explicit ID tags in the prompt. IRR (Invalid Response Rate) assesses conversational reliability, while Accuracy evaluates adherence to the non-disclosure command.
| EN | ZH | |||||||
|---|---|---|---|---|---|---|---|---|
| IRR↓ | Accuracy↑ | IRR↓ | Accuracy↑ | |||||
| Models | Deepseek-V3 | Deepseek-V3 | Gemini2.5-Pro | Human† | Deepseek-V3 | Deepseek-V3 | Gemini2.5-Pro | Human† |
| Tier 1 | ||||||||
| Upper Bound | ||||||||
| LLM | 0.24 | 97.33 | 98.01 | 97.00 | 0.32 | 99.10 | 99.10 | 100.00 |
| Closed-sourced | ||||||||
| Gemini-2.0-flash | 0.57 | 79.92 | 81.35 | 82.00 | 1.18 | 85.01 | 88.72 | 85.00 |
| Gemini-2.5-pro | 0.15 | 81.42 | 81.95 | - | 0.56 | 83.90 | 84.03 | - |
| Open-sourced | ||||||||
| Qwen2.5Omni | 0.93 | 41.42 | 39.41 | 37.00 | 0.90 | 31.59 | 30.50 | 29.50 |
| MiniCPM-o2.6 | 0.67 | 26.86 | 30.06 | - | 1.44 | 22.28 | 23.77 | - |
| Qwen2Audio | 1.04 | 27.47 | 30.02 | 32.25 | 4.36 | 25.88 | 25.88 | 29.50 |
| Voxtral3B | 0.41 | 37.11 | 37.91 | 34.50 | 2.77 | 22.26 | 24.89 | 21.75 |
| Baichuan-Omni-1.5 | 7.31 | 39.81 | 39.00 | 42.25 | 7.78 | 31.50 | 31.50 | 33.75 |
| GLM4Voice | 2.13 | 44.91 | 43.88 | 42.50 | 1.88 | 25.81 | 26.03 | 26.00 |
| Kimi-Audio | 2.40 | 73.04 | 71.38 | 64.75 | 16.42 | 38.26 | 40.77 | 35.25 |
| Ours: Kimi-Audio-sft | 5.06 | 88.11 | 87.92 | 83.25 | 9.13 | 79.43 | 80.23 | 82.50 |
Performance on Conditional Privacy Tasks: Speaker-Verified (Tier 2) and Proactive Privacy Protection (Tier 3).
| EN | ZH | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Models | IRR↓ | Accuracy↑ | Precision↑ | Recall↑ | F1↑ | IRR↓ | Accuracy↑ | Precision↑ | Recall↑ | F1↑ |
| Tier 2 | ||||||||||
| Upper Bound | ||||||||||
| LLM | 1.26 | 88.37 | 87.24 | 94.31 | 90.64 | 1.10 | 93.72 | 93.88 | 93.39 | 93.64 |
| Closed-sourced | ||||||||||
| Gemini-2.0-flash | 1.76 | 66.10 | 66.60 | 63.38 | 64.95 | 2.46 | 67.34 | 81.76 | 43.29 | 56.61 |
| Gemini-2.5-pro | 1.05 | 76.05 | 75.89 | 76.90 | 76.39 | 2.25 | 77.93 | 83.41 | 70.32 | 76.31 |
| Open-sourced | ||||||||||
| Qwen2.5Omni | 0.08 | 48.27 | 48.05 | 41.65 | 44.63 | 1.02 | 49.05 | 47.10 | 12.50 | 19.76 |
| MiniCPM-o2.6 | 0.67 | 49.92 | 50.50 | 25.42 | 33.82 | 1.10 | 49.10 | 46.50 | 12.56 | 19.78 |
| Qwen2Audio | 1.09 | 50.47 | 50.78 | 27.77 | 35.90 | 3.57 | 49.03 | 48.20 | 23.59 | 31.68 |
| Voxtral3B | 0.42 | 48.10 | 47.22 | 31.59 | 37.85 | 3.23 | 49.47 | 48.52 | 20.25 | 28.57 |
| Baichuan-Omni-1.5 | 14.81 | 48.34 | 42.69 | 39.29 | 40.92 | 14.34 | 51.78 | 45.73 | 29.78 | 36.07 |
| GLM4Voice | 2.27 | 50.04 | 50.10 | 44.58 | 47.18 | 1.96 | 49.70 | 49.73 | 15.89 | 24.08 |
| Kimi-Audio | 3.28 | 49.61 | 49.88 | 72.62 | 59.14 | 14.54 | 50.25 | 45.69 | 18.63 | 26.47 |
| Ours: Kimi-Audio-sft | 1.85 | 83.93 | 85.11 | 80.32 | 82.65 | 4.13 | 79.34 | 80.10 | 76.96 | 78.50 |
| Tier 3 | ||||||||||
| Upper Bound | ||||||||||
| LLM | 0.21 | 85.21 | 84.38 | 89.17 | 86.71 | 0.57 | 87.80 | 87.92 | 88.40 | 88.16 |
| Closed-sourced | ||||||||||
| Gemini-2.0-flash | 0.17 | 55.47 | 55.56 | 57.88 | 56.69 | 1.36 | 65.93 | 66.40 | 39.07 | 49.19 |
| Gemini-2.5-pro | 0.38 | 66.28 | 65.30 | 68.92 | 67.06 | 1.33 | 68.58 | 70.90 | 63.83 | 67.18 |
| Open-sourced | ||||||||||
| Qwen2.5Omni | 0.36 | 50.18 | 50.40 | 34.00 | 40.61 | 0.85 | 48.80 | 46.45 | 14.55 | 22.16 |
| MiniCPM-o2.6 | 0.34 | 48.40 | 46.44 | 20.95 | 28.87 | 0.85 | 49.20 | 47.67 | 12.56 | 19.88 |
| Qwen2Audio | 0.93 | 49.53 | 49.52 | 26.23 | 34.29 | 3.74 | 49.91 | 50.25 | 35.68 | 41.73 |
| Voxtral3B | 0.59 | 48.43 | 47.79 | 32.99 | 39.04 | 3.83 | 50.40 | 52.94 | 14.21 | 22.41 |
| Baichuan-Omni-1.5 | 21.63 | 52.39 | 42.97 | 42.97 | 42.97 | 19.70 | 51.55 | 43.41 | 45.59 | 44.47 |
| GLM4Voice | 1.43 | 50.90 | 51.15 | 45.32 | 48.06 | 2.81 | 50.31 | 50.22 | 20.04 | 28.64 |
| Kimi-Audio | 0.76 | 50.13 | 50.00 | 62.07 | 55.39 | 17.78 | 51.60 | 50.25 | 21.11 | 29.73 |
| Ours: Kimi-Audio-sft | 2.27 | 77.57 | 78.18 | 77.48 | 77.83 | 2.21 | 82.88 | 84.76 | 62.10 | 71.68 |