See What I See, Know What I Think

See What I See, Know What I Think:
Dense Latent Communication Across Heterogeneous Agents

University of Michigan · NVIDIA · University of Pennsylvania · University of Colorado Boulder · Michigan State University
Preprint 2026

Abstract

Multi-agent systems communicate mostly through text, paying a lossy and expensive decode and re-encode cost. KV-cache communication is a promising alternative, yet most prior work is homogeneous, using duplicate copies of the same model, and avoids the central challenge of cross-model latent alignment; existing heterogeneous methods are also restrictive, typically assuming shared input and using transferred caches mainly for steering. We study a more fundamental question: can heterogeneous agents be aligned well enough to perform real "mind reading" and transfer both what one agent sees and how it thinks? Our information-structure analysis reveals a duality: context-aware transfer is driven by sparse reasoning signals, while context-unaware transfer, where the receiver sees no input, requires dense contextual knowledge preservation. Motivated by this, we propose dense alignment for heterogeneous KV-cache communication via a lightweight cross-model cache transformation and two-phase training: reconstruction followed by generation. Across all six directions of {Qwen3-4B, 8B, 14B} and six in-domain and out-of-domain benchmarks, our method outperforms prior heterogeneous baselines, matches or exceeds text communication in context-aware settings at roughly 2 to 3× lower compute, and remains effective in context-unaware transfer where prior methods collapse.

We run post-hoc compressed sensing in homogeneous self-communication (Qwen3-4B to Qwen3-4B), recover head importance with Lasso, aggregate to KV-group scores, and sweep top-K KV retention under both communication regimes.

The compressed-sensing estimator identifies which sender heads contribute most to communication quality. We solve a sparse linear inverse problem over random ablation masks:

\[ \hat{\alpha} = \arg\min_{\alpha} \frac{1}{2M}\|\tilde{y}-\Phi\alpha\|_2^2 + \lambda \|\alpha\|_1, \quad \tilde{y}=y-y_0. \]

Method: position disentanglement, layer/group transformation, and structured gating.

Training uses two phases: Phase I reconstruction aligns sender caches into receiver-native cache geometry for dense information preservation; Phase II generation jointly optimizes context-aware and context-unaware decoding so the aligned cache becomes directly actionable.

\[ \mathcal{L}_{\mathrm{rec}} = \sum_{l,g} \|\widetilde{K}_R^{(l,g)}-K_R^{(l,g)}\|_2^2 + \|\widetilde{V}_R^{(l,g)}-V_R^{(l,g)}\|_2^2. \] \[ \mathcal{L}_{\mathrm{gen}} = -\sum_t \log p_{\mathcal{A}_R} \big( y_t \mid y_{\lt t}, \widetilde{\mathcal{C}}_R(X), X_R \big), \quad X_R \in \{X,\emptyset\}. \]

4. Dense Alignment Results

Context-aware communication (In-domain + Out-of-domain + TFLOPs)

Pair	Method	GSM8K	MATH-500	ARC-C	MMLU-Redux	MedQA	OpenBookQA	TFLOPs
4B→8B	Receiver-only	81.10	49.20	91.00	72.10	53.00	91.20	19.85
	T2T	88.10	76.00	91.74	80.75	67.40	90.40	37.73
	C2C	77.86	44.20	86.09	75.87	56.87	85.60	4.50
	Ours	92.95	82.00	93.69	78.52	67.24	91.20	12.56
4B→14B	Receiver-only	83.70	46.40	92.60	71.80	64.70	91.80	31.16
	T2T	92.34	77.80	92.00	82.87	71.17	92.00	56.24
	C2C	82.34	44.20	92.43	72.76	63.00	87.00	6.64
	Ours	93.86	86.00	94.20	78.57	71.96	93.60	21.54
8B→4B	Receiver-only	82.40	44.20	89.20	65.00	47.70	88.00	9.18
	T2T	89.39	63.20	90.96	79.39	66.46	87.00	33.91
	C2C	72.48	37.40	86.78	70.19	55.07	77.20	3.43
	Ours	91.81	83.40	93.00	77.86	66.30	89.60	7.95
8B→14B	Receiver-only	83.70	46.40	92.60	71.80	64.70	91.80	31.16
	T2T	93.93	75.00	91.74	84.13	72.66	92.80	67.08
	C2C	82.26	43.20	92.43	73.81	64.65	87.00	7.42
	Ours	94.09	85.00	94.37	77.38	70.46	93.40	21.79
14B→4B	Receiver-only	82.40	44.20	89.20	65.00	47.70	88.00	9.18
	T2T	90.60	60.20	89.83	80.61	69.91	85.00	43.48
	C2C	70.58	36.60	85.83	69.73	53.42	76.40	5.01
	Ours	91.13	82.60	91.89	77.66	63.00	88.80	10.18
14B→8B	Receiver-only	81.10	49.20	91.00	72.10	53.00	91.20	19.85
	T2T	91.58	73.00	92.35	83.43	72.90	90.20	55.50
	C2C	76.35	43.00	89.39	75.99	62.69	85.60	7.83
	Ours	92.95	81.40	93.77	78.04	70.38	92.60	15.64

Context-unaware communication (In-domain + Out-of-domain + TFLOPs)

Pair	Method	GSM8K	MATH-500	ARC-C	MMLU-Redux	MedQA	OpenBookQA	TFLOPs
4B→8B	T2T-ctx-unaware	51.63	74.40	19.48	21.00	18.85	24.00	37.86
	C2C-ctx-unaware	1.90	3.00	22.52	21.91	8.96	27.00	14.94
	Ours-ctx-unaware	91.43	78.80	91.38	74.59	61.82	88.80	9.42
4B→14B	T2T-ctx-unaware	56.79	75.20	23.39	23.06	27.65	27.20	60.83
	C2C-ctx-unaware	0.00	0.00	10.70	10.26	12.80	16.40	20.05
	Ours-ctx-unaware	82.26	70.60	86.86	62.86	53.26	82.40	17.15
8B→4B	T2T-ctx-unaware	27.98	69.40	23.74	25.18	22.94	23.20	32.75
	C2C-ctx-unaware	0.38	2.40	10.35	8.20	7.23	10.00	14.76
	Ours-ctx-unaware	91.36	81.60	93.60	77.06	64.57	90.00	6.56
8B→14B	T2T-ctx-unaware	30.86	70.40	22.96	22.80	27.81	27.40	72.06
	C2C-ctx-unaware	0.00	0.00	7.39	7.14	6.36	7.60	9.84
	Ours-ctx-unaware	81.58	65.00	88.48	54.92	57.74	83.80	15.85
14B→4B	T2T-ctx-unaware	18.57	66.60	20.61	24.52	21.13	20.20	41.95
	C2C-ctx-unaware	0.68	0.40	4.96	6.11	0.63	6.60	17.93
	Ours-ctx-unaware	82.49	64.00	88.48	67.15	59.54	85.40	8.90
14B→8B	T2T-ctx-unaware	18.80	66.40	22.87	21.41	25.22	26.20	53.80
	C2C-ctx-unaware	0.00	0.20	2.78	5.34	5.18	2.20	24.77
	Ours-ctx-unaware	84.15	68.40	88.99	67.84	54.20	88.00	13.28

BibTeX

@misc{chen2026denselatentcommunication, title={See What I See, Know What I Think: Dense Latent Communication Across Heterogeneous Agents}, author={Siyi Chen and Xiaoyan Zhang and Meng Wu and Jonathan Tremblay and Valts Blukis and Stan Birchfield and Rene Vidal and Alvaro Velasquez and Sijia Liu and Qing Qu}, year={2026}, note={Preprint} }

See What I See, Know What I Think:
Dense Latent Communication Across Heterogeneous Agents