-
Non-MoE Capacity Controls (Table 4): This crucial ablation shows that simply adding parameters is not the solution. Methods like Wide-FFN
(making the model wider) or Append/Prepend-blocks
(making it deeper) provide minimal or even negative gains. IDIOMoE
's structured separation is the key to its success.
Table 4: Non-MoE capacity controls on Amazon-Beauty and Industrial datasets. (Manual Transcription)
Method | Amazon-Beauty ∆(%) | Industrial ∆(%) |
NDCG@10 | HR@10 | NDCG@10 | HR@10 |
Item-LLM (baseline) | | |
LoRA-LLM | +21.5% | +7.9% | -79.1% | -76.3% |
Wide-FFN | +27.0% | +24.9% | +3.8% | +1.3% |
Append-blocks | -87.8% | -90.3% | -5.5% | -5.3% |
Prepend-blocks | -97.2% | -95.9% | -15.3% | -16.2% |
MoA | +48.3% | +46.2% | +20.9% | +27.1% |
MoT | +49.3% | +51.1% | +22.5% | +24.8% |
IDIOMoE | +48.1% | +49.6% | **+24.1%** | **+28.9%** |
-
Item Expert Capacity (Table 5): The optimal size of the item expert depends on the dataset's complexity. For smaller datasets, a "shrunken" expert works well, saving parameters. For the large industrial dataset, a full-capacity expert is needed, highlighting the method's flexibility.
-
MoE Layer Placement (Table 6): Placing the MoE modules in the last 8 layers of the network yields the best performance. This suggests that the disentanglement is most critical at deeper layers, where abstract, task-specific representations for ranking are formed.
-
Static vs. Dynamic Routing (Table 7): A simple static routing rule based on token type dramatically outperforms a learned, dynamic router. This is a key finding, indicating that for this problem, forcing specialization is more effective than allowing the model to mix signals, which leads back to entanglement.