Compitum RouterBench Evaluation Summary

This report compares Compitum against baseline routers on a bounded evaluation set. Higher oracle_match indicates lower regret relative to the oracle assignment.

Metrics

  • compitum

    • accuracy_mean: 0.5022

    • cost_mean: 1.0159

  • WizardLM/WizardLM-13B-V1.2

    • accuracy_mean: 0.4423

    • cost_mean: 0.0902

  • claude-instant-v1

    • accuracy_mean: 0.3923

    • cost_mean: 0.2571

  • claude-v1

    • accuracy_mean: 0.4612

    • cost_mean: 2.4918

  • claude-v2

    • accuracy_mean: 0.5111

    • cost_mean: 2.6118

  • gpt-3.5-turbo-1106

    • accuracy_mean: 0.5842

    • cost_mean: 0.3007

  • gpt-4-1106-preview

    • accuracy_mean: 0.7091

    • cost_mean: 3.3715

  • meta/code-llama-instruct-34b-chat

    • accuracy_mean: 0.4275

    • cost_mean: 0.2264

  • meta/llama-2-70b-chat

    • accuracy_mean: 0.4853

    • cost_mean: 0.2640

  • mistralai/mistral-7b-chat

    • accuracy_mean: 0.4401

    • cost_mean: 0.0589

  • mistralai/mixtral-8x7b-chat

    • accuracy_mean: 0.5768

    • cost_mean: 0.1757

  • oracle

    • accuracy_mean: 0.8704

    • cost_mean: 0.2064

  • zero-one-ai/Yi-34B-Chat

    • accuracy_mean: 0.5883

    • cost_mean: 0.2368

Where Compitum Wins

  • Cost mean vs WizardLM/WizardLM-13B-V1.2: +0.9257

  • Accuracy mean vs WizardLM/WizardLM-13B-V1.2: +0.0599

  • Cost mean vs claude-instant-v1: +0.7588

  • Accuracy mean vs claude-instant-v1: +0.1100

  • Cost mean vs claude-v1: -1.4759

  • Accuracy mean vs claude-v1: +0.0410

  • Cost mean vs claude-v2: -1.5958

  • Accuracy mean vs claude-v2: -0.0089

  • Cost mean vs gpt-3.5-turbo-1106: +0.7152

  • Accuracy mean vs gpt-3.5-turbo-1106: -0.0820

  • Cost mean vs gpt-4-1106-preview: -2.3555

  • Accuracy mean vs gpt-4-1106-preview: -0.2069

  • Cost mean vs meta/code-llama-instruct-34b-chat: +0.7895

  • Accuracy mean vs meta/code-llama-instruct-34b-chat: +0.0747

  • Cost mean vs meta/llama-2-70b-chat: +0.7519

  • Accuracy mean vs meta/llama-2-70b-chat: +0.0169

  • Cost mean vs mistralai/mistral-7b-chat: +0.9570

  • Accuracy mean vs mistralai/mistral-7b-chat: +0.0622

  • Cost mean vs mistralai/mixtral-8x7b-chat: +0.8402

  • Accuracy mean vs mistralai/mixtral-8x7b-chat: -0.0745

  • Cost mean vs oracle: +0.8095

  • Accuracy mean vs oracle: -0.3682

  • Cost mean vs zero-one-ai/Yi-34B-Chat: +0.7791

  • Accuracy mean vs zero-one-ai/Yi-34B-Chat: -0.0860

Regret (accuracy gap to oracle)

  • Compitum: +0.3682

  • WizardLM/WizardLM-13B-V1.2: +0.4281

  • claude-instant-v1: +0.4782

  • claude-v1: +0.4092

  • claude-v2: +0.3593

  • gpt-3.5-turbo-1106: +0.2863

  • gpt-4-1106-preview: +0.1613

  • meta/code-llama-instruct-34b-chat: +0.4429

  • meta/llama-2-70b-chat: +0.3851

  • mistralai/mistral-7b-chat: +0.4304

  • mistralai/mixtral-8x7b-chat: +0.2937

  • zero-one-ai/Yi-34B-Chat: +0.2822

Determinism

Compitum routing is deterministic given fixed models and parameters, reducing variance and improving reproducibility compared to routers relying on stochastic LLM calls for decisions.