codeval · User Guide

Compare code from any LLM — measure what actually matters

codeval lets you paste outputs from different AI models side by side and instantly see which one is closest to the minimal optimal solution, which added unnecessary complexity, and why.

Multi-model comparison Complexity analysis Manifold deviation PDF export
01

What is codeval?

codeval is a code comparison tool designed to evaluate how well different large language models solve the same programming task. Beyond checking if the code runs, it measures how unnecessarily complex each solution is compared to the theoretical optimal.

Core idea

For any programming task there exists a minimal optimal solution — the simplest correct code possible. codeval measures the manifold deviation: how far each model's output drifted from that optimal center. A model can produce correct code that still scores poorly because it over-engineered the solution.

Describe the task
Write or paste your prompt
Add model columns
One column per LLM
Paste the code
Copy output from each model
Click Analyze
Runs all metrics instantly
Export PDF
Shareable report
02

Interface layout

The app is divided into three zones: the left sidebar for configuration, the central columns area for code input, and the bottom panel for analysis results.

Left sidebar
Task / Prompt
Language ▾ Python
Quick examples
is_even(n) · py
factorial(n) · py
debounce(fn, ms) · js
A
GPT-4o optimal
def is_even(n):
    return n % 2 == 0
C
Claude 3.5 verbose
def is_even(n):
    if n % 2 == 0:
        return True
    else: return False
F
Gemini bloated
class EvenChecker:
  def check(self,n):
    return n%2==0
def is_even(n):
  return EvenChecker().check(n)
Overview Complexity Tests Diff Similarity Issues
Tip — columns

Columns are fully dynamic. Add as many as you need with + column in the top-right. Each column can be renamed and assigned to a specific model using the colored badge on the column header.

03

Step-by-step guide

Follow these steps to run your first comparison in under two minutes.

1

Describe the task

In the Task / Prompt field on the left sidebar, type or paste the exact prompt you gave to each model. For example: Write a function factorial(n) that returns n!

Quick start

Use the Quick examples section in the sidebar to auto-load pre-filled prompts with sample code from multiple models — great for understanding the interface before using your own data.

2

Select the programming language

Use the Language dropdown to set the language of the code being compared. This helps the analyzer apply the right complexity rules. Supported: Python, JavaScript, TypeScript, Java, Go, Rust, C++.

3

Add model columns

Click + column in the top-right of the toolbar (or the + button at the end of the columns area) to add a new column for each model you want to compare.

Click the colored model badge (shows "Custom" by default) on each column header to pick the model from the list. The badge updates with the model's brand color.

GPT-4o Claude 3.5 Claude 3.7 Gemini 1.5 Gemini 2.0 Qwen3-14B Llama 3.1 DeepSeek Custom

You can also type a custom label in the text field next to the badge — useful when comparing versions of the same model with different system prompts.

4

Paste the code

Copy the output from each model and paste it into the corresponding column's code editor. The editor accepts any code — no formatting required. You can paste entire responses including comments and docstrings.

Important

Clicking ▶ Analyze with empty columns will show a warning. All columns must contain code before the analysis runs. Empty columns are skipped automatically after confirmation.

5

Click ▶ Analyze

Press the orange ▶ Analyze button in the top-right. The tool sends all versions to the backend, which runs the full complexity analysis pipeline. Each column header updates with a letter grade (A–F) within seconds.

The results panel slides up at the bottom of the screen, and the ⬇ Export PDF button becomes visible.

6

Read the results

Navigate the six tabs in the results panel at the bottom. Start with Overview for a quick score summary, then go to Complexity for the full metric table. See the next section for a detailed explanation of each tab.

04

Result tabs explained

After clicking Analyze, a panel with six tabs appears at the bottom. Each tab shows a different lens on the same analysis data.

Overview

Score summary + deviation bars

Shows the overall score percentage for each model and a horizontal bar representing manifold deviation. Green = optimal, amber = minor deviation, red = unnecessary complexity. Best starting point.

Complexity

Full metrics comparison table

All 15 metrics in a side-by-side table. Grouped into: Manifold, Complexity, Structure, Code Quality, and Correctness. Color-coded values make it easy to spot which model is worst on each dimension.

Tests

Test case results per model

Shows pass/fail for each test case defined in the app. If no tests are defined (the test section was removed), shows 0/0. Load a Quick Example to see this tab populated with real test data.

±
Diff

Line-by-line diff vs reference

Requires a reference solution to be defined in the backend. Shows added lines (green) and removed lines (red) for each model's output. Highlights exactly where each model diverged from the optimal.

Similarity

Token similarity heatmap

A matrix showing how similar each pair of models is to each other (0% = completely different, 100% = identical). Useful for detecting when two models produced essentially the same solution.

!
Issues

Detected complexity problems

A plain-language list of specific issues found in each model's code: trivial intermediate variables, unnecessary if/else, dead assignments, premature abstraction into classes, nested functions, etc.

05

All metrics explained

The Complexity tab shows 15 metrics organized in five groups. Here is what each one measures and when to be concerned.

Metric What it measures · When to worry
Manifold
Manifold deviation Distance of the code from the minimal optimal solution. Combines all complexity signals into a single 0–100% score. Worry if > 20% for a simple task.
Overall score Weighted combination of correctness and manifold penalty. The final grade is derived from this. Below 65% indicates real problems.
Complexity
Cyclomatic complexity Number of independent execution paths (branches from if, for, while, and/or). For a simple function the optimal value is 1. Values above 5 require many tests to cover all paths.
Cognitive complexity Like cyclomatic, but each nesting level increases the penalty. A loop inside an if inside another if costs much more than three flat conditions. Optimal is 0 for one-liner tasks.
AST node count Total nodes in the abstract syntax tree — a raw measure of code volume. A return a + b has ~12 nodes. Compare across models: the lower, the more concise the solution.
Max nesting depth Deepest level of nested control structures. if inside for inside while = depth 3. Optimal for a simple task is 0 or 1. Depth > 3 is a strong signal of over-engineering.
Structure
Function count Number of functions defined. If the task asked for one function and the model defined three, that is abstraction overhead. Compare against the expected count from the prompt.
Class count Number of classes defined. A non-zero value for a simple function task is almost always premature abstraction.
Abstraction overhead Percentage of extra functions/classes beyond what the prompt required. 0% = perfect. 100% means the model defined twice as many abstractions as needed. 200% means triple.
Code Quality
Dead assignments Variables that are assigned but never read within the same function. A value > 0 indicates the model changed logic mid-generation and left orphaned code behind.
Trivial returns Patterns like result = expr; return result instead of return expr directly. Each occurrence adds 0 value but increases verbosity and cognitive load.
Correctness
Syntax valid Whether the code parses without syntax errors. A ✗ here means the model produced broken code — all other metrics are unreliable for that version.
Execution OK Whether the code runs without raising an exception when imported. Does not check correctness of output — only that it doesn't crash on load.
Tests passed Ratio of test cases that returned the expected value. Requires test cases to be defined via the API (the Quick Examples load them automatically). Shows 0/0 if no tests are defined.
Prompt alignment How well the code matches the prompt: keyword coverage, expected function name, correct language. A model can score 100% alignment but still have high deviation — it understood the task but solved it badly.

Understanding manifold deviation

0–20%
Optimal
Code is at or near the minimal solution. No unnecessary branches, abstractions, or verbosity.
20–45%
Minor deviation
Slightly more complex than needed — perhaps an extra variable or a redundant check. Functional but not elegant.
45–100%
Unnecessary complexity
Significant over-engineering. Extra functions, deep nesting, defensive guards, or premature abstraction into classes.
06

Grade scale

Each column header displays a letter grade derived from the overall score after applying the manifold deviation penalty. The overall score itself can be high (code is correct) while the grade drops because of unnecessary complexity.

A
≥ 90%
Near-optimal. Correct, clean, minimal. Very close to the ideal manifold center.
B
75–89%
Good. Minor deviations — perhaps one extra variable or slightly verbose style.
C
60–74%
Acceptable. Noticeable complexity overhead. Extra branches or an unnecessary helper function.
D
40–59%
Poor. Significant over-engineering. The model drifted far from the optimal manifold.
F
< 40%
Failing. Either incorrect code or extreme unnecessary complexity — classes, multiple functions, deep nesting for a trivial task.
How the grade is calculated

Grade = overall_score − (manifold_deviation × 0.25). A model with 100% correctness but 80% manifold deviation ends up with a score of 80%, which is a B. This is intentional — code quality matters, not just whether it runs.

07

Supported models

codeval is model-agnostic — it analyzes whatever code you paste, regardless of which model generated it. The model badge is a label for your reference and affects color coding in the PDF report.

Available model labels

GPT-4oOpenAI · green badge
GPT-4.1OpenAI · green badge
Claude 3.5Anthropic · amber badge
Claude 3.7Anthropic · amber badge
Gemini 1.5Google · blue badge
Gemini 2.0Google · blue badge

More models

Qwen3-14BAlibaba · purple badge
Llama 3.1Meta · amber badge
MistralMistral AI · amber badge
DeepSeekDeepSeek · green badge
CustomAny model not in the list · grey badge. Type any name in the column label field.
Using custom models

If your model isn't in the list, select Custom from the picker and type the model name directly into the column label field. The badge will show grey, but all analysis functions work identically.

08

Quick examples

The sidebar includes six pre-loaded examples that fill every field automatically — prompt, language, and one column per model with realistic outputs showing different complexity levels.

py · is_even(n)

Boolean check

3 models: GPT-4o (1-liner), Claude (if/else), Gemini (intermediate variable). Shows how a trivial task can be over-expressed.

py · sum(a, b)

Arithmetic

3 models: optimal return, extra variable, defensive None check. Classic unnecessary guard example.

py · max_list(lst)

List max

2 models: return max(lst) vs. a manual loop. The loop version reimplements a built-in — a common LLM pattern.

py · reverse(s)

String reversal

2 models: slice [::-1] vs. character loop. Tests whether the model knows idiomatic Python.

py · factorial(n)

Recursion

3 models: recursive (optimal), iterative (valid), and wrapped inside a class (over-engineered). Good example of abstraction overhead.

js · debounce(fn, ms)

JavaScript closure

2 models: minimal arrow function vs. verbose function with null-check and cleanup. Shows the tool works for JS too.

09

Exporting the PDF report

After clicking Analyze, a bright ⬇ Export PDF button appears in the top toolbar. Clicking it generates a complete A4 PDF report and downloads it automatically.

1

Run the analysis first

The Export PDF button is hidden until you click ▶ Analyze. It appears automatically once results are available.

2

Click ⬇ Export PDF

The button is in the toolbar above the columns, to the right of "clear results". It has a light background so it stands out from the dark interface.

3

The report downloads automatically

File name format: codeval_report_YYYY-MM-DD.pdf. The report includes: task prompt, model summary cards with scores, manifold deviation bars, full metrics table grouped by category, comparison analysis in prose, and all detected issues.

PDF contents

The report is entirely in English and uses a white background with color-coded pill badges for metric values. Each section is labelled and can be shared with teammates or attached to research notes.

10

Frequently asked questions

Why is my code getting a low score even though it runs correctly?
The overall score penalizes manifold deviation — unnecessary complexity beyond what the task requires. A model can produce a 100% correct solution that still scores 36% overall because it used a class, three helper functions, and five conditional branches for a problem that needed a single return statement. Correctness is necessary but not sufficient.
What does "manifold deviation" actually mean?
It's a geometric metaphor. All valid solutions to a programming task exist in a high-dimensional vector space. The simplest correct solution occupies the center of this space — the "manifold." More complex solutions (extra branches, more functions, deeper nesting) exist at greater distances from that center. Manifold deviation is that distance, normalized to 0–100%.
Does codeval execute my code? Is it safe?
Yes — codeval executes code in a sandboxed Python runner with a 5-second timeout. Dangerous imports (os, subprocess, socket, etc.) are blocked automatically. The runner captures stdout/stderr and returns them without affecting the host system.
Why does "Tests passed" show 0/0?
The test cases section was removed from the sidebar to keep the interface focused on complexity comparison. Tests still run if you load one of the Quick Examples from the sidebar, which include pre-defined test cases. If you need custom tests for your own prompts, they can be added via the backend API directly.
Can I compare more than two models at once?
Yes. Click + column as many times as you need. The columns scroll horizontally. The similarity matrix in the Similarity tab adapts to show all pairwise comparisons. The PDF report also handles N models — the summary cards and metrics table scale accordingly.
What languages are supported?
The complexity analyzer is built on Python's ast module and is therefore most accurate for Python. JavaScript and other languages will parse for syntax validity but AST-based metrics (cyclomatic complexity, cognitive complexity, nesting depth, dead assignments) will default to approximate values. The language selector in the sidebar is informational for now — full multi-language AST support is a planned feature.
How is abstraction overhead calculated?
The tool infers the expected number of functions from the prompt (usually 1). Abstraction overhead = (actual functions + classes − expected) / expected × 100%. If the prompt says "write a function" and the model defined 1 function and 1 class, overhead = (1 + 1 − 1) / 1 = 100%. Two extra functions would be 200%.
The ⬇ Export PDF button isn't showing. What should I do?
The button only appears after a successful analysis run. Make sure: (1) at least one column has code pasted in, (2) you clicked ▶ Analyze, and (3) the results panel appeared at the bottom. If you clicked "clear results", the button hides again and you need to re-run the analysis.