Compare code from any LLM — measure what actually matters
codeval lets you paste outputs from different AI models side by side and instantly see which one is closest to the minimal optimal solution, which added unnecessary complexity, and why.
What is codeval?
codeval is a code comparison tool designed to evaluate how well different large language models solve the same programming task. Beyond checking if the code runs, it measures how unnecessarily complex each solution is compared to the theoretical optimal.
For any programming task there exists a minimal optimal solution — the simplest correct code possible. codeval measures the manifold deviation: how far each model's output drifted from that optimal center. A model can produce correct code that still scores poorly because it over-engineered the solution.
Interface layout
The app is divided into three zones: the left sidebar for configuration, the central columns area for code input, and the bottom panel for analysis results.
Columns are fully dynamic. Add as many as you need with + column in the top-right. Each column can be renamed and assigned to a specific model using the colored badge on the column header.
Step-by-step guide
Follow these steps to run your first comparison in under two minutes.
Describe the task
In the Task / Prompt field on the left sidebar, type or paste the exact prompt you gave to each model. For example: Write a function factorial(n) that returns n!
Use the Quick examples section in the sidebar to auto-load pre-filled prompts with sample code from multiple models — great for understanding the interface before using your own data.
Select the programming language
Use the Language dropdown to set the language of the code being compared. This helps the analyzer apply the right complexity rules. Supported: Python, JavaScript, TypeScript, Java, Go, Rust, C++.
Add model columns
Click + column in the top-right of the toolbar (or the + button at the end of the columns area) to add a new column for each model you want to compare.
Click the colored model badge (shows "Custom" by default) on each column header to pick the model from the list. The badge updates with the model's brand color.
You can also type a custom label in the text field next to the badge — useful when comparing versions of the same model with different system prompts.
Paste the code
Copy the output from each model and paste it into the corresponding column's code editor. The editor accepts any code — no formatting required. You can paste entire responses including comments and docstrings.
Clicking ▶ Analyze with empty columns will show a warning. All columns must contain code before the analysis runs. Empty columns are skipped automatically after confirmation.
Click ▶ Analyze
Press the orange ▶ Analyze button in the top-right. The tool sends all versions to the backend, which runs the full complexity analysis pipeline. Each column header updates with a letter grade (A–F) within seconds.
The results panel slides up at the bottom of the screen, and the ⬇ Export PDF button becomes visible.
Read the results
Navigate the six tabs in the results panel at the bottom. Start with Overview for a quick score summary, then go to Complexity for the full metric table. See the next section for a detailed explanation of each tab.
Result tabs explained
After clicking Analyze, a panel with six tabs appears at the bottom. Each tab shows a different lens on the same analysis data.
Score summary + deviation bars
Shows the overall score percentage for each model and a horizontal bar representing manifold deviation. Green = optimal, amber = minor deviation, red = unnecessary complexity. Best starting point.
Full metrics comparison table
All 15 metrics in a side-by-side table. Grouped into: Manifold, Complexity, Structure, Code Quality, and Correctness. Color-coded values make it easy to spot which model is worst on each dimension.
Test case results per model
Shows pass/fail for each test case defined in the app. If no tests are defined (the test section was removed), shows 0/0. Load a Quick Example to see this tab populated with real test data.
Line-by-line diff vs reference
Requires a reference solution to be defined in the backend. Shows added lines (green) and removed lines (red) for each model's output. Highlights exactly where each model diverged from the optimal.
Token similarity heatmap
A matrix showing how similar each pair of models is to each other (0% = completely different, 100% = identical). Useful for detecting when two models produced essentially the same solution.
Detected complexity problems
A plain-language list of specific issues found in each model's code: trivial intermediate variables, unnecessary if/else, dead assignments, premature abstraction into classes, nested functions, etc.
All metrics explained
The Complexity tab shows 15 metrics organized in five groups. Here is what each one measures and when to be concerned.
| Metric | What it measures · When to worry |
|---|---|
| Manifold | |
| Manifold deviation | Distance of the code from the minimal optimal solution. Combines all complexity signals into a single 0–100% score. Worry if > 20% for a simple task. |
| Overall score | Weighted combination of correctness and manifold penalty. The final grade is derived from this. Below 65% indicates real problems. |
| Complexity | |
| Cyclomatic complexity | Number of independent execution paths (branches from if, for, while, and/or). For a simple function the optimal value is 1. Values above 5 require many tests to cover all paths. |
| Cognitive complexity | Like cyclomatic, but each nesting level increases the penalty. A loop inside an if inside another if costs much more than three flat conditions. Optimal is 0 for one-liner tasks. |
| AST node count | Total nodes in the abstract syntax tree — a raw measure of code volume. A return a + b has ~12 nodes. Compare across models: the lower, the more concise the solution. |
| Max nesting depth | Deepest level of nested control structures. if inside for inside while = depth 3. Optimal for a simple task is 0 or 1. Depth > 3 is a strong signal of over-engineering. |
| Structure | |
| Function count | Number of functions defined. If the task asked for one function and the model defined three, that is abstraction overhead. Compare against the expected count from the prompt. |
| Class count | Number of classes defined. A non-zero value for a simple function task is almost always premature abstraction. |
| Abstraction overhead | Percentage of extra functions/classes beyond what the prompt required. 0% = perfect. 100% means the model defined twice as many abstractions as needed. 200% means triple. |
| Code Quality | |
| Dead assignments | Variables that are assigned but never read within the same function. A value > 0 indicates the model changed logic mid-generation and left orphaned code behind. |
| Trivial returns | Patterns like result = expr; return result instead of return expr directly. Each occurrence adds 0 value but increases verbosity and cognitive load. |
| Correctness | |
| Syntax valid | Whether the code parses without syntax errors. A ✗ here means the model produced broken code — all other metrics are unreliable for that version. |
| Execution OK | Whether the code runs without raising an exception when imported. Does not check correctness of output — only that it doesn't crash on load. |
| Tests passed | Ratio of test cases that returned the expected value. Requires test cases to be defined via the API (the Quick Examples load them automatically). Shows 0/0 if no tests are defined. |
| Prompt alignment | How well the code matches the prompt: keyword coverage, expected function name, correct language. A model can score 100% alignment but still have high deviation — it understood the task but solved it badly. |
Understanding manifold deviation
Grade scale
Each column header displays a letter grade derived from the overall score after applying the manifold deviation penalty. The overall score itself can be high (code is correct) while the grade drops because of unnecessary complexity.
Grade = overall_score − (manifold_deviation × 0.25). A model with 100% correctness but 80% manifold deviation ends up with a score of 80%, which is a B. This is intentional — code quality matters, not just whether it runs.
Supported models
codeval is model-agnostic — it analyzes whatever code you paste, regardless of which model generated it. The model badge is a label for your reference and affects color coding in the PDF report.
Available model labels
More models
If your model isn't in the list, select Custom from the picker and type the model name directly into the column label field. The badge will show grey, but all analysis functions work identically.
Quick examples
The sidebar includes six pre-loaded examples that fill every field automatically — prompt, language, and one column per model with realistic outputs showing different complexity levels.
Boolean check
3 models: GPT-4o (1-liner), Claude (if/else), Gemini (intermediate variable). Shows how a trivial task can be over-expressed.
Arithmetic
3 models: optimal return, extra variable, defensive None check. Classic unnecessary guard example.
List max
2 models: return max(lst) vs. a manual loop. The loop version reimplements a built-in — a common LLM pattern.
String reversal
2 models: slice [::-1] vs. character loop. Tests whether the model knows idiomatic Python.
Recursion
3 models: recursive (optimal), iterative (valid), and wrapped inside a class (over-engineered). Good example of abstraction overhead.
JavaScript closure
2 models: minimal arrow function vs. verbose function with null-check and cleanup. Shows the tool works for JS too.
Exporting the PDF report
After clicking Analyze, a bright ⬇ Export PDF button appears in the top toolbar. Clicking it generates a complete A4 PDF report and downloads it automatically.
Run the analysis first
The Export PDF button is hidden until you click ▶ Analyze. It appears automatically once results are available.
Click ⬇ Export PDF
The button is in the toolbar above the columns, to the right of "clear results". It has a light background so it stands out from the dark interface.
The report downloads automatically
File name format: codeval_report_YYYY-MM-DD.pdf. The report includes: task prompt, model summary cards with scores, manifold deviation bars, full metrics table grouped by category, comparison analysis in prose, and all detected issues.
The report is entirely in English and uses a white background with color-coded pill badges for metric values. Each section is labelled and can be shared with teammates or attached to research notes.
Frequently asked questions
return statement. Correctness is necessary but not sufficient.
os, subprocess, socket, etc.) are blocked automatically. The runner captures stdout/stderr and returns them without affecting the host system.
ast module and is therefore most accurate for Python. JavaScript and other languages will parse for syntax validity but AST-based metrics (cyclomatic complexity, cognitive complexity, nesting depth, dead assignments) will default to approximate values. The language selector in the sidebar is informational for now — full multi-language AST support is a planned feature.