Documentation

Learn how to use the LLM Evaluation Platform

Contents

Getting Started

The LLM Evaluation Platform allows you to test and compare different Large Language Models (LLMs) in real-time. This helps you make data-driven decisions about which model best suits your specific needs.

Key Features

Test prompts across multiple LLM models simultaneously
Compare response quality, speed, and cost
Analyze detailed performance metrics
Visualize results with interactive charts
Save and share experiments

Prerequisites

To use the platform, you'll need:

API keys for the models you want to test (OpenAI, Groq, etc.)
Basic understanding of prompt engineering

Running Experiments

Running an experiment is simple. Follow these steps to test your prompts across multiple models:

Navigate to the New Experiment page
Click on "New Experiment" in the navigation bar or go to /new-experiment.
Configure your experiment settings
Set parameters like temperature, max tokens, and optionally add a system prompt.
Enter your prompt
Type or paste the prompt you want to test in the prompt field.
Run the experiment
Click "Run Experiment" to send your prompt to all selected models.
View the results
Results will appear in real-time as each model responds.

Pro Tip

For the most accurate comparisons, try to keep your prompts consistent across experiments. Small changes in wording can significantly impact model responses.

Analyzing Results

After running an experiment, you can analyze the results in several ways:

Results Tab

The Results tab shows each model's response along with key metrics:

Response Time: How long the model took to generate a response
Token Usage: Total tokens, prompt tokens, and completion tokens
Cost: Estimated cost of the API call

Comparison Tab

The Comparison tab provides a side-by-side comparison of all models, making it easy to spot differences in performance.

Dashboard

For a more visual analysis, visit the Dashboard to see charts and graphs of your experiment results:

Response Time Chart: Compare how quickly each model responds
Token Usage Breakdown: Visualize token usage across models
Cost Comparison: See which models are most cost-effective

Supported Models

The platform currently supports the following models:

GPT-4

OpenAI's most advanced model, with broad general knowledge and domain expertise.

Strengths

Reasoning
Creative writing
Code generation
Multi-modal capabilities

Specifications

Context Window: 128,000 tokens
Cost: $0.03 per 1K tokens
Provider: OpenAI

For more detailed information about each model, visit the Models page.

Evaluation Metrics

The platform collects and analyzes several metrics to help you evaluate model performance:

Technical Metrics

Response Time:
The time taken for the model to generate a response, measured in seconds.
Token Count:
The total number of tokens used in the request and response.
Prompt Tokens:
The number of tokens in your input prompt.
Completion Tokens:
The number of tokens in the model's response.
Cost:
The estimated cost of the API call, calculated based on token usage and model pricing.

Future Metrics

We're working on adding more advanced evaluation metrics, including:

Relevancy:
How relevant the response is to the prompt.
Accuracy:
The factual correctness of the response.
Consistency:
How consistent the model's responses are across similar prompts.
Response Quality:
An overall quality score based on multiple factors.

API Reference

The platform provides a simple API for running experiments programmatically:

Evaluate Endpoint

POST /api/evaluate

Request Body

{
  "prompt": "Explain quantum computing in simple terms",
  "model": "gpt-4", // or "llama-3.3-70b", "gemma2-9b"
  "temperature": 0.7,
  "max_tokens": 1000
}

Response

{
  "modelName": "gpt-4",
  "response": "Model&apos;s response text",
  "responseTime": 1.23,
  "metrics": {
    "tokenCount": 150,
    "promptTokens": 50,
    "completionTokens": 100,
    "cost": 0.0045
  }
}

Note

API access requires authentication. Contact us for API keys and rate limit information.