Documentation

Learn how to use the LLM Evaluation Platform

Try it now

Getting Started

The LLM Evaluation Platform allows you to test and compare different Large Language Models (LLMs) in real-time. This helps you make data-driven decisions about which model best suits your specific needs.

Key Features

  • Test prompts across multiple LLM models simultaneously
  • Compare response quality, speed, and cost
  • Analyze detailed performance metrics
  • Visualize results with interactive charts
  • Save and share experiments

Prerequisites

To use the platform, you'll need:

  • API keys for the models you want to test (OpenAI, Groq, etc.)
  • Basic understanding of prompt engineering

Running Experiments

Running an experiment is simple. Follow these steps to test your prompts across multiple models:

  1. Navigate to the New Experiment page

    Click on "New Experiment" in the navigation bar or go to /new-experiment.

  2. Configure your experiment settings

    Set parameters like temperature, max tokens, and optionally add a system prompt.

  3. Enter your prompt

    Type or paste the prompt you want to test in the prompt field.

  4. Run the experiment

    Click "Run Experiment" to send your prompt to all selected models.

  5. View the results

    Results will appear in real-time as each model responds.

Pro Tip

For the most accurate comparisons, try to keep your prompts consistent across experiments. Small changes in wording can significantly impact model responses.

Analyzing Results

After running an experiment, you can analyze the results in several ways:

Results Tab

The Results tab shows each model's response along with key metrics:

  • Response Time: How long the model took to generate a response
  • Token Usage: Total tokens, prompt tokens, and completion tokens
  • Cost: Estimated cost of the API call

Comparison Tab

The Comparison tab provides a side-by-side comparison of all models, making it easy to spot differences in performance.

Dashboard

For a more visual analysis, visit the Dashboard to see charts and graphs of your experiment results:

  • Response Time Chart: Compare how quickly each model responds
  • Token Usage Breakdown: Visualize token usage across models
  • Cost Comparison: See which models are most cost-effective

Supported Models

The platform currently supports the following models:

GPT-4

OpenAI's most advanced model, with broad general knowledge and domain expertise.

Strengths

  • Reasoning
  • Creative writing
  • Code generation
  • Multi-modal capabilities

Specifications

  • Context Window: 128,000 tokens
  • Cost: $0.03 per 1K tokens
  • Provider: OpenAI

For more detailed information about each model, visit the Models page.

Evaluation Metrics

The platform collects and analyzes several metrics to help you evaluate model performance:

Technical Metrics

  • Response Time:

    The time taken for the model to generate a response, measured in seconds.

  • Token Count:

    The total number of tokens used in the request and response.

  • Prompt Tokens:

    The number of tokens in your input prompt.

  • Completion Tokens:

    The number of tokens in the model's response.

  • Cost:

    The estimated cost of the API call, calculated based on token usage and model pricing.

Future Metrics

We're working on adding more advanced evaluation metrics, including:

  • Relevancy:

    How relevant the response is to the prompt.

  • Accuracy:

    The factual correctness of the response.

  • Consistency:

    How consistent the model's responses are across similar prompts.

  • Response Quality:

    An overall quality score based on multiple factors.

API Reference

The platform provides a simple API for running experiments programmatically:

Evaluate Endpoint

POST /api/evaluate

Request Body

{
  "prompt": "Explain quantum computing in simple terms",
  "model": "gpt-4", // or "llama-3.3-70b", "gemma2-9b"
  "temperature": 0.7,
  "max_tokens": 1000
}

Response

{
  "modelName": "gpt-4",
  "response": "Model's response text",
  "responseTime": 1.23,
  "metrics": {
    "tokenCount": 150,
    "promptTokens": 50,
    "completionTokens": 100,
    "cost": 0.0045
  }
}

Note

API access requires authentication. Contact us for API keys and rate limit information.