FAQs
The most common questions about our service
Why Does My LLM Return Incorrect Information About Its Underlying Model?
Model Hallucination: When the LLM Sounds Confident But Is Wrong
When working with large language models like GPT-4 or Claude, developers may encounter cases where the model gives incorrect information about its architecture, origin, or performance. This behavior is a form of hallucination—a confident yet unfounded response. It is especially common when using model aggregation platforms or proxy services, and should not be mistaken for deliberate tampering by the platform.
You can cross-check the actual backend model using traceable system-level signals such as:
- Context window size: For example, GPT-4 Turbo supports up to 128k tokens. You can validate this by passing the
max_tokens
parameter or inferring from where context truncation occurs. - First-token latency: Different models exhibit distinct response delays before the first token is returned.
- Generation throughput (tokens/sec): GPT-3.5 and Claude-Haiku are noticeably faster than GPT-4 or Claude-Opus.
- System API response headers or metadata: Fields like
model-id
andprovider
can reveal backend identity. - Native call constraints: For instance, OpenAI or Gemini models cannot be queried via Claude-native interfaces—this applies to similar mismatches as well.
- Stylistic fingerprinting: Claude models tend to be more restrained and implicit, while GPT models are generally more logic-driven and declarative.
Observing these signals in combination can help verify the actual model running under the hood and avoid mistaking hallucinated claims as ground truth.
If you need test scripts for measuring first-token latency or throughput, you can download them here.
Common Hallucination Scenarios
Type of Question | Example Question | Typical Hallucinated Answer |
---|---|---|
Model Identity | ”Are you GPT-4?" | "I’m GPT-4 Turbo, released in June 2024.” |
Source Attribution | ”Are you Claude?" | "I’m Claude 3.5 Sonnet, provided by Anthropic.” |
Performance Claims | ”Who’s faster, you or Gemini?" | "I’m faster and have more parameters.” (fabricated) |
Why Does This Happen?
-
Language models are not aware of their environment
LLMs have no actual awareness of where they are running. They respond based purely on patterns in text, not on introspective system data. -
Injected prompts can be inaccurate or misleading
Some platforms inject system prompts that assert identity, like “You are Claude Sonnet,” which influences the model’s response regardless of actual backend. -
Aggregation layers obscure real runtime details
When proxying across multiple providers, the model itself cannot detect the infrastructure behind it—it only sees the prompt.
Mitigation Strategies
1. Do not trust identity claims made by the model itself
Any self-description like “I am GPT-4” or “I run on Claude” should be treated as textual output, not system truth.
2. Propagate trusted metadata from the backend
Use system-level APIs to provide metadata about the model rather than relying on model responses. For example:
This metadata should be passed through the call chain and surfaced where needed.
3. Inject identity explicitly through the system prompt
If you need the model to refer to its own identity, do so via a fixed system prompt
, and lock it down to prevent hallucination:
4. Surface model info in the frontend from system, not model output
Avoid presenting the model’s own claims as “official model information.” Clearly label such content as “system verified” versus “model generated.”
5. Use hallucination detection and confidence scoring
Implement detection logic to flag statements like “I am GPT-4,” assigning low confidence scores or adding warnings. This can involve keyword heuristics, secondary model evaluation, or fine-tuned classifiers.
Conclusion
Language models are not reliable sources for identifying their own backend, provider, or performance tier. To build trustworthy AI systems, model identity must come from the system—not from the model’s own mouth.
Why do I get different outputs when using the same prompt and input on Claude.ai versus the API?
Claude’s web app (Claude.ai) and mobile apps prepend a system prompt to each conversation by default. This prompt provides critical context, such as the current date, stylistic guidance (e.g., prefer Markdown for code), tone calibration, and behavioral shaping for the assistant.
Anthropic regularly updates these prompts to improve model performance. Importantly, the full system prompts used on Claude.ai are publicly available, and you can find the latest version for each model in Anthropic’s official documentation.
In contrast, API calls do not include any default system prompt unless you provide one. This difference in context explains why the same prompt may yield different outputs between the web interface and the API.
To replicate Claude.ai’s behavior via API, you can explicitly include the published system prompt in your call.
Why does GPT-4 consume tokens so quickly?
- GPT-4 consumes tokens 20 to 40 times faster than GPT-3.5-turbo. Assuming you purchase 90,000 tokens, using an average multiplier of 30, you get about 3,000 words. Including historical messages further reduces the number of messages you can send. In extreme cases, a single message can consume all 90,000 tokens, so please use them cautiously.
What are some tips to save tokens when using Next Web?
- Click the settings button above the chat box and find the following options:
- Number of historical messages: The fewer, the less token consumption, but GPT will forget previous conversations.
- Historical summary: Used to record long-term topics; turning it off can reduce token consumption.
- Inject system-level prompts: Used to improve ChatGPT’s response quality; turning it off can reduce token consumption.
- Click the settings button in the lower-left corner to turn off automatic title generation, which can reduce token consumption.
- During a conversation, click the robot icon above the chat box to quickly switch models. Prefer using 3.5 for Q&A, and if unsatisfied, switch to 4.0 to ask again.
Why doesn’t the backend show the used quota when creating a key?
When set to unlimited quota, the used quota won’t update. Change the unlimited quota to a limited quota to see the usage.
User Agreement
Payment is considered agreement to this agreement! Otherwise, please do not pay!
- This service will not persistently store any user’s chat information in any form;
- This service does not know and cannot know any text content transmitted by users in this service. Any illegal or criminal consequences caused by users using this service are borne by the user, and this service will fully cooperate with any related investigations that may arise.