Sampling
When generating text, large language models select the next word from candidate words based on probability distribution. Temperature, Top-P, and Top-K are three core sampling parameters that together determine the randomness, creativity, and determinism of model output.
Understanding and properly configuring these three parameters allows you to achieve ideal generation results in different scenarios.
Core Concepts
| Parameter | Effect | Range | Default | Use Cases |
|---|---|---|---|---|
| Temperature | Controls overall randomness | 0.0 - 2.0 | 0.7 | All scenarios |
| Top-P | Truncates candidate words by cumulative probability | 0.0 - 1.0 | 0.9 | Need to balance creativity and coherence |
| Top-K | Truncates candidate words by ranking | 1 - 100 | 50 | Need to strictly limit candidate range |
Quick Selection Guide:
- Pursuing accuracy (code, math, fact Q&A): Low Temperature + Low Top-P
- Daily conversation: Temperature 0.7 + Top-P 0.9 (default)
- Creative writing: High Temperature + High Top-P
Temperature
Principle Explanation
Temperature controls the “smoothness” of the model output probability distribution:
- Low Temperature (close to 0): The probability distribution is more “sharp”, the model tends to select the highest probability word, output is more deterministic and conservative
- High Temperature (greater than 1): The probability distribution is more “flat”, the model has more opportunities to select lower probability words, output is more random and creative
Intuitive Understanding
Imagine a bag full of colored balls (representing candidate words):
- Temperature = 0.1: The bag almost all contains the same color of balls, the draw result is very deterministic
- Temperature = 1.0: Balls of various colors are distributed in original proportions, the draw has some randomness
- Temperature = 1.5: The distribution of balls is “scrambled”, originally rare colored balls become more numerous, the draw result is more unpredictable
Code Example
Python
from openai import OpenAI
client = OpenAI(
base_url="https://api.ant-ling.com/v1",
api_key="YOUR_API_KEY"
)
# Low Temperature: Deterministic output
response = client.chat.completions.create(
model="Ling-2.6-flash",
messages=[{"role": "user", "content": "What is the capital of China?"}],
temperature=0.1 # Almost deterministic response
)
# High Temperature: Creative output
response = client.chat.completions.create(
model="Ling-2.6-flash",
messages=[{"role": "user", "content": "Write a poem about the moon"}],
temperature=1.2 # More creative expression
)Temperature Recommended Values
| Scenario | Recommended Value | Description |
|---|---|---|
| Fact Q&A, code generation | 0.0 - 0.3 | Pursue accuracy and consistency |
| Text summarization, translation | 0.3 - 0.5 | Preserve original meaning, moderately fluent |
| General purpose | 0.5 - 0.8 | Balance naturalness and accuracy |
| Creative writing, brainstorming | 0.8 - 1.2 | Encourage diversity and creativity |
| Exploratory generation | 1.2 - 1.5 | Get unexpected output |
Top-P
Principle Explanation
Top-P (Nucleus Sampling) is a dynamic truncation strategy:
- The model sorts all candidate words from high to low probability
- Accumulate probabilities of these words until the cumulative probability reaches the set P value
- Only sample from this “core set”
For example, Top-P = 0.9 means: only consider words that accumulate to 90% probability, ignore the bottom 10% low probability words.
Intuitive Understanding
Imagine you’re choosing books in a bookstore:
- Top-P = 0.3: Only choose from the top 30% of bestseller list, very small selection range, predictable result
- Top-P = 0.9: Choose from the top 90% of bestseller list, includes both popular books and niche quality works, balances quality and diversity
- Top-P = 1.0: Choose freely from the entire bookstore, might pick books with uneven quality
Difference from Temperature
| Feature | Temperature | Top-P |
|---|---|---|
| How it works | Changes probability distribution shape | Truncates candidate word set |
| Scope of impact | Global | Local (per layer sampling) |
| Low value effect | More deterministic | Fewer candidate words |
| High value effect | More random | More candidate words |
Code Example
from openai import OpenAI
client = OpenAI(
base_url="https://api.ant-ling.com/v1",
api_key="YOUR_API_KEY"
)
# Conservative Top-P: only consider high probability words
response = client.chat.completions.create(
model="Ling-2.6-flash",
messages=[{"role": "user", "content": "Explain the basic principles of quantum computing"}],
top_p=0.5, # Only choose from the most likely 50% of words
temperature=0.5
)
# Open Top-P: consider more candidate words
response = client.chat.completions.create(
model="Ling-2.6-flash",
messages=[{"role": "user", "content": "Brainstorm the beginning of a sci-fi story"}],
top_p=0.95, # Consider 95% of candidate words
temperature=0.8
)Top-P Recommended Values
| Scenario | Recommended Value | Description |
|---|---|---|
| Code generation, mathematical reasoning | 0.1 - 0.5 | Strictly limit, ensure logical correctness |
| Technical documentation, academic writing | 0.5 - 0.8 | Professional and coherent |
| General purpose | 0.8 - 0.95 | Natural and smooth |
| Creative writing | 0.9 - 1.0 | Retain more possibilities |
Top-K
Principle Explanation
Top-K is a fixed truncation strategy:
- The model sorts all candidate words by probability
- Only keep the top K highest probability words
- Sample from these K words
For example, Top-K = 50 means: each time only choose from the 50 highest probability words, completely ignoring other words.
Intuitive Understanding
Imagine a lottery box:
- Top-K = 1: The box only has 1 ball, result is completely deterministic
- Top-K = 10: The box has 10 balls, all are popular choices
- Top-K = 100: The box has 100 balls, wider selection range
Comparison with Top-P
| Feature | Top-K | Top-P |
|---|---|---|
| Filtering method | Fixed quantity | Dynamic probability accumulation |
| Candidate word count | Fixed to K | Dynamic change |
| Use Cases | Need strict limit | Need adaptive balance |
| Combined use | Usually choose one with Top-P | More commonly used |
Code Example
from openai import OpenAI
client = OpenAI(
base_url="https://api.ant-ling.com/v1",
api_key="YOUR_API_KEY"
)
# Use Top-K to limit candidate range
response = client.chat.completions.create(
model="Ling-2.6-flash",
messages=[{"role": "user", "content": "List three renewable energy sources"}],
top_k=40, # Only choose from the 40 highest probability words
temperature=0.6
)Top-K Recommended Values
| Scenario | Recommended Value | Description |
|---|---|---|
| Deterministic tasks | 1 - 10 | Near-greedy decoding |
| Code generation | 20 - 40 | Ensure grammatical correctness |
| General tasks | 40 - 60 | Balance quality and diversity |
| Creative tasks | 60 - 100 | Retain more choices |
Parameter Combination Strategy
Different parameter combinations suit different scenarios. Here are four common patterns:
1. Precision Mode
Use Cases: Code generation, mathematical reasoning, fact Q&A
Characteristics: Highly deterministic output, results basically consistent across multiple calls
response = client.chat.completions.create(
model="Ling-2.6-flash",
messages=[{"role": "user", "content": "Write a quick sort in Python"}],
temperature=0.1,
top_p=0.1
# or top_k=5
)2. Balance Mode
Use Cases: chat, intelligent customer service
Characteristics: Natural and smooth conversation experience, Ant Ling default configuration
response = client.chat.completions.create(
model="Ling-2.6-flash",
messages=[
{"role": "system", "content": "You are a professional customer service assistant"},
{"role": "user", "content": "How do I reset my password?"}
],
temperature=0.7,
top_p=0.9
)3. Creative Mode
Use Cases: Writing, brainstorming, marketing copy
Characteristics: Rich expression, unexpected associations
response = client.chat.completions.create(
model="Ling-2.6-flash",
messages=[{"role": "user", "content": "Write a slogan for the new smart watch"}],
temperature=1.0,
top_p=0.95
)4. Adventure Mode
Use Cases: Art generation, style transfer, concept exploration
Characteristics: High randomness, may produce unexpected results
response = client.chat.completions.create(
model="Ling-2.6-flash",
messages=[{"role": "user", "content": "Describe a city in the future"}],
temperature=1.3,
top_p=1.0
)Parameter Relationship Diagram
Temperature controls overall randomness
↓
┌─────────────────────────────────────┐
│ Low ←────────────────────────→ High │
│ Deterministic Balanced Random │
│ 0.0 0.7 1.5+ │
└─────────────────────────────────────┘
Top-P controls candidate word range
↓
┌─────────────────────────────────────┐
│ Narrow ←────────────────────────→ Wide │
│ Strict Moderate Open │
│ 0.1 0.9 1.0 │
└─────────────────────────────────────┘
Combined effect examples:
┌─────────────┬───────────┬────────────────────────┐
│ Temperature │ Top-P │ Effect │
├─────────────┼───────────┼────────────────────────┤
│ 0.1 │ 0.1 │ Extremely deterministic, suitable for code │
│ 0.5 │ 0.7 │ Conservative but natural │
│ 0.7 │ 0.9 │ Balanced, default recommendation │
│ 1.0 │ 0.95 │ Creative writing │
│ 1.3 │ 1.0 │ Experimental, high risk high return │
└─────────────┴───────────┴────────────────────────┘Common Questions
Q1: What happens when Temperature is set to 0?
When Temperature = 0, the model almost always selects the highest probability word (greedy decoding). This produces the most deterministic output, suitable for scenarios requiring consistent results. But note, due to floating-point precision, Temperature = 0 cannot guarantee 100% determinism.
Q2: Can Top-P and Top-K be used together?
Yes, but usually not recommended. Using both adds restrictions, which may lead to too few candidate words. Ant Ling recommends using Top-P (more flexible), only use Top-K when strict control is needed.
Q3: Why sometimes parameter adjustments have no effect?
- Short text: Short prompts have limited generation space, parameter effects are not obvious
- Deterministic tasks: Like math calculations, the model itself tends toward deterministic output
- Context constraints: Strong system prompts or examples will override parameter effects
Q4: How to debug parameter effects?
It is recommended to use the same prompt for multiple calls and observe output changes:
# Test different Temperature effects
for temp in [0.1, 0.5, 1.0, 1.5]:
response = client.chat.completions.create(
model="Ling-2.6-flash",
messages=[{"role": "user", "content": "Describe spring"}],
temperature=temp,
n=3 # Generate 3 results
)
print(f"\nTemperature={temp}:")
for choice in response.choices:
print(f" - {choice.message.content[:50]}...")Q5: Are there best practices for parameter settings?
- Start from default values (temperature=0.7, top_p=0.9)
- Need accuracy → lower temperature
- Need creativity → raise temperature
- Need diversity → raise top_p
- Need stability → lower top_p
Parameter Quick Reference
| Scenario Type | Typical Application | Temperature | Top-P | Top-K |
|---|---|---|---|---|
| Precision Mode | Code generation, math calculation | 0.0 - 0.3 | 0.1 - 0.5 | 1 - 10 |
| Content Generation | Article summarization, translation | 0.3 - 0.6 | 0.7 - 0.9 | 20 - 40 |
| Conversation Interaction | Chat robot, customer service | 0.6 - 0.9 ⭐ | 0.85 - 0.95 ⭐ | 40 - 60 |
| Creative Creation | Story writing, ad copy | 0.9 - 1.3 | 0.9 - 1.0 | 60 - 100 |
| Exploration Experiment | Style transfer, art generation | 1.2 - 1.5 | 0.95 - 1.0 | 80 - 100 |
⭐ Indicates Ant Ling default values
Related Resources
- Streaming - Get generation results in real-time
- Model Selection - Learn about the characteristics of Ling, Ring, Ming models
- Quickstart - Complete your first API call in 5 minutes