Sampling

When generating text, large language models select the next word from candidate words based on probability distribution. Temperature, Top-P, and Top-K are three core sampling parameters that together determine the randomness, creativity, and determinism of model output.

Understanding and properly configuring these three parameters allows you to achieve ideal generation results in different scenarios.

Core Concepts

Parameter	Effect	Range	Default	Use Cases
Temperature	Controls overall randomness	0.0 - 2.0	0.7	All scenarios
Top-P	Truncates candidate words by cumulative probability	0.0 - 1.0	0.9	Need to balance creativity and coherence
Top-K	Truncates candidate words by ranking	1 - 100	50	Need to strictly limit candidate range

Quick Selection Guide:

Pursuing accuracy (code, math, fact Q&A): Low Temperature + Low Top-P
Daily conversation: Temperature 0.7 + Top-P 0.9 (default)
Creative writing: High Temperature + High Top-P

Temperature

Principle Explanation

Temperature controls the “smoothness” of the model output probability distribution:

Low Temperature (close to 0): The probability distribution is more “sharp”, the model tends to select the highest probability word, output is more deterministic and conservative
High Temperature (greater than 1): The probability distribution is more “flat”, the model has more opportunities to select lower probability words, output is more random and creative

Intuitive Understanding

Imagine a bag full of colored balls (representing candidate words):

Temperature = 0.1: The bag almost all contains the same color of balls, the draw result is very deterministic
Temperature = 1.0: Balls of various colors are distributed in original proportions, the draw has some randomness
Temperature = 1.5: The distribution of balls is “scrambled”, originally rare colored balls become more numerous, the draw result is more unpredictable

Code Example

Python


from openai import OpenAI
 
client = OpenAI(
    base_url="https://api.ant-ling.com/v1",
    api_key="YOUR_API_KEY"
)
 
# Low Temperature: Deterministic output
response = client.chat.completions.create(
    model="Ling-2.6-flash",
    messages=[{"role": "user", "content": "What is the capital of China?"}],
    temperature=0.1  # Almost deterministic response
)
 
# High Temperature: Creative output
response = client.chat.completions.create(
    model="Ling-2.6-flash",
    messages=[{"role": "user", "content": "Write a poem about the moon"}],
    temperature=1.2  # More creative expression
)

Node.js


import OpenAI from 'openai';
 
const client = new OpenAI({
  baseURL: 'https://api.ant-ling.com/v1',
  apiKey: 'YOUR_API_KEY',
});
 
// Low Temperature: Deterministic output
const response1 = await client.chat.completions.create({
  model: 'Ling-2.6-flash',
  messages: [{ role: 'user', content: 'What is the capital of China?' }],
  temperature: 0.1,
});
 
// High Temperature: Creative output
const response2 = await client.chat.completions.create({
  model: 'Ling-2.6-flash',
  messages: [{ role: 'user', content: 'Write a poem about the moon' }],
  temperature: 1.2,
});

cURL


# Low Temperature
curl https://api.ant-ling.com/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "model": "Ling-2.6-flash",
    "messages": [{"role": "user", "content": "What is the capital of China?"}],
    "temperature": 0.1
  }'
 
# High Temperature
curl https://api.ant-ling.com/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "model": "Ling-2.6-flash",
    "messages": [{"role": "user", "content": "Write a poem about the moon"}],
    "temperature": 1.2
  }'

Temperature Recommended Values

Scenario	Recommended Value	Description
Fact Q&A, code generation	0.0 - 0.3	Pursue accuracy and consistency
Text summarization, translation	0.3 - 0.5	Preserve original meaning, moderately fluent
General purpose	0.5 - 0.8	Balance naturalness and accuracy
Creative writing, brainstorming	0.8 - 1.2	Encourage diversity and creativity
Exploratory generation	1.2 - 1.5	Get unexpected output

Top-P

Principle Explanation

Top-P (Nucleus Sampling) is a dynamic truncation strategy:

The model sorts all candidate words from high to low probability
Accumulate probabilities of these words until the cumulative probability reaches the set P value
Only sample from this “core set”

For example, Top-P = 0.9 means: only consider words that accumulate to 90% probability, ignore the bottom 10% low probability words.

Intuitive Understanding

Imagine you’re choosing books in a bookstore:

Top-P = 0.3: Only choose from the top 30% of bestseller list, very small selection range, predictable result
Top-P = 0.9: Choose from the top 90% of bestseller list, includes both popular books and niche quality works, balances quality and diversity
Top-P = 1.0: Choose freely from the entire bookstore, might pick books with uneven quality

Difference from Temperature

Feature	Temperature	Top-P
How it works	Changes probability distribution shape	Truncates candidate word set
Scope of impact	Global	Local (per layer sampling)
Low value effect	More deterministic	Fewer candidate words
High value effect	More random	More candidate words

Code Example


from openai import OpenAI
 
client = OpenAI(
    base_url="https://api.ant-ling.com/v1",
    api_key="YOUR_API_KEY"
)
 
# Conservative Top-P: only consider high probability words
response = client.chat.completions.create(
    model="Ling-2.6-flash",
    messages=[{"role": "user", "content": "Explain the basic principles of quantum computing"}],
    top_p=0.5,  # Only choose from the most likely 50% of words
    temperature=0.5
)
 
# Open Top-P: consider more candidate words
response = client.chat.completions.create(
    model="Ling-2.6-flash",
    messages=[{"role": "user", "content": "Brainstorm the beginning of a sci-fi story"}],
    top_p=0.95,  # Consider 95% of candidate words
    temperature=0.8
)

Top-P Recommended Values

Scenario	Recommended Value	Description
Code generation, mathematical reasoning	0.1 - 0.5	Strictly limit, ensure logical correctness
Technical documentation, academic writing	0.5 - 0.8	Professional and coherent
General purpose	0.8 - 0.95	Natural and smooth
Creative writing	0.9 - 1.0	Retain more possibilities

Top-K

Principle Explanation

Top-K is a fixed truncation strategy:

The model sorts all candidate words by probability
Only keep the top K highest probability words
Sample from these K words

For example, Top-K = 50 means: each time only choose from the 50 highest probability words, completely ignoring other words.

Intuitive Understanding

Imagine a lottery box:

Top-K = 1: The box only has 1 ball, result is completely deterministic
Top-K = 10: The box has 10 balls, all are popular choices
Top-K = 100: The box has 100 balls, wider selection range

Comparison with Top-P

Feature	Top-K	Top-P
Filtering method	Fixed quantity	Dynamic probability accumulation
Candidate word count	Fixed to K	Dynamic change
Use Cases	Need strict limit	Need adaptive balance
Combined use	Usually choose one with Top-P	More commonly used

Code Example


from openai import OpenAI
 
client = OpenAI(
    base_url="https://api.ant-ling.com/v1",
    api_key="YOUR_API_KEY"
)
 
# Use Top-K to limit candidate range
response = client.chat.completions.create(
    model="Ling-2.6-flash",
    messages=[{"role": "user", "content": "List three renewable energy sources"}],
    top_k=40,  # Only choose from the 40 highest probability words
    temperature=0.6
)

Top-K Recommended Values

Scenario	Recommended Value	Description
Deterministic tasks	1 - 10	Near-greedy decoding
Code generation	20 - 40	Ensure grammatical correctness
General tasks	40 - 60	Balance quality and diversity
Creative tasks	60 - 100	Retain more choices

Parameter Combination Strategy

Different parameter combinations suit different scenarios. Here are four common patterns:

1. Precision Mode

Use Cases: Code generation, mathematical reasoning, fact Q&A

Characteristics: Highly deterministic output, results basically consistent across multiple calls


response = client.chat.completions.create(
    model="Ling-2.6-flash",
    messages=[{"role": "user", "content": "Write a quick sort in Python"}],
    temperature=0.1,
    top_p=0.1
    # or top_k=5
)

2. Balance Mode

Use Cases: chat, intelligent customer service

Characteristics: Natural and smooth conversation experience, Ant Ling default configuration


response = client.chat.completions.create(
    model="Ling-2.6-flash",
    messages=[
        {"role": "system", "content": "You are a professional customer service assistant"},
        {"role": "user", "content": "How do I reset my password?"}
    ],
    temperature=0.7,
    top_p=0.9
)

3. Creative Mode

Use Cases: Writing, brainstorming, marketing copy

Characteristics: Rich expression, unexpected associations


response = client.chat.completions.create(
    model="Ling-2.6-flash",
    messages=[{"role": "user", "content": "Write a slogan for the new smart watch"}],
    temperature=1.0,
    top_p=0.95
)

4. Adventure Mode

Use Cases: Art generation, style transfer, concept exploration

Characteristics: High randomness, may produce unexpected results


response = client.chat.completions.create(
    model="Ling-2.6-flash",
    messages=[{"role": "user", "content": "Describe a city in the future"}],
    temperature=1.3,
    top_p=1.0
)

Parameter Relationship Diagram


Temperature controls overall randomness
     ↓
┌─────────────────────────────────────┐
│  Low ←────────────────────────→ High   │
│  Deterministic  Balanced  Random          │
│  0.0      0.7        1.5+          │
└─────────────────────────────────────┘

Top-P controls candidate word range
     ↓
┌─────────────────────────────────────┐
│  Narrow ←────────────────────────→ Wide   │
│  Strict      Moderate       Open          │
│  0.1       0.9        1.0           │
└─────────────────────────────────────┘

Combined effect examples:
┌─────────────┬───────────┬────────────────────────┐
│ Temperature │  Top-P    │ Effect                   │
├─────────────┼───────────┼────────────────────────┤
│    0.1      │   0.1     │ Extremely deterministic, suitable for code     │
│    0.5      │   0.7     │ Conservative but natural             │
│    0.7      │   0.9     │ Balanced, default recommendation         │
│    1.0      │   0.95    │ Creative writing               │
│    1.3      │   1.0     │ Experimental, high risk high return   │
└─────────────┴───────────┴────────────────────────┘

Common Questions

Q1: What happens when Temperature is set to 0?

When Temperature = 0, the model almost always selects the highest probability word (greedy decoding). This produces the most deterministic output, suitable for scenarios requiring consistent results. But note, due to floating-point precision, Temperature = 0 cannot guarantee 100% determinism.

Q2: Can Top-P and Top-K be used together?

Yes, but usually not recommended. Using both adds restrictions, which may lead to too few candidate words. Ant Ling recommends using Top-P (more flexible), only use Top-K when strict control is needed.

Q3: Why sometimes parameter adjustments have no effect?

Short text: Short prompts have limited generation space, parameter effects are not obvious
Deterministic tasks: Like math calculations, the model itself tends toward deterministic output
Context constraints: Strong system prompts or examples will override parameter effects

Q4: How to debug parameter effects?

It is recommended to use the same prompt for multiple calls and observe output changes:


# Test different Temperature effects
for temp in [0.1, 0.5, 1.0, 1.5]:
    response = client.chat.completions.create(
        model="Ling-2.6-flash",
        messages=[{"role": "user", "content": "Describe spring"}],
        temperature=temp,
        n=3  # Generate 3 results
    )
    print(f"\nTemperature={temp}:")
    for choice in response.choices:
        print(f"  - {choice.message.content[:50]}...")

Q5: Are there best practices for parameter settings?

Start from default values (temperature=0.7, top_p=0.9)
Need accuracy → lower temperature
Need creativity → raise temperature
Need diversity → raise top_p
Need stability → lower top_p

Parameter Quick Reference

Scenario Type	Typical Application	Temperature	Top-P	Top-K
Precision Mode	Code generation, math calculation	0.0 - 0.3	0.1 - 0.5	1 - 10
Content Generation	Article summarization, translation	0.3 - 0.6	0.7 - 0.9	20 - 40
Conversation Interaction	Chat robot, customer service	0.6 - 0.9 ⭐	0.85 - 0.95 ⭐	40 - 60
Creative Creation	Story writing, ad copy	0.9 - 1.3	0.9 - 1.0	60 - 100
Exploration Experiment	Style transfer, art generation	1.2 - 1.5	0.95 - 1.0	80 - 100

⭐ Indicates Ant Ling default values

Streaming - Get generation results in real-time
Model Selection - Learn about the characteristics of Ling, Ring, Ming models
Quickstart - Complete your first API call in 5 minutes

Was this page helpful?

Sampling

Core Concepts

Temperature

Principle Explanation

Intuitive Understanding

Code Example

Python

Node.js

cURL

Temperature Recommended Values

Top-P

Principle Explanation

Intuitive Understanding

Difference from Temperature

Code Example

Top-P Recommended Values

Top-K

Principle Explanation

Intuitive Understanding

Comparison with Top-P

Code Example

Top-K Recommended Values

Parameter Combination Strategy

1. Precision Mode

2. Balance Mode

3. Creative Mode

4. Adventure Mode

Parameter Relationship Diagram

Common Questions

Q1: What happens when Temperature is set to 0?

Q2: Can Top-P and Top-K be used together?

Q3: Why sometimes parameter adjustments have no effect?

Q4: How to debug parameter effects?

Q5: Are there best practices for parameter settings?

Parameter Quick Reference

Related Resources