Skip to Content
TutorialsSampling

Sampling

When generating text, large language models select the next word from candidate words based on probability distribution. Temperature, Top-P, and Top-K are three core sampling parameters that together determine the randomness, creativity, and determinism of model output.

Understanding and properly configuring these three parameters allows you to achieve ideal generation results in different scenarios.


Core Concepts

ParameterEffectRangeDefaultUse Cases
TemperatureControls overall randomness0.0 - 2.00.7All scenarios
Top-PTruncates candidate words by cumulative probability0.0 - 1.00.9Need to balance creativity and coherence
Top-KTruncates candidate words by ranking1 - 10050Need to strictly limit candidate range

Quick Selection Guide:

  • Pursuing accuracy (code, math, fact Q&A): Low Temperature + Low Top-P
  • Daily conversation: Temperature 0.7 + Top-P 0.9 (default)
  • Creative writing: High Temperature + High Top-P

Temperature

Principle Explanation

Temperature controls the “smoothness” of the model output probability distribution:

  • Low Temperature (close to 0): The probability distribution is more “sharp”, the model tends to select the highest probability word, output is more deterministic and conservative
  • High Temperature (greater than 1): The probability distribution is more “flat”, the model has more opportunities to select lower probability words, output is more random and creative

Intuitive Understanding

Imagine a bag full of colored balls (representing candidate words):

  • Temperature = 0.1: The bag almost all contains the same color of balls, the draw result is very deterministic
  • Temperature = 1.0: Balls of various colors are distributed in original proportions, the draw has some randomness
  • Temperature = 1.5: The distribution of balls is “scrambled”, originally rare colored balls become more numerous, the draw result is more unpredictable

Code Example

from openai import OpenAI client = OpenAI( base_url="https://api.ant-ling.com/v1", api_key="YOUR_API_KEY" ) # Low Temperature: Deterministic output response = client.chat.completions.create( model="Ling-2.6-flash", messages=[{"role": "user", "content": "What is the capital of China?"}], temperature=0.1 # Almost deterministic response ) # High Temperature: Creative output response = client.chat.completions.create( model="Ling-2.6-flash", messages=[{"role": "user", "content": "Write a poem about the moon"}], temperature=1.2 # More creative expression )
ScenarioRecommended ValueDescription
Fact Q&A, code generation0.0 - 0.3Pursue accuracy and consistency
Text summarization, translation0.3 - 0.5Preserve original meaning, moderately fluent
General purpose0.5 - 0.8Balance naturalness and accuracy
Creative writing, brainstorming0.8 - 1.2Encourage diversity and creativity
Exploratory generation1.2 - 1.5Get unexpected output

Top-P

Principle Explanation

Top-P (Nucleus Sampling) is a dynamic truncation strategy:

  1. The model sorts all candidate words from high to low probability
  2. Accumulate probabilities of these words until the cumulative probability reaches the set P value
  3. Only sample from this “core set”

For example, Top-P = 0.9 means: only consider words that accumulate to 90% probability, ignore the bottom 10% low probability words.

Intuitive Understanding

Imagine you’re choosing books in a bookstore:

  • Top-P = 0.3: Only choose from the top 30% of bestseller list, very small selection range, predictable result
  • Top-P = 0.9: Choose from the top 90% of bestseller list, includes both popular books and niche quality works, balances quality and diversity
  • Top-P = 1.0: Choose freely from the entire bookstore, might pick books with uneven quality

Difference from Temperature

FeatureTemperatureTop-P
How it worksChanges probability distribution shapeTruncates candidate word set
Scope of impactGlobalLocal (per layer sampling)
Low value effectMore deterministicFewer candidate words
High value effectMore randomMore candidate words

Code Example

from openai import OpenAI client = OpenAI( base_url="https://api.ant-ling.com/v1", api_key="YOUR_API_KEY" ) # Conservative Top-P: only consider high probability words response = client.chat.completions.create( model="Ling-2.6-flash", messages=[{"role": "user", "content": "Explain the basic principles of quantum computing"}], top_p=0.5, # Only choose from the most likely 50% of words temperature=0.5 ) # Open Top-P: consider more candidate words response = client.chat.completions.create( model="Ling-2.6-flash", messages=[{"role": "user", "content": "Brainstorm the beginning of a sci-fi story"}], top_p=0.95, # Consider 95% of candidate words temperature=0.8 )
ScenarioRecommended ValueDescription
Code generation, mathematical reasoning0.1 - 0.5Strictly limit, ensure logical correctness
Technical documentation, academic writing0.5 - 0.8Professional and coherent
General purpose0.8 - 0.95Natural and smooth
Creative writing0.9 - 1.0Retain more possibilities

Top-K

Principle Explanation

Top-K is a fixed truncation strategy:

  1. The model sorts all candidate words by probability
  2. Only keep the top K highest probability words
  3. Sample from these K words

For example, Top-K = 50 means: each time only choose from the 50 highest probability words, completely ignoring other words.

Intuitive Understanding

Imagine a lottery box:

  • Top-K = 1: The box only has 1 ball, result is completely deterministic
  • Top-K = 10: The box has 10 balls, all are popular choices
  • Top-K = 100: The box has 100 balls, wider selection range

Comparison with Top-P

FeatureTop-KTop-P
Filtering methodFixed quantityDynamic probability accumulation
Candidate word countFixed to KDynamic change
Use CasesNeed strict limitNeed adaptive balance
Combined useUsually choose one with Top-PMore commonly used

Code Example

from openai import OpenAI client = OpenAI( base_url="https://api.ant-ling.com/v1", api_key="YOUR_API_KEY" ) # Use Top-K to limit candidate range response = client.chat.completions.create( model="Ling-2.6-flash", messages=[{"role": "user", "content": "List three renewable energy sources"}], top_k=40, # Only choose from the 40 highest probability words temperature=0.6 )
ScenarioRecommended ValueDescription
Deterministic tasks1 - 10Near-greedy decoding
Code generation20 - 40Ensure grammatical correctness
General tasks40 - 60Balance quality and diversity
Creative tasks60 - 100Retain more choices

Parameter Combination Strategy

Different parameter combinations suit different scenarios. Here are four common patterns:

1. Precision Mode

Use Cases: Code generation, mathematical reasoning, fact Q&A

Characteristics: Highly deterministic output, results basically consistent across multiple calls

response = client.chat.completions.create( model="Ling-2.6-flash", messages=[{"role": "user", "content": "Write a quick sort in Python"}], temperature=0.1, top_p=0.1 # or top_k=5 )

2. Balance Mode

Use Cases: chat, intelligent customer service

Characteristics: Natural and smooth conversation experience, Ant Ling default configuration

response = client.chat.completions.create( model="Ling-2.6-flash", messages=[ {"role": "system", "content": "You are a professional customer service assistant"}, {"role": "user", "content": "How do I reset my password?"} ], temperature=0.7, top_p=0.9 )

3. Creative Mode

Use Cases: Writing, brainstorming, marketing copy

Characteristics: Rich expression, unexpected associations

response = client.chat.completions.create( model="Ling-2.6-flash", messages=[{"role": "user", "content": "Write a slogan for the new smart watch"}], temperature=1.0, top_p=0.95 )

4. Adventure Mode

Use Cases: Art generation, style transfer, concept exploration

Characteristics: High randomness, may produce unexpected results

response = client.chat.completions.create( model="Ling-2.6-flash", messages=[{"role": "user", "content": "Describe a city in the future"}], temperature=1.3, top_p=1.0 )

Parameter Relationship Diagram

Temperature controls overall randomness ┌─────────────────────────────────────┐ │ Low ←────────────────────────→ High │ │ Deterministic Balanced Random │ │ 0.0 0.7 1.5+ │ └─────────────────────────────────────┘ Top-P controls candidate word range ┌─────────────────────────────────────┐ │ Narrow ←────────────────────────→ Wide │ │ Strict Moderate Open │ │ 0.1 0.9 1.0 │ └─────────────────────────────────────┘ Combined effect examples: ┌─────────────┬───────────┬────────────────────────┐ │ Temperature │ Top-P │ Effect │ ├─────────────┼───────────┼────────────────────────┤ │ 0.1 │ 0.1 │ Extremely deterministic, suitable for code │ │ 0.5 │ 0.7 │ Conservative but natural │ │ 0.7 │ 0.9 │ Balanced, default recommendation │ │ 1.0 │ 0.95 │ Creative writing │ │ 1.3 │ 1.0 │ Experimental, high risk high return │ └─────────────┴───────────┴────────────────────────┘

Common Questions

Q1: What happens when Temperature is set to 0?

When Temperature = 0, the model almost always selects the highest probability word (greedy decoding). This produces the most deterministic output, suitable for scenarios requiring consistent results. But note, due to floating-point precision, Temperature = 0 cannot guarantee 100% determinism.

Q2: Can Top-P and Top-K be used together?

Yes, but usually not recommended. Using both adds restrictions, which may lead to too few candidate words. Ant Ling recommends using Top-P (more flexible), only use Top-K when strict control is needed.

Q3: Why sometimes parameter adjustments have no effect?

  • Short text: Short prompts have limited generation space, parameter effects are not obvious
  • Deterministic tasks: Like math calculations, the model itself tends toward deterministic output
  • Context constraints: Strong system prompts or examples will override parameter effects

Q4: How to debug parameter effects?

It is recommended to use the same prompt for multiple calls and observe output changes:

# Test different Temperature effects for temp in [0.1, 0.5, 1.0, 1.5]: response = client.chat.completions.create( model="Ling-2.6-flash", messages=[{"role": "user", "content": "Describe spring"}], temperature=temp, n=3 # Generate 3 results ) print(f"\nTemperature={temp}:") for choice in response.choices: print(f" - {choice.message.content[:50]}...")

Q5: Are there best practices for parameter settings?

  • Start from default values (temperature=0.7, top_p=0.9)
  • Need accuracy → lower temperature
  • Need creativity → raise temperature
  • Need diversity → raise top_p
  • Need stability → lower top_p

Parameter Quick Reference

Scenario TypeTypical ApplicationTemperatureTop-PTop-K
Precision ModeCode generation, math calculation0.0 - 0.30.1 - 0.51 - 10
Content GenerationArticle summarization, translation0.3 - 0.60.7 - 0.920 - 40
Conversation InteractionChat robot, customer service0.6 - 0.9 ⭐0.85 - 0.95 ⭐40 - 60
Creative CreationStory writing, ad copy0.9 - 1.30.9 - 1.060 - 100
Exploration ExperimentStyle transfer, art generation1.2 - 1.50.95 - 1.080 - 100

⭐ Indicates Ant Ling default values


  • Streaming - Get generation results in real-time
  • Model Selection - Learn about the characteristics of Ling, Ring, Ming models
  • Quickstart - Complete your first API call in 5 minutes
Was this page helpful?
Last updated on