Streaming
Streaming allows the model to return results in real-time while generating content, without waiting for the complete response to be generated. This significantly reduces Time To First Byte (TTFB), providing users with a smoother interaction experience.
Why Use Streaming?
Lower Perceived Latency
In non-streaming mode, users need to wait for the model to generate a complete response before seeing results. For long text, this may take several seconds or even longer.
In streaming mode, users can see the first character almost immediately, then content progressively appears, greatly reducing perceived latency.
Better User Experience
- Immediate feedback: Users know the model is working
- Progressive reading: Users can read while content is being generated
- Early termination: If content doesn’t meet expectations, it can be stopped midway
Use Cases
| Scenario | Recommended Mode | Reason |
|---|---|---|
| Conversation apps | Streaming | Immediate feedback, natural experience |
| Long text generation | Streaming | Reduce waiting anxiety |
| Real-time assistant | Streaming | Fast response, smooth interaction |
| Batch processing | Non-streaming | Simpler code, easier to store |
| Simple queries | Non-streaming | Fast response, no need for streaming |
Quick Start
Simply add stream: true to your request to enable streaming.
Python
from openai import OpenAI
client = OpenAI(
base_url="https://api.ant-ling.com/v1",
api_key="YOUR_API_KEY"
)
response = client.chat.completions.create(
model="Ling-2.6-flash",
messages=[
{"role": "user", "content": "Please write a poem about spring"}
],
stream=True # Enable streaming
)
# Read response chunk by chunk
for chunk in response:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="")Streaming Response Format
Ant Ling uses Server-Sent Events (SSE) protocol for streaming. Each event starts with data: and ends with two newlines.
Event Structure
data: {"id":"chatcmpl-xxx","object":"chat.completion.chunk","created":1234567890,"model":"Ling-2.6-flash","choices":[{"index":0,"delta":{"role":"assistant"},"finish_reason":null}]}
data: {"id":"chatcmpl-xxx","object":"chat.completion.chunk","created":1234567890,"model":"Ling-2.6-flash","choices":[{"index":0,"delta":{"content":"S"},"finish_reason":null}]}
data: {"id":"chatcmpl-xxx","object":"chat.completion.chunk","created":1234567890,"model":"Ling-2.6-flash","choices":[{"index":0,"delta":{"content":"pring"},"finish_reason":null}]}
data: [DONE]Field Description
| Field | Type | Description |
|---|---|---|
id | string | Unique request identifier |
object | string | Fixed to chat.completion.chunk |
created | integer | Response creation timestamp |
model | string | Model name used |
choices | array | Generation result array |
choices[].delta | object | Incremental content |
choices[].delta.role | string | Role (only in first chunk) |
choices[].delta.content | string | Generated text fragment |
choices[].finish_reason | string | End reason: stop, length, content_filter or null |
End Marker
Streaming response ends with data: [DONE].
Complete Example: Building a Chat Interface
The following is a complete web chat application example, showing how to use streaming in actual projects.
Backend (Node.js + Express)
import express from 'express';
import OpenAI from 'openai';
const app = express();
const client = new OpenAI({
baseURL: 'https://api.ant-ling.com/v1',
apiKey: process.env.ANT_LING_API_KEY,
});
app.post('/chat', async (req, res) => {
const { messages } = req.body;
// Set SSE response headers
res.setHeader('Content-Type', 'text/event-stream');
res.setHeader('Cache-Control', 'no-cache');
res.setHeader('Connection', 'keep-alive');
try {
const stream = await client.chat.completions.create({
model: 'Ling-2.6-flash',
messages,
stream: true,
});
for await (const chunk of stream) {
const content = chunk.choices[0]?.delta?.content;
if (content) {
// Send SSE format data
res.write(`data: ${JSON.stringify({ content })}\n\n`);
}
}
// Send end marker
res.write('data: [DONE]\n\n');
res.end();
} catch (error) {
res.write(`data: ${JSON.stringify({ error: error.message })}\n\n`);
res.end();
}
});
app.listen(3000);Frontend (JavaScript)
async function sendMessage(messages) {
const response = await fetch('/chat', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ messages }),
});
const reader = response.body.getReader();
const decoder = new TextDecoder();
let buffer = '';
while (true) {
const { done, value } = await reader.read();
if (done) break;
buffer += decoder.decode(value, { stream: true });
const lines = buffer.split('\n\n');
buffer = lines.pop(); // Keep incomplete part
for (const line of lines) {
if (line.startsWith('data: ')) {
const data = line.slice(6);
if (data === '[DONE]') {
console.log('Streaming response ended');
return;
}
try {
const parsed = JSON.parse(data);
if (parsed.content) {
// Update UI to show new content
appendToChat(parsed.content);
}
} catch (e) {
console.error('Parse error:', e);
}
}
}
}
}
function appendToChat(content) {
// Append content to chat interface
const chatElement = document.getElementById('chat-output');
chatElement.textContent += content;
}Advanced Usage
Async Iteration Processing
In Python, you can use async for for async processing:
import asyncio
from openai import AsyncOpenAI
client = AsyncOpenAI(
base_url="https://api.ant-ling.com/v1",
api_key="YOUR_API_KEY"
)
async def stream_chat():
response = await client.chat.completions.create(
model="Ling-2.6-flash",
messages=[{"role": "user", "content": "Tell a story"}],
stream=True
)
async for chunk in response:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="")
asyncio.run(stream_chat())Cancelling Streaming Requests
When users stop generation midway, requests should be cancelled promptly to save resources:
Python
from openai import OpenAI
client = OpenAI(
base_url="https://api.ant-ling.com/v1",
api_key="YOUR_API_KEY"
)
# Use with statement to ensure proper resource release
with client.chat.completions.create(
model="Ling-2.6-flash",
messages=[{"role": "user", "content": "Write a long novel"}],
stream=True
) as response:
for chunk in response:
content = chunk.choices[0].delta.content
if content:
print(content, end="")
# Break loop when user clicks stop button
if user_clicked_stop():
break # Automatically close connectionError Handling
Errors in streaming responses may be returned in two ways:
- Connection stage errors: HTTP status code not 200, return JSON error directly
- Streaming stage errors: Return error event in SSE stream
Error Handling Example
from openai import OpenAI, APIError
client = OpenAI(
base_url="https://api.ant-ling.com/v1",
api_key="YOUR_API_KEY"
)
try:
response = client.chat.completions.create(
model="Ling-2.6-flash",
messages=[{"role": "user", "content": "Hello"}],
stream=True
)
for chunk in response:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="")
except APIError as e:
print(f"API Error: {e.message}")
except Exception as e:
print(f"Other Error: {e}")Best Practices
1. Always Handle Empty Content
Streaming responses may contain chunks with empty content, be sure to check:
for chunk in response:
content = chunk.choices[0].delta.content
if content: # Ensure content is not empty
print(content, end="")2. Set Timeout Appropriately
Streaming requests may last for a long time, it is recommended to set an appropriate timeout:
from openai import OpenAI
client = OpenAI(
base_url="https://api.ant-ling.com/v1",
api_key="YOUR_API_KEY",
timeout=60.0 # 60 second timeout
)3. Buffer Rendering Optimization
When rendering on frontend, you can use buffering strategy to avoid too frequent DOM updates:
let buffer = '';
let renderTimer = null;
function bufferContent(content) {
buffer += content;
// Batch update once every 50ms
if (!renderTimer) {
renderTimer = setTimeout(() => {
appendToChat(buffer);
buffer = '';
renderTimer = null;
}, 50);
}
}4. Monitor Token Usage
The last chunk of streaming response usually contains finish_reason and usage information:
for chunk in response:
# Process content
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="")
# Check finish reason
finish_reason = chunk.choices[0].finish_reason
if finish_reason:
print(f"\nGeneration ended, reason: {finish_reason}")Related Resources
- Quickstart - Complete your first API call in 5 minutes
- Model Selection - Learn about Use Cases for Ling, Ring, Ming models