Streaming

Streaming allows the model to return results in real-time while generating content, without waiting for the complete response to be generated. This significantly reduces Time To First Byte (TTFB), providing users with a smoother interaction experience.

Why Use Streaming?

Lower Perceived Latency

In non-streaming mode, users need to wait for the model to generate a complete response before seeing results. For long text, this may take several seconds or even longer.

In streaming mode, users can see the first character almost immediately, then content progressively appears, greatly reducing perceived latency.

Better User Experience

Immediate feedback: Users know the model is working
Progressive reading: Users can read while content is being generated
Early termination: If content doesn’t meet expectations, it can be stopped midway

Use Cases

Scenario	Recommended Mode	Reason
Conversation apps	Streaming	Immediate feedback, natural experience
Long text generation	Streaming	Reduce waiting anxiety
Real-time assistant	Streaming	Fast response, smooth interaction
Batch processing	Non-streaming	Simpler code, easier to store
Simple queries	Non-streaming	Fast response, no need for streaming

Quick Start

Simply add stream: true to your request to enable streaming.

Python


from openai import OpenAI
 
client = OpenAI(
    base_url="https://api.ant-ling.com/v1",
    api_key="YOUR_API_KEY"
)
 
response = client.chat.completions.create(
    model="Ling-2.6-flash",
    messages=[
        {"role": "user", "content": "Please write a poem about spring"}
    ],
    stream=True  # Enable streaming
)
 
# Read response chunk by chunk
for chunk in response:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

Node.js


import OpenAI from 'openai';
 
const client = new OpenAI({
  baseURL: 'https://api.ant-ling.com/v1',
  apiKey: 'YOUR_API_KEY',
});
 
const stream = await client.chat.completions.create({
  model: 'Ling-2.6-flash',
  messages: [{ role: 'user', content: 'Please write a poem about spring' }],
  stream: true, // Enable streaming
});
 
// Read response chunk by chunk
for await (const chunk of stream) {
  process.stdout.write(chunk.choices[0]?.delta?.content || '');
}

cURL


curl https://api.ant-ling.com/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "model": "Ling-2.6-flash",
    "messages": [
      {"role": "user", "content": "Please write a poem about spring"}
    ],
    "stream": true
  }'

Streaming Response Format

Ant Ling uses Server-Sent Events (SSE) protocol for streaming. Each event starts with data: and ends with two newlines.

Event Structure


data: {"id":"chatcmpl-xxx","object":"chat.completion.chunk","created":1234567890,"model":"Ling-2.6-flash","choices":[{"index":0,"delta":{"role":"assistant"},"finish_reason":null}]}

data: {"id":"chatcmpl-xxx","object":"chat.completion.chunk","created":1234567890,"model":"Ling-2.6-flash","choices":[{"index":0,"delta":{"content":"S"},"finish_reason":null}]}

data: {"id":"chatcmpl-xxx","object":"chat.completion.chunk","created":1234567890,"model":"Ling-2.6-flash","choices":[{"index":0,"delta":{"content":"pring"},"finish_reason":null}]}

data: [DONE]

Field Description

Field	Type	Description
`id`	string	Unique request identifier
`object`	string	Fixed to `chat.completion.chunk`
`created`	integer	Response creation timestamp
`model`	string	Model name used
`choices`	array	Generation result array
`choices[].delta`	object	Incremental content
`choices[].delta.role`	string	Role (only in first chunk)
`choices[].delta.content`	string	Generated text fragment
`choices[].finish_reason`	string	End reason: `stop`, `length`, `content_filter` or `null`

End Marker

Streaming response ends with data: [DONE].

Complete Example: Building a Chat Interface

The following is a complete web chat application example, showing how to use streaming in actual projects.

Backend (Node.js + Express)


import express from 'express';
import OpenAI from 'openai';
 
const app = express();
const client = new OpenAI({
  baseURL: 'https://api.ant-ling.com/v1',
  apiKey: process.env.ANT_LING_API_KEY,
});
 
app.post('/chat', async (req, res) => {
  const { messages } = req.body;
 
  // Set SSE response headers
  res.setHeader('Content-Type', 'text/event-stream');
  res.setHeader('Cache-Control', 'no-cache');
  res.setHeader('Connection', 'keep-alive');
 
  try {
    const stream = await client.chat.completions.create({
      model: 'Ling-2.6-flash',
      messages,
      stream: true,
    });
 
    for await (const chunk of stream) {
      const content = chunk.choices[0]?.delta?.content;
      if (content) {
        // Send SSE format data
        res.write(`data: ${JSON.stringify({ content })}\n\n`);
      }
    }
 
    // Send end marker
    res.write('data: [DONE]\n\n');
    res.end();
  } catch (error) {
    res.write(`data: ${JSON.stringify({ error: error.message })}\n\n`);
    res.end();
  }
});
 
app.listen(3000);

Frontend (JavaScript)


async function sendMessage(messages) {
  const response = await fetch('/chat', {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify({ messages }),
  });
 
  const reader = response.body.getReader();
  const decoder = new TextDecoder();
  let buffer = '';
 
  while (true) {
    const { done, value } = await reader.read();
    if (done) break;
 
    buffer += decoder.decode(value, { stream: true });
    const lines = buffer.split('\n\n');
    buffer = lines.pop(); // Keep incomplete part
 
    for (const line of lines) {
      if (line.startsWith('data: ')) {
        const data = line.slice(6);
 
        if (data === '[DONE]') {
          console.log('Streaming response ended');
          return;
        }
 
        try {
          const parsed = JSON.parse(data);
          if (parsed.content) {
            // Update UI to show new content
            appendToChat(parsed.content);
          }
        } catch (e) {
          console.error('Parse error:', e);
        }
      }
    }
  }
}
 
function appendToChat(content) {
  // Append content to chat interface
  const chatElement = document.getElementById('chat-output');
  chatElement.textContent += content;
}

Advanced Usage

Async Iteration Processing

In Python, you can use async for for async processing:


import asyncio
from openai import AsyncOpenAI
 
client = AsyncOpenAI(
    base_url="https://api.ant-ling.com/v1",
    api_key="YOUR_API_KEY"
)
 
async def stream_chat():
    response = await client.chat.completions.create(
        model="Ling-2.6-flash",
        messages=[{"role": "user", "content": "Tell a story"}],
        stream=True
    )
 
    async for chunk in response:
        if chunk.choices[0].delta.content:
            print(chunk.choices[0].delta.content, end="")
 
asyncio.run(stream_chat())

Cancelling Streaming Requests

When users stop generation midway, requests should be cancelled promptly to save resources:

Python


from openai import OpenAI
 
client = OpenAI(
    base_url="https://api.ant-ling.com/v1",
    api_key="YOUR_API_KEY"
)
 
# Use with statement to ensure proper resource release
with client.chat.completions.create(
    model="Ling-2.6-flash",
    messages=[{"role": "user", "content": "Write a long novel"}],
    stream=True
) as response:
    for chunk in response:
        content = chunk.choices[0].delta.content
        if content:
            print(content, end="")
 
        # Break loop when user clicks stop button
        if user_clicked_stop():
            break  # Automatically close connection

Error Handling

Errors in streaming responses may be returned in two ways:

Connection stage errors: HTTP status code not 200, return JSON error directly
Streaming stage errors: Return error event in SSE stream

Error Handling Example


from openai import OpenAI, APIError
 
client = OpenAI(
    base_url="https://api.ant-ling.com/v1",
    api_key="YOUR_API_KEY"
)
 
try:
    response = client.chat.completions.create(
        model="Ling-2.6-flash",
        messages=[{"role": "user", "content": "Hello"}],
        stream=True
    )
 
    for chunk in response:
        if chunk.choices[0].delta.content:
            print(chunk.choices[0].delta.content, end="")
 
except APIError as e:
    print(f"API Error: {e.message}")
except Exception as e:
    print(f"Other Error: {e}")

Best Practices

1. Always Handle Empty Content

Streaming responses may contain chunks with empty content, be sure to check:


for chunk in response:
    content = chunk.choices[0].delta.content
    if content:  # Ensure content is not empty
        print(content, end="")

2. Set Timeout Appropriately

Streaming requests may last for a long time, it is recommended to set an appropriate timeout:


from openai import OpenAI
 
client = OpenAI(
    base_url="https://api.ant-ling.com/v1",
    api_key="YOUR_API_KEY",
    timeout=60.0  # 60 second timeout
)

3. Buffer Rendering Optimization

When rendering on frontend, you can use buffering strategy to avoid too frequent DOM updates:


let buffer = '';
let renderTimer = null;
 
function bufferContent(content) {
  buffer += content;
 
  // Batch update once every 50ms
  if (!renderTimer) {
    renderTimer = setTimeout(() => {
      appendToChat(buffer);
      buffer = '';
      renderTimer = null;
    }, 50);
  }
}

4. Monitor Token Usage

The last chunk of streaming response usually contains finish_reason and usage information:


for chunk in response:
    # Process content
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")
 
    # Check finish reason
    finish_reason = chunk.choices[0].finish_reason
    if finish_reason:
        print(f"\nGeneration ended, reason: {finish_reason}")

Quickstart - Complete your first API call in 5 minutes
Model Selection - Learn about Use Cases for Ling, Ring, Ming models

Was this page helpful?

Streaming

Why Use Streaming?

Lower Perceived Latency

Better User Experience

Use Cases

Quick Start

Python

Node.js

cURL

Streaming Response Format

Event Structure

Field Description

End Marker

Complete Example: Building a Chat Interface

Backend (Node.js + Express)

Frontend (JavaScript)

Advanced Usage

Async Iteration Processing

Cancelling Streaming Requests

Python

Node.js

Error Handling

Error Handling Example

Best Practices

1. Always Handle Empty Content

2. Set Timeout Appropriately

3. Buffer Rendering Optimization

4. Monitor Token Usage

Related Resources