Skip to Content
TutorialsStreaming

Streaming

Streaming allows the model to return results in real-time while generating content, without waiting for the complete response to be generated. This significantly reduces Time To First Byte (TTFB), providing users with a smoother interaction experience.


Why Use Streaming?

Lower Perceived Latency

In non-streaming mode, users need to wait for the model to generate a complete response before seeing results. For long text, this may take several seconds or even longer.

In streaming mode, users can see the first character almost immediately, then content progressively appears, greatly reducing perceived latency.

Better User Experience

  • Immediate feedback: Users know the model is working
  • Progressive reading: Users can read while content is being generated
  • Early termination: If content doesn’t meet expectations, it can be stopped midway

Use Cases

ScenarioRecommended ModeReason
Conversation appsStreamingImmediate feedback, natural experience
Long text generationStreamingReduce waiting anxiety
Real-time assistantStreamingFast response, smooth interaction
Batch processingNon-streamingSimpler code, easier to store
Simple queriesNon-streamingFast response, no need for streaming

Quick Start

Simply add stream: true to your request to enable streaming.

from openai import OpenAI client = OpenAI( base_url="https://api.ant-ling.com/v1", api_key="YOUR_API_KEY" ) response = client.chat.completions.create( model="Ling-2.6-flash", messages=[ {"role": "user", "content": "Please write a poem about spring"} ], stream=True # Enable streaming ) # Read response chunk by chunk for chunk in response: if chunk.choices[0].delta.content: print(chunk.choices[0].delta.content, end="")

Streaming Response Format

Ant Ling uses Server-Sent Events (SSE) protocol for streaming. Each event starts with data: and ends with two newlines.

Event Structure

data: {"id":"chatcmpl-xxx","object":"chat.completion.chunk","created":1234567890,"model":"Ling-2.6-flash","choices":[{"index":0,"delta":{"role":"assistant"},"finish_reason":null}]} data: {"id":"chatcmpl-xxx","object":"chat.completion.chunk","created":1234567890,"model":"Ling-2.6-flash","choices":[{"index":0,"delta":{"content":"S"},"finish_reason":null}]} data: {"id":"chatcmpl-xxx","object":"chat.completion.chunk","created":1234567890,"model":"Ling-2.6-flash","choices":[{"index":0,"delta":{"content":"pring"},"finish_reason":null}]} data: [DONE]

Field Description

FieldTypeDescription
idstringUnique request identifier
objectstringFixed to chat.completion.chunk
createdintegerResponse creation timestamp
modelstringModel name used
choicesarrayGeneration result array
choices[].deltaobjectIncremental content
choices[].delta.rolestringRole (only in first chunk)
choices[].delta.contentstringGenerated text fragment
choices[].finish_reasonstringEnd reason: stop, length, content_filter or null

End Marker

Streaming response ends with data: [DONE].


Complete Example: Building a Chat Interface

The following is a complete web chat application example, showing how to use streaming in actual projects.

Backend (Node.js + Express)

import express from 'express'; import OpenAI from 'openai'; const app = express(); const client = new OpenAI({ baseURL: 'https://api.ant-ling.com/v1', apiKey: process.env.ANT_LING_API_KEY, }); app.post('/chat', async (req, res) => { const { messages } = req.body; // Set SSE response headers res.setHeader('Content-Type', 'text/event-stream'); res.setHeader('Cache-Control', 'no-cache'); res.setHeader('Connection', 'keep-alive'); try { const stream = await client.chat.completions.create({ model: 'Ling-2.6-flash', messages, stream: true, }); for await (const chunk of stream) { const content = chunk.choices[0]?.delta?.content; if (content) { // Send SSE format data res.write(`data: ${JSON.stringify({ content })}\n\n`); } } // Send end marker res.write('data: [DONE]\n\n'); res.end(); } catch (error) { res.write(`data: ${JSON.stringify({ error: error.message })}\n\n`); res.end(); } }); app.listen(3000);

Frontend (JavaScript)

async function sendMessage(messages) { const response = await fetch('/chat', { method: 'POST', headers: { 'Content-Type': 'application/json' }, body: JSON.stringify({ messages }), }); const reader = response.body.getReader(); const decoder = new TextDecoder(); let buffer = ''; while (true) { const { done, value } = await reader.read(); if (done) break; buffer += decoder.decode(value, { stream: true }); const lines = buffer.split('\n\n'); buffer = lines.pop(); // Keep incomplete part for (const line of lines) { if (line.startsWith('data: ')) { const data = line.slice(6); if (data === '[DONE]') { console.log('Streaming response ended'); return; } try { const parsed = JSON.parse(data); if (parsed.content) { // Update UI to show new content appendToChat(parsed.content); } } catch (e) { console.error('Parse error:', e); } } } } } function appendToChat(content) { // Append content to chat interface const chatElement = document.getElementById('chat-output'); chatElement.textContent += content; }

Advanced Usage

Async Iteration Processing

In Python, you can use async for for async processing:

import asyncio from openai import AsyncOpenAI client = AsyncOpenAI( base_url="https://api.ant-ling.com/v1", api_key="YOUR_API_KEY" ) async def stream_chat(): response = await client.chat.completions.create( model="Ling-2.6-flash", messages=[{"role": "user", "content": "Tell a story"}], stream=True ) async for chunk in response: if chunk.choices[0].delta.content: print(chunk.choices[0].delta.content, end="") asyncio.run(stream_chat())

Cancelling Streaming Requests

When users stop generation midway, requests should be cancelled promptly to save resources:

from openai import OpenAI client = OpenAI( base_url="https://api.ant-ling.com/v1", api_key="YOUR_API_KEY" ) # Use with statement to ensure proper resource release with client.chat.completions.create( model="Ling-2.6-flash", messages=[{"role": "user", "content": "Write a long novel"}], stream=True ) as response: for chunk in response: content = chunk.choices[0].delta.content if content: print(content, end="") # Break loop when user clicks stop button if user_clicked_stop(): break # Automatically close connection

Error Handling

Errors in streaming responses may be returned in two ways:

  1. Connection stage errors: HTTP status code not 200, return JSON error directly
  2. Streaming stage errors: Return error event in SSE stream

Error Handling Example

from openai import OpenAI, APIError client = OpenAI( base_url="https://api.ant-ling.com/v1", api_key="YOUR_API_KEY" ) try: response = client.chat.completions.create( model="Ling-2.6-flash", messages=[{"role": "user", "content": "Hello"}], stream=True ) for chunk in response: if chunk.choices[0].delta.content: print(chunk.choices[0].delta.content, end="") except APIError as e: print(f"API Error: {e.message}") except Exception as e: print(f"Other Error: {e}")

Best Practices

1. Always Handle Empty Content

Streaming responses may contain chunks with empty content, be sure to check:

for chunk in response: content = chunk.choices[0].delta.content if content: # Ensure content is not empty print(content, end="")

2. Set Timeout Appropriately

Streaming requests may last for a long time, it is recommended to set an appropriate timeout:

from openai import OpenAI client = OpenAI( base_url="https://api.ant-ling.com/v1", api_key="YOUR_API_KEY", timeout=60.0 # 60 second timeout )

3. Buffer Rendering Optimization

When rendering on frontend, you can use buffering strategy to avoid too frequent DOM updates:

let buffer = ''; let renderTimer = null; function bufferContent(content) { buffer += content; // Batch update once every 50ms if (!renderTimer) { renderTimer = setTimeout(() => { appendToChat(buffer); buffer = ''; renderTimer = null; }, 50); } }

4. Monitor Token Usage

The last chunk of streaming response usually contains finish_reason and usage information:

for chunk in response: # Process content if chunk.choices[0].delta.content: print(chunk.choices[0].delta.content, end="") # Check finish reason finish_reason = chunk.choices[0].finish_reason if finish_reason: print(f"\nGeneration ended, reason: {finish_reason}")

Was this page helpful?
Last updated on