Pocket carry in a thobe
How I carry my CCW in a long kurta
Response streaming transforms user experience by sending data progressively as it becomes available. Includes a description of a serverlessland pattern demonstrating an API Gateway REST API that invokes a Python Lambda function.
Originally published on https://builder.aws.com/content/36Fep43fu6dkfrif200G5nM7Wj0/response-streaming-on-aws-serverless
Response streaming allows you to incrementally stream responses back to clients rather than waiting for the entire response to be buffered first, reducing Time to First Byte (TTFB), and making your applications more responsive to users. Lets start off with a quick review of how technologies has evolved, an overview of what response streaming is and how it works, before diving into how response streaming is supported on AWS.
The web started with simple request-response interactions where users waited for complete page reloads after every click. AJAX and JavaScript in the early 2000s enabled dynamic updates without page refreshes, but still required waiting for complete responses from the server. REST APIs and GraphQL improved how applications exchanged structured data, yet both still delivered complete payloads after processing finished. WebSockets introduced real-time bidirectional communication. Today's applications often need to handle large datasets or AI inference that takes time, creating poor user experiences with long loading periods. Response streaming solves this by sending data progressively as it becomes available, letting users see results appear incrementally rather than waiting for everything to
complete.
HTTP/1.1 introduced chunked transfer encoding and persistent connections, which laid the foundation for response streaming but was limited by head-of-line blocking where only one request could be processed at a time per connection. HTTP/2 revolutionized streaming with binary multiplexing, allowing multiple concurrent streams over a single connection, plus flow control and stream prioritization to manage backpressure and resource allocation. HTTP/3, built on QUIC, eliminated TCP-level head-of-line blocking entirely and provided faster connection establishment, making streaming even more efficient. These protocol advances transformed streaming from a workaround requiring multiple connections to a first-class feature where applications can handle dozens of concurrent streams efficiently on a single connection. Response streaming doesn't require HTTP/3 and QUIC. It works with HTTP/1.1's chunked transfer encoding (available since 1997), which most streaming APIs like ChatGPT and AWS Lambda Response Streaming use today. HTTP/2 and HTTP/3 provide performance improvements like multiplexing and better mobile performance, but they're enhancements, not requirements for streaming functionality.Response streaming works by keeping HTTP connections open and sending data in chunks as it becomes available, rather than waiting for complete responses. The server uses Transfer-Encoding: chunked or Server-Sent Events (text/event-stream) to send progressive data, while clients use JavaScript's Fetch API with streams or EventSource to process chunks immediately as they arrive. This requires careful connection management, back pressure handling to prevent overwhelming slow clients, and error handling for mid-stream failures. The key benefit is transforming user experience from "wait then see everything" to "see results as they happen," reducing perceived latency and memory usage while improving responsiveness for applications like AI text generation, large dataset queries, and real-time updates.Response streaming serves two key purposes: improving responsiveness for slow operations (like AI generation where users see progress immediately) and enabling memory-efficient handling of large files (streaming gigabyte files with only kilobytes of RAM usage instead of loading everything into memory). This makes streaming essential both for user experience and server scalability when dealing with large
data.Response streaming dramatically improves TTFB by sending the first chunk within 50-200ms instead of waiting seconds or minutes for complete processing, reducing TTFB by 80-99% for complex operations. While total processing time remains the same, perceived latency drops significantly because users see immediate progress rather than blank loading screens. For example, AI text generation goes from a 30-second TTFB to 100ms with words appearing progressively, making applications feel 30x faster even though the underlying computation takes the same time.
If you wanted to build an application or API with response streaming, what AWS services would you use?AWS introduced native response streaming with Lambda Response Streaming in April 2023, marking the first time serverless functions could stream responses directly. Prior to this, AWS had streaming capabilities in S3 , CloudFront , and Kinesis , but developers needed complex workarounds to achieve response streaming in serverless applications. The 2023 Lambda announcement was significant because it eliminated the need for multi-service architectures just to stream responses from functions. However, Lambda supported native response streaming for Node.js only, through Lambda Function URLs only, and not API Gateway or ALB. For other runtimes like Python or Java, you can use Lambda Web Adapter for response streaming . For response streaming with WebSockets, you can use API Gateway WebSocket API or AppSync . This blog post covers three serverless options for response streaming :
Since November 2025, API Gateway REST APIs now support response streaming . This is exciting because it completes AWS's serverless streaming story. Now you can build fully streaming AI applications without complex workarounds, eliminates the 10MB limit and 29-second timeout restrictions, and enables
real-time user experiences where AI responses appear word-by-word instead of after long waits.
To demonstrate practically how to build a response streaming GenAI application, I have submitted a Serverlessland pattern (still being reviewed, for now check the Github issue and PR , and code ). This pattern deploys an API Gateway REST API to a Python Lambda function that invokes Bedrock, and all services are response streaming enabled.Lets walk-through how the pattern works:
sam build && sam deploy to get it workingresponseTransferMode: "STREAM" on the API GW resourceInvokeWithResponseStream.InvokeModelWithResponseStream method.