December 31, 2025

Response Streaming on AWS Serverless

Response streaming transforms user experience by sending data progressively as it becomes available. Includes a description of a serverlessland pattern demonstrating an API Gateway REST API that invokes a Python Lambda function.

Originally published on https://builder.aws.com/content/36Fep43fu6dkfrif200G5nM7Wj0/response-streaming-on-aws-serverless

Response streaming allows you to incrementally stream responses back to clients rather than waiting for the entire response to be buffered first, reducing Time to First Byte (TTFB), and making your applications more responsive to users. Lets start off with a quick review of how technologies has evolved, an overview of what response streaming is and how it works, before diving into how response streaming is supported on AWS.

Evolution of the web

The web started with simple request-response interactions where users waited for complete page reloads after every click. AJAX and JavaScript in the early 2000s enabled dynamic updates without page refreshes, but still required waiting for complete responses from the server. REST APIs and GraphQL improved how applications exchanged structured data, yet both still delivered complete payloads after processing finished. WebSockets introduced real-time bidirectional communication. Today's applications often need to handle large datasets or AI inference that takes time, creating poor user experiences with long loading periods. Response streaming solves this by sending data progressively as it becomes available, letting users see results appear incrementally rather than waiting for everything to
complete.

Inside the tech

HTTP/1.1 introduced chunked transfer encoding and persistent connections, which laid the foundation for response streaming but was limited by head-of-line blocking where only one request could be processed at a time per connection. HTTP/2 revolutionized streaming with binary multiplexing, allowing multiple concurrent streams over a single connection, plus flow control and stream prioritization to manage backpressure and resource allocation. HTTP/3, built on QUIC, eliminated TCP-level head-of-line blocking entirely and provided faster connection establishment, making streaming even more efficient. These protocol advances transformed streaming from a workaround requiring multiple connections to a first-class feature where applications can handle dozens of concurrent streams efficiently on a single connection. Response streaming doesn't require HTTP/3 and QUIC. It works with HTTP/1.1's chunked transfer encoding (available since 1997), which most streaming APIs like ChatGPT and AWS Lambda Response Streaming use today. HTTP/2 and HTTP/3 provide performance improvements like multiplexing and better mobile performance, but they're enhancements, not requirements for streaming functionality.Response streaming works by keeping HTTP connections open and sending data in chunks as it becomes available, rather than waiting for complete responses. The server uses Transfer-Encoding: chunked or Server-Sent Events (text/event-stream) to send progressive data, while clients use JavaScript's Fetch API with streams or EventSource to process chunks immediately as they arrive. This requires careful connection management, back pressure handling to prevent overwhelming slow clients, and error handling for mid-stream failures. The key benefit is transforming user experience from "wait then see everything" to "see results as they happen," reducing perceived latency and memory usage while improving responsiveness for applications like AI text generation, large dataset queries, and real-time updates.Response streaming serves two key purposes: improving responsiveness for slow operations (like AI generation where users see progress immediately) and enabling memory-efficient handling of large files (streaming gigabyte files with only kilobytes of RAM usage instead of loading everything into memory). This makes streaming essential both for user experience and server scalability when dealing with large
data.Response streaming dramatically improves TTFB by sending the first chunk within 50-200ms instead of waiting seconds or minutes for complete processing, reducing TTFB by 80-99% for complex operations. While total processing time remains the same, perceived latency drops significantly because users see immediate progress rather than blank loading screens. For example, AI text generation goes from a 30-second TTFB to 100ms with words appearing progressively, making applications feel 30x faster even though the underlying computation takes the same time.

Response streaming on AWS

If you wanted to build an application or API with response streaming, what AWS services would you use?AWS introduced native response streaming with Lambda Response Streaming in April 2023, marking the first time serverless functions could stream responses directly. Prior to this, AWS had streaming capabilities in S3 , CloudFront , and Kinesis , but developers needed complex workarounds to achieve response streaming in serverless applications. The 2023 Lambda announcement was significant because it eliminated the need for multi-service architectures just to stream responses from functions. However, Lambda supported native response streaming for Node.js only, through Lambda Function URLs only, and not API Gateway or ALB. For other runtimes like Python or Java, you can use Lambda Web Adapter for response streaming . For response streaming with WebSockets, you can use API Gateway WebSocket API   or AppSync . This blog post covers three serverless options for response streaming :

  1. AWS Lambda function URLs with response streaming
  2. Amazon API Gateway WebSocket APIs
  3. AWS AppSync GraphQL subscriptions

Since November 2025, API Gateway REST APIs now support response streaming . This is exciting because it completes AWS's serverless streaming story. Now you can build fully streaming AI applications without complex workarounds, eliminates the 10MB limit and 29-second timeout restrictions, and enables
real-time user experiences where AI responses appear word-by-word instead of after long waits.

Amazon API Gateway REST API to AWS Lambda Python pattern

To demonstrate practically how to build a response streaming GenAI application, I have submitted a Serverlessland pattern (still being reviewed, for now check the Github issue and PR , and code ). This pattern deploys an API Gateway REST API to a Python Lambda function that invokes Bedrock, and all services are response streaming enabled.Lets walk-through how the pattern works:

  1. The pattern uses AWS SAM as IaC - you simply need to do a sam build && sam deploy to get it working
  2. To enable response streaming on an API Gateway REST API , it sets responseTransferMode: "STREAM" on the API GW resource
  3. Then on the API GW Lambda proxy integration, the ARN or URI uses a different date for the API version and a different service action compared to Lambda proxy integration. This is because API GW now invokes Lambda using the InvokeWithResponseStream.
  4. In the Lambda function, I chose to use the Python runtime. However, at the time of writing (December 2025), Lambda only natively supports Node.js runtime for response streaming (native support for Python is on the Lambda roadmap ). To use response streaming with another runtime, we need to use Lambda Web Adapter , which allows developers to package familiar HTTP 1.1/1.0 web applications, such as Express.js, Next.js, Flask, SpringBoot, or Laravel, and deploy them on AWS Lambda. With Python and Lambda Web Adapter, I chose to use FastAPI, based closely off this FastAPI response streaming with Function URL example.
  5. The Lambda function then calls Bedrock using InvokeModelWithResponseStream method.
  6. The streamed response is passed back from Bedrock, to Lambda, to API Gateway, then onto the client.