Spaces:

GIZ
/

audit_assistant

Running on T4

App Files Files Community

jonas commited on Sep 4, 2024

Commit

28d4a09

verified ·

1 Parent(s): 3da458d

Upload app.py

Browse files

# Implement Real-Time Streaming for Chat Responses

## Description
This PR introduces real-time streaming functionality to our chat interface., aiming to enhance the user experience by providing immediate, token-by-token responses.

## Changes
- Enabled streaming in the HuggingFaceEndpoint configuration
- Implemented an asynchronous streaming process using `astream()`
- Modified the chat function to yield partial results in real-time
- Updated Gradio setup to support streaming responses (set queue as False)

## Expected Behavior
- Responses should start appearing immediately after a question is asked
- Text should stream in smoothly, word by word or token by token
- The final response should be identical to the non-streaming version

## Technical Details
Key components of the implementation:
1. **Streaming Callback**: Implemented `StreamingStdOutCallbackHandler` for real-time token processing.
2. **LLM Configuration**: Added `streaming=True` to `HuggingFaceEndpoint` setup.
3. **Asynchronous Streaming**: Created `process_stream()` function to handle token-by-token response generation.
4. **Real-Time Updates**: Modified main loop to yield updates as they become available.

Files changed (1) hide show

app.py +15 -4

app.py CHANGED Viewed

@@ -217,29 +217,37 @@ async def chat(query,history,sources,reports,subtype,year):
     ##-----------------------getting inference endpoints------------------------------
     callback = StreamingStdOutCallbackHandler()
     llm_qa = HuggingFaceEndpoint(
         endpoint_url=model_config.get('reader', 'ENDPOINT'),
         max_new_tokens=512,
         repetition_penalty=1.03,
         timeout=70,
         huggingfacehub_api_token=HF_token,
-        streaming=True,
-        callbacks=[callback]
     )
     chat_model = ChatHuggingFace(llm=llm_qa)
     docs_html = []
     for i, d in enumerate(context_retrieved, 1):
         docs_html.append(make_html_source(d, i))
     docs_html = "".join(docs_html)
     answer_yet = ""
     async def process_stream():
-        nonlocal answer_yet
         async for chunk in chat_model.astream(messages):
             token = chunk.content
             answer_yet += token
@@ -247,9 +255,10 @@ async def chat(query,history,sources,reports,subtype,year):
             history[-1] = (query, parsed_answer)
             yield [tuple(x) for x in history], docs_html
     async for update in process_stream():
         yield update
     # #callbacks = [StreamingStdOutCallbackHandler()]
     # llm_qa = HuggingFaceEndpoint(
     #     endpoint_url= model_config.get('reader','ENDPOINT'),
@@ -508,11 +517,13 @@ with gr.Blocks(title="Audit Q&A", css= "style.css", theme=theme,elem_id = "main-
     # https://www.gradio.app/docs/gradio/textbox#event-listeners-arguments
     (textbox
     .submit(start_chat, [textbox, chatbot], [textbox, tabs, chatbot], queue=False, api_name="start_chat_textbox")
     .then(chat, [textbox, chatbot, dropdown_sources, dropdown_reports, dropdown_category, dropdown_year], [chatbot, sources_textbox], queue=True, concurrency_limit=8, api_name="chat_textbox")
     .then(finish_chat, None, [textbox], api_name="finish_chat_textbox"))
     (examples_hidden
         .change(start_chat, [examples_hidden, chatbot], [textbox, tabs, chatbot], queue=False, api_name="start_chat_examples")
         .then(chat, [examples_hidden, chatbot, dropdown_sources, dropdown_reports, dropdown_category, dropdown_year], [chatbot, sources_textbox], concurrency_limit=8, api_name="chat_examples")
         .then(finish_chat, None, [textbox], api_name="finish_chat_examples")
     )

     ##-----------------------getting inference endpoints------------------------------
+    # Set up the streaming callback handler
     callback = StreamingStdOutCallbackHandler()
+    # Initialize the HuggingFaceEndpoint with streaming enabled
     llm_qa = HuggingFaceEndpoint(
         endpoint_url=model_config.get('reader', 'ENDPOINT'),
         max_new_tokens=512,
         repetition_penalty=1.03,
         timeout=70,
         huggingfacehub_api_token=HF_token,
+        streaming=True, # Enable streaming for real-time token generation
+        callbacks=[callback] # Add the streaming callback handler
     )
+    # Create a ChatHuggingFace instance with the streaming-enabled endpoint
     chat_model = ChatHuggingFace(llm=llm_qa)
+    # Prepare the HTML for displaying source documents
     docs_html = []
     for i, d in enumerate(context_retrieved, 1):
         docs_html.append(make_html_source(d, i))
     docs_html = "".join(docs_html)
+    # Initialize the variable to store the accumulated answer
     answer_yet = ""
+    # Define an asynchronous generator function to process the streaming response
     async def process_stream():
+        # Without nonlocal, Python would create a new local variable answer_yet inside process_stream(), instead of modifying the one from the outer scope.
+        nonlocal answer_yet # Use the outer scope's answer_yet variable
+        # Iterate over the streaming response chunks
         async for chunk in chat_model.astream(messages):
             token = chunk.content
             answer_yet += token
             history[-1] = (query, parsed_answer)
             yield [tuple(x) for x in history], docs_html
+    # Stream the response updates
     async for update in process_stream():
         yield update
     # #callbacks = [StreamingStdOutCallbackHandler()]
     # llm_qa = HuggingFaceEndpoint(
     #     endpoint_url= model_config.get('reader','ENDPOINT'),
     # https://www.gradio.app/docs/gradio/textbox#event-listeners-arguments
     (textbox
     .submit(start_chat, [textbox, chatbot], [textbox, tabs, chatbot], queue=False, api_name="start_chat_textbox")
+    # queue must be set as False (default) so the process is not waiting for another to be finished
     .then(chat, [textbox, chatbot, dropdown_sources, dropdown_reports, dropdown_category, dropdown_year], [chatbot, sources_textbox], queue=True, concurrency_limit=8, api_name="chat_textbox")
     .then(finish_chat, None, [textbox], api_name="finish_chat_textbox"))
     (examples_hidden
         .change(start_chat, [examples_hidden, chatbot], [textbox, tabs, chatbot], queue=False, api_name="start_chat_examples")
+        # queue must be set as False (default) so the process is not waiting for another to be finished
         .then(chat, [examples_hidden, chatbot, dropdown_sources, dropdown_reports, dropdown_category, dropdown_year], [chatbot, sources_textbox], concurrency_limit=8, api_name="chat_examples")
         .then(finish_chat, None, [textbox], api_name="finish_chat_examples")
     )