Project

General

Profile

Actions

Feature #289

open

AI Communication Coach Service

Added by Harikrishnan Murugan about 1 month ago. Updated about 1 month ago.

Status:
Pending
Priority:
High
Assignee:
Target version:
-
Start date:
11/24/2025
Due date:
11/28/2025 (34 days late)
% Done:

0%

Estimated time:
Prioritization:
P0

Description

Task Description

Build the AI Communication Coach service that helps users practice English communication skills through realistic conversation scenarios. The coach generates appropriate practice scenarios, conducts natural conversations, tracks duration, and provides comprehensive feedback on fluency, pronunciation, grammar, and vocabulary.

The coach uses two integrated tools (scenario generation and feedback generation) to deliver the complete practice experience. Work is split between system prompt design (Prisha) and technical integration (Ishita), with both collaborating on end-to-end testing.

Repository: https://github.com/harikrishnan-crayond/ai-coach-gen-ai
Branching: Checkout from feature/production-setup, create new branch for this work

Complete User Journey

User starts practice session:

  1. User selects their profile (user_id), proficiency level (beginner/intermediate/advanced), and optional topic of interest
  2. Backend initializes AI Communication Coach with user context
  3. First-time users: Coach welcomes user and informs them about generating a practice scenario
    Returning users: Coach directly informs about generating next scenario (skip welcome)
  4. Loading state shown: "Generating your practice scenario..."
  5. Scenario appears on screen with title, description, and duration (e.g., 8 minutes)

Conversation practice:
6. User and coach engage in natural conversation based on the scenario
7. Coach tracks elapsed time and plans to wrap up gracefully within ±30 seconds of target duration
8. All audio (user and coach) is captured throughout the session

Feedback delivery:
9. When duration completes, coach calls feedback_generation tool
10. Tool immediately stops recording (audiobuffer.stop_recording()) to save compute resources
11. Loading state shown: "Analyzing your performance..."
12. Tool generates comprehensive feedback while user waits
13. User sees detailed analysis displayed as text: overall scores, fluency metrics, mispronounced words, grammar corrections, vocabulary suggestions
14. User can choose to practice the same scenario again or start a new one

Why stop session during feedback generation: Recording stops to save compute resources (WebRTC connection) since feedback generation takes time and user reviews feedback before next session.

System Architecture

┌─────────────────────────────────────────────────────────────────┐
│                     USER STARTS SESSION                          │
│  Frontend: User selects user_id, level, topic (optional)        │
│  POST /sessions/start {user_id, level, topic?}                  │
└─────────────────────────────────────────────────────────────────┘
                                │
                                ▼
┌─────────────────────────────────────────────────────────────────┐
│              AI COMMUNICATION COACH INITIALIZES                  │
│                                                                   │
│  • Load user profile from users.json (industry, role)            │
│  • Check if first-time user (for welcome message)                │
│  • System Prompt with user context                               │
│  • Register Tools: scenario_generation, feedback_generation      │
│  • AudioBufferProcessor ready for audio capture                  │
│  • WebRTC connection established                                 │
└─────────────────────────────────────────────────────────────────┘
                                │
                                ▼
┌─────────────────────────────────────────────────────────────────┐
│               SCENARIO GENERATION (Tool Call)                    │
│                                                                   │
│  First-time: Welcome + inform about scenario generation          │
│  Returning: Directly inform about next scenario generation       │
│                                                                   │
│  Tool start event → Frontend shows: "Generating scenario..."     │
│  Coach → scenario_generation(user_id, topic?, personalization)   │
│  Returns: {title, description, duration}                         │
│  Update: active_sessions[session_id]["scenario_details"]         │
│  Sent to frontend via rtvi.send_server_message()                │
│  Store scenario_start_timestamp                                  │
└─────────────────────────────────────────────────────────────────┘
                                │
                                ▼
┌─────────────────────────────────────────────────────────────────┐
│           CONVERSATION PRACTICE (Live Audio via WebRTC)          │
│                                                                   │
│  • Scenario displayed, conversation auto-starts                  │
│  • Coach conducts natural conversation based on scenario         │
│  • Time tracking: Inject periodic time updates via               │
│    LLMMessagesAppendFrame ("[Time remaining: X.X minutes]")      │
│  • Coach plans wrap-up at duration ±30 seconds                   │
│  • AudioBufferProcessor captures user + coach audio              │
│  • Continuous update: active_sessions[session_id]["audio_msgs"] │
└─────────────────────────────────────────────────────────────────┘
                                │
                                ▼
┌─────────────────────────────────────────────────────────────────┐
│            FEEDBACK GENERATION (Tool Call)                       │
│                                                                   │
│  Tool start event → Frontend shows: "Analyzing performance..."   │
│  FIRST: audiobuffer.stop_recording() (save compute)              │
│  Coach → feedback_generation(session_id)                         │
│  Tool retrieves: audio_messages + scenario_details from store    │
│  Analyzes with Gemini                                            │
│  Returns: {overall_scores, fluency_analysis, pronunciation,      │
│            grammar_analysis, vocabulary_analysis}                │
│  Update: active_sessions[session_id]["feedback"]                 │
│  Sent to frontend via rtvi.send_server_message()                │
└─────────────────────────────────────────────────────────────────┘
                                │
                                ▼
┌─────────────────────────────────────────────────────────────────┐
│             FEEDBACK DISPLAYED TO USER (Text Only)               │
│  User reviews performance analysis in text format                │
│  • Overall scores                                                │
│  • Fluency metrics                                               │
│  • Mispronounced words (text list)                               │
│  • Grammar corrections                                           │
│  • Vocabulary suggestions                                        │
│                                                                   │
│  Options: "Practice Same Scenario" or "New Scenario"             │
└─────────────────────────────────────────────────────────────────┘

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
                    IN-MEMORY STORAGE STRUCTURE
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Python FastAPI global dictionary:

active_sessions = {}

active_sessions[session_id] = {
    "user_id": "user_001",
    "scenario_details": {
        "input": {
            "user_id": "user_001",
            "industry": "technology",
            "role": "software engineer",
            "level": "intermediate",
            "topic": "client meeting",
            "personalization_note": ""
        },
        "output": {
            "title": "Client Meeting Simulation",
            "description": "Practice professional communication...",
            "duration": 8
        }
    },
    "audio_messages": [
        {"role": "user", "audio": "base64...", "timestamp": 1234.56, "duration_ms": 3000},
        {"role": "assistant", "audio": "base64...", "timestamp": 1238.12, "duration_ms": 2500}
    ],
    "feedback": {
        "overall_scores": {"fluency": 8.0, "pronunciation": 7.0, ...},
        "fluency_analysis": {...},
        "pronunciation": {...},
        "grammar_analysis": {...},
        "vocabulary_analysis": {...}
    },
    "scenario_start_timestamp": 1732467234.56,
    ...
}

Storage updates during session:
- scenario_generation tool → Updates scenario_details
- AudioBufferProcessor → Continuously appends to audio_messages
- feedback_generation tool → Updates feedback

Technical Approach

Branching Strategy:

  1. Checkout from feature/production-setup branch
  2. Create new branch for AI Communication Coach integration work
  3. Push changes to new branch

System Prompt Design (Prisha):

  • Create AI Communication Coach system message with user context (industry, role, level)
  • Include first-time user detection logic (check if user has previous sessions)
  • Define coach behavior:
    • First-time: Welcome user + inform about scenario generation (user-friendly, no technical details)
    • Returning: Directly inform about next scenario generation
  • Define coach responsibilities: generate scenarios, conduct natural conversations, track duration, call feedback_generation tool
  • Document available tools and their usage patterns
  • Single system message at initialization

Tool Registration & Loading States (Ishita):

  • Register scenario_generation tool with Pipecat
  • Register feedback_generation tool with Pipecat
  • Implement tool start events for loading states:
    • scenario_generation starts → Send event to frontend: "Generating your practice scenario..."
    • feedback_generation starts → Send event to frontend: "Analyzing your performance..."
  • Implement event-driven tool responses using rtvi.send_server_message()
  • Both tools return results via params.result_callback() for LLM context

Temporal Context Implementation (Ishita):

  • Implement time updates using LLMMessagesAppendFrame
  • Inject periodic time remaining messages as user role
  • Calculate: remaining_time = scenario_duration - (current_time - scenario_start_time)
  • Enable AI coach to plan graceful wrap-up at duration ±30 seconds

In-Memory Storage Management (Ishita):

  • Implement global active_sessions = {} dictionary
  • Structure per session: user_id, scenario_details (input + output), audio_messages, feedback, timestamps
  • scenario_generation tool updates: active_sessions[session_id]["scenario_details"]
  • AudioBufferProcessor continuously updates: active_sessions[session_id]["audio_messages"]
  • feedback_generation tool updates: active_sessions[session_id]["feedback"]

Feedback Generation Tool Logic (Reference Issue #288):

  • FIRST action when tool called: audiobuffer.stop_recording() (saves compute)
  • Retrieve audio_messages and scenario_details from active_sessions storage
  • Generate feedback with Gemini
  • Update storage with feedback results
  • Send to frontend

Session Management (Ishita):

  • Track user session history to determine first-time vs returning user
  • Store flag in active_sessions for coach behavior differentiation
  • Implement session stop logic in feedback_generation tool (not coach manually)

Acceptance Criteria

Complete End-to-End Flow Working:

  • User selects user_id, level, topic → AI coach detects first-time vs returning user
  • First-time users: Coach welcomes and informs about scenario generation (user-friendly language)
  • Returning users: Coach directly informs about next scenario generation (no welcome)
  • Loading state displayed: "Generating your practice scenario..."
  • Scenario displayed with title, description, duration → Conversation auto-starts
  • AI coach conducts natural conversation based on scenario context
  • Time tracking enables coach to plan wrap-up at duration ±30 seconds
  • Coach calls feedback_generation tool
  • Tool stops recording immediately, loading state displayed: "Analyzing your performance..."
  • Comprehensive feedback displayed as text only: overall scores, fluency analysis, pronunciation errors (text list), grammar corrections, vocabulary suggestions
  • User can choose to practice same scenario again or start new scenario
  • Event-driven tool responses working (no polling required)
  • All 3 static users tested successfully with different scenarios (both first-time and returning behavior)
  • Audio messages captured correctly throughout all sessions

System Prompt (Prisha):

  • System message includes user context (industry, role, level)
  • First-time user detection logic included
  • Defines coach behavior for first-time vs returning users
  • Coach communication is user-friendly without technical details
  • Lists available tools with usage patterns
  • Single system message at initialization

Tool Integration & Loading States (Ishita):

  • Both tools registered with Pipecat
  • Tool start events implemented for both tools with appropriate loading messages
  • Event-driven responses via rtvi.send_server_message()
  • Results also returned via params.result_callback() for LLM context
  • Frontend receives tool results without polling
  • Loading states displayed correctly during tool execution

Temporal Context (Ishita):

  • LLMMessagesAppendFrame injects time updates
  • Time remaining format: "[Time remaining: X.X minutes]"
  • Updates sent periodically during conversation
  • AI coach receives temporal context for wrap-up planning

In-Memory Storage (Ishita):

  • Global active_sessions dictionary implemented
  • Storage structure matches specification (user_id, scenario_details with input/output, audio_messages, feedback)
  • scenario_generation tool updates storage correctly
  • AudioBufferProcessor updates storage continuously
  • feedback_generation tool retrieves and updates storage correctly

Session Management (Ishita):

  • User session history tracked correctly
  • First-time vs returning user flag working
  • feedback_generation tool stops recording first (audiobuffer.stop_recording())
  • Session stop saves compute resources as intended
  • Text-only feedback display working

Success Indicators:

  • Complete user journey works seamlessly from session start to feedback display
  • Coach behavior adapts appropriately for first-time vs returning users
  • AI coach communication is natural and user-friendly
  • Loading states provide clear feedback during tool execution
  • Duration tracking enables proper conversation pacing
  • feedback_generation tool handles session stop correctly
  • In-memory storage structure working correctly across all tool calls
  • Feedback displayed as text format successfully
  • System ready for user testing with 3 static profiles

Resources/References

Actions #1

Updated by Harikrishnan Murugan about 1 month ago

  • Tracker changed from Bug to Feature
Actions

Also available in: Atom PDF