To Do #156: Speech Analysis POC with Gemini - AI Communication Coach - Redmine

Actions

Copy link

To Do #156

closed

Speech Analysis POC with Gemini

Added by Harikrishnan Murugan 5 months ago. Updated 4 months ago.

Status:

Closed

Priority:

High

Assignee:

Prisha S

Target version:

Start date:

10/07/2025

Due date:

10/17/2025

% Done:

100%

Estimated time:

Prioritization:

Description

Task Description

Conduct a proof-of-concept to test if Gemini 2.5 Flash (or Flash Lite) can analyze audio input and produce detailed speech assessment output similar to SpeechSuper's unscripted speech evaluation. Try the Fluently App (https://getfluently.app/) to understand how speech analysis works - this will help you understand the expected output quality and structure.

Technical Approach

Use Gemini's multimodal capability via google-genai Python package to process audio input directly (audio length: 5 seconds to 10 minutes)
Create test suite with 5-10 diverse audio samples (different Indian English accents and proficiency levels, varying lengths)
Implement prompt engineering to analyze speech for:
- Fluency and Coherence: Speech length, speaking rate, pause fillers, long pauses
- Pronunciation: Identify ONLY mispronounced words with accuracy scoring (these words will be converted to audio for correct pronunciation playback)
- Grammar: Error detection with correction suggestions
- Vocabulary: Word choice assessment with better vocabulary suggestions
Design JSON output structure based on SpeechSuper format (reference: https://www.speechsuper.com/ unscripted speech assessment)
Generate scores out of 10 with decimal points (e.g., 4.5, 6.5, 8.0, 9.5) with detailed breakdown
Test both Gemini 2.5 Flash and Flash Lite models
Document response time and output quality

Acceptance Criteria

Audio input successfully processed by Gemini using google-genai package (5 seconds to 10 minutes range)
JSON output includes:
- Overall scores for: Fluency, Pronunciation, Grammar, Vocabulary (out of 10 with decimals)
- Detailed metrics (speech length, speaking rate, pause fillers, etc.)
- Mispronounced words only with accuracy scores (for audio playback generation)
- Grammar errors with correction suggestions
- Vocabulary assessment with better word suggestions
Test suite covers 5-10 diverse audio samples with varying lengths
Both Gemini 2.5 Flash and Flash Lite tested
Response time documented for different audio lengths
Output quality assessment documented
JSON structure clearly defined and matches SpeechSuper format

Resources/References

Fluently App (try for understanding): https://getfluently.app/
SpeechSuper Unscripted Speech Assessment: https://www.speechsuper.com/
Google Gemini Python SDK: https://github.com/googleapis/python-genai

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

AI Communication Coach

To Do #156

Speech Analysis POC with Gemini

Updated by Harikrishnan Murugan 4 months ago