Speech Recognition & response in web or mobile directly without Alexa/Google home dependency


Feb. 21

1.6 K


In the era of voice-enabled devices like Google Assistant, Amazon Alexa it’s quite obvious that In the near future, there will be more or less support of voice-enabled services in every aspect of our routine life.

As it provides better interactivity and easy accessibility, it’ll be a game-changer for the next generation. There are already smart houses out there where every single thing in your house can talk to you and respond to your command. There is neither GUI nor content needed in voice-enabled devices, the only concerning factor is speed. You can get a faster response compared to all other technologies.

There are so many libraries and API out there, you can use to get started with your voice bot like

  1. Microsoft Bing Speech
  2. Google Web Speech API
  3. Google Cloud speech
  4. IBM speech to text

We are going to use Google web speech API from speechRecognition library. It’s easy to use as it has a default API key that is hard-coded into the SpeechRecognition library.
So that you can get started using it without any configuration and authentication process. Of course like every other API it has a daily limit of 50 requests. And we can’t raise the limit by any chance. so this is the best API you can use for experiment purposes. For production or live scenarios, you’ll have to purchase paid services from the above-mentioned APIs.

There will be a three-step process for every voice-enabled device –

  1. Speech to Text: In this phase, we are going to let our bot understand what we are talking about. We’ll provide either an audio file or a direct stream from our mic. The bot will convert this sound signal into text using our google speech recognition API.
  2. Processing: After converting your voice into a text bot will process your text and respond the same as a text-based bot will do. The process can be either to search a song from the web or can be to set an alarm or reminder.
  3. Text to speech: After the bot completed its processing and ready with your output stream or data, the last step is to give the user that processed response in voice form, which can be achieved using the google TextToSpeech library.

So, Let’s get started with developing your first voice-enabled bot.

Dependencies :
  1. Google Speech recognition library
    pip install SpeechRecognition
  2. Pyaudio
    pip install pyaudio
  3. Flask
    pip install Flask

import json
import os
from flask import Flask, Response
from flask import jsonify
from flask import request, redirect
from flask_socketio import SocketIO
from flask_cors import CORS
import ss
import speech_recognition as sr
import io
from gtts import gTTS

app = Flask(__name__)
socketio = SocketIO(app)

# Redirect http to https on CloudFoundry
def before_request():
fwd = request.headers.get('x-forwarded-proto')
if fwd is None:
return None
elif fwd == "https":
return None
elif fwd == "http":
url = request.url.replace('http://', 'https://', 1)
code = 301
return redirect(url, code=code)

def Welcome():
return app.send_static_file('index.html')

@app.route('/api/conversation', methods=['POST', 'GET'])
def getConvResponse():
convText = request.form.get('convText')
convContext = request.form.get('context', "{}")
jsonContext = json.loads(convContext)
if convText:
response = "Did you mean, " + convText + " ?"
response = "Hello There"
responseDetails = {'responseText':response,
return jsonify(results=responseDetails)

@app.route('/api/text-to-speech', methods=['POST'])
def getSpeechFromText():
inputText = request.form.get('text')
def generate():
if inputText:
audioOut = gTTS(text=inputText, lang='en', slow=False)
kk ="welcome.mp3")
f = open("welcome.mp3",'rb')
data =
print("Empty response")
data = "I have no response to that."

yield data

return Response(response=generate(), mimetype="audio/x-wav")

@app.route('/api/speech-to-text', methods=['POST'])
def getTextFromSpeech():
recognizer = sr.Recognizer()
f = request.files['audio_data']
file_obj = io.BytesIO()
mic = sr.AudioFile(file_obj)
response = ss.recognize_speech_from_mic(recognizer, mic)
print('\nSuccess : {}\nError : {}\n\nText from Speech\n{}\n\n{}' \
return Response(response=response['transcription'], mimetype='plain/text')

port = 5000
if __name__ == "__main__":, host='', port=int(port))

import speech_recognition as sr

def recognize_speech_from_mic(recognizer, microphone):

with microphone as source:
audio = recognizer.record(source)
response = {
"success": True,
"error": None,
"transcription": None

response["transcription"] = recognizer.recognize_google(audio)
except sr.RequestError:
# API was unreachable or unresponsive
response["success"] = False
response["error"] = "API unavailable/unresponsive"
except sr.UnknownValueError:
# speech was unintelligible
response["error"] = "Unable to recognize speech"

return response

Run file and it’ll run your server on 5000 ports. you‘ll need to call all defined functions from your front-end i.e HTML and javascript.
Let’s understand the code first.

  1. getConvResponse: This is the function that is responsible for storing the context of the conversation and printing output to your HTML front.
  2. getSpeechFromText: This function is responsible for converting your processed text output to voice output.
  3. getTextFromSpeech: This one is the most important function where we are getting voice input from the web recorder and converting it to text using speechRecognition API. This data will be passed to getConvResponse to save the context and process it.

Here in this tutorial, we developed a pretty simple example of a voice bot to make you understand how voice recognition works. You can use it in your live project by adding more functionalities. Feel free to contact us for any queries and know more about the other services we provide in Voice Assistant App development.


Lets Nurture
Posted by Lets Nurture
We use cookies to give you tailored experiences on our website.