1
0 Comments

How to build a text and voice-powered ChatGPT bot with text-to-speech and speech-to-text capabilities

Chatbots are becoming increasingly popular for providing quick and efficient customer support, answering questions, and helping users navigate through complex tasks.

In this blog post, I'll walk you through the process of building an AI-powered chatbot using ReactJS, OpenAI API and AWS Lambda.

The chatbot is designed to engage in conversation based on the user's input.

Do you want the full source code?
This tutorial is quite extensive, and I've prepared the entire source code.

Visit this page to download the entire source code, you'll get instant access to the files, which you can use as a reference or as a starting point for your own voice-powered ChatGPT bot project.

Here's the simple chatbot interface we'll be building together:

Chatbot app overview

Here are some of its key features:

1. Text-based interaction
Users can type their questions

2. Voice input and output
Users can send voice messages and the chatbot can transcribe them and reply with both text and audio responses

3. Context-aware conversations
The chatbot leverages the OpenAI ChatGPT API to maintain context during the conversation. Which enables coherent interactions.


We'll be using the following technologies:

1. ReactJS
A popular Javascript library for building user interfaces.

2. OpenAI API
Powered by GPT-3.5-turbo to generate human-like responses

3. AWS Lambda
A serverless compute service, where we can run our backend code without provisioning or managing servers. We'll use Lambda to handle audio transcription, text-to-speech, and calling the OpenAI API.

4. Material UI
A popular React UI framework with components and styling.

5. ElevenLabs API
A powerful API developed by ElevenLabs that offers state-of-the-art text-to-speech, voice cloning, and synthetic voice designing capabilities.

In the upcoming sections, I'll guide you through the entire process of building the chatbot, from setting up the frontend and backend to deploying the chatbot.

Let's get started!

Do you want the full source code?
This tutorial is quite extensive, and I've prepared the entire source code.

Visit this page to download the entire source code, you'll get instant access to the files, which you can use as a reference or as a starting point for your own voice-powered ChatGPT bot project.


1. Create a new ReactJS app

To begin, start by creating a parent folder for your new chatbot project, we'll create this folder structure in the next step:

Folder structure

Navigate to the location you'd like to have your project and then run the following command in your terminal or command prompt:

mkdir your-project-name

Replace your-project-name with the desired name for your chatbot project. Then navigate to that new folder by running the following command:

cd your-project-name

Then, let's create a new ReactJS app using create-react-app.

This command-line tool helps us quickly set up a React project with the necessary build configurations and folder structure.

Run the following command in your terminal to create the app:

npx create-react-app frontend

After the project is created, navigate into the folder and start the development server:

cd frontend
npm start

This command will launch a new browser window, showing the default React app starter template:

React app starter

Now that our React app is up and running, let's install the required libraries.


2. Install libraries

We'll need several libraries for our chatbot project.
First, we'll use Material UI (MUI) v5 for styling and UI components. MUI is a fully-loaded component library and design system with production-ready components.

To install MUI, run the following command in your project folder frontend that got created earlier:

npm install [@mui](/mui)/material [@emotion](/emotion)/react [@emotion](/emotion)/styled

Additionally, we'll install MUI's icon package, which provides a set of SVG icons exported as React components:

npm install [@mui](/mui)/icons-material

Next, we'll need a library to handle microphone recording and output the audio as an mp3 file.
For this guide, we'll use the mic-recorder-to-mp3 library, but you can pick any library that will record your microphone and output an audio file in mp3:

npm install mic-recorder-to-mp3

The mic-recorder-to-mp3 library also enables playback of recorded audio, which is a useful feature for our chatbot.

Finally, let's install aws-amplify. This library will help us send the recorded audio to our backend using AWS Amplify:

npm install aws-amplify

With all the necessary libraries installed, we're ready to start building the audio recording functionality for our chatbot app.


3. Create the chat interface components

In this section, we'll build the components needed for a simple chatbot interface that allows users to record audio, stop recording, playback the recorded audio, and upload the audio to the backend:

Chatbot app overview

We'll create the following components for the chat interface:

1. ChatHeader - to display the chatbot title and header information
2. ChatMessages - to display chat messages exchanged between the user and the chatbot
3. AudiControls - to provide the user audio control, including recording and unloading audio.
4. MessageInput - to provide the user text input option
5. ResponseFormatToggle - to provide the user the option to receive audio responses in addition to text responses

Let's start by changing the title of the app. Open up public/index.html and change the title tag to your desired name:

 <title>ChatGPT audio chatbot</title>

Create React app comes with reloading and ES6 support, so you should already see the changes in the browser tab:

New title

Let's now set up our App.js file.

Open App.js from your src folder and remove all the code within the return statement.

Also, delete the default logo and import React and the hook useState. Your App.js file should now look like this:

import React from "react";
import './App.css';

function App() {
  return (

  );
}

export default App;

Now, let's import the necessary MUI components, such as Container and Grid.

Wrap your app with a Container component and add a maxWidth of sm to keep the window narrow for the chat interface. Additionally, add some padding to the top.

Your App.js should now look like this:

import React from "react";
import './App.css';

// Material UI
import { Container, Grid } from '[@mui](/mui)/material';

function App() {
  return (
    <Container maxWidth="sm" sx={{ pt: 2 }}>
      
    </Container>

  );
}

export default App;

3.1. Create the ChatHeader component

The ChatHeader component will display the chatbot title and any relevant header information. This component will be positioned at the top of the chat interface.

Start by creating a ChatHeader component inside your App function, we'll use Typography, so import that component from MUI:

import { Typography } from '[@mui](/mui)/material';

Then, define the ChatHeader component with a headline for the chatbot:

const ChatHeader = () => {
    return(
        <Typography variant="h4" align="center" gutterBottom>
            Voice Chatbot
        </Typography>
    )
}

The Typography component from MUI is used to display text in a consistent and responsive manner. The variant prop sets the font size and style, while align adjusts the text alignment, and gutterBottom adds a bottom margin to create a space below the headline.

Next, include the ChatHeader component in the return statement of the App function:

return(
    <Container maxWidth="sm" sx={{ pt: 2 }}>
        <ChatHeader />
    </Container>
)

By adding the ChatHeader component to the Container, it is now integrated into the overall layout of the application.

Your app should now look like this:

New title


3.2. Create the ChatMessages component

The ChatMessages component will display chat messages exchanged between the user and the chatbot. It should update dynamically as new messages are added:

ChatMessages component

First, create an initial greeting message from the chatbot inside your App function:

const mockMessages = [
  {
    role: 'assistant',
    content: 'Hello, how can I help you today?',
    text: 'Hello, how can I help you today?'
  },
];

Then, import the useState hook and save the mockMessages to the state:

import React, { useState } from "react";
const [messages, setMessages] = useState(mockMessages);

Each object in the messages array will have 3 keys:

  • role determines if it is the chatbot or the user talking
  • text key is the text shown in the app
  • content is the text we'll use to send to the backend to create a completion.

The text key will store both text and React components, which we'll get back to later.

Import necessary components from MUI, such as List, ListItem, ListItemText, Box, and Paper:

import {
    Box, // Add to imports
    Container,
    Grid,
    IconButton, // Add to imports
    List, // Add to imports
    ListItem, // Add to imports
    ListItemText, // Add to imports
    Paper, // Add to imports
    Typography
} from "[@mui](/mui)/material";

To style our components, import useTheme and styled from MUI:

import { useTheme } from '[@mui](/mui)/material/styles';
import { styled } from '[@mui](/mui)/system';

Before creating the chat area, define three styles for the chat messages: one for user messages, one for agent messages and one for the MessageWrapper that wraps both messages inside your App function.

The user messages style should use the audio prop to adjust the padding for the audio icon:

const UserMessage = styled('div', { shouldForwardProp: (prop) => prop !== 'audio' })`
  position: relative;
  background-color: ${({ theme }) => theme.palette.primary.main};
  color: ${({ theme }) => theme.palette.primary.contrastText};
  padding: ${({ theme }) => theme.spacing(1, 2)};
  padding-right: ${({ theme, audio }) => (audio ? theme.spacing(6) : theme.spacing(2))};
  border-radius: 1rem;
  border-top-right-radius: 0;
  align-self: flex-end;
  max-width: 80%;
  word-wrap: break-word;
`;

Then create the styling for the Agent messages:

const AgentMessage = styled('div')`
  position: relative;
  background-color: ${({ theme }) => theme.palette.grey[300]};
  color: ${({ theme }) => theme.palette.text.primary};
  padding: ${({ theme }) => theme.spacing(1, 2)};
  border-radius: 1rem;
  border-top-left-radius: 0;
  align-self: flex-end;
  max-width: 80%;
  word-wrap: break-word;
`;

Finally, let's create the styling for the MessageWrapper that wraps both the agent messages and the user messages:

const MessageWrapper = styled('div')`
  display: flex;
  margin-bottom: ${({ theme }) => theme.spacing(1)};
  justify-content: ${({ align }) => (align === 'user' ? 'flex-end' : 'flex-start')};
`;

Each message in the ChatMessages will have a play icon if any audio is available, so we'll import a fitting icon component:

import VolumeUpIcon from '[@mui](/mui)/icons-material/VolumeUp';

Now, create a ChatMessages component that displays the messages from the messages array:

const ChatMessages = ({messages}) => {
}

Then add a useTheme hook to access the MUI theme:

const ChatMessages = ({messages}) => {
  const theme = useTheme();
}

To improve user experience, we want the chat window to automatically scroll to the bottom whenever a new message is added to the conversation.

Start by importing useEffect and useRef hooks from React:

import React, { 
  useEffect, // Add to imports
  useRef, // Add to imports
  useState 
} from "react";

useEffect allows us to run side effects, such as scrolling the chat window, in response to changes in the component's state or properties. useRef is used to create a reference to a DOM element so that we can interact with it programmatically.

Continue with defining a local variable bottomRef in the ChatMessages component, to create a reference to the bottom of the chat window:

const bottomRef = useRef(null);

Then create the scrollToBottom function, which will be responsible for scrolling the chat window to the bottom:

const scrollToBottom = () => {
  if (bottomRef.current) {
    if (typeof bottomRef.current.scrollIntoViewIfNeeded === 'function') {
      bottomRef.current.scrollIntoViewIfNeeded({ behavior: 'smooth' });
    } else {
      bottomRef.current.scrollIntoView({ behavior: 'smooth' });
    }
  }
};

This function first checks if bottomRef.current is defined. If it is, it then checks if the scrollIntoViewIfNeeded function is available.
If available, it smoothly scrolls the chat window to the bottom using scrollIntoViewIfNeeded. scrollIntoViewIfNeeded is only supported by some browsers, for example not by Safari. So if it's not available, it uses the scrollIntoView function instead, which is more widely supported, to achieve the same effect.

Next, add a useEffect hook that triggers the scrollToBottom function whenever the messages prop changes:

useEffect(() => {
  scrollToBottom();
}, [messages]);

This will ensure that the chat window always scrolls to the bottom when new messages are added to the conversation.

Then finally create the components where the chat messages will be displayed in the return statement of ChatMessages:

return(
    <Container>
        <Box sx={{ width: '100%', mt: 4, maxHeight: 300, minHeight: 300, overflow: 'auto' }}>
            <Paper elevation={0} sx={{ padding: 2 }}>
                <List>
                    {messages.map((message, index) => (
                    <ListItem key={index} sx={{ padding: 0 }}>
                        <ListItemText
                        sx={{ margin: 0 }}
                        primary={
                            <MessageWrapper align={message.role}>
                            {message.role === 'user' ? (
                                <>
                                <UserMessage theme={theme} audio={message.audio}>
                                    {message.text}
                                    {message.audio && ( 
                                    <IconButton
                                        size="small"
                                        sx={{ 
                                            position: 'absolute', 
                                            top: '50%', 
                                            right: 8, 
                                            transform: 'translateY(-50%)' 
                                            }}
                                        onClick={() => message.audio.play()}
                                    >
                                        <VolumeUpIcon fontSize="small" />
                                    </IconButton>
                                    )}
                                </UserMessage>
                                </>
                            ) : (
                                <AgentMessage theme={theme}>
                                    {message.text}
                                </AgentMessage>
                            )}
                            </MessageWrapper>
                        }
                        />
                    </ListItem>
                    ))}
                </List>
            </Paper>
        </Box>
    </Container>
)

Lastly, add the bottomRef to your List component to make the auto-scrolling functionality work:

  // ........
      </ListItem>
    ))}
    <div ref={bottomRef} /> // Add this ref
  </List>
</Paper>
 // ........

By adding the bottomRef to an empty <div> at the end of the List component, we can now programmatically scroll the chat window to the bottom whenever new messages are added to the conversation.

Let's break down what we're doing in the ChatMessages component in detail.

We start by defining the ChatMessages component, which takes the messages prop. We also use the useTheme hook to access the Material-UI theme:

const ChatMessages = ({messages}) => {
    const theme = useTheme()

We then wrap the chat area with a Container component. Inside the container, we use a Box component with specific styles for width, margin, maximum height, and overflow. This ensures that the chat area has a fixed height and scrolls if there are more messages than can fit in the available space.

We then use a Paper component with an elevation of 0 to remove a raised effect, making the chat area stand out from the background. We also add some padding to the Paper component.

Inside the Paper component, we use a List component to hold the chat messages:

{messages.map((message, index) => (
  <ListItem key={index} sx={{ padding: 0 }}>
    <ListItemText
      sx={{ margin: 0 }}
      primary={
        <MessageWrapper align={message.role}>

We iterate over the messages array and create a ListItem component for each message. We then set the padding of the ListItem to 0 and provide a unique key using the index. We then use ListItemText component to display the message content.

We conditionally align the message based on the role using the MessageWrapper component. The MessageWrapper component uses the align prop to justify the content to either

- flex-end for user messages or

- flex-start for agent messages

{message.role === 'user' ? (
  <>
    <UserMessage theme={theme} audio={message.audio}>
      {message.text}
      {message.audio && (
        <IconButton
          size="small"
          sx={{
            position: 'absolute',
            top: '50%',
            right: 8,
            transform: 'translateY(-50%)',
          }}
          onClick={() => message.audio.play()}
        >
          <VolumeUpIcon fontSize="small" />
        </IconButton>
      )}
    </UserMessage>
  </>
) : (
  <AgentMessage theme={theme}>
    {message.text}
  </AgentMessage>
)}

We conditionally apply the UserMessage or AgentMessage styling based on the role.

We pass the Material-UI theme and the audio prop, if available, to the UserMessage component. If the message has associated audio, we display an IconButton component with the VolumeUpIcon. The IconButton has an onClick event that plays the audio when clicked.

The same structure is applied to the AgentMessage component. The styling for the AgentMessage is slightly different, but the functionality remains the same.

In summary, the ChatMessages component is responsible for displaying chat messages in a styled, scrollable area. It takes an array of messages and iterates over them, creating a list of messages aligned based on the role, user or agent.

It also displays an audio icon for messages with associated audio, allowing users to play the audio by clicking the icon.

Now we're ready to include the ChatMessages component in our return statement of the App function, your return statement should look like this now:

return (
    <Container maxWidth="sm" sx={{ pt: 2 }}>
      <ChatHeader />
      <ChatMessages messages={messages} />
    </Container>
)

Your app should now look like this with a greeting message:

App with agent greeting

Let's go ahead and create the audio controls in the next segment.


3.3 Create the AudioControls

The next step is to create the audio controls:

App with audio controls

Start by importing the MicRecorder library:

import MicRecorder from 'mic-recorder-to-mp3';

Then, go ahead and define the function outside the App function and create four new state variables:

const AudioControls = () => {
    const [isRecording, setIsRecording] = useState(false);
    const [recorder, setRecorder] = useState(null);
    const [player, setPlayer] = useState(null);
    const [audioFile, setAudioFile] = useState(null);
}

AudioControls is placed outside of the App function to encapsulate its state and logic, making the component reusable and easier to maintain. The separation concerns also helps prevent unnecessary re-renders of the App component when state changes occur within the AudioControls component.

By defining the AudioControls component outside of the App function, you can more efficiently manage the state related to recording, playing, and uploading audio, making your application more modular and organized.

We'll have four buttons in the AudioControls component:

1. Start a recording
2. Stop the recording
3. Play the recording
4. Upload audio

App with audio controls parts

For the icon buttons, we'll need a microphone icon and a dot, import those icon components:

import FiberManualRecordIcon from '[@mui](/mui)/icons-material/FiberManualRecord';
import MicIcon from '[@mui](/mui)/icons-material/Mic';

Also, import the Button component from MUI:

import {
    Button, // Add to imports
    Container,
    Grid,
    IconButton,
    List,
    ListItem, 
    ListItemText,
    Paper,
    Typography
} from "[@mui](/mui)/material";

Let's create the function for starting an audio recording inside the AudioControls function:

const startRecording = async () => {
    const newRecorder = new MicRecorder({ bitRate: 128 });

    try {
    await newRecorder.start();
    setIsRecording(true);
    setRecorder(newRecorder);
    } catch (e) {
    console.error(e);
    alert(e)
    }
};

Let's break down what we're doing in the function. We're declaring an asynchronous function using async:

const startRecording = async () => {

This allows us to use the keyword await within the function to handle the Promise from MicRecorder.

The next step is to create a new instance of MicRecorder with a bitrate of 128 kps. The bitrate option is specifying the quality of the recorded audio. A higher bitrate means better quality but a larger file size:

const newRecorder = new MicRecorder({ bitRate: 128 });

Then we're calling the start() method on the newRecorder instance to start recording in a try block:

try {
  await newRecorder.start();

The await keyword is used with newRecorder.start() to pause the function's execution until the Promise resolves or rejects.

If the audio recording start successfully, the Promise resolves and proceeds to update the React component's states:

setIsRecording(true);
setRecorder(newRecorder);
  • The setIsRecording(true) call sets the isRecording state to true, indicating that the recording is in progress.

  • The setRecorder(newRecorder) call sets the recorder state to the newRecorder instance, so it can be used later to stop the recording.

If the start() method fails, which could be due to permission issues or the microphone being unavailable, then the catch block gets executed:

catch (e) {
  console.error(e);
  alert(e)
}

This block logs the error and shows an alert so you can troubleshoot the issue.

Let's also crate the function for stopping the audio recording:

const stopRecording = async () => {
    if (!recorder) return;

    try {
    const [buffer, blob] = await recorder.stop().getMp3();
    const audioFile = new File(buffer, "voice-message.mp3", {
        type: blob.type,
        lastModified: Date.now(),
    });
    setPlayer(new Audio(URL.createObjectURL(audioFile)));
    setIsRecording(false);
    setAudioFile(audioFile); // Add this line
    } catch (e) {
    console.error(e);
    alert(e)
    }
};

Here's the breakdown of what we did; starting with declaring the function as an asynchronous function with the async keyword to handle Promises:

const stopRecording = async () => {

Then we added the try block to attempt to stop the recording and get the MP3 data:

try {
  const [buffer, blob] = await recorder.stop().getMp3();

The await keyword is used with recorder.stop().getMp3() to pause the function's execution until the Promise is resolved or rejected.

If the Promise is resolved, the buffer and blob variables are assigned values returned by the getMp3() method.

Then we converted the recorded audio into an MP3 file:

const audioFile = new File(buffer, 'voice-message.mp3', {
  type: blob.type,
  lastModified: Date.now(),
});

In this code, the File constructor is used to create a new File object with the audio data, the name voice-message.mp3, the appropriate file type and the last-modified timestamp.

The MP3 file is then used to create a new Audio object, which can be played back:

setPlayer(new Audio(URL.createObjectURL(file)));

The URL.createObjectURL(file) method creates a URL representing the file, and the new Audio() constructor creates a new Audio object using that URL.

The setPlayer(newPlayer) call updates the React component's player state with the new Audio object.

In the next step, we update the React component's state:

setIsRecording(false);

The setIsRecording(false) call sets the isRecording state to false, indicating that the recording is no longer in process.

If the stop().getMp3() method fails, which could be due to an issue with the recorder, the catch block is executed:

catch (e) {
        console.error(e);
        alert(e);
      }

Let's also create the function for playing a recording:

const playRecording = () => {
    if (player) {
    player.play();
    }
};

Now that we have the audio control functions ready, we can create the AudioControl component:

return (
    <Container>
        <Box sx={{ width: "100%", mt: 4 }}>
            <Grid container spacing={2} justifyContent="flex-end">
                <Grid item xs={12} md>
                    <IconButton
                    color="primary"
                    aria-label="start recording"
                    onClick={startRecording}
                    disabled={isRecording}
                    >
                    <MicIcon />
                    </IconButton>
                </Grid>
                <Grid item xs={12} md>
                    <IconButton
                    color="secondary"
                    aria-label="stop recording"
                    onClick={stopRecording}
                    disabled={!isRecording}
                    >
                    <FiberManualRecordIcon />
                    </IconButton>
                </Grid>
                <Grid item xs="auto">
                    <Button
                    variant="contained"
                    disableElevation
                    onClick={playRecording}
                    disabled={isRecording}
                    >
                    Play Recording
                    </Button>
                </Grid>
            </Grid>
        </Box>
    </Container>
)

We're ready to include the AudioControls component in the return statement of the App function, and your return statement should now look like this:

return(
    <Container maxWidth="sm" sx={{ pt: 2 }}>
      <ChatHeader />
      <ChatMessages messages={messages} />
      <AudioControls />
    </Container>
)

If you look at your app, you'll see that one button is missing: upload audio:

App with audio controls missing a button

We'll create it in the coming sections, but first, let's first create the logic for switching between audio and text.


3.4 Create the audio response toggle

In this section, we'll build the ResponseFormatToggle, which allows users to decide if they want an audio response in addition to the text response:

App with the audio response format

Just like we did with AudioControls, we'll place the ResponseFormatToggle outside of the App function to encapsulate its state and logic, making the component reusable and easier to maintain.

First, add the isAudioResponse and setIsAudioResponse variables to your main state:

const [isAudioResponse, setIsAudioResponse] = useState(false);

Next, create the ResponseFormatToggle component outside of the App function and pass the variables as props:

const ResponseFormatToggle = ({ isAudioResponse, setIsAudioResponse }) => {
}

Define the function for handling the toggle change in the ResponseFormatToggle function:

const handleToggleChange = (event) => {
  setIsAudioResponse(event.target.checked);
};

We'll need to import two new MUI components; FormControlLabel and Switch:

import {
    Button,
    Container,
    FormControlLabel, // Add to imports
    Grid,
    IconButton,
    List,
    ListItem, 
    ListItemText,
    Paper,
    Switch, // Add to imports
    Typography
} from "[@mui](/mui)/material";

Now, create the component for the toggle, and your ResponseFormatToggle should now look like this:

const ResponseFormatToggle = ({ isAudioResponse, setIsAudioResponse }) => {

  const handleToggleChange = (event) => {
    setIsAudioResponse(event.target.checked);
  };

  return (
    <Box sx={{ display: "flex", justifyContent: "center", marginTop: 2 }}>
      <FormControlLabel
        control={
          <Switch
            checked={isAudioResponse}
            onChange={handleToggleChange}
            color="primary"
          />
        }
        label="Audio response"
      />
    </Box>
  );
};

Finally, add the ResponseFormatToggle to the return statement of the App function:

return (
    <Container maxWidth="sm" sx={{ pt: 2 }}>
      <ChatHeader />
      <ChatMessages messages={messages} />
      <AudioControls />
      <ResponseFormatToggle isAudioResponse={isAudioResponse} setIsAudioResponse={setIsAudioResponse} />
    </Container>
  );

Your app should now display a functioning toggle button:

App with toggle button

With the toggle button in place, we're ready to create the missing UploadButton:

App with audio controls missing a button


3.5 Create the upload button

The SendButton is part of the AudioControls component and is responsible for uploading the audio file to the backend.

To keep the user informed while the audio is being sent and processed in the backend, we'll create a new component, ThinkingBubble, that pulses while the chatbot is "thinking".

Both ThinkingBubble and SendButton are placed outside of the App function to encapsulate its state and logic, making the component reusable and easier to maintain.

To create the pulse motion, we'll need to import keyframes from MUI:

import {
  keyframes, // Add this import
  styled
}
 from '[@mui](/mui)/system';

Then define the pulse motion outside of your App function:

const pulse = keyframes`
  0% {
    transform: scale(1);
    opacity: 1;
  }
  50% {
    transform: scale(1.1);
    opacity: 0.7;
  }
  100% {
    transform: scale(1);
    opacity: 1;
  }
`;

We'll use the MoreHorizIcon for the thinking bubble, so import it from MUI:

import MoreHorizIcon from '[@mui](/mui)/icons-material/MoreHoriz';

Now, create the ThinkBubbleStyled component with the pulse animation below the pulse definition:

const ThinkingBubbleStyled = styled(MoreHorizIcon)`
  animation: ${pulse} 1.2s ease-in-out infinite;
  margin-bottom: -5px;
`;

Finally, create the ThinkingBubble component:

const ThinkingBubble = () => {
  const theme = useTheme();
  return <ThinkingBubbleStyled theme={theme} sx={{ marginBottom: '-5px' }} />;
};

This ThinkingBubble will be styled with MUI so it needs to define the theme.

Now we're ready to create the SendButton component, begin by defining it with a useTheme hook:

const SendButton = ({audioFile}) => {
  const theme = useTheme();
}

Continue by creating a function in the SendButton for uploading the audio file to the backend, which starts a check for if an audio file exists:

const uploadAudio = async () => {

  if (!audioFile) {
    console.log("No audio file to upload");
    return;
  }
}

Before we add the backend API call function, let's create a helper function that will create the message objects needed as ChatGPT prompt. Add this function in the main application since we'll use it for components both within and outside of the App function:

function filterMessageObjects(list) {
  return list.map(({ role, content }) => ({ role, content }));
}

Make sure to add filterMessageObjects as props and SendButton should now have two props:

const SendButton = ({audioFile, filterMessageObjects}) => {
  // .....
}

This function maps the messages and creates a new array with only the role and content keys. For the backend call itself, we'll use Amplify which we installed earlier, go ahead and import the library:

import { API } from "aws-amplify";

The next step is adding the async backend call and your uploadAudio function should now look like this:

const uploadAudio = async () => {

  if (!audioFile) {
    console.log("No audio file to upload");
    return;
  }

  try {

    const reader = new FileReader();
    reader.onloadend = async () => {
      const base64Audio = reader.result;

      // Add a unique id to the message to be able to update it later
      const messageId = new Date().getTime();

      // Create the message objects
      let messageObjects = filterMessageObjects(messages)

      // Add user's audio message to the messages array
      setMessages((prevMessages) => [
        ...prevMessages,
        { role: "user", content: "🎤 Audio Message", audio: new Audio(base64Audio), text: "🎤 Audio Message", id: messageId },
      ]);

      // Add thinking bubble
      setMessages((prevMessages) => [
        ...prevMessages,
        { role: "assistant", content: <ThinkingBubble theme={theme} sx={{ marginBottom: '-5px' }} />, text: <ThinkingBubble theme={theme} sx={{ marginBottom: '-5px' }} />, key: "thinking" },
      ]);

      const response = await API.post("api", "/get-answer", {
        headers: {
          "Content-Type": "application/json",
        },
        body: {
          audio: base64Audio,
          messages: messageObjects,
          isAudioResponse
        },
      });

      // Remove the thinking bubble
      setMessages((prevMessages) => {
        return prevMessages.filter((message) => message.key !== "thinking");
      });
    };
    reader.readAsDataURL(audioFile);

  } catch (error) {
    console.error("Error uploading audio file:", error);
    alert(error)
  }
};

Let's break down how the uploadAudio function is built and examine each step in detail:

1. Check if an audio file exists
The function starts by checking if an audioFile exists. If not, it logs a message and returns early to prevent further execution.

if (!audioFile) {
  console.log("No audio file to upload");
  return;
}

2. Create a FileReader instance
A new FileReader instance is created to read the audio file's content. The reader.onloadend event is used to handle the file reading completion. It's an async event to ensure that the reading process is complete before proceeding:

const reader = new FileReader();
reader.onloadend = async () => {
  // ... remaining steps
};

3. Convert the audio file to Base64
The reader.result contains the audio file's content in Base64 format. This is needed for further processing and transmitting the file to the backend:

const base64Audio = reader.result;

4. Generate a unique message ID
To uniquely identify messages, generate a unique ID based on the current timestamp. We're doing this to keep track of a placeholder message (the pulsing ThinkingBubble) while the backend is processing the audio file:

const messageId = new Date().getTime();

5. Create message objects
Use the filterMessageObjects helper function to create an array containing only the role and content keys for each message:

let messageObjects = filterMessageObjects(messages);

6. Add the user's audio message
Update the messages array with the new audio message, including its role, content, audio, text, and the unique ID:

setMessages((prevMessages) => [
  ...prevMessages,
  {
    role: "user",
    content: "🎤 Audio Message",
    audio: new Audio(base64Audio),
    text: "🎤 Audio Message",
    id: messageId,
  },
]);

The unique ID is used later to update the content key with the transcribed audio message from the backend

7. Add the thinking bubble
Display the ThinkingBubble component to indicate that the chatbot is processing the user's input:

setMessages((prevMessages) => [
  ...prevMessages,
  {
    role: "assistant",
    content: <ThinkingBubble theme={theme} sx={{ marginBottom: '-5px' }} />,
    text: <ThinkingBubble theme={theme} sx={{ marginBottom: '-5px' }} />,
    key: "thinking",
  },
]);

We'll add the key thinking to keep track of the object for when we're removing it from the array later.

8. Make the backend call
Use the API.post method from Amplify to send the Base64 audio file, message objects, and the isAudioResponse flag to the backend for processing:

const response = await API.post("api", "/get-answer", {
  headers: {
    "Content-Type": "application/json",
  },
  body: {
    audio: base64Audio,
    messages: messageObjects,
    isAudioResponse,
  },
});

9. Remove the thinking bubble
Once the response is received, remove the ThinkingBubble component from the message array:

setMessages((prevMessages) => {
  return prevMessages.filter((message) => message.key !== "thinking");
});

10. Read the audio file
Lastly, initiate the process of reading the audio file using the reader.readAsDataURL(audioFile) method:

reader.readAsDataURL(audioFile);

Let's update the SendButton component to include the necessary isAudioResponse, messages and setMessages as props:

const SendButton = ({
  audioFile, 
  isAudioResponse,
  filterMessageObjects, 
  messages, 
  setMessages}) => {
  // .....
}

Let's also create the Button component, we'll need the CloudUploadIcon, so start by importing it and then add the Button component to the return statement of the SendButton:

import CloudUploadIcon from "[@mui](/mui)/icons-material/CloudUpload";
return (
  <Grid item xs="auto">
    <Button
      variant="contained"
      color="primary"
      disableElevation
      onClick={uploadAudio}
      disabled={!audioFile}
      startIcon={<CloudUploadIcon />}
    >
      Upload Audio
    </Button>
  </Grid>
);

Now that the SendButton component is complete, incorporate it into the AudioControls component created earlier:

const AudioControls = () => {
  // startRecording ...
  // stopRecording ...
  // playRecording ...
  // ...
   return ( 
    // ...
      <Grid item xs="auto">
        <Button
          variant="contained"
          disableElevation
          onClick={playRecording}
          disabled={isRecording}
        >
          Play Recording
        </Button>
      </Grid>
      <SendButton audioFile={audioFile} isAudioResponse={isAudioResponse} filterMessageObjects={filterMessageObjects}
  messages={messages} 
  setMessages={setMessages} /> {/* Add the SendButton component */}
    // ....

Since SendButton requires the props isAudioResponse,filterMessageObjects, messages and setMessages, make sure to include them in both the return statement for AudioControls:

return (
    <Container maxWidth="sm" sx={{ pt: 2 }}>
      <ChatHeader />
      <ChatMessages messages={messages} />
      <AudioControls isAudioResponse={isAudioResponse} filterMessageObjects={filterMessageObjects} messages={messages} setMessages={setMessages}  />
      <ResponseFormatToggle isAudioResponse={isAudioResponse} setIsAudioResponse={setIsAudioResponse} />
    </Container>
  );

Also add isAudioResponse,filterMessageObjects, messages and setMessages as props for AudioControls:

const AudioControls = ({isAudioResponse, filterMessageObjects, messages, setMessages}) => {
 // ....
}

With these updates, your SendButton component receives the necessary props and is now integrated into the AudioControls component.

Your app should now have an Upload Audio button:

App with audio controls with upload audio button

Now you have a functional SendButton component that uploads the audio file to the backend and displays a ThinkingBubble component while the chatbot processes the user's input. Once the response is received, the ThinkingBubble is removed, and the assistant's response is displayed.


3.6 Create the message input

For this guide, we're giving the users the option to send both audio and text messages. Let's create the last component, the MessageInput, which will allow users to type and send text messages.

Start by defining a message variable in the main App function:

// Main app
const [message, setMessage] = useState("");

Then continue with defining the component outside of the App function:

const MessageInput = () => {
  }

This component will need to send the isAudioResponse flag to the backend, so add it as props:

const MessageInput = ({isAudioResponse}) => {
}

Also, add the variables message and setMessage as props:

const MessageInput = ({message, setMessage, isAudioResponse}) => {
}

Next, create a function to handle the text input change, and place this function inside the MessageInput function:

const handleInputChange = (event) => {
  setMessage(event.target.value);
};

Now, add a function that sends the text message to the backend, and place it inside the App function:

const handleSendMessage = async () => {
  if (message.trim() !== "") {
    
    // Send the message to the chat

    // Add the new message to the chat area
    setMessages((prevMessages) => [
      ...prevMessages,
      { role: "user", content: message, text: message, audio: null },
    ]);

    // Clear the input field
    setMessage("");

    // Add thinking bubble
    setMessages((prevMessages) => [
      ...prevMessages,
      { role: "assistant", content: <ThinkingBubble theme={theme} sx={{ marginBottom: '-5px' }} />, text: <ThinkingBubble theme={theme} sx={{ marginBottom: '-5px' }} />, key: "thinking" },
    ]);

    // Create backend chat input
    let messageObjects = filterMessageObjects(messages)
    messageObjects.push({ role: "user", content: message })

    // Create endpoint for just getting the completion
    try {
      // Send the text message to the backend
      const response = await API.post("api", "/get-answer", {
        headers: {
          "Content-Type": "application/json",
        },
        body: {
          text: message,
          messages: messageObjects,
          isAudioResponse
        },
      });

      // Remove the thinking bubble
      setMessages((prevMessages) => {
        return prevMessages.filter((message) => message.key !== "thinking");
      });

    } catch (error) {
      console.error("Error sending text message:", error);
      alert(error);
    }


  }
};

The handleSendMessage function uses the theme so let's add a useTheme hook to access the MUI theme in the main App function:

const theme = useTheme();

Let's break down what we're doing in handleSendMessage and examine each step in detail:

1. Check if the message is not empty
The function starts by checking if the message is not an empty string (ignoring leading and trailing whitespaces). If it's empty, the function will not process further:

if (message.trim() !== "") {
  // ... remaining steps
}

2. Add the user's text message
Update the messages array with the new text message, including its role, content, text and audio:

setMessages((prevMessages) => [
  ...prevMessages,
  { role: "user", content: message, text: message, audio: null },
]);

3. Clear the input field
Clear the input field to allow the user to enter a new message after the response:

setMessage("");

4. Add the thinking bubble
Display the ThinkingBubble component to indicate that the chatbot is processing the user's input.

setMessages((prevMessages) => [
  ...prevMessages,
  {
    role: "assistant",
    content: <ThinkingBubble theme={theme} sx={{ marginBottom: '-5px' }} />,
    text: <ThinkingBubble theme={theme} sx={{ marginBottom: '-5px' }} />,
    key: "thinking",
  },
]);

5. Create message objects
Use the filterMessageObjects helper function to create an array containing only the role and content keys for each message. Then, push the new text message into the array:

let messageObjects = filterMessageObjects(messages);
messageObjects.push({ role: "user", content: message });

6. Make the backend API call
Use the API.post method from Amplify to send the text message, message object, and the isAudioResponse flag to the backend for processing:

const response = await API.post("api", "/get-answer", {
  headers: {
    "Content-Type": "application/json",
  },
  body: {
    text: message,
    messages: messageObjects,
    isAudioResponse,
  },
});

7. Remove the thinking bubble
Once the response is received, remove the ThinkingBubble component from the messages array:

setMessages((prevMessages) => {
  return prevMessages.filter((message) => message.key !== "thinking");
});

8. Catch any errors
If there are any errors while sending the text message to the backend, log the error message and show an alert:

catch (error) {
  console.error("Error sending text message:", error);
  alert(error);
}

The handleSendMessage function is now handling sending the text message, updating the UI with a thinking bubble, and making a backend API call to process the user's input.

To add functionality for listening to a key event within the MessageInput component, define the handleKeyPress function:

const handleKeyPress = (event) => {
  if (event.key === "Enter") {
    handleSendMessage();
  }
};

The handleKeyPress function checks if the Enter key is pressed. If so, it calls the handleSendMessage function, triggering the message-sending process.

Add the handleSendMessage as props in MessageInput, and it should now look like this:

const MessageInput = ({ message, setMessage, isAudioResponse, handleSendMessage }) => {
  // ....
}

We now just need to add a TextField so the user can use it to type and send a text message. Start by importing the TextField component from MUI:

import {
    Button,
    Container,
    FormControlLabel,
    Grid,
    IconButton,
    List,
    ListItem, 
    ListItemText,
    Paper,
    Switch,
    TextField, // Add to imports
    Typography
} from "[@mui](/mui)/material";

And then import the SendIcon:

import SendIcon from "[@mui](/mui)/icons-material/Send";

Then add the TextField and IconButton within the return statement of the MessageInput component:

return (
  <Box sx={{ display: "flex", alignItems: "center", marginTop: 2 }}>
    <TextField
      variant="outlined"
      fullWidth
      label="Type your message"
      value={message}
      onChange={handleInputChange}
      onKeyPress={handleKeyPress}
    />
    <IconButton
      color="primary"
      onClick={() => handleSendMessage(isAudioResponse)}
      disabled={message.trim() === ""}
    >
      <SendIcon />
    </IconButton>
  </Box>
);

Lastly, add the MessageInput component in the return statement above the ResponseFormatToggle in your App function:

return (
    <Container maxWidth="sm" sx={{ pt: 2 }}>
      <ChatHeader />
      <ChatMessages messages={messages} />
      <AudioControls isAudioResponse={isAudioResponse} filterMessageObjects={filterMessageObjects} messages={messages} setMessages={setMessages}  />
      <MessageInput message={message} setMessage={setMessage} isAudioResponse={isAudioResponse} handleSendMessage={handleSendMessage} />
      <ResponseFormatToggle isAudioResponse={isAudioResponse} setIsAudioResponse={setIsAudioResponse} />
    </Container>
  );

If you check your app, you should now see a text input field where you can type a text message:

App with text input


3.7 Create the backend response handling

Before we can start to build the backend, there is one last function we'll need to build; handleBackendResponse. This function is responsible for transforming the backend response into the format required by the ChatMessages component and is placed inside the App function.

Start by defining the function:

const handleBackendResponse = (response, id = null) => {
  }

We have two arguments: the backend response and id. The id is used to track the user message when it is an audio file and has been transcribed.

Whenever a user sends an audio message, the placeholder chat message is 🎤 Audio Message
So when the audio has been transrcibed into text, we want to add it to the messages to be able to keep track of what the user said to the chatbot.
That's why we're keeping track of the chat message id

The backend response will have three keys:

The generated text (the ChatGPT answer)
The generated audio (if isAudioResponse is true)
Transcription of the message

Create local variables of each response key:

const generatedText = response.generated_text;
const generatedAudio = response.generated_audio;
const transcription = response.transcription;

Next, let's create an audio element if it is present:

const audioElement = generatedAudio
  ? new Audio(`data:audio/mpeg;base64,${generatedAudio}`)
  : null;

Now, create an AudioMessage component. This chat message can be clicked on by the user if audio is present:

const AudioMessage = () => (
  <span>
    {generatedText}{" "}
    {audioElement && (
      <IconButton
        aria-label="play-message"
        onClick={() => {
          audioElement.play();
        }}
      >
        <VolumeUpIcon style={{ cursor: "pointer" }} fontSize="small" />
      </IconButton>
    )}
  </span>
);

The final step is to add a conditional statement for updating the messages array, put it below the AudioMessage component:

if (id) {
  setMessages((prevMessages) => {
    const updatedMessages = prevMessages.map((message) => {
      if (message.id && message.id === id) {
        return {
          ...message,
          content: transcription,
        };
      }
      return message;
    });
    return [
      ...updatedMessages,
      {
        role: "assistant",
        content: generatedText,
        audio: audioElement,
        text: <AudioMessage />,
      },
    ];
  });
} else {
  // Simply add the response when no messageId is involved
  setMessages((prevMessages) => [
    ...prevMessages,
    {
      role: "assistant",
      content: generatedText,
      audio: audioElement,
      text: <AudioMessage />,
    },
  ]);
}

Let's break down the conditional statement within the handleBackendResponse function and examine each step in detail:

1. Check if id is present
The conditional statement checks if the id argument is provided. If id is present, it means the message is an audio transcription, and we need to update the existing message with the transcribed text. If id is not present, we directly add the chatbot's response to the messages array:

if (id) {
  // ... update existing message and add chatbot's response
} else {
  // ... directly add the chatbot's response
}

2. Update the existing message with the transcription
If id is present, we iterate through the messages array using the map function. For each message, if the message's id matches the provided id, we create a new message object with the same properties and update its content with the transcription:

const updatedMessages = prevMessages.map((message) => {
  if (message.id && message.id === id) {
    return {
      ...message,
      content: transcription,
    };
  }
  return message;
});

3. Add the chatbot's response to the updated messages array
Next, we add the chatbot's response, including the generated text, audio element, and AudioMessage component, to the updateMessages array:

return [
  ...updatedMessages,
  {
    role: "assistant",
    content: generatedText,
    audio: audioElement,
    text: <AudioMessage />,
  },
];

4. Set the updated messages array
The setMessages function is called with the updated messages array, which contains the transcribed message and the chatbot's response:

setMessages((prevMessages) => {
  // ... logic for updating messages and adding chatbot's response
});

5. Directly add the chatbot's response when no id is involved
If the id is not present, we don't need to update any existing messages. Instead, we directly add the chatbot's response, including the generated text, audio element and AudioMessage component, to the messages array:

setMessages((prevMessages) => [
  ...prevMessages,
  {
    role: "assistant",
    content: generatedText,
    audio: audioElement,
    text: <AudioMessage />,
  },
]);

The entire process ensures that the messages array is updated correctly, whether the user input is transcribed audio message or a simple text message.

Finally, you'll need to call the handleBackendResponse function in two locations within your code:

1. After removing the thinking bubble in the SendButton component

Add handleBackendResponse as a prop and call the function:

const SendButton = ({ audioFile, isAudioResponse, handleBackendResponse, filterMessageObjects, messages, setMessages }) => {
  // ......
  setMessages((prevMessages) => {
    return prevMessages.filter((message) => message.key !== "thinking");
  });
  handleBackendResponse(response, messageId); // Add function call
// ......

2. After removing the thinking bubble in the handleSendMessage component

Add a call to the handleBackendResponse function:

const handleSendMessage = async () => {
  // ......
  setMessages((prevMessages) => {
    return prevMessages.filter((message) => message.key !== "thinking");
  });
  handleBackendResponse(response); // Add function call
  // ......
}

After adding handleSendMessage as a prop, update the AudioControls:

const AudioControls = ({ isAudioResponse, handleBackendResponse, messages, filterMessageObjects, setMessages }) => {
  // ....
  <SendButton 
    audioFile={audioFile} 
    isAudioResponse={isAudioResponse} 
    handleBackendResponse={handleBackendResponse} // Add handleBackendResponse 
    filterMessageObjects={filterMessageObjects} 
    messages={messages} 
    setMessages={setMessages} /> 

}

Also, update the return statement to this:

return (
    <Container maxWidth="sm" sx={{ pt: 2 }}>
      <ChatHeader />
      <ChatMessages messages={messages} />
      <AudioControls isAudioResponse={isAudioResponse} filterMessageObjects={filterMessageObjects} messages={messages} setMessages={setMessages} handleBackendResponse={handleBackendResponse}  />
      <MessageInput message={message} setMessage={setMessage} isAudioResponse={isAudioResponse} handleSendMessage={handleSendMessage} handleBackendResponse={handleBackendResponse}  />
      <ResponseFormatToggle isAudioResponse={isAudioResponse} setIsAudioResponse={setIsAudioResponse} />
    </Container>
  );

We're all set to start building our backend in Python.


4. Create an AWS account

In this guide, we'll use AWS Lambda for the Python backend, powered by AWS API Gateway to handle the REST calls. We'll create the Lambda with our Python code using the Serverless framework.

To begin, you'll need to create a new AWS account if you don't already have one.

1. Visit https://aws.amazon.com and click Sign In to the Console:

AWS website

2. Click Create a new AWS account:

AWS create a new AWS account

3. Complete the signup process:

Finalize signup process


Important

Before proceeding, create a billing alarm to ensure you receive a notification if your bill increases unexpectedly.

Follow these steps to set up a billing alarm: AWS docs


4. Next, create a user on your account. In the top menu, type IAM, then click on IAM from the dropdown:

AWS console

5. Click on Users in the left menu:

AWS IAM console

6. Click on Add users:

AWS IAM users

7. Choose a username for the new user and click Next:

IAM new user

8. Set up permissions for the new user. Click on Attach policies directly:

IAM attach policies

9. Scroll down and type admin in the search field, then select AdministratorAccess:

IAM admin policies

10. Scroll to the bottom of the page and click Next:

IAM admin policies

11. Review the policies, then scroll down to the bottom of the page and click Create user:

IAM review

12. Click on the user you just created:

IAM new user

13. Click on Security credentials:

IAM new user

14. In the Security credentials menu, scroll down to the Access keys section and click Create access key:

IAM user access key

15. Choose Command Line Interface and scroll down to the bottom of the page, then click Next:

IAM user cli

16. Optionally, add tags for the new user, then click Create access key:

IAM user tags

17. You've now reached the final step of creating a new IAM user. Be sure to save the access key and the secret access key:

IAM access key

Either copy the keys and store them in a secure location or download the CSV file. This is crucial since you won't be able to reveal the secret access key again after this step.

Make sure to save both the secret access key somewhere safe, since you won't be able to reveal it again after this step.

We'll configure your AWS user in the next step so make sure to have both the access key and the secret access key available for the next step.


5. Set up AWS CLI and configure your account

In this section, we'll guide you through installing the AWS Command Line Interface (CLI) and configuring it with your AWS account.

5.1 Install AWS CLI

First, you'll need to install the AWS CLI on your computer.

Follow the installation instructions for your operating system: AWS docs

After the installation is complete, you can verify that the AWS CLI is installed by running the following command in your command prompt:

aws --version

You should see an output similar to this:

aws-cli/2.3.0 Python/3.8.8 Linux/4.14.193-113.317.amzn2.x86_64 botocore/2.0.0

5.2 Configure your AWS CLI

Now that the AWS CLI is installed, you'll need to configure it with your AWS account. Make sure you have your access key and the secret access key from the previous section.

Run the following command in your terminal or command prompt:

aws configure

You'll be prompted to enter your AWS credentials:

- AWS Access Key ID [None]: Enter your access key

- AWS Secret Access Key [None]: Enter your secret access key

Next, you'll need to specify a default region and output format. The region is where your AWS resources will be created. Choose the region that's closest to you or your target audience.

You can find a complete list of available regions and their codes in the AWS documentation: https://docs.aws.amazon.com/general/latest/gr/rande.html

For example:

- Default region name [None]: Enter your desired region code, such as us-east-1

- Default output format [None]: Enter the output format you prefer, such as json

Your AWS CLI is now configured, and you can start using it to interact with your AWS account.

In the next section, we'll create an AWS Lambda function and configure it with the necessary resources to power your chatbot's backend.


6. Set up a Serverless project with a handler.py file

This section will guide you through creating a new Serverless project with a handler.py file using the Serverless Framework. The handler.py file will contain the code for your AWS Lambda function, which will power your chatbot's backend.


6.1 Install the Serverless Framework

First, you need to install the Serverless Framework on your computer. Make sure you have Node.js installed and then run the following command in your terminal or command prompt:

npm install -g serverless

After the installation is complete, verify that the Serverless Framework is installed by running the following command:

serverless --version

You should see output similar to this:

2.71.1

6.2 Create a new Serverless project

Now that the Serverless Framework is installed, you can create a new Serverless project. first, navigate to the folder we created for the project my-chatbot-project:

cd my-chatbot-project

Then run the following command in your terminal or command prompt:

serverless create --template aws-python3 --path backend

We're using backend here to create a new project in a folder called backend.

Then navigate to the new backend folder by running:

cd backend

Inside the folder, you'll find two files:

- handler.py: This is the file that contains your AWS Lambda function code

- serverless.yml: This is the configuration file for your Serverless project, which defines the resources, function, and events in your application


6.3 Configure the serverless.yml file

In this section we'll walk through the serverless.yml file configuration, explaining the purpose of each part.

Open the serverless.yml file in your favorite text editor or IDE. You'll need to customize this file to define your chatbot's backend resources, function, and events.

Replace the current code with the following:

service: your-service-name

provider:
  name: aws
  runtime: python3.9
  stage: dev
  region: us-east-1

plugins:
  - serverless-python-requirements

functions:
  chatgpt-audio-chatbot:
    handler: handler.handler
    timeout: 30
    events:
      - http:
          path: get-answer
          method: post
          cors: true

Let's break down and explain the purpose of each part.

1. Service name

service: your-service-name

This line defines the name of your service, which is used by the Serverless Framework to group related resources and functions. In this case, make sure to replace your-service-name with your own name.

2. Provider configuration

provider:
  name: aws
  runtime: python3.9
  stage: dev
  region: us-east-1

This section specifies the cloud provider, in our case AWS, and sets up some basic configurations:

- name: The cloud provider for your service aws

- runtime: The runtime for your Lambda function python3.9

- stage: The stage of your service deployment dev - you can use different stages for different environments (e.g. development, staging, production)

- region: The AWS region where your service will be deployed us-east-1. Make sure to select a region that supports the required services and is closest to your users for lower latency

3. Plugins

plugins:
  - serverless-python-requirements

This section lists any Serverless Framework plugins you want to use in your project. In this case, we're using the serverless-python-requirements plugin to automatically package and deploy any Python dependencies your Lambda function requires.

4. Functions

functions:
  chatgpt-audio-chatbot:
    handler: handler.handler
    timeout: 30
    events:
      - http:
          path: get-answer
          method: post
          cors: true

This section defines the Lambda functions in your service:

- chatgpt-audio-chatbot: The name of the Lambda function

- handler: The reference to the function within your handler.py file handler.handler - this tells the Serverless Framework to use the handler function defined in the handler.py file

- timeout: The maximum time your Lambda function is allowed to run before it's terminated, in seconds. We've set it to 30 seconds.

- events: The events that trigger your Lambda function. In this case, we've set up an HTTP event, which is triggered by a POST request to the /get-answer endpoint. The cors: true setting enables CORS (Cross-Origin Resource Sharing) for this endpoint, allowing requests from different origins (e.g. your frontend application)

Now that you have a better understanding of the serverless.yml file, you can customize it to suit the future needs of your chatbot's backend.

In the next section, we'll walk through implementing the Lambda function in the handler.py file.


7. Create the python backend

7.1 Import necessary libraries

Open up the handler.py file, delete all the prewritten code and let's start by importing the necessary libraries:

import json
import base64
import io
import openai
import requests
from requests.exceptions import Timeout

We'll need json to parse and handle JSON data from incoming requests and to generate JSON responses.

base64 will be used to encode and decode the audio data sent and received in requests.

The io library is necessary for handling the in-memory file-like objects used in the audio transcription process.

The openai library enables us to interact with the OpenAI API for transcribing and generating text, while requests will be used to make HTTP requests to the Eleven Labs API for converting text to speech.


7.2 Add your OpenAI API key

Next, let's add our OpenAI API key. Here're the steps for getting your OpenAI API key if you don't already have it.

Go to https://beta.openai.com/, log in and click on your avatar and View API keys:

Open AI API keys

Then create a new secret key and save it for the request:

Create OpenAI API key

Remember that you'll only be able to reveal the secret key once, so make sure to save it somewhere for the next step.

Then add your API key to the handler.py file below the library imports:

openai.api_key = "sk-YOUR_API_KEY"

7.3 Create function to transcribe audio

For the user audio, we'll need to create a function that takes the audio data as input and returns the transcribed text using the OpenAI API. This function will be called transcribe_audio and will accept a single argument, audio_data.

Add the function to handler.py:

def transcribe_audio(audio_data):
    # Convert the audio data into a file-like object using io.BytesIO
    with io.BytesIO(audio_data) as audio_file:
        audio_file.name = "audio.mp3"  # Add a name attribute to the BytesIO object
        
        # Use the OpenAI API to transcribe the audio, specifying the model, file, and language
        response = openai.Audio.transcribe(model="whisper-1", file=audio_file, language="en")
    
    # Extract the transcribed text from the response
    transcription = response["text"]
    
    return transcription

In this function, we first create a file-like object using io.BytesIO by passing in the audio_data. We then add a name attribute to the BytesIO object to indicate that it is an MP3 file.

Next, we call the openai.Audio.transcribe method, providing the model whisper-1, the audio_file object, and specifying the language as en (English).

The API call returns a response containing the transcribed text, which we extract and return from the function.


7.4 Create function to generate a text reply

Once we have the audio transcribed, we'll need to create a function that calls the OpenAI API to generate a chat completion based on the user message.

Let's create the function generate_chat_completion to achieve this:

def generate_chat_completion(messages):
    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=messages,
        max_tokens=100, # Optional, change to desired value
    )
    return response.choices[0].message["content"]

Now, let's break down the function:

1. The generate_chat_completion function takes a single argument, messages, which is a list of message objects from the frontend.

2. We call the openai.ChatCompletion.create method to generate a chat completion using the gpt-3.5-turbo model. We pass the messages list as an argument to the method.
The messages are formatted in the frontend and should be a list of dictionaries, each containing a role, either system, user, or assistant, and content, which is the message text. We've also added the max_tokens parameter and set it to 100.

When generating chat completions using GPT, you might want to limit the length of the responses to prevent excessively long answers. You can do this by setting the max_tokens parameter when making the API call. In our generate_chat_completion function, we've added the max_tokens parameter and set it to 100.

By setting max_tokens to 100, we limit the response to a maximum of 100 tokens. You can adjust this value according to your requirements.

Keep in mind that if you set it too low, the generated text might be cut off and not make sense to users. Experiment with different values to find the best balance between response length and usability.

3. The API call returns a response that contains a list of choices, with each choice representing a possible chat completion. In our case, we simply select the first choice response.choices[0].

4. Finally, we extract the content of the message from the first choice using response.choices[0].message["content"].

With this function in place, we can not generate a text reply based on the transcribed user audio and any other messages provided in the messages list.


7.5 Create function to generate audio from text

Now that we have the text reply generated by our chatbot, we might want to convert it back to audio if the flag isAudioResponse is true. For this, we'll create a function called generate_audio that uses the ElevenLabs API to synthesize speech from the generated text.

ElevenLabs has a generous free tier with API access - just remember to add an attribution to elevenlabs.io when on the free tier:

ElevenLabs pricing
Screenshot from Match 2023

Start by creating a free ElevenLabs account, if you don't already have one. Visit https://beta.elevenlabs.io/ and click Sign up:

ElevenLabs sign up

Then click on the avatar in the upper right corner and click Profile:

ElevenLabs dashboard

Copy your API key and have it available for the next step when we're calling the ElevenLabs API:

ElevenLabs API key

The last step is to get a voice id for the API call. Go back to your dashboard, click Resources and then API:

ElevenLabs resources

Click to expand the documentation for text-to-speech:

ElevenLabs API overview

Here you'll find a voice id we'll use when synthesizing the audio in our backend, copy and save the voice id for the next steps:

ElevenLabs voice id

Let's finally create the function to synthesize speech from the generated text:

def generate_audio(generated_text):
    # API key
    api_key = "YOUR_API_KEY"

    # Voice id
    voice_id = "21m00Tcm4TlvDq8ikWAM"

    # Voice params
    data = {
        "text": generated_text,
        "voice_settings": {
            "stability": 0,
            "similarity_boost": 0
        }
    }

    # Call endpoint
    url = f'https://api.elevenlabs.io/v1/text-to-speech/{voice_id}?api_key={api_key}'
    headers = {
        'accept': 'audio/mpeg',
        'Content-Type': 'application/json'
    }

    response = requests.post(url, headers=headers, json=data)

    # Bytes type is not JSON serializable
    # Convert to a Base64 string
    return base64.b64encode(response.content).decode('utf-8')

Let's break down this function step by step:

1. We define the API key and voice ID as variables. Replace YOUR_API_KEY with your actual ElevenLabs API key we just generated:

api_key = "YOUR_API_KEY"

voice_id = "21m00Tcm4TlvDq8ikWAM"

2. We create a dictionary called data that contains the generated text and voice settings. The text key contains the text that we want to convert to speech. The voice_settings key is a dictionary containing options for controlling the stability and similarity of the generated voice:

data = {
    "text": generated_text,
    "voice_settings": {
        "stability": 0,
        "similarity_boost": 0
    }
}

3. We define the API endpoint URL using the voice ID and the API key. The URL includes the base endpoint, https://api.elevenlabs.io/v1/text-to-speech/ followed by the voice ID and the API key as a query parameter:

url = f'https://api.elevenlabs.io/v1/text-to-speech/{voice_id}?api_key={api_key}'

4. We set up the HTTP headers for our API request. The accept header indicates that we expect the response to be in the audio/mpeg format, while the Content-Type header specifies that we will send JSON data in our request:

headers = {
    'accept': 'audio/mpeg',
    'Content-Type': 'application/json'
}

5. We then use the request.post method to make a POST request to the API endpoint, passing the headers and JSON data as arguments. The API call returns a response containing the synthesized audio data:

response = requests.post(url, headers=headers, json=data, timeout=15)

Try block for timeout
In some cases, the ElevenLabs API request time is long, causing the API Gateway to time out while waiting for a response. To handle this, we've added a timeout of 15 seconds to the generate_audio function. This ensures that our application does not hang indefinitely while waiting for a response from the API, and provides a more predictable user experience.

If the API does not respond within 15 seconds, the request will be terminated and return None. We added a try block around the request and catch the requests.exceptions.Timeout exception.

6. Since the audio data is in bytes format, which is not JSON serializable, we need to convert it to a Base64 string. We use the base64.b64encode method to do this and then decode the result to a UTF-8 string using the decode method:

base64.b64encode(response.content).decode('utf-8')

7. Finally, we return the Base64-encoded audio data as the output of the function.

With this generate_audio function, we can now convert the text reply generated by our chatbot back into an audio format that can be played by the user.


7.6 Create the handler function to tie everything together

Finally, we need to create the main handler function that will be triggered by the API Gateway event. This function will tie together all the other functions we've created, allowing us to process the incoming request, transcribe audio, generate chat completions, and create audio responses.

Add the handler function to your handler.py file:

def handler(event, context):
    try:
        body = json.loads(event["body"])

        if 'audio' in body:
            audio_base64 = body["audio"]
            audio_data = base64.b64decode(audio_base64.split(",")[-1])
            transcription = transcribe_audio(audio_data)
            message_objects = body['messages'] + [{"role": "user", "content": transcription}]

        elif 'text' in body:
            transcription = body['text']
            message_objects = body['messages']
        else:
            raise ValueError("Invalid request format. Either 'audio' or 'text' key must be provided.")

        generated_text = generate_chat_completion(message_objects)
        
        # Check if audio response
        is_audio_response = body.get('isAudioResponse', False)

        if is_audio_response:
            generated_audio = generate_audio(generated_text)
        else:
            generated_audio = None

        response = {
            "statusCode": 200,
            "headers": {"Access-Control-Allow-Origin": "*"},
            "body": json.dumps(
                {"transcription": transcription, "generated_text": generated_text, "generated_audio": generated_audio}),
        }
        return response

    except ValueError as ve:
        import traceback
        print(traceback.format_exc())
        print(f"ValueError: {str(ve)}")
        response = {
            "statusCode": 400,
            "body": json.dumps({"message": str(ve)}),
        }
        return response
    except Exception as e:
        import traceback
        print(traceback.format_exc())
        print(f"Error: {str(e)}")
        response = {
            "statusCode": 500,
            "body": json.dumps({"message": "An error occurred while processing the request."}),
        }
        return response

Let's break down this handler function step by step:

1. We start by defining the handler function with two arguments: event and context. The event object contains the data from the API Gateway event, and context contains runtime information:

def handler(event, context):

2. We then extract the body from the event object by loading it as a JSON object:

body = json.loads(event["body"])

3. We then check if the body contains an audio key. If it does, we decode the base64-encoded audio data and transcribe it using the transcribe_audio function. We create a message_objects list by combining the existing messages from the frontend data with the transcribed message:

if 'audio' in body:
    audio_base64 = body["audio"]
    audio_data = base64.b64decode(audio_base64.split(",")[-1])
    transcription = transcribe_audio(audio_data)
    message_objects = body['messages'] + [{"role": "user", "content": transcription}]

4. If the body contains a text key instead, we simply use the text provided and create the message_objects list from the frontend data:

elif 'text' in body:
    transcription = body['text']
    message_objects = body['messages']

5. If neither audio nor text keys are present, we raise a ValueError to indicate that the request format is invalid:

else:
  raise ValueError("Invalid request format. Either 'audio' or 'text' key must be provided.")

6. We then call the generate_chat_completion function, passing the message_objects list as an argument. This returns the generated text response from our Chatbot:

generated_text = generate_chat_completion(message_objects)

7. We check if the body contains an isAudioResponse key and use its value to determine if we should generate an audio response from the generated text:

is_audio_response = body.get('isAudioResponse', False)

8. If an audio response is requested from the frontend, we call the generate_audio function to convert the generated text back to audio, If not, we set generated_audio to None:

if is_audio_response:
    generated_audio = generate_audio(generated_text)
else:
    generated_audio = None

9. We create a response dictionary with the following keys:

- statusCode: The HTTP status code for the response. We set it to 200, indicating a successful operation.

- headers: The HTTP headers to include the response. We set the Access-Control-Allow-Origin header to * to enable cross-origin requests.

- body: The response body, which we serialize as a JSON object. The response body contains the following keys:
transcription: The transcribed text from the user's audio input
generated_text: The generated text response from the chatbot
generated_audio: The generated audio response if requested, encoded as a base64 string:

response = {
    "statusCode": 200,
    "headers": {"Access-Control-Allow-Origin": "*"},
    "body": json.dumps(
        {"transcription": transcription, "generated_text": generated_text, "generated_audio": generated_audio}),
}

10. We return the response dictionary:

return response

11. If a ValueError occurs, e.g., due to an invalid request format, we catch the exception, print the traceback, and return a 400 status code along with an error message:

except ValueError as ve:
    import traceback
    print(traceback.format_exc())
    print(f"ValueError: {str(ve)}")
    response = {
        "statusCode": 400,
        "body": json.dumps({"message": str(ve)}),
    }
    return response

12. If any other exception occurs, we catch the exception, print the traceback, and return a 500 status code along with a generic error message:

except Exception as e:
    import traceback
    print(traceback.format_exc())
    print(f"Error: {str(e)}")
    response = {
        "statusCode": 500,
        "body": json.dumps({"message": "An error occurred while processing the request."}),
    }
    return response

With the handler function complete, we now have a fully functional backend for our chatbot that can handle text and audio input, generate chat completions using OpenAI and return text or audio responses as needed.


8. Deploying the backend to AWS

Now that we have our chatbot backend implemented in the handler.py file, it's time to deploy it to AWS using Serverless Framework, In this section, we'll go through the deployment process step by step.


8.1 Ensure AWS credentials are configured

Before deploying, ensure that you have properly set up your AWS credentials on your local machine. If you haven't done this yet, refer to section 6.2 for a detailed guide on setting up your AWS credentials.


8.2 Install dependencies

Before deploying the backend, we need to install the required Python packages. In your backend folder, create a requirements.txt file and add the following dependencies:

openai
requests

8.3 Install and configure the serverless-python-requirements plugin

Before deploying the Serverless project, you need to ensure that you have the serverless-python-requirements plugin installed and configured. This plugin is essential for handling your Python dependencies and packaging them with your Lambda function.

To install the plugin, run the following command in your project directory:

npm install --save serverless-python-requirements

This command will add the plugin to your project package.json file and install it in the node_modules folder.

These are the Python packages needed for our backend implementation. The Serverless Framework will automatically package these dependencies and include them in the deployment.


8.4 Deploy the backend

Now that we have our AWS credentials configured and our dependencies installed, it's time to deploy the backend. Open a terminal, navigate to your backend folder located in your project folder, and run the following command:

serverless deploy

This command will package and deploy tour Serverless service to AWS Lambda. The deployment process might take a few minutes. Once the deployment is completed, you'll see output similar to this:

Service Information
service: chatgpt-audio-chatbot
stage: dev
region: us-east-1
stack: chatgpt-audio-chatbot-dev
endpoints:
  POST - https://xxxxxxxxxx.execute-api.us-east-1.amazonaws.com/dev/get-answer
functions:
  chatgpt-audio-chatbot: chatgpt-audio-chatbot-dev-chatgpt-audio-chatbot

Take note of the endpoints section, as it contains the API Gateway URL for your deployed Lambda function. You'll need this URL in the next section when we'll make requests to your chatbot backend from the frontend.


8.5 Locating the deployed Lambda function in the AWS console

Once your backend is successfully deployed, you may want to explore and manage your Lambda function using the AWS Console. In this section, we'll guide you through the process of finding your deployed Lambda function in the AWS Console.

1. Sign in to your AWS Management Console: https://aws.amazon.com/console/

2. Under the "Services" menu, navigate to "Lambda" or use the search bar to find and select "Lambda" to open the AWS Lambda Console.

3. In the AWS Lambda Console, you'll see a list of all the Lambda functions deployed in the selected region. The default function name will be in the format service-stage-function, where service is the service name defined in your serverless.yml file, stage is the stage you deployed to (e.g., dev), and function is the function name you defined in the same file.

For example, if your serverless.yml has the following configurations:

service: chatgpt-audio-chatbot
...
functions:
  chatgpt-audio-chatbot:
    handler: handler.handler

The Lambda function will have a name like chatgpt-audio-chatbot-dev-chatgpt-audio-chatbot.

4. Click on the Lambda function name in the list to view its details, configuration, and monitoring information. On the Lambda function details page, you can:

- Edit the function code in the inline code editor (for smaller functions), or download the deployment package to make changes offline.

- Modify environment variables, memory, timeout, and other settings.

- Add triggers, layers, or destinations.

- View monitoring data, such as invocation count, duration, and error rate in the Monitoring tab.

- Access CloudWatch Logs to view and search the function's logs in the Monitoring tab, by clicking on View logs in CloudWatch

5. Additionally, you can navigate to the API Gateway console to view and manage the API Gateway that's integrated with your Lambda function:

- In the AWS Management Console, search for API Gateway under the Services menu or use the search bar.

- Select the API Gateway that corresponds to your serverless.yml configuration (e.g., chatgpt-audio-chatbot-dev if your service name is chatgpt-audio-chatbot and the stage is dev).

- In the API Gateway Console, you can view and manage resources, methods, stages, and other settings for your API. You can also test the API endpoints directly from the console.

By following these steps, you can locate, manage, and monitor your deployed Lambda function and other AWS resources from the AWS Management Console. This allows you to better understand your application's performance, troubleshoot any issues, and make any necessary updates to the backend as needed.


8.6 Test the deployed backend

To ensure that your backend is working correctly, you can use a tool like Postman or curl to send a test request to the API Gateway URL. Replace https://xxxxxxxxxx.execute-api.us-east-1.amazonaws.com with your own API Gateway URL you received when you deployed the backend:

For a text-based request:

curl -X POST https://xxxxxxxxxx.execute-api.us-east-1.amazonaws.com/dev/get-answer \
-H "Content-Type: application/json" \
-d '{
    "messages": [{"role": "system", "content": "You are a helpful assistant."}],
    "text": "What is the capital of France?",
    "isAudioResponse": false
}'

For an audio-based request, replace your_base64_encoded_audio_string with an actual Base64 encoded audio string:

curl -X POST https://xxxxxxxxxx.execute-api.us-east-1.amazonaws.com/dev/get-answer \
-H "Content-Type: application/json" \
-d '{
    "messages": [{"role": "system", "content": "You are a helpful assistant."}],
    "audio": "your_base64_encoded_audio_string",
    "isAudioResponse": false
}'

You should receive a response containing the transcription of the user's input, the generated text from the chatbot, and (optionally) the generated audio if isAudioResponse is set to true.

{
    "transcription": "What is the capital of France?",
    "generated_text": "The capital of France is Paris.",
    "generated_audio": null
}

If you receive an error, double-check your request payload and ensure that your Lambda function has the correct permissions and environment variables set.


9. Update the frontend

Now that your backend is deployed and working correctly, let's update the frontend application to use the API Gateway URL. We'll leverage AWS Amplify to configure the API call and make it easy to interact with our backend.

First, open the App.js file in your frontend project. Import Amplify from aws-amplify:

import { 
  Amplify, // Add to imports
  API 
} from "aws-amplify";

Just before the function App(), add the Amplify configuration, including the API endpoint you received when you deployed the backend:

Amplify.configure({
  // OPTIONAL - if your API requires authentication 
  Auth: {
    mandatorySignIn: false,
  },
  API: {
    endpoints: [
      {
        name: "api",
        endpoint: "https://xxxxxxxxxx.execute-api.us-east-1.amazonaws.com/dev"
      }
    ]
  }
});

Make sure to replace xxxxxxxxxx with the actual endpoint from the backend deploy.

With your backend deployed and your frontend updated, your ChatGPT Audio Chatbot is now ready to use!

Let's try it out, here's how it works:

Chatbot demo


10. Redeploying after changes

If you make any changes to your backend code or serverless.yml configuration, you can redeploy your service by running serverless deploy again. The Serverless Framework will update your AWS resources accordingly.

Remember to test your backend after each deployment to ensure everything is working as expected.

That's it! You have successfully created and deployed a ChatGPT Audio Chatbot using OpenAI, AWS Lambda, and the Serverless Framework. Your chatbot is now ready to receive and respond to both text and audio-based requests.


The source code

Do you want the full source code?
This tutorial is quite extensive, and following along step-by-step may be time-consuming.

Visit this page to download the entire source code, you'll get instant access to the files, which you can use as a reference or as a starting point for your own voice-powered ChatGPT bot project.


Improvements

Protecting the Lambda API Endpoint
Currently, our Lambda function is openly accessible, which can lead to potential misuse or abuse. To secure the API endpoint, you can use Amazon API Gateway's built-in authentication and authorization mechanisms. One such mechanism is Amazon Cognito, which provides user sign-up and sign-in functionality, as well as identity management.

By integrating Amazon Cognito with your API Gateway, you can ensure that only authenticated users have access to your chatbot API. This not only secures your API but also enables you to track and manage user access, providing a more robust and secure experience.

Error Handling
The chatbot application could benefit from more comprehensive error handling. This would involve checking for error responses from both the text-to-speech API, the speech-to-text API and the Lambda function and gracefully displaying relevant error messages to the user. This would help users understand any issues encountered during their interaction with the chatbot.

Saving Chat History to a Database
Currently, the chat history between the user and the chatbot is stored in the application's state, which means that the messages disappear when the page is refreshed. To preserve the chat history, you can save the conversation to a database. This can be achieved using a variety of database solutions, such as Amazon DynamoDB or MongoDB.

Get in Touch for Assistance or Questions

Do you need help implementing the ChatGPT chatbot, or have any other questions related to this tutorial? I'm more than happy to help. Don't hesitate to reach out by sending an email to norah@braine.ai

Alternatively, feel free to shoot me a DM on Twitter @norahsakal.

I look forward to hearing from you and assisting with your ChatGPT chatbot journey!


Originally published at https://norahsakal.com/blog/chatgpt-audio-chatbot

posted to Icon for group Developers
Developers
on March 28, 2023
Trending on Indie Hackers
How we got 100+ clients in 48 hours from Product Hunt User Avatar 14 comments Meme marketing for startups 🔥 User Avatar 11 comments Who says bootstrapped means small? User Avatar 8 comments IP Geolocation API User Avatar 3 comments QR Feedback - Marketing Ideas? User Avatar 2 comments Conquering On-Call Burnout: Essential Strategies for Tech Teams User Avatar 1 comment