U.S. Pat. No. 11,605,388
Speaker Conversion for Video Games
AssigneeElectronic Arts Inc.
Issue DateNovember 9, 2020
Illustrative Figure
Abstract
This specification describes a computer-implemented method of generating speech audio for use in a video game, wherein the speech audio is generated using a voice convertor that has been trained to convert audio data for a source speaker into audio data for a target speaker. The method comprises receiving: (i) source speech audio, and (ii) a target speaker identifier. The source speech audio comprises speech content in the voice of a source speaker. Source acoustic features are determined for the source speech audio. A target speaker embedding associated with the target speaker identifier is generated as output of a speaker encoder of the voice convertor. The target speaker embedding and the source acoustic features are inputted into an acoustic feature encoder of the voice convertor. One or more acoustic feature encodings are generated as output of the acoustic feature encoder. The one or more acoustic feature encodings are derived from the target speaker embedding and the source acoustic features. Target speech audio is generated for the target speaker. The target speech audio comprises the speech content in the voice of the target speaker. The generating comprises decoding the one or more acoustic feature encodings using an acoustic feature decoder of the voice convertor.
Description
DETAILED DESCRIPTION General Definitions The following terms are defined to aid the present disclosure and not limit the scope thereof. A “user” or “player”, as used in some embodiments herein, refers to an individual and/or the computing system(s) or device(s) corresponding to (e.g., associated with, operated by) that individual. A “client” as used in some embodiments described herein, is a software application with which a user interacts, and which can be executed on a computing system or device locally, remotely, or over a cloud service. A “server” as used in some embodiments described here, is a software application configured to provide certain services to a client, e.g. content and/or functionality. A “video game” as used in some embodiments described herein, is a virtual interactive environment in which players engage. Video game environments may be facilitated through a client-server framework in which a client may connect with the server to access at least some of the content and functionality of the video game. “Speech” as used in some embodiments described herein may include sounds in the form of spoken words in any language, whether real or invented and/or other utterances including paralinguistics such as sighs, yawns, moans etc. “Speech audio” refers to audio (e.g. audio data) which includes or represents speech, and may comprise data in any suitable audio file format whether in a compressed or uncompressed format. “Acoustic features” as used in some embodiments described herein may include any suitable acoustic representation of frequency, magnitude and/or phase information. For example, acoustic features may comprise linear spectrograms, log-mel-spectrograms, linear predictive coding (LPC) coefficients, Mel-Frequency Cepstral Coefficients (MFCC), log fundamental frequency (LFO), band aperiodicity (bap) or combinations thereof. Example implementations provide systems and methods for generating speech audio in a video game, using a voice convertor to generate audio data for a ...
DETAILED DESCRIPTION
General Definitions
The following terms are defined to aid the present disclosure and not limit the scope thereof.
A “user” or “player”, as used in some embodiments herein, refers to an individual and/or the computing system(s) or device(s) corresponding to (e.g., associated with, operated by) that individual.
A “client” as used in some embodiments described herein, is a software application with which a user interacts, and which can be executed on a computing system or device locally, remotely, or over a cloud service.
A “server” as used in some embodiments described here, is a software application configured to provide certain services to a client, e.g. content and/or functionality.
A “video game” as used in some embodiments described herein, is a virtual interactive environment in which players engage. Video game environments may be facilitated through a client-server framework in which a client may connect with the server to access at least some of the content and functionality of the video game.
“Speech” as used in some embodiments described herein may include sounds in the form of spoken words in any language, whether real or invented and/or other utterances including paralinguistics such as sighs, yawns, moans etc. “Speech audio” refers to audio (e.g. audio data) which includes or represents speech, and may comprise data in any suitable audio file format whether in a compressed or uncompressed format.
“Acoustic features” as used in some embodiments described herein may include any suitable acoustic representation of frequency, magnitude and/or phase information. For example, acoustic features may comprise linear spectrograms, log-mel-spectrograms, linear predictive coding (LPC) coefficients, Mel-Frequency Cepstral Coefficients (MFCC), log fundamental frequency (LFO), band aperiodicity (bap) or combinations thereof.
Example implementations provide systems and methods for generating speech audio in a video game, using a voice convertor to generate audio data for a target speaker. Specifically, the voice convertor is configured to convert acoustic features relating to a source speaker into audio data (e.g. acoustic features, speech audio) for the target speaker.
The described systems and methods are particularly advantageous for use in the context of video games (e.g. in video game development, or after deployment of a video game). In video games, it may be desirable to align speech audio with the animation of a particular character or scene in the game. In video games that allow players to select different voices for characters, it may be useful to ensure that the timing of recorded speech audio for different voices closely match each other, and also match the animation of the character or scene.
The methods and systems described in this specification enable speech audio to be generated in a target speaker's voice, while maintaining the performance (e.g. speech prosody) and timing of source speech audio from which the acoustic features relating to a source speaker are derived. This may be achieved by, for example, learning suitable speaker embeddings and using these learned speaker embeddings when converting voices. Suitable speaker embeddings may accurately reflect characteristics of a speaker's voice and disentangle these voice characteristics from other aspects of speech audio (such as speech prosody). In some embodiments, a speaker embedding may be generated using a trained speaker encoder, as described in more detail below. Alternatively, or in addition, new voices may be created by combining previously determined speaker embeddings (for example by summing two or more embeddings together) and/or by varying known embeddings, so as to generate new embeddings with different voice characteristics (e.g. different gender, age etc). In this way, a “space” of voices may be created which an operator may explore until they find an embedding corresponding to a desired voice for the target speaker. In such embodiments, a target speaker embedding may thus be determined without the use of a trained speaker encoder and the speaker encoder may be omitted from the voice convertor during inference.
Timing information of source speech audio may be maintained through use of an attention mechanism in the voice convertor. The attention mechanism may be trained to utilize appropriate segments/frames of an encoding of source acoustic features when generating target acoustic features. Furthermore, the systems and methods described herein can accommodate for differences in the length of a target speaker's waveform compared to the source speaker's waveform due to the use of sequence-to-sequence and encoder-decoder architectures as described below.
The methods and systems described herein allow voice conversion to (or voice cloning of) a target voice using a small amount of speech audio data for the target speaker relative to other techniques which do not use a voice convertor as described herein. However, in some embodiments, additional speech audio data for the target speaker may be leveraged (e.g. as additional training data) in order to provide improved quality output. In some examples, a pre-trained speaker encoder may be provided to generate speaker embeddings as described below. While in some examples, the speaker encoder may be retrained or finetuned using audio samples from the target speaker, in some examples the speaker encoder may be entirely pretrained, using speech audio samples provided by other speakers. Training using more speech samples provided by different speakers may allow for better quality (e.g. more realistic) speech audio to be output. For example, a more representative speaker encoder that outputs more suitable speaker embeddings (e.g. that more accurately reflects characteristics of a speaker's voice) may be learned as a result of training with more speech samples. A target speaker embedding learned in this way may be used to realistically convert the voice of source speech audio into the voice of the target speaker. In some embodiments, the target speaker may be the player of a video game. This is, methods and system described in this specification enable speech audio to be generated in the voice of a player of a video game, using speech audio provided by the player (e.g. a small amount of speech audio such as minutes of speech audio from the player), and the voice convertor.
Example Video Game Environment
FIG.1illustrates an example of a computer system configured to provide a video game environment100to players of a video game.
The video game environment100includes video game server apparatus107, and one or more client computing devices101. Each client computing device101is operable by a user and provides a client in the form of gaming application102to the user. The client computing device101is configured to communicate with the video game server apparatus107which provides a game server111for providing content and functionality to the gaming application102. For the sake of clarity, the video game environment100is illustrated as comprising a specific number of devices. Any of the functionality described as being performed by a specific device may instead be performed across a number of computing devices, and/or functionality described as being performed by multiple devices may be performed on a single device. For example, multiple instances of the video game server apparatus107(or components thereof) may be hosted as virtual machines or containers on one or more computing devices of a public or private cloud computing environment.
The video game server apparatus107provides speech audio generator108. The speech audio generator108receives source speech audio comprising speech content (e.g. pre-recorded speech audio), and a target speaker identifier for a target speaker, and outputs speech audio in the target speaker's voice corresponding to the speech content. The source speech audio may be pre-recorded speech audio and may derived from game content105, such as from speech audio106stored on the client computing device101, or speech audio113stored on the server apparatus107. The speaker identifier is any data that can be associated with (e.g. used to identify) an individual speaker. In some embodiments, the speaker identifier is a different one-hot vector for each speaker whose voice can be synthesized in output of the speech audio generator108. In other embodiments, speech samples (or indications thereof, e.g. acoustic features) provided by that particular speaker are used as a speaker identifier. The target speaker may be a user of the computing device101, such as a player of the video game, or any other speaker whose voice samples were used to train the speech audio generator108. Other inputs, such as animation information, linguistic information, timing information, and speaker attributes (such as gender and/or age of the speaker) may also be provided as input to the speech audio generator108.
The client computing device101can be any computing device suitable for providing the gaming application102to the user. For example, the client computing device101may be any of a laptop computer, a desktop computer, a tablet computer, a video games console, or a smartphone. For displaying the graphical user interfaces of computer programs to the user, the client computing device includes or is connected to a display (not shown). Input device(s) (not shown) are also included or connected to the client. Examples of suitable input devices include keyboards, touchscreens, mice, video game controllers, microphones and cameras.
Gaming application102provides a video game to the user of the client computing device101. The gaming application102may be configured to cause the client computing device101to request video game content from the video game server apparatus107while the user is playing the video game. Requests made by the gaming application102are received at the request router112of game server111, which processes the request, and returns a corresponding response (e.g. synthesized speech audio generated by the speech audio generator108) to gaming application102. Examples of requests include Application Programming Interface (API) requests, e.g. a representational state transfer (REST) call, a Simple Object Access Protocol (SOAP) call, a message queue; or any other suitable request.
The gaming application102provides an audio input module103for use by the user of computing device101. The audio input module103is configured to enable a player of the video game to input player speech samples for use in refining the voice convertor109(or components thereof) of speech audio generator108. The audio input module103transmits player speech audio to the request router112, which is subsequently transmitted to the speech audio generator108. The player speech audio may be any suitable digital data and may for example represent a waveform of the player speech samples (e.g. transmitted as an MP3 file, a WAV file, etc). The player speech audio may comprise acoustic features of the player speech sample. Acoustic features may comprise any low-level acoustic representation of frequency, magnitude and phase information such as linear spectrograms, log-mel-spectrograms, linear predictive coding (LPC) coefficients, Mel-Frequency Cepstral Coefficients (MFCC), log fundamental frequency (LFO), band aperiodicity (bap) or combinations thereof. The acoustic features may comprise a sequence of vectors, each vector representing acoustic information in a short time period, e.g. 50 milliseconds.
The gaming application102provides an audio receiver module104configured to receive output of the speech audio generator108. The audio receiver module104may be configured to request speech audio from the speech audio generator108throughout different stages of the video game. For example, some of the speech content may be predetermined (e.g. stored in game content105, and/or at game server111) and so the audio receiver module104may request synthesized speech audio for the predetermined content at the same time, e.g. during a loading process. The speech audio may be received at the audio receiver module104as a waveform (e.g. represented in an MP4 file, a WAV file, etc).
The gaming application102comprises game content105accessed while the video game is being played by the player. The game content105includes speech audio106, and other assets such as speech scripts, markup-language files, scripts, images and music. The speech audio106comprises audio data for entities/characters in the video game, which may be output by the gaming application102at appropriate stages of the video game. The speech audio106(or a portion thereof) may have corresponding speech scripts which are transcriptions of the speech audio106.
The speech audio106(and, potentially, speech scripts/transcriptions thereof) are also used if a player of the video game decides to add their voice to speech audio generator108. During an initialization process, the player is provided with examples of speech audio106and is asked to provide player speech samples corresponding to the speech audio106, which samples are used to refine (i.e. further train) the voice convertor109(or components thereof). For example, the user may be asked to mimic an example of speech audio106such that the player provides the same speech content as the example, in a speech style (e.g. prosody) similar to that of the example. Additionally or alternatively, a transcript may be provided for the user to recite. The resulting player speech sample may be associated with the example of speech audio106as a “paired” training example for use in refining the voice convertor109(or components thereof), as will be described in relation toFIGS.5and6. In some implementations, speech audio113and/or speech scripts may also be stored at game server111.
As will be described in further detail in relation toFIG.2, the speech audio generator108comprises a voice convertor109, and optionally, a vocoder110. The speech audio generator108receives source speech audio comprising speech content and determines source acoustic features thereof. The source speech audio comprises a waveform of speech audio. The source acoustic features are transformed, in accordance with a target speaker identifier, by the voice convertor109into audio data relating to the speech content in the voice of the target speaker. For example, the audio data may comprise target acoustic features. Speech audio in the target speaker's voice (otherwise referred to as target speech audio herein) may be output by the vocoder110after processing the target acoustic features. Alternatively, the audio data generated by the voice convertor109may comprise speech audio for the speech content in the target speaker's voice. The target speech audio comprises a waveform of speech audio.
The video game server apparatus107provides the game server111, which communicates with the client-side gaming application102. As shown inFIG.1, the game server111includes request router112, and optionally, speech audio113as described previously. The request router112receives requests from the gaming application102, and provides video game content responsive to the request to the gaming application102. Examples of requests include Application Programming Interface (API) requests, e.g. a representational state transfer (REST) call, a Simple Object Access Protocol (SOAP) call, a message queue; or any other suitable request.
AlthoughFIG.1shows the speech audio generator108implemented by video game server apparatus107, it will be appreciated that one or more components of the speech audio generator108may be implemented by computing device101. For example, one or more components of the voice convertor109may be implemented by computing device101, avoiding the need for player speech samples to be transmitted to video game server apparatus107.
Example Speech Audio Generator Method
FIG.2illustrates an example method for a speech audio generator200for generating speech audio in a voice of a target speaker, using a voice convertor.
The speech audio generator200comprises a voice convertor204used to transform source acoustic features into target acoustic features. As will be described in relation toFIGS.4-6, the voice convertor204comprises machine-learned models that may be initially trained using recordings or input from speakers for whom there are many speech samples. If a new speaker is added to voice convertor204(e.g. a player of the video game), the voice convertor204may be refined/further trained using a few speech samples provided by the new speaker. In this way, target acoustic features corresponding to the speech content in the new speaker's voice can be generated.
The speech audio generator200is configured to receive source speech audio201. The source speech audio201comprises speech content in the voice of a source speaker and may comprise pre-recorded speech audio. The speech content of the source speech audio201may include (or be) lexical utterances such as words, non-lexical utterances, or a combination of lexical and non-lexical utterances. Non-lexical utterances may include noises (e.g. a sigh or moan), disfluencies (e.g. um, oh, uk), and the like. Any paralinguistic utterance may form part of the speech content, such as sighs, yawns, moans, laughs, grunts, etc. The source speech audio201may be any suitable digital data and may for example represent a waveform of speech audio (e.g. transmitted as an MP3 file, a WAV file, etc). The source speaker may be a speaker whose voice samples were used to initially train voice convertor204.
The source speech audio201may be pre-processed before being processed by the speech audio generator200. Pre-processing steps may include denoising, silence removal, compression, and/or downsampling.
The source speech audio201is processed by the speech audio generator200, and source acoustic features202of the source speech audio201are determined. The source acoustic features202may comprise any low-level acoustic representation of frequency, magnitude and phase information such as linear spectrograms, log-mel-spectrograms, linear predictive coding (LPC) coefficients, Mel-Frequency Cepstral Coefficients (MFCC), log fundamental frequency (LFO), band aperiodicity (bap) or combinations thereof. The source acoustic features202may comprise a sequence of vectors, each vector representing acoustic information in a short time period, e.g. 50 milliseconds. Source acoustic features202may be determined in any suitable manner, e.g. by performing a Fast Fourier Transform on source speech audio201.
The voice convertor204is configured to receive the source acoustic features202and a speaker identifier203. The speaker identifier203is any data that can be associated with (e.g. used to identify) an individual speaker. In some embodiments, the speaker identifier203is a different one-hot vector for each speaker whose voice can be synthesized in output of the speech audio generator200. In other embodiments, speech samples (or acoustic features thereof) provided by that particular speaker are used as a speaker identifier203. The source acoustic features202and speaker identifier203are processed by the voice convertor module204to output target acoustic features205. The target acoustic features205comprise acoustic features for the speech content of source speech audio201, but in a voice of the target speaker associated with speaker identifier203. The target speaker may be a player of the video game, or any other speaker whose voice samples were used to train the voice convertor204.
The vocoder206is configured to receive the target acoustic features205. The vocoder206processes the target acoustic features to produce a waveform of speech audio207. The speech audio207is synthesized speech audio in the target speaker's voice corresponding to the speech content of source speech audio201. The speech audio207comprises an amplitude sample for each of a plurality of audio frames.
The vocoder206may be pre-trained using recordings or input from speakers for whom there are many speech samples. In some cases, the same vocoder206may be used for many speakers without the need for retraining based on new speakers, i.e. the vocoder206may comprise a universal vocoder. For example, the vocoder206may be pre-trained using training examples derived from speech samples wherein each training example comprises acoustic features for the speech sample and a corresponding ground-truth waveform of speech audio. The vocoder206processes the acoustic features of one or more training examples and generates a predicted waveform of speech audio for the one or more training examples. The vocoder206is trained in dependence on an objective function, wherein the objective function comprises a comparison between the predicted waveform of speech audio and the ground-truth waveform of speech audio. The parameters of the vocoder206are updated by optimizing the objective function using any suitable optimization procedure. For example, the objective function may be optimized using gradient-based methods such as stochastic gradient descent, mini-batch gradient descent, or batch gradient descent.
AlthoughFIG.2depicts a vocoder206generating speech audio, it will be appreciated that the vocoder206may be omitted from the speech audio generator200(as described in relation toFIG.1), and that the voice convertor204may be configured to output the speech audio207for the target speaker (in addition to, or in lieu of, the target acoustic features205).
The speech audio207may be post-processed. Post-processing steps may include denoising, upsampling, and/or decompression to a full sample rate.
Example Voice Convertor Method
FIG.3illustrates an example method300for a voice convertor configured to transform source acoustic features into target acoustic features.
The voice convertor301comprises a speaker encoder303, an acoustic feature encoder306, and an acoustic feature decoder307.
The speaker encoder303is configured to receive a speaker identifier302. The speaker identifier302is any data that can be associated with (e.g. used to identify) an individual speaker. In some embodiments, the speaker identifier302is a different one-hot vector for each speaker whose voice can be synthesized in output of the speech audio generator. In other embodiments, speech samples (or indications thereof, e.g. acoustic features) provided by that particular speaker may be used as a speaker identifier302.
The speaker encoder303processes the speaker identifier302and outputs speaker embedding305. The speaker embedding305is a representation of the voice of the speaker associated with speaker identifier302. The speaker embedding305is a vector of a learned embedding space, such that different speakers are represented in different regions of the embedding space. In embodiments where the speaker identifier302comprises speech samples (or acoustic features thereof) from the speaker, the speaker embedding305may be determined from an average of one or more embeddings which each may be determined by inputting a different speech sample (or acoustic features thereof) from the speaker into speaker encoder303. New voices may be created by combining (or otherwise varying) learned speaker embeddings, and inputting these modified speaker embeddings into the acoustic feature encoder306. For example, a modified speaker embedding may be generated by adding together the speaker embeddings of two or more different speakers.
In some embodiments, and as will be described in relation toFIG.4, the speaker encoder303has been trained separately to acoustic feature encoder306and acoustic feature decoder307. In these embodiments, the speaker encoder303may comprise a recurrent neural network comprising one or more recurrent layers. The recurrent neural network is configured to receive a speaker identifier comprising speech audio (or acoustic features thereof) from a particular speaker.
In a recurrent neural network, each recurrent layer comprises a hidden state that is updated as the recurrent neural network processes data input to the network. For each time step, recurrent layer receives its hidden state from the previous time step, and an input to the recurrent layer for the current time step. A recurrent layer processes its previous hidden state and the current input in accordance with its parameters and generates an updated hidden state for the current time step. For example, recurrent layer may apply a first linear transformation to the previous hidden state and a second linear transformation to the current input and combine the results of the two linear transformations e.g. by adding the two results together. Recurrent layer may apply a non-linear activation function (e.g. a tanh activation function, a sigmoid activation function, a ReLU activation function, etc.) to generate an updated hidden state for the current time step.
In some embodiments, and as will be described in relation toFIG.6, the speaker encoder303has been trained jointly with acoustic feature encoder306and acoustic feature decoder307. In these embodiments, the speaker encoder303may comprise a feedforward neural network comprising one or more fully connected layers. The feedforward neural network is configured to receive a speaker identifier for a particular speaker, for example a one-hot vector indicating the particular speaker.
A fully connected layer receives an input and applies a learned linear transformation to its input. The fully connected layer may further apply a non-linear transformation to generate an output for the layer.
Acoustic feature encoder306is configured to receive source acoustic features304and target speaker embedding305. The acoustic feature encoder outputs one or more acoustic feature encodings. An acoustic feature encoding may be determined for each input time step of a plurality of input time steps of the source acoustic features. The acoustic feature encoding for each input time step may comprise a combination of the speaker embedding for the speaker and an encoding of the source acoustic features for the input time step. For example, the combination may be performed by a concatenation operation, an addition operation, a dot product operation, etc. The acoustic feature encoder306may comprise a recurrent neural network comprising one or more recurrent layers.
Acoustic feature decoder307is configured to receive the one or more acoustic feature encodings output by acoustic feature encoder306. The acoustic feature decoder307outputs target acoustic features308. The target acoustic features308comprises target acoustic features for a plurality of output time steps. The acoustic feature decoder307may comprise an attention mechanism. For each output time step of the plurality of output time steps, the acoustic feature encoding for each input time step may be received. The attention mechanism may generate an attention weight for each acoustic feature encoding. The attention mechanism may generate a context vector for the output time step by averaging each acoustic feature encoding using the respective attention weight. The acoustic feature decoder307may process the context vector of the output time step to generate target acoustic features for the output time step. The acoustic feature decoder307may comprise a recurrent neural network comprising one or more recurrent layers.
Although depicted inFIG.3as two separate components, it will be appreciated that the acoustic feature encoder306and acoustic feature decoder307may be combined as a single encoder-decoder model. For example, they may be combined as an encoder-decoder (e.g. sequence-to-sequence) neural network with or without attention.
A model trainer may be used, after an initial training procedure for the components of the voice convertor301, to refine components of the voice convertor301when adding a new speaker's voice (e.g. a player's voice) to the speech audio generator. During the process of adding the new speaker's voice to the speech audio generator, the new speaker provides speech samples from which acoustic features are determined to refine (i.e. further train) the acoustic feature encoder306, and acoustic feature decoder307. In some implementations, the speaker encoder303is also refined using the speech samples provided by the new speaker.
The speech samples may be used to form a “paired training example” or an “unpaired training example”. In a “paired training example”, the speech sample provided by the new speaker closely matches an example used to train components of the voice convertor301. For example, a player may be asked to mimic an example of speech audio such that the player speaks the same words as the example, in a speech style (e.g. prosody) and with timing similar to that of the example. It should be noted that the systems and methods described herein can accommodate for differences in the length of the target speaker's waveform compared to the source speaker's waveform due to the use of sequence-to-sequence and encoder-decoder architecture as described above. In some embodiments, and as will be described in relation toFIG.6, paired training examples may be used to jointly train speaker encoder303, acoustic feature encoder306, and acoustic feature decoder307. In an “unpaired training example”, the speech sample provided by the new speaker does not closely match an example used to train components of the voice convertor module301. In some embodiments, and as will be described in relation toFIG.4, unpaired training examples may be used to separately train speaker encoder303.
Example Speaker Encoder Training Method
FIG.4is illustrates an example method400for training a speaker encoder to generate speaker embeddings. In the embodiment depicted inFIG.4, the speaker encoder406is trained separately to the acoustic feature encoder and acoustic feature decoder. The separate training of the acoustic feature encoder and acoustic feature decoder will be described in relation toFIG.5.
As shown inFIG.4, the speaker encoder is being trained on a speaker verification task, although it will be appreciated that the speaker encoder may be trained on other similar tasks (such as speaker classification). In the speaker verification task, speech audio (or acoustic features thereof) is processed in order to verify the speaker's identity. Separate training of the speaker encoder406may result in a more representative speaker encoder, such that speaker embeddings407output by the speaker encoder406more accurately reflect the characteristics of a speaker's voice. For example, speech samples spoken by the same speaker may have different embeddings, which are close to each other (in an embedding space) compared to embeddings for speech samples from different speakers. In cases where the performance of the speaker is different in speech samples provided by that speaker (for example whispering compared to screaming), the variety in performance may be captured in the embeddings since a different embedding is output for each speech sample. In addition, speaker encoder406may have sufficient representation power to encode speech audio from speakers that are not present in the training set used to train the speaker encoder406. Given a new speaker, the speaker encoder406may not need to be retrained and can be used as preprocessing module to obtain speaker embeddings. Furthermore, unpaired training examples may be used to train speaker encoder406, allowing a large corpus of publicly available data to be used to train the speaker encoder406.
The speaker encoder406is trained using training set401comprising training examples402-1,402-2,402-3. Each training example402comprises speech audio404and a speaker label403corresponding to the speaker identity of the respective speech audio404. For example, if speech audio404-1and speech audio404-2were provided by the same speaker, then speaker labels403-1and403-2are identical. Speaker label403-3is different to that of speaker labels403-1,403-2if speech audio404-3was provided by a different speaker to that of speech audio404-1,404-2. Speaker labels403may be represented by one hot vectors such that a different one-hot vector indicates each speaker in the training set401. In general, the training set401comprises a plurality of examples of speech audio404for each speaker.
As shown inFIG.4, speech audio404-2of training example402-2is processed by acoustic feature extractor405to determine acoustic features for speech audio404-2. Acoustic feature extractor405may determine acoustic features in any suitable manner, e.g. by performing a Fast Fourier Transform on speech audio404. Acoustic features determined by the acoustic feature extractor405may comprise any low-level acoustic representation of frequency, magnitude and phase information such as linear spectrograms, log-mel-spectrograms, linear predictive coding (LPC) coefficients, Mel-Frequency Cepstral Coefficients (MFCC), log fundamental frequency (LFO), band aperiodicity (bap) or combinations thereof. The acoustic features may comprise a sequence of vectors, each vector representing acoustic information in a short time period, e.g. 50 milliseconds.
The acoustic features are received by speaker encoder406, which processes the acoustic features in accordance with a current set of parameters and outputs a speaker embedding407for speech audio404-2.
The speaker embedding407is processed in order to verify the speaker identity of speech audio404-2. For example, the speaker embedding407may be used as part of a speaker verification loss comprising a generalized end-to-end speaker loss, with the speaker encoder406trained to optimize the loss. The generalized end-to-end speaker loss may be used to train the speaker encoder406to output embeddings of utterances from the same speaker with a high similarity (which may be measured by cosine similarity), while those of utterances from different speakers are far apart in the embedding space. For example, the generalized end-to-end speaker loss may involve finding a centroid for each speaker by averaging embeddings for speech samples provided by the speaker. A similarity matrix may be determined measuring the similarity (e.g. cosine similarity, or a linear transformation thereof) between the embedding for each utterance in a training batch and the centroid for each speaker. The generalized end-to-end speaker loss may encourage the similarity matrix to have high values for matching speaker-centroid values (e.g. values representing a similarity between an embedding for an utterance by a speaker and the centroid for the same speaker), and low values for non-matching speaker-centroid values. Alternatively, the speaker embedding for speech audio404-2may be received by an output classification layer which processes the speaker embedding in accordance with a current set of parameters, and outputs a speaker identity output. For example, the output classification layer may comprise a softmax layer, and the speaker identity output may comprise a probability vector indicating a probability, for each speaker out of the set of speakers included in the training set401, that speech audio404-2was provided by the speaker.
Model trainer408receives speaker embedding407for speech audio404-2, and speaker label403-2for speech audio404-2and updates the parameters of speaker encoder406in order to optimize an objective function. The objective function comprises a loss in dependence on the speaker label403-2and speaker embedding407. For example, the loss may be a speaker verification loss. Additionally or alternatively, the loss may measure a cross-entropy loss between speaker label403-2and a speaker identity output. The objective function may additionally comprise a regularization term, for example the objective function may be a linear combination of the loss and the regularization term. The parameters of the speaker encoder406may be updated by optimizing the objective function using any suitable optimization procedure. For example, the objective function may be optimized using gradient-based methods such as stochastic gradient descent, mini-batch gradient descent, or batch gradient descent, including momentum-based methods such as Adam, RMSProp, and AdaGrad. In the event that an output classification layer is included, optimizing the objective function using the model trainer408may include updating the parameters of the output classification layer.
AlthoughFIG.4shows the training process with processing of a single training example, it will be appreciated that any number of training examples may be used when updating the parameters of the speaker encoder406. The training process is repeated for a number of passes through the training set401, and is terminated at a suitable point in time, e.g. when a speaker identity output derived from speaker embedding407can be reliably used to correctly verify speaker identity. After training has completed, the speaker encoder406is retained for use in generating speaker embeddings for speech samples, as described previously. Although the speaker encoder may in some cases be subsequently finetuned by way of further training using speech audio data for new voices, this may not be necessary if the speaker encoder is pretrained using a sufficient number of speakers. This may be achieved as the speaker encoder may not need paired training data and may be pretrained using data from many voices.
Example Acoustic Feature Encoder/Decoder Training Method
FIG.5illustrates an example method500for training an acoustic feature encoder and an acoustic feature decoder to generate target acoustic features. The method displayed inFIG.5occurs after separate training of the speaker encoder, resulting in a pre-trained speaker encoder504, as described in relation toFIG.4. During the training process displayed inFIG.5, the parameters of the pre-trained speaker encoder504are fixed. The acoustic feature encoder505and acoustic feature decoder506are initially trained prior to adding a new speaker's voice (e.g. a player's voice) and are subsequently refined when the new speaker adds their voice and provides speech samples.
As shown inFIG.5, acoustic feature encoder505and acoustic feature decoder506are trained on a voice conversion task using one or more training examples501. In a voice conversion task, source acoustic features are transformed into target acoustic features such that synthesized speech audio from the target acoustic features closely match the content, performance, and timing of the source acoustic features, while changing the voice represented in the source acoustic features into that of the target speaker.
Training example501comprises source acoustic features503and corresponding target acoustic features502. During training, the goal of the acoustic feature encoder505and the acoustic feature decoder506is to transform source acoustic features503of a training example501into the target acoustic features502of the training example501. The training example may be referred to as a “paired” training example, wherein the source speech audio (from which the source acoustic features503are determined) and the target speech audio (from which the target source acoustic features502are determined) differ only in speaker identity. Alternatively said, the content (e.g. the words spoken), the performance, and the timing of the source and target speech audio may closely match each other in a paired training example.
When adding a new player's voice, the target acoustic features502correspond to acoustic features from player speech samples provided by the player. As described previously, the player speech samples may be paired with source speech audio (e.g. when the player is asked to mimic the source speech audio).
The target acoustic features502are received by the pre-trained speaker encoder, which processes the target acoustic features502in accordance with a learned set of parameters, and outputs a target speaker embedding for the target acoustic features502.
The target speaker embedding and source acoustic features503are received by acoustic feature encoder505, which processes the received inputs in accordance with a current set of parameters to output one or more acoustic feature encodings. The one or more acoustic feature encodings are processed by the acoustic feature decoder506in accordance with a current set of parameters to output predicted target acoustic features507.
Model trainer508receives the predicted target acoustic features507and the “ground-truth” target acoustic features502, and updates the parameters of acoustic feature encoder505and acoustic feature decoder506in order to optimize an objective function. The objective function comprises a loss in dependence on the predicted target acoustic features507and the ground-truth target acoustic features502. For example, the loss may measure a mean-squared error between the predicted target acoustic features507and the ground-truth target acoustic features502. The objective function may additionally comprise a regularization term, for example the objective function may be a linear combination of the loss and the regularization term. The objective function may further comprise other weighted losses such as speaker classifier (to emphasize that the target acoustic features have target speaker characteristics) or alignment loss (to emphasize the correct alignment between paired source and target acoustic features). The parameters of the acoustic feature encoder505and acoustic feature decoder506may be updated by optimizing the objective function using any suitable optimization procedure. For example, the objective function may be optimized using gradient-based methods such as stochastic gradient descent, mini-batch gradient descent, or batch gradient descent, including momentum-based methods such as Adam, RMSProp, and AdaGrad.
The training process is repeated for a number of training examples, and is terminated at a suitable point in time, e.g. when predicted target acoustic features507closely match ground-truth target acoustic features502. After an initial training process, the acoustic feature encoder505and acoustic feature decoder506may be further trained/refined using target acoustic features502determined from player speech samples. Subsequently, the speaker encoder504, acoustic feature encoder505and acoustic feature decoder506can be used to convert any source acoustic features into target acoustic features corresponding to the player's voice.
Joint Training Method
FIG.6illustrates an example method600for training a speaker encoder, an acoustic feature encoder, and an acoustic feature decoder to generate target acoustic features. In the embodiment depicted inFIG.6, the speaker encoder605is trained jointly with the acoustic feature encoder606and acoustic feature decoder607on a voice conversion task.
Training example601comprises source acoustic features604, corresponding target acoustic features602, and a target speaker identifier603. During training, the goal is to transform source acoustic features604of a training example601into the target acoustic features602of the training example601. The training example may be referred to as a “paired” training example, wherein the source speech audio (from which the source acoustic features604are determined) and the target speech audio (from which the target source acoustic features602are determined) differ only in speaker identity. Alternatively said, the content (e.g. the words spoken), the performance, and the timing of the source and target speech audio may closely match each other in a paired training example. The target speaker identifier603is a label used to identify the speaker corresponding to target acoustic features602. Training examples601with target acoustic features corresponding to the same speaker have identical target speaker identifiers603. Target speaker identifiers603may be represented by one hot vectors such that a different one-hot vector indicates each speaker in the training set501. In addition, one or more one-hot vectors may be reserved as target speaker identifiers603for speakers to be added later, e.g. for players who wish to synthesize speech audio in their voice.
Target speaker identifier603is received by speaker encoder605, which processes the target speaker identifier603in accordance with a current set of parameters and outputs a target speaker embedding for the target speaker.
The target speaker embedding and source acoustic features604are received by acoustic feature encoder606, which processes the received inputs in accordance with a current set of parameters to output one or more acoustic feature encodings. The one or more acoustic feature encodings are processed by the acoustic feature decoder607in accordance with a current set of parameters to output predicted target acoustic features608.
Model trainer609receives the predicted target acoustic features608and the “ground-truth” target acoustic features602, and updates the parameters of the speaker encoder605, acoustic feature encoder606and acoustic feature decoder607in order to optimize an objective function. The objective function comprises a loss in dependence on the predicted target acoustic features608and the ground-truth target acoustic features602. For example, the loss may measure a mean-squared error between the predicted target acoustic features608and the ground-truth target acoustic features602. The objective function may additionally comprise a regularization term, for example the objective function may be a linear combination of the loss and the regularization term. The parameters of the speaker encoder605, acoustic feature encoder606and acoustic feature decoder607may be updated by optimizing the objective function using any suitable optimization procedure. For example, the objective function may be optimized using gradient-based methods such as stochastic gradient descent, mini-batch gradient descent, or batch gradient descent, including momentum-based methods such as Adam, RMSProp, and AdaGrad.
The training process is repeated for a number of training examples, and is terminated at a suitable point in time, e.g. when predicted target acoustic features608closely match ground-truth target acoustic features602. After an initial training process, the speaker encoder605, acoustic feature encoder606and acoustic feature decoder607may be further trained/refined using target acoustic features602determined from player speech samples. Subsequently, the speaker encoder605, acoustic feature encoder606and acoustic feature decoder607can be used to convert any source acoustic features into target acoustic features corresponding to the player's voice.
FIG.7is a flow diagram illustrating an example method700of generating speech audio for use in a video game, wherein the speech audio is generated using a voice convertor that has been trained to convert audio data for a source speaker into audio data for a target speaker.
In step7.1, source speech audio and a target speaker identifier are received.
The source speech audio comprises speech content in the voice of a source speaker. The speech content of the source speech audio may include (or be) lexical utterances such as words, non-lexical utterances, or a combination of lexical and non-lexical utterances. Non-lexical utterances may include noises (e.g. a sigh or moan), disfluencies (e.g. um, oh, uk), and the like. Any paralinguistic utterance may form part of the speech content, such as sighs, yawns, moans, laughs, grunts, etc. The source speech audio may be any suitable digital data and may for example represent a waveform of speech audio (e.g. transmitted as an MP3 file, a WAV file, etc). The source speaker may be a speaker whose voice samples were used to initially train the voice convertor.
The target speaker identifier is any data that can be associated with (e.g. used to identify) the target speaker. In some embodiments, the target speaker identifier is a one-hot vector indicating the target speaker. In other embodiments, speech samples (or indications thereof, e.g. acoustic features) provided by the target speaker are used as a speaker identifier. The target speaker may be a player of the video game, or any other speaker whose voice samples were used to train the speech audio generator.
In step7.2, source acoustic features for the source speech audio are determined. The source acoustic features may comprise any low-level acoustic representation of frequency, magnitude and phase information such as linear spectrograms, log-mel-spectrograms, linear predictive coding (LPC) coefficients, Mel-Frequency Cepstral Coefficients (MFCC), log fundamental frequency (LFO), band aperiodicity (bap) or combinations thereof. The source acoustic features may comprise a sequence of vectors, each vector representing acoustic information in a short time period, e.g. 50 milliseconds. Source acoustic features may be determined in any suitable manner, e.g. by performing a Fast Fourier Transform on the source speech audio.
In step7.3, a target speaker embedding associated with the target speaker identifier is generated as output of a speaker encoder of the voice convertor. The target speaker embedding is a representation of the voice of the target speaker which is associated with the target speaker identifier.
Generating a target speaker embedding associated with the target speaker identifier may comprise inputting, into the speaker encoder, one or more examples of speech audio associated with the target speaker identifier. An embedding for each example of speech audio may be generated as output of the speaker encoder. A target speaker embedding may be generated based on the one or more embeddings. For example, the target speaker embedding may be an average of the one or more embeddings.
Alternatively, generating a target speaker embedding associated with the target speaker identifier may comprise inputting into the speaker encoder, the target speaker identifier. The target speaker embedding may be generated as output of the speaker encoder.
The speaker encoder may comprise neural network layers. For example, the neural network layers may comprise feedforward layers, e.g. fully connected layers and/or convolutional layers. Additionally or alternatively, the neural network layers may comprise recurrent layers, e.g. LSTM layers and/or bidirectional LSTM layers.
In step7.4, the target speaker embedding and the source acoustic features are inputted into an acoustic feature encoder of the voice convertor. The acoustic feature encoder may comprise neural network layers. For example, the neural network layers may comprise feedforward layers, e.g. fully connected layers and/or convolutional layers. Additionally or alternatively, the neural network layers may comprise recurrent layers, e.g. LSTM layers and/or bidirectional LSTM layers.
In step7.5, one or more acoustic feature encodings are generated as output of the acoustic feature encoder. The one or more acoustic feature encodings are derived from the target speaker embedding and the source acoustic features. Each of the one or more acoustic feature encodings may comprise a combination of the target speaker embedding and an encoding of the source acoustic features.
Generating the one or more acoustic feature encodings may comprise generating an acoustic feature encoding for each input time step of a plurality of input time steps of the source acoustic features. The acoustic feature encoding for each input time step comprises a combination of the target speaker embedding and an encoding of the source acoustic features for the input time step.
In step7.6, target speech audio for the target speaker is generated. The target speech audio comprises the speech content in the voice of the target speaker. The generating comprises decoding the one or more acoustic feature encodings using an acoustic feature decoder of the voice convertor.
Decoding the one or more acoustic feature encodings may comprise, for each output time step of a plurality of output time steps: receiving the acoustic feature encoding for each input time step. An attention weight for each acoustic feature encoding may be generated by an attention mechanism. A context vector for the output time step may be generated by the attention mechanism by averaging each acoustic feature encoding using the respective attention weight. The context vector of the output time step may be processed by the acoustic feature decoder to generate target acoustic features for the output time step.
The acoustic feature decoder may comprise neural network layers. For example, the neural network layers may comprise feedforward layers, e.g. fully connected layers and/or convolutional layers. Additionally or alternatively, the neural network layers may comprise recurrent layers, e.g. LSTM layers and/or bidirectional LSTM layers.
The acoustic feature encoder and acoustic feature decoder may be combined as a single encoder-decoder model. For example, they may be combined as an encoder-decoder (e.g. sequence-to-sequence) neural network with or without attention, or as a transformer network, etc. Furthermore, an encoder-decoder model may be implemented by architectures such as Variational Autoencoders (VAEs) and Generative Adversarial Networks (GANs).
The acoustic feature decoder may generate target acoustic features for the speech content in the voice of the target speaker. The target speech audio may be generated by processing the target acoustic features using a vocoder. The vocoder may comprise neural network layers. For example, the neural network layers may comprise feedforward layers, e.g. fully connected layers and/or convolutional layers. Additionally or alternatively, the neural network layers may comprise recurrent layers, e.g. LSTM layers and/or bidirectional LSTM layers.
Additionally or alternatively, the acoustic feature decoder may generate the target speech audio.
FIG.8is a flow diagram illustrating an example method800of training a voice convertor, for use in a video game, to convert acoustic features for a source speaker into acoustic features for a target speaker.
In step8.1, one or more training examples are received. Each training example comprises: (i) source acoustic features for speech content in the voice of the source speaker, and (ii) ground-truth target acoustic features for the speech content in the voice of the target speaker.
Step8.2comprises steps8.2.1,8.2.2,8.2.3, and8.2.4, each of which are performed for each of the one or more training examples.
In step8.2.1, a target speaker embedding for the training example and the source acoustic features are inputted into an acoustic feature encoder of the voice convertor.
In step8.2.2, one or more acoustic feature encodings are generated as output of the acoustic feature encoder. The one or more acoustic feature encodings are derived from the target speaker embedding and the source acoustic features. Each of the one or more acoustic feature encodings may comprise a combination of the target speaker embedding and an encoding of the source acoustic features.
In step8.2.3, the one or more acoustic feature encodings are inputted into an acoustic feature decoder of the voice convertor.
In step8.2.4, predicted target acoustic features for the training example are generated as output of the acoustic feature decoder.
In step8.3, parameters of the acoustic feature encoder and the acoustic feature decoder are updated based on an objective function. The objective function comprises a comparison between the predicted target acoustic features and the ground-truth target acoustic features.
The target speaker embedding may be generated by a speaker encoder of the voice convertor that has been trained separately to the acoustic feature encoder and the acoustic feature decoder. The separate training of the speaker encoder may comprise receiving one or more further training examples. Each further training example may comprise: (i) acoustic features for speech content in the voice of a speaker, and (ii) a speaker identifier for the speaker. For each further training example of the one or more training examples, the acoustic features may be inputted into the speaker encoder. A speaker embedding for the further training example may be generated as output of the speaker encoder. Parameters of the speaker encoder may be updated based on a further objective function. The further objective function may depend on the speaker embedding and the speaker identifier. The further objective function may comprise a similarity metric that measures the similarity of the generated speaker embeddings of the further training examples. For example, the similarity metric may be a cosine similarity.
Alternatively, the target speaker embedding may be generated by a speaker encoder of the voice convertor that is being trained jointly with the acoustic feature encoder and the acoustic feature decoder. Each training example may further comprise a target speaker identifier for the target speaker, and the joint training of the speaker encoder may comprise, for each training example of the one or more training examples, inputting, into the speaker encoder, the target speaker identifier of the training example. A target speaker embedding may be generated as output of the speaker encoder. Parameters of the speaker encoder may be updated based on the objective function.
The method800may further comprise receiving target acoustic features for speech content in the voice of a player of the video game. The received target acoustic features may be associated with the source acoustic features of a particular training example. Predicted target acoustic features may be generated by converting the source acoustic features using the voice convertor. Parameters of the acoustic feature decoder and the acoustic feature encoder may be updated based on the objective function.
FIG.9shows a schematic example of a system/apparatus for performing methods described herein. The system/apparatus shown is an example of a computing device. It will be appreciated by the skilled person that other types of computing devices/systems may alternatively be used to implement the methods described herein, such as a distributed computing system.
The apparatus (or system)900comprises one or more processors902. The one or more processors control operation of other components of the system/apparatus900. The one or more processors902may, for example, comprise a general purpose processor. The one or more processors902may be a single core device or a multiple core device. The one or more processors902may comprise a central processing unit (CPU) or a graphical processing unit (GPU). Alternatively, the one or more processors902may comprise specialised processing hardware, for instance a RISC processor or programmable hardware with embedded firmware. Multiple processors may be included.
The system/apparatus comprises a working or volatile memory904. The one or more processors may access the volatile memory904in order to process data and may control the storage of data in memory. The volatile memory904may comprise RAM of any type, for example Static RAM (SRAM), Dynamic RAM (DRAM), or it may comprise Flash memory, such as an SD-Card.
The system/apparatus comprises a non-volatile memory906. The non-volatile memory906stores a set of operation instructions908for controlling the operation of the processors902in the form of computer readable instructions. The non-volatile memory906may be a memory of any kind such as a Read Only Memory (ROM), a Flash memory or a magnetic drive memory.
The one or more processors902are configured to execute operating instructions908to cause the system/apparatus to perform any of the methods described herein. The operating instructions908may comprise code (i.e. drivers) relating to the hardware components of the system/apparatus900, as well as code relating to the basic operation of the system/apparatus900. Generally speaking, the one or more processors902execute one or more instructions of the operating instructions908, which are stored permanently or semi-permanently in the non-volatile memory906, using the volatile memory904to temporarily store data generated during execution of said operating instructions908.
Implementations of the methods described herein may be realised as in digital electronic circuitry, integrated circuitry, specially designed ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These may include computer program products (such as software stored on e.g. magnetic discs, optical disks, memory, Programmable Logic Devices) comprising computer readable instructions that, when executed by a computer, such as that described in relation toFIG.9, cause the computer to perform one or more of the methods described herein.
Any system feature as described herein may also be provided as a method feature, and vice versa. As used herein, means plus function features may be expressed alternatively in terms of their corresponding structure. In particular, method aspects may be applied to system aspects, and vice versa.
Furthermore, any, some and/or all features in one aspect can be applied to any, some and/or all features in any other aspect, in any appropriate combination. It should also be appreciated that particular combinations of the various features described and defined in any aspects of the invention can be implemented and/or supplied and/or used independently.
Although several embodiments have been shown and described, it would be appreciated by those skilled in the art that changes may be made in these embodiments without departing from the principles of this disclosure, the scope of which is defined in the claims.
Claims
- A computer-implemented method of generating speech audio for use in a video game, wherein the speech audio is generated using a voice convertor that has been trained to convert audio data for a source speaker into audio data for a target speaker, the method comprising: receiving: (i) source speech audio, wherein the source speech audio comprises speech content in the voice of the source speaker, and (ii) a target speaker identifier;determining source acoustic features for the source speech audio;generating, as output of a speaker encoder of the voice convertor, a target speaker embedding associated with the target speaker identifier;inputting the target speaker embedding and the source acoustic features into an acoustic feature encoder of the voice convertor;generating, as output of the acoustic feature encoder, one or more acoustic feature encodings derived from the target speaker embedding and the source acoustic features;and generating target speech audio for the target speaker, wherein the target speech audio comprises the speech content in the voice of the target speaker, the generating comprising decoding the one or more acoustic feature encodings using an acoustic feature decoder of the voice convertor.
- The method of claim 1, wherein each of the one or more acoustic feature encodings comprise a combination of the target speaker embedding and an encoding of the source acoustic features.
- The method of claim 1, wherein generating the one or more acoustic feature encodings comprises generating an acoustic feature encoding for each input time step of a plurality of input time steps of the source acoustic features, wherein the acoustic feature encoding for each input time step comprises a combination of the target speaker embedding and an encoding of the source acoustic features for the input time step.
- The method of claim 3, wherein decoding the one or more acoustic feature encodings comprises, for each output time step of a plurality of output time steps: receiving the acoustic feature encoding for each input time step;generating, by an attention mechanism, an attention weight for each acoustic feature encoding;generating, by the attention mechanism, a context vector for the output time step by averaging each acoustic feature encoding using the respective attention weight;and processing, by the acoustic feature decoder, the context vector of the output time step to generate target acoustic features for the output time step.
- The method of claim 1, wherein the acoustic feature decoder generates target acoustic features for the speech content in the voice of the target speaker, and the target speech audio is generated by processing the target acoustic features using a vocoder.
- The method of claim 1, wherein generating, as output of the speaker encoder of the voice convertor, the target speaker embedding associated with the target speaker identifier comprises: inputting, into the speaker encoder, one or more examples of speech audio associated with the target speaker identifier;generating, as output of the speaker encoder, an embedding for each example of speech audio;and generating the target speaker embedding based on the one or more embeddings.
- The method of claim 1, wherein generating, as output of the speaker encoder of the voice convertor, the target speaker embedding associated with the target speaker identifier comprises: inputting, into the speaker encoder, the target speaker identifier;and generating, as output of the speaker encoder, the target speaker embedding.
- The method of claim 1, wherein the target speaker is a player of the video game.
- A computer-implemented method of training a voice convertor, for use in a video game, to convert acoustic features for a source speaker into acoustic features for a target speaker, the method comprising: receiving one or more training examples, each training example comprising: (i) source acoustic features for speech content in the voice of the source speaker, and (ii) ground-truth target acoustic features for the speech content in the voice of the target speaker;for each training example of the one or more training examples: inputting, into an acoustic feature encoder of the voice convertor, a target speaker embedding for the training example and the source acoustic features;generating, as output of the acoustic feature encoder, one or more acoustic feature encodings derived from the target speaker embedding and the source acoustic features;inputting, into an acoustic feature decoder of the voice convertor, the one or more acoustic feature encodings;and generating, as output of the acoustic feature decoder, predicted target acoustic features for the training example;and updating parameters of the acoustic feature encoder and the acoustic feature decoder based on an objective function, wherein the objective function comprises a comparison between the predicted target acoustic features and the ground-truth target acoustic features.
- The method of claim 9, wherein the target speaker embedding is generated by a speaker encoder of the voice convertor that has been trained separately to the acoustic feature encoder and the acoustic feature decoder.
- The method of claim 10, wherein the separate training of the speaker encoder comprises: receiving one or more further training examples, each further training example comprising: (i) acoustic features for speech content in the voice of a speaker, and (ii) a speaker identifier for the speaker;for each further training example of the one or more further training examples: inputting, into the speaker encoder, the acoustic features;and generating, as output of the speaker encoder, a speaker embedding for the further training example;and updating parameters of the speaker encoder based on a further objective function, wherein the further objective function depends on the speaker embedding and the speaker identifier.
- The method of claim 11, wherein the further objective function comprises a similarity metric that measures the similarity of the generated speaker embeddings of the further training examples.
- The method of claim 9, wherein the target speaker embedding is generated by a speaker encoder of the voice convertor that is being trained jointly with the acoustic feature encoder and the acoustic feature decoder.
- The method of claim 13, wherein each training example further comprises a target speaker identifier for the target speaker, and the joint training of the speaker encoder comprises: for each training example of the one or more training examples: inputting, into the speaker encoder, the target speaker identifier of the training example;and generating, as output of the speaker encoder, the target speaker embedding;and updating parameters of the speaker encoder based on the objective function.
- The method of claim 9, further comprising: receiving target acoustic features for speech content in the voice of a player of the video game, wherein the received target acoustic features are associated with the source acoustic features of a particular training example;generating, using the voice convertor, predicted target acoustic features by converting the source acoustic features;and updating the parameters of the acoustic feature decoder and the acoustic feature encoder based on the objective function.
- A non-transitory computer-readable medium storing instructions, which when executed by a processor, cause the processor to: receive source speech audio, wherein the source speech audio comprises speech content in the voice of a source speaker;determine source acoustic features for the source speech audio;input a target speaker embedding for a target speaker and the source acoustic features into an acoustic feature encoder of a voice convertor;generate, as output of the acoustic feature encoder, one or more acoustic feature encodings derived from the target speaker embedding and the source acoustic features;and generate target speech audio for the target speaker, wherein the target speech audio comprises the speech content in the voice of the target speaker, the generating comprising decoding the one or more acoustic feature encodings using an acoustic feature decoder of the voice convertor.
- The non-transitory computer-readable medium of claim 16, wherein the target speaker embedding is generated from an output of a speaker encoder of the voice convertor, the generating comprising: inputting, into the speaker encoder, one or more examples of speech audio associated with the target speaker;generating, as output of the speaker encoder, an embedding for each example of speech audio;and generating the target speaker embedding based on the one or more embeddings.
- The non-transitory computer-readable medium of claim 16, wherein the target speaker embedding is generated as an output of a speaker encoder of the voice convertor, the generating comprising: inputting, into the speaker encoder, a target speaker identifier associated with the target speaker;and generating, as output of the speaker encoder, the target speaker embedding.
- The non-transitory computer-readable medium of claim 16, wherein the target speaker embedding is generated by combining two or more different speaker embeddings.
- The non-transitory computer-readable medium of claim 19, wherein combining the two or more different speaker embeddings comprises summing the two or more different speaker embeddings together.
Disclaimer: Data collected from the USPTO and may be malformed, incomplete, and/or otherwise inaccurate.