We created John Q-Train: An AI Jazz Musician that can improvise a melody in real-time while a backup is playing chords. John Q-Train is an MDP that is formulated as an RNN Q-Network and was trained by iterating through 30 jazz MIDI files and updating its parameters to minimize loss. We defined our loss function as the mean squared error between the target Q-value and the predicted Q-value. We were able to successfully decrease the loss through training and found that the quality of our model’s improvisation improved significantly throughout the process.
Jazz improvisation embodies decision-making under uncertainty because musicians must listen to each other and make real-time decisions about what note, tempo, and volume to play without any reference to a composition.
There have been multiple prior attempts at an AI jazz musician, but they have yet to achieve a subjective quality that consistently matches human performance. One example is “The Jazz Transformer on the Front Line”, which used a Transformer to model jazz lead sheets. Another example is “On the Adaptability of Recurrent Neural Networks for Real-Time Jazz Improvisation Accompaniment”, which implemented a jazz accompanist using a Recurrent Neural Network.
First, We selected 30 jazz MIDI songs that were structurally simple and extracted the lead and backup tracks from them. Our training dataset was from Kaggle and is called “Jazz ML ready MIDI”.
Next, we created our own data structures (Song and MusicInterval) that would allow it to be operated on by our model. Song contained a song’s title, key, and notes. The notes were an array of MusicInterval, which represents an 8th note and an individual timestep. Each MusicInterval contains the note and velocity played by the lead and the notes and velocities played by the backup.
Finally, we populated these data structures. We created a representation of the song’s key by counting the pitches within each song and using it to calculate the probability distribution of each possible key. Then we populated Song for each song by finding the MusicInterval that corresponds to each note and then filling in its note and velocity.
In the end, we had 30 populated Song objects that were ready to be processed by our model.
For this problem, we designed a neural network whose inputs are the state space and the outputs are the corresponding Q values for each action.
To simplify the problem, we set our model to the role of a lead, meaning it plays a maximum of one note at each time step. We also simplified the environment to be one backup accompaniment playing chord voicings.
Input preprocessing layer that would dot product all of the state notes with a vector of shape (12) repeated along the length of the input and shifted by i places for i in [0,11]. This vector was a learnable parameter for the model which we hoped would capture the key for the current time step.
Because of the linear, temporal nature of music, we decided to use an LSTM to process the state space. A final linear layer to go from the hidden state to the size of the action space.
To train the model, we took inspiration from the Q-learning paradigm but with an ML twist. The loss function we were minimizing was the mean squared error of the Q learning objective:
At each timestep, we would feed one MusicInterval into the model to compute the Q vector for all possible actions. Then we would consider only the Q value for the action that was actually taken in the training set and compute the loss between that value and the reward plus the discounted maximum value of the output of the model’s next timestep. We did not use Monte Carlo sampling and instead computed the mean squared error for each entire song before taking a step with the gradient.
The reward function that was used during training is a mixture of four components:
We don’t listen to music by looking at the score, you have to hear it be sung.
To generate a lead melody from the model, we fed in each MusicInterval from the test set and queried the model for the Q values of taking each action from the current state. To sample from the model, we took a temperature-scaled softmax approach (prob of taking action a is proportional to \lambda Q(a,s)) to get random melodies each time, while also prioritizing the notes that likely would sound good.
We were able to significantly improve the model simply by tweaking hyperparameters and retraining the model. Here are the hyperparameters we tweaked to get the final result:
For training, we decreased discount, increased reward for closeness, and increased reward for playing in key.
For Evaluating, we decreased temperature for softmax (more spiky) nad increased scale-lock.
Ultimately, John Q-Train was able to successfully learn melodic jazz improvisation through Q-learning. Here are a few key decisions that were the most influential towards its resulting performance.
> Here is some selected code that highlights the primary functions of John Q-Train: