This is an accompanying page for the paper “PitchSTAR: Pitch Style Transfer with Auto-Regularized Flow Matching for Singing Voice”, currently under review. PitchSTAR is a self-supervised framework for arbitrary pitch style transfer (PST). The PST task is defined as generating a pitch curve given a reference ornamented pitch (style) and notes (content), represented in the Figure 1.

PitchSTAR
Figure 1. The (Pitch Style Transfer) PST task.

PitchSTAR is based on flow matching, and operates on note-relative pitch modulation, allowing it to disentangle note tone from pitch ornaments. PitchSTAR also uses an auto-regularization strategy of exploiting the noisy inputs inherent to flow matching training, to allow conditioning on the full reference through a blurred cross-attention, forcing the model to capture both global and local stylistic characteristics while avoiding trivial reference copying. Its training is shown on Figure 2.

PitchSTAR
Figure 2. Training scheme of PitchSTAR. Two optimization steps at low and high noise levels are shown with highlighted sharp and blurred cross-attention matrices, respectively.

Sound Samples

For each model and ornament, we select one sample of the combination reference plus notes from the style consistency experiment that yielded the best correct confidence score of the pitch style classifiers. Below we show the output stylized curves with the corresponding reference and input notes. The audios were synthesized with a Serenade model using each shown pitch curve as input conditioning.

In-Domain

Sample 1

Notes
source spectrogram
Reference
reference spectrogram
PitchSTAR
spectrogram
PitchSTAR w/o Flow
spectrogram
StylePitcher w/ Mod
spectrogram
StylePitcher
spectrogram

Sample 2

Notes
source spectrogram
Reference
reference spectrogram
PitchSTAR
spectrogram
PitchSTAR w/o Flow
spectrogram
StylePitcher w/ Mod
spectrogram
StylePitcher
spectrogram

Sample 3

Notes
source spectrogram
Reference
reference spectrogram
PitchSTAR
spectrogram
PitchSTAR w/o Flow
spectrogram
StylePitcher w/ Mod
spectrogram
StylePitcher
spectrogram

Sample 4

Notes
source spectrogram
Reference
reference spectrogram
PitchSTAR
spectrogram
PitchSTAR w/o Flow
spectrogram
StylePitcher w/ Mod
spectrogram
StylePitcher
spectrogram

Effect of CFG

In this section we show the effect of the guidance CFG scale, which balances guided by style and unguided generation.

Sample 1

Notes
source spectrogram
Reference
reference spectrogram
CFG=0.0
spectrogram
CFG=0.25
spectrogram
CFG=0.50
spectrogram
CFG=1.00
spectrogram
CFG=2
spectrogram

Sample 2

Notes
source spectrogram
Reference
reference spectrogram
CFG=0.0
spectrogram
CFG=0.25
spectrogram
CFG=0.50
spectrogram
CFG=1.00
spectrogram
CFG=2
spectrogram

Pitch Style Classifier

In this section we plot the results of the trained Pitch Ornament Classifier, through its confusion matrices.

Test Data

This confusion matrix is obtained on the test set of the GTSinger.

PitchSTAR
Figure 3. Confusion Matrix of the Pitch Style Classifier on the test set.

Pitch Style Transfer Models

These are the matrices obtained applying on the transfer of each of the models.

PitchSTAR
PitchSTAR
PitchSTAR w/o Flow
PitchSTAR
StylePitcher w/ Mod
PitchSTAR
StylePitcher
PitchSTAR
Figure 3. Confusion Matrix of the Pitch Style Classifier on the test set.

Synthetic Dataset Samples

Four examples of the synthetic ornament dataset are shown. Two with ornament and two without ornament.

With Ornament 1

PitchSTAR PitchSTAR
Figure 8. Synthetic notes and their corresponding ornamented pitch.

With Ornament 2

PitchSTAR PitchSTAR
Figure 9. Synthetic notes and their corresponding ornamented pitch.

Without Ornament 1

PitchSTAR PitchSTAR
Figure 10. Synthetic notes and their corresponding non-ornamented pitch.

Without Ornament 2

PitchSTAR PitchSTAR
Figure 11. Synthetic notes and their corresponding non-ornamented pitch.