PitchSTAR: Pitch Style Transfer with Auto-Regularized Flow Matching for Singing Voice
This is an accompanying page for the paper “PitchSTAR: Pitch Style Transfer with Auto-Regularized Flow Matching for Singing Voice”, currently under review. PitchSTAR is a self-supervised framework for arbitrary pitch style transfer (PST). The PST task is defined as generating a pitch curve given a reference ornamented pitch (style) and notes (content), represented in the Figure 1.
PitchSTAR is based on flow matching, and operates on note-relative pitch modulation, allowing it to disentangle note tone from pitch ornaments. PitchSTAR also uses an auto-regularization strategy of exploiting the noisy inputs inherent to flow matching training, to allow conditioning on the full reference through a blurred cross-attention, forcing the model to capture both global and local stylistic characteristics while avoiding trivial reference copying. Its training is shown on Figure 2.
Sound Samples
For each model and ornament, we select one sample of the combination reference plus notes from the style consistency experiment that yielded the best correct confidence score of the pitch style classifiers. Below we show the output stylized curves with the corresponding reference and input notes. The audios were synthesized with a Serenade model using each shown pitch curve as input conditioning.
In-Domain
Sample 1
|
Notes
Reference
|
|||
|
PitchSTAR
|
PitchSTAR w/o Flow
|
StylePitcher w/ Mod
|
StylePitcher
|
Sample 2
|
Notes
Reference
|
|||
|
PitchSTAR
|
PitchSTAR w/o Flow
|
StylePitcher w/ Mod
|
StylePitcher
|
Sample 3
|
Notes
Reference
|
|||
|
PitchSTAR
|
PitchSTAR w/o Flow
|
StylePitcher w/ Mod
|
StylePitcher
|
Sample 4
|
Notes
Reference
|
|||
|
PitchSTAR
|
PitchSTAR w/o Flow
|
StylePitcher w/ Mod
|
StylePitcher
|
Effect of CFG
In this section we show the effect of the guidance CFG scale, which balances guided by style and unguided generation.
Sample 1
|
Notes
Reference
|
||||
|
CFG=0.0
|
CFG=0.25
|
CFG=0.50
|
CFG=1.00
|
CFG=2
|
Sample 2
|
Notes
Reference
|
||||
|
CFG=0.0
|
CFG=0.25
|
CFG=0.50
|
CFG=1.00
|
CFG=2
|
Pitch Style Classifier
In this section we plot the results of the trained Pitch Ornament Classifier, through its confusion matrices.
Test Data
This confusion matrix is obtained on the test set of the GTSinger.
Pitch Style Transfer Models
These are the matrices obtained applying on the transfer of each of the models.
Synthetic Dataset Samples
Four examples of the synthetic ornament dataset are shown. Two with ornament and two without ornament.
With Ornament 1
With Ornament 2
Without Ornament 1
Without Ornament 2