Broadly interested in Music AI and NLP, my research goal is to draw parallels between music and language processing and 1) devise methods for better understanding language and music models, 2) devise generalizable methods for music AI drawing inspirations from NLP, and 3) develop effective, useful music AI applications. I have gained rich research experience under the supervision of Professors Gus Xia and Timothy Baldwin. To further grow as a researcher and contribute to these promising fields, I intend to pursue a PhD.

Past Research

Generalizable Methods for Music AI

Music AI encompasses a plethora of understanding and generation tasks, each with its own characteristics. I am interested in unifying them into a shared, general framework, taking inspiration from generalizable methods in NLP. In a research project supervised by Professor Gus Xia, I have attempted to unify a series of music information retrieval (MIR) tasks under an audio-to-audio format. To accommodate this setup, I modified the transformer prior model of Jukebox with an adaptor module following ControlNet such that it can take in additional audio as condition. The advantage of this setup lies in its versatility, as many MIR tasks that involve symbolic inputs and outputs can be recast into the audio-to-audio format. For instance, MIDI-to-audio synthesis can be recast by synthesizing the input condition using simple sine waves, and music transcription can be recast by synthesizing the output. Small-scale experiments show that this method can handle tasks such as vocal source separation and MIDI-conditioned synthesis.

Another common challenge in music AI and NLP is to improve the inference time of autoregressive generation models. This is particularly an impediment in utilizing music AI models as interactive tools. In another project, I noted that certain generation tasks, such as harmonic style transfer, can be approximated with rule-based operations, with the transformed output refined with only minor edits. Following this intuition, I devised an edit-based model (commonly used in NLP tasks such as grammatical error correction) that accommodates symbolic music, augmented it with a rule-based approximation step and experimented with having it learn from an autoregressive style transfer model in a distillation setup. While there exists a gap in the output’s quality, the semi-autoregressive edit model leads to a 4x improvement in inference speed, and I believe using learned rule-based transformations can make it useful for many more tasks.

Emergent Concepts

It is a common belief that certain innate priors enable the human understanding of concepts across different modalities. Previous work has relied on domain knowledge to induce human-aligned concept representations with self-supervised learning. In (Liu et al., 2023), published at NeurIPS 2023, we examined physical symmetry as a minimal unifying inductive bias applicable to different concepts across modalities. We assume that sets of symmetric relationships (e.g. translational equivariance) govern the dynamics of human concepts (sequences of pitches). Following this assumption, we implement a VAE with an equivariance-regularized prior model and show that it can discover concepts such as music pitch and 3D coordinates from unlabeled perceptual data with minimal domain knowledge. Within this project, I designed and conducted experiments learning from natural melodies and with time-invariant timber to apply our method to more real-world setups. This project helped demonstrate the commonality across models in different modalities.

Future Research

Understanding Music Models and Language Models

Previous work on NLP interpretability has used probing to align the intermediate representations with human concepts, which rely on and are limited by pre-defined linguistic knowledge. This limitation is more salient when it comes to music, where the domain knowledge is more ambiguous. This motivates a search for a different anchor, or proxy, of interpretable concepts. In (Liu et al., 2023), we investigated physical symmetry as one such anchor, but the method cannot be easily adapted to more complex data such as natural language and polyphonic music. Since it has been established that pretrained transformers are universal compute engines, I believe an alternative way to understand models and, ultimately, human concepts is to study the parallels between uni-modal models of different modalities. Music and language can serve as an ideal subject since they both involve structure and syntax. I am interested in examining the structural understanding of music and language models through methods such as probing and mechanistic approaches and identifying commonalities. In the long run, I envision that we can identify and manipulate the implicit functions behind language and music processing such that we can fuse language and music models in a much more data-efficient manner.

Generalizable Methods for Music AI

Past work in music AI has predominantly focused on utilizing domain knowledge for solving individual, specific tasks. Recent breakthroughs in generalizable methods in NLP (e.g. LLM prompting and parameter-efficient finetuning) offer a new opportunity to unify previously disparate music tasks. I have experimented with unifying certain MIR tasks as audio-to-audio by modifying Jukebox, which is akin to sequence-to-sequence in NLP. Following the historical trajectory of NLP paradigms (seq2seq to T5-like multi-tasking to LLM-based prompting), a natural extension is to apply multi-task training. To go even further, I am interested in investigating in-context learning (ICL) and other advanced prompting methods in generative music models. Large music audio models have recently been introduced following the success of textual LLMs, but characteristic LLM features such as ICL and prompting are understudied. Enabling such features would improve controllability and expressiveness and unify complex music tasks such as motif-conditioned generation or music genre transfer under a single, powerful model. I intend to start with simple setups, such as framing iterative score reduction as a chain-of-thought problem. Going further, a key impediment is the ambiguity of describing music with music (e.g. (Anonymous, 2023) examined reduced music scores as an intermediate product in generation, but tasks like music summarization are difficult to define). I am interested in instructional finetuning for music audio models through adaptors.

Applications in Music AI

A strong appeal of music AI to me is that it can be directly applied to practical problems in music-making. As an amateur guitarist, a common frustration for me during recording is the laborious process of splicing and editing multiple flawed takes to produce a refined track. This complex task of recording refinement involves both performance-level (timing and techniques) and score-level (wrong notes) flaws and is under-explored in the context of neural methods. An intuitive solution is to use feature fusion, where we combine the latent representations of the multiple takes and feed them into a quality-aware decoder trained with a denoising objective. However, real-world use cases require more controllability as users may want to specify particular notes/audio segments or specific aspects (timing, velocity, etc.) to edit. Therefore, a superior approach can be to apply partial re-synthesis (where we explicitly model notes, performance, and timber as intermediate representations) or program synthesis (where a controller model utilizes a set of pre-defined or learned edit operations). As an additional use case, such a system can be used in music tutoring to inform students of the flaws in their performance and provide demonstrations.