Melodfy: How a Python App Turns Your Piano Playing Into MIDI

Melodfy is a Python application with a Qt6 GUI that takes a piano audio recording and converts it into a MIDI file. The conversion runs entirely on your machine — no internet connection, no cloud API, no subscription. You drop in a recording, click convert, and get a .mid file you can open in FL Studio, Ableton, Logic, GarageBand, or any other DAW.

The core of it is a GUI wrapper around ByteDance’s piano transcription model, which Hemant Kumar built into a usable desktop application. Understanding what Melodfy does means understanding what that underlying model does and how the pieces connect.

What the Problem Actually Is

Converting audio to MIDI is genuinely hard. Audio is a continuous waveform — air pressure changes over time. MIDI is a discrete event log — note on, note off, pitch, velocity, time. The translation requires the software to hear a waveform and answer: which piano keys are being pressed, at what exact moments, with how much force, and when do they release?

For a single note in isolation, this is manageable. For a full piano performance with chords, pedal use, and notes ringing over each other, it requires a model that understands the harmonic structure of the piano and can separate overlapping frequencies into individual note events.

Melodfy application interface

The ByteDance Piano Transcription Model

Melodfy uses the piano transcription system developed by ByteDance Research. The model architecture is a CRNN — a Convolutional Recurrent Neural Network — formally described as Regress_onset_offset_frame_velocity_CRNN in the codebase. The name describes what it predicts: onset times (when a note begins), offset times (when it ends), frame-level note activity, and velocity (how hard the key was struck).

The model was trained on the MAESTRO dataset — over 200 hours of professional piano performances with ~3ms alignment between audio waveforms and MIDI note labels. This alignment precision is what enables the model to learn exact onset and offset timing rather than approximate note presence.

The model does not operate on raw audio. Before running inference, the audio must be converted to a mel spectrogram: a 2D representation of frequency content over time, scaled to match how human hearing perceives pitch. Melodfy’s inference.py handles this preprocessing — loading the audio, resampling to 16 kHz mono, computing the mel spectrogram, and feeding it to the model in overlapping chunks.

The ONNX Runtime: Running It Offline

The original ByteDance model is implemented in PyTorch. Running it directly requires a GPU for acceptable performance and adds PyTorch as a large dependency. Melodfy instead uses an ONNX export of the model, running via the ONNX Runtime library.

ONNX (Open Neural Network Exchange) is a format for representing trained models that is independent of the training framework. The ONNX Runtime is a high-performance inference engine that can execute these models on CPU or GPU using hardware-specific optimisation backends. On CPU, it typically uses MLAS (Microsoft Linear Algebra Subprograms) for matrix operations. The result is that the model runs efficiently on a standard laptop CPU without requiring CUDA or any GPU.

The model weights live in the models/ directory of the repository. The inference.py file loads the model via onnxruntime.InferenceSession and runs the transcription in segments, with the results assembled into a piano roll and then exported as MIDI using pretty_midi.

The Qt6 Interface

The GUI is built with PySide6 — the official Python bindings for Qt6. The main window is defined in mainui.ui (Qt Designer XML) with the corresponding Python logic in mainui.py. The workflow from the user’s perspective is: select an audio file, set an output directory, and start conversion. The application logs progress segment-by-segment in real time as inference runs, so you can see conversion happening rather than waiting for a silent progress bar.

Melodfy MIDI output open in DAW

The piano_vad.py file handles voice activity detection — identifying segments of the audio that actually contain piano playing, so the model is not run on silence. The utilities.py and config.py files handle path management, format configuration, and output settings.

What the Output Looks Like

The output is a standard .mid file — General MIDI compatible, readable by every DAW and notation editor. Each note event carries pitch (which key), velocity (how hard), start time, and end time. Pedal events are also captured when present, since the ByteDance model specifically includes pedal regression in its output representation.

The quality of transcription depends on the recording. Clean solo piano on a dry room recording will give near-perfect results for well-voiced passages. Heavy reverb, low recording volume, or very fast ornamental runs are more difficult. The model was trained on professional recordings so it is well-calibrated for those conditions.

Getting Started

Melodfy runs on Windows with prebuilt binaries available on the GitHub releases page and SourceForge. Since v1.0.0+22, the Windows build is fully self-contained — FFMPEG is bundled, so you do not need to install it separately.

Linux and macOS support is in progress. Until prebuilt binaries are available for those platforms, you can run Melodfy directly from source by cloning the repository and installing the dependencies from requirements.txt. You will need FFMPEG in your PATH for the from-source path.

The repository is on GitHub at HemantKArya/Melodfy. It is MIT licensed — free for personal and commercial use.