Voice Beat Counter

12 Views, 0 Favorites, 0 Comments

Voice Beat Counter

Captura de pantalla 2025-12-10 202917.png

In dance training, keeping a consistent rhythm and counting the beats correctly is essential. However, many dancers struggle to count while focusing on their movements, and teachers often need tools that help students stay synchronized with the music.

This project is a Python-based audio tool that automatically adds a spoken counting voice over any music track. The program analyzes the rhythm of the song, detects the beats, and generates a clear voice that counts the measures (for example: 1 to 8 for ballet, 1 to 3 for waltz, or 1 to 4 for tango).

The main goal of this project is to help dancers practice more efficiently by providing an automatic voice guide that stays synchronized with the music. This system can be adapted for different dance styles and tempos, making it a flexible and practical learning tool.

Supplies

To build and run this project, you will need the following:

Hardware

A computer (Windows, macOS, or Linux)
Speakers or headphones

Software

Python 3.8 or higher

Python Libraries

You need to install the following libraries:

librosa – for beat detection and audio analysis
numpy – for numerical processing
pydub – for audio editing and mixing
pyttsx3 – for text-to-speech voice generation

Optional

Any .wav music file to test the project
A code editor such as VS Code, PyCharm, or any text editor

Installing the Libraries

Before running the project, you need to install the required Python libraries.

Open a terminal or command prompt and run the following command:

pip install librosa numpy pydub pyttsx3

Additional requirement (Windows users)

Pydub needs FFmpeg to work correctly. You may need to install FFmpeg and add it to your system PATH.

Once the installation is complete, you are ready to move to the next step.

Generating the Voice Numbers

This project uses a function called generar_numero_wav() to create spoken numbers as audio files.

What this function does

This function:

Converts a number into a human voice
Saves that voice as a .wav audio file
Optionally makes the number “1” louder to mark the start of a measure

The code

def generar_numero_wav(numero, accent):

engine = pyttsx3.init()

engine.setProperty("rate", 170)

texto = "UNO" if (accent and numero == 1) else str(numero)

temp_file = tempfile.NamedTemporaryFile(delete=False, suffix=".wav")

path = temp_file.name

temp_file.close()

engine.save_to_file(texto, path)

engine.runAndWait()

audio = AudioSegment.from_wav(path)

audio = audio + 8

if accent and numero == 1:

audio = audio + 6

audio.export(path, format="wav")

return path

How it works :

It initializes the text-to-speech engine.
It sets the voice speed with:

engine.setProperty("rate", 170)

If the number is 1 and the accent is enabled, it says "UNO" instead of "1".
It creates a temporary .wav file.
It saves the spoken number in that file.
If the number is 1, it increases the volume.
Finally, it returns the path to the generated audio file.

This function is the base for creating the counting voice in the project.

Generating the Voice on the Music Beats

This step explains the function generar_voz_compases(), which places the spoken numbers in the correct moments of the music.

What this function does

This function:

Receives the detected beat times of the music
Divides the song into measures (bars)
Plays a spoken number only on odd-numbered measures
Repeats the count from 1 up to a maximum number, depending on the dance style

The code

def generar_voz_compases(beats_time, pasos_por_compas, max_conteo, accent):

pistas = []

compas_count = 1

total_beats = len(beats_time)

for inicio_compas in range(0, total_beats, pasos_por_compas):

numero_compas_actual = (inicio_compas // pasos_por_compas) + 1

if numero_compas_actual % 2 == 1:

path = generar_numero_wav(compas_count, accent=True)

audio = AudioSegment.from_wav(path)

os.remove(path)

t = beats_time[inicio_compas] * 1000 + OFFSET_MS

if t < 0:

t = 0

pistas.append((audio, t))

compas_count += 1

if compas_count > max_conteo:

compas_count = 1

return pistas

How it works

It starts with a counter called compas_count that begins at 1.
It loops through the music in steps of pasos_por_compas.
It calculates the current measure number.
If the measure is odd, it:
Generates a spoken number as a WAV file.
Loads the audio into memory.
Deletes the temporary file.
It calculates the exact time in milliseconds where the voice should play.
It saves the voice track and its position in a list.
It increases the counter and resets it when it reaches the maximum number.

This creates a list of voice sounds perfectly placed on top of the music.

Changing Speed

This step explains the function change_speed(), which depending on the numbre received, the music and counting will sound faster or slower

What this part does

This part of the code:

Loads the original audio file
Changes the playback speed if needed
Places the voice numbers at the correct times
Ensures that the voices do not overlap
Exports the final result as a new .wav file

The code

def change_speed(sound: AudioSegment, speed: float) -> AudioSegment:

if speed == 1:

return sound

new_frame_rate = int(sound.frame_rate * speed)

changed = sound._spawn(sound.raw_data, overrides={"frame_rate": new_frame_rate})

return changed.set_frame_rate(sound.frame_rate)

How it works

If the user asks for a speed of 1.0, that means normal speed, so it simply returns the original audio.
It calculates a new frame rate
If speed > 1 → the audio plays faster
If speed < 1 → the audio plays slower
It creates a new audio object with the modified frame rate
_spawn() avoids re-encoding the audio.
It keeps the same raw audio data
It resets the frame rate back to the original
Preserves the speed change
Restores the original pitch
Ensures compatibility with the rest of the audio processing

This makes it ideal for adjusting music or voice tracks while maintaining natural sound quality.

Mixing the Voice With the Music

In this step, the program combines the original music track with the generated spoken counting. Using the function generador_pista_con_conteo()

What this part does

This part of the code:

Loads the original audio file
Changes the playback speed if needed
Places the voice numbers at the correct times
Ensures that the voices do not overlap
Exports the final result as a new .wav file

The code

def generador_pista_con_conteo(in_audio, out_audio, mode, modo_conteo, accent, speed):

# ------------------ CARGAR AUDIO Y DETECTAR BEATS ------------------

y, sr = librosa.load(in_audio, sr=None)

tempo, beats = librosa.beat.beat_track(y=y, sr=sr)

beats_time = librosa.frames_to_time(beats, sr=sr)

if isinstance(tempo, np.ndarray):

tempo = tempo[0]

print(f"Tempo detectado: {tempo:.2f} BPM")

print(f"Beats detectados: {len(beats)}")

# ------------------ CARGAR MÚSICA ORIGINAL ------------------

musica = AudioSegment.from_file(in_audio)

musica = change_speed(musica, speed)

# ------------------ CONFIGURACIÓN DE RITMO Y CONTEO ------------------

pasos_por_compas = VELOCIDAD[mode]

max_conteo = CONTEO_MAX[modo_conteo]

# ------------------ GENERAR VOZ ------------------

pistas_voz = generar_voz_compases(beats_time, pasos_por_compas, max_conteo, accent)

# ------------------ MEZCLA ------------------

mezcla = musica

fin_ultima_frase = 0

for voz, t_ms in pistas_voz:

if speed != 1:

voz = change_speed(voz, speed)

start = int(t_ms / speed)

if start < fin_ultima_frase:

start = fin_ultima_frase

mezcla = mezcla.overlay(voz, position=start)

fin_ultima_frase = start + len(voz)

# ------------------ EXPORTAR ------------------

mezcla.export(out_audio, format="wav")

print(f"Pista final generada: {out_audio}")

How it works

The music file is loaded:

musica = AudioSegment.from_file(in_audio)

The program adjusts the speed if needed.
It loops through every generated voice:
Calculates its position on the timeline
Moves it if it would overlap with a previous voice
The voice is mixed into the music using overlay().
The final track is exported as a .wav file.

This step creates a synchronized music track with a spoken counting voice.

What Else You Need to Put in the Code

You have to decsribe the type of dances you are coing to use, and the speed needed, and to make it synchronized an OFFSET at the beggining of the code:

# --------------------------- CONFIGURACIÓN ---------------------------

VELOCIDAD = {

"normal": 3

}

# Controla solo hasta qué número cuenta la voz

CONTEO_MAX = {

"ballet": 8,

"moderno": 4,

"vals": 3,

"tango": 4,

"flamenco": 12

}

OFFSET_MS = -20 # pequeño ajuste de sincronía

You need to have some wav with the music on the same folder, so the code can read it and use it and this is the main you can use:

# --------------------------- USO ---------------------------

if __name__ == "__main__":

in_audio = "name_in_audio.wav"

out_audio = "name_out_audio.wav"

# Controla la velocidad real

mode = "normal"

# Controla hasta qué número cuenta la voz

modo_conteo = "ballet"

accent = True

speed = 1.0

generador_pista_con_conteo(in_audio,out_audio,mode,modo_conteo,accent,speed)

And with that you can have your music with the beat counting as help.