If you would like a detailed explanation of this project, please refer to the Medium article below.


The project is also available for testing on Hugging Face.


Audio-Classification-Raw-Audio-to-Mel-Spectrogram-CNNs

Complete end-to-end audio classification pipeline using deep learning. From raw recordings to Mel spectrogram CNNs, includes preprocessing, augmentation, dataset validation, model training, and evaluation β€” a reproducible blueprint for speech, environmental, or general sound classification tasks.


Audio Classification Pipeline β€” From Raw Audio to Mel-Spectrogram CNNs

β€œIn machine learning, the model is rarely the problem β€” the data almost always is.” β€” A reminder I kept repeating to myself while building this project.

This repository contains a complete, professional, end-to-end pipeline for audio classification using deep learning, starting from raw, messy audio recordings and ending with a fully trained CNN model using Mel spectrograms.

The workflow includes:

  • Raw audio loading
  • Cleaning & normalization
  • Silence trimming
  • Noise reduction
  • Chunking
  • Data augmentation
  • Mel spectrogram generation
  • Dataset validation
  • CNN training
  • Evaluation & metrics

It is a fully reproducible blueprint for real-world audio classification tasks.


Project Structure

Here is a quick table summarizing the core stages of the pipeline:

Stage Description Output
1. Raw Audio Unprocessed WAV/MP3 files Audio dataset
2. Preprocessing Trimming, cleaning, resampling Cleaned signals
3. Augmentation Pitch shift, time stretch, noise Expanded dataset
4. Mel Spectrograms Converts audio β†’ images PNG/IMG files
5. CNN Training Deep model learns spectrogram patterns .h5 model
6. Evaluation Accuracy, F1, Confusion Matrix Metrics + plots

1. Loading & Inspecting Raw Audio

The dataset is loaded from directory structure:

paths = [(path.parts[-2], path.name, str(path)) 
         for path in Path(extract_to).rglob('*.*') 
         if path.suffix.lower() in audio_extensions]

df = pd.DataFrame(paths, columns=['class', 'filename', 'full_path'])
df = df.sort_values('class').reset_index(drop=True)

During EDA, I computed:

  • Duration
  • Sample rate
  • Peak amplitude

And visualized duration distribution:

plt.hist(df['duration'], bins=30, edgecolor='black')
plt.xlabel("Duration (seconds)")
plt.ylabel("Number of recordings")
plt.title("Audio Duration Distribution")
plt.show()

2. Audio Cleaning & Normalization

Bad samples were removed, silent files filtered, and amplitudes normalized:

peak = np.abs(y).max()
if peak > 0:
    y = y / peak * 0.99

This ensures consistency and prevents the model from learning from corrupted audio.


3. Advanced Preprocessing

Preprocessing included:

  • Silence trimming
  • Noise reduction
  • Resampling β†’ 16 kHz
  • Mono conversion
  • 5-second chunking
TARGET_DURATION = 5.0
TARGET_SR = 16000
TARGET_LENGTH = int(TARGET_DURATION * TARGET_SR)

Every audio file becomes a clean, consistent chunk ready for feature extraction.


4. Audio Augmentation

To improve generalization, I applied augmentations:

augment = Compose([
    Shift(min_shift=-0.3, max_shift=0.3, p=0.5),
    PitchShift(min_semitones=-2, max_semitones=2, p=0.5),
    TimeStretch(min_rate=0.8, max_rate=1.25, p=0.5),
    AddGaussianNoise(min_amplitude=0.001, max_amplitude=0.015, p=0.5)
])

Every augmented file receives a unique name to avoid collisions.


5. Mel Spectrogram Generation

Each cleaned audio chunk is transformed into a Mel spectrogram:

S = librosa.feature.melspectrogram(
    y=y, sr=SR,
    n_fft=N_FFT,
    hop_length=HOP_LENGTH,
    n_mels=N_MELS
)
S_dB = librosa.power_to_db(S, ref=np.max)
  • Output: 128Γ—128 PNG images
  • Separate directories per class
  • Supports both original & augmented samples

These images become the CNN input.

Example of Mel Spectrogram Images


6. Dataset Validation

After spectrogram creation:

  • Corrupted images removed
  • Duplicate hashes filtered
  • Filename integrity checked
  • Class folders validated
df['file_hash'] = df['full_path'].apply(get_hash)
duplicate_hashes = df[df.duplicated(subset=['file_hash'], keep=False)]

This step ensures clean, reliable training data.


7. Building TensorFlow Datasets

The dataset is built with batching, caching, prefetching:

train_ds = tf.data.Dataset.from_tensor_slices((train_paths, train_labels))
train_ds = train_ds.map(load_and_preprocess, num_parallel_calls=AUTOTUNE)
train_ds = train_ds.shuffle(1024).batch(batch_size).prefetch(AUTOTUNE)

I used a simple image-level augmentation pipeline:

data_augmentation = tf.keras.Sequential([
    tf.keras.layers.InputLayer(input_shape=(231, 232, 4)),
    tf.keras.layers.RandomFlip("horizontal"),
    tf.keras.layers.RandomRotation(0.1),
    tf.keras.layers.RandomZoom(0.1),
])

8. CNN Architecture

The CNN captures deep frequency-time patterns across Mel images.

Key features:

  • Multiple Conv2D + BatchNorm blocks
  • Dropout
  • L2 regularization
  • Softmax output
model = Sequential([
    data_augmentation,
    Conv2D(32, (3,3), padding='same', activation='relu', kernel_regularizer=l2(weight_decay)),
    BatchNormalization(),
    MaxPooling2D((2,2)),
    Dropout(0.2),
    # ... more layers ...
    Flatten(),
    Dense(num_classes, activation='softmax')
])

9. Training Strategy

reduce_lr = ReduceLROnPlateau(monitor='val_loss', factor=0.5, patience=10)
early_stopping = EarlyStopping(monitor='val_loss', patience=40, restore_best_weights=True)

history = model.fit(
    train_ds,
    validation_data=val_ds,
    epochs=50,
    callbacks=[reduce_lr, early_stopping]
)

The model converges smoothly while avoiding overfitting.


10. Evaluation

Performance is evaluated using:

  • Accuracy
  • Precision, recall, F1-score
  • Confusion matrix
  • ROC/AUC curves
y_pred = np.argmax(model.predict(test_ds), axis=1)
print(classification_report(y_true, y_pred, target_names=le.classes_))

Confusion matrix:

sns.heatmap(confusion_matrix(y_true, y_pred), annot=True, cmap='Blues')
plt.title("Confusion Matrix")
plt.show()

11. Saving the Model & Dataset

model.save("Audio_Model_Classification.h5")
shutil.make_archive("/content/spectrograms", 'zip', "/content/spectrograms")

The entire spectrogram dataset is also zipped for sharing or deployment.


Final Notes

This project demonstrates:

  • How to clean & prepare raw audio at a professional level
  • Audio augmentation best practices
  • How Mel spectrograms unlock CNN performance
  • A full TensorFlow training pipeline
  • Proper evaluation, reporting, and dataset integrity

If you're working on sound recognition, speech tasks, or environmental audio detection, this pipeline gives you a complete production-grade foundation.


Results

Note: Click the image below to view the video showcasing the project’s results.


Note: If the video above is not working, you can access it directly via the link below.

Watch Demo Video

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train AIOmarRehan/CNN_Audio_Classification_Model_with_Spectrogram

Space using AIOmarRehan/CNN_Audio_Classification_Model_with_Spectrogram 1