Spaces:

GenSEC-LLM
/

Post-ASR-LLM-Transcription-Correction

Running

huckiyang commited on Mar 14

Commit

381227f

1 Parent(s): c7f8633

optz the data code

Files changed (2) hide show

README.md CHANGED Viewed

@@ -42,3 +42,12 @@ Word Error Rate is calculated between:
 Lower WER values indicate better transcription accuracy.
 Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 Lower WER values indicate better transcription accuracy.
 Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
+## Table Structure
+The leaderboard is displayed as a table with:
+- **Rows**: "Number of Examples" and "Word Error Rate (WER)"
+- **Columns**: Different data sources (CHiME4, CORAAL, CommonVoice, etc.) and OVERALL
+Each cell shows the corresponding metric for that specific data source. The OVERALL column shows aggregate metrics across all sources.

app.py CHANGED Viewed

@@ -259,12 +259,18 @@ def get_wer_metrics(dataset):
         # Create a transposed DataFrame with metrics as rows and sources as columns
         metrics = ["Count", "No LM Baseline"]
-        result_df = pd.DataFrame(index=metrics, columns=all_sources + ["OVERALL"])
         for source in all_sources + ["OVERALL"]:
             for metric in metrics:
                 result_df.loc[metric, source] = source_results[source][metric]
         return result_df
     except Exception as e:
@@ -278,17 +284,23 @@ def format_dataframe(df):
         # Use vectorized operations instead of apply
         df = df.copy()
-        # Format WER values
-        if "No LM Baseline" in df.index:
             # Convert to object type first to avoid warnings
-            df.loc["No LM Baseline"] = df.loc["No LM Baseline"].astype(object)
             for col in df.columns:
-                value = df.loc["No LM Baseline", col]
                 if pd.notna(value):
-                    df.loc["No LM Baseline", col] = f"{value:.4f}"
                 else:
-                    df.loc["No LM Baseline", col] = "N/A"
         return df

         # Create a transposed DataFrame with metrics as rows and sources as columns
         metrics = ["Count", "No LM Baseline"]
+        result_df = pd.DataFrame(index=metrics, columns=["Metric"] + all_sources + ["OVERALL"])
+        # Add descriptive column
+        result_df["Metric"] = ["Number of Examples", "Word Error Rate (WER)"]
         for source in all_sources + ["OVERALL"]:
             for metric in metrics:
                 result_df.loc[metric, source] = source_results[source][metric]
+        # Set Metric as index for better display
+        result_df = result_df.set_index("Metric")
         return result_df
     except Exception as e:
         # Use vectorized operations instead of apply
         df = df.copy()
+        # Find the row containing WER values (now with new index name)
+        wer_row_index = None
+        for idx in df.index:
+            if "WER" in idx or "Error Rate" in idx:
+                wer_row_index = idx
+                break
+        if wer_row_index:
             # Convert to object type first to avoid warnings
+            df.loc[wer_row_index] = df.loc[wer_row_index].astype(object)
             for col in df.columns:
+                value = df.loc[wer_row_index, col]
                 if pd.notna(value):
+                    df.loc[wer_row_index, col] = f"{value:.4f}"
                 else:
+                    df.loc[wer_row_index, col] = "N/A"
         return df