Spaces:

danielrosehill
/

STT-Comparison

Running

danielrosehill commited on 28 days ago

Commit

0aa8adc

1 Parent(s): 4a63305

Fix SRT timestamp alignment with ground truth

Adjusted all transcription SRT files to start at 00:00:00,000 to match the ground truth timeline. This resolves the synchronization issue where transcripts were not displaying in sync with the audio playback.

Changes:
- AssemblyAI: removed 80ms offset
- Nova3: removed 80ms offset
- Speechmatics: removed 120ms offset
- Gladia: already aligned, no changes needed

Added adjust_srt_timing.py script to automate timing corrections for future updates.

Files changed (4) hide show

adjust_srt_timing.py +123 -0
srt-out/assembly.srt +470 -470
srt-out/nova3.srt +576 -576
srt-out/speechmatics.srt +414 -414

adjust_srt_timing.py ADDED Viewed

	@@ -0,0 +1,123 @@

+#!/usr/bin/env python3
+"""
+Adjust SRT file timestamps to align with ground truth.
+This script removes timing offsets to ensure all transcripts start at 00:00:00,000
+"""
+import re
+from pathlib import Path
+def parse_timestamp(timestamp_str):
+    """Convert SRT timestamp to milliseconds."""
+    # Format: HH:MM:SS,mmm
+    time_part, ms_part = timestamp_str.split(',')
+    h, m, s = map(int, time_part.split(':'))
+    ms = int(ms_part)
+    return (h * 3600 + m * 60 + s) * 1000 + ms
+def format_timestamp(ms):
+    """Convert milliseconds to SRT timestamp format."""
+    hours = ms // 3600000
+    ms %= 3600000
+    minutes = ms // 60000
+    ms %= 60000
+    seconds = ms // 1000
+    milliseconds = ms % 1000
+    return f"{hours:02d}:{minutes:02d}:{seconds:02d},{milliseconds:03d}"
+def adjust_srt_timing(input_path, output_path, offset_ms):
+    """
+    Adjust all timestamps in an SRT file by subtracting offset_ms.
+    Args:
+        input_path: Path to input SRT file
+        output_path: Path to output SRT file
+        offset_ms: Offset in milliseconds to subtract from all timestamps
+    """
+    with open(input_path, 'r', encoding='utf-8') as f:
+        content = f.read()
+    # Remove BOM if present
+    content = content.lstrip('\ufeff')
+    # Pattern to match timestamp lines: HH:MM:SS,mmm --> HH:MM:SS,mmm
+    timestamp_pattern = re.compile(
+        r'(\d{2}:\d{2}:\d{2},\d{3})\s*-->\s*(\d{2}:\d{2}:\d{2},\d{3})'
+    )
+    def adjust_match(match):
+        start_str = match.group(1)
+        end_str = match.group(2)
+        start_ms = parse_timestamp(start_str)
+        end_ms = parse_timestamp(end_str)
+        # Subtract offset
+        new_start_ms = max(0, start_ms - offset_ms)
+        new_end_ms = max(0, end_ms - offset_ms)
+        new_start = format_timestamp(new_start_ms)
+        new_end = format_timestamp(new_end_ms)
+        return f"{new_start} --> {new_end}"
+    adjusted_content = timestamp_pattern.sub(adjust_match, content)
+    with open(output_path, 'w', encoding='utf-8') as f:
+        f.write(adjusted_content)
+    print(f"✓ Adjusted {input_path.name}: offset={offset_ms}ms → {output_path.name}")
+def find_first_timestamp(srt_path):
+    """Find the first timestamp in an SRT file."""
+    with open(srt_path, 'r', encoding='utf-8') as f:
+        content = f.read()
+    timestamp_pattern = re.compile(r'(\d{2}:\d{2}:\d{2},\d{3})\s*-->')
+    match = timestamp_pattern.search(content)
+    if match:
+        return parse_timestamp(match.group(1))
+    return 0
+def main():
+    srt_dir = Path(__file__).parent / "srt-out"
+    # Files to adjust
+    srt_files = [
+        "assembly.srt",
+        "gladia.srt",
+        "nova3.srt",
+        "speechmatics.srt"
+    ]
+    print("Analyzing SRT files for timing offset...\n")
+    for filename in srt_files:
+        input_path = srt_dir / filename
+        if not input_path.exists():
+            print(f"⚠ Skipping {filename} (not found)")
+            continue
+        # Find first timestamp
+        first_ts_ms = find_first_timestamp(input_path)
+        if first_ts_ms == 0:
+            print(f"✓ {filename} already starts at 00:00:00,000 (no adjustment needed)")
+            continue
+        # Calculate offset
+        offset_ms = first_ts_ms
+        # Adjust the file in place
+        adjust_srt_timing(input_path, input_path, offset_ms)
+    print("\n✅ All SRT files have been adjusted to start at 00:00:00,000")
+if __name__ == "__main__":
+    main()

srt-out/assembly.srt CHANGED Viewed

@@ -1,1880 +1,1880 @@
 1
-00:00:00,080 --> 00:00:05,680
 Hello and welcome to a audio data set consisting
 2
-00:00:05,680 --> 00:00:10,640
 of one single episode of a non-existent podcast. Or I
 3
-00:00:10,720 --> 00:00:13,360
 may append this to a podcast that I set up
 4
-00:00:13,600 --> 00:00:19,200
 recently regarding my with my thoughts on speech
 5
-00:00:19,280 --> 00:00:24,000
 tech and AI in particular, more AI in generative AI,
 6
-00:00:24,240 --> 00:00:28,640
 I would say. But in any event, the purpose of
 7
-00:00:28,720 --> 00:00:33,850
 this Voice recording is actually to create a lengthy
 8
-00:00:33,930 --> 00:00:37,130
 voice sample for a quick evaluation, a back of the
 9
-00:00:37,130 --> 00:00:40,650
 envelope evaluation, as they might say, for different speech attack
 10
-00:00:40,890 --> 00:00:43,450
 models. And I'm doing this because I thought I had
 11
-00:00:43,450 --> 00:00:46,810
 made a great breakthrough in my journey with speech tech,
 12
-00:00:47,130 --> 00:00:50,730
 and that was succeeding in the elusive task of fine-tuning
 13
-00:00:50,730 --> 00:00:54,810
 Whisper. Whisper is, and I'm going to just talk, I'm
 14
-00:00:54,890 --> 00:00:58,250
 trying to mix up, I'm going to try a few
 15
-00:00:58,410 --> 00:01:01,530
 different styles of speaking. I might whisper something at some
 16
-00:01:01,610 --> 00:01:04,880
 point. As well. And I'll go back to speaking loud
 17
-00:01:04,960 --> 00:01:08,080
 in, in different parts. I'm going to sound really like
 18
-00:01:08,160 --> 00:01:11,120
 a crazy person because I'm also going to try to
 19
-00:01:11,280 --> 00:01:16,240
 speak at different pitches and cadences in order to really
 20
-00:01:16,560 --> 00:01:20,560
 try to put a speech attacks model through its paces,
 21
-00:01:20,720 --> 00:01:23,040
 which is trying to make sense of is this guy
 22
-00:01:23,200 --> 00:01:28,060
 just rambling on incoherently in one long sentence or are
 23
-00:01:28,460 --> 00:01:34,220
 these just actually a series of step, standalone,
 24
-00:01:34,380 --> 00:01:37,420
 step alone, standalone sentences? And how is it gonna handle
 25
-00:01:37,500 --> 00:01:40,460
 step alone? That's not a word. What happens when you
 26
-00:01:40,540 --> 00:01:43,020
 use speech to text and you use a fake word?
 27
-00:01:43,180 --> 00:01:45,580
 And then you're like, wait, that's not actually, that word
 28
-00:01:45,740 --> 00:01:50,220
 doesn't exist. How does AI handle that? And these and
 29
-00:01:50,460 --> 00:01:54,300
 more are all the questions that I'm seeking to answer
 30
-00:01:54,460 --> 00:01:57,500
 in this training data. Now, why was it trying to
 31
-00:01:57,500 --> 00:02:00,290
 fine tune Whisper? And what is Whisper? As I said,
 32
-00:02:00,370 --> 00:02:03,010
 I'm going to try to record this at a couple
 33
-00:02:03,170 --> 00:02:07,490
 of different levels of technicality for folks who are, you
 34
-00:02:07,490 --> 00:02:11,730
 know, in the normal world and not totally stuck down
 35
-00:02:11,810 --> 00:02:13,810
 the rabbit hole of AI, which I have to say
 36
-00:02:13,970 --> 00:02:18,130
 is a really wonderful rabbit hole to be down. It's
 37
-00:02:18,210 --> 00:02:21,570
 a really interesting area and speech and voice tech is
 38
-00:02:21,970 --> 00:02:24,610
 the aspect of it that I find actually the most,
 39
-00:02:25,010 --> 00:02:27,410
 I'm not sure I would say the most interesting because
 40
-00:02:27,650 --> 00:02:31,370
 there's just so much that is fascinating in AI. But
 41
-00:02:31,530 --> 00:02:34,330
 the most that I find the most personally transformative in
 42
-00:02:34,410 --> 00:02:38,970
 terms of the impact that it's had on my daily
 43
-00:02:39,050 --> 00:02:41,530
 work life and productivity and how I sort of work.
 44
-00:02:42,170 --> 00:02:47,290
 And I'm persevering hard with the task of trying
 45
-00:02:47,290 --> 00:02:50,330
 to get a good solution working for Linux, which if
 46
-00:02:50,330 --> 00:02:52,330
 anyone actually does listen to this, not just for the
 47
-00:02:52,330 --> 00:02:56,490
 training data and for the actual content, this is sparked
 48
-00:02:56,830 --> 00:03:00,030
 I had, besides the fine tune not working, well, that
 49
-00:03:00,110 --> 00:03:05,310
 was the failure. Um, I used Claude code because one
 50
-00:03:05,550 --> 00:03:10,030
 thinks these days that there is nothing short of solving,
 51
-00:03:11,070 --> 00:03:15,470
 you know, the, the reason of life or something, that
 52
-00:03:15,870 --> 00:03:19,070
 Claude and agentic AI can't do, which is not really
 53
-00:03:19,150 --> 00:03:22,270
 the case. Uh, it does seem that way sometimes, but
 54
-00:03:22,430 --> 00:03:24,270
 it fails a lot as well. And this is one
 55
-00:03:24,270 --> 00:03:27,710
 of those, instances where last week I put together an
 56
-00:03:27,790 --> 00:03:32,090
 hour of voice training data, basically speaking, just random things
 57
-00:03:32,330 --> 00:03:37,130
 for 3 minutes. And it was actually kind of tedious
 58
-00:03:37,210 --> 00:03:39,290
 because the texts were really weird. Some of them were
 59
-00:03:39,530 --> 00:03:43,130
 it was like it was AI generated. I tried before
 60
-00:03:43,290 --> 00:03:45,210
 to read Sherlock Holmes for an hour and I just
 61
-00:03:45,210 --> 00:03:48,410
 couldn't. I was so bored after 10 minutes that I
 62
-00:03:48,410 --> 00:03:50,810
 was like, okay, no, I'm just going to have to
 63
-00:03:50,810 --> 00:03:55,370
 find something else to read. So I used a created
 64
-00:03:55,770 --> 00:04:01,360
 with AI studio vibe coded a synthetic text generator. Which
 65
-00:04:01,680 --> 00:04:03,920
 actually I thought was probably a better way of doing
 66
-00:04:04,000 --> 00:04:07,520
 it because it would give me more short samples with
 67
-00:04:07,760 --> 00:04:10,560
 more varied content. So I was like, okay, give me
 68
-00:04:10,960 --> 00:04:13,840
 a voice note, like I'm recording an email, give me
 69
-00:04:14,080 --> 00:04:17,760
 a short story to read, give me prose to read.
 70
-00:04:18,080 --> 00:04:20,480
 So I came up with all these different things and
 71
-00:04:20,640 --> 00:04:22,640
 they added a little timer to it so I could
 72
-00:04:22,800 --> 00:04:26,480
 see how close I was to one hour. And I
 73
-00:04:26,640 --> 00:04:29,680
 spent like an hour one afternoon or probably two hours
 74
-00:04:29,840 --> 00:04:33,410
 by the time you you do retakes. And whatever, because
 75
-00:04:33,490 --> 00:04:36,690
 you want to, it gave me a source of truth,
 76
-00:04:37,410 --> 00:04:40,130
 which I'm not sure if that's the scientific way to
 77
-00:04:40,290 --> 00:04:44,290
 approach this topic of gathering, training data, but I thought
 78
-00:04:44,530 --> 00:04:48,210
 made sense. Um, I have a lot of audio data
 79
-00:04:48,290 --> 00:04:50,850
 from recording voice notes, which I've also kind of used,
 80
-00:04:52,130 --> 00:04:55,890
 been experimenting with using for a different purpose, slightly different
 81
-00:04:56,290 --> 00:05:01,490
 annotating task types. It's more a text classification experiment
 82
-00:05:01,810 --> 00:05:04,240
 or, Well, it's more than that actually. I'm working on
 83
-00:05:04,240 --> 00:05:08,160
 a voice app. So it's a prototype, I guess, is
 84
-00:05:08,320 --> 00:05:12,800
 really more accurate. But you can do that and you
 85
-00:05:12,800 --> 00:05:15,280
 can work backwards. You're like, you listen back to a
 86
-00:05:15,280 --> 00:05:18,800
 voice note and you painfully go through one of those
 87
-00:05:19,120 --> 00:05:21,920
 transcribing, you know, where you start and stop and scrub
 88
-00:05:22,080 --> 00:05:24,000
 around it and you fix the errors, but it's really,
 89
-00:05:24,160 --> 00:05:26,800
 really boring to do that. So I thought it would
 90
-00:05:26,880 --> 00:05:29,120
 be less tedious in the long term if I just
 91
-00:05:30,139 --> 00:05:33,020
 recorded the source of truth. So it gave me these
 92
-00:05:33,100 --> 00:05:36,220
 three minute snippets. I recorded them. It saved an MP3
 93
-00:05:36,460 --> 00:05:39,580
 and a TXT in the same folder, and I created
 94
-00:05:39,660 --> 00:05:42,940
 an error with that data. So I was very hopeful,
 95
-00:05:43,340 --> 00:05:46,940
 quietly, a little bit hopeful that I could actually fine
 96
-00:05:47,020 --> 00:05:50,540
 tune Whisper. I want to fine tune Whisper because when
 97
-00:05:50,620 --> 00:05:54,860
 I got into Voicetech last November, my wife was in
 98
-00:05:54,860 --> 00:05:58,220
 the US and I was alone at home. And when
 99
-00:05:58,680 --> 00:06:01,480
 crazy people like me do really wild things like use
 100
-00:06:01,720 --> 00:06:06,200
 voice to tech technology. That was basically when I started
 101
-00:06:06,280 --> 00:06:08,840
 doing it, I didn't feel like a crazy person speaking
 102
-00:06:08,920 --> 00:06:13,800
 to myself. And my expectations weren't that high. I used
 103
-00:06:14,360 --> 00:06:17,720
 speech tech now and again, tried it out. It was
 104
-00:06:17,720 --> 00:06:19,240
 like, it'd be really cool if you could just, like,
 105
-00:06:19,400 --> 00:06:22,840
 speak into your computer. And whatever I tried out that
 106
-00:06:23,080 --> 00:06:26,670
 had Linux support was just. It was not good, basically.
 107
-00:06:27,310 --> 00:06:29,550
 And this blew me away from the first go. I
 108
-00:06:29,550 --> 00:06:32,830
 mean, it wasn't 100% accurate out of the box and
 109
-00:06:32,910 --> 00:06:34,990
 it took work, but it was good enough that there
 110
-00:06:35,070 --> 00:06:37,550
 was a solid foundation and it kind of passed that
 111
-00:06:38,750 --> 00:06:41,950
 pivot point that it's actually worth doing this. You know,
 112
-00:06:42,110 --> 00:06:44,750
 there's a point where it's so like the transcript is
 113
-00:06:44,990 --> 00:06:47,390
 you don't have to get 100% accuracy for it to
 114
-00:06:47,390 --> 00:06:50,110
 be worth your time for speech attacks to be a
 115
-00:06:50,110 --> 00:06:52,510
 worthwhile addition to your productivity, but you do need to
 116
-00:06:52,510 --> 00:06:56,050
 get above, let's say, I don't know, 85%. If it's
 117
-00:06:56,210 --> 00:06:59,890
 60% or 50%, you inevitably say, screw it, I'll just
 118
-00:06:59,890 --> 00:07:02,850
 type it because you end up missing errors in the
 119
-00:07:02,850 --> 00:07:05,570
 transcript and it becomes actually worse. You end up in
 120
-00:07:05,570 --> 00:07:07,650
 a worse position than you started with. That's been my
 121
-00:07:07,730 --> 00:07:12,050
 experience. So I was like, oh, this is actually really,
 122
-00:07:12,210 --> 00:07:14,050
 really good now. How did that happen? And the answer
 123
-00:07:14,210 --> 00:07:19,490
 is ASR whisper being open source and the transformer
 124
-00:07:19,490 --> 00:07:23,250
 architecture. If you want to go back to the to
 125
-00:07:23,330 --> 00:07:26,450
 the underpinnings, which really blows my mind and it's on
 126
-00:07:26,530 --> 00:07:30,760
 my list. To read through that paper. All you need
 127
-00:07:30,840 --> 00:07:36,040
 is attention as attentively as can be done
 128
-00:07:36,280 --> 00:07:39,400
 with my limited brain because it's super, super high level
 129
-00:07:39,720 --> 00:07:44,600
 stuff, super advanced stuff, I mean. But that, I think
 130
-00:07:44,760 --> 00:07:49,400
 of all the things that are fascinating about the sudden
 131
-00:07:49,720 --> 00:07:53,780
 rise in AI and the dramatic capabilities. I find it
 132
-00:07:53,780 --> 00:07:56,180
 fascinating that a few people are like, hang on, you've
 133
-00:07:56,180 --> 00:07:58,500
 got this thing that can speak to you, like a
 134
-00:07:58,500 --> 00:08:03,060
 chatbot, an LLM, and then you've got image generation. Okay,
 135
-00:08:03,140 --> 00:08:06,660
 so firstly, those two things on the surface have nothing
 136
-00:08:06,980 --> 00:08:10,820
 in common. So like, how are they, how did that
 137
-00:08:10,980 --> 00:08:12,580
 just happen all at the same time? And then when
 138
-00:08:12,580 --> 00:08:16,660
 you extend that further, you're like, Suno, right? You can
 139
-00:08:17,140 --> 00:08:20,110
 sing a song and AI will come up with and
 140
-00:08:20,270 --> 00:08:23,470
 instrumental. And then you've got Whisper and you're like, wait
 141
-00:08:23,470 --> 00:08:25,950
 a second, how did all this stuff, like, if it's
 142
-00:08:25,950 --> 00:08:29,310
 all AI, what's like, there has to be some commonality.
 143
-00:08:29,550 --> 00:08:34,670
 Otherwise, these are totally different technologies on the surface of
 144
-00:08:34,670 --> 00:08:38,910
 it. And the Transformer architecture is, as far as I
 145
-00:08:38,990 --> 00:08:41,630
 know, the answer. And I can't even say, can't even
 146
-00:08:41,710 --> 00:08:46,350
 pretend that I really understand what the Transformer architecture means.
 147
-00:08:46,850 --> 00:08:49,330
 In depth, but I have scanned it and as I
 148
-00:08:49,490 --> 00:08:51,890
 said, I want to print it and really kind of
 149
-00:08:52,290 --> 00:08:56,130
 think over it at some point. And I'll probably feel
 150
-00:08:56,370 --> 00:08:59,330
 bad about myself, I think, because weren't those guys in
 151
-00:08:59,410 --> 00:09:03,490
 their 20s? Like, that's crazy. I think I asked ChatGPT
 152
-00:09:03,570 --> 00:09:07,970
 once who wrote that paper and how old were they
 153
-00:09:08,130 --> 00:09:10,850
 when it was published in Arciv? And I was expecting,
 154
-00:09:11,090 --> 00:09:13,970
 like, I don't know, What do you imagine? I personally
 155
-00:09:14,050 --> 00:09:16,290
 imagine kind of like, you know, you have these breakthroughs
 156
-00:09:16,450 --> 00:09:19,890
 during COVID and things like that where like these kind
 157
-00:09:19,970 --> 00:09:22,850
 of really obscure scientists are like in their 50s and
 158
-00:09:22,850 --> 00:09:27,250
 they've just kind of been laboring in labs and wearily
 159
-00:09:27,250 --> 00:09:30,530
 in writing and publishing in kind of obscure academic publications.
 160
-00:09:30,850 --> 00:09:33,250
 And they finally like hit a big or win a
 161
-00:09:33,250 --> 00:09:37,330
 Nobel Prize and then their household names. So that was
 162
-00:09:37,410 --> 00:09:39,070
 kind of what I had in mind. That was the
 163
-00:09:39,070 --> 00:09:43,070
 mental image I'd formed of the birth of Arcsight. Like
 164
-00:09:43,070 --> 00:09:46,350
 I wasn't expecting 20-somethings in San Francisco, though. I thought
 165
-00:09:46,430 --> 00:09:48,910
 that was both very, very funny, very cool, and actually
 166
-00:09:49,070 --> 00:09:52,590
 kind of inspiring. It's nice to think that people who,
 167
-00:09:53,390 --> 00:09:56,190
 you know, just you might put them in the kind
 168
-00:09:56,270 --> 00:09:59,630
 of milieu or bubble or world that you are in
 169
-00:09:59,710 --> 00:10:03,310
 are credibly in through, you know, the series of connections
 170
-00:10:03,390 --> 00:10:07,470
 that are coming up with such literally world changing innovations.
 171
-00:10:07,950 --> 00:10:11,540
 So that was, I thought, anyway. That's that was cool.
 172
-00:10:11,940 --> 00:10:14,580
 Okay, voice training data. How are we doing? We're about
 173
-00:10:14,580 --> 00:10:18,660
 10 minutes and I'm still talking about voice technology. So
 174
-00:10:18,740 --> 00:10:22,180
 Whisper was brilliant and I was so excited that I
 175
-00:10:22,260 --> 00:10:25,460
 was my first instinct was to like guess like, oh
 176
-00:10:25,460 --> 00:10:26,900
 my gosh, I have to get like a really good
 177
-00:10:26,900 --> 00:10:30,660
 microphone for this. So I didn't go on a spending
 178
-00:10:30,660 --> 00:10:32,820
 spree because I said, I'm gonna have to just wait
 179
-00:10:32,820 --> 00:10:35,220
 a month and see if I still use this. And
 180
-00:10:36,510 --> 00:10:38,990
 It just kind of became, it's become really part of
 181
-00:10:39,150 --> 00:10:43,470
 my daily routine. Like if I'm writing an email, I'll
 182
-00:10:43,550 --> 00:10:47,070
 record a voice note. And then I've developed and it's
 183
-00:10:47,070 --> 00:10:49,150
 nice to see that everyone is like developing the same
 184
-00:10:49,630 --> 00:10:52,030
 things in parallel. Like that's my kind of a weird
 185
-00:10:52,030 --> 00:10:54,590
 thing to say, but when I look, I kind of
 186
-00:10:54,750 --> 00:10:59,070
 came, when I started working on this, these prototypes on
 187
-00:10:59,150 --> 00:11:01,550
 GitHub, which is where I just kind of share very
 188
-00:11:01,790 --> 00:11:06,810
 freely and loosely, ideas and first iterations on concepts.
 189
-00:11:08,570 --> 00:11:10,730
 And for want of a better word, I called it
 190
-00:11:10,810 --> 00:11:15,530
 like LLM post-processing or cleanup or basically a system prompt
 191
-00:11:15,610 --> 00:11:18,970
 that after you get back the raw text from Whisper,
 192
-00:11:19,130 --> 00:11:22,090
 you run it through a model and say, okay, this
 193
-00:11:22,170 --> 00:11:27,050
 is crappy text, like add sentence structure and fix it
 194
-00:11:27,130 --> 00:11:32,330
 up. And now when I'm exploring the different tools that
 195
-00:11:32,410 --> 00:11:35,260
 are out there that people have built, I see quite
 196
-00:11:35,500 --> 00:11:39,180
 a number of projects have basically done the same thing,
 197
-00:11:40,540 --> 00:11:43,260
 lest that be misconstrued. I'm not saying for a millisecond
 198
-00:11:43,340 --> 00:11:46,300
 that I inspired them. I'm sure this has been a
 199
-00:11:46,380 --> 00:11:49,580
 thing that's been integrated into tools for a while, but
 200
-00:11:50,460 --> 00:11:52,380
 it's the kind of thing that when you start using
 201
-00:11:52,380 --> 00:11:54,860
 these tools every day, the need for it is almost
 202
-00:11:55,020 --> 00:11:59,500
 instantly apparent because text that doesn't have any punctuation or
 203
-00:11:59,880 --> 00:12:03,080
 Paragraph spacing takes a long time to, you know, it
 204
-00:12:03,240 --> 00:12:05,480
 takes so long to get it into a presentable email
 205
-00:12:05,640 --> 00:12:09,800
 that again, it's, it's, it, it moves speech tech into
 206
-00:12:10,040 --> 00:12:13,560
 that before that inflection point where you're like, no, it's
 207
-00:12:13,560 --> 00:12:16,040
 just not worth it. It's like, it's, it'll just be
 208
-00:12:16,120 --> 00:12:18,600
 quicker to type this. So it's a big, it's a
 209
-00:12:18,600 --> 00:12:21,640
 little touch that actually is a big deal. Uh, so
 210
-00:12:21,800 --> 00:12:25,720
 I was on Whisper and I've been using Whisper and
 211
-00:12:25,720 --> 00:12:28,190
 I kind of, early on found a couple of tools.
 212
-00:12:28,350 --> 00:12:30,590
 I couldn't find what I was looking for on Linux,
 213
-00:12:30,750 --> 00:12:35,550
 which is basically just something that'll run in the background.
 214
-00:12:35,790 --> 00:12:38,110
 It'll give it an API key and it will just
 215
-00:12:38,270 --> 00:12:42,990
 like transcribe with like a little key to start and
 216
-00:12:43,070 --> 00:12:47,390
 stop the dictation. And the issues were I discovered that
 217
-00:12:47,550 --> 00:12:51,150
 like most people involved in creating these projects were very
 218
-00:12:51,310 --> 00:12:55,150
 much focused on local models, running Whisper locally because you
 219
-00:12:55,230 --> 00:12:58,020
 can. And I tried that a bunch of times and
 220
-00:12:58,100 --> 00:13:00,420
 just never got results that were as good as the
 221
-00:13:00,420 --> 00:13:03,220
 cloud. And when I began looking at the cost of
 222
-00:13:03,300 --> 00:13:05,780
 the speech to text APIs and what I was spending,
 223
-00:13:06,340 --> 00:13:09,540
 I just thought there is, it's actually, in my opinion,
 224
-00:13:09,700 --> 00:13:12,900
 just one of the better deals in API spending and
 225
-00:13:12,900 --> 00:13:15,220
 in cloud. Like it's just not that expensive for very,
 226
-00:13:15,380 --> 00:13:19,380
 very good models that are much more, you know, you're
 227
-00:13:19,380 --> 00:13:21,960
 gonna be able to run the full model. The latest
 228
-00:13:21,960 --> 00:13:25,960
 model versus whatever you can run on your average GPU,
 229
-00:13:26,200 --> 00:13:29,240
 unless you want to buy a crazy GPU. It doesn't
 230
-00:13:29,240 --> 00:13:31,160
 really make sense to me. Now, privacy is another concern
 231
-00:13:32,200 --> 00:13:33,960
 that I know is kind of like a very much
 232
-00:13:34,040 --> 00:13:36,840
 a separate thing that people just don't want their voice
 233
-00:13:37,080 --> 00:13:40,760
 data and their voice leaving their local environment, maybe for
 234
-00:13:40,760 --> 00:13:44,280
 regulatory reasons as well. But I'm not in that. I
 235
-00:13:44,680 --> 00:13:48,920
 neither really care about people listening to my grocery list
 236
-00:13:49,160 --> 00:13:51,800
 consisting of reminding myself that I need to buy more
 237
-00:13:51,880 --> 00:13:55,230
 beer, Cheetos, and hummus, which is kind of the three
 238
-00:13:55,390 --> 00:13:59,950
 staples of my diet during periods of poorer nutrition. But
 239
-00:14:00,030 --> 00:14:02,510
 the kind of stuff that I transcribe, it's just not,
 240
-00:14:04,030 --> 00:14:07,790
 it's not a privacy thing I'm that sort of sensitive
 241
-00:14:07,870 --> 00:14:13,230
 about and I don't do anything so sensitive or secure
 242
-00:14:13,310 --> 00:14:16,510
 that requires air gapping. So I looked at the pricing
 243
-00:14:16,590 --> 00:14:19,870
 and especially the kind of older model mini Some of
 244
-00:14:19,950 --> 00:14:22,030
 them are very, very affordable. And I did a back
 245
-00:14:22,270 --> 00:14:25,950
 of the, I did a calculation once with ChatGPT and
 246
-00:14:25,950 --> 00:14:29,310
 I was like, okay, this is the API price for
 247
-00:14:29,470 --> 00:14:32,350
 I can't remember whatever the model was. Let's say I
 248
-00:14:32,430 --> 00:14:35,310
 just go at it like nonstop, which it rarely happens.
 249
-00:14:35,550 --> 00:14:38,910
 Probably, I would say on average, I might dictate 30
 250
-00:14:38,990 --> 00:14:41,870
 to 60 minutes per day if I was probably summing
 251
-00:14:41,870 --> 00:14:47,070
 up the emails, documents, outlines, which
 252
-00:14:47,310 --> 00:14:49,950
 is a lot, but it's still a fairly modest amount.
 253
-00:14:50,110 --> 00:14:52,020
 And I was like, Some days I do go on
 254
-00:14:52,180 --> 00:14:54,980
 like one or two days where I've been usually when
 255
-00:14:54,980 --> 00:14:57,060
 I'm like kind of out of the house and just
 256
-00:14:57,300 --> 00:15:00,580
 have something like I have nothing else to do. Like
 257
-00:15:00,740 --> 00:15:04,100
 if I'm at a hospital, we have a newborn and
 258
-00:15:04,260 --> 00:15:07,380
 you're waiting for like eight hours and hours for an
 259
-00:15:07,460 --> 00:15:10,900
 appointment. And I would probably have listened to podcasts before
 260
-00:15:11,460 --> 00:15:14,260
 becoming a speech fanatic. And I'm like, oh, wait, let
 261
-00:15:14,420 --> 00:15:16,339
 me just get down. Let me just get these ideas
 262
-00:15:16,500 --> 00:15:18,620
 out of my head. And that's when I'll go on
 263
-00:15:19,340 --> 00:15:21,900
 my speech binges. But those are like once every few
 264
-00:15:21,900 --> 00:15:25,020
 months, like not frequently. But I said, okay, let's just
 265
-00:15:25,100 --> 00:15:29,180
 say if I'm gonna price out Cloud SCT, if I
 266
-00:15:29,260 --> 00:15:33,980
 was like dedicated every second of every waking hour to
 267
-00:15:34,140 --> 00:15:37,980
 transcribing for some odd reason, I mean, I'd have to
 268
-00:15:38,060 --> 00:15:40,860
 like eat and use the toilet. Like, you know, there's
 269
-00:15:40,940 --> 00:15:43,500
 only so many hours I'm awake for. So like, let's
 270
-00:15:43,500 --> 00:15:46,700
 just say a maximum of like 40 hour, 45 minutes.
 271
-00:15:47,290 --> 00:15:49,370
 In the hour. Then I said, all right, let's just
 272
-00:15:49,370 --> 00:15:52,970
 say 50. Who knows? You're dictating on the toilet. We
 273
-00:15:53,130 --> 00:15:55,130
 do it. So it could be. You could just do
 274
-00:15:55,210 --> 00:15:59,370
 60. But whatever I did. And every day, like, you're
 275
-00:15:59,450 --> 00:16:02,810
 going flat out seven days a week dictating non-stop I
 276
-00:16:02,810 --> 00:16:05,930
 was like, what's my monthly API bill gonna be at
 277
-00:16:06,010 --> 00:16:08,650
 this price? And it came out to, like, 70 or
 278
-00:16:08,650 --> 00:16:10,810
 80 bucks. And I was like, well, that would be
 279
-00:16:11,210 --> 00:16:15,780
 an extraordinary. Amount of dictation. And I would hope that
 280
-00:16:16,260 --> 00:16:20,020
 there was some compelling reason more worth more than $70
 281
-00:16:20,340 --> 00:16:23,540
 that I embarked upon that project. So given that that's
 282
-00:16:23,540 --> 00:16:25,540
 kind of the max point for me, I said that's
 283
-00:16:25,620 --> 00:16:29,220
 actually very, very affordable. Now you're gonna, if you want
 284
-00:16:29,300 --> 00:16:31,780
 to spec out the costs and you want to do
 285
-00:16:31,780 --> 00:16:36,340
 the post-processing that I really do feel is valuable, that's
 286
-00:16:36,420 --> 00:16:40,900
 gonna cost some more as well, unless you're using Gemini,
 287
-00:16:41,380 --> 00:16:44,500
 which needless to say is a random person sitting in
 288
-00:16:44,580 --> 00:16:49,140
 Jerusalem. I have no affiliation, nor with Google, nor anthropic,
 289
-00:16:49,220 --> 00:16:52,100
 nor Gemini, nor any major tech vendor for that matter.
 290
-00:16:53,700 --> 00:16:56,900
 I like Gemini not so much as a everyday model.
 291
-00:16:57,380 --> 00:16:59,940
 It's kind of underwhelmed in that respect, I would say.
 292
-00:17:00,340 --> 00:17:02,820
 But for multimodal, I think it's got a lot to
 293
-00:17:02,820 --> 00:17:06,580
 offer. And I think that the transcribing functionality whereby it
 294
-00:17:06,660 --> 00:17:11,980
 can process audio with a system prompt and both give
 295
-00:17:12,140 --> 00:17:15,180
 you transcription that's cleaned up that reduces two steps to
 296
-00:17:15,340 --> 00:17:18,300
 one. And that for me is a very, very big
 297
-00:17:18,460 --> 00:17:21,660
 deal. And I feel like even Google has haven't really
 298
-00:17:21,900 --> 00:17:26,780
 sort of thought through how useful the that modality is
 299
-00:17:26,860 --> 00:17:29,340
 and what kind of use cases you can achieve with
 300
-00:17:29,420 --> 00:17:31,340
 it. Because I found in the course of this year,
 301
-00:17:31,980 --> 00:17:36,620
 just an endless list of really kind of system prompt
 302
-00:17:36,940 --> 00:17:40,300
 system prompt stuff that I can say, okay, I've used
 303
-00:17:40,300 --> 00:17:43,500
 it to capture context data for AI, which is literally
 304
-00:17:43,580 --> 00:17:45,740
 I might speak for if I wanted to have a
 305
-00:17:45,740 --> 00:17:49,820
 good bank of context data about who knows my childhood
 306
-00:17:50,380 --> 00:17:54,300
 more realistically, maybe my career goals, something that would just
 307
-00:17:54,380 --> 00:17:56,780
 be like really boring to type out. So I'll just
 308
-00:17:56,860 --> 00:18:00,860
 like sit in my car and record it for 10
 309
-00:18:00,940 --> 00:18:03,180
 minutes. And that 10 minutes you get a lot of
 310
-00:18:03,340 --> 00:18:08,730
 information in. Um, emails, which is short text, just
 311
-00:18:09,130 --> 00:18:12,330
 there is a whole bunch and all these workflows kind
 312
-00:18:12,490 --> 00:18:14,490
 of require a little bit of treatment afterwards and different
 313
-00:18:14,730 --> 00:18:18,170
 treatment. My context pipeline is kind of like just extract
 314
-00:18:18,250 --> 00:18:21,050
 the bare essentials. So you end up with me talking
 315
-00:18:21,130 --> 00:18:23,050
 very loosely about sort of what I've done in my
 316
-00:18:23,130 --> 00:18:25,450
 career, where I've worked, where I might like to work.
 317
-00:18:25,930 --> 00:18:29,050
 And it goes, it condenses that down to very robotic
 318
-00:18:29,290 --> 00:18:32,570
 language that is easy to chunk parse and maybe put
 319
-00:18:32,650 --> 00:18:36,630
 into a vector database. Daniel has worked in technology. Daniel
 320
-00:18:37,510 --> 00:18:40,230
 has been working in, you know, stuff like that. That's
 321
-00:18:40,230 --> 00:18:43,190
 not how you would speak, but I figure it's probably
 322
-00:18:43,430 --> 00:18:47,430
 easier to parse for, after all, robots. So we've almost
 323
-00:18:47,510 --> 00:18:49,350
 got to 20 minutes and this is actually a success
 324
-00:18:49,830 --> 00:18:55,190
 because I wasted 20 minutes of the evening speaking
 325
-00:18:55,270 --> 00:18:59,990
 into a microphone and the levels were shot and it
 326
-00:18:59,990 --> 00:19:01,670
 was clipping and I said, I can't really do an
 327
-00:19:01,750 --> 00:19:04,070
 evaluation. I have to be fair. I have to give
 328
-00:19:04,640 --> 00:19:08,000
 the models a chance to do their thing. What am
 329
-00:19:08,000 --> 00:19:10,400
 I hoping to achieve in this? Okay, my fine tune
 330
-00:19:10,400 --> 00:19:13,440
 was a dud as mentioned. DeepChrom ST, I'm really, really
 331
-00:19:13,520 --> 00:19:16,560
 hopeful that this prototype will work and it's a build
 332
-00:19:16,800 --> 00:19:19,360
 in public open source, so anyone is welcome to use
 333
-00:19:19,440 --> 00:19:22,400
 it if I make anything good. But that was really
 334
-00:19:22,560 --> 00:19:26,560
 exciting for me last night when after hours of trying
 335
-00:19:26,640 --> 00:19:30,560
 my own prototype, seeing someone just made something that works
 336
-00:19:30,720 --> 00:19:32,480
 like that, you know, you're not gonna have to build
 337
-00:19:32,720 --> 00:19:37,540
 a custom conda environment and image. I have AMD GPU,
 338
-00:19:37,700 --> 00:19:41,060
 which makes things much more complicated. I didn't find it.
 339
-00:19:41,620 --> 00:19:43,060
 And I was about to give up and I said,
 340
-00:19:43,140 --> 00:19:45,540
 all right, let me just give Deep Grams Linux thing
 341
-00:19:46,020 --> 00:19:49,300
 a shot. And if this doesn't work, I'm just going
 342
-00:19:49,300 --> 00:19:51,060
 to go back to trying to Vibe code something myself.
 343
-00:19:51,700 --> 00:19:55,540
 And when I ran the script, I was using Claude
 344
-00:19:55,620 --> 00:19:59,140
 code to do the installation process. It ran the script
 345
-00:19:59,220 --> 00:20:02,100
 and oh my gosh, it works just like that. The
 346
-00:20:02,180 --> 00:20:06,060
 tricky thing For all those who want to know all
 347
-00:20:06,060 --> 00:20:11,340
 the nitty gritty details, was that I
 348
-00:20:11,340 --> 00:20:14,460
 don't think it was actually struggling with transcription, but pasting
 349
-00:20:14,780 --> 00:20:18,220
 Wayland makes life very hard. And I think there was
 350
-00:20:18,300 --> 00:20:21,580
 something not running the right time. Anyway, Deepgram, I looked
 351
-00:20:21,580 --> 00:20:23,900
 at how they actually handled that because it worked out
 352
-00:20:23,980 --> 00:20:26,620
 of the box when other stuff didn't. And it was
 353
-00:20:27,180 --> 00:20:30,650
 quite a clever little mechanism. And but more so than
 354
-00:20:30,730 --> 00:20:33,370
 that, the accuracy was brilliant. Now, what am I doing
 355
-00:20:33,370 --> 00:20:36,010
 here? This is going to be a 20 minute audio
 356
-00:20:36,570 --> 00:20:42,090
 sample. And I think I've done one or two
 357
-00:20:42,250 --> 00:20:46,650
 of these before, but I did it with short snappy
 358
-00:20:46,810 --> 00:20:49,850
 voice notes. This is kind of long form. This actually
 359
-00:20:50,090 --> 00:20:52,250
 might be a better approximation for what's useful to me
 360
-00:20:52,410 --> 00:20:55,970
 than voice memos. Like, I need to buy three Bread,
 361
-00:20:56,050 --> 00:20:58,690
 eaters of milk tomorrow and Peter bread, which is probably
 362
-00:20:58,850 --> 00:21:01,410
 how like half my voice notes sound. Like if anyone
 363
-00:21:01,890 --> 00:21:04,130
 were to, I don't know, like find my phone, they'd
 364
-00:21:04,130 --> 00:21:05,650
 be like, this is the most boring person in the
 365
-00:21:05,650 --> 00:21:09,410
 world. Although actually, there are some like kind of journaling
 366
-00:21:09,410 --> 00:21:11,570
 thoughts as well, but it's a lot of content like
 367
-00:21:11,570 --> 00:21:14,530
 that. And the probably for the evaluation, the most useful
 368
-00:21:14,610 --> 00:21:20,290
 thing is slightly obscure tech, GitHub, NeocleNo, hugging
 369
-00:21:20,370 --> 00:21:23,020
 face, Not so obscure that it's not going to have
 370
-00:21:23,100 --> 00:21:26,540
 a chance of knowing it, but hopefully sufficiently well known
 371
-00:21:26,540 --> 00:21:28,780
 that the model should get it. I tried to do
 372
-00:21:28,860 --> 00:21:31,660
 a little bit of speaking really fast and speaking very
 373
-00:21:31,820 --> 00:21:35,100
 slowly. I would say in general, I've spoken, delivered this
 374
-00:21:35,260 --> 00:21:37,580
 at a faster pace than I usually would owing to
 375
-00:21:38,060 --> 00:21:42,540
 strong coffee flowing through my bloodstream. And the thing that
 376
-00:21:42,540 --> 00:21:44,780
 I'm not going to get in this benchmark is background
 377
-00:21:44,860 --> 00:21:46,540
 noise, which in my first take that I had to
 378
-00:21:46,540 --> 00:21:49,790
 get rid of, My wife came in with my son
 379
-00:21:50,110 --> 00:21:52,430
 and for a goodnight kiss. And that actually would have
 380
-00:21:52,430 --> 00:21:56,590
 been super helpful to get in because it was non
 381
-00:21:56,670 --> 00:22:00,270
 diarized or if we had diarization, a female, I could
 382
-00:22:00,270 --> 00:22:02,510
 say, I want the male voice and that wasn't intended
 383
-00:22:02,510 --> 00:22:05,950
 for transcription. And we're not going to get background noise
 384
-00:22:06,030 --> 00:22:08,350
 like people honking their horns, which is something I've done
 385
-00:22:08,510 --> 00:22:11,230
 in my main data set where I am trying to
 386
-00:22:11,470 --> 00:22:14,420
 go back to some of my voice notes. Annotate them
 387
-00:22:14,660 --> 00:22:16,500
 and run a benchmark. But this is going to be
 388
-00:22:16,500 --> 00:22:21,780
 just a pure quick test. And as someone,
 389
-00:22:22,340 --> 00:22:24,740
 I'm working on a voice note idea. That's my sort
 390
-00:22:24,740 --> 00:22:28,740
 of end motivation. Besides thinking it's an ask to the
 391
-00:22:28,740 --> 00:22:32,420
 outstanding technology that's coming to viability. And really, I know
 392
-00:22:32,500 --> 00:22:36,020
 this sounds cheesy, can actually have a very transformative effect.
 393
-00:22:37,060 --> 00:22:41,210
 It's, you know, voice technology has been life changing for
 394
-00:22:42,010 --> 00:22:47,050
 folks living with disabilities. And I think
 395
-00:22:47,210 --> 00:22:49,050
 there's something really nice about the fact that it can
 396
-00:22:49,210 --> 00:22:52,570
 also benefit, you know, folks who are able bodied and
 397
-00:22:52,730 --> 00:22:57,770
 like we can all in different ways make this tech
 398
-00:22:57,850 --> 00:23:00,490
 as useful as possible, regardless of the exact way that
 399
-00:23:00,490 --> 00:23:03,850
 we're using it. And I think there's something very powerful
 400
-00:23:03,930 --> 00:23:06,520
 in that and it can be very cool. I see
 401
-00:23:06,680 --> 00:23:10,280
 huge potential. What excites me about Voicetech? A lot of
 402
-00:23:10,360 --> 00:23:14,440
 things actually. Firstly, the fact that it's cheap and accurate,
 403
-00:23:14,520 --> 00:23:17,160
 as I mentioned at the very start of this. And
 404
-00:23:17,320 --> 00:23:19,960
 it's getting better and better with stuff like accent handling.
 405
-00:23:20,760 --> 00:23:23,480
 I'm not sure my fine-tune will actually ever come to
 406
-00:23:23,560 --> 00:23:25,400
 fruition in the sense that I'll use it day to
 407
-00:23:25,480 --> 00:23:28,920
 day as I imagine. I get like superb flawless words
 408
-00:23:29,000 --> 00:23:33,420
 error rates because I'm just kind of skeptical about Local
 409
-00:23:33,580 --> 00:23:37,180
 speech to text, as I mentioned, and I think the
 410
-00:23:37,260 --> 00:23:40,780
 pace of innovation and improvement in the models, the main
 411
-00:23:40,940 --> 00:23:44,700
 reasons for fine tuning from what I've seen have been
 412
-00:23:44,860 --> 00:23:47,500
 people who are something that really blows my mind about
 413
-00:23:48,060 --> 00:23:53,180
 ASR is the idea that it's inherently a lingual or
 414
-00:23:53,340 --> 00:23:58,650
 multilingual phonetic based. So as folks who use speak
 415
-00:23:58,970 --> 00:24:02,330
 very obscure languages, that there might be a paucity of
 416
-00:24:02,330 --> 00:24:04,970
 training data or almost none at all, and therefore the
 417
-00:24:04,970 --> 00:24:10,170
 accuracy is significantly reduced. Or folks in very critical
 418
-00:24:10,410 --> 00:24:14,330
 environments, I know this is used extensively in medical transcription
 419
-00:24:14,410 --> 00:24:19,210
 and dispatcher work, the call centers who send out ambulances,
 420
-00:24:19,290 --> 00:24:23,210
 et cetera, where accuracy is absolutely paramount. And in the
 421
-00:24:23,210 --> 00:24:26,940
 case of doctors, radiologist, they might be using very specialized
 422
-00:24:26,940 --> 00:24:29,500
 vocab all the time. So those are kind of the
 423
-00:24:29,580 --> 00:24:31,500
 main two things that I'm not sure that really just
 424
-00:24:31,580 --> 00:24:35,020
 for trying to make it better on a few random
 425
-00:24:35,020 --> 00:24:37,980
 tech words with my slightly, I mean, I have an
 426
-00:24:38,060 --> 00:24:41,100
 accent, but like not, you know, an accent that a
 427
-00:24:41,180 --> 00:24:45,980
 few other million people have ish. I'm not sure that
 428
-00:24:46,460 --> 00:24:50,380
 my little fine tune is gonna actually like the bump
 429
-00:24:50,540 --> 00:24:53,580
 in word error reduction, if I ever actually figure out
 430
-00:24:53,580 --> 00:24:54,700
 how to do it and get it up to the
 431
-00:24:54,780 --> 00:24:57,950
 cloud. By the time we've done that, I suspect that
 432
-00:24:58,270 --> 00:25:00,510
 the next generation of ASR will just be so good
 433
-00:25:00,590 --> 00:25:03,070
 that it will kind of be, well, that would have
 434
-00:25:03,070 --> 00:25:04,750
 been cool if it worked out, but I'll just use
 435
-00:25:04,830 --> 00:25:08,590
 this instead. So that's going to be it for today's
 436
-00:25:08,910 --> 00:25:14,110
 episode of voice training data. Single long shot evaluation.
 437
-00:25:14,430 --> 00:25:17,230
 Who am I going to compare? Whisper is always good
 438
-00:25:17,230 --> 00:25:20,590
 as a benchmark, but I'm more interested in seeing Whisper
 439
-00:25:20,670 --> 00:25:24,590
 head to head with two things, really. One is Whisper
 440
-00:25:24,670 --> 00:25:29,780
 variants. So you've got these projects like faster Distill Whisper,
 441
-00:25:29,860 --> 00:25:31,780
 it's a bit confusing, there's a whole bunch of them.
 442
-00:25:32,100 --> 00:25:35,380
 And the emerging ASRs, which are also a thing. My
 443
-00:25:35,460 --> 00:25:37,300
 intention for this is I'm not sure I'm going to
 444
-00:25:37,300 --> 00:25:39,940
 have the time in any point in the foreseeable future
 445
-00:25:40,260 --> 00:25:44,660
 to go back through this whole episode and create a
 446
-00:25:44,740 --> 00:25:49,780
 proper source truth, where I fix everything. Might do
 447
-00:25:49,860 --> 00:25:52,820
 it if I can get one transcriptions that sufficiently close
 448
-00:25:53,060 --> 00:25:57,120
 to perfection. But what I would actually love to do
 449
-00:25:57,280 --> 00:26:00,000
 on Hugging Face, I think would be a great probably
 450
-00:26:00,320 --> 00:26:02,960
 how I might visualize this is having the audio waveform
 451
-00:26:03,280 --> 00:26:08,240
 play and then have the transcript for each model below
 452
-00:26:08,240 --> 00:26:12,640
 it and maybe even a like, you know, to scale
 453
-00:26:13,200 --> 00:26:15,680
 and maybe even a local one as well, like local
 454
-00:26:15,840 --> 00:26:21,180
 whisper versus OpenAI API, et cetera. And, I
 455
-00:26:21,260 --> 00:26:23,580
 can then actually listen back to segments or anyone who
 456
-00:26:23,580 --> 00:26:25,900
 wants to can listen back to segments of this recording
 457
-00:26:26,220 --> 00:26:31,020
 and see where a particular model struggled and others didn't,
 458
-00:26:31,500 --> 00:26:33,420
 as well as the sort of headline finding of which
 459
-00:26:33,580 --> 00:26:36,940
 had the best WER, but that would require the source
 460
-00:26:36,940 --> 00:26:39,660
 of truth. Okay, that's it. I hope this was, I
 461
-00:26:39,660 --> 00:26:42,620
 don't know, maybe useful for other folks interested in STT.
 462
-00:26:42,940 --> 00:26:45,740
 You want to see that I always feel think I've
 463
-00:26:45,740 --> 00:26:48,950
 just said as something I didn't intend to. STT, I
 464
-00:26:48,950 --> 00:26:52,550
 said for those. Listen carefully, including hopefully the models themselves.
 465
-00:26:53,270 --> 00:26:57,350
 This has been myself, Daniel Rosell. For more jumbled repositories
 466
-00:26:57,430 --> 00:27:01,830
 about my roving interests in AI, but particularly agentic, MCP
 467
-00:27:02,070 --> 00:27:07,109
 and Voicetech, you can find me on GitHub, huggingface.com,
 468
-00:27:10,310 --> 00:27:13,350
 which is my personal website, as well as this podcast,
 469
-00:27:13,590 --> 00:27:17,030
 whose name I sadly cannot remember. Until next time, thanks
 470
-00:27:17,030 --> 00:27:17,590
 for listening.

 1
+00:00:00,000 --> 00:00:05,600
 Hello and welcome to a audio data set consisting
 2
+00:00:05,600 --> 00:00:10,560
 of one single episode of a non-existent podcast. Or I
 3
+00:00:10,640 --> 00:00:13,280
 may append this to a podcast that I set up
 4
+00:00:13,520 --> 00:00:19,120
 recently regarding my with my thoughts on speech
 5
+00:00:19,200 --> 00:00:23,920
 tech and AI in particular, more AI in generative AI,
 6
+00:00:24,160 --> 00:00:28,560
 I would say. But in any event, the purpose of
 7
+00:00:28,640 --> 00:00:33,770
 this Voice recording is actually to create a lengthy
 8
+00:00:33,850 --> 00:00:37,050
 voice sample for a quick evaluation, a back of the
 9
+00:00:37,050 --> 00:00:40,570
 envelope evaluation, as they might say, for different speech attack
 10
+00:00:40,810 --> 00:00:43,370
 models. And I'm doing this because I thought I had
 11
+00:00:43,370 --> 00:00:46,730
 made a great breakthrough in my journey with speech tech,
 12
+00:00:47,050 --> 00:00:50,650
 and that was succeeding in the elusive task of fine-tuning
 13
+00:00:50,650 --> 00:00:54,730
 Whisper. Whisper is, and I'm going to just talk, I'm
 14
+00:00:54,810 --> 00:00:58,170
 trying to mix up, I'm going to try a few
 15
+00:00:58,330 --> 00:01:01,450
 different styles of speaking. I might whisper something at some
 16
+00:01:01,530 --> 00:01:04,800
 point. As well. And I'll go back to speaking loud
 17
+00:01:04,880 --> 00:01:08,000
 in, in different parts. I'm going to sound really like
 18
+00:01:08,080 --> 00:01:11,040
 a crazy person because I'm also going to try to
 19
+00:01:11,200 --> 00:01:16,160
 speak at different pitches and cadences in order to really
 20
+00:01:16,480 --> 00:01:20,480
 try to put a speech attacks model through its paces,
 21
+00:01:20,640 --> 00:01:22,960
 which is trying to make sense of is this guy
 22
+00:01:23,120 --> 00:01:27,980
 just rambling on incoherently in one long sentence or are
 23
+00:01:28,380 --> 00:01:34,140
 these just actually a series of step, standalone,
 24
+00:01:34,300 --> 00:01:37,340
 step alone, standalone sentences? And how is it gonna handle
 25
+00:01:37,420 --> 00:01:40,380
 step alone? That's not a word. What happens when you
 26
+00:01:40,460 --> 00:01:42,940
 use speech to text and you use a fake word?
 27
+00:01:43,100 --> 00:01:45,500
 And then you're like, wait, that's not actually, that word
 28
+00:01:45,660 --> 00:01:50,140
 doesn't exist. How does AI handle that? And these and
 29
+00:01:50,380 --> 00:01:54,220
 more are all the questions that I'm seeking to answer
 30
+00:01:54,380 --> 00:01:57,420
 in this training data. Now, why was it trying to
 31
+00:01:57,420 --> 00:02:00,210
 fine tune Whisper? And what is Whisper? As I said,
 32
+00:02:00,290 --> 00:02:02,930
 I'm going to try to record this at a couple
 33
+00:02:03,090 --> 00:02:07,410
 of different levels of technicality for folks who are, you
 34
+00:02:07,410 --> 00:02:11,650
 know, in the normal world and not totally stuck down
 35
+00:02:11,730 --> 00:02:13,730
 the rabbit hole of AI, which I have to say
 36
+00:02:13,890 --> 00:02:18,050
 is a really wonderful rabbit hole to be down. It's
 37
+00:02:18,130 --> 00:02:21,490
 a really interesting area and speech and voice tech is
 38
+00:02:21,890 --> 00:02:24,530
 the aspect of it that I find actually the most,
 39
+00:02:24,930 --> 00:02:27,330
 I'm not sure I would say the most interesting because
 40
+00:02:27,570 --> 00:02:31,290
 there's just so much that is fascinating in AI. But
 41
+00:02:31,450 --> 00:02:34,250
 the most that I find the most personally transformative in
 42
+00:02:34,330 --> 00:02:38,890
 terms of the impact that it's had on my daily
 43
+00:02:38,970 --> 00:02:41,450
 work life and productivity and how I sort of work.
 44
+00:02:42,090 --> 00:02:47,210
 And I'm persevering hard with the task of trying
 45
+00:02:47,210 --> 00:02:50,250
 to get a good solution working for Linux, which if
 46
+00:02:50,250 --> 00:02:52,250
 anyone actually does listen to this, not just for the
 47
+00:02:52,250 --> 00:02:56,410
 training data and for the actual content, this is sparked
 48
+00:02:56,750 --> 00:02:59,950
 I had, besides the fine tune not working, well, that
 49
+00:03:00,030 --> 00:03:05,230
 was the failure. Um, I used Claude code because one
 50
+00:03:05,470 --> 00:03:09,950
 thinks these days that there is nothing short of solving,
 51
+00:03:10,990 --> 00:03:15,390
 you know, the, the reason of life or something, that
 52
+00:03:15,790 --> 00:03:18,990
 Claude and agentic AI can't do, which is not really
 53
+00:03:19,070 --> 00:03:22,190
 the case. Uh, it does seem that way sometimes, but
 54
+00:03:22,350 --> 00:03:24,190
 it fails a lot as well. And this is one
 55
+00:03:24,190 --> 00:03:27,630
 of those, instances where last week I put together an
 56
+00:03:27,710 --> 00:03:32,010
 hour of voice training data, basically speaking, just random things
 57
+00:03:32,250 --> 00:03:37,050
 for 3 minutes. And it was actually kind of tedious
 58
+00:03:37,130 --> 00:03:39,210
 because the texts were really weird. Some of them were
 59
+00:03:39,450 --> 00:03:43,050
 it was like it was AI generated. I tried before
 60
+00:03:43,210 --> 00:03:45,130
 to read Sherlock Holmes for an hour and I just
 61
+00:03:45,130 --> 00:03:48,330
 couldn't. I was so bored after 10 minutes that I
 62
+00:03:48,330 --> 00:03:50,730
 was like, okay, no, I'm just going to have to
 63
+00:03:50,730 --> 00:03:55,290
 find something else to read. So I used a created
 64
+00:03:55,690 --> 00:04:01,280
 with AI studio vibe coded a synthetic text generator. Which
 65
+00:04:01,600 --> 00:04:03,840
 actually I thought was probably a better way of doing
 66
+00:04:03,920 --> 00:04:07,440
 it because it would give me more short samples with
 67
+00:04:07,680 --> 00:04:10,480
 more varied content. So I was like, okay, give me
 68
+00:04:10,880 --> 00:04:13,760
 a voice note, like I'm recording an email, give me
 69
+00:04:14,000 --> 00:04:17,680
 a short story to read, give me prose to read.
 70
+00:04:18,000 --> 00:04:20,400
 So I came up with all these different things and
 71
+00:04:20,560 --> 00:04:22,560
 they added a little timer to it so I could
 72
+00:04:22,720 --> 00:04:26,400
 see how close I was to one hour. And I
 73
+00:04:26,560 --> 00:04:29,600
 spent like an hour one afternoon or probably two hours
 74
+00:04:29,760 --> 00:04:33,330
 by the time you you do retakes. And whatever, because
 75
+00:04:33,410 --> 00:04:36,610
 you want to, it gave me a source of truth,
 76
+00:04:37,330 --> 00:04:40,050
 which I'm not sure if that's the scientific way to
 77
+00:04:40,210 --> 00:04:44,210
 approach this topic of gathering, training data, but I thought
 78
+00:04:44,450 --> 00:04:48,130
 made sense. Um, I have a lot of audio data
 79
+00:04:48,210 --> 00:04:50,770
 from recording voice notes, which I've also kind of used,
 80
+00:04:52,050 --> 00:04:55,810
 been experimenting with using for a different purpose, slightly different
 81
+00:04:56,210 --> 00:05:01,410
 annotating task types. It's more a text classification experiment
 82
+00:05:01,730 --> 00:05:04,160
 or, Well, it's more than that actually. I'm working on
 83
+00:05:04,160 --> 00:05:08,080
 a voice app. So it's a prototype, I guess, is
 84
+00:05:08,240 --> 00:05:12,720
 really more accurate. But you can do that and you
 85
+00:05:12,720 --> 00:05:15,200
 can work backwards. You're like, you listen back to a
 86
+00:05:15,200 --> 00:05:18,720
 voice note and you painfully go through one of those
 87
+00:05:19,040 --> 00:05:21,840
 transcribing, you know, where you start and stop and scrub
 88
+00:05:22,000 --> 00:05:23,920
 around it and you fix the errors, but it's really,
 89
+00:05:24,080 --> 00:05:26,720
 really boring to do that. So I thought it would
 90
+00:05:26,800 --> 00:05:29,040
 be less tedious in the long term if I just
 91
+00:05:30,059 --> 00:05:32,940
 recorded the source of truth. So it gave me these
 92
+00:05:33,020 --> 00:05:36,140
 three minute snippets. I recorded them. It saved an MP3
 93
+00:05:36,380 --> 00:05:39,500
 and a TXT in the same folder, and I created
 94
+00:05:39,580 --> 00:05:42,860
 an error with that data. So I was very hopeful,
 95
+00:05:43,260 --> 00:05:46,860
 quietly, a little bit hopeful that I could actually fine
 96
+00:05:46,940 --> 00:05:50,460
 tune Whisper. I want to fine tune Whisper because when
 97
+00:05:50,540 --> 00:05:54,780
 I got into Voicetech last November, my wife was in
 98
+00:05:54,780 --> 00:05:58,140
 the US and I was alone at home. And when
 99
+00:05:58,600 --> 00:06:01,400
 crazy people like me do really wild things like use
 100
+00:06:01,640 --> 00:06:06,120
 voice to tech technology. That was basically when I started
 101
+00:06:06,200 --> 00:06:08,760
 doing it, I didn't feel like a crazy person speaking
 102
+00:06:08,840 --> 00:06:13,720
 to myself. And my expectations weren't that high. I used
 103
+00:06:14,280 --> 00:06:17,640
 speech tech now and again, tried it out. It was
 104
+00:06:17,640 --> 00:06:19,160
 like, it'd be really cool if you could just, like,
 105
+00:06:19,320 --> 00:06:22,760
 speak into your computer. And whatever I tried out that
 106
+00:06:23,000 --> 00:06:26,590
 had Linux support was just. It was not good, basically.
 107
+00:06:27,230 --> 00:06:29,470
 And this blew me away from the first go. I
 108
+00:06:29,470 --> 00:06:32,750
 mean, it wasn't 100% accurate out of the box and
 109
+00:06:32,830 --> 00:06:34,910
 it took work, but it was good enough that there
 110
+00:06:34,990 --> 00:06:37,470
 was a solid foundation and it kind of passed that
 111
+00:06:38,670 --> 00:06:41,870
 pivot point that it's actually worth doing this. You know,
 112
+00:06:42,030 --> 00:06:44,670
 there's a point where it's so like the transcript is
 113
+00:06:44,910 --> 00:06:47,310
 you don't have to get 100% accuracy for it to
 114
+00:06:47,310 --> 00:06:50,030
 be worth your time for speech attacks to be a
 115
+00:06:50,030 --> 00:06:52,430
 worthwhile addition to your productivity, but you do need to
 116
+00:06:52,430 --> 00:06:55,970
 get above, let's say, I don't know, 85%. If it's
 117
+00:06:56,130 --> 00:06:59,810
 60% or 50%, you inevitably say, screw it, I'll just
 118
+00:06:59,810 --> 00:07:02,770
 type it because you end up missing errors in the
 119
+00:07:02,770 --> 00:07:05,490
 transcript and it becomes actually worse. You end up in
 120
+00:07:05,490 --> 00:07:07,570
 a worse position than you started with. That's been my
 121
+00:07:07,650 --> 00:07:11,970
 experience. So I was like, oh, this is actually really,
 122
+00:07:12,130 --> 00:07:13,970
 really good now. How did that happen? And the answer
 123
+00:07:14,130 --> 00:07:19,410
 is ASR whisper being open source and the transformer
 124
+00:07:19,410 --> 00:07:23,170
 architecture. If you want to go back to the to
 125
+00:07:23,250 --> 00:07:26,370
 the underpinnings, which really blows my mind and it's on
 126
+00:07:26,450 --> 00:07:30,680
 my list. To read through that paper. All you need
 127
+00:07:30,760 --> 00:07:35,960
 is attention as attentively as can be done
 128
+00:07:36,200 --> 00:07:39,320
 with my limited brain because it's super, super high level
 129
+00:07:39,640 --> 00:07:44,520
 stuff, super advanced stuff, I mean. But that, I think
 130
+00:07:44,680 --> 00:07:49,320
 of all the things that are fascinating about the sudden
 131
+00:07:49,640 --> 00:07:53,700
 rise in AI and the dramatic capabilities. I find it
 132
+00:07:53,700 --> 00:07:56,100
 fascinating that a few people are like, hang on, you've
 133
+00:07:56,100 --> 00:07:58,420
 got this thing that can speak to you, like a
 134
+00:07:58,420 --> 00:08:02,980
 chatbot, an LLM, and then you've got image generation. Okay,
 135
+00:08:03,060 --> 00:08:06,580
 so firstly, those two things on the surface have nothing
 136
+00:08:06,900 --> 00:08:10,740
 in common. So like, how are they, how did that
 137
+00:08:10,900 --> 00:08:12,500
 just happen all at the same time? And then when
 138
+00:08:12,500 --> 00:08:16,580
 you extend that further, you're like, Suno, right? You can
 139
+00:08:17,060 --> 00:08:20,030
 sing a song and AI will come up with and
 140
+00:08:20,190 --> 00:08:23,390
 instrumental. And then you've got Whisper and you're like, wait
 141
+00:08:23,390 --> 00:08:25,870
 a second, how did all this stuff, like, if it's
 142
+00:08:25,870 --> 00:08:29,230
 all AI, what's like, there has to be some commonality.
 143
+00:08:29,470 --> 00:08:34,590
 Otherwise, these are totally different technologies on the surface of
 144
+00:08:34,590 --> 00:08:38,830
 it. And the Transformer architecture is, as far as I
 145
+00:08:38,910 --> 00:08:41,550
 know, the answer. And I can't even say, can't even
 146
+00:08:41,630 --> 00:08:46,270
 pretend that I really understand what the Transformer architecture means.
 147
+00:08:46,770 --> 00:08:49,250
 In depth, but I have scanned it and as I
 148
+00:08:49,410 --> 00:08:51,810
 said, I want to print it and really kind of
 149
+00:08:52,210 --> 00:08:56,050
 think over it at some point. And I'll probably feel
 150
+00:08:56,290 --> 00:08:59,250
 bad about myself, I think, because weren't those guys in
 151
+00:08:59,330 --> 00:09:03,410
 their 20s? Like, that's crazy. I think I asked ChatGPT
 152
+00:09:03,490 --> 00:09:07,890
 once who wrote that paper and how old were they
 153
+00:09:08,050 --> 00:09:10,770
 when it was published in Arciv? And I was expecting,
 154
+00:09:11,010 --> 00:09:13,890
 like, I don't know, What do you imagine? I personally
 155
+00:09:13,970 --> 00:09:16,210
 imagine kind of like, you know, you have these breakthroughs
 156
+00:09:16,370 --> 00:09:19,810
 during COVID and things like that where like these kind
 157
+00:09:19,890 --> 00:09:22,770
 of really obscure scientists are like in their 50s and
 158
+00:09:22,770 --> 00:09:27,170
 they've just kind of been laboring in labs and wearily
 159
+00:09:27,170 --> 00:09:30,450
 in writing and publishing in kind of obscure academic publications.
 160
+00:09:30,770 --> 00:09:33,170
 And they finally like hit a big or win a
 161
+00:09:33,170 --> 00:09:37,250
 Nobel Prize and then their household names. So that was
 162
+00:09:37,330 --> 00:09:38,990
 kind of what I had in mind. That was the
 163
+00:09:38,990 --> 00:09:42,990
 mental image I'd formed of the birth of Arcsight. Like
 164
+00:09:42,990 --> 00:09:46,270
 I wasn't expecting 20-somethings in San Francisco, though. I thought
 165
+00:09:46,350 --> 00:09:48,830
 that was both very, very funny, very cool, and actually
 166
+00:09:48,990 --> 00:09:52,510
 kind of inspiring. It's nice to think that people who,
 167
+00:09:53,310 --> 00:09:56,110
 you know, just you might put them in the kind
 168
+00:09:56,190 --> 00:09:59,550
 of milieu or bubble or world that you are in
 169
+00:09:59,630 --> 00:10:03,230
 are credibly in through, you know, the series of connections
 170
+00:10:03,310 --> 00:10:07,390
 that are coming up with such literally world changing innovations.
 171
+00:10:07,870 --> 00:10:11,460
 So that was, I thought, anyway. That's that was cool.
 172
+00:10:11,860 --> 00:10:14,500
 Okay, voice training data. How are we doing? We're about
 173
+00:10:14,500 --> 00:10:18,580
 10 minutes and I'm still talking about voice technology. So
 174
+00:10:18,660 --> 00:10:22,100
 Whisper was brilliant and I was so excited that I
 175
+00:10:22,180 --> 00:10:25,380
 was my first instinct was to like guess like, oh
 176
+00:10:25,380 --> 00:10:26,820
 my gosh, I have to get like a really good
 177
+00:10:26,820 --> 00:10:30,580
 microphone for this. So I didn't go on a spending
 178
+00:10:30,580 --> 00:10:32,740
 spree because I said, I'm gonna have to just wait
 179
+00:10:32,740 --> 00:10:35,140
 a month and see if I still use this. And
 180
+00:10:36,430 --> 00:10:38,910
 It just kind of became, it's become really part of
 181
+00:10:39,070 --> 00:10:43,390
 my daily routine. Like if I'm writing an email, I'll
 182
+00:10:43,470 --> 00:10:46,990
 record a voice note. And then I've developed and it's
 183
+00:10:46,990 --> 00:10:49,070
 nice to see that everyone is like developing the same
 184
+00:10:49,550 --> 00:10:51,950
 things in parallel. Like that's my kind of a weird
 185
+00:10:51,950 --> 00:10:54,510
 thing to say, but when I look, I kind of
 186
+00:10:54,670 --> 00:10:58,990
 came, when I started working on this, these prototypes on
 187
+00:10:59,070 --> 00:11:01,470
 GitHub, which is where I just kind of share very
 188
+00:11:01,710 --> 00:11:06,730
 freely and loosely, ideas and first iterations on concepts.
 189
+00:11:08,490 --> 00:11:10,650
 And for want of a better word, I called it
 190
+00:11:10,730 --> 00:11:15,450
 like LLM post-processing or cleanup or basically a system prompt
 191
+00:11:15,530 --> 00:11:18,890
 that after you get back the raw text from Whisper,
 192
+00:11:19,050 --> 00:11:22,010
 you run it through a model and say, okay, this
 193
+00:11:22,090 --> 00:11:26,970
 is crappy text, like add sentence structure and fix it
 194
+00:11:27,050 --> 00:11:32,250
 up. And now when I'm exploring the different tools that
 195
+00:11:32,330 --> 00:11:35,180
 are out there that people have built, I see quite
 196
+00:11:35,420 --> 00:11:39,100
 a number of projects have basically done the same thing,
 197
+00:11:40,460 --> 00:11:43,180
 lest that be misconstrued. I'm not saying for a millisecond
 198
+00:11:43,260 --> 00:11:46,220
 that I inspired them. I'm sure this has been a
 199
+00:11:46,300 --> 00:11:49,500
 thing that's been integrated into tools for a while, but
 200
+00:11:50,380 --> 00:11:52,300
 it's the kind of thing that when you start using
 201
+00:11:52,300 --> 00:11:54,780
 these tools every day, the need for it is almost
 202
+00:11:54,940 --> 00:11:59,420
 instantly apparent because text that doesn't have any punctuation or
 203
+00:11:59,800 --> 00:12:03,000
 Paragraph spacing takes a long time to, you know, it
 204
+00:12:03,160 --> 00:12:05,400
 takes so long to get it into a presentable email
 205
+00:12:05,560 --> 00:12:09,720
 that again, it's, it's, it, it moves speech tech into
 206
+00:12:09,960 --> 00:12:13,480
 that before that inflection point where you're like, no, it's
 207
+00:12:13,480 --> 00:12:15,960
 just not worth it. It's like, it's, it'll just be
 208
+00:12:16,040 --> 00:12:18,520
 quicker to type this. So it's a big, it's a
 209
+00:12:18,520 --> 00:12:21,560
 little touch that actually is a big deal. Uh, so
 210
+00:12:21,720 --> 00:12:25,640
 I was on Whisper and I've been using Whisper and
 211
+00:12:25,640 --> 00:12:28,110
 I kind of, early on found a couple of tools.
 212
+00:12:28,270 --> 00:12:30,510
 I couldn't find what I was looking for on Linux,
 213
+00:12:30,670 --> 00:12:35,470
 which is basically just something that'll run in the background.
 214
+00:12:35,710 --> 00:12:38,030
 It'll give it an API key and it will just
 215
+00:12:38,190 --> 00:12:42,910
 like transcribe with like a little key to start and
 216
+00:12:42,990 --> 00:12:47,310
 stop the dictation. And the issues were I discovered that
 217
+00:12:47,470 --> 00:12:51,070
 like most people involved in creating these projects were very
 218
+00:12:51,230 --> 00:12:55,070
 much focused on local models, running Whisper locally because you
 219
+00:12:55,150 --> 00:12:57,940
 can. And I tried that a bunch of times and
 220
+00:12:58,020 --> 00:13:00,340
 just never got results that were as good as the
 221
+00:13:00,340 --> 00:13:03,140
 cloud. And when I began looking at the cost of
 222
+00:13:03,220 --> 00:13:05,700
 the speech to text APIs and what I was spending,
 223
+00:13:06,260 --> 00:13:09,460
 I just thought there is, it's actually, in my opinion,
 224
+00:13:09,620 --> 00:13:12,820
 just one of the better deals in API spending and
 225
+00:13:12,820 --> 00:13:15,140
 in cloud. Like it's just not that expensive for very,
 226
+00:13:15,300 --> 00:13:19,300
 very good models that are much more, you know, you're
 227
+00:13:19,300 --> 00:13:21,880
 gonna be able to run the full model. The latest
 228
+00:13:21,880 --> 00:13:25,880
 model versus whatever you can run on your average GPU,
 229
+00:13:26,120 --> 00:13:29,160
 unless you want to buy a crazy GPU. It doesn't
 230
+00:13:29,160 --> 00:13:31,080
 really make sense to me. Now, privacy is another concern
 231
+00:13:32,120 --> 00:13:33,880
 that I know is kind of like a very much
 232
+00:13:33,960 --> 00:13:36,760
 a separate thing that people just don't want their voice
 233
+00:13:37,000 --> 00:13:40,680
 data and their voice leaving their local environment, maybe for
 234
+00:13:40,680 --> 00:13:44,200
 regulatory reasons as well. But I'm not in that. I
 235
+00:13:44,600 --> 00:13:48,840
 neither really care about people listening to my grocery list
 236
+00:13:49,080 --> 00:13:51,720
 consisting of reminding myself that I need to buy more
 237
+00:13:51,800 --> 00:13:55,150
 beer, Cheetos, and hummus, which is kind of the three
 238
+00:13:55,310 --> 00:13:59,870
 staples of my diet during periods of poorer nutrition. But
 239
+00:13:59,950 --> 00:14:02,430
 the kind of stuff that I transcribe, it's just not,
 240
+00:14:03,950 --> 00:14:07,710
 it's not a privacy thing I'm that sort of sensitive
 241
+00:14:07,790 --> 00:14:13,150
 about and I don't do anything so sensitive or secure
 242
+00:14:13,230 --> 00:14:16,430
 that requires air gapping. So I looked at the pricing
 243
+00:14:16,510 --> 00:14:19,790
 and especially the kind of older model mini Some of
 244
+00:14:19,870 --> 00:14:21,950
 them are very, very affordable. And I did a back
 245
+00:14:22,190 --> 00:14:25,870
 of the, I did a calculation once with ChatGPT and
 246
+00:14:25,870 --> 00:14:29,230
 I was like, okay, this is the API price for
 247
+00:14:29,390 --> 00:14:32,270
 I can't remember whatever the model was. Let's say I
 248
+00:14:32,350 --> 00:14:35,230
 just go at it like nonstop, which it rarely happens.
 249
+00:14:35,470 --> 00:14:38,830
 Probably, I would say on average, I might dictate 30
 250
+00:14:38,910 --> 00:14:41,790
 to 60 minutes per day if I was probably summing
 251
+00:14:41,790 --> 00:14:46,990
 up the emails, documents, outlines, which
 252
+00:14:47,230 --> 00:14:49,870
 is a lot, but it's still a fairly modest amount.
 253
+00:14:50,030 --> 00:14:51,940
 And I was like, Some days I do go on
 254
+00:14:52,100 --> 00:14:54,900
 like one or two days where I've been usually when
 255
+00:14:54,900 --> 00:14:56,980
 I'm like kind of out of the house and just
 256
+00:14:57,220 --> 00:15:00,500
 have something like I have nothing else to do. Like
 257
+00:15:00,660 --> 00:15:04,020
 if I'm at a hospital, we have a newborn and
 258
+00:15:04,180 --> 00:15:07,300
 you're waiting for like eight hours and hours for an
 259
+00:15:07,380 --> 00:15:10,820
 appointment. And I would probably have listened to podcasts before
 260
+00:15:11,380 --> 00:15:14,180
 becoming a speech fanatic. And I'm like, oh, wait, let
 261
+00:15:14,340 --> 00:15:16,259
 me just get down. Let me just get these ideas
 262
+00:15:16,420 --> 00:15:18,540
 out of my head. And that's when I'll go on
 263
+00:15:19,260 --> 00:15:21,820
 my speech binges. But those are like once every few
 264
+00:15:21,820 --> 00:15:24,940
 months, like not frequently. But I said, okay, let's just
 265
+00:15:25,020 --> 00:15:29,100
 say if I'm gonna price out Cloud SCT, if I
 266
+00:15:29,180 --> 00:15:33,900
 was like dedicated every second of every waking hour to
 267
+00:15:34,060 --> 00:15:37,900
 transcribing for some odd reason, I mean, I'd have to
 268
+00:15:37,980 --> 00:15:40,780
 like eat and use the toilet. Like, you know, there's
 269
+00:15:40,860 --> 00:15:43,420
 only so many hours I'm awake for. So like, let's
 270
+00:15:43,420 --> 00:15:46,620
 just say a maximum of like 40 hour, 45 minutes.
 271
+00:15:47,210 --> 00:15:49,290
 In the hour. Then I said, all right, let's just
 272
+00:15:49,290 --> 00:15:52,890
 say 50. Who knows? You're dictating on the toilet. We
 273
+00:15:53,050 --> 00:15:55,050
 do it. So it could be. You could just do
 274
+00:15:55,130 --> 00:15:59,290
 60. But whatever I did. And every day, like, you're
 275
+00:15:59,370 --> 00:16:02,730
 going flat out seven days a week dictating non-stop I
 276
+00:16:02,730 --> 00:16:05,850
 was like, what's my monthly API bill gonna be at
 277
+00:16:05,930 --> 00:16:08,570
 this price? And it came out to, like, 70 or
 278
+00:16:08,570 --> 00:16:10,730
 80 bucks. And I was like, well, that would be
 279
+00:16:11,130 --> 00:16:15,700
 an extraordinary. Amount of dictation. And I would hope that
 280
+00:16:16,180 --> 00:16:19,940
 there was some compelling reason more worth more than $70
 281
+00:16:20,260 --> 00:16:23,460
 that I embarked upon that project. So given that that's
 282
+00:16:23,460 --> 00:16:25,460
 kind of the max point for me, I said that's
 283
+00:16:25,540 --> 00:16:29,140
 actually very, very affordable. Now you're gonna, if you want
 284
+00:16:29,220 --> 00:16:31,700
 to spec out the costs and you want to do
 285
+00:16:31,700 --> 00:16:36,260
 the post-processing that I really do feel is valuable, that's
 286
+00:16:36,340 --> 00:16:40,820
 gonna cost some more as well, unless you're using Gemini,
 287
+00:16:41,300 --> 00:16:44,420
 which needless to say is a random person sitting in
 288
+00:16:44,500 --> 00:16:49,060
 Jerusalem. I have no affiliation, nor with Google, nor anthropic,
 289
+00:16:49,140 --> 00:16:52,020
 nor Gemini, nor any major tech vendor for that matter.
 290
+00:16:53,620 --> 00:16:56,820
 I like Gemini not so much as a everyday model.
 291
+00:16:57,300 --> 00:16:59,860
 It's kind of underwhelmed in that respect, I would say.
 292
+00:17:00,260 --> 00:17:02,740
 But for multimodal, I think it's got a lot to
 293
+00:17:02,740 --> 00:17:06,500
 offer. And I think that the transcribing functionality whereby it
 294
+00:17:06,580 --> 00:17:11,900
 can process audio with a system prompt and both give
 295
+00:17:12,060 --> 00:17:15,100
 you transcription that's cleaned up that reduces two steps to
 296
+00:17:15,260 --> 00:17:18,220
 one. And that for me is a very, very big
 297
+00:17:18,380 --> 00:17:21,580
 deal. And I feel like even Google has haven't really
 298
+00:17:21,820 --> 00:17:26,700
 sort of thought through how useful the that modality is
 299
+00:17:26,780 --> 00:17:29,260
 and what kind of use cases you can achieve with
 300
+00:17:29,340 --> 00:17:31,260
 it. Because I found in the course of this year,
 301
+00:17:31,900 --> 00:17:36,540
 just an endless list of really kind of system prompt
 302
+00:17:36,860 --> 00:17:40,220
 system prompt stuff that I can say, okay, I've used
 303
+00:17:40,220 --> 00:17:43,420
 it to capture context data for AI, which is literally
 304
+00:17:43,500 --> 00:17:45,660
 I might speak for if I wanted to have a
 305
+00:17:45,660 --> 00:17:49,740
 good bank of context data about who knows my childhood
 306
+00:17:50,300 --> 00:17:54,220
 more realistically, maybe my career goals, something that would just
 307
+00:17:54,300 --> 00:17:56,700
 be like really boring to type out. So I'll just
 308
+00:17:56,780 --> 00:18:00,780
 like sit in my car and record it for 10
 309
+00:18:00,860 --> 00:18:03,100
 minutes. And that 10 minutes you get a lot of
 310
+00:18:03,260 --> 00:18:08,650
 information in. Um, emails, which is short text, just
 311
+00:18:09,050 --> 00:18:12,250
 there is a whole bunch and all these workflows kind
 312
+00:18:12,410 --> 00:18:14,410
 of require a little bit of treatment afterwards and different
 313
+00:18:14,650 --> 00:18:18,090
 treatment. My context pipeline is kind of like just extract
 314
+00:18:18,170 --> 00:18:20,970
 the bare essentials. So you end up with me talking
 315
+00:18:21,050 --> 00:18:22,970
 very loosely about sort of what I've done in my
 316
+00:18:23,050 --> 00:18:25,370
 career, where I've worked, where I might like to work.
 317
+00:18:25,850 --> 00:18:28,970
 And it goes, it condenses that down to very robotic
 318
+00:18:29,210 --> 00:18:32,490
 language that is easy to chunk parse and maybe put
 319
+00:18:32,570 --> 00:18:36,550
 into a vector database. Daniel has worked in technology. Daniel
 320
+00:18:37,430 --> 00:18:40,150
 has been working in, you know, stuff like that. That's
 321
+00:18:40,150 --> 00:18:43,110
 not how you would speak, but I figure it's probably
 322
+00:18:43,350 --> 00:18:47,350
 easier to parse for, after all, robots. So we've almost
 323
+00:18:47,430 --> 00:18:49,270
 got to 20 minutes and this is actually a success
 324
+00:18:49,750 --> 00:18:55,110
 because I wasted 20 minutes of the evening speaking
 325
+00:18:55,190 --> 00:18:59,910
 into a microphone and the levels were shot and it
 326
+00:18:59,910 --> 00:19:01,590
 was clipping and I said, I can't really do an
 327
+00:19:01,670 --> 00:19:03,990
 evaluation. I have to be fair. I have to give
 328
+00:19:04,560 --> 00:19:07,920
 the models a chance to do their thing. What am
 329
+00:19:07,920 --> 00:19:10,320
 I hoping to achieve in this? Okay, my fine tune
 330
+00:19:10,320 --> 00:19:13,360
 was a dud as mentioned. DeepChrom ST, I'm really, really
 331
+00:19:13,440 --> 00:19:16,480
 hopeful that this prototype will work and it's a build
 332
+00:19:16,720 --> 00:19:19,280
 in public open source, so anyone is welcome to use
 333
+00:19:19,360 --> 00:19:22,320
 it if I make anything good. But that was really
 334
+00:19:22,480 --> 00:19:26,480
 exciting for me last night when after hours of trying
 335
+00:19:26,560 --> 00:19:30,480
 my own prototype, seeing someone just made something that works
 336
+00:19:30,640 --> 00:19:32,400
 like that, you know, you're not gonna have to build
 337
+00:19:32,640 --> 00:19:37,460
 a custom conda environment and image. I have AMD GPU,
 338
+00:19:37,620 --> 00:19:40,980
 which makes things much more complicated. I didn't find it.
 339
+00:19:41,540 --> 00:19:42,980
 And I was about to give up and I said,
 340
+00:19:43,060 --> 00:19:45,460
 all right, let me just give Deep Grams Linux thing
 341
+00:19:45,940 --> 00:19:49,220
 a shot. And if this doesn't work, I'm just going
 342
+00:19:49,220 --> 00:19:50,980
 to go back to trying to Vibe code something myself.
 343
+00:19:51,620 --> 00:19:55,460
 And when I ran the script, I was using Claude
 344
+00:19:55,540 --> 00:19:59,060
 code to do the installation process. It ran the script
 345
+00:19:59,140 --> 00:20:02,020
 and oh my gosh, it works just like that. The
 346
+00:20:02,100 --> 00:20:05,980
 tricky thing For all those who want to know all
 347
+00:20:05,980 --> 00:20:11,260
 the nitty gritty details, was that I
 348
+00:20:11,260 --> 00:20:14,380
 don't think it was actually struggling with transcription, but pasting
 349
+00:20:14,700 --> 00:20:18,140
 Wayland makes life very hard. And I think there was
 350
+00:20:18,220 --> 00:20:21,500
 something not running the right time. Anyway, Deepgram, I looked
 351
+00:20:21,500 --> 00:20:23,820
 at how they actually handled that because it worked out
 352
+00:20:23,900 --> 00:20:26,540
 of the box when other stuff didn't. And it was
 353
+00:20:27,100 --> 00:20:30,570
 quite a clever little mechanism. And but more so than
 354
+00:20:30,650 --> 00:20:33,290
 that, the accuracy was brilliant. Now, what am I doing
 355
+00:20:33,290 --> 00:20:35,930
 here? This is going to be a 20 minute audio
 356
+00:20:36,490 --> 00:20:42,010
 sample. And I think I've done one or two
 357
+00:20:42,170 --> 00:20:46,570
 of these before, but I did it with short snappy
 358
+00:20:46,730 --> 00:20:49,770
 voice notes. This is kind of long form. This actually
 359
+00:20:50,010 --> 00:20:52,170
 might be a better approximation for what's useful to me
 360
+00:20:52,330 --> 00:20:55,890
 than voice memos. Like, I need to buy three Bread,
 361
+00:20:55,970 --> 00:20:58,610
 eaters of milk tomorrow and Peter bread, which is probably
 362
+00:20:58,770 --> 00:21:01,330
 how like half my voice notes sound. Like if anyone
 363
+00:21:01,810 --> 00:21:04,050
 were to, I don't know, like find my phone, they'd
 364
+00:21:04,050 --> 00:21:05,570
 be like, this is the most boring person in the
 365
+00:21:05,570 --> 00:21:09,330
 world. Although actually, there are some like kind of journaling
 366
+00:21:09,330 --> 00:21:11,490
 thoughts as well, but it's a lot of content like
 367
+00:21:11,490 --> 00:21:14,450
 that. And the probably for the evaluation, the most useful
 368
+00:21:14,530 --> 00:21:20,210
 thing is slightly obscure tech, GitHub, NeocleNo, hugging
 369
+00:21:20,290 --> 00:21:22,940
 face, Not so obscure that it's not going to have
 370
+00:21:23,020 --> 00:21:26,460
 a chance of knowing it, but hopefully sufficiently well known
 371
+00:21:26,460 --> 00:21:28,700
 that the model should get it. I tried to do
 372
+00:21:28,780 --> 00:21:31,580
 a little bit of speaking really fast and speaking very
 373
+00:21:31,740 --> 00:21:35,020
 slowly. I would say in general, I've spoken, delivered this
 374
+00:21:35,180 --> 00:21:37,500
 at a faster pace than I usually would owing to
 375
+00:21:37,980 --> 00:21:42,460
 strong coffee flowing through my bloodstream. And the thing that
 376
+00:21:42,460 --> 00:21:44,700
 I'm not going to get in this benchmark is background
 377
+00:21:44,780 --> 00:21:46,460
 noise, which in my first take that I had to
 378
+00:21:46,460 --> 00:21:49,710
 get rid of, My wife came in with my son
 379
+00:21:50,030 --> 00:21:52,350
 and for a goodnight kiss. And that actually would have
 380
+00:21:52,350 --> 00:21:56,510
 been super helpful to get in because it was non
 381
+00:21:56,590 --> 00:22:00,190
 diarized or if we had diarization, a female, I could
 382
+00:22:00,190 --> 00:22:02,430
 say, I want the male voice and that wasn't intended
 383
+00:22:02,430 --> 00:22:05,870
 for transcription. And we're not going to get background noise
 384
+00:22:05,950 --> 00:22:08,270
 like people honking their horns, which is something I've done
 385
+00:22:08,430 --> 00:22:11,150
 in my main data set where I am trying to
 386
+00:22:11,390 --> 00:22:14,340
 go back to some of my voice notes. Annotate them
 387
+00:22:14,580 --> 00:22:16,420
 and run a benchmark. But this is going to be
 388
+00:22:16,420 --> 00:22:21,700
 just a pure quick test. And as someone,
 389
+00:22:22,260 --> 00:22:24,660
 I'm working on a voice note idea. That's my sort
 390
+00:22:24,660 --> 00:22:28,660
 of end motivation. Besides thinking it's an ask to the
 391
+00:22:28,660 --> 00:22:32,340
 outstanding technology that's coming to viability. And really, I know
 392
+00:22:32,420 --> 00:22:35,940
 this sounds cheesy, can actually have a very transformative effect.
 393
+00:22:36,980 --> 00:22:41,130
 It's, you know, voice technology has been life changing for
 394
+00:22:41,930 --> 00:22:46,970
 folks living with disabilities. And I think
 395
+00:22:47,130 --> 00:22:48,970
 there's something really nice about the fact that it can
 396
+00:22:49,130 --> 00:22:52,490
 also benefit, you know, folks who are able bodied and
 397
+00:22:52,650 --> 00:22:57,690
 like we can all in different ways make this tech
 398
+00:22:57,770 --> 00:23:00,410
 as useful as possible, regardless of the exact way that
 399
+00:23:00,410 --> 00:23:03,770
 we're using it. And I think there's something very powerful
 400
+00:23:03,850 --> 00:23:06,440
 in that and it can be very cool. I see
 401
+00:23:06,600 --> 00:23:10,200
 huge potential. What excites me about Voicetech? A lot of
 402
+00:23:10,280 --> 00:23:14,360
 things actually. Firstly, the fact that it's cheap and accurate,
 403
+00:23:14,440 --> 00:23:17,080
 as I mentioned at the very start of this. And
 404
+00:23:17,240 --> 00:23:19,880
 it's getting better and better with stuff like accent handling.
 405
+00:23:20,680 --> 00:23:23,400
 I'm not sure my fine-tune will actually ever come to
 406
+00:23:23,480 --> 00:23:25,320
 fruition in the sense that I'll use it day to
 407
+00:23:25,400 --> 00:23:28,840
 day as I imagine. I get like superb flawless words
 408
+00:23:28,920 --> 00:23:33,340
 error rates because I'm just kind of skeptical about Local
 409
+00:23:33,500 --> 00:23:37,100
 speech to text, as I mentioned, and I think the
 410
+00:23:37,180 --> 00:23:40,700
 pace of innovation and improvement in the models, the main
 411
+00:23:40,860 --> 00:23:44,620
 reasons for fine tuning from what I've seen have been
 412
+00:23:44,780 --> 00:23:47,420
 people who are something that really blows my mind about
 413
+00:23:47,980 --> 00:23:53,100
 ASR is the idea that it's inherently a lingual or
 414
+00:23:53,260 --> 00:23:58,570
 multilingual phonetic based. So as folks who use speak
 415
+00:23:58,890 --> 00:24:02,250
 very obscure languages, that there might be a paucity of
 416
+00:24:02,250 --> 00:24:04,890
 training data or almost none at all, and therefore the
 417
+00:24:04,890 --> 00:24:10,090
 accuracy is significantly reduced. Or folks in very critical
 418
+00:24:10,330 --> 00:24:14,250
 environments, I know this is used extensively in medical transcription
 419
+00:24:14,330 --> 00:24:19,130
 and dispatcher work, the call centers who send out ambulances,
 420
+00:24:19,210 --> 00:24:23,130
 et cetera, where accuracy is absolutely paramount. And in the
 421
+00:24:23,130 --> 00:24:26,860
 case of doctors, radiologist, they might be using very specialized
 422
+00:24:26,860 --> 00:24:29,420
 vocab all the time. So those are kind of the
 423
+00:24:29,500 --> 00:24:31,420
 main two things that I'm not sure that really just
 424
+00:24:31,500 --> 00:24:34,940
 for trying to make it better on a few random
 425
+00:24:34,940 --> 00:24:37,900
 tech words with my slightly, I mean, I have an
 426
+00:24:37,980 --> 00:24:41,020
 accent, but like not, you know, an accent that a
 427
+00:24:41,100 --> 00:24:45,900
 few other million people have ish. I'm not sure that
 428
+00:24:46,380 --> 00:24:50,300
 my little fine tune is gonna actually like the bump
 429
+00:24:50,460 --> 00:24:53,500
 in word error reduction, if I ever actually figure out
 430
+00:24:53,500 --> 00:24:54,620
 how to do it and get it up to the
 431
+00:24:54,700 --> 00:24:57,870
 cloud. By the time we've done that, I suspect that
 432
+00:24:58,190 --> 00:25:00,430
 the next generation of ASR will just be so good
 433
+00:25:00,510 --> 00:25:02,990
 that it will kind of be, well, that would have
 434
+00:25:02,990 --> 00:25:04,670
 been cool if it worked out, but I'll just use
 435
+00:25:04,750 --> 00:25:08,510
 this instead. So that's going to be it for today's
 436
+00:25:08,830 --> 00:25:14,030
 episode of voice training data. Single long shot evaluation.
 437
+00:25:14,350 --> 00:25:17,150
 Who am I going to compare? Whisper is always good
 438
+00:25:17,150 --> 00:25:20,510
 as a benchmark, but I'm more interested in seeing Whisper
 439
+00:25:20,590 --> 00:25:24,510
 head to head with two things, really. One is Whisper
 440
+00:25:24,590 --> 00:25:29,700
 variants. So you've got these projects like faster Distill Whisper,
 441
+00:25:29,780 --> 00:25:31,700
 it's a bit confusing, there's a whole bunch of them.
 442
+00:25:32,020 --> 00:25:35,300
 And the emerging ASRs, which are also a thing. My
 443
+00:25:35,380 --> 00:25:37,220
 intention for this is I'm not sure I'm going to
 444
+00:25:37,220 --> 00:25:39,860
 have the time in any point in the foreseeable future
 445
+00:25:40,180 --> 00:25:44,580
 to go back through this whole episode and create a
 446
+00:25:44,660 --> 00:25:49,700
 proper source truth, where I fix everything. Might do
 447
+00:25:49,780 --> 00:25:52,740
 it if I can get one transcriptions that sufficiently close
 448
+00:25:52,980 --> 00:25:57,040
 to perfection. But what I would actually love to do
 449
+00:25:57,200 --> 00:25:59,920
 on Hugging Face, I think would be a great probably
 450
+00:26:00,240 --> 00:26:02,880
 how I might visualize this is having the audio waveform
 451
+00:26:03,200 --> 00:26:08,160
 play and then have the transcript for each model below
 452
+00:26:08,160 --> 00:26:12,560
 it and maybe even a like, you know, to scale
 453
+00:26:13,120 --> 00:26:15,600
 and maybe even a local one as well, like local
 454
+00:26:15,760 --> 00:26:21,100
 whisper versus OpenAI API, et cetera. And, I
 455
+00:26:21,180 --> 00:26:23,500
 can then actually listen back to segments or anyone who
 456
+00:26:23,500 --> 00:26:25,820
 wants to can listen back to segments of this recording
 457
+00:26:26,140 --> 00:26:30,940
 and see where a particular model struggled and others didn't,
 458
+00:26:31,420 --> 00:26:33,340
 as well as the sort of headline finding of which
 459
+00:26:33,500 --> 00:26:36,860
 had the best WER, but that would require the source
 460
+00:26:36,860 --> 00:26:39,580
 of truth. Okay, that's it. I hope this was, I
 461
+00:26:39,580 --> 00:26:42,540
 don't know, maybe useful for other folks interested in STT.
 462
+00:26:42,860 --> 00:26:45,660
 You want to see that I always feel think I've
 463
+00:26:45,660 --> 00:26:48,870
 just said as something I didn't intend to. STT, I
 464
+00:26:48,870 --> 00:26:52,470
 said for those. Listen carefully, including hopefully the models themselves.
 465
+00:26:53,190 --> 00:26:57,270
 This has been myself, Daniel Rosell. For more jumbled repositories
 466
+00:26:57,350 --> 00:27:01,750
 about my roving interests in AI, but particularly agentic, MCP
 467
+00:27:01,990 --> 00:27:07,029
 and Voicetech, you can find me on GitHub, huggingface.com,
 468
+00:27:10,230 --> 00:27:13,270
 which is my personal website, as well as this podcast,
 469
+00:27:13,510 --> 00:27:16,950
 whose name I sadly cannot remember. Until next time, thanks
 470
+00:27:16,950 --> 00:27:17,510
 for listening.

srt-out/nova3.srt CHANGED Viewed

@@ -1,2304 +1,2304 @@
 1
-00:00:00,080 --> 00:00:06,240
 Hello and welcome to a audio dataset consisting of one
 2
-00:00:06,240 --> 00:00:08,400
 single episode of a nonexistent podcast.
 3
-00:00:08,800 --> 00:00:12,880
 Or it I may append this to a podcast that
 4
-00:00:12,880 --> 00:00:18,814
 I set up recently regarding my with my thoughts on
 5
-00:00:18,815 --> 00:00:20,815
 speech tech and A.
 6
-00:00:20,815 --> 00:00:21,214
 I.
 7
-00:00:21,214 --> 00:00:22,814
 In particular, more A.
 8
-00:00:22,814 --> 00:00:23,054
 I.
 9
-00:00:23,054 --> 00:00:23,935
 And generative A.
 10
-00:00:23,935 --> 00:00:24,095
 I.
 11
-00:00:24,095 --> 00:00:26,494
 I would I would say.
 12
-00:00:26,814 --> 00:00:30,869
 But in any event, the purpose of this voice recording
 13
-00:00:30,869 --> 00:00:35,590
 is actually to create a lengthy voice sample for a
 14
-00:00:35,590 --> 00:00:38,950
 quick evaluation, a back of the envelope evaluation, they might
 15
-00:00:38,950 --> 00:00:41,429
 say, for different speech attacks models.
 16
-00:00:41,429 --> 00:00:43,945
 I'm doing this because I thought I'd made a great
 17
-00:00:43,945 --> 00:00:47,784
 breakthrough in my journey with speech tech and that was
 18
-00:00:47,784 --> 00:00:51,385
 succeeding in the elusive task of fine tuning whisper.
 19
-00:00:51,704 --> 00:00:56,424
 Whisper is, and I'm to just talk, I'm trying to
 20
-00:00:55,829 --> 00:00:56,789
 mix up.
 21
-00:00:56,869 --> 00:01:00,390
 I'm going to try a few different styles of speaking
 22
-00:01:00,390 --> 00:01:02,869
 whisper something at some points as well.
 23
-00:01:03,350 --> 00:01:06,790
 And I'll go back to speaking loud in in different
 24
-00:01:06,790 --> 00:01:09,030
 parts are going to sound really like a crazy person
 25
-00:01:09,030 --> 00:01:12,424
 because I'm also going to try to speak at different
 26
-00:01:12,984 --> 00:01:18,025
 pitches and cadences in order to really try to push
 27
-00:01:18,344 --> 00:01:21,145
 a speech to text model through its paces, which is
 28
-00:01:21,145 --> 00:01:24,609
 trying to make sense of is this guy just rambling
 29
-00:01:24,609 --> 00:01:30,049
 on incoherently in one long sentence or are these just
 30
-00:01:30,049 --> 00:01:36,450
 actually a series of step standalone, standalone, standalone sentences?
 31
-00:01:36,450 --> 00:01:38,130
 And how is it going to handle step alone?
 32
-00:01:38,130 --> 00:01:38,770
 That's not a word.
 33
-00:01:39,704 --> 00:01:42,025
 What happens when you use speech to text and you
 34
-00:01:42,025 --> 00:01:43,384
 use a fake word?
 35
-00:01:43,384 --> 00:01:45,784
 And then you're like, wait, that's not actually that word
 36
-00:01:45,784 --> 00:01:46,665
 doesn't exist.
 37
-00:01:46,984 --> 00:01:48,584
 How does AI handle that?
 38
-00:01:48,584 --> 00:01:53,750
 And these and more are all the questions that I'm
 39
-00:01:53,750 --> 00:01:55,750
 seeking to answer in this training data.
 40
-00:01:55,829 --> 00:01:58,549
 Now, why was I trying to fine tune Whisper?
 41
-00:01:58,549 --> 00:01:59,750
 And what is Whisper?
 42
-00:01:59,750 --> 00:02:02,710
 As I said, I'm going to try to record this
 43
-00:02:02,710 --> 00:02:06,644
 at a couple of different levels of technicality for folks
 44
-00:02:06,644 --> 00:02:11,764
 who are in the normal world and not totally stuck
 45
-00:02:11,764 --> 00:02:13,764
 down the rabbit hole of AI, which you have to
 46
-00:02:13,764 --> 00:02:17,685
 say is a really wonderful rabbit hole to be done.
 47
-00:02:17,844 --> 00:02:20,919
 It's a really interesting area and speech and voice tech
 48
-00:02:20,919 --> 00:02:24,359
 is is the aspect of it that I find actually
 49
-00:02:24,359 --> 00:02:27,239
 most I'm not sure I would say the most interesting
 50
-00:02:27,239 --> 00:02:30,759
 because there's just so much that is fascinating in AI.
 51
-00:02:31,400 --> 00:02:34,134
 But the most that I find the most personally transformative
 52
-00:02:34,134 --> 00:02:38,534
 in terms of the impact that it's had on my
 53
-00:02:38,534 --> 00:02:41,254
 daily work life and productivity and how I sort of
 54
-00:02:41,254 --> 00:02:41,895
 work.
 55
-00:02:42,935 --> 00:02:47,500
 I'm persevering hard with the task of trying to get
 56
-00:02:47,500 --> 00:02:50,939
 a good solution working for Linux, which if anyone actually
 57
-00:02:50,939 --> 00:02:52,939
 does listen to this, not just for the training data
 58
-00:02:52,939 --> 00:02:56,700
 and for the actual content, is sparked.
 59
-00:02:56,700 --> 00:02:59,980
 I had, besides the fine tune not working, well that
 60
-00:02:59,980 --> 00:03:01,385
 was the failure.
 61
-00:03:02,504 --> 00:03:06,745
 I used Claude code because one thinks these days that
 62
-00:03:06,745 --> 00:03:13,280
 there is nothing short of solving, you know, the the
 63
-00:03:13,280 --> 00:03:17,599
 reason of life or something that clause and agentic AI
 64
-00:03:17,599 --> 00:03:19,680
 can't do, which is not really the case.
 65
-00:03:19,680 --> 00:03:23,199
 It does seem that way sometimes, but it fails a
 66
-00:03:23,199 --> 00:03:23,759
 lot as well.
 67
-00:03:23,759 --> 00:03:26,639
 And this is one of those instances where last week
 68
-00:03:26,639 --> 00:03:30,824
 I put together an hour of voice training data, basically
 69
-00:03:30,824 --> 00:03:33,465
 speaking just random things for three minutes.
 70
-00:03:35,465 --> 00:03:38,104
 It was actually kind of tedious because the texts were
 71
-00:03:38,104 --> 00:03:38,664
 really weird.
 72
-00:03:38,664 --> 00:03:41,370
 Some of them were, it was like it was AI
 73
-00:03:41,370 --> 00:03:42,250
 generated.
 74
-00:03:42,569 --> 00:03:44,889
 I tried before to read Sherlock Holmes for an hour
 75
-00:03:44,889 --> 00:03:47,689
 and I just couldn't, I was so bored after ten
 76
-00:03:47,689 --> 00:03:50,569
 minutes that I was like, okay, no, I'm just gonna
 77
-00:03:50,569 --> 00:03:51,930
 have to find something else to read.
 78
-00:03:51,930 --> 00:03:58,284
 So I used a created with AI Studio, VibeCoded, a
 79
-00:03:58,284 --> 00:04:03,164
 synthetic text generator which actually I thought was probably a
 80
-00:04:03,164 --> 00:04:05,245
 better way of doing it because it would give me
 81
-00:04:05,245 --> 00:04:09,069
 more short samples with more varied content.
 82
-00:04:09,069 --> 00:04:11,710
 So I was like, okay, give me a voice note
 83
-00:04:11,710 --> 00:04:14,909
 like I'm recording an email, give me a short story
 84
-00:04:14,909 --> 00:04:18,189
 to read, give me prose to read.
 85
-00:04:18,189 --> 00:04:20,634
 So I came up with all these different things and
 86
-00:04:20,634 --> 00:04:22,714
 they added a little timer to it so I could
 87
-00:04:22,714 --> 00:04:24,955
 see how close I was to one hour.
 88
-00:04:25,915 --> 00:04:29,115
 And I spent like an hour one afternoon or probably
 89
-00:04:29,115 --> 00:04:33,115
 two hours by the time you do retakes and whatever
 90
-00:04:33,115 --> 00:04:36,169
 because you want to it gave me a source of
 91
-00:04:36,169 --> 00:04:40,009
 truth which I'm not sure if that's the scientific way
 92
-00:04:40,009 --> 00:04:44,169
 to approach this topic of gathering training data but I
 93
-00:04:44,169 --> 00:04:45,449
 thought made sense.
 94
-00:04:46,490 --> 00:04:49,464
 I have a lot of audio data from recording voice
 95
-00:04:49,464 --> 00:04:53,544
 notes which I've also kind of used, been experimenting with
 96
-00:04:53,544 --> 00:04:55,064
 using for a different purpose.
 97
-00:04:55,384 --> 00:04:58,745
 Slightly different annotating task types.
 98
-00:04:58,745 --> 00:05:03,250
 It's more a text classification experiment or Well, it's more
 99
-00:05:03,250 --> 00:05:03,810
 than that actually.
 100
-00:05:03,810 --> 00:05:05,009
 I'm working on a voice app.
 101
-00:05:05,009 --> 00:05:09,329
 So it's a prototype, I guess, is really more accurate.
 102
-00:05:11,409 --> 00:05:13,969
 But you can do that and you can work backwards.
 103
-00:05:13,969 --> 00:05:18,354
 Listen back to a voice note and you painfully go
 104
-00:05:18,354 --> 00:05:21,474
 through one of those transcribing, where you start and stop
 105
-00:05:21,474 --> 00:05:23,634
 and scrub around it and you fix the errors, but
 106
-00:05:23,634 --> 00:05:25,875
 it's really, really pouring to do that.
 107
-00:05:26,115 --> 00:05:28,034
 So I thought it would be less tedious in the
 108
-00:05:28,034 --> 00:05:31,714
 long term if I just recorded the source of truth.
 109
-00:05:32,069 --> 00:05:34,389
 So it gave me these three minutes snippets.
 110
-00:05:34,389 --> 00:05:37,509
 I recorded them and saved an MP3 and a TXT
 111
-00:05:37,750 --> 00:05:40,310
 in the same folder and I created an error that
 112
-00:05:40,310 --> 00:05:40,949
 data.
 113
-00:05:41,990 --> 00:05:44,870
 So I was very hopeful, quietly, a little bit hopeful
 114
-00:05:44,870 --> 00:05:47,029
 that I would be able, that I could actually fine
 115
-00:05:47,029 --> 00:05:47,750
 tune Whisper.
 116
-00:05:48,365 --> 00:05:51,085
 I want to fine tune Whisper because when I got
 117
-00:05:51,085 --> 00:05:55,004
 into voice tech last November, my wife was in the
 118
-00:05:55,004 --> 00:05:57,245
 US and I was alone at home.
 119
-00:05:57,324 --> 00:06:01,004
 And when crazy people like me do really wild things
 120
-00:06:01,004 --> 00:06:03,980
 like use voice to tech technology.
 121
-00:06:03,980 --> 00:06:06,939
 That was basically when I started doing it, I didn't
 122
-00:06:06,939 --> 00:06:09,580
 feel like a crazy person speaking to myself.
 123
-00:06:09,980 --> 00:06:12,780
 And my expectations weren't that high.
 124
-00:06:13,180 --> 00:06:17,685
 I'd used speech tech now and again, tried it out.
 125
-00:06:17,685 --> 00:06:18,884
 I was like, it'd be really cool if you could
 126
-00:06:18,884 --> 00:06:22,404
 just like speak into your computer and whatever I tried
 127
-00:06:22,404 --> 00:06:25,925
 out that had Linux support was just, it was not
 128
-00:06:25,925 --> 00:06:26,805
 good basically.
 129
-00:06:27,365 --> 00:06:29,524
 And this blew me away from the first go.
 130
-00:06:29,524 --> 00:06:32,339
 I mean, it wasn't one hundred percent accurate out of
 131
-00:06:32,339 --> 00:06:34,500
 the box and it took work, but it was good
 132
-00:06:34,500 --> 00:06:36,819
 enough that there was a solid foundation and it kind
 133
-00:06:36,819 --> 00:06:41,139
 of passed that pivot point that it's actually worth doing
 134
-00:06:41,139 --> 00:06:41,620
 this.
 135
-00:06:41,939 --> 00:06:43,939
 You know, there's a point where it's so like, the
 136
-00:06:43,939 --> 00:06:46,485
 transcript is you don't have to get one hundred percent
 137
-00:06:46,485 --> 00:06:49,525
 accuracy for it to be worth your time for speech
 138
-00:06:49,525 --> 00:06:51,925
 to text to be a worthwhile addition to your productivity.
 139
-00:06:51,925 --> 00:06:53,685
 But you do need to get above, let's say, I
 140
-00:06:53,685 --> 00:06:55,125
 don't know, eighty five percent.
 141
-00:06:55,605 --> 00:06:58,805
 If it's sixty percent or fifty percent, you inevitably say,
 142
-00:06:59,040 --> 00:07:00,319
 Screw it, I'll just type it.
 143
-00:07:00,319 --> 00:07:03,680
 Because you end up missing errors in the transcript and
 144
-00:07:03,680 --> 00:07:05,040
 it becomes actually worse.
 145
-00:07:05,040 --> 00:07:06,720
 You end up in a worse position than you started
 146
-00:07:06,720 --> 00:07:07,040
 with it.
 147
-00:07:07,040 --> 00:07:08,240
 That's been my experience.
 148
-00:07:08,560 --> 00:07:12,480
 So I was like, Oh, this is actually really, really
 149
-00:07:12,480 --> 00:07:12,960
 good now.
 150
-00:07:12,960 --> 00:07:13,680
 How did that happen?
 151
-00:07:13,680 --> 00:07:17,995
 And the answer is ASR, Whisper being open sourced and
 152
-00:07:18,714 --> 00:07:21,594
 the transformer architecture, if you want to go back to
 153
-00:07:21,594 --> 00:07:26,394
 the underpinnings, which really blows my mind and it's on
 154
-00:07:26,394 --> 00:07:29,830
 my list to read through that paper.
 155
-00:07:30,389 --> 00:07:35,990
 All you need is attention as attentively as can be
 156
-00:07:35,990 --> 00:07:39,350
 done with my limited brain because it's super super high
 157
-00:07:39,350 --> 00:07:43,045
 level stuff, super advanced stuff, mean.
 158
-00:07:43,285 --> 00:07:48,084
 That I think of all the things that are fascinating
 159
-00:07:48,084 --> 00:07:52,564
 about the sudden rise in AI and the dramatic capabilities,
 160
-00:07:53,339 --> 00:07:55,419
 I find it fascinating that few people are like, hang
 161
-00:07:55,419 --> 00:07:58,300
 on, you've got this thing that can speak to you
 162
-00:07:58,300 --> 00:08:00,060
 like a chatbot, an LLM.
 163
-00:08:00,620 --> 00:08:02,860
 And then you've got image generation.
 164
-00:08:02,860 --> 00:08:03,180
 Okay.
 165
-00:08:03,180 --> 00:08:07,100
 So firstly, two things on the surface have nothing in
 166
-00:08:07,100 --> 00:08:07,419
 common.
 167
-00:08:08,365 --> 00:08:12,044
 So how did that just happen all at the same
 168
-00:08:12,044 --> 00:08:12,285
 time?
 169
-00:08:12,285 --> 00:08:15,964
 And then when you extend that further, you're like, Suno.
 170
-00:08:15,964 --> 00:08:19,485
 You can sing a song and AI will come up
 171
-00:08:19,485 --> 00:08:21,165
 with an instrumental.
 172
-00:08:21,485 --> 00:08:23,485
 And then you've got Whisper and you're like, Wait a
 173
-00:08:23,485 --> 00:08:23,725
 second.
 174
-00:08:24,100 --> 00:08:28,180
 How did all this stuff If it's all AI, there
 175
-00:08:28,180 --> 00:08:29,540
 has to be some commonality.
 176
-00:08:29,540 --> 00:08:35,139
 Otherwise, are totally different technologies on the surface of it.
 177
-00:08:35,220 --> 00:08:39,384
 And the transformer architecture is, as far as I know,
 178
-00:08:39,384 --> 00:08:40,264
 the answer.
 179
-00:08:40,264 --> 00:08:42,985
 And I can't even say, can't even pretend that I
 180
-00:08:42,985 --> 00:08:47,384
 really understand what the transformer architecture means in-depth.
 181
-00:08:47,384 --> 00:08:49,865
 But I have scanned this and as I said, I
 182
-00:08:49,865 --> 00:08:52,879
 want to print it and really kind of think over
 183
-00:08:52,879 --> 00:08:54,160
 it at some point.
 184
-00:08:54,879 --> 00:08:58,080
 And I'll probably feel bad about myself, I think, because
 185
-00:08:58,080 --> 00:08:59,679
 weren't those guys in twenties?
 186
-00:09:00,320 --> 00:09:01,840
 Like, that's crazy.
 187
-00:09:02,160 --> 00:09:06,160
 I think I asked ChatGPT once who wrote that paper
 188
-00:09:06,545 --> 00:09:09,264
 and how old were they when it was published in
 189
-00:09:09,264 --> 00:09:09,825
 ArcSiv?
 190
-00:09:09,825 --> 00:09:13,105
 And I was expecting like, I don't know, what do
 191
-00:09:13,105 --> 00:09:13,585
 you imagine?
 192
-00:09:13,585 --> 00:09:15,665
 I personally imagine kind of like, you you have these
 193
-00:09:15,665 --> 00:09:19,745
 breakthroughs during COVID and things like that, where like these
 194
-00:09:19,745 --> 00:09:22,629
 kind of really obscure scientists who are in their 50s
 195
-00:09:22,629 --> 00:09:26,870
 and they've just kind of been laboring in labs and
 196
-00:09:26,870 --> 00:09:29,830
 wearily in writing and publishing in kind of obscure academic
 197
-00:09:29,830 --> 00:09:30,710
 publications.
 198
-00:09:30,870 --> 00:09:33,669
 And they finally hit a big or win a Nobel
 199
-00:09:33,669 --> 00:09:36,235
 Prize and then their household names.
 200
-00:09:36,634 --> 00:09:38,634
 So that was kind of what I had in mind.
 201
-00:09:38,634 --> 00:09:42,154
 That was the mental image I'd formed of the birth
 202
-00:09:42,154 --> 00:09:42,955
 of ArcSim.
 203
-00:09:42,955 --> 00:09:45,595
 Like I wasn't expecting twenty somethings in San Francisco.
 204
-00:09:45,595 --> 00:09:48,794
 I thought that was both very funny, very cool, and
 205
-00:09:48,794 --> 00:09:50,075
 actually kind of inspiring.
 206
-00:09:50,554 --> 00:09:55,230
 It's nice to think that people who just you might
 207
-00:09:55,230 --> 00:09:58,509
 put them in the kind of milieu or bubble or
 208
-00:09:58,509 --> 00:10:02,669
 world that you are in incredibly in through a series
 209
-00:10:02,669 --> 00:10:05,835
 of connections that are coming up with such literally world
 210
-00:10:05,835 --> 00:10:07,835
 changing innovations.
 211
-00:10:07,914 --> 00:10:11,274
 So that was I thought anyway, that's that that was
 212
-00:10:11,274 --> 00:10:11,835
 cool.
 213
-00:10:12,235 --> 00:10:12,554
 Okay.
 214
-00:10:12,554 --> 00:10:13,434
 Voice training data.
 215
-00:10:13,434 --> 00:10:14,154
 How are we doing?
 216
-00:10:14,154 --> 00:10:17,355
 We're about ten minutes, and I'm still talking about voice
 217
-00:10:17,355 --> 00:10:18,235
 technology.
 218
-00:10:18,634 --> 00:10:22,179
 So Whisper was brilliant, and I was so excited that
 219
-00:10:22,179 --> 00:10:25,860
 my first instinct was to guess, like, Oh my gosh,
 220
-00:10:25,860 --> 00:10:28,019
 I have to get a really good microphone for this.
 221
-00:10:28,179 --> 00:10:31,379
 So I didn't go on a spending spree because I
 222
-00:10:31,379 --> 00:10:33,299
 said, I'm gonna have to just wait a month and
 223
-00:10:33,299 --> 00:10:34,740
 see if I still use this.
 224
-00:10:35,220 --> 00:10:38,875
 And it just kind of became it's become really part
 225
-00:10:38,875 --> 00:10:40,955
 of my daily routine.
 226
-00:10:41,754 --> 00:10:44,315
 Like if I'm writing an email, I'll record a voice
 227
-00:10:44,315 --> 00:10:47,595
 note and then I've developed and it's nice to see
 228
-00:10:47,595 --> 00:10:50,759
 that everyone is like developing the same things in parallel.
 229
-00:10:50,759 --> 00:10:53,399
 That's kind of a weird thing to say, when I
 230
-00:10:53,399 --> 00:11:00,279
 started working on these prototypes on GitHub, which is where
 231
-00:11:00,279 --> 00:11:04,039
 I just kind of share very freely and loosely ideas
 232
-00:11:04,039 --> 00:11:06,945
 and first iterations on concepts.
 233
-00:11:09,024 --> 00:11:10,704
 And for want of a better word, I called it
 234
-00:11:10,704 --> 00:11:14,945
 like LLM post processing or clean up or basically a
 235
-00:11:14,945 --> 00:11:17,745
 system prompt that after you get back the raw text
 236
-00:11:17,745 --> 00:11:21,620
 from Whisper, you run it through a model and say,
 237
-00:11:21,620 --> 00:11:26,339
 okay, this is crappy text like add sentence structure and,
 238
-00:11:26,339 --> 00:11:27,459
 you know, fix it up.
 239
-00:11:27,860 --> 00:11:32,579
 And now when I'm exploring the different tools that are
 240
-00:11:32,579 --> 00:11:35,634
 out there that people have built, I see quite a
 241
-00:11:35,634 --> 00:11:39,475
 number of projects have basically done the same thing.
 242
-00:11:40,754 --> 00:11:43,235
 Lest that be misconstrued, I'm not saying for a millisecond
 243
-00:11:43,235 --> 00:11:44,595
 that I inspired them.
 244
-00:11:44,595 --> 00:11:48,034
 I'm sure this has been a thing that's been integrated
 245
-00:11:48,034 --> 00:11:51,290
 into tools for a while, but it's the kind of
 246
-00:11:51,290 --> 00:11:53,690
 thing that when you start using these tools every day,
 247
-00:11:53,690 --> 00:11:57,610
 the need for it is almost instantly apparent because text
 248
-00:11:57,610 --> 00:12:01,529
 that doesn't have any punctuation or paragraph spacing takes a
 249
-00:12:01,529 --> 00:12:03,965
 long time to, you know, it takes so long to
 250
-00:12:03,965 --> 00:12:09,004
 get it into a presentable email that again, moves speech
 251
-00:12:09,004 --> 00:12:13,085
 tech into that before that inflection point where you're like,
 252
-00:12:13,085 --> 00:12:13,965
 nah, it's just not worth it.
 253
-00:12:13,965 --> 00:12:16,924
 It's like, it'll just be quicker to type this.
 254
-00:12:17,279 --> 00:12:19,840
 So it's a big, it's a little touch that actually
 255
-00:12:20,080 --> 00:12:21,200
 is a big deal.
 256
-00:12:21,519 --> 00:12:25,440
 So I was on Whisper and I've been using Whisper
 257
-00:12:25,440 --> 00:12:27,759
 and I kind of early on found a couple of
 258
-00:12:27,759 --> 00:12:28,399
 tools.
 259
-00:12:28,399 --> 00:12:30,639
 I couldn't find what I was looking for on Linux,
 260
-00:12:30,639 --> 00:12:35,924
 which is basically just something that'll run-in the background.
 261
-00:12:35,924 --> 00:12:38,245
 You'll give it an API key and it will just
 262
-00:12:38,245 --> 00:12:43,044
 like transcribe with like a little key to start and
 263
-00:12:43,044 --> 00:12:43,845
 stop the dictation.
 264
-00:12:45,080 --> 00:12:48,440
 And the issues where I discovered that like most people
 265
-00:12:48,440 --> 00:12:52,040
 involved in creating these projects were very much focused on
 266
-00:12:52,040 --> 00:12:55,800
 local models, running Whisper locally because you can.
 267
-00:12:56,279 --> 00:12:58,200
 And I tried that a bunch of times and just
 268
-00:12:58,200 --> 00:13:01,054
 never got results that were as good as the cloud.
 269
-00:13:01,455 --> 00:13:03,615
 And when I began looking at the cost of the
 270
-00:13:03,615 --> 00:13:06,654
 speech to text APIs and what I was spending, I
 271
-00:13:06,654 --> 00:13:09,855
 just thought there is it's actually, in my opinion, just
 272
-00:13:09,855 --> 00:13:13,160
 one of the better deals in API spending in the
 273
-00:13:13,160 --> 00:13:13,480
 cloud.
 274
-00:13:13,480 --> 00:13:15,720
 Like, it's just not that expensive for very, very good
 275
-00:13:15,720 --> 00:13:19,639
 models that are much more, you know, you're gonna be
 276
-00:13:19,639 --> 00:13:22,759
 able to run the full model, the latest model versus
 277
-00:13:22,759 --> 00:13:26,605
 whatever you can run on your average GPU unless you
 278
-00:13:26,605 --> 00:13:28,845
 want to buy a crazy GPU.
 279
-00:13:28,845 --> 00:13:30,044
 It doesn't really make sense to me.
 280
-00:13:30,044 --> 00:13:33,164
 Privacy is another concern that I know is kind of
 281
-00:13:33,164 --> 00:13:35,325
 like a very much a separate thing that people just
 282
-00:13:35,325 --> 00:13:38,845
 don't want their voice data and their voice leaving their
 283
-00:13:38,845 --> 00:13:42,460
 local environment maybe for regulatory reasons as well.
 284
-00:13:42,700 --> 00:13:43,980
 But I'm not in that.
 285
-00:13:44,220 --> 00:13:48,540
 I neither really care about people listening to my, grocery
 286
-00:13:48,540 --> 00:13:51,580
 list, consisting of, reminding myself that I need to buy
 287
-00:13:51,580 --> 00:13:54,779
 more beer, Cheetos, and hummus, which is kind of the
 288
-00:13:55,334 --> 00:13:59,574
 three staples of my diet during periods of poor nutrition.
 289
-00:13:59,894 --> 00:14:02,375
 But the kind of stuff that I transcribe, it's just
 290
-00:14:02,375 --> 00:14:02,694
 not.
 291
-00:14:02,694 --> 00:14:07,814
 It's not a privacy thing I'm that sort of sensitive
 292
-00:14:07,814 --> 00:14:13,269
 about and I don't do anything so sensitive or secure
 293
-00:14:13,269 --> 00:14:14,790
 that requires air capping.
 294
-00:14:15,670 --> 00:14:17,590
 I looked at the pricing and especially the kind of
 295
-00:14:17,590 --> 00:14:18,950
 older model mini.
 296
-00:14:19,590 --> 00:14:21,910
 Some of them are very, very affordable and I did
 297
-00:14:21,910 --> 00:14:26,764
 a calculation once with ChatGPT and I was like, okay,
 298
-00:14:26,764 --> 00:14:30,365
 this is the API price for I can't remember whatever
 299
-00:14:30,365 --> 00:14:31,404
 the model was.
 300
-00:14:31,804 --> 00:14:34,445
 Let's say I just go at it like nonstop, which
 301
-00:14:34,445 --> 00:14:35,565
 rarely happens.
 302
-00:14:35,644 --> 00:14:38,959
 Probably, I would say on average I might dictate thirty
 303
-00:14:38,959 --> 00:14:41,759
 to sixty minutes per day if I was probably summing
 304
-00:14:41,759 --> 00:14:48,000
 up the emails, documents, outlines, which is a lot, but
 305
-00:14:48,000 --> 00:14:50,159
 it's it's still a fairly modest amount.
 306
-00:14:50,159 --> 00:14:51,839
 And I was like, well, some days I do go
 307
-00:14:51,839 --> 00:14:54,934
 on like one or two days where I've been usually
 308
-00:14:54,934 --> 00:14:56,855
 when I'm like kind of out of the house and
 309
-00:14:56,855 --> 00:15:00,535
 just have something like I have nothing else to do.
 310
-00:15:00,535 --> 00:15:03,175
 Like if I'm at a hospital, we have a newborn
 311
-00:15:03,575 --> 00:15:07,299
 and you're waiting for like eight hours and hours for
 312
-00:15:07,299 --> 00:15:08,100
 an appointment.
 313
-00:15:08,179 --> 00:15:12,019
 And I would probably have listened to podcasts before becoming
 314
-00:15:12,019 --> 00:15:12,980
 a speech fanatic.
 315
-00:15:12,980 --> 00:15:15,379
 And I'm like, Oh, wait, let me just get down.
 316
-00:15:15,379 --> 00:15:17,379
 Let me just get these ideas out of my head.
 317
-00:15:17,540 --> 00:15:20,745
 And that's when I'll go on my speech binges.
 318
-00:15:20,745 --> 00:15:22,664
 But those are like once every few months, like not
 319
-00:15:22,664 --> 00:15:23,544
 frequently.
 320
-00:15:23,784 --> 00:15:25,784
 But I said, okay, let's just say if I'm going
 321
-00:15:25,784 --> 00:15:28,184
 to price out cloud STT.
 322
-00:15:28,985 --> 00:15:33,500
 If I was like dedicated every second of every waking
 323
-00:15:33,500 --> 00:15:37,820
 hour to transcribing for some odd reason, I mean I'd
 324
-00:15:37,820 --> 00:15:39,820
 have to eat and use the toilet.
 325
-00:15:40,540 --> 00:15:42,700
 There's only so many hours I'm awake for.
 326
-00:15:42,700 --> 00:15:47,019
 So let's just say a maximum of forty five minutes
 327
-00:15:47,205 --> 00:15:49,205
 in the hour, then I said, All right, let's just
 328
-00:15:49,205 --> 00:15:50,165
 say fifty.
 329
-00:15:50,644 --> 00:15:51,365
 Who knows?
 330
-00:15:51,365 --> 00:15:52,804
 You're dictating on the toilet.
 331
-00:15:52,804 --> 00:15:53,605
 We do it.
 332
-00:15:53,924 --> 00:15:56,884
 So you could just do sixty, but whatever I did
 333
-00:15:57,125 --> 00:16:01,179
 and every day, like you're going flat out seven days
 334
-00:16:01,179 --> 00:16:02,620
 a week dictating nonstop.
 335
-00:16:02,620 --> 00:16:05,579
 I was like, What's my monthly API bill going to
 336
-00:16:05,579 --> 00:16:06,700
 be at this price?
 337
-00:16:06,779 --> 00:16:09,339
 And it came out to like seventy or eighty bucks.
 338
-00:16:09,339 --> 00:16:12,620
 And I was like, Well, that would be an extraordinary
 339
-00:16:12,940 --> 00:16:14,379
 amount of dictation.
 340
-00:16:14,379 --> 00:16:18,105
 And I would hope that there was some compelling reason
 341
-00:16:18,745 --> 00:16:21,784
 worth more than seventy dollars that I embarked upon that
 342
-00:16:21,784 --> 00:16:22,424
 project.
 343
-00:16:22,664 --> 00:16:24,585
 So given that that's kind of the max point for
 344
-00:16:24,585 --> 00:16:27,304
 me I said that's actually very very affordable.
 345
-00:16:28,024 --> 00:16:30,504
 Now you're gonna if you want to spec out the
 346
-00:16:30,504 --> 00:16:33,909
 costs and you want to do the post processing that
 347
-00:16:33,909 --> 00:16:36,789
 I really do feel is valuable, that's going to cost
 348
-00:16:36,789 --> 00:16:37,750
 some more as well.
 349
-00:16:38,070 --> 00:16:43,269
 Unless you're using Gemini, which needless to say is a
 350
-00:16:43,269 --> 00:16:45,190
 random person sitting in Jerusalem.
 351
-00:16:45,855 --> 00:16:49,455
 I have no affiliation nor with Google nor Anthropic nor
 352
-00:16:49,455 --> 00:16:52,414
 Gemini nor any major tech vendor for that matter.
 353
-00:16:53,855 --> 00:16:57,215
 I like Gemini not so much as a everyday model.
 354
-00:16:57,455 --> 00:16:59,934
 It's kind of underwhelmed in that respect, I would say.
 355
-00:17:00,379 --> 00:17:02,779
 But for multimodal, I think it's got a lot to
 356
-00:17:02,779 --> 00:17:03,339
 offer.
 357
-00:17:03,659 --> 00:17:07,179
 And I think that the transcribing functionality whereby it can,
 358
-00:17:08,059 --> 00:17:12,380
 process audio with a system prompt and both give you
 359
-00:17:12,380 --> 00:17:13,900
 transcription that's cleaned up.
 360
-00:17:13,900 --> 00:17:15,339
 That reduces two steps to one.
 361
-00:17:15,835 --> 00:17:18,954
 And that for me is a very, very big deal.
 362
-00:17:18,955 --> 00:17:22,474
 And I feel like even Google hasn't really sort of
 363
-00:17:22,555 --> 00:17:27,195
 thought through how useful the that modality is and what
 364
-00:17:27,195 --> 00:17:29,700
 kind of use cases you can achieve with it.
 365
-00:17:29,700 --> 00:17:32,339
 Because I found in the course of this year just
 366
-00:17:32,339 --> 00:17:38,019
 an endless list of really kind of system prompt stuff
 367
-00:17:38,019 --> 00:17:40,900
 that I can say, okay, I've used it to capture
 368
-00:17:40,900 --> 00:17:44,115
 context data for AI, which is literally I might speak
 369
-00:17:44,115 --> 00:17:46,755
 for if I wanted to have a good bank of
 370
-00:17:46,755 --> 00:17:50,035
 context data about who knows my childhood.
 371
-00:17:50,434 --> 00:17:54,355
 More realistically, maybe my career goals, something that would just
 372
-00:17:54,355 --> 00:17:56,195
 be like really boring to type out.
 373
-00:17:56,195 --> 00:18:00,500
 So I'll just like sit in my car and record
 374
-00:18:00,500 --> 00:18:01,460
 it for ten minutes.
 375
-00:18:01,460 --> 00:18:03,779
 And that ten minutes you get a lot of information
 376
-00:18:03,779 --> 00:18:04,419
 in.
 377
-00:18:05,619 --> 00:18:07,700
 Emails, which is short text.
 378
-00:18:08,660 --> 00:18:10,419
 Just there is a whole bunch.
 379
-00:18:10,420 --> 00:18:13,375
 And all these workflows kind of require a little bit
 380
-00:18:13,375 --> 00:18:15,134
 of treatment afterwards and different treatment.
 381
-00:18:15,134 --> 00:18:18,414
 My context pipeline is kind of like just extract the
 382
-00:18:18,414 --> 00:18:19,295
 bare essentials.
 383
-00:18:19,295 --> 00:18:22,174
 You end up with me talking very loosely about sort
 384
-00:18:22,174 --> 00:18:24,494
 of what I've done in my career, where I've worked,
 385
-00:18:24,494 --> 00:18:25,454
 where I might like to work.
 386
-00:18:26,000 --> 00:18:29,119
 And it goes, it condenses that down to very robotic
 387
-00:18:29,119 --> 00:18:32,720
 language that is easy to chunk parse and maybe put
 388
-00:18:32,720 --> 00:18:34,000
 into a vector database.
 389
-00:18:34,000 --> 00:18:36,240
 Daniel has worked in technology.
 390
-00:18:36,240 --> 00:18:39,840
 Daniel has been working in, know, stuff like that.
 391
-00:18:39,840 --> 00:18:43,055
 That's not how you would speak, but I figure it's
 392
-00:18:43,055 --> 00:18:46,494
 probably easier to parse for, after all, robots.
 393
-00:18:46,815 --> 00:18:48,734
 So we've almost got to twenty minutes and this is
 394
-00:18:48,734 --> 00:18:53,134
 actually a success because I wasted twenty minutes of my
 395
-00:18:53,535 --> 00:18:57,200
 of the evening speaking into you in microphone and the
 396
-00:18:57,200 --> 00:19:01,119
 levels were shot and was clipping and I said I
 397
-00:19:01,119 --> 00:19:02,400
 can't really do an evaluation.
 398
-00:19:02,400 --> 00:19:03,440
 I have to be fair.
 399
-00:19:03,440 --> 00:19:06,400
 I have to give the models a chance to do
 400
-00:19:06,400 --> 00:19:06,960
 their thing.
 401
-00:19:07,505 --> 00:19:09,585
 What am I hoping to achieve in this?
 402
-00:19:09,585 --> 00:19:11,664
 Okay, my fine tune was a dud as mentioned.
 403
-00:19:11,745 --> 00:19:15,265
 Deepgram STT, I'm really, really hopeful that this prototype will
 404
-00:19:15,265 --> 00:19:18,065
 work and it's a build in public open source so
 405
-00:19:18,065 --> 00:19:20,384
 anyone is welcome to use it if I make anything
 406
-00:19:20,384 --> 00:19:20,705
 good.
 407
-00:19:21,640 --> 00:19:23,880
 But that was really exciting for me last night when
 408
-00:19:23,880 --> 00:19:28,920
 after hours of trying my own prototype, seeing someone just
 409
-00:19:28,920 --> 00:19:32,119
 made something that works like that, you you're not gonna
 410
-00:19:32,119 --> 00:19:36,454
 have to build a custom conda environment and image.
 411
-00:19:36,454 --> 00:19:40,054
 I have AMD GPU which makes things much more complicated.
 412
-00:19:40,294 --> 00:19:42,694
 I didn't find it and I was about to give
 413
-00:19:42,694 --> 00:19:43,974
 up and I said, All right, let me just give
 414
-00:19:43,974 --> 00:19:46,535
 Deepgram's Linux thing a shot.
 415
-00:19:47,109 --> 00:19:49,669
 And if this doesn't work, I'm just gonna go back
 416
-00:19:49,669 --> 00:19:51,429
 to trying to vibe code something myself.
 417
-00:19:51,750 --> 00:19:55,589
 And when I ran the script, I was using Cloud
 418
-00:19:55,589 --> 00:19:59,109
 Code to do the installation process, it ran the script
 419
-00:19:59,109 --> 00:20:01,269
 and, oh my gosh, it works just like that.
 420
-00:20:01,904 --> 00:20:06,065
 The tricky thing for all those who wants to know
 421
-00:20:06,065 --> 00:20:11,505
 all the nitty, ditty, nitty gritty details was that I
 422
-00:20:11,505 --> 00:20:14,704
 don't think it was actually struggling with transcription, but pasting
 423
-00:20:14,785 --> 00:20:17,619
 Weyland makes life very hard.
 424
-00:20:17,619 --> 00:20:19,220
 And I think there was something not running at the
 425
-00:20:19,220 --> 00:20:19,779
 right time.
 426
-00:20:19,779 --> 00:20:23,059
 Anyway, Deepgram, I looked at how they actually handle that
 427
-00:20:23,059 --> 00:20:25,220
 because it worked out of the box when other stuff
 428
-00:20:25,220 --> 00:20:25,859
 didn't.
 429
-00:20:26,180 --> 00:20:28,980
 And it was quite a clever little mechanism.
 430
-00:20:29,575 --> 00:20:32,215
 And but more so than that, the accuracy was brilliant.
 431
-00:20:32,215 --> 00:20:33,654
 Now what am I what am I doing here?
 432
-00:20:33,654 --> 00:20:37,255
 This is gonna be a twenty minute audio sample.
 433
-00:20:38,455 --> 00:20:42,490
 And I'm I think I've done one or two of
 434
-00:20:42,490 --> 00:20:47,210
 these before, but I did it with short, snappy voice
 435
-00:20:47,210 --> 00:20:47,690
 notes.
 436
-00:20:47,690 --> 00:20:49,450
 This is kind of long form.
 437
-00:20:49,529 --> 00:20:52,009
 This actually might be a better approximation for what's useful
 438
-00:20:52,009 --> 00:20:53,929
 to me than voice memos.
 439
-00:20:53,929 --> 00:20:56,974
 Like, I need to buy three liters of milk tomorrow
 440
-00:20:56,974 --> 00:21:00,255
 and peter bread, which is probably how half my voice
 441
-00:21:00,255 --> 00:21:00,815
 notes sound.
 442
-00:21:00,815 --> 00:21:04,174
 Like if anyone were to find my phone they'd be
 443
-00:21:04,174 --> 00:21:06,014
 like this is the most boring person in the world.
 444
-00:21:06,095 --> 00:21:10,130
 Although actually there are some journaling thoughts as well, but
 445
-00:21:10,130 --> 00:21:11,890
 it's a lot of content like that.
 446
-00:21:11,890 --> 00:21:14,690
 And the probably for the evaluation, the most useful thing
 447
-00:21:14,690 --> 00:21:21,914
 is slightly obscure tech, GitHub, Nucleano, hugging face, not so
 448
-00:21:21,914 --> 00:21:24,554
 obscure that it's not gonna have a chance of knowing
 449
-00:21:24,554 --> 00:21:27,274
 it, but hopefully sufficiently well known that the model should
 450
-00:21:27,274 --> 00:21:27,914
 get it.
 451
-00:21:27,994 --> 00:21:30,075
 I tried to do a little bit of speaking really
 452
-00:21:30,075 --> 00:21:32,474
 fast and speaking very slowly.
 453
-00:21:32,474 --> 00:21:35,609
 Would say in general, I've spoken, delivered this at a
 454
-00:21:35,609 --> 00:21:39,210
 faster pace than I usually would owing to strong coffee
 455
-00:21:39,210 --> 00:21:40,650
 flowing through my bloodstream.
 456
-00:21:41,210 --> 00:21:43,609
 And the thing that I'm not gonna get in this
 457
-00:21:43,609 --> 00:21:46,170
 benchmark is background noise, which in my first take that
 458
-00:21:46,170 --> 00:21:48,535
 I had to get rid of, my wife came in
 459
-00:21:48,535 --> 00:21:51,575
 with my son and for a good night kiss.
 460
-00:21:51,654 --> 00:21:55,174
 And that actually would have been super helpful to get
 461
-00:21:55,174 --> 00:21:57,894
 in because it was non diarized or if we had
 462
-00:21:57,894 --> 00:21:58,775
 diarization.
 463
-00:21:59,414 --> 00:22:01,494
 A female, I could say, I want the male voice
 464
-00:22:01,494 --> 00:22:03,174
 and that wasn't intended for transcription.
 465
-00:22:04,589 --> 00:22:06,349
 And we're not going to get background noise like people
 466
-00:22:06,349 --> 00:22:09,069
 honking their horns, which is something I've done in my
 467
-00:22:09,230 --> 00:22:11,950
 main data set where I am trying to go back
 468
-00:22:11,950 --> 00:22:15,150
 to some of my voice notes, annotate them and run
 469
-00:22:15,150 --> 00:22:15,789
 a benchmark.
 470
-00:22:15,789 --> 00:22:18,345
 But this is going to be just a pure quick
 471
-00:22:18,345 --> 00:22:19,144
 test.
 472
-00:22:19,865 --> 00:22:24,105
 And as someone I'm working on a voice note idea.
 473
-00:22:24,105 --> 00:22:28,265
 That's my sort of end motivation besides thinking it's an
 474
-00:22:28,265 --> 00:22:31,865
 absolutely outstanding technology that's coming to viability.
 475
-00:22:31,865 --> 00:22:34,480
 And really, I know this sounds cheesy, can actually have
 476
-00:22:34,480 --> 00:22:36,559
 a very transformative effect.
 477
-00:22:38,000 --> 00:22:43,200
 Voice technology has been life changing for folks living with
 478
-00:22:44,079 --> 00:22:45,119
 disabilities.
 479
-00:22:46,000 --> 00:22:48,625
 And I think there's something really nice about the fact
 480
-00:22:48,625 --> 00:22:52,625
 that it can also benefit folks who are able-bodied and
 481
-00:22:52,625 --> 00:22:57,984
 we can all in different ways make this tech as
 482
-00:22:57,984 --> 00:23:00,785
 useful as possible regardless of the exact way that we're
 483
-00:23:00,785 --> 00:23:01,105
 using it.
 484
-00:23:02,279 --> 00:23:04,519
 And I think there's something very powerful in that, and
 485
-00:23:04,519 --> 00:23:05,639
 it can be very cool.
 486
-00:23:06,200 --> 00:23:07,639
 I see huge potential.
 487
-00:23:07,639 --> 00:23:09,399
 What excites me about voice tech?
 488
-00:23:09,799 --> 00:23:11,239
 A lot of things actually.
 489
-00:23:12,200 --> 00:23:14,919
 Firstly, the fact that it's cheap and accurate, as I
 490
-00:23:14,919 --> 00:23:17,865
 mentioned at the very start of this, and it's getting
 491
-00:23:17,865 --> 00:23:20,184
 better and better with stuff like accent handling.
 492
-00:23:20,825 --> 00:23:23,384
 I'm not sure my fine tune will actually ever come
 493
-00:23:23,384 --> 00:23:25,305
 to fruition in the sense that I'll use it day
 494
-00:23:25,305 --> 00:23:26,664
 to day as I imagine.
 495
-00:23:26,744 --> 00:23:30,585
 I get like superb, flawless words error rates because I'm
 496
-00:23:30,585 --> 00:23:35,029
 just kind of skeptical about local speech to text, as
 497
-00:23:35,029 --> 00:23:35,750
 I mentioned.
 498
-00:23:36,150 --> 00:23:39,910
 And I think the pace of innovation and improvement in
 499
-00:23:39,910 --> 00:23:42,390
 the models, the main reasons for fine tuning from what
 500
-00:23:42,390 --> 00:23:46,230
 I've seen have been people who are something that really
 501
-00:23:46,230 --> 00:23:50,455
 blows blows my mind about ASR is the idea that
 502
-00:23:50,455 --> 00:23:55,654
 it's inherently ailingual or multilingual, phonetic based.
 503
-00:23:56,375 --> 00:24:00,455
 So as folks who use speak very obscure languages that
 504
-00:24:00,455 --> 00:24:03,174
 there may be very there might be a paucity of
 505
-00:24:02,309 --> 00:24:05,110
 training data or almost none at all, and therefore the
 506
-00:24:05,110 --> 00:24:06,870
 accuracy is significantly reduced.
 507
-00:24:06,870 --> 00:24:11,430
 Or folks in very critical environments, I know there are
 508
-00:24:11,590 --> 00:24:15,430
 this is used extensively in medical transcription and dispatcher work
 509
-00:24:15,430 --> 00:24:19,144
 as, you know the call centers who send out ambulances
 510
-00:24:19,144 --> 00:24:19,944
 etc.
 511
-00:24:20,345 --> 00:24:23,625
 Where accuracy is absolutely paramount and in the case of
 512
-00:24:23,625 --> 00:24:27,625
 doctors radiologists they might be using very specialized vocab all
 513
-00:24:27,625 --> 00:24:27,945
 the time.
 514
-00:24:28,710 --> 00:24:30,309
 So those are kind of the main two things, and
 515
-00:24:30,309 --> 00:24:32,230
 I'm not sure that really just for trying to make
 516
-00:24:32,230 --> 00:24:36,470
 it better on a few random tech words with my
 517
-00:24:36,470 --> 00:24:39,509
 slightly I mean, I have an accent, but, like, not,
 518
-00:24:39,509 --> 00:24:42,549
 you know, an accent that a few other million people
 519
-00:24:42,950 --> 00:24:43,990
 have ish.
 520
-00:24:44,765 --> 00:24:48,045
 I'm not sure that my little fine tune is gonna
 521
-00:24:48,045 --> 00:24:52,684
 actually like, the bump in word error reduction, if I
 522
-00:24:52,684 --> 00:24:54,285
 ever actually figure out how to do it and get
 523
-00:24:54,285 --> 00:24:56,445
 it up to the cloud, by the time we've done
 524
-00:24:56,445 --> 00:25:00,039
 that, I suspect that the next generation of ASR will
 525
-00:25:00,039 --> 00:25:01,799
 just be so good that it will kind of be,
 526
-00:25:02,039 --> 00:25:04,039
 well, that would have been cool if it worked out,
 527
-00:25:04,039 --> 00:25:05,559
 but I'll just use this instead.
 528
-00:25:05,799 --> 00:25:10,759
 So that's gonna be it for today's episode of voice
 529
-00:25:10,759 --> 00:25:11,720
 training data.
 530
-00:25:11,960 --> 00:25:14,335
 Single, long shot evaluation.
 531
-00:25:14,575 --> 00:25:15,774
 Who am I gonna compare?
 532
-00:25:16,494 --> 00:25:18,654
 Whisper is always good as a benchmark, but I'm more
 533
-00:25:18,654 --> 00:25:22,255
 interested in seeing Whisper head to head with two things
 534
-00:25:22,255 --> 00:25:22,974
 really.
 535
-00:25:23,375 --> 00:25:25,214
 One is Whisper variants.
 536
-00:25:25,214 --> 00:25:27,775
 So you've got these projects like Faster Whisper.
 537
-00:25:29,190 --> 00:25:30,069
 Distill Whisper.
 538
-00:25:30,069 --> 00:25:30,789
 It's a bit confusing.
 539
-00:25:30,789 --> 00:25:31,989
 There's a whole bunch of them.
 540
-00:25:32,230 --> 00:25:35,190
 And the emerging ASRs, which are also a thing.
 541
-00:25:35,349 --> 00:25:37,190
 My intention for this is I'm not sure I'm gonna
 542
-00:25:37,190 --> 00:25:39,990
 have the time in any point in the foreseeable future
 543
-00:25:39,990 --> 00:25:44,855
 to go back to this whole episode and create a
 544
-00:25:44,855 --> 00:25:48,374
 proper source truth where I fix everything.
 545
-00:25:49,335 --> 00:25:51,974
 Might do it if I can get one transcription that's
 546
-00:25:51,974 --> 00:25:54,214
 sufficiently close to perfection.
 547
-00:25:55,014 --> 00:25:58,480
 But what I would actually love to do on Hugging
 548
-00:25:58,480 --> 00:26:00,559
 Face, I think would be a great probably how I
 549
-00:26:00,559 --> 00:26:04,480
 might visualize this is having the audio waveform play and
 550
-00:26:04,480 --> 00:26:08,960
 then have the transcript for each model below it and
 551
-00:26:08,960 --> 00:26:13,845
 maybe even a, like, you know, to scale and maybe
 552
-00:26:13,845 --> 00:26:16,724
 even a local one as well, like local whisper versus
 553
-00:26:16,724 --> 00:26:19,764
 OpenAI API, etcetera.
 554
-00:26:19,845 --> 00:26:23,204
 And I can then actually listen back to segments or
 555
-00:26:23,204 --> 00:26:25,365
 anyone who wants to can listen back to segments of
 556
-00:26:25,365 --> 00:26:30,299
 this recording and see where a particular model struggled and
 557
-00:26:30,299 --> 00:26:33,179
 others didn't as well as the sort of headline finding
 558
-00:26:33,179 --> 00:26:35,659
 of which had the best W E R but that
 559
-00:26:35,659 --> 00:26:37,739
 would require the source of truth.
 560
-00:26:37,740 --> 00:26:38,539
 Okay, that's it.
 561
-00:26:38,505 --> 00:26:41,065
 I hope this was, I don't know, maybe useful for
 562
-00:26:41,065 --> 00:26:42,984
 other folks interested in STT.
 563
-00:26:43,065 --> 00:26:46,025
 You want to see I always think I've just said
 564
-00:26:46,025 --> 00:26:47,704
 it as something I didn't intend to.
 565
-00:26:47,944 --> 00:26:49,704
 STT, I said for those.
 566
-00:26:49,704 --> 00:26:53,129
 Listen carefully, including hopefully the models themselves.
 567
-00:26:53,369 --> 00:26:55,129
 This has been myself, Daniel Rosol.
 568
-00:26:55,129 --> 00:26:59,450
 For more jumbled repositories about my roving interest in AI
 569
-00:26:59,450 --> 00:27:04,089
 but particularly AgenTic, MCP and VoiceTech you can find me
 570
-00:27:04,089 --> 00:27:05,769
 on GitHub.
 571
-00:27:06,009 --> 00:27:06,730
 Hugging Face.
 572
-00:27:08,125 --> 00:27:09,004
 Where else?
 573
-00:27:09,005 --> 00:27:11,805
 DanielRosel dot com, which is my personal website, as well
 574
-00:27:11,805 --> 00:27:15,565
 as this podcast whose name I sadly cannot remember.
 575
-00:27:15,724 --> 00:27:16,765
 Until next time.
 576
-00:27:16,765 --> 00:27:17,404
 Thanks for listening.

 1
+00:00:00,000 --> 00:00:06,160
 Hello and welcome to a audio dataset consisting of one
 2
+00:00:06,160 --> 00:00:08,320
 single episode of a nonexistent podcast.
 3
+00:00:08,720 --> 00:00:12,800
 Or it I may append this to a podcast that
 4
+00:00:12,800 --> 00:00:18,734
 I set up recently regarding my with my thoughts on
 5
+00:00:18,735 --> 00:00:20,735
 speech tech and A.
 6
+00:00:20,735 --> 00:00:21,134
 I.
 7
+00:00:21,134 --> 00:00:22,734
 In particular, more A.
 8
+00:00:22,734 --> 00:00:22,974
 I.
 9
+00:00:22,974 --> 00:00:23,855
 And generative A.
 10
+00:00:23,855 --> 00:00:24,015
 I.
 11
+00:00:24,015 --> 00:00:26,414
 I would I would say.
 12
+00:00:26,734 --> 00:00:30,789
 But in any event, the purpose of this voice recording
 13
+00:00:30,789 --> 00:00:35,510
 is actually to create a lengthy voice sample for a
 14
+00:00:35,510 --> 00:00:38,870
 quick evaluation, a back of the envelope evaluation, they might
 15
+00:00:38,870 --> 00:00:41,349
 say, for different speech attacks models.
 16
+00:00:41,349 --> 00:00:43,865
 I'm doing this because I thought I'd made a great
 17
+00:00:43,865 --> 00:00:47,704
 breakthrough in my journey with speech tech and that was
 18
+00:00:47,704 --> 00:00:51,305
 succeeding in the elusive task of fine tuning whisper.
 19
+00:00:51,624 --> 00:00:56,344
 Whisper is, and I'm to just talk, I'm trying to
 20
+00:00:55,749 --> 00:00:56,709
 mix up.
 21
+00:00:56,789 --> 00:01:00,310
 I'm going to try a few different styles of speaking
 22
+00:01:00,310 --> 00:01:02,789
 whisper something at some points as well.
 23
+00:01:03,270 --> 00:01:06,710
 And I'll go back to speaking loud in in different
 24
+00:01:06,710 --> 00:01:08,950
 parts are going to sound really like a crazy person
 25
+00:01:08,950 --> 00:01:12,344
 because I'm also going to try to speak at different
 26
+00:01:12,904 --> 00:01:17,945
 pitches and cadences in order to really try to push
 27
+00:01:18,264 --> 00:01:21,065
 a speech to text model through its paces, which is
 28
+00:01:21,065 --> 00:01:24,529
 trying to make sense of is this guy just rambling
 29
+00:01:24,529 --> 00:01:29,969
 on incoherently in one long sentence or are these just
 30
+00:01:29,969 --> 00:01:36,370
 actually a series of step standalone, standalone, standalone sentences?
 31
+00:01:36,370 --> 00:01:38,050
 And how is it going to handle step alone?
 32
+00:01:38,050 --> 00:01:38,690
 That's not a word.
 33
+00:01:39,624 --> 00:01:41,945
 What happens when you use speech to text and you
 34
+00:01:41,945 --> 00:01:43,304
 use a fake word?
 35
+00:01:43,304 --> 00:01:45,704
 And then you're like, wait, that's not actually that word
 36
+00:01:45,704 --> 00:01:46,585
 doesn't exist.
 37
+00:01:46,904 --> 00:01:48,504
 How does AI handle that?
 38
+00:01:48,504 --> 00:01:53,670
 And these and more are all the questions that I'm
 39
+00:01:53,670 --> 00:01:55,670
 seeking to answer in this training data.
 40
+00:01:55,749 --> 00:01:58,469
 Now, why was I trying to fine tune Whisper?
 41
+00:01:58,469 --> 00:01:59,670
 And what is Whisper?
 42
+00:01:59,670 --> 00:02:02,630
 As I said, I'm going to try to record this
 43
+00:02:02,630 --> 00:02:06,564
 at a couple of different levels of technicality for folks
 44
+00:02:06,564 --> 00:02:11,684
 who are in the normal world and not totally stuck
 45
+00:02:11,684 --> 00:02:13,684
 down the rabbit hole of AI, which you have to
 46
+00:02:13,684 --> 00:02:17,605
 say is a really wonderful rabbit hole to be done.
 47
+00:02:17,764 --> 00:02:20,839
 It's a really interesting area and speech and voice tech
 48
+00:02:20,839 --> 00:02:24,279
 is is the aspect of it that I find actually
 49
+00:02:24,279 --> 00:02:27,159
 most I'm not sure I would say the most interesting
 50
+00:02:27,159 --> 00:02:30,679
 because there's just so much that is fascinating in AI.
 51
+00:02:31,320 --> 00:02:34,054
 But the most that I find the most personally transformative
 52
+00:02:34,054 --> 00:02:38,454
 in terms of the impact that it's had on my
 53
+00:02:38,454 --> 00:02:41,174
 daily work life and productivity and how I sort of
 54
+00:02:41,174 --> 00:02:41,815
 work.
 55
+00:02:42,855 --> 00:02:47,420
 I'm persevering hard with the task of trying to get
 56
+00:02:47,420 --> 00:02:50,859
 a good solution working for Linux, which if anyone actually
 57
+00:02:50,859 --> 00:02:52,859
 does listen to this, not just for the training data
 58
+00:02:52,859 --> 00:02:56,620
 and for the actual content, is sparked.
 59
+00:02:56,620 --> 00:02:59,900
 I had, besides the fine tune not working, well that
 60
+00:02:59,900 --> 00:03:01,305
 was the failure.
 61
+00:03:02,424 --> 00:03:06,665
 I used Claude code because one thinks these days that
 62
+00:03:06,665 --> 00:03:13,200
 there is nothing short of solving, you know, the the
 63
+00:03:13,200 --> 00:03:17,519
 reason of life or something that clause and agentic AI
 64
+00:03:17,519 --> 00:03:19,600
 can't do, which is not really the case.
 65
+00:03:19,600 --> 00:03:23,119
 It does seem that way sometimes, but it fails a
 66
+00:03:23,119 --> 00:03:23,679
 lot as well.
 67
+00:03:23,679 --> 00:03:26,559
 And this is one of those instances where last week
 68
+00:03:26,559 --> 00:03:30,744
 I put together an hour of voice training data, basically
 69
+00:03:30,744 --> 00:03:33,385
 speaking just random things for three minutes.
 70
+00:03:35,385 --> 00:03:38,024
 It was actually kind of tedious because the texts were
 71
+00:03:38,024 --> 00:03:38,584
 really weird.
 72
+00:03:38,584 --> 00:03:41,290
 Some of them were, it was like it was AI
 73
+00:03:41,290 --> 00:03:42,170
 generated.
 74
+00:03:42,489 --> 00:03:44,809
 I tried before to read Sherlock Holmes for an hour
 75
+00:03:44,809 --> 00:03:47,609
 and I just couldn't, I was so bored after ten
 76
+00:03:47,609 --> 00:03:50,489
 minutes that I was like, okay, no, I'm just gonna
 77
+00:03:50,489 --> 00:03:51,850
 have to find something else to read.
 78
+00:03:51,850 --> 00:03:58,204
 So I used a created with AI Studio, VibeCoded, a
 79
+00:03:58,204 --> 00:04:03,084
 synthetic text generator which actually I thought was probably a
 80
+00:04:03,084 --> 00:04:05,165
 better way of doing it because it would give me
 81
+00:04:05,165 --> 00:04:08,989
 more short samples with more varied content.
 82
+00:04:08,989 --> 00:04:11,630
 So I was like, okay, give me a voice note
 83
+00:04:11,630 --> 00:04:14,829
 like I'm recording an email, give me a short story
 84
+00:04:14,829 --> 00:04:18,109
 to read, give me prose to read.
 85
+00:04:18,109 --> 00:04:20,554
 So I came up with all these different things and
 86
+00:04:20,554 --> 00:04:22,634
 they added a little timer to it so I could
 87
+00:04:22,634 --> 00:04:24,875
 see how close I was to one hour.
 88
+00:04:25,835 --> 00:04:29,035
 And I spent like an hour one afternoon or probably
 89
+00:04:29,035 --> 00:04:33,035
 two hours by the time you do retakes and whatever
 90
+00:04:33,035 --> 00:04:36,089
 because you want to it gave me a source of
 91
+00:04:36,089 --> 00:04:39,929
 truth which I'm not sure if that's the scientific way
 92
+00:04:39,929 --> 00:04:44,089
 to approach this topic of gathering training data but I
 93
+00:04:44,089 --> 00:04:45,369
 thought made sense.
 94
+00:04:46,410 --> 00:04:49,384
 I have a lot of audio data from recording voice
 95
+00:04:49,384 --> 00:04:53,464
 notes which I've also kind of used, been experimenting with
 96
+00:04:53,464 --> 00:04:54,984
 using for a different purpose.
 97
+00:04:55,304 --> 00:04:58,665
 Slightly different annotating task types.
 98
+00:04:58,665 --> 00:05:03,170
 It's more a text classification experiment or Well, it's more
 99
+00:05:03,170 --> 00:05:03,730
 than that actually.
 100
+00:05:03,730 --> 00:05:04,929
 I'm working on a voice app.
 101
+00:05:04,929 --> 00:05:09,249
 So it's a prototype, I guess, is really more accurate.
 102
+00:05:11,329 --> 00:05:13,889
 But you can do that and you can work backwards.
 103
+00:05:13,889 --> 00:05:18,274
 Listen back to a voice note and you painfully go
 104
+00:05:18,274 --> 00:05:21,394
 through one of those transcribing, where you start and stop
 105
+00:05:21,394 --> 00:05:23,554
 and scrub around it and you fix the errors, but
 106
+00:05:23,554 --> 00:05:25,795
 it's really, really pouring to do that.
 107
+00:05:26,035 --> 00:05:27,954
 So I thought it would be less tedious in the
 108
+00:05:27,954 --> 00:05:31,634
 long term if I just recorded the source of truth.
 109
+00:05:31,989 --> 00:05:34,309
 So it gave me these three minutes snippets.
 110
+00:05:34,309 --> 00:05:37,429
 I recorded them and saved an MP3 and a TXT
 111
+00:05:37,670 --> 00:05:40,230
 in the same folder and I created an error that
 112
+00:05:40,230 --> 00:05:40,869
 data.
 113
+00:05:41,910 --> 00:05:44,790
 So I was very hopeful, quietly, a little bit hopeful
 114
+00:05:44,790 --> 00:05:46,949
 that I would be able, that I could actually fine
 115
+00:05:46,949 --> 00:05:47,670
 tune Whisper.
 116
+00:05:48,285 --> 00:05:51,005
 I want to fine tune Whisper because when I got
 117
+00:05:51,005 --> 00:05:54,924
 into voice tech last November, my wife was in the
 118
+00:05:54,924 --> 00:05:57,165
 US and I was alone at home.
 119
+00:05:57,244 --> 00:06:00,924
 And when crazy people like me do really wild things
 120
+00:06:00,924 --> 00:06:03,900
 like use voice to tech technology.
 121
+00:06:03,900 --> 00:06:06,859
 That was basically when I started doing it, I didn't
 122
+00:06:06,859 --> 00:06:09,500
 feel like a crazy person speaking to myself.
 123
+00:06:09,900 --> 00:06:12,700
 And my expectations weren't that high.
 124
+00:06:13,100 --> 00:06:17,605
 I'd used speech tech now and again, tried it out.
 125
+00:06:17,605 --> 00:06:18,804
 I was like, it'd be really cool if you could
 126
+00:06:18,804 --> 00:06:22,324
 just like speak into your computer and whatever I tried
 127
+00:06:22,324 --> 00:06:25,845
 out that had Linux support was just, it was not
 128
+00:06:25,845 --> 00:06:26,725
 good basically.
 129
+00:06:27,285 --> 00:06:29,444
 And this blew me away from the first go.
 130
+00:06:29,444 --> 00:06:32,259
 I mean, it wasn't one hundred percent accurate out of
 131
+00:06:32,259 --> 00:06:34,420
 the box and it took work, but it was good
 132
+00:06:34,420 --> 00:06:36,739
 enough that there was a solid foundation and it kind
 133
+00:06:36,739 --> 00:06:41,059
 of passed that pivot point that it's actually worth doing
 134
+00:06:41,059 --> 00:06:41,540
 this.
 135
+00:06:41,859 --> 00:06:43,859
 You know, there's a point where it's so like, the
 136
+00:06:43,859 --> 00:06:46,405
 transcript is you don't have to get one hundred percent
 137
+00:06:46,405 --> 00:06:49,445
 accuracy for it to be worth your time for speech
 138
+00:06:49,445 --> 00:06:51,845
 to text to be a worthwhile addition to your productivity.
 139
+00:06:51,845 --> 00:06:53,605
 But you do need to get above, let's say, I
 140
+00:06:53,605 --> 00:06:55,045
 don't know, eighty five percent.
 141
+00:06:55,525 --> 00:06:58,725
 If it's sixty percent or fifty percent, you inevitably say,
 142
+00:06:58,960 --> 00:07:00,239
 Screw it, I'll just type it.
 143
+00:07:00,239 --> 00:07:03,600
 Because you end up missing errors in the transcript and
 144
+00:07:03,600 --> 00:07:04,960
 it becomes actually worse.
 145
+00:07:04,960 --> 00:07:06,640
 You end up in a worse position than you started
 146
+00:07:06,640 --> 00:07:06,960
 with it.
 147
+00:07:06,960 --> 00:07:08,160
 That's been my experience.
 148
+00:07:08,480 --> 00:07:12,400
 So I was like, Oh, this is actually really, really
 149
+00:07:12,400 --> 00:07:12,880
 good now.
 150
+00:07:12,880 --> 00:07:13,600
 How did that happen?
 151
+00:07:13,600 --> 00:07:17,915
 And the answer is ASR, Whisper being open sourced and
 152
+00:07:18,634 --> 00:07:21,514
 the transformer architecture, if you want to go back to
 153
+00:07:21,514 --> 00:07:26,314
 the underpinnings, which really blows my mind and it's on
 154
+00:07:26,314 --> 00:07:29,750
 my list to read through that paper.
 155
+00:07:30,309 --> 00:07:35,910
 All you need is attention as attentively as can be
 156
+00:07:35,910 --> 00:07:39,270
 done with my limited brain because it's super super high
 157
+00:07:39,270 --> 00:07:42,965
 level stuff, super advanced stuff, mean.
 158
+00:07:43,205 --> 00:07:48,004
 That I think of all the things that are fascinating
 159
+00:07:48,004 --> 00:07:52,484
 about the sudden rise in AI and the dramatic capabilities,
 160
+00:07:53,259 --> 00:07:55,339
 I find it fascinating that few people are like, hang
 161
+00:07:55,339 --> 00:07:58,220
 on, you've got this thing that can speak to you
 162
+00:07:58,220 --> 00:07:59,980
 like a chatbot, an LLM.
 163
+00:08:00,540 --> 00:08:02,780
 And then you've got image generation.
 164
+00:08:02,780 --> 00:08:03,100
 Okay.
 165
+00:08:03,100 --> 00:08:07,020
 So firstly, two things on the surface have nothing in
 166
+00:08:07,020 --> 00:08:07,339
 common.
 167
+00:08:08,285 --> 00:08:11,964
 So how did that just happen all at the same
 168
+00:08:11,964 --> 00:08:12,205
 time?
 169
+00:08:12,205 --> 00:08:15,884
 And then when you extend that further, you're like, Suno.
 170
+00:08:15,884 --> 00:08:19,405
 You can sing a song and AI will come up
 171
+00:08:19,405 --> 00:08:21,085
 with an instrumental.
 172
+00:08:21,405 --> 00:08:23,405
 And then you've got Whisper and you're like, Wait a
 173
+00:08:23,405 --> 00:08:23,645
 second.
 174
+00:08:24,020 --> 00:08:28,100
 How did all this stuff If it's all AI, there
 175
+00:08:28,100 --> 00:08:29,460
 has to be some commonality.
 176
+00:08:29,460 --> 00:08:35,059
 Otherwise, are totally different technologies on the surface of it.
 177
+00:08:35,140 --> 00:08:39,304
 And the transformer architecture is, as far as I know,
 178
+00:08:39,304 --> 00:08:40,184
 the answer.
 179
+00:08:40,184 --> 00:08:42,905
 And I can't even say, can't even pretend that I
 180
+00:08:42,905 --> 00:08:47,304
 really understand what the transformer architecture means in-depth.
 181
+00:08:47,304 --> 00:08:49,785
 But I have scanned this and as I said, I
 182
+00:08:49,785 --> 00:08:52,799
 want to print it and really kind of think over
 183
+00:08:52,799 --> 00:08:54,080
 it at some point.
 184
+00:08:54,799 --> 00:08:58,000
 And I'll probably feel bad about myself, I think, because
 185
+00:08:58,000 --> 00:08:59,599
 weren't those guys in twenties?
 186
+00:09:00,240 --> 00:09:01,760
 Like, that's crazy.
 187
+00:09:02,080 --> 00:09:06,080
 I think I asked ChatGPT once who wrote that paper
 188
+00:09:06,465 --> 00:09:09,184
 and how old were they when it was published in
 189
+00:09:09,184 --> 00:09:09,745
 ArcSiv?
 190
+00:09:09,745 --> 00:09:13,025
 And I was expecting like, I don't know, what do
 191
+00:09:13,025 --> 00:09:13,505
 you imagine?
 192
+00:09:13,505 --> 00:09:15,585
 I personally imagine kind of like, you you have these
 193
+00:09:15,585 --> 00:09:19,665
 breakthroughs during COVID and things like that, where like these
 194
+00:09:19,665 --> 00:09:22,549
 kind of really obscure scientists who are in their 50s
 195
+00:09:22,549 --> 00:09:26,790
 and they've just kind of been laboring in labs and
 196
+00:09:26,790 --> 00:09:29,750
 wearily in writing and publishing in kind of obscure academic
 197
+00:09:29,750 --> 00:09:30,630
 publications.
 198
+00:09:30,790 --> 00:09:33,589
 And they finally hit a big or win a Nobel
 199
+00:09:33,589 --> 00:09:36,155
 Prize and then their household names.
 200
+00:09:36,554 --> 00:09:38,554
 So that was kind of what I had in mind.
 201
+00:09:38,554 --> 00:09:42,074
 That was the mental image I'd formed of the birth
 202
+00:09:42,074 --> 00:09:42,875
 of ArcSim.
 203
+00:09:42,875 --> 00:09:45,515
 Like I wasn't expecting twenty somethings in San Francisco.
 204
+00:09:45,515 --> 00:09:48,714
 I thought that was both very funny, very cool, and
 205
+00:09:48,714 --> 00:09:49,995
 actually kind of inspiring.
 206
+00:09:50,474 --> 00:09:55,150
 It's nice to think that people who just you might
 207
+00:09:55,150 --> 00:09:58,429
 put them in the kind of milieu or bubble or
 208
+00:09:58,429 --> 00:10:02,589
 world that you are in incredibly in through a series
 209
+00:10:02,589 --> 00:10:05,755
 of connections that are coming up with such literally world
 210
+00:10:05,755 --> 00:10:07,755
 changing innovations.
 211
+00:10:07,834 --> 00:10:11,194
 So that was I thought anyway, that's that that was
 212
+00:10:11,194 --> 00:10:11,755
 cool.
 213
+00:10:12,155 --> 00:10:12,474
 Okay.
 214
+00:10:12,474 --> 00:10:13,354
 Voice training data.
 215
+00:10:13,354 --> 00:10:14,074
 How are we doing?
 216
+00:10:14,074 --> 00:10:17,275
 We're about ten minutes, and I'm still talking about voice
 217
+00:10:17,275 --> 00:10:18,155
 technology.
 218
+00:10:18,554 --> 00:10:22,099
 So Whisper was brilliant, and I was so excited that
 219
+00:10:22,099 --> 00:10:25,780
 my first instinct was to guess, like, Oh my gosh,
 220
+00:10:25,780 --> 00:10:27,939
 I have to get a really good microphone for this.
 221
+00:10:28,099 --> 00:10:31,299
 So I didn't go on a spending spree because I
 222
+00:10:31,299 --> 00:10:33,219
 said, I'm gonna have to just wait a month and
 223
+00:10:33,219 --> 00:10:34,660
 see if I still use this.
 224
+00:10:35,140 --> 00:10:38,795
 And it just kind of became it's become really part
 225
+00:10:38,795 --> 00:10:40,875
 of my daily routine.
 226
+00:10:41,674 --> 00:10:44,235
 Like if I'm writing an email, I'll record a voice
 227
+00:10:44,235 --> 00:10:47,515
 note and then I've developed and it's nice to see
 228
+00:10:47,515 --> 00:10:50,679
 that everyone is like developing the same things in parallel.
 229
+00:10:50,679 --> 00:10:53,319
 That's kind of a weird thing to say, when I
 230
+00:10:53,319 --> 00:11:00,199
 started working on these prototypes on GitHub, which is where
 231
+00:11:00,199 --> 00:11:03,959
 I just kind of share very freely and loosely ideas
 232
+00:11:03,959 --> 00:11:06,865
 and first iterations on concepts.
 233
+00:11:08,944 --> 00:11:10,624
 And for want of a better word, I called it
 234
+00:11:10,624 --> 00:11:14,865
 like LLM post processing or clean up or basically a
 235
+00:11:14,865 --> 00:11:17,665
 system prompt that after you get back the raw text
 236
+00:11:17,665 --> 00:11:21,540
 from Whisper, you run it through a model and say,
 237
+00:11:21,540 --> 00:11:26,259
 okay, this is crappy text like add sentence structure and,
 238
+00:11:26,259 --> 00:11:27,379
 you know, fix it up.
 239
+00:11:27,780 --> 00:11:32,499
 And now when I'm exploring the different tools that are
 240
+00:11:32,499 --> 00:11:35,554
 out there that people have built, I see quite a
 241
+00:11:35,554 --> 00:11:39,395
 number of projects have basically done the same thing.
 242
+00:11:40,674 --> 00:11:43,155
 Lest that be misconstrued, I'm not saying for a millisecond
 243
+00:11:43,155 --> 00:11:44,515
 that I inspired them.
 244
+00:11:44,515 --> 00:11:47,954
 I'm sure this has been a thing that's been integrated
 245
+00:11:47,954 --> 00:11:51,210
 into tools for a while, but it's the kind of
 246
+00:11:51,210 --> 00:11:53,610
 thing that when you start using these tools every day,
 247
+00:11:53,610 --> 00:11:57,530
 the need for it is almost instantly apparent because text
 248
+00:11:57,530 --> 00:12:01,449
 that doesn't have any punctuation or paragraph spacing takes a
 249
+00:12:01,449 --> 00:12:03,885
 long time to, you know, it takes so long to
 250
+00:12:03,885 --> 00:12:08,924
 get it into a presentable email that again, moves speech
 251
+00:12:08,924 --> 00:12:13,005
 tech into that before that inflection point where you're like,
 252
+00:12:13,005 --> 00:12:13,885
 nah, it's just not worth it.
 253
+00:12:13,885 --> 00:12:16,844
 It's like, it'll just be quicker to type this.
 254
+00:12:17,199 --> 00:12:19,760
 So it's a big, it's a little touch that actually
 255
+00:12:20,000 --> 00:12:21,120
 is a big deal.
 256
+00:12:21,439 --> 00:12:25,360
 So I was on Whisper and I've been using Whisper
 257
+00:12:25,360 --> 00:12:27,679
 and I kind of early on found a couple of
 258
+00:12:27,679 --> 00:12:28,319
 tools.
 259
+00:12:28,319 --> 00:12:30,559
 I couldn't find what I was looking for on Linux,
 260
+00:12:30,559 --> 00:12:35,844
 which is basically just something that'll run-in the background.
 261
+00:12:35,844 --> 00:12:38,165
 You'll give it an API key and it will just
 262
+00:12:38,165 --> 00:12:42,964
 like transcribe with like a little key to start and
 263
+00:12:42,964 --> 00:12:43,765
 stop the dictation.
 264
+00:12:45,000 --> 00:12:48,360
 And the issues where I discovered that like most people
 265
+00:12:48,360 --> 00:12:51,960
 involved in creating these projects were very much focused on
 266
+00:12:51,960 --> 00:12:55,720
 local models, running Whisper locally because you can.
 267
+00:12:56,199 --> 00:12:58,120
 And I tried that a bunch of times and just
 268
+00:12:58,120 --> 00:13:00,974
 never got results that were as good as the cloud.
 269
+00:13:01,375 --> 00:13:03,535
 And when I began looking at the cost of the
 270
+00:13:03,535 --> 00:13:06,574
 speech to text APIs and what I was spending, I
 271
+00:13:06,574 --> 00:13:09,775
 just thought there is it's actually, in my opinion, just
 272
+00:13:09,775 --> 00:13:13,080
 one of the better deals in API spending in the
 273
+00:13:13,080 --> 00:13:13,400
 cloud.
 274
+00:13:13,400 --> 00:13:15,640
 Like, it's just not that expensive for very, very good
 275
+00:13:15,640 --> 00:13:19,559
 models that are much more, you know, you're gonna be
 276
+00:13:19,559 --> 00:13:22,679
 able to run the full model, the latest model versus
 277
+00:13:22,679 --> 00:13:26,525
 whatever you can run on your average GPU unless you
 278
+00:13:26,525 --> 00:13:28,765
 want to buy a crazy GPU.
 279
+00:13:28,765 --> 00:13:29,964
 It doesn't really make sense to me.
 280
+00:13:29,964 --> 00:13:33,084
 Privacy is another concern that I know is kind of
 281
+00:13:33,084 --> 00:13:35,245
 like a very much a separate thing that people just
 282
+00:13:35,245 --> 00:13:38,765
 don't want their voice data and their voice leaving their
 283
+00:13:38,765 --> 00:13:42,380
 local environment maybe for regulatory reasons as well.
 284
+00:13:42,620 --> 00:13:43,900
 But I'm not in that.
 285
+00:13:44,140 --> 00:13:48,460
 I neither really care about people listening to my, grocery
 286
+00:13:48,460 --> 00:13:51,500
 list, consisting of, reminding myself that I need to buy
 287
+00:13:51,500 --> 00:13:54,699
 more beer, Cheetos, and hummus, which is kind of the
 288
+00:13:55,254 --> 00:13:59,494
 three staples of my diet during periods of poor nutrition.
 289
+00:13:59,814 --> 00:14:02,295
 But the kind of stuff that I transcribe, it's just
 290
+00:14:02,295 --> 00:14:02,614
 not.
 291
+00:14:02,614 --> 00:14:07,734
 It's not a privacy thing I'm that sort of sensitive
 292
+00:14:07,734 --> 00:14:13,189
 about and I don't do anything so sensitive or secure
 293
+00:14:13,189 --> 00:14:14,710
 that requires air capping.
 294
+00:14:15,590 --> 00:14:17,510
 I looked at the pricing and especially the kind of
 295
+00:14:17,510 --> 00:14:18,870
 older model mini.
 296
+00:14:19,510 --> 00:14:21,830
 Some of them are very, very affordable and I did
 297
+00:14:21,830 --> 00:14:26,684
 a calculation once with ChatGPT and I was like, okay,
 298
+00:14:26,684 --> 00:14:30,285
 this is the API price for I can't remember whatever
 299
+00:14:30,285 --> 00:14:31,324
 the model was.
 300
+00:14:31,724 --> 00:14:34,365
 Let's say I just go at it like nonstop, which
 301
+00:14:34,365 --> 00:14:35,485
 rarely happens.
 302
+00:14:35,564 --> 00:14:38,879
 Probably, I would say on average I might dictate thirty
 303
+00:14:38,879 --> 00:14:41,679
 to sixty minutes per day if I was probably summing
 304
+00:14:41,679 --> 00:14:47,920
 up the emails, documents, outlines, which is a lot, but
 305
+00:14:47,920 --> 00:14:50,079
 it's it's still a fairly modest amount.
 306
+00:14:50,079 --> 00:14:51,759
 And I was like, well, some days I do go
 307
+00:14:51,759 --> 00:14:54,854
 on like one or two days where I've been usually
 308
+00:14:54,854 --> 00:14:56,775
 when I'm like kind of out of the house and
 309
+00:14:56,775 --> 00:15:00,455
 just have something like I have nothing else to do.
 310
+00:15:00,455 --> 00:15:03,095
 Like if I'm at a hospital, we have a newborn
 311
+00:15:03,495 --> 00:15:07,219
 and you're waiting for like eight hours and hours for
 312
+00:15:07,219 --> 00:15:08,020
 an appointment.
 313
+00:15:08,099 --> 00:15:11,939
 And I would probably have listened to podcasts before becoming
 314
+00:15:11,939 --> 00:15:12,900
 a speech fanatic.
 315
+00:15:12,900 --> 00:15:15,299
 And I'm like, Oh, wait, let me just get down.
 316
+00:15:15,299 --> 00:15:17,299
 Let me just get these ideas out of my head.
 317
+00:15:17,460 --> 00:15:20,665
 And that's when I'll go on my speech binges.
 318
+00:15:20,665 --> 00:15:22,584
 But those are like once every few months, like not
 319
+00:15:22,584 --> 00:15:23,464
 frequently.
 320
+00:15:23,704 --> 00:15:25,704
 But I said, okay, let's just say if I'm going
 321
+00:15:25,704 --> 00:15:28,104
 to price out cloud STT.
 322
+00:15:28,905 --> 00:15:33,420
 If I was like dedicated every second of every waking
 323
+00:15:33,420 --> 00:15:37,740
 hour to transcribing for some odd reason, I mean I'd
 324
+00:15:37,740 --> 00:15:39,740
 have to eat and use the toilet.
 325
+00:15:40,460 --> 00:15:42,620
 There's only so many hours I'm awake for.
 326
+00:15:42,620 --> 00:15:46,939
 So let's just say a maximum of forty five minutes
 327
+00:15:47,125 --> 00:15:49,125
 in the hour, then I said, All right, let's just
 328
+00:15:49,125 --> 00:15:50,085
 say fifty.
 329
+00:15:50,564 --> 00:15:51,285
 Who knows?
 330
+00:15:51,285 --> 00:15:52,724
 You're dictating on the toilet.
 331
+00:15:52,724 --> 00:15:53,525
 We do it.
 332
+00:15:53,844 --> 00:15:56,804
 So you could just do sixty, but whatever I did
 333
+00:15:57,045 --> 00:16:01,099
 and every day, like you're going flat out seven days
 334
+00:16:01,099 --> 00:16:02,540
 a week dictating nonstop.
 335
+00:16:02,540 --> 00:16:05,499
 I was like, What's my monthly API bill going to
 336
+00:16:05,499 --> 00:16:06,620
 be at this price?
 337
+00:16:06,699 --> 00:16:09,259
 And it came out to like seventy or eighty bucks.
 338
+00:16:09,259 --> 00:16:12,540
 And I was like, Well, that would be an extraordinary
 339
+00:16:12,860 --> 00:16:14,299
 amount of dictation.
 340
+00:16:14,299 --> 00:16:18,025
 And I would hope that there was some compelling reason
 341
+00:16:18,665 --> 00:16:21,704
 worth more than seventy dollars that I embarked upon that
 342
+00:16:21,704 --> 00:16:22,344
 project.
 343
+00:16:22,584 --> 00:16:24,505
 So given that that's kind of the max point for
 344
+00:16:24,505 --> 00:16:27,224
 me I said that's actually very very affordable.
 345
+00:16:27,944 --> 00:16:30,424
 Now you're gonna if you want to spec out the
 346
+00:16:30,424 --> 00:16:33,829
 costs and you want to do the post processing that
 347
+00:16:33,829 --> 00:16:36,709
 I really do feel is valuable, that's going to cost
 348
+00:16:36,709 --> 00:16:37,670
 some more as well.
 349
+00:16:37,990 --> 00:16:43,189
 Unless you're using Gemini, which needless to say is a
 350
+00:16:43,189 --> 00:16:45,110
 random person sitting in Jerusalem.
 351
+00:16:45,775 --> 00:16:49,375
 I have no affiliation nor with Google nor Anthropic nor
 352
+00:16:49,375 --> 00:16:52,334
 Gemini nor any major tech vendor for that matter.
 353
+00:16:53,775 --> 00:16:57,135
 I like Gemini not so much as a everyday model.
 354
+00:16:57,375 --> 00:16:59,854
 It's kind of underwhelmed in that respect, I would say.
 355
+00:17:00,299 --> 00:17:02,699
 But for multimodal, I think it's got a lot to
 356
+00:17:02,699 --> 00:17:03,259
 offer.
 357
+00:17:03,579 --> 00:17:07,099
 And I think that the transcribing functionality whereby it can,
 358
+00:17:07,979 --> 00:17:12,300
 process audio with a system prompt and both give you
 359
+00:17:12,300 --> 00:17:13,820
 transcription that's cleaned up.
 360
+00:17:13,820 --> 00:17:15,259
 That reduces two steps to one.
 361
+00:17:15,755 --> 00:17:18,874
 And that for me is a very, very big deal.
 362
+00:17:18,875 --> 00:17:22,394
 And I feel like even Google hasn't really sort of
 363
+00:17:22,475 --> 00:17:27,115
 thought through how useful the that modality is and what
 364
+00:17:27,115 --> 00:17:29,620
 kind of use cases you can achieve with it.
 365
+00:17:29,620 --> 00:17:32,259
 Because I found in the course of this year just
 366
+00:17:32,259 --> 00:17:37,939
 an endless list of really kind of system prompt stuff
 367
+00:17:37,939 --> 00:17:40,820
 that I can say, okay, I've used it to capture
 368
+00:17:40,820 --> 00:17:44,035
 context data for AI, which is literally I might speak
 369
+00:17:44,035 --> 00:17:46,675
 for if I wanted to have a good bank of
 370
+00:17:46,675 --> 00:17:49,955
 context data about who knows my childhood.
 371
+00:17:50,354 --> 00:17:54,275
 More realistically, maybe my career goals, something that would just
 372
+00:17:54,275 --> 00:17:56,115
 be like really boring to type out.
 373
+00:17:56,115 --> 00:18:00,420
 So I'll just like sit in my car and record
 374
+00:18:00,420 --> 00:18:01,380
 it for ten minutes.
 375
+00:18:01,380 --> 00:18:03,699
 And that ten minutes you get a lot of information
 376
+00:18:03,699 --> 00:18:04,339
 in.
 377
+00:18:05,539 --> 00:18:07,620
 Emails, which is short text.
 378
+00:18:08,580 --> 00:18:10,339
 Just there is a whole bunch.
 379
+00:18:10,340 --> 00:18:13,295
 And all these workflows kind of require a little bit
 380
+00:18:13,295 --> 00:18:15,054
 of treatment afterwards and different treatment.
 381
+00:18:15,054 --> 00:18:18,334
 My context pipeline is kind of like just extract the
 382
+00:18:18,334 --> 00:18:19,215
 bare essentials.
 383
+00:18:19,215 --> 00:18:22,094
 You end up with me talking very loosely about sort
 384
+00:18:22,094 --> 00:18:24,414
 of what I've done in my career, where I've worked,
 385
+00:18:24,414 --> 00:18:25,374
 where I might like to work.
 386
+00:18:25,920 --> 00:18:29,039
 And it goes, it condenses that down to very robotic
 387
+00:18:29,039 --> 00:18:32,640
 language that is easy to chunk parse and maybe put
 388
+00:18:32,640 --> 00:18:33,920
 into a vector database.
 389
+00:18:33,920 --> 00:18:36,160
 Daniel has worked in technology.
 390
+00:18:36,160 --> 00:18:39,760
 Daniel has been working in, know, stuff like that.
 391
+00:18:39,760 --> 00:18:42,975
 That's not how you would speak, but I figure it's
 392
+00:18:42,975 --> 00:18:46,414
 probably easier to parse for, after all, robots.
 393
+00:18:46,735 --> 00:18:48,654
 So we've almost got to twenty minutes and this is
 394
+00:18:48,654 --> 00:18:53,054
 actually a success because I wasted twenty minutes of my
 395
+00:18:53,455 --> 00:18:57,120
 of the evening speaking into you in microphone and the
 396
+00:18:57,120 --> 00:19:01,039
 levels were shot and was clipping and I said I
 397
+00:19:01,039 --> 00:19:02,320
 can't really do an evaluation.
 398
+00:19:02,320 --> 00:19:03,360
 I have to be fair.
 399
+00:19:03,360 --> 00:19:06,320
 I have to give the models a chance to do
 400
+00:19:06,320 --> 00:19:06,880
 their thing.
 401
+00:19:07,425 --> 00:19:09,505
 What am I hoping to achieve in this?
 402
+00:19:09,505 --> 00:19:11,584
 Okay, my fine tune was a dud as mentioned.
 403
+00:19:11,665 --> 00:19:15,185
 Deepgram STT, I'm really, really hopeful that this prototype will
 404
+00:19:15,185 --> 00:19:17,985
 work and it's a build in public open source so
 405
+00:19:17,985 --> 00:19:20,304
 anyone is welcome to use it if I make anything
 406
+00:19:20,304 --> 00:19:20,625
 good.
 407
+00:19:21,560 --> 00:19:23,800
 But that was really exciting for me last night when
 408
+00:19:23,800 --> 00:19:28,840
 after hours of trying my own prototype, seeing someone just
 409
+00:19:28,840 --> 00:19:32,039
 made something that works like that, you you're not gonna
 410
+00:19:32,039 --> 00:19:36,374
 have to build a custom conda environment and image.
 411
+00:19:36,374 --> 00:19:39,974
 I have AMD GPU which makes things much more complicated.
 412
+00:19:40,214 --> 00:19:42,614
 I didn't find it and I was about to give
 413
+00:19:42,614 --> 00:19:43,894
 up and I said, All right, let me just give
 414
+00:19:43,894 --> 00:19:46,455
 Deepgram's Linux thing a shot.
 415
+00:19:47,029 --> 00:19:49,589
 And if this doesn't work, I'm just gonna go back
 416
+00:19:49,589 --> 00:19:51,349
 to trying to vibe code something myself.
 417
+00:19:51,670 --> 00:19:55,509
 And when I ran the script, I was using Cloud
 418
+00:19:55,509 --> 00:19:59,029
 Code to do the installation process, it ran the script
 419
+00:19:59,029 --> 00:20:01,189
 and, oh my gosh, it works just like that.
 420
+00:20:01,824 --> 00:20:05,985
 The tricky thing for all those who wants to know
 421
+00:20:05,985 --> 00:20:11,425
 all the nitty, ditty, nitty gritty details was that I
 422
+00:20:11,425 --> 00:20:14,624
 don't think it was actually struggling with transcription, but pasting
 423
+00:20:14,705 --> 00:20:17,539
 Weyland makes life very hard.
 424
+00:20:17,539 --> 00:20:19,140
 And I think there was something not running at the
 425
+00:20:19,140 --> 00:20:19,699
 right time.
 426
+00:20:19,699 --> 00:20:22,979
 Anyway, Deepgram, I looked at how they actually handle that
 427
+00:20:22,979 --> 00:20:25,140
 because it worked out of the box when other stuff
 428
+00:20:25,140 --> 00:20:25,779
 didn't.
 429
+00:20:26,100 --> 00:20:28,900
 And it was quite a clever little mechanism.
 430
+00:20:29,495 --> 00:20:32,135
 And but more so than that, the accuracy was brilliant.
 431
+00:20:32,135 --> 00:20:33,574
 Now what am I what am I doing here?
 432
+00:20:33,574 --> 00:20:37,175
 This is gonna be a twenty minute audio sample.
 433
+00:20:38,375 --> 00:20:42,410
 And I'm I think I've done one or two of
 434
+00:20:42,410 --> 00:20:47,130
 these before, but I did it with short, snappy voice
 435
+00:20:47,130 --> 00:20:47,610
 notes.
 436
+00:20:47,610 --> 00:20:49,370
 This is kind of long form.
 437
+00:20:49,449 --> 00:20:51,929
 This actually might be a better approximation for what's useful
 438
+00:20:51,929 --> 00:20:53,849
 to me than voice memos.
 439
+00:20:53,849 --> 00:20:56,894
 Like, I need to buy three liters of milk tomorrow
 440
+00:20:56,894 --> 00:21:00,175
 and peter bread, which is probably how half my voice
 441
+00:21:00,175 --> 00:21:00,735
 notes sound.
 442
+00:21:00,735 --> 00:21:04,094
 Like if anyone were to find my phone they'd be
 443
+00:21:04,094 --> 00:21:05,934
 like this is the most boring person in the world.
 444
+00:21:06,015 --> 00:21:10,050
 Although actually there are some journaling thoughts as well, but
 445
+00:21:10,050 --> 00:21:11,810
 it's a lot of content like that.
 446
+00:21:11,810 --> 00:21:14,610
 And the probably for the evaluation, the most useful thing
 447
+00:21:14,610 --> 00:21:21,834
 is slightly obscure tech, GitHub, Nucleano, hugging face, not so
 448
+00:21:21,834 --> 00:21:24,474
 obscure that it's not gonna have a chance of knowing
 449
+00:21:24,474 --> 00:21:27,194
 it, but hopefully sufficiently well known that the model should
 450
+00:21:27,194 --> 00:21:27,834
 get it.
 451
+00:21:27,914 --> 00:21:29,995
 I tried to do a little bit of speaking really
 452
+00:21:29,995 --> 00:21:32,394
 fast and speaking very slowly.
 453
+00:21:32,394 --> 00:21:35,529
 Would say in general, I've spoken, delivered this at a
 454
+00:21:35,529 --> 00:21:39,130
 faster pace than I usually would owing to strong coffee
 455
+00:21:39,130 --> 00:21:40,570
 flowing through my bloodstream.
 456
+00:21:41,130 --> 00:21:43,529
 And the thing that I'm not gonna get in this
 457
+00:21:43,529 --> 00:21:46,090
 benchmark is background noise, which in my first take that
 458
+00:21:46,090 --> 00:21:48,455
 I had to get rid of, my wife came in
 459
+00:21:48,455 --> 00:21:51,495
 with my son and for a good night kiss.
 460
+00:21:51,574 --> 00:21:55,094
 And that actually would have been super helpful to get
 461
+00:21:55,094 --> 00:21:57,814
 in because it was non diarized or if we had
 462
+00:21:57,814 --> 00:21:58,695
 diarization.
 463
+00:21:59,334 --> 00:22:01,414
 A female, I could say, I want the male voice
 464
+00:22:01,414 --> 00:22:03,094
 and that wasn't intended for transcription.
 465
+00:22:04,509 --> 00:22:06,269
 And we're not going to get background noise like people
 466
+00:22:06,269 --> 00:22:08,989
 honking their horns, which is something I've done in my
 467
+00:22:09,150 --> 00:22:11,870
 main data set where I am trying to go back
 468
+00:22:11,870 --> 00:22:15,070
 to some of my voice notes, annotate them and run
 469
+00:22:15,070 --> 00:22:15,709
 a benchmark.
 470
+00:22:15,709 --> 00:22:18,265
 But this is going to be just a pure quick
 471
+00:22:18,265 --> 00:22:19,064
 test.
 472
+00:22:19,785 --> 00:22:24,025
 And as someone I'm working on a voice note idea.
 473
+00:22:24,025 --> 00:22:28,185
 That's my sort of end motivation besides thinking it's an
 474
+00:22:28,185 --> 00:22:31,785
 absolutely outstanding technology that's coming to viability.
 475
+00:22:31,785 --> 00:22:34,400
 And really, I know this sounds cheesy, can actually have
 476
+00:22:34,400 --> 00:22:36,479
 a very transformative effect.
 477
+00:22:37,920 --> 00:22:43,120
 Voice technology has been life changing for folks living with
 478
+00:22:43,999 --> 00:22:45,039
 disabilities.
 479
+00:22:45,920 --> 00:22:48,545
 And I think there's something really nice about the fact
 480
+00:22:48,545 --> 00:22:52,545
 that it can also benefit folks who are able-bodied and
 481
+00:22:52,545 --> 00:22:57,904
 we can all in different ways make this tech as
 482
+00:22:57,904 --> 00:23:00,705
 useful as possible regardless of the exact way that we're
 483
+00:23:00,705 --> 00:23:01,025
 using it.
 484
+00:23:02,199 --> 00:23:04,439
 And I think there's something very powerful in that, and
 485
+00:23:04,439 --> 00:23:05,559
 it can be very cool.
 486
+00:23:06,120 --> 00:23:07,559
 I see huge potential.
 487
+00:23:07,559 --> 00:23:09,319
 What excites me about voice tech?
 488
+00:23:09,719 --> 00:23:11,159
 A lot of things actually.
 489
+00:23:12,120 --> 00:23:14,839
 Firstly, the fact that it's cheap and accurate, as I
 490
+00:23:14,839 --> 00:23:17,785
 mentioned at the very start of this, and it's getting
 491
+00:23:17,785 --> 00:23:20,104
 better and better with stuff like accent handling.
 492
+00:23:20,745 --> 00:23:23,304
 I'm not sure my fine tune will actually ever come
 493
+00:23:23,304 --> 00:23:25,225
 to fruition in the sense that I'll use it day
 494
+00:23:25,225 --> 00:23:26,584
 to day as I imagine.
 495
+00:23:26,664 --> 00:23:30,505
 I get like superb, flawless words error rates because I'm
 496
+00:23:30,505 --> 00:23:34,949
 just kind of skeptical about local speech to text, as
 497
+00:23:34,949 --> 00:23:35,670
 I mentioned.
 498
+00:23:36,070 --> 00:23:39,830
 And I think the pace of innovation and improvement in
 499
+00:23:39,830 --> 00:23:42,310
 the models, the main reasons for fine tuning from what
 500
+00:23:42,310 --> 00:23:46,150
 I've seen have been people who are something that really
 501
+00:23:46,150 --> 00:23:50,375
 blows blows my mind about ASR is the idea that
 502
+00:23:50,375 --> 00:23:55,574
 it's inherently ailingual or multilingual, phonetic based.
 503
+00:23:56,295 --> 00:24:00,375
 So as folks who use speak very obscure languages that
 504
+00:24:00,375 --> 00:24:03,094
 there may be very there might be a paucity of
 505
+00:24:02,229 --> 00:24:05,030
 training data or almost none at all, and therefore the
 506
+00:24:05,030 --> 00:24:06,790
 accuracy is significantly reduced.
 507
+00:24:06,790 --> 00:24:11,350
 Or folks in very critical environments, I know there are
 508
+00:24:11,510 --> 00:24:15,350
 this is used extensively in medical transcription and dispatcher work
 509
+00:24:15,350 --> 00:24:19,064
 as, you know the call centers who send out ambulances
 510
+00:24:19,064 --> 00:24:19,864
 etc.
 511
+00:24:20,265 --> 00:24:23,545
 Where accuracy is absolutely paramount and in the case of
 512
+00:24:23,545 --> 00:24:27,545
 doctors radiologists they might be using very specialized vocab all
 513
+00:24:27,545 --> 00:24:27,865
 the time.
 514
+00:24:28,630 --> 00:24:30,229
 So those are kind of the main two things, and
 515
+00:24:30,229 --> 00:24:32,150
 I'm not sure that really just for trying to make
 516
+00:24:32,150 --> 00:24:36,390
 it better on a few random tech words with my
 517
+00:24:36,390 --> 00:24:39,429
 slightly I mean, I have an accent, but, like, not,
 518
+00:24:39,429 --> 00:24:42,469
 you know, an accent that a few other million people
 519
+00:24:42,870 --> 00:24:43,910
 have ish.
 520
+00:24:44,685 --> 00:24:47,965
 I'm not sure that my little fine tune is gonna
 521
+00:24:47,965 --> 00:24:52,604
 actually like, the bump in word error reduction, if I
 522
+00:24:52,604 --> 00:24:54,205
 ever actually figure out how to do it and get
 523
+00:24:54,205 --> 00:24:56,365
 it up to the cloud, by the time we've done
 524
+00:24:56,365 --> 00:24:59,959
 that, I suspect that the next generation of ASR will
 525
+00:24:59,959 --> 00:25:01,719
 just be so good that it will kind of be,
 526
+00:25:01,959 --> 00:25:03,959
 well, that would have been cool if it worked out,
 527
+00:25:03,959 --> 00:25:05,479
 but I'll just use this instead.
 528
+00:25:05,719 --> 00:25:10,679
 So that's gonna be it for today's episode of voice
 529
+00:25:10,679 --> 00:25:11,640
 training data.
 530
+00:25:11,880 --> 00:25:14,255
 Single, long shot evaluation.
 531
+00:25:14,495 --> 00:25:15,694
 Who am I gonna compare?
 532
+00:25:16,414 --> 00:25:18,574
 Whisper is always good as a benchmark, but I'm more
 533
+00:25:18,574 --> 00:25:22,175
 interested in seeing Whisper head to head with two things
 534
+00:25:22,175 --> 00:25:22,894
 really.
 535
+00:25:23,295 --> 00:25:25,134
 One is Whisper variants.
 536
+00:25:25,134 --> 00:25:27,695
 So you've got these projects like Faster Whisper.
 537
+00:25:29,110 --> 00:25:29,989
 Distill Whisper.
 538
+00:25:29,989 --> 00:25:30,709
 It's a bit confusing.
 539
+00:25:30,709 --> 00:25:31,909
 There's a whole bunch of them.
 540
+00:25:32,150 --> 00:25:35,110
 And the emerging ASRs, which are also a thing.
 541
+00:25:35,269 --> 00:25:37,110
 My intention for this is I'm not sure I'm gonna
 542
+00:25:37,110 --> 00:25:39,910
 have the time in any point in the foreseeable future
 543
+00:25:39,910 --> 00:25:44,775
 to go back to this whole episode and create a
 544
+00:25:44,775 --> 00:25:48,294
 proper source truth where I fix everything.
 545
+00:25:49,255 --> 00:25:51,894
 Might do it if I can get one transcription that's
 546
+00:25:51,894 --> 00:25:54,134
 sufficiently close to perfection.
 547
+00:25:54,934 --> 00:25:58,400
 But what I would actually love to do on Hugging
 548
+00:25:58,400 --> 00:26:00,479
 Face, I think would be a great probably how I
 549
+00:26:00,479 --> 00:26:04,400
 might visualize this is having the audio waveform play and
 550
+00:26:04,400 --> 00:26:08,880
 then have the transcript for each model below it and
 551
+00:26:08,880 --> 00:26:13,765
 maybe even a, like, you know, to scale and maybe
 552
+00:26:13,765 --> 00:26:16,644
 even a local one as well, like local whisper versus
 553
+00:26:16,644 --> 00:26:19,684
 OpenAI API, etcetera.
 554
+00:26:19,765 --> 00:26:23,124
 And I can then actually listen back to segments or
 555
+00:26:23,124 --> 00:26:25,285
 anyone who wants to can listen back to segments of
 556
+00:26:25,285 --> 00:26:30,219
 this recording and see where a particular model struggled and
 557
+00:26:30,219 --> 00:26:33,099
 others didn't as well as the sort of headline finding
 558
+00:26:33,099 --> 00:26:35,579
 of which had the best W E R but that
 559
+00:26:35,579 --> 00:26:37,659
 would require the source of truth.
 560
+00:26:37,660 --> 00:26:38,459
 Okay, that's it.
 561
+00:26:38,425 --> 00:26:40,985
 I hope this was, I don't know, maybe useful for
 562
+00:26:40,985 --> 00:26:42,904
 other folks interested in STT.
 563
+00:26:42,985 --> 00:26:45,945
 You want to see I always think I've just said
 564
+00:26:45,945 --> 00:26:47,624
 it as something I didn't intend to.
 565
+00:26:47,864 --> 00:26:49,624
 STT, I said for those.
 566
+00:26:49,624 --> 00:26:53,049
 Listen carefully, including hopefully the models themselves.
 567
+00:26:53,289 --> 00:26:55,049
 This has been myself, Daniel Rosol.
 568
+00:26:55,049 --> 00:26:59,370
 For more jumbled repositories about my roving interest in AI
 569
+00:26:59,370 --> 00:27:04,009
 but particularly AgenTic, MCP and VoiceTech you can find me
 570
+00:27:04,009 --> 00:27:05,689
 on GitHub.
 571
+00:27:05,929 --> 00:27:06,650
 Hugging Face.
 572
+00:27:08,045 --> 00:27:08,924
 Where else?
 573
+00:27:08,925 --> 00:27:11,725
 DanielRosel dot com, which is my personal website, as well
 574
+00:27:11,725 --> 00:27:15,485
 as this podcast whose name I sadly cannot remember.
 575
+00:27:15,644 --> 00:27:16,685
 Until next time.
 576
+00:27:16,685 --> 00:27:17,324
 Thanks for listening.

srt-out/speechmatics.srt CHANGED Viewed

@@ -1,2069 +1,2069 @@
 1
-00:00:00,120 --> 00:00:06,520
 Hello and welcome to a audio data
 set consisting of one single
 2
-00:00:06,520 --> 00:00:12,120
 episode of a non-existent podcast.
 Or it, uh, I may append this to a
 3
-00:00:12,120 --> 00:00:16,640
 podcast that I set up recently.
 Um, regarding my, uh,
 4
-00:00:16,680 --> 00:00:21,960
 with my thoughts on speech,
 tech and AI in particular,
 5
-00:00:22,240 --> 00:00:27,960
 more AI and generative AI, I would,
 uh, I would say, but in any event,
 6
-00:00:27,960 --> 00:00:32,480
 the purpose of this, um,
 voice recording is actually to create
 7
-00:00:32,680 --> 00:00:37,560
 a lengthy voice sample for a quick
 evaluation, a back of the envelope
 8
-00:00:37,560 --> 00:00:41,160
 evaluation, as they might say,
 for different speech to text models.
 9
-00:00:41,160 --> 00:00:43,800
 And I'm doing this because I,
 uh, I thought I'd made a great
 10
-00:00:43,800 --> 00:00:48,320
 breakthrough in my journey with
 speech tech, and that was succeeding
 11
-00:00:48,320 --> 00:00:52,720
 in the elusive task of fine tuning.
 Whisper, whisper is.
 12
-00:00:52,840 --> 00:00:56,960
 And I'm going to just talk.
 I'm trying to mix up, uh,
 13
-00:00:56,960 --> 00:01:00,470
 I'm going to try a few different
 styles of speaking.
 14
-00:01:00,470 --> 00:01:02,630
 I might whisper something at
 some point as well,
 15
-00:01:03,190 --> 00:01:07,150
 and I'll go back to speaking loud in,
 uh, in different parts.
 16
-00:01:07,150 --> 00:01:09,710
 I'm going to sound really like a
 crazy person, because I'm also
 17
-00:01:09,710 --> 00:01:15,870
 going to try to speak at different
 pitches and cadences in order to
 18
-00:01:15,910 --> 00:01:20,630
 really try to put a speech to
 text model through its paces,
 19
-00:01:20,630 --> 00:01:25,870
 which is trying to make sense of,
 is this guy just on incoherently in
 20
-00:01:25,870 --> 00:01:34,350
 one long sentence, or are these just
 actually a series of step standalone,
 21
-00:01:34,350 --> 00:01:37,510
 standalone, standalone sentences?
 And how is it going to handle
 22
-00:01:37,510 --> 00:01:40,750
 step alone? That's not a word.
 Uh, what happens when you use
 23
-00:01:40,750 --> 00:01:44,030
 speech to text and you use a fake
 word and then you're like, wait,
 24
-00:01:44,030 --> 00:01:48,350
 that's not actually that word doesn't
 exist. How does AI handle that?
 25
-00:01:48,390 --> 00:01:53,910
 And, uh, these and more are all
 the questions that I'm seeking
 26
-00:01:53,910 --> 00:01:57,350
 to answer in this training data.
 Now, why did why was it trying
 27
-00:01:57,350 --> 00:01:59,740
 to fine tune a whisper?
 And what is whisper?
 28
-00:01:59,780 --> 00:02:03,540
 As I said, I'm gonna try to, uh,
 record this at a couple of different
 29
-00:02:03,540 --> 00:02:09,060
 levels of technicality for folks who
 are, uh, you know, in the normal, uh,
 30
-00:02:09,060 --> 00:02:13,460
 world and not totally stuck down
 the rabbit hole of AI, uh, which I
 31
-00:02:13,460 --> 00:02:17,460
 have to say is a really wonderful,
 uh, rabbit hole to be to be down.
 32
-00:02:17,580 --> 00:02:21,700
 Um, it's a really interesting area.
 And speech and voice tech is is
 33
-00:02:21,940 --> 00:02:24,980
 the aspect of it that I find
 actually most.
 34
-00:02:25,180 --> 00:02:28,340
 I'm not sure I would say the most
 interesting, because there's just
 35
-00:02:28,340 --> 00:02:32,700
 so much that is fascinating in AI.
 Uh, but the most that I find the
 36
-00:02:32,700 --> 00:02:36,220
 most personally transformative
 in terms of the impact that it's
 37
-00:02:36,220 --> 00:02:41,660
 had on my daily work life and
 productivity and how I sort of work.
 38
-00:02:41,940 --> 00:02:48,020
 And I'm persevering hard with the
 task of trying to guess a good
 39
-00:02:48,020 --> 00:02:51,700
 solution working for Linux, which if
 anyone actually does listen to this,
 40
-00:02:51,700 --> 00:02:55,100
 not just for the training data
 and for the actual content, uh,
 41
-00:02:55,140 --> 00:02:59,600
 this is this is has sparked I had
 besides the fine tune not working.
 42
-00:02:59,600 --> 00:03:05,560
 Well, that was the failure.
 Um, I used clod code because one
 43
-00:03:05,560 --> 00:03:10,160
 thinks these days that there is
 nothing short of solving,
 44
-00:03:11,040 --> 00:03:14,680
 you know, the, uh,
 the reason of life or something.
 45
-00:03:15,080 --> 00:03:19,560
 Uh, that clod and agentic AI can't
 do, uh, which is not really the case.
 46
-00:03:19,600 --> 00:03:23,600
 Uh, it does seem that way sometimes,
 but it fails a lot as well.
 47
-00:03:23,600 --> 00:03:26,960
 And this is one of those, uh,
 instances where last week I put
 48
-00:03:26,960 --> 00:03:31,400
 together an hour of voice training
 data, basically speaking just
 49
-00:03:31,400 --> 00:03:35,040
 random things for three minutes.
 And, um,
 50
-00:03:35,720 --> 00:03:38,520
 it was actually kind of tedious
 because the texts were really weird.
 51
-00:03:38,520 --> 00:03:42,120
 Some of them were it was like it
 was AI generated.
 52
-00:03:42,320 --> 00:03:44,920
 Um, I tried before to read
 Sherlock Holmes for an hour and
 53
-00:03:44,920 --> 00:03:47,000
 I just couldn't.
 I was so bored, uh,
 54
-00:03:47,040 --> 00:03:50,800
 after ten minutes that I was like,
 okay, now I'm just gonna have to
 55
-00:03:50,800 --> 00:03:56,470
 find something else to read.
 So I used a created with AI
 56
-00:03:56,510 --> 00:04:00,150
 studio vibe coded.
 A synthetic text generator.
 57
-00:04:00,390 --> 00:04:03,990
 Um, which actually I thought was
 probably a better way of doing it
 58
-00:04:03,990 --> 00:04:08,870
 because it would give me more short
 samples with more varied content.
 59
-00:04:08,870 --> 00:04:13,310
 So I was like, okay, give me a voice
 note, like I'm recording an email,
 60
-00:04:13,310 --> 00:04:18,110
 give me a short story to read,
 give me prose, um, to read.
 61
-00:04:18,110 --> 00:04:21,310
 So I came up with all these
 different things, and I added a
 62
-00:04:21,310 --> 00:04:24,750
 little timer to it so I could
 see how close I was to one hour.
 63
-00:04:24,990 --> 00:04:29,830
 Um, and, uh, I spent like an hour one
 afternoon or probably two hours by
 64
-00:04:29,830 --> 00:04:34,190
 the time you, um, you do retakes
 or whatever because you want to.
 65
-00:04:34,990 --> 00:04:39,190
 It gave me a source of truth,
 which I'm not sure if that's the
 66
-00:04:39,190 --> 00:04:43,550
 scientific way to approach this topic
 of gathering, uh, training data,
 67
-00:04:43,550 --> 00:04:48,070
 but I thought it made sense.
 Um, I have a lot of audio data
 68
-00:04:48,070 --> 00:04:52,070
 from recording voice notes,
 which I've also kind of used, um,
 69
-00:04:52,070 --> 00:04:55,780
 been experimenting with using for
 a different purpose, slightly
 70
-00:04:55,780 --> 00:05:00,820
 different annotating task types.
 It's more text classification
 71
-00:05:00,820 --> 00:05:03,740
 experiment or uh, well,
 it's more than that, actually.
 72
-00:05:03,740 --> 00:05:08,100
 I'm working on a voice app,
 so it's a prototype I guess is
 73
-00:05:08,100 --> 00:05:12,780
 really more accurate.
 Um, but you can do that and you
 74
-00:05:12,780 --> 00:05:14,220
 can work backwards.
 You're like,
 75
-00:05:14,260 --> 00:05:18,620
 you listen back to a voice note
 and you painfully go through one
 76
-00:05:18,620 --> 00:05:21,980
 of those transcribing, you know,
 where you start and stop and scrub
 77
-00:05:21,980 --> 00:05:24,100
 around it and you fix the errors.
 But it's really,
 78
-00:05:24,100 --> 00:05:27,220
 really boring to do that.
 So I thought it would be less
 79
-00:05:27,220 --> 00:05:31,860
 tedious in the long term if I just
 recorded The Source of truth.
 80
-00:05:32,180 --> 00:05:34,300
 So it gave me these three minute
 snippets.
 81
-00:05:34,300 --> 00:05:38,780
 I recorded them and saved an MP3
 and a txt in the same folder,
 82
-00:05:38,780 --> 00:05:43,820
 and I created an hour of that data.
 Uh, so I was very hopeful, quietly,
 83
-00:05:43,860 --> 00:05:46,380
 you know, a little bit hopeful
 that I would be able that I could
 84
-00:05:46,380 --> 00:05:49,700
 actually fine tune, whisper.
 Um, I want to fine tune whisper
 85
-00:05:49,700 --> 00:05:54,840
 because when I got into voice tech
 last November, my wife was in
 86
-00:05:54,840 --> 00:05:59,600
 the US and I was alone at home.
 And you know, when crazy people
 87
-00:05:59,600 --> 00:06:03,760
 like me do really wild things like
 use voice to tech, uh, technology.
 88
-00:06:03,760 --> 00:06:06,520
 That was basically, um,
 when I started doing it,
 89
-00:06:06,520 --> 00:06:10,280
 I didn't feel like a crazy person
 speaking to myself, and my
 90
-00:06:10,280 --> 00:06:16,120
 expectations weren't that high.
 Uh, I used speech tech now and again.
 91
-00:06:16,200 --> 00:06:18,480
 Um, tried it out.
 I was like, it'd be really cool
 92
-00:06:18,480 --> 00:06:20,520
 if you could just, like,
 speak into your computer.
 93
-00:06:20,880 --> 00:06:24,720
 And whatever I tried out that
 had Linux support was just.
 94
-00:06:25,440 --> 00:06:28,640
 It was not good, basically.
 Um, and this blew me away from
 95
-00:06:28,640 --> 00:06:32,040
 the first go.
 I mean, it wasn't 100% accurate
 96
-00:06:32,080 --> 00:06:35,160
 out of the box and it took work,
 but it was good enough that there was
 97
-00:06:35,160 --> 00:06:39,720
 a solid foundation and it kind of
 passed that, uh, pivot point that
 98
-00:06:39,720 --> 00:06:42,880
 it's actually worth doing this.
 You know, there's a point where
 99
-00:06:42,880 --> 00:06:46,920
 it's so like the transcript is you
 don't have to get 100% accuracy
 100
-00:06:46,920 --> 00:06:50,630
 for it to be worth your time for
 speech to text to be a worthwhile
 101
-00:06:50,630 --> 00:06:53,070
 addition to your productivity.
 But you do need to get above.
 102
-00:06:53,110 --> 00:06:57,750
 Let's say, I don't know, 85%.
 If it's 60% or 50%,
 103
-00:06:57,750 --> 00:07:00,790
 you inevitably say, screw it.
 I'll just type it because you end up
 104
-00:07:00,790 --> 00:07:05,070
 missing errors in the transcript
 and it becomes actually worse.
 105
-00:07:05,070 --> 00:07:06,830
 You end up in a worse position
 than you started with.
 106
-00:07:06,830 --> 00:07:11,030
 And that's been my experience.
 So, um, I was like, oh,
 107
-00:07:11,070 --> 00:07:13,550
 this is actually really, really good.
 Now how did that happen?
 108
-00:07:13,550 --> 00:07:18,910
 And the answer is ASR whisper
 being open sourced and the
 109
-00:07:18,910 --> 00:07:21,910
 transformer architecture,
 if you want to go back to the,
 110
-00:07:22,510 --> 00:07:26,750
 um, to the underpinnings, which
 really blows my mind and it's on my
 111
-00:07:26,750 --> 00:07:32,430
 list to read through that paper.
 Um, all you need is attention as
 112
-00:07:33,470 --> 00:07:38,470
 attentively as can be done with my
 limited brain because it's super,
 113
-00:07:38,470 --> 00:07:42,310
 super high level stuff.
 Um, super advanced stuff.
 114
-00:07:42,350 --> 00:07:48,070
 I mean, uh, but that I think of all
 the things that are fascinating
 115
-00:07:48,180 --> 00:07:52,820
 about the sudden rise in AI and
 the dramatic capabilities.
 116
-00:07:53,420 --> 00:07:55,700
 I find it fascinating that few
 people are like, hang on,
 117
-00:07:55,860 --> 00:07:59,740
 you've got this thing that can speak
 to you like a chatbot, an LLM,
 118
-00:08:00,420 --> 00:08:05,580
 and then you've got image generation.
 Okay, so firstly, those two things on
 119
-00:08:05,580 --> 00:08:10,860
 the surface have nothing in common.
 Um, so like how are they how did that
 120
-00:08:10,860 --> 00:08:13,100
 just happen all at the same time.
 And then when you extend that
 121
-00:08:13,100 --> 00:08:16,180
 further, um, you're like sooner,
 right?
 122
-00:08:16,180 --> 00:08:21,700
 You can sing a song and AI will like,
 come up with an instrumental and then
 123
-00:08:21,700 --> 00:08:23,860
 you've got whisper and you're like,
 wait a second,
 124
-00:08:24,060 --> 00:08:28,100
 how did all this stuff, like,
 if it's all AI, what's like there
 125
-00:08:28,100 --> 00:08:30,700
 has to be some commonality.
 Otherwise these are four.
 126
-00:08:30,780 --> 00:08:34,780
 These are totally different
 technologies on the surface of it.
 127
-00:08:34,780 --> 00:08:40,220
 And, uh, the transformer architecture
 is, as far as I know, the answer.
 128
-00:08:40,220 --> 00:08:43,860
 And I can't even say can't even
 pretend that I really understand
 129
-00:08:44,140 --> 00:08:47,290
 what the transformer
 architecture means in depth,
 130
-00:08:47,290 --> 00:08:51,810
 but I have scanned it and as I said,
 I want to print it and really kind
 131
-00:08:51,810 --> 00:08:56,770
 of think over it at some point,
 and I'll probably feel bad about
 132
-00:08:56,770 --> 00:08:59,090
 myself, I think,
 because weren't those guys in their
 133
-00:08:59,130 --> 00:09:04,010
 in their 20s like, that's crazy.
 I think I asked ChatGPT once who
 134
-00:09:04,050 --> 00:09:08,370
 were the who wrote that paper
 and how old were they when it
 135
-00:09:08,370 --> 00:09:11,290
 was published in arXiv?
 And I was expecting like,
 136
-00:09:11,530 --> 00:09:13,450
 I don't know,
 what do you what do you imagine?
 137
-00:09:13,450 --> 00:09:15,050
 I personally imagine kind of like,
 you know,
 138
-00:09:15,090 --> 00:09:19,210
 you have these breakthroughs during
 Covid and things like that where
 139
-00:09:19,250 --> 00:09:22,210
 like these kind of really obscure
 scientists who are like in their
 140
-00:09:22,210 --> 00:09:27,250
 50s and they've just kind of been
 laboring in labs and, uh, wearily
 141
-00:09:27,250 --> 00:09:30,650
 and writing in publishing in kind
 of obscure academic publications.
 142
-00:09:30,850 --> 00:09:34,050
 And they finally, like,
 hit a big or win a Nobel Prize and
 143
-00:09:34,050 --> 00:09:37,930
 then their household household names.
 Uh, so that was kind of what I
 144
-00:09:37,930 --> 00:09:39,770
 had in mind.
 That was the mental image I'd
 145
-00:09:39,770 --> 00:09:44,010
 formed of the birth of arXiv.
 Like, I wasn't expecting 20
 146
-00:09:44,050 --> 00:09:47,430
 somethings in San Francisco,
 though I thought that was both very,
 147
-00:09:47,430 --> 00:09:49,990
 very funny, very cool,
 and actually kind of inspiring.
 148
-00:09:50,510 --> 00:09:55,630
 It's nice to think that people who,
 you know, just you might put them
 149
-00:09:55,630 --> 00:10:01,030
 in the kind of milieu or bubble or
 world that you are in or credibly in,
 150
-00:10:01,070 --> 00:10:03,710
 through, you know,
 a series of connections that are
 151
-00:10:03,710 --> 00:10:07,750
 coming up with such literally
 world changing, um, innovations.
 152
-00:10:07,790 --> 00:10:11,550
 Uh, so that was, I thought,
 anyway, that, that that was cool.
 153
-00:10:12,190 --> 00:10:14,070
 Okay. Voice training data.
 How are we doing?
 154
-00:10:14,070 --> 00:10:18,110
 We're about ten minutes, and I'm
 still talking about voice technology.
 155
-00:10:18,310 --> 00:10:22,470
 Um, so whisper was brilliant,
 and I was so excited that I was.
 156
-00:10:22,470 --> 00:10:25,750
 My first instinct was to, like,
 get like, oh, my gosh,
 157
-00:10:25,750 --> 00:10:27,830
 I have to get, like,
 a really good microphone for this.
 158
-00:10:28,070 --> 00:10:31,750
 So, um, I didn't go on a
 spending spree because I said,
 159
-00:10:31,790 --> 00:10:34,590
 I'm gonna have to just wait a
 month and see if I still use this.
 160
-00:10:35,030 --> 00:10:40,110
 And it just kind of became it's
 become really part of my daily
 161
-00:10:40,110 --> 00:10:43,110
 routine.
 Like, if I'm writing an email,
 162
-00:10:43,110 --> 00:10:47,140
 I'll record a voice note.
 And then I've developed and it's
 163
-00:10:47,140 --> 00:10:50,020
 nice to see that everyone is
 like developing the same things
 164
-00:10:50,020 --> 00:10:52,020
 in parallel.
 Like, that's kind of a weird thing
 165
-00:10:52,060 --> 00:10:57,460
 to say, but when I look, I kind of
 came when I started working on this,
 166
-00:10:57,500 --> 00:11:00,820
 these prototypes on GitHub,
 which is where I just kind of
 167
-00:11:00,860 --> 00:11:04,860
 share very freely and loosely,
 uh, ideas and, you know,
 168
-00:11:04,900 --> 00:11:10,140
 first iterations on, on concepts,
 um, and for want of a better word,
 169
-00:11:10,140 --> 00:11:14,020
 I called it like, uh,
 lm post-processing or cleanup or
 170
-00:11:14,260 --> 00:11:18,220
 basically a system prompt that after
 you get back the raw text from
 171
-00:11:18,540 --> 00:11:24,220
 whisper, you run it through a model
 and say, okay, this is crappy text,
 172
-00:11:24,260 --> 00:11:27,260
 like add sentence structure and,
 you know, fix it up.
 173
-00:11:27,700 --> 00:11:32,780
 And, um, now when I'm exploring the
 different tools that are out there
 174
-00:11:32,820 --> 00:11:36,700
 that people have built, I see, uh,
 quite a number of projects have
 175
-00:11:37,300 --> 00:11:41,820
 basically done the same thing,
 um, less that be misconstrued.
 176
-00:11:41,820 --> 00:11:44,490
 I'm not saying for a millisecond
 that I inspired them.
 177
-00:11:44,490 --> 00:11:49,010
 I'm sure this has been a thing that's
 been integrated into tools for a
 178
-00:11:49,050 --> 00:11:52,410
 while, but it's it's the kind of
 thing that when you start using these
 179
-00:11:52,410 --> 00:11:56,850
 tools every day, the need for it
 is almost instantly apparent, uh,
 180
-00:11:56,850 --> 00:12:00,890
 because text that doesn't have any
 punctuation or paragraph spacing
 181
-00:12:00,930 --> 00:12:04,370
 takes a long time to, you know,
 it takes so long to get it into
 182
-00:12:04,370 --> 00:12:09,490
 a presentable email that again,
 it's it's it moves speech tech
 183
-00:12:09,530 --> 00:12:13,050
 into that before that inflection
 point where you're like, no,
 184
-00:12:13,050 --> 00:12:16,370
 it's just not worth it.
 It's like it'll just be quicker
 185
-00:12:16,370 --> 00:12:18,970
 to type this.
 So it's a big it's a little touch.
 186
-00:12:18,970 --> 00:12:24,210
 That actually is a big deal.
 Uh, so I was on whisper and I've
 187
-00:12:24,210 --> 00:12:28,290
 been using whisper and I kind of
 early on found a couple of tools.
 188
-00:12:28,330 --> 00:12:31,050
 I couldn't find what I was
 looking for on Linux, which is,
 189
-00:12:31,490 --> 00:12:35,890
 um, basically just something
 that'll run in the background.
 190
-00:12:35,930 --> 00:12:40,250
 You'll give it an API key and it
 will just transcribe. Um.
 191
-00:12:41,400 --> 00:12:44,120
 with, like, a little key to
 start and stop the dictation.
 192
-00:12:44,720 --> 00:12:49,160
 Uh, and the issues were I discovered
 that, like most people involved in
 193
-00:12:49,160 --> 00:12:54,040
 creating these projects were very
 much focused on local models running
 194
-00:12:54,040 --> 00:12:57,520
 whisper locally, because you can.
 And I tried that a bunch of
 195
-00:12:57,520 --> 00:13:00,960
 times and just never got results
 that were as good as the cloud.
 196
-00:13:01,280 --> 00:13:04,760
 And when I began looking at the
 cost of the speech to text APIs
 197
-00:13:04,760 --> 00:13:08,640
 and what I was spending,
 I just thought there's it's actually,
 198
-00:13:08,840 --> 00:13:13,320
 in my opinion, just one of the better
 deals in API spending and in cloud.
 199
-00:13:13,360 --> 00:13:17,400
 Like it's just not that expensive
 for very, very good models that are
 200
-00:13:17,520 --> 00:13:20,960
 much more, you know, you're going
 to be able to run the full model,
 201
-00:13:21,480 --> 00:13:26,080
 the latest model versus whatever
 you can run on your average GPU.
 202
-00:13:26,120 --> 00:13:29,880
 Unless you want to buy a crazy GPU.
 It doesn't really make sense to me.
 203
-00:13:29,880 --> 00:13:33,600
 Now, privacy is another concern.
 Um, that I know is kind of like a
 204
-00:13:33,640 --> 00:13:37,040
 very much a separate thing that
 people just don't want their voice,
 205
-00:13:37,040 --> 00:13:39,910
 data, and their voice leaving
 their local environment,
 206
-00:13:40,230 --> 00:13:43,950
 maybe for regulatory reasons as well.
 Um, but I'm not in that.
 207
-00:13:44,030 --> 00:13:48,030
 Um, I'm neither really care about
 people listening to my, uh,
 208
-00:13:48,070 --> 00:13:51,310
 grocery list consisting of, uh,
 reminding myself that I need to
 209
-00:13:51,350 --> 00:13:54,910
 buy more beer, Cheetos and hummus,
 which is kind of the three,
 210
-00:13:55,110 --> 00:13:59,430
 three staples of my diet.
 Um, during periods of poor nutrition.
 211
-00:13:59,710 --> 00:14:03,430
 Uh, but the kind of stuff that I
 transcribe, it's just not it's not a,
 212
-00:14:04,110 --> 00:14:09,470
 it's not a privacy thing and that
 sort of sensitive about and, uh,
 213
-00:14:09,470 --> 00:14:13,190
 I don't do anything so,
 you know, sensitive or secure,
 214
-00:14:13,190 --> 00:14:16,710
 that requires air gapping.
 So, um, I looked at the pricing and
 215
-00:14:16,710 --> 00:14:20,390
 especially the kind of older models,
 mini, um, some of them are very,
 216
-00:14:20,390 --> 00:14:23,230
 very affordable.
 And I did a back of the I did a
 217
-00:14:23,230 --> 00:14:27,270
 calculation once with ChatGPT
 and I was like, okay, this is a,
 218
-00:14:27,270 --> 00:14:31,190
 this is the API price for I can't
 remember whatever the model was.
 219
-00:14:31,670 --> 00:14:34,030
 Uh, let's say I just go at it
 like nonstop,
 220
-00:14:34,150 --> 00:14:37,530
 which it rarely happens. Probably.
 I would say on average,
 221
-00:14:37,530 --> 00:14:42,010
 I might dictate 30 to 60 minutes per
 day if I was probably summing up
 222
-00:14:42,010 --> 00:14:48,610
 the emails, documents, outlines,
 um, which is a lot, but it's it's
 223
-00:14:48,610 --> 00:14:50,850
 still a fairly modest amount.
 And I was like, well,
 224
-00:14:50,890 --> 00:14:54,050
 some days I do go on like 1 or 2
 days where I've been.
 225
-00:14:54,570 --> 00:14:58,570
 Usually when I'm like kind of out of
 the house and just have something
 226
-00:14:59,210 --> 00:15:02,370
 like, I have nothing else to do.
 Like if I'm at a hospital with a
 227
-00:15:02,370 --> 00:15:07,090
 newborn, uh, and you're waiting
 for like eight hours and hours
 228
-00:15:07,090 --> 00:15:10,330
 for an appointment, and I would
 probably have listened to podcasts
 229
-00:15:10,610 --> 00:15:14,130
 before becoming a speech fanatic.
 And I'm like, oh, wait,
 230
-00:15:14,170 --> 00:15:16,490
 let me just get down.
 Let me just get these ideas out
 231
-00:15:16,530 --> 00:15:19,730
 of my head.
 And that's when I'll go on my
 232
-00:15:19,770 --> 00:15:21,650
 speech binges.
 But those are like once every
 233
-00:15:21,650 --> 00:15:25,090
 few months, like not frequently.
 But I said, okay, let's just say
 234
-00:15:25,090 --> 00:15:30,770
 if I'm gonna price out.
 Cloud asked if I was like, dedicated
 235
-00:15:30,770 --> 00:15:37,000
 every second of every waking hour to
 transcribing for some odd reason. Um.
 236
-00:15:37,320 --> 00:15:39,800
 I mean, it'd have to, like,
 eat and use the toilet and,
 237
-00:15:39,840 --> 00:15:42,640
 like, you know, there's only so
 many hours I'm awake for.
 238
-00:15:42,640 --> 00:15:44,800
 So, like,
 let's just say a maximum of, like,
 239
-00:15:44,840 --> 00:15:48,800
 40 hours, 45 minutes in the hour.
 Then I said, all right,
 240
-00:15:48,800 --> 00:15:52,720
 let's just say 50. Who knows?
 You're dictating on the toilet.
 241
-00:15:52,760 --> 00:15:54,000
 We do it.
 Uh,
 242
-00:15:54,000 --> 00:15:58,840
 so it could be you could just do 60.
 But whatever I did, and every day,
 243
-00:15:58,880 --> 00:16:02,560
 like, you're going flat out seven
 days a week dictating non-stop.
 244
-00:16:02,600 --> 00:16:06,560
 I was like, what's my monthly API
 bill going to be at this price?
 245
-00:16:06,840 --> 00:16:09,240
 And it came out to like 70 or 80
 bucks.
 246
-00:16:09,240 --> 00:16:14,200
 And I was like, well, that would be
 an extraordinary amount of dictation.
 247
-00:16:14,200 --> 00:16:17,960
 And I would hope that there was
 some compelling reason,
 248
-00:16:18,160 --> 00:16:22,320
 more worth more than $70,
 that I embarked upon that project.
 249
-00:16:22,520 --> 00:16:25,320
 Uh, so given that that's kind of the
 max point for me, I said, that's
 250
-00:16:25,360 --> 00:16:29,120
 actually very, very affordable.
 Um, now you're gonna if you want
 251
-00:16:29,160 --> 00:16:34,200
 to spec out the costs and you want
 to do the post-processing that I
 252
-00:16:34,270 --> 00:16:37,230
 really do feel is valuable.
 Um, that's going to cost some more as
 253
-00:16:37,230 --> 00:16:43,230
 well, unless you're using Gemini,
 which, uh, needless to say, is a
 254
-00:16:43,230 --> 00:16:47,070
 random person sitting in Jerusalem.
 Uh, I have no affiliation,
 255
-00:16:47,070 --> 00:16:51,470
 nor with Google, nor anthropic,
 nor Gemini, nor any major tech vendor
 256
-00:16:51,470 --> 00:16:56,910
 for that matter. Um, I like Gemini.
 Not so much as a everyday model.
 257
-00:16:56,990 --> 00:16:59,950
 Um, it's kind of underwhelmed in
 that respect, I would say.
 258
-00:17:00,350 --> 00:17:03,150
 But for multimodal,
 I think it's got a lot to offer.
 259
-00:17:03,430 --> 00:17:06,990
 And I think that the transcribing
 functionality whereby it can,
 260
-00:17:07,390 --> 00:17:12,270
 um, process audio with a system
 prompt and both give you
 261
-00:17:12,310 --> 00:17:15,510
 transcription that's cleaned up,
 that reduces two steps to one.
 262
-00:17:15,830 --> 00:17:18,750
 And that for me is a very,
 very big deal.
 263
-00:17:18,750 --> 00:17:23,110
 And, uh, I feel like even Google
 has haven't really sort of thought
 264
-00:17:23,110 --> 00:17:27,550
 through how useful the that
 modality is and what kind of use
 265
-00:17:27,550 --> 00:17:30,910
 cases you can achieve with it.
 Because I found in the course of
 266
-00:17:30,910 --> 00:17:36,610
 this year just an endless list
 of really kind of system prompt,
 267
-00:17:36,850 --> 00:17:41,410
 system prompt stuff that I can say,
 okay, I've used it to capture context
 268
-00:17:41,410 --> 00:17:45,690
 data for AI, which is literally I
 might speak for if I wanted to have a
 269
-00:17:45,690 --> 00:17:49,850
 good bank of context data about,
 who knows, my childhood.
 270
-00:17:50,130 --> 00:17:53,570
 Uh, more realistically,
 maybe my career goals, uh,
 271
-00:17:53,570 --> 00:17:56,130
 something that would just be,
 like, really boring to type out.
 272
-00:17:56,250 --> 00:18:01,250
 So I'll just, like, sit in my car
 and record it for ten minutes.
 273
-00:18:01,250 --> 00:18:04,210
 And that ten minutes,
 you get a lot of information in,
 274
-00:18:04,650 --> 00:18:10,210
 um, emails, which is short text.
 Um, just there is a whole bunch.
 275
-00:18:10,210 --> 00:18:13,690
 And all these workflows kind of
 require a little bit of treatment
 276
-00:18:13,690 --> 00:18:17,610
 afterwards and different treatment.
 My context pipeline is kind of like
 277
-00:18:17,610 --> 00:18:21,330
 just extract the bare essentials.
 So you end up with me talking very
 278
-00:18:21,330 --> 00:18:24,370
 loosely about sort of what I've done
 in my career, where I've worked,
 279
-00:18:24,370 --> 00:18:27,730
 where I might like to work,
 and it goes it condenses that
 280
-00:18:27,730 --> 00:18:31,720
 down to very robotic language
 that is easy to chunk, parse,
 281
-00:18:31,720 --> 00:18:36,080
 and maybe put into a vector database.
 Daniel has worked in technology,
 282
-00:18:36,120 --> 00:18:39,760
 Daniel is a has been working in,
 you know, stuff like that.
 283
-00:18:39,760 --> 00:18:43,720
 That's not how you would speak.
 Um, but I figure it's probably easier
 284
-00:18:43,720 --> 00:18:48,240
 to parse for, after all, robots.
 So we've almost got to 20 minutes.
 285
-00:18:48,240 --> 00:18:52,760
 And this is actually a success
 because I wasted 20 minutes of my,
 286
-00:18:52,920 --> 00:18:57,000
 uh, of the evening speaking into
 a microphone, and, uh,
 287
-00:18:57,040 --> 00:19:00,960
 the levels were shot and, uh, it,
 uh, it was clipping and I said,
 288
-00:19:00,960 --> 00:19:03,320
 I can't really do an evaluation.
 I have to be fair.
 289
-00:19:03,320 --> 00:19:07,120
 I have to give the models a
 chance to do their thing.
 290
-00:19:07,640 --> 00:19:09,480
 Uh,
 what am I hoping to achieve in this?
 291
-00:19:09,520 --> 00:19:12,720
 Okay, my fine tune was a dud,
 as mentioned Deepgram SVT.
 292
-00:19:12,760 --> 00:19:15,640
 I'm really, really hopeful that
 this prototype will work.
 293
-00:19:15,920 --> 00:19:19,080
 And it's a built in public open
 source, so anyone is welcome to
 294
-00:19:19,120 --> 00:19:23,040
 use it if I make anything good.
 Um, but that was really exciting for
 295
-00:19:23,040 --> 00:19:27,520
 me last night when after hours of,
 um, trying my own prototype,
 296
-00:19:27,520 --> 00:19:31,350
 seeing someone just made
 something that works like that.
 297
-00:19:31,390 --> 00:19:32,790
 You know,
 you're not going to have to build a
 298
-00:19:32,790 --> 00:19:38,350
 custom conda environment and image.
 I have AMD GPU, which makes
 299
-00:19:38,350 --> 00:19:42,430
 things much more complicated.
 I didn't find it and I was about
 300
-00:19:42,430 --> 00:19:44,110
 to give up and I said,
 all right, let me just give deep
 301
-00:19:44,110 --> 00:19:48,870
 grams Linux thing a shot.
 And if this doesn't work, um,
 302
-00:19:48,870 --> 00:19:51,270
 I'm just going to go back to
 trying to code something myself.
 303
-00:19:51,630 --> 00:19:56,310
 And when I ran the script,
 I was using cloud code to do the
 304
-00:19:56,310 --> 00:20:00,150
 installation process.
 It ran the script and oh my gosh,
 305
-00:20:00,190 --> 00:20:05,470
 it works just like that.
 Uh, the tricky thing for all those
 306
-00:20:05,470 --> 00:20:10,430
 who wants to know all the nitty
 gritty, nitty gritty details, um, was
 307
-00:20:10,430 --> 00:20:13,870
 that I don't think it was actually
 struggling with transcription, but
 308
-00:20:13,870 --> 00:20:18,670
 pasting Wayland makes life very hard,
 and I think there was something not
 309
-00:20:18,670 --> 00:20:21,990
 running in the right time anyway.
 Deepgram I looked at how they
 310
-00:20:21,990 --> 00:20:24,830
 actually handle that because it
 worked out of the box when other
 311
-00:20:24,830 --> 00:20:29,260
 stuff didn't, and it was quite a
 clever little mechanism,
 312
-00:20:29,580 --> 00:20:32,220
 and but more so than that,
 the accuracy was brilliant.
 313
-00:20:32,260 --> 00:20:35,140
 Now, what am I doing here?
 This is going to be a 20 minute
 314
-00:20:35,380 --> 00:20:43,100
 audio sample, and I'm I think
 I've done 1 or 2 of these before,
 315
-00:20:43,100 --> 00:20:49,300
 but I did it with short, snappy voice
 notes. This is kind of long form.
 316
-00:20:49,580 --> 00:20:51,860
 This actually might be a better
 approximation for what's useful
 317
-00:20:51,860 --> 00:20:56,220
 to me than voice memos.
 Like I need to buy three liters
 318
-00:20:56,220 --> 00:20:59,300
 of milk tomorrow, and pita bread,
 which is probably how like half
 319
-00:20:59,300 --> 00:21:02,940
 my voice voice notes sound like
 if anyone were to, I don't know,
 320
-00:21:02,980 --> 00:21:04,700
 like find my phone,
 they'd be like, this is the most
 321
-00:21:04,700 --> 00:21:07,540
 boring person in the world.
 Although actually there are some
 322
-00:21:07,580 --> 00:21:09,820
 like kind of, uh,
 journaling thoughts as well.
 323
-00:21:09,820 --> 00:21:13,820
 But it's a lot of content like that.
 And the probably for the evaluation,
 324
-00:21:13,820 --> 00:21:20,780
 the most useful thing is slightly
 obscure tech GitHub uh, hugging face
 325
-00:21:21,300 --> 00:21:24,780
 not so obscure that it's not going
 to have a chance of knowing it,
 326
-00:21:24,780 --> 00:21:27,760
 but hopefully sufficiently well
 known that the model should get it.
 327
-00:21:28,320 --> 00:21:30,880
 I tried to do a little bit of
 speaking really fast and
 328
-00:21:30,880 --> 00:21:33,320
 speaking very slowly.
 I would say in general,
 329
-00:21:33,320 --> 00:21:37,000
 I've spoken, delivered this at a
 faster pace than I usually would
 330
-00:21:37,040 --> 00:21:40,400
 owing to strong coffee flowing
 through my bloodstream.
 331
-00:21:41,040 --> 00:21:44,320
 And the thing that I'm not going
 to get in this benchmark is
 332
-00:21:44,320 --> 00:21:47,000
 background noise, which in my first
 take that I had to get rid of,
 333
-00:21:47,800 --> 00:21:51,360
 my wife came in with my son and
 for a good night kiss.
 334
-00:21:51,560 --> 00:21:55,240
 And that actually would have
 been super helpful to get in
 335
-00:21:55,240 --> 00:21:59,880
 because it was not diarised.
 Or if we had diarisation a female,
 336
-00:22:00,000 --> 00:22:02,400
 I could say I want the male
 voice and that wasn't intended
 337
-00:22:02,400 --> 00:22:05,400
 for transcription.
 Um, and we're not going to get
 338
-00:22:05,400 --> 00:22:07,080
 background noise like people
 honking their horns,
 339
-00:22:07,080 --> 00:22:11,400
 which is something I've done in my
 main data set where I am trying to
 340
-00:22:11,560 --> 00:22:15,640
 go back to some of my voice notes,
 annotate them, and run a benchmark.
 341
-00:22:15,640 --> 00:22:19,080
 But this is going to be just a
 pure quick test.
 342
-00:22:19,560 --> 00:22:24,000
 And as someone I'm working on a
 voice note idea,
 343
-00:22:24,000 --> 00:22:28,350
 that's my sort of end motivation.
 Besides thinking it's an
 344
-00:22:28,350 --> 00:22:31,710
 absolutely outstanding technology
 that's coming to viability.
 345
-00:22:31,710 --> 00:22:34,790
 And really, I know this sounds
 cheesy can actually have a very
 346
-00:22:34,790 --> 00:22:38,950
 transformative effect.
 Um, it's, you know, voice technology
 347
-00:22:38,990 --> 00:22:45,030
 has been life changing for, uh,
 folks living with, um, disabilities.
 348
-00:22:45,750 --> 00:22:48,670
 And I think there's something
 really nice about the fact that
 349
-00:22:48,670 --> 00:22:52,830
 it can also benefit, you know,
 folks who are able bodied and like,
 350
-00:22:52,870 --> 00:22:59,070
 we can all in different ways, um,
 make this tech as useful as possible,
 351
-00:22:59,110 --> 00:23:01,230
 regardless of the exact way that
 we're using it.
 352
-00:23:01,630 --> 00:23:04,830
 Um, and I think there's something
 very powerful in that, and it can be
 353
-00:23:04,830 --> 00:23:09,030
 very cool. Um, I see use potential.
 What excites me about voice tech?
 354
-00:23:09,870 --> 00:23:13,670
 A lot of things, actually.
 Firstly, the fact that it's cheap
 355
-00:23:13,670 --> 00:23:17,230
 and accurate, as I mentioned at
 the very start of this, um,
 356
-00:23:17,230 --> 00:23:20,910
 and it's getting better and better
 with stuff like accent handling, um,
 357
-00:23:20,910 --> 00:23:24,300
 I'm not sure my, my fine tune will
 actually ever come to fruition in the
 358
-00:23:24,300 --> 00:23:27,980
 sense that I'll use it day to day,
 as I imagine I get like superb,
 359
-00:23:27,980 --> 00:23:33,660
 flawless word error rates because I'm
 just kind of skeptical about local
 360
-00:23:33,660 --> 00:23:38,220
 speech to texts, as I mentioned.
 And I think the pace of innovation
 361
-00:23:38,220 --> 00:23:42,180
 and improvement in the models,
 the main reasons for fine tuning from
 362
-00:23:42,180 --> 00:23:46,460
 what I've seen have been people who
 are something that really blows,
 363
-00:23:46,500 --> 00:23:53,060
 blows my mind about ASR is the idea
 that it's inherently a lingual
 364
-00:23:53,060 --> 00:23:59,220
 or multilingual phonetic based.
 So as folks who use speak very
 365
-00:23:59,260 --> 00:24:02,340
 obscure languages that there may
 be there might be a paucity of
 366
-00:24:02,340 --> 00:24:05,620
 training data or almost none at all,
 and therefore the accuracy is
 367
-00:24:05,620 --> 00:24:10,780
 significantly reduced or folks
 in very critical environments.
 368
-00:24:10,820 --> 00:24:13,500
 I know there are.
 This is used extensively in medical
 369
-00:24:13,500 --> 00:24:18,260
 transcription and dispatcher work as,
 um, you know, the call centers who
 370
-00:24:18,260 --> 00:24:22,610
 send out ambulances, etc., where
 accuracy is absolutely paramount.
 371
-00:24:22,610 --> 00:24:26,170
 And in the case of doctors,
 radiologists, they might be using
 372
-00:24:26,170 --> 00:24:29,730
 very specialized vocab all the time.
 So those are kind of the main
 373
-00:24:29,730 --> 00:24:31,650
 two things.
 And I'm not sure that really just for
 374
-00:24:31,650 --> 00:24:37,410
 trying to make it better on a few
 random tech words with my slightly.
 375
-00:24:37,450 --> 00:24:41,370
 I mean, I have an accent, but like,
 not, you know, an accent that a few
 376
-00:24:41,410 --> 00:24:47,330
 other million people have. Ish.
 I'm not sure that my little fine
 377
-00:24:47,330 --> 00:24:52,370
 tune is going to actually like the
 bump in word error rate reduction.
 378
-00:24:52,370 --> 00:24:54,690
 If I ever actually figure out how
 to do it and get it up to the
 379
-00:24:54,690 --> 00:24:58,730
 cloud by the time I've done that.
 I suspect that the next
 380
-00:24:58,730 --> 00:25:01,530
 generation of ASR will just be
 so good that it will kind of be.
 381
-00:25:02,050 --> 00:25:03,890
 Ah, well,
 that would be cool if it worked out,
 382
-00:25:03,890 --> 00:25:08,850
 but I'll just use this instead.
 So that's going to be it for today's
 383
-00:25:08,850 --> 00:25:14,250
 episode of, uh, voice training data.
 Single long shot evaluation.
 384
-00:25:14,530 --> 00:25:17,450
 Who am I going to compare?
 Whisper is always good as a
 385
-00:25:17,450 --> 00:25:20,720
 benchmark, but I'm more
 interested in seeing Whisperer
 386
-00:25:20,720 --> 00:25:25,200
 head to head with two things,
 really. One is whisper variance.
 387
-00:25:25,200 --> 00:25:30,000
 So you've got these projects like
 faster Whisper, Still whisper.
 388
-00:25:30,000 --> 00:25:31,760
 It's a bit confusing.
 There's a whole bunch of them
 389
-00:25:32,040 --> 00:25:34,920
 and the emerging acers,
 which are also a thing.
 390
-00:25:35,320 --> 00:25:37,800
 My intention for this is I'm not
 sure I'm going to have the time
 391
-00:25:37,800 --> 00:25:41,760
 in any point in the foreseeable
 future to go back through this whole
 392
-00:25:41,760 --> 00:25:46,680
 episode and create a proper source,
 truth or a fix.
 393
-00:25:47,440 --> 00:25:51,800
 Everything might do it if I can
 get one transcription that
 394
-00:25:51,800 --> 00:25:56,840
 sufficiently close to perfection.
 But what I would actually love
 395
-00:25:56,840 --> 00:25:59,920
 to do on Hugging Face I think
 would be a great.
 396
-00:25:59,920 --> 00:26:03,680
 Probably how I might visualize this
 is having the audio waveform play,
 397
-00:26:04,160 --> 00:26:09,920
 and then have the transcript for each
 model below it, and maybe even a,
 398
-00:26:10,600 --> 00:26:15,240
 um, like, you know, two scale and
 maybe even a local one as well,
 399
-00:26:15,280 --> 00:26:21,820
 like local whisper versus open
 AI API, Etc. and, um, I can then
 400
-00:26:21,820 --> 00:26:24,500
 actually listen back to segments
 or anyone who wants to can listen
 401
-00:26:24,500 --> 00:26:29,540
 back to segments of this recording
 and see where a particular model
 402
-00:26:29,580 --> 00:26:33,060
 struggled and others didn't, as well
 as the sort of headline finding
 403
-00:26:33,100 --> 00:26:36,900
 of which had the best, uh, wer.
 But that would require the source
 404
-00:26:36,900 --> 00:26:40,140
 of truth. Okay. That's it.
 Hope this was, I don't know,
 405
-00:26:40,300 --> 00:26:43,580
 maybe useful for other folks
 interested in stuff you want to see.
 406
-00:26:44,060 --> 00:26:48,220
 I always feel think I've just said
 something I didn't intend to say.
 407
-00:26:48,780 --> 00:26:51,140
 I said for those, listen carefully.
 Including, hopefully,
 408
-00:26:51,140 --> 00:26:54,180
 the models themselves.
 This has been myself,
 409
-00:26:54,220 --> 00:26:58,020
 Daniel Rosehill, for more, um,
 jumbled repositories about my,
 410
-00:26:58,060 --> 00:27:00,940
 uh, roving interest in AI,
 but particularly Agentic,
 411
-00:27:01,300 --> 00:27:05,460
 MCP and voice tech.
 Uh, you can find me on GitHub.
 412
-00:27:05,940 --> 00:27:11,260
 Hugging face. Where else?
 Daniel, which is my personal website,
 413
-00:27:11,260 --> 00:27:15,380
 as well as this podcast whose
 name I sadly cannot remember.
 414
-00:27:15,820 --> 00:27:17,540
 Until next time.
 Thanks for listening.

 1
+00:00:00,000 --> 00:00:06,400
 Hello and welcome to a audio data
 set consisting of one single
 2
+00:00:06,400 --> 00:00:12,000
 episode of a non-existent podcast.
 Or it, uh, I may append this to a
 3
+00:00:12,000 --> 00:00:16,520
 podcast that I set up recently.
 Um, regarding my, uh,
 4
+00:00:16,560 --> 00:00:21,840
 with my thoughts on speech,
 tech and AI in particular,
 5
+00:00:22,120 --> 00:00:27,840
 more AI and generative AI, I would,
 uh, I would say, but in any event,
 6
+00:00:27,840 --> 00:00:32,360
 the purpose of this, um,
 voice recording is actually to create
 7
+00:00:32,560 --> 00:00:37,440
 a lengthy voice sample for a quick
 evaluation, a back of the envelope
 8
+00:00:37,440 --> 00:00:41,040
 evaluation, as they might say,
 for different speech to text models.
 9
+00:00:41,040 --> 00:00:43,680
 And I'm doing this because I,
 uh, I thought I'd made a great
 10
+00:00:43,680 --> 00:00:48,200
 breakthrough in my journey with
 speech tech, and that was succeeding
 11
+00:00:48,200 --> 00:00:52,600
 in the elusive task of fine tuning.
 Whisper, whisper is.
 12
+00:00:52,720 --> 00:00:56,840
 And I'm going to just talk.
 I'm trying to mix up, uh,
 13
+00:00:56,840 --> 00:01:00,350
 I'm going to try a few different
 styles of speaking.
 14
+00:01:00,350 --> 00:01:02,510
 I might whisper something at
 some point as well,
 15
+00:01:03,070 --> 00:01:07,030
 and I'll go back to speaking loud in,
 uh, in different parts.
 16
+00:01:07,030 --> 00:01:09,590
 I'm going to sound really like a
 crazy person, because I'm also
 17
+00:01:09,590 --> 00:01:15,750
 going to try to speak at different
 pitches and cadences in order to
 18
+00:01:15,790 --> 00:01:20,510
 really try to put a speech to
 text model through its paces,
 19
+00:01:20,510 --> 00:01:25,750
 which is trying to make sense of,
 is this guy just on incoherently in
 20
+00:01:25,750 --> 00:01:34,230
 one long sentence, or are these just
 actually a series of step standalone,
 21
+00:01:34,230 --> 00:01:37,390
 standalone, standalone sentences?
 And how is it going to handle
 22
+00:01:37,390 --> 00:01:40,630
 step alone? That's not a word.
 Uh, what happens when you use
 23
+00:01:40,630 --> 00:01:43,910
 speech to text and you use a fake
 word and then you're like, wait,
 24
+00:01:43,910 --> 00:01:48,230
 that's not actually that word doesn't
 exist. How does AI handle that?
 25
+00:01:48,270 --> 00:01:53,790
 And, uh, these and more are all
 the questions that I'm seeking
 26
+00:01:53,790 --> 00:01:57,230
 to answer in this training data.
 Now, why did why was it trying
 27
+00:01:57,230 --> 00:01:59,620
 to fine tune a whisper?
 And what is whisper?
 28
+00:01:59,660 --> 00:02:03,420
 As I said, I'm gonna try to, uh,
 record this at a couple of different
 29
+00:02:03,420 --> 00:02:08,940
 levels of technicality for folks who
 are, uh, you know, in the normal, uh,
 30
+00:02:08,940 --> 00:02:13,340
 world and not totally stuck down
 the rabbit hole of AI, uh, which I
 31
+00:02:13,340 --> 00:02:17,340
 have to say is a really wonderful,
 uh, rabbit hole to be to be down.
 32
+00:02:17,460 --> 00:02:21,580
 Um, it's a really interesting area.
 And speech and voice tech is is
 33
+00:02:21,820 --> 00:02:24,860
 the aspect of it that I find
 actually most.
 34
+00:02:25,060 --> 00:02:28,220
 I'm not sure I would say the most
 interesting, because there's just
 35
+00:02:28,220 --> 00:02:32,580
 so much that is fascinating in AI.
 Uh, but the most that I find the
 36
+00:02:32,580 --> 00:02:36,100
 most personally transformative
 in terms of the impact that it's
 37
+00:02:36,100 --> 00:02:41,540
 had on my daily work life and
 productivity and how I sort of work.
 38
+00:02:41,820 --> 00:02:47,900
 And I'm persevering hard with the
 task of trying to guess a good
 39
+00:02:47,900 --> 00:02:51,580
 solution working for Linux, which if
 anyone actually does listen to this,
 40
+00:02:51,580 --> 00:02:54,980
 not just for the training data
 and for the actual content, uh,
 41
+00:02:55,020 --> 00:02:59,480
 this is this is has sparked I had
 besides the fine tune not working.
 42
+00:02:59,480 --> 00:03:05,440
 Well, that was the failure.
 Um, I used clod code because one
 43
+00:03:05,440 --> 00:03:10,040
 thinks these days that there is
 nothing short of solving,
 44
+00:03:10,920 --> 00:03:14,560
 you know, the, uh,
 the reason of life or something.
 45
+00:03:14,960 --> 00:03:19,440
 Uh, that clod and agentic AI can't
 do, uh, which is not really the case.
 46
+00:03:19,480 --> 00:03:23,480
 Uh, it does seem that way sometimes,
 but it fails a lot as well.
 47
+00:03:23,480 --> 00:03:26,840
 And this is one of those, uh,
 instances where last week I put
 48
+00:03:26,840 --> 00:03:31,280
 together an hour of voice training
 data, basically speaking just
 49
+00:03:31,280 --> 00:03:34,920
 random things for three minutes.
 And, um,
 50
+00:03:35,600 --> 00:03:38,400
 it was actually kind of tedious
 because the texts were really weird.
 51
+00:03:38,400 --> 00:03:42,000
 Some of them were it was like it
 was AI generated.
 52
+00:03:42,200 --> 00:03:44,800
 Um, I tried before to read
 Sherlock Holmes for an hour and
 53
+00:03:44,800 --> 00:03:46,880
 I just couldn't.
 I was so bored, uh,
 54
+00:03:46,920 --> 00:03:50,680
 after ten minutes that I was like,
 okay, now I'm just gonna have to
 55
+00:03:50,680 --> 00:03:56,350
 find something else to read.
 So I used a created with AI
 56
+00:03:56,390 --> 00:04:00,030
 studio vibe coded.
 A synthetic text generator.
 57
+00:04:00,270 --> 00:04:03,870
 Um, which actually I thought was
 probably a better way of doing it
 58
+00:04:03,870 --> 00:04:08,750
 because it would give me more short
 samples with more varied content.
 59
+00:04:08,750 --> 00:04:13,190
 So I was like, okay, give me a voice
 note, like I'm recording an email,
 60
+00:04:13,190 --> 00:04:17,990
 give me a short story to read,
 give me prose, um, to read.
 61
+00:04:17,990 --> 00:04:21,190
 So I came up with all these
 different things, and I added a
 62
+00:04:21,190 --> 00:04:24,630
 little timer to it so I could
 see how close I was to one hour.
 63
+00:04:24,870 --> 00:04:29,710
 Um, and, uh, I spent like an hour one
 afternoon or probably two hours by
 64
+00:04:29,710 --> 00:04:34,070
 the time you, um, you do retakes
 or whatever because you want to.
 65
+00:04:34,870 --> 00:04:39,070
 It gave me a source of truth,
 which I'm not sure if that's the
 66
+00:04:39,070 --> 00:04:43,430
 scientific way to approach this topic
 of gathering, uh, training data,
 67
+00:04:43,430 --> 00:04:47,950
 but I thought it made sense.
 Um, I have a lot of audio data
 68
+00:04:47,950 --> 00:04:51,950
 from recording voice notes,
 which I've also kind of used, um,
 69
+00:04:51,950 --> 00:04:55,660
 been experimenting with using for
 a different purpose, slightly
 70
+00:04:55,660 --> 00:05:00,700
 different annotating task types.
 It's more text classification
 71
+00:05:00,700 --> 00:05:03,620
 experiment or uh, well,
 it's more than that, actually.
 72
+00:05:03,620 --> 00:05:07,980
 I'm working on a voice app,
 so it's a prototype I guess is
 73
+00:05:07,980 --> 00:05:12,660
 really more accurate.
 Um, but you can do that and you
 74
+00:05:12,660 --> 00:05:14,100
 can work backwards.
 You're like,
 75
+00:05:14,140 --> 00:05:18,500
 you listen back to a voice note
 and you painfully go through one
 76
+00:05:18,500 --> 00:05:21,860
 of those transcribing, you know,
 where you start and stop and scrub
 77
+00:05:21,860 --> 00:05:23,980
 around it and you fix the errors.
 But it's really,
 78
+00:05:23,980 --> 00:05:27,100
 really boring to do that.
 So I thought it would be less
 79
+00:05:27,100 --> 00:05:31,740
 tedious in the long term if I just
 recorded The Source of truth.
 80
+00:05:32,060 --> 00:05:34,180
 So it gave me these three minute
 snippets.
 81
+00:05:34,180 --> 00:05:38,660
 I recorded them and saved an MP3
 and a txt in the same folder,
 82
+00:05:38,660 --> 00:05:43,700
 and I created an hour of that data.
 Uh, so I was very hopeful, quietly,
 83
+00:05:43,740 --> 00:05:46,260
 you know, a little bit hopeful
 that I would be able that I could
 84
+00:05:46,260 --> 00:05:49,580
 actually fine tune, whisper.
 Um, I want to fine tune whisper
 85
+00:05:49,580 --> 00:05:54,720
 because when I got into voice tech
 last November, my wife was in
 86
+00:05:54,720 --> 00:05:59,480
 the US and I was alone at home.
 And you know, when crazy people
 87
+00:05:59,480 --> 00:06:03,640
 like me do really wild things like
 use voice to tech, uh, technology.
 88
+00:06:03,640 --> 00:06:06,400
 That was basically, um,
 when I started doing it,
 89
+00:06:06,400 --> 00:06:10,160
 I didn't feel like a crazy person
 speaking to myself, and my
 90
+00:06:10,160 --> 00:06:16,000
 expectations weren't that high.
 Uh, I used speech tech now and again.
 91
+00:06:16,080 --> 00:06:18,360
 Um, tried it out.
 I was like, it'd be really cool
 92
+00:06:18,360 --> 00:06:20,400
 if you could just, like,
 speak into your computer.
 93
+00:06:20,760 --> 00:06:24,600
 And whatever I tried out that
 had Linux support was just.
 94
+00:06:25,320 --> 00:06:28,520
 It was not good, basically.
 Um, and this blew me away from
 95
+00:06:28,520 --> 00:06:31,920
 the first go.
 I mean, it wasn't 100% accurate
 96
+00:06:31,960 --> 00:06:35,040
 out of the box and it took work,
 but it was good enough that there was
 97
+00:06:35,040 --> 00:06:39,600
 a solid foundation and it kind of
 passed that, uh, pivot point that
 98
+00:06:39,600 --> 00:06:42,760
 it's actually worth doing this.
 You know, there's a point where
 99
+00:06:42,760 --> 00:06:46,800
 it's so like the transcript is you
 don't have to get 100% accuracy
 100
+00:06:46,800 --> 00:06:50,510
 for it to be worth your time for
 speech to text to be a worthwhile
 101
+00:06:50,510 --> 00:06:52,950
 addition to your productivity.
 But you do need to get above.
 102
+00:06:52,990 --> 00:06:57,630
 Let's say, I don't know, 85%.
 If it's 60% or 50%,
 103
+00:06:57,630 --> 00:07:00,670
 you inevitably say, screw it.
 I'll just type it because you end up
 104
+00:07:00,670 --> 00:07:04,950
 missing errors in the transcript
 and it becomes actually worse.
 105
+00:07:04,950 --> 00:07:06,710
 You end up in a worse position
 than you started with.
 106
+00:07:06,710 --> 00:07:10,910
 And that's been my experience.
 So, um, I was like, oh,
 107
+00:07:10,950 --> 00:07:13,430
 this is actually really, really good.
 Now how did that happen?
 108
+00:07:13,430 --> 00:07:18,790
 And the answer is ASR whisper
 being open sourced and the
 109
+00:07:18,790 --> 00:07:21,790
 transformer architecture,
 if you want to go back to the,
 110
+00:07:22,390 --> 00:07:26,630
 um, to the underpinnings, which
 really blows my mind and it's on my
 111
+00:07:26,630 --> 00:07:32,310
 list to read through that paper.
 Um, all you need is attention as
 112
+00:07:33,350 --> 00:07:38,350
 attentively as can be done with my
 limited brain because it's super,
 113
+00:07:38,350 --> 00:07:42,190
 super high level stuff.
 Um, super advanced stuff.
 114
+00:07:42,230 --> 00:07:47,950
 I mean, uh, but that I think of all
 the things that are fascinating
 115
+00:07:48,060 --> 00:07:52,700
 about the sudden rise in AI and
 the dramatic capabilities.
 116
+00:07:53,300 --> 00:07:55,580
 I find it fascinating that few
 people are like, hang on,
 117
+00:07:55,740 --> 00:07:59,620
 you've got this thing that can speak
 to you like a chatbot, an LLM,
 118
+00:08:00,300 --> 00:08:05,460
 and then you've got image generation.
 Okay, so firstly, those two things on
 119
+00:08:05,460 --> 00:08:10,740
 the surface have nothing in common.
 Um, so like how are they how did that
 120
+00:08:10,740 --> 00:08:12,980
 just happen all at the same time.
 And then when you extend that
 121
+00:08:12,980 --> 00:08:16,060
 further, um, you're like sooner,
 right?
 122
+00:08:16,060 --> 00:08:21,580
 You can sing a song and AI will like,
 come up with an instrumental and then
 123
+00:08:21,580 --> 00:08:23,740
 you've got whisper and you're like,
 wait a second,
 124
+00:08:23,940 --> 00:08:27,980
 how did all this stuff, like,
 if it's all AI, what's like there
 125
+00:08:27,980 --> 00:08:30,580
 has to be some commonality.
 Otherwise these are four.
 126
+00:08:30,660 --> 00:08:34,660
 These are totally different
 technologies on the surface of it.
 127
+00:08:34,660 --> 00:08:40,100
 And, uh, the transformer architecture
 is, as far as I know, the answer.
 128
+00:08:40,100 --> 00:08:43,740
 And I can't even say can't even
 pretend that I really understand
 129
+00:08:44,020 --> 00:08:47,170
 what the transformer
 architecture means in depth,
 130
+00:08:47,170 --> 00:08:51,690
 but I have scanned it and as I said,
 I want to print it and really kind
 131
+00:08:51,690 --> 00:08:56,650
 of think over it at some point,
 and I'll probably feel bad about
 132
+00:08:56,650 --> 00:08:58,970
 myself, I think,
 because weren't those guys in their
 133
+00:08:59,010 --> 00:09:03,890
 in their 20s like, that's crazy.
 I think I asked ChatGPT once who
 134
+00:09:03,930 --> 00:09:08,250
 were the who wrote that paper
 and how old were they when it
 135
+00:09:08,250 --> 00:09:11,170
 was published in arXiv?
 And I was expecting like,
 136
+00:09:11,410 --> 00:09:13,330
 I don't know,
 what do you what do you imagine?
 137
+00:09:13,330 --> 00:09:14,930
 I personally imagine kind of like,
 you know,
 138
+00:09:14,970 --> 00:09:19,090
 you have these breakthroughs during
 Covid and things like that where
 139
+00:09:19,130 --> 00:09:22,090
 like these kind of really obscure
 scientists who are like in their
 140
+00:09:22,090 --> 00:09:27,130
 50s and they've just kind of been
 laboring in labs and, uh, wearily
 141
+00:09:27,130 --> 00:09:30,530
 and writing in publishing in kind
 of obscure academic publications.
 142
+00:09:30,730 --> 00:09:33,930
 And they finally, like,
 hit a big or win a Nobel Prize and
 143
+00:09:33,930 --> 00:09:37,810
 then their household household names.
 Uh, so that was kind of what I
 144
+00:09:37,810 --> 00:09:39,650
 had in mind.
 That was the mental image I'd
 145
+00:09:39,650 --> 00:09:43,890
 formed of the birth of arXiv.
 Like, I wasn't expecting 20
 146
+00:09:43,930 --> 00:09:47,310
 somethings in San Francisco,
 though I thought that was both very,
 147
+00:09:47,310 --> 00:09:49,870
 very funny, very cool,
 and actually kind of inspiring.
 148
+00:09:50,390 --> 00:09:55,510
 It's nice to think that people who,
 you know, just you might put them
 149
+00:09:55,510 --> 00:10:00,910
 in the kind of milieu or bubble or
 world that you are in or credibly in,
 150
+00:10:00,950 --> 00:10:03,590
 through, you know,
 a series of connections that are
 151
+00:10:03,590 --> 00:10:07,630
 coming up with such literally
 world changing, um, innovations.
 152
+00:10:07,670 --> 00:10:11,430
 Uh, so that was, I thought,
 anyway, that, that that was cool.
 153
+00:10:12,070 --> 00:10:13,950
 Okay. Voice training data.
 How are we doing?
 154
+00:10:13,950 --> 00:10:17,990
 We're about ten minutes, and I'm
 still talking about voice technology.
 155
+00:10:18,190 --> 00:10:22,350
 Um, so whisper was brilliant,
 and I was so excited that I was.
 156
+00:10:22,350 --> 00:10:25,630
 My first instinct was to, like,
 get like, oh, my gosh,
 157
+00:10:25,630 --> 00:10:27,710
 I have to get, like,
 a really good microphone for this.
 158
+00:10:27,950 --> 00:10:31,630
 So, um, I didn't go on a
 spending spree because I said,
 159
+00:10:31,670 --> 00:10:34,470
 I'm gonna have to just wait a
 month and see if I still use this.
 160
+00:10:34,910 --> 00:10:39,990
 And it just kind of became it's
 become really part of my daily
 161
+00:10:39,990 --> 00:10:42,990
 routine.
 Like, if I'm writing an email,
 162
+00:10:42,990 --> 00:10:47,020
 I'll record a voice note.
 And then I've developed and it's
 163
+00:10:47,020 --> 00:10:49,900
 nice to see that everyone is
 like developing the same things
 164
+00:10:49,900 --> 00:10:51,900
 in parallel.
 Like, that's kind of a weird thing
 165
+00:10:51,940 --> 00:10:57,340
 to say, but when I look, I kind of
 came when I started working on this,
 166
+00:10:57,380 --> 00:11:00,700
 these prototypes on GitHub,
 which is where I just kind of
 167
+00:11:00,740 --> 00:11:04,740
 share very freely and loosely,
 uh, ideas and, you know,
 168
+00:11:04,780 --> 00:11:10,020
 first iterations on, on concepts,
 um, and for want of a better word,
 169
+00:11:10,020 --> 00:11:13,900
 I called it like, uh,
 lm post-processing or cleanup or
 170
+00:11:14,140 --> 00:11:18,100
 basically a system prompt that after
 you get back the raw text from
 171
+00:11:18,420 --> 00:11:24,100
 whisper, you run it through a model
 and say, okay, this is crappy text,
 172
+00:11:24,140 --> 00:11:27,140
 like add sentence structure and,
 you know, fix it up.
 173
+00:11:27,580 --> 00:11:32,660
 And, um, now when I'm exploring the
 different tools that are out there
 174
+00:11:32,700 --> 00:11:36,580
 that people have built, I see, uh,
 quite a number of projects have
 175
+00:11:37,180 --> 00:11:41,700
 basically done the same thing,
 um, less that be misconstrued.
 176
+00:11:41,700 --> 00:11:44,370
 I'm not saying for a millisecond
 that I inspired them.
 177
+00:11:44,370 --> 00:11:48,890
 I'm sure this has been a thing that's
 been integrated into tools for a
 178
+00:11:48,930 --> 00:11:52,290
 while, but it's it's the kind of
 thing that when you start using these
 179
+00:11:52,290 --> 00:11:56,730
 tools every day, the need for it
 is almost instantly apparent, uh,
 180
+00:11:56,730 --> 00:12:00,770
 because text that doesn't have any
 punctuation or paragraph spacing
 181
+00:12:00,810 --> 00:12:04,250
 takes a long time to, you know,
 it takes so long to get it into
 182
+00:12:04,250 --> 00:12:09,370
 a presentable email that again,
 it's it's it moves speech tech
 183
+00:12:09,410 --> 00:12:12,930
 into that before that inflection
 point where you're like, no,
 184
+00:12:12,930 --> 00:12:16,250
 it's just not worth it.
 It's like it'll just be quicker
 185
+00:12:16,250 --> 00:12:18,850
 to type this.
 So it's a big it's a little touch.
 186
+00:12:18,850 --> 00:12:24,090
 That actually is a big deal.
 Uh, so I was on whisper and I've
 187
+00:12:24,090 --> 00:12:28,170
 been using whisper and I kind of
 early on found a couple of tools.
 188
+00:12:28,210 --> 00:12:30,930
 I couldn't find what I was
 looking for on Linux, which is,
 189
+00:12:31,370 --> 00:12:35,770
 um, basically just something
 that'll run in the background.
 190
+00:12:35,810 --> 00:12:40,130
 You'll give it an API key and it
 will just transcribe. Um.
 191
+00:12:41,280 --> 00:12:44,000
 with, like, a little key to
 start and stop the dictation.
 192
+00:12:44,600 --> 00:12:49,040
 Uh, and the issues were I discovered
 that, like most people involved in
 193
+00:12:49,040 --> 00:12:53,920
 creating these projects were very
 much focused on local models running
 194
+00:12:53,920 --> 00:12:57,400
 whisper locally, because you can.
 And I tried that a bunch of
 195
+00:12:57,400 --> 00:13:00,840
 times and just never got results
 that were as good as the cloud.
 196
+00:13:01,160 --> 00:13:04,640
 And when I began looking at the
 cost of the speech to text APIs
 197
+00:13:04,640 --> 00:13:08,520
 and what I was spending,
 I just thought there's it's actually,
 198
+00:13:08,720 --> 00:13:13,200
 in my opinion, just one of the better
 deals in API spending and in cloud.
 199
+00:13:13,240 --> 00:13:17,280
 Like it's just not that expensive
 for very, very good models that are
 200
+00:13:17,400 --> 00:13:20,840
 much more, you know, you're going
 to be able to run the full model,
 201
+00:13:21,360 --> 00:13:25,960
 the latest model versus whatever
 you can run on your average GPU.
 202
+00:13:26,000 --> 00:13:29,760
 Unless you want to buy a crazy GPU.
 It doesn't really make sense to me.
 203
+00:13:29,760 --> 00:13:33,480
 Now, privacy is another concern.
 Um, that I know is kind of like a
 204
+00:13:33,520 --> 00:13:36,920
 very much a separate thing that
 people just don't want their voice,
 205
+00:13:36,920 --> 00:13:39,790
 data, and their voice leaving
 their local environment,
 206
+00:13:40,110 --> 00:13:43,830
 maybe for regulatory reasons as well.
 Um, but I'm not in that.
 207
+00:13:43,910 --> 00:13:47,910
 Um, I'm neither really care about
 people listening to my, uh,
 208
+00:13:47,950 --> 00:13:51,190
 grocery list consisting of, uh,
 reminding myself that I need to
 209
+00:13:51,230 --> 00:13:54,790
 buy more beer, Cheetos and hummus,
 which is kind of the three,
 210
+00:13:54,990 --> 00:13:59,310
 three staples of my diet.
 Um, during periods of poor nutrition.
 211
+00:13:59,590 --> 00:14:03,310
 Uh, but the kind of stuff that I
 transcribe, it's just not it's not a,
 212
+00:14:03,990 --> 00:14:09,350
 it's not a privacy thing and that
 sort of sensitive about and, uh,
 213
+00:14:09,350 --> 00:14:13,070
 I don't do anything so,
 you know, sensitive or secure,
 214
+00:14:13,070 --> 00:14:16,590
 that requires air gapping.
 So, um, I looked at the pricing and
 215
+00:14:16,590 --> 00:14:20,270
 especially the kind of older models,
 mini, um, some of them are very,
 216
+00:14:20,270 --> 00:14:23,110
 very affordable.
 And I did a back of the I did a
 217
+00:14:23,110 --> 00:14:27,150
 calculation once with ChatGPT
 and I was like, okay, this is a,
 218
+00:14:27,150 --> 00:14:31,070
 this is the API price for I can't
 remember whatever the model was.
 219
+00:14:31,550 --> 00:14:33,910
 Uh, let's say I just go at it
 like nonstop,
 220
+00:14:34,030 --> 00:14:37,410
 which it rarely happens. Probably.
 I would say on average,
 221
+00:14:37,410 --> 00:14:41,890
 I might dictate 30 to 60 minutes per
 day if I was probably summing up
 222
+00:14:41,890 --> 00:14:48,490
 the emails, documents, outlines,
 um, which is a lot, but it's it's
 223
+00:14:48,490 --> 00:14:50,730
 still a fairly modest amount.
 And I was like, well,
 224
+00:14:50,770 --> 00:14:53,930
 some days I do go on like 1 or 2
 days where I've been.
 225
+00:14:54,450 --> 00:14:58,450
 Usually when I'm like kind of out of
 the house and just have something
 226
+00:14:59,090 --> 00:15:02,250
 like, I have nothing else to do.
 Like if I'm at a hospital with a
 227
+00:15:02,250 --> 00:15:06,970
 newborn, uh, and you're waiting
 for like eight hours and hours
 228
+00:15:06,970 --> 00:15:10,210
 for an appointment, and I would
 probably have listened to podcasts
 229
+00:15:10,490 --> 00:15:14,010
 before becoming a speech fanatic.
 And I'm like, oh, wait,
 230
+00:15:14,050 --> 00:15:16,370
 let me just get down.
 Let me just get these ideas out
 231
+00:15:16,410 --> 00:15:19,610
 of my head.
 And that's when I'll go on my
 232
+00:15:19,650 --> 00:15:21,530
 speech binges.
 But those are like once every
 233
+00:15:21,530 --> 00:15:24,970
 few months, like not frequently.
 But I said, okay, let's just say
 234
+00:15:24,970 --> 00:15:30,650
 if I'm gonna price out.
 Cloud asked if I was like, dedicated
 235
+00:15:30,650 --> 00:15:36,880
 every second of every waking hour to
 transcribing for some odd reason. Um.
 236
+00:15:37,200 --> 00:15:39,680
 I mean, it'd have to, like,
 eat and use the toilet and,
 237
+00:15:39,720 --> 00:15:42,520
 like, you know, there's only so
 many hours I'm awake for.
 238
+00:15:42,520 --> 00:15:44,680
 So, like,
 let's just say a maximum of, like,
 239
+00:15:44,720 --> 00:15:48,680
 40 hours, 45 minutes in the hour.
 Then I said, all right,
 240
+00:15:48,680 --> 00:15:52,600
 let's just say 50. Who knows?
 You're dictating on the toilet.
 241
+00:15:52,640 --> 00:15:53,880
 We do it.
 Uh,
 242
+00:15:53,880 --> 00:15:58,720
 so it could be you could just do 60.
 But whatever I did, and every day,
 243
+00:15:58,760 --> 00:16:02,440
 like, you're going flat out seven
 days a week dictating non-stop.
 244
+00:16:02,480 --> 00:16:06,440
 I was like, what's my monthly API
 bill going to be at this price?
 245
+00:16:06,720 --> 00:16:09,120
 And it came out to like 70 or 80
 bucks.
 246
+00:16:09,120 --> 00:16:14,080
 And I was like, well, that would be
 an extraordinary amount of dictation.
 247
+00:16:14,080 --> 00:16:17,840
 And I would hope that there was
 some compelling reason,
 248
+00:16:18,040 --> 00:16:22,200
 more worth more than $70,
 that I embarked upon that project.
 249
+00:16:22,400 --> 00:16:25,200
 Uh, so given that that's kind of the
 max point for me, I said, that's
 250
+00:16:25,240 --> 00:16:29,000
 actually very, very affordable.
 Um, now you're gonna if you want
 251
+00:16:29,040 --> 00:16:34,080
 to spec out the costs and you want
 to do the post-processing that I
 252
+00:16:34,150 --> 00:16:37,110
 really do feel is valuable.
 Um, that's going to cost some more as
 253
+00:16:37,110 --> 00:16:43,110
 well, unless you're using Gemini,
 which, uh, needless to say, is a
 254
+00:16:43,110 --> 00:16:46,950
 random person sitting in Jerusalem.
 Uh, I have no affiliation,
 255
+00:16:46,950 --> 00:16:51,350
 nor with Google, nor anthropic,
 nor Gemini, nor any major tech vendor
 256
+00:16:51,350 --> 00:16:56,790
 for that matter. Um, I like Gemini.
 Not so much as a everyday model.
 257
+00:16:56,870 --> 00:16:59,830
 Um, it's kind of underwhelmed in
 that respect, I would say.
 258
+00:17:00,230 --> 00:17:03,030
 But for multimodal,
 I think it's got a lot to offer.
 259
+00:17:03,310 --> 00:17:06,870
 And I think that the transcribing
 functionality whereby it can,
 260
+00:17:07,270 --> 00:17:12,150
 um, process audio with a system
 prompt and both give you
 261
+00:17:12,190 --> 00:17:15,390
 transcription that's cleaned up,
 that reduces two steps to one.
 262
+00:17:15,710 --> 00:17:18,630
 And that for me is a very,
 very big deal.
 263
+00:17:18,630 --> 00:17:22,990
 And, uh, I feel like even Google
 has haven't really sort of thought
 264
+00:17:22,990 --> 00:17:27,430
 through how useful the that
 modality is and what kind of use
 265
+00:17:27,430 --> 00:17:30,790
 cases you can achieve with it.
 Because I found in the course of
 266
+00:17:30,790 --> 00:17:36,490
 this year just an endless list
 of really kind of system prompt,
 267
+00:17:36,730 --> 00:17:41,290
 system prompt stuff that I can say,
 okay, I've used it to capture context
 268
+00:17:41,290 --> 00:17:45,570
 data for AI, which is literally I
 might speak for if I wanted to have a
 269
+00:17:45,570 --> 00:17:49,730
 good bank of context data about,
 who knows, my childhood.
 270
+00:17:50,010 --> 00:17:53,450
 Uh, more realistically,
 maybe my career goals, uh,
 271
+00:17:53,450 --> 00:17:56,010
 something that would just be,
 like, really boring to type out.
 272
+00:17:56,130 --> 00:18:01,130
 So I'll just, like, sit in my car
 and record it for ten minutes.
 273
+00:18:01,130 --> 00:18:04,090
 And that ten minutes,
 you get a lot of information in,
 274
+00:18:04,530 --> 00:18:10,090
 um, emails, which is short text.
 Um, just there is a whole bunch.
 275
+00:18:10,090 --> 00:18:13,570
 And all these workflows kind of
 require a little bit of treatment
 276
+00:18:13,570 --> 00:18:17,490
 afterwards and different treatment.
 My context pipeline is kind of like
 277
+00:18:17,490 --> 00:18:21,210
 just extract the bare essentials.
 So you end up with me talking very
 278
+00:18:21,210 --> 00:18:24,250
 loosely about sort of what I've done
 in my career, where I've worked,
 279
+00:18:24,250 --> 00:18:27,610
 where I might like to work,
 and it goes it condenses that
 280
+00:18:27,610 --> 00:18:31,600
 down to very robotic language
 that is easy to chunk, parse,
 281
+00:18:31,600 --> 00:18:35,960
 and maybe put into a vector database.
 Daniel has worked in technology,
 282
+00:18:36,000 --> 00:18:39,640
 Daniel is a has been working in,
 you know, stuff like that.
 283
+00:18:39,640 --> 00:18:43,600
 That's not how you would speak.
 Um, but I figure it's probably easier
 284
+00:18:43,600 --> 00:18:48,120
 to parse for, after all, robots.
 So we've almost got to 20 minutes.
 285
+00:18:48,120 --> 00:18:52,640
 And this is actually a success
 because I wasted 20 minutes of my,
 286
+00:18:52,800 --> 00:18:56,880
 uh, of the evening speaking into
 a microphone, and, uh,
 287
+00:18:56,920 --> 00:19:00,840
 the levels were shot and, uh, it,
 uh, it was clipping and I said,
 288
+00:19:00,840 --> 00:19:03,200
 I can't really do an evaluation.
 I have to be fair.
 289
+00:19:03,200 --> 00:19:07,000
 I have to give the models a
 chance to do their thing.
 290
+00:19:07,520 --> 00:19:09,360
 Uh,
 what am I hoping to achieve in this?
 291
+00:19:09,400 --> 00:19:12,600
 Okay, my fine tune was a dud,
 as mentioned Deepgram SVT.
 292
+00:19:12,640 --> 00:19:15,520
 I'm really, really hopeful that
 this prototype will work.
 293
+00:19:15,800 --> 00:19:18,960
 And it's a built in public open
 source, so anyone is welcome to
 294
+00:19:19,000 --> 00:19:22,920
 use it if I make anything good.
 Um, but that was really exciting for
 295
+00:19:22,920 --> 00:19:27,400
 me last night when after hours of,
 um, trying my own prototype,
 296
+00:19:27,400 --> 00:19:31,230
 seeing someone just made
 something that works like that.
 297
+00:19:31,270 --> 00:19:32,670
 You know,
 you're not going to have to build a
 298
+00:19:32,670 --> 00:19:38,230
 custom conda environment and image.
 I have AMD GPU, which makes
 299
+00:19:38,230 --> 00:19:42,310
 things much more complicated.
 I didn't find it and I was about
 300
+00:19:42,310 --> 00:19:43,990
 to give up and I said,
 all right, let me just give deep
 301
+00:19:43,990 --> 00:19:48,750
 grams Linux thing a shot.
 And if this doesn't work, um,
 302
+00:19:48,750 --> 00:19:51,150
 I'm just going to go back to
 trying to code something myself.
 303
+00:19:51,510 --> 00:19:56,190
 And when I ran the script,
 I was using cloud code to do the
 304
+00:19:56,190 --> 00:20:00,030
 installation process.
 It ran the script and oh my gosh,
 305
+00:20:00,070 --> 00:20:05,350
 it works just like that.
 Uh, the tricky thing for all those
 306
+00:20:05,350 --> 00:20:10,310
 who wants to know all the nitty
 gritty, nitty gritty details, um, was
 307
+00:20:10,310 --> 00:20:13,750
 that I don't think it was actually
 struggling with transcription, but
 308
+00:20:13,750 --> 00:20:18,550
 pasting Wayland makes life very hard,
 and I think there was something not
 309
+00:20:18,550 --> 00:20:21,870
 running in the right time anyway.
 Deepgram I looked at how they
 310
+00:20:21,870 --> 00:20:24,710
 actually handle that because it
 worked out of the box when other
 311
+00:20:24,710 --> 00:20:29,140
 stuff didn't, and it was quite a
 clever little mechanism,
 312
+00:20:29,460 --> 00:20:32,100
 and but more so than that,
 the accuracy was brilliant.
 313
+00:20:32,140 --> 00:20:35,020
 Now, what am I doing here?
 This is going to be a 20 minute
 314
+00:20:35,260 --> 00:20:42,980
 audio sample, and I'm I think
 I've done 1 or 2 of these before,
 315
+00:20:42,980 --> 00:20:49,180
 but I did it with short, snappy voice
 notes. This is kind of long form.
 316
+00:20:49,460 --> 00:20:51,740
 This actually might be a better
 approximation for what's useful
 317
+00:20:51,740 --> 00:20:56,100
 to me than voice memos.
 Like I need to buy three liters
 318
+00:20:56,100 --> 00:20:59,180
 of milk tomorrow, and pita bread,
 which is probably how like half
 319
+00:20:59,180 --> 00:21:02,820
 my voice voice notes sound like
 if anyone were to, I don't know,
 320
+00:21:02,860 --> 00:21:04,580
 like find my phone,
 they'd be like, this is the most
 321
+00:21:04,580 --> 00:21:07,420
 boring person in the world.
 Although actually there are some
 322
+00:21:07,460 --> 00:21:09,700
 like kind of, uh,
 journaling thoughts as well.
 323
+00:21:09,700 --> 00:21:13,700
 But it's a lot of content like that.
 And the probably for the evaluation,
 324
+00:21:13,700 --> 00:21:20,660
 the most useful thing is slightly
 obscure tech GitHub uh, hugging face
 325
+00:21:21,180 --> 00:21:24,660
 not so obscure that it's not going
 to have a chance of knowing it,
 326
+00:21:24,660 --> 00:21:27,640
 but hopefully sufficiently well
 known that the model should get it.
 327
+00:21:28,200 --> 00:21:30,760
 I tried to do a little bit of
 speaking really fast and
 328
+00:21:30,760 --> 00:21:33,200
 speaking very slowly.
 I would say in general,
 329
+00:21:33,200 --> 00:21:36,880
 I've spoken, delivered this at a
 faster pace than I usually would
 330
+00:21:36,920 --> 00:21:40,280
 owing to strong coffee flowing
 through my bloodstream.
 331
+00:21:40,920 --> 00:21:44,200
 And the thing that I'm not going
 to get in this benchmark is
 332
+00:21:44,200 --> 00:21:46,880
 background noise, which in my first
 take that I had to get rid of,
 333
+00:21:47,680 --> 00:21:51,240
 my wife came in with my son and
 for a good night kiss.
 334
+00:21:51,440 --> 00:21:55,120
 And that actually would have
 been super helpful to get in
 335
+00:21:55,120 --> 00:21:59,760
 because it was not diarised.
 Or if we had diarisation a female,
 336
+00:21:59,880 --> 00:22:02,280
 I could say I want the male
 voice and that wasn't intended
 337
+00:22:02,280 --> 00:22:05,280
 for transcription.
 Um, and we're not going to get
 338
+00:22:05,280 --> 00:22:06,960
 background noise like people
 honking their horns,
 339
+00:22:06,960 --> 00:22:11,280
 which is something I've done in my
 main data set where I am trying to
 340
+00:22:11,440 --> 00:22:15,520
 go back to some of my voice notes,
 annotate them, and run a benchmark.
 341
+00:22:15,520 --> 00:22:18,960
 But this is going to be just a
 pure quick test.
 342
+00:22:19,440 --> 00:22:23,880
 And as someone I'm working on a
 voice note idea,
 343
+00:22:23,880 --> 00:22:28,230
 that's my sort of end motivation.
 Besides thinking it's an
 344
+00:22:28,230 --> 00:22:31,590
 absolutely outstanding technology
 that's coming to viability.
 345
+00:22:31,590 --> 00:22:34,670
 And really, I know this sounds
 cheesy can actually have a very
 346
+00:22:34,670 --> 00:22:38,830
 transformative effect.
 Um, it's, you know, voice technology
 347
+00:22:38,870 --> 00:22:44,910
 has been life changing for, uh,
 folks living with, um, disabilities.
 348
+00:22:45,630 --> 00:22:48,550
 And I think there's something
 really nice about the fact that
 349
+00:22:48,550 --> 00:22:52,710
 it can also benefit, you know,
 folks who are able bodied and like,
 350
+00:22:52,750 --> 00:22:58,950
 we can all in different ways, um,
 make this tech as useful as possible,
 351
+00:22:58,990 --> 00:23:01,110
 regardless of the exact way that
 we're using it.
 352
+00:23:01,510 --> 00:23:04,710
 Um, and I think there's something
 very powerful in that, and it can be
 353
+00:23:04,710 --> 00:23:08,910
 very cool. Um, I see use potential.
 What excites me about voice tech?
 354
+00:23:09,750 --> 00:23:13,550
 A lot of things, actually.
 Firstly, the fact that it's cheap
 355
+00:23:13,550 --> 00:23:17,110
 and accurate, as I mentioned at
 the very start of this, um,
 356
+00:23:17,110 --> 00:23:20,790
 and it's getting better and better
 with stuff like accent handling, um,
 357
+00:23:20,790 --> 00:23:24,180
 I'm not sure my, my fine tune will
 actually ever come to fruition in the
 358
+00:23:24,180 --> 00:23:27,860
 sense that I'll use it day to day,
 as I imagine I get like superb,
 359
+00:23:27,860 --> 00:23:33,540
 flawless word error rates because I'm
 just kind of skeptical about local
 360
+00:23:33,540 --> 00:23:38,100
 speech to texts, as I mentioned.
 And I think the pace of innovation
 361
+00:23:38,100 --> 00:23:42,060
 and improvement in the models,
 the main reasons for fine tuning from
 362
+00:23:42,060 --> 00:23:46,340
 what I've seen have been people who
 are something that really blows,
 363
+00:23:46,380 --> 00:23:52,940
 blows my mind about ASR is the idea
 that it's inherently a lingual
 364
+00:23:52,940 --> 00:23:59,100
 or multilingual phonetic based.
 So as folks who use speak very
 365
+00:23:59,140 --> 00:24:02,220
 obscure languages that there may
 be there might be a paucity of
 366
+00:24:02,220 --> 00:24:05,500
 training data or almost none at all,
 and therefore the accuracy is
 367
+00:24:05,500 --> 00:24:10,660
 significantly reduced or folks
 in very critical environments.
 368
+00:24:10,700 --> 00:24:13,380
 I know there are.
 This is used extensively in medical
 369
+00:24:13,380 --> 00:24:18,140
 transcription and dispatcher work as,
 um, you know, the call centers who
 370
+00:24:18,140 --> 00:24:22,490
 send out ambulances, etc., where
 accuracy is absolutely paramount.
 371
+00:24:22,490 --> 00:24:26,050
 And in the case of doctors,
 radiologists, they might be using
 372
+00:24:26,050 --> 00:24:29,610
 very specialized vocab all the time.
 So those are kind of the main
 373
+00:24:29,610 --> 00:24:31,530
 two things.
 And I'm not sure that really just for
 374
+00:24:31,530 --> 00:24:37,290
 trying to make it better on a few
 random tech words with my slightly.
 375
+00:24:37,330 --> 00:24:41,250
 I mean, I have an accent, but like,
 not, you know, an accent that a few
 376
+00:24:41,290 --> 00:24:47,210
 other million people have. Ish.
 I'm not sure that my little fine
 377
+00:24:47,210 --> 00:24:52,250
 tune is going to actually like the
 bump in word error rate reduction.
 378
+00:24:52,250 --> 00:24:54,570
 If I ever actually figure out how
 to do it and get it up to the
 379
+00:24:54,570 --> 00:24:58,610
 cloud by the time I've done that.
 I suspect that the next
 380
+00:24:58,610 --> 00:25:01,410
 generation of ASR will just be
 so good that it will kind of be.
 381
+00:25:01,930 --> 00:25:03,770
 Ah, well,
 that would be cool if it worked out,
 382
+00:25:03,770 --> 00:25:08,730
 but I'll just use this instead.
 So that's going to be it for today's
 383
+00:25:08,730 --> 00:25:14,130
 episode of, uh, voice training data.
 Single long shot evaluation.
 384
+00:25:14,410 --> 00:25:17,330
 Who am I going to compare?
 Whisper is always good as a
 385
+00:25:17,330 --> 00:25:20,600
 benchmark, but I'm more
 interested in seeing Whisperer
 386
+00:25:20,600 --> 00:25:25,080
 head to head with two things,
 really. One is whisper variance.
 387
+00:25:25,080 --> 00:25:29,880
 So you've got these projects like
 faster Whisper, Still whisper.
 388
+00:25:29,880 --> 00:25:31,640
 It's a bit confusing.
 There's a whole bunch of them
 389
+00:25:31,920 --> 00:25:34,800
 and the emerging acers,
 which are also a thing.
 390
+00:25:35,200 --> 00:25:37,680
 My intention for this is I'm not
 sure I'm going to have the time
 391
+00:25:37,680 --> 00:25:41,640
 in any point in the foreseeable
 future to go back through this whole
 392
+00:25:41,640 --> 00:25:46,560
 episode and create a proper source,
 truth or a fix.
 393
+00:25:47,320 --> 00:25:51,680
 Everything might do it if I can
 get one transcription that
 394
+00:25:51,680 --> 00:25:56,720
 sufficiently close to perfection.
 But what I would actually love
 395
+00:25:56,720 --> 00:25:59,800
 to do on Hugging Face I think
 would be a great.
 396
+00:25:59,800 --> 00:26:03,560
 Probably how I might visualize this
 is having the audio waveform play,
 397
+00:26:04,040 --> 00:26:09,800
 and then have the transcript for each
 model below it, and maybe even a,
 398
+00:26:10,480 --> 00:26:15,120
 um, like, you know, two scale and
 maybe even a local one as well,
 399
+00:26:15,160 --> 00:26:21,700
 like local whisper versus open
 AI API, Etc. and, um, I can then
 400
+00:26:21,700 --> 00:26:24,380
 actually listen back to segments
 or anyone who wants to can listen
 401
+00:26:24,380 --> 00:26:29,420
 back to segments of this recording
 and see where a particular model
 402
+00:26:29,460 --> 00:26:32,940
 struggled and others didn't, as well
 as the sort of headline finding
 403
+00:26:32,980 --> 00:26:36,780
 of which had the best, uh, wer.
 But that would require the source
 404
+00:26:36,780 --> 00:26:40,020
 of truth. Okay. That's it.
 Hope this was, I don't know,
 405
+00:26:40,180 --> 00:26:43,460
 maybe useful for other folks
 interested in stuff you want to see.
 406
+00:26:43,940 --> 00:26:48,100
 I always feel think I've just said
 something I didn't intend to say.
 407
+00:26:48,660 --> 00:26:51,020
 I said for those, listen carefully.
 Including, hopefully,
 408
+00:26:51,020 --> 00:26:54,060
 the models themselves.
 This has been myself,
 409
+00:26:54,100 --> 00:26:57,900
 Daniel Rosehill, for more, um,
 jumbled repositories about my,
 410
+00:26:57,940 --> 00:27:00,820
 uh, roving interest in AI,
 but particularly Agentic,
 411
+00:27:01,180 --> 00:27:05,340
 MCP and voice tech.
 Uh, you can find me on GitHub.
 412
+00:27:05,820 --> 00:27:11,140
 Hugging face. Where else?
 Daniel, which is my personal website,
 413
+00:27:11,140 --> 00:27:15,260
 as well as this podcast whose
 name I sadly cannot remember.
 414
+00:27:15,700 --> 00:27:17,420
 Until next time.
 Thanks for listening.