STT-Comparison / srt-out /assembly.srt
danielrosehill's picture
Fix SRT timestamp alignment with ground truth
0aa8adc
1
00:00:00,000 --> 00:00:05,600
Hello and welcome to a audio data set consisting
2
00:00:05,600 --> 00:00:10,560
of one single episode of a non-existent podcast. Or I
3
00:00:10,640 --> 00:00:13,280
may append this to a podcast that I set up
4
00:00:13,520 --> 00:00:19,120
recently regarding my with my thoughts on speech
5
00:00:19,200 --> 00:00:23,920
tech and AI in particular, more AI in generative AI,
6
00:00:24,160 --> 00:00:28,560
I would say. But in any event, the purpose of
7
00:00:28,640 --> 00:00:33,770
this Voice recording is actually to create a lengthy
8
00:00:33,850 --> 00:00:37,050
voice sample for a quick evaluation, a back of the
9
00:00:37,050 --> 00:00:40,570
envelope evaluation, as they might say, for different speech attack
10
00:00:40,810 --> 00:00:43,370
models. And I'm doing this because I thought I had
11
00:00:43,370 --> 00:00:46,730
made a great breakthrough in my journey with speech tech,
12
00:00:47,050 --> 00:00:50,650
and that was succeeding in the elusive task of fine-tuning
13
00:00:50,650 --> 00:00:54,730
Whisper. Whisper is, and I'm going to just talk, I'm
14
00:00:54,810 --> 00:00:58,170
trying to mix up, I'm going to try a few
15
00:00:58,330 --> 00:01:01,450
different styles of speaking. I might whisper something at some
16
00:01:01,530 --> 00:01:04,800
point. As well. And I'll go back to speaking loud
17
00:01:04,880 --> 00:01:08,000
in, in different parts. I'm going to sound really like
18
00:01:08,080 --> 00:01:11,040
a crazy person because I'm also going to try to
19
00:01:11,200 --> 00:01:16,160
speak at different pitches and cadences in order to really
20
00:01:16,480 --> 00:01:20,480
try to put a speech attacks model through its paces,
21
00:01:20,640 --> 00:01:22,960
which is trying to make sense of is this guy
22
00:01:23,120 --> 00:01:27,980
just rambling on incoherently in one long sentence or are
23
00:01:28,380 --> 00:01:34,140
these just actually a series of step, standalone,
24
00:01:34,300 --> 00:01:37,340
step alone, standalone sentences? And how is it gonna handle
25
00:01:37,420 --> 00:01:40,380
step alone? That's not a word. What happens when you
26
00:01:40,460 --> 00:01:42,940
use speech to text and you use a fake word?
27
00:01:43,100 --> 00:01:45,500
And then you're like, wait, that's not actually, that word
28
00:01:45,660 --> 00:01:50,140
doesn't exist. How does AI handle that? And these and
29
00:01:50,380 --> 00:01:54,220
more are all the questions that I'm seeking to answer
30
00:01:54,380 --> 00:01:57,420
in this training data. Now, why was it trying to
31
00:01:57,420 --> 00:02:00,210
fine tune Whisper? And what is Whisper? As I said,
32
00:02:00,290 --> 00:02:02,930
I'm going to try to record this at a couple
33
00:02:03,090 --> 00:02:07,410
of different levels of technicality for folks who are, you
34
00:02:07,410 --> 00:02:11,650
know, in the normal world and not totally stuck down
35
00:02:11,730 --> 00:02:13,730
the rabbit hole of AI, which I have to say
36
00:02:13,890 --> 00:02:18,050
is a really wonderful rabbit hole to be down. It's
37
00:02:18,130 --> 00:02:21,490
a really interesting area and speech and voice tech is
38
00:02:21,890 --> 00:02:24,530
the aspect of it that I find actually the most,
39
00:02:24,930 --> 00:02:27,330
I'm not sure I would say the most interesting because
40
00:02:27,570 --> 00:02:31,290
there's just so much that is fascinating in AI. But
41
00:02:31,450 --> 00:02:34,250
the most that I find the most personally transformative in
42
00:02:34,330 --> 00:02:38,890
terms of the impact that it's had on my daily
43
00:02:38,970 --> 00:02:41,450
work life and productivity and how I sort of work.
44
00:02:42,090 --> 00:02:47,210
And I'm persevering hard with the task of trying
45
00:02:47,210 --> 00:02:50,250
to get a good solution working for Linux, which if
46
00:02:50,250 --> 00:02:52,250
anyone actually does listen to this, not just for the
47
00:02:52,250 --> 00:02:56,410
training data and for the actual content, this is sparked
48
00:02:56,750 --> 00:02:59,950
I had, besides the fine tune not working, well, that
49
00:03:00,030 --> 00:03:05,230
was the failure. Um, I used Claude code because one
50
00:03:05,470 --> 00:03:09,950
thinks these days that there is nothing short of solving,
51
00:03:10,990 --> 00:03:15,390
you know, the, the reason of life or something, that
52
00:03:15,790 --> 00:03:18,990
Claude and agentic AI can't do, which is not really
53
00:03:19,070 --> 00:03:22,190
the case. Uh, it does seem that way sometimes, but
54
00:03:22,350 --> 00:03:24,190
it fails a lot as well. And this is one
55
00:03:24,190 --> 00:03:27,630
of those, instances where last week I put together an
56
00:03:27,710 --> 00:03:32,010
hour of voice training data, basically speaking, just random things
57
00:03:32,250 --> 00:03:37,050
for 3 minutes. And it was actually kind of tedious
58
00:03:37,130 --> 00:03:39,210
because the texts were really weird. Some of them were
59
00:03:39,450 --> 00:03:43,050
it was like it was AI generated. I tried before
60
00:03:43,210 --> 00:03:45,130
to read Sherlock Holmes for an hour and I just
61
00:03:45,130 --> 00:03:48,330
couldn't. I was so bored after 10 minutes that I
62
00:03:48,330 --> 00:03:50,730
was like, okay, no, I'm just going to have to
63
00:03:50,730 --> 00:03:55,290
find something else to read. So I used a created
64
00:03:55,690 --> 00:04:01,280
with AI studio vibe coded a synthetic text generator. Which
65
00:04:01,600 --> 00:04:03,840
actually I thought was probably a better way of doing
66
00:04:03,920 --> 00:04:07,440
it because it would give me more short samples with
67
00:04:07,680 --> 00:04:10,480
more varied content. So I was like, okay, give me
68
00:04:10,880 --> 00:04:13,760
a voice note, like I'm recording an email, give me
69
00:04:14,000 --> 00:04:17,680
a short story to read, give me prose to read.
70
00:04:18,000 --> 00:04:20,400
So I came up with all these different things and
71
00:04:20,560 --> 00:04:22,560
they added a little timer to it so I could
72
00:04:22,720 --> 00:04:26,400
see how close I was to one hour. And I
73
00:04:26,560 --> 00:04:29,600
spent like an hour one afternoon or probably two hours
74
00:04:29,760 --> 00:04:33,330
by the time you you do retakes. And whatever, because
75
00:04:33,410 --> 00:04:36,610
you want to, it gave me a source of truth,
76
00:04:37,330 --> 00:04:40,050
which I'm not sure if that's the scientific way to
77
00:04:40,210 --> 00:04:44,210
approach this topic of gathering, training data, but I thought
78
00:04:44,450 --> 00:04:48,130
made sense. Um, I have a lot of audio data
79
00:04:48,210 --> 00:04:50,770
from recording voice notes, which I've also kind of used,
80
00:04:52,050 --> 00:04:55,810
been experimenting with using for a different purpose, slightly different
81
00:04:56,210 --> 00:05:01,410
annotating task types. It's more a text classification experiment
82
00:05:01,730 --> 00:05:04,160
or, Well, it's more than that actually. I'm working on
83
00:05:04,160 --> 00:05:08,080
a voice app. So it's a prototype, I guess, is
84
00:05:08,240 --> 00:05:12,720
really more accurate. But you can do that and you
85
00:05:12,720 --> 00:05:15,200
can work backwards. You're like, you listen back to a
86
00:05:15,200 --> 00:05:18,720
voice note and you painfully go through one of those
87
00:05:19,040 --> 00:05:21,840
transcribing, you know, where you start and stop and scrub
88
00:05:22,000 --> 00:05:23,920
around it and you fix the errors, but it's really,
89
00:05:24,080 --> 00:05:26,720
really boring to do that. So I thought it would
90
00:05:26,800 --> 00:05:29,040
be less tedious in the long term if I just
91
00:05:30,059 --> 00:05:32,940
recorded the source of truth. So it gave me these
92
00:05:33,020 --> 00:05:36,140
three minute snippets. I recorded them. It saved an MP3
93
00:05:36,380 --> 00:05:39,500
and a TXT in the same folder, and I created
94
00:05:39,580 --> 00:05:42,860
an error with that data. So I was very hopeful,
95
00:05:43,260 --> 00:05:46,860
quietly, a little bit hopeful that I could actually fine
96
00:05:46,940 --> 00:05:50,460
tune Whisper. I want to fine tune Whisper because when
97
00:05:50,540 --> 00:05:54,780
I got into Voicetech last November, my wife was in
98
00:05:54,780 --> 00:05:58,140
the US and I was alone at home. And when
99
00:05:58,600 --> 00:06:01,400
crazy people like me do really wild things like use
100
00:06:01,640 --> 00:06:06,120
voice to tech technology. That was basically when I started
101
00:06:06,200 --> 00:06:08,760
doing it, I didn't feel like a crazy person speaking
102
00:06:08,840 --> 00:06:13,720
to myself. And my expectations weren't that high. I used
103
00:06:14,280 --> 00:06:17,640
speech tech now and again, tried it out. It was
104
00:06:17,640 --> 00:06:19,160
like, it'd be really cool if you could just, like,
105
00:06:19,320 --> 00:06:22,760
speak into your computer. And whatever I tried out that
106
00:06:23,000 --> 00:06:26,590
had Linux support was just. It was not good, basically.
107
00:06:27,230 --> 00:06:29,470
And this blew me away from the first go. I
108
00:06:29,470 --> 00:06:32,750
mean, it wasn't 100% accurate out of the box and
109
00:06:32,830 --> 00:06:34,910
it took work, but it was good enough that there
110
00:06:34,990 --> 00:06:37,470
was a solid foundation and it kind of passed that
111
00:06:38,670 --> 00:06:41,870
pivot point that it's actually worth doing this. You know,
112
00:06:42,030 --> 00:06:44,670
there's a point where it's so like the transcript is
113
00:06:44,910 --> 00:06:47,310
you don't have to get 100% accuracy for it to
114
00:06:47,310 --> 00:06:50,030
be worth your time for speech attacks to be a
115
00:06:50,030 --> 00:06:52,430
worthwhile addition to your productivity, but you do need to
116
00:06:52,430 --> 00:06:55,970
get above, let's say, I don't know, 85%. If it's
117
00:06:56,130 --> 00:06:59,810
60% or 50%, you inevitably say, screw it, I'll just
118
00:06:59,810 --> 00:07:02,770
type it because you end up missing errors in the
119
00:07:02,770 --> 00:07:05,490
transcript and it becomes actually worse. You end up in
120
00:07:05,490 --> 00:07:07,570
a worse position than you started with. That's been my
121
00:07:07,650 --> 00:07:11,970
experience. So I was like, oh, this is actually really,
122
00:07:12,130 --> 00:07:13,970
really good now. How did that happen? And the answer
123
00:07:14,130 --> 00:07:19,410
is ASR whisper being open source and the transformer
124
00:07:19,410 --> 00:07:23,170
architecture. If you want to go back to the to
125
00:07:23,250 --> 00:07:26,370
the underpinnings, which really blows my mind and it's on
126
00:07:26,450 --> 00:07:30,680
my list. To read through that paper. All you need
127
00:07:30,760 --> 00:07:35,960
is attention as attentively as can be done
128
00:07:36,200 --> 00:07:39,320
with my limited brain because it's super, super high level
129
00:07:39,640 --> 00:07:44,520
stuff, super advanced stuff, I mean. But that, I think
130
00:07:44,680 --> 00:07:49,320
of all the things that are fascinating about the sudden
131
00:07:49,640 --> 00:07:53,700
rise in AI and the dramatic capabilities. I find it
132
00:07:53,700 --> 00:07:56,100
fascinating that a few people are like, hang on, you've
133
00:07:56,100 --> 00:07:58,420
got this thing that can speak to you, like a
134
00:07:58,420 --> 00:08:02,980
chatbot, an LLM, and then you've got image generation. Okay,
135
00:08:03,060 --> 00:08:06,580
so firstly, those two things on the surface have nothing
136
00:08:06,900 --> 00:08:10,740
in common. So like, how are they, how did that
137
00:08:10,900 --> 00:08:12,500
just happen all at the same time? And then when
138
00:08:12,500 --> 00:08:16,580
you extend that further, you're like, Suno, right? You can
139
00:08:17,060 --> 00:08:20,030
sing a song and AI will come up with and
140
00:08:20,190 --> 00:08:23,390
instrumental. And then you've got Whisper and you're like, wait
141
00:08:23,390 --> 00:08:25,870
a second, how did all this stuff, like, if it's
142
00:08:25,870 --> 00:08:29,230
all AI, what's like, there has to be some commonality.
143
00:08:29,470 --> 00:08:34,590
Otherwise, these are totally different technologies on the surface of
144
00:08:34,590 --> 00:08:38,830
it. And the Transformer architecture is, as far as I
145
00:08:38,910 --> 00:08:41,550
know, the answer. And I can't even say, can't even
146
00:08:41,630 --> 00:08:46,270
pretend that I really understand what the Transformer architecture means.
147
00:08:46,770 --> 00:08:49,250
In depth, but I have scanned it and as I
148
00:08:49,410 --> 00:08:51,810
said, I want to print it and really kind of
149
00:08:52,210 --> 00:08:56,050
think over it at some point. And I'll probably feel
150
00:08:56,290 --> 00:08:59,250
bad about myself, I think, because weren't those guys in
151
00:08:59,330 --> 00:09:03,410
their 20s? Like, that's crazy. I think I asked ChatGPT
152
00:09:03,490 --> 00:09:07,890
once who wrote that paper and how old were they
153
00:09:08,050 --> 00:09:10,770
when it was published in Arciv? And I was expecting,
154
00:09:11,010 --> 00:09:13,890
like, I don't know, What do you imagine? I personally
155
00:09:13,970 --> 00:09:16,210
imagine kind of like, you know, you have these breakthroughs
156
00:09:16,370 --> 00:09:19,810
during COVID and things like that where like these kind
157
00:09:19,890 --> 00:09:22,770
of really obscure scientists are like in their 50s and
158
00:09:22,770 --> 00:09:27,170
they've just kind of been laboring in labs and wearily
159
00:09:27,170 --> 00:09:30,450
in writing and publishing in kind of obscure academic publications.
160
00:09:30,770 --> 00:09:33,170
And they finally like hit a big or win a
161
00:09:33,170 --> 00:09:37,250
Nobel Prize and then their household names. So that was
162
00:09:37,330 --> 00:09:38,990
kind of what I had in mind. That was the
163
00:09:38,990 --> 00:09:42,990
mental image I'd formed of the birth of Arcsight. Like
164
00:09:42,990 --> 00:09:46,270
I wasn't expecting 20-somethings in San Francisco, though. I thought
165
00:09:46,350 --> 00:09:48,830
that was both very, very funny, very cool, and actually
166
00:09:48,990 --> 00:09:52,510
kind of inspiring. It's nice to think that people who,
167
00:09:53,310 --> 00:09:56,110
you know, just you might put them in the kind
168
00:09:56,190 --> 00:09:59,550
of milieu or bubble or world that you are in
169
00:09:59,630 --> 00:10:03,230
are credibly in through, you know, the series of connections
170
00:10:03,310 --> 00:10:07,390
that are coming up with such literally world changing innovations.
171
00:10:07,870 --> 00:10:11,460
So that was, I thought, anyway. That's that was cool.
172
00:10:11,860 --> 00:10:14,500
Okay, voice training data. How are we doing? We're about
173
00:10:14,500 --> 00:10:18,580
10 minutes and I'm still talking about voice technology. So
174
00:10:18,660 --> 00:10:22,100
Whisper was brilliant and I was so excited that I
175
00:10:22,180 --> 00:10:25,380
was my first instinct was to like guess like, oh
176
00:10:25,380 --> 00:10:26,820
my gosh, I have to get like a really good
177
00:10:26,820 --> 00:10:30,580
microphone for this. So I didn't go on a spending
178
00:10:30,580 --> 00:10:32,740
spree because I said, I'm gonna have to just wait
179
00:10:32,740 --> 00:10:35,140
a month and see if I still use this. And
180
00:10:36,430 --> 00:10:38,910
It just kind of became, it's become really part of
181
00:10:39,070 --> 00:10:43,390
my daily routine. Like if I'm writing an email, I'll
182
00:10:43,470 --> 00:10:46,990
record a voice note. And then I've developed and it's
183
00:10:46,990 --> 00:10:49,070
nice to see that everyone is like developing the same
184
00:10:49,550 --> 00:10:51,950
things in parallel. Like that's my kind of a weird
185
00:10:51,950 --> 00:10:54,510
thing to say, but when I look, I kind of
186
00:10:54,670 --> 00:10:58,990
came, when I started working on this, these prototypes on
187
00:10:59,070 --> 00:11:01,470
GitHub, which is where I just kind of share very
188
00:11:01,710 --> 00:11:06,730
freely and loosely, ideas and first iterations on concepts.
189
00:11:08,490 --> 00:11:10,650
And for want of a better word, I called it
190
00:11:10,730 --> 00:11:15,450
like LLM post-processing or cleanup or basically a system prompt
191
00:11:15,530 --> 00:11:18,890
that after you get back the raw text from Whisper,
192
00:11:19,050 --> 00:11:22,010
you run it through a model and say, okay, this
193
00:11:22,090 --> 00:11:26,970
is crappy text, like add sentence structure and fix it
194
00:11:27,050 --> 00:11:32,250
up. And now when I'm exploring the different tools that
195
00:11:32,330 --> 00:11:35,180
are out there that people have built, I see quite
196
00:11:35,420 --> 00:11:39,100
a number of projects have basically done the same thing,
197
00:11:40,460 --> 00:11:43,180
lest that be misconstrued. I'm not saying for a millisecond
198
00:11:43,260 --> 00:11:46,220
that I inspired them. I'm sure this has been a
199
00:11:46,300 --> 00:11:49,500
thing that's been integrated into tools for a while, but
200
00:11:50,380 --> 00:11:52,300
it's the kind of thing that when you start using
201
00:11:52,300 --> 00:11:54,780
these tools every day, the need for it is almost
202
00:11:54,940 --> 00:11:59,420
instantly apparent because text that doesn't have any punctuation or
203
00:11:59,800 --> 00:12:03,000
Paragraph spacing takes a long time to, you know, it
204
00:12:03,160 --> 00:12:05,400
takes so long to get it into a presentable email
205
00:12:05,560 --> 00:12:09,720
that again, it's, it's, it, it moves speech tech into
206
00:12:09,960 --> 00:12:13,480
that before that inflection point where you're like, no, it's
207
00:12:13,480 --> 00:12:15,960
just not worth it. It's like, it's, it'll just be
208
00:12:16,040 --> 00:12:18,520
quicker to type this. So it's a big, it's a
209
00:12:18,520 --> 00:12:21,560
little touch that actually is a big deal. Uh, so
210
00:12:21,720 --> 00:12:25,640
I was on Whisper and I've been using Whisper and
211
00:12:25,640 --> 00:12:28,110
I kind of, early on found a couple of tools.
212
00:12:28,270 --> 00:12:30,510
I couldn't find what I was looking for on Linux,
213
00:12:30,670 --> 00:12:35,470
which is basically just something that'll run in the background.
214
00:12:35,710 --> 00:12:38,030
It'll give it an API key and it will just
215
00:12:38,190 --> 00:12:42,910
like transcribe with like a little key to start and
216
00:12:42,990 --> 00:12:47,310
stop the dictation. And the issues were I discovered that
217
00:12:47,470 --> 00:12:51,070
like most people involved in creating these projects were very
218
00:12:51,230 --> 00:12:55,070
much focused on local models, running Whisper locally because you
219
00:12:55,150 --> 00:12:57,940
can. And I tried that a bunch of times and
220
00:12:58,020 --> 00:13:00,340
just never got results that were as good as the
221
00:13:00,340 --> 00:13:03,140
cloud. And when I began looking at the cost of
222
00:13:03,220 --> 00:13:05,700
the speech to text APIs and what I was spending,
223
00:13:06,260 --> 00:13:09,460
I just thought there is, it's actually, in my opinion,
224
00:13:09,620 --> 00:13:12,820
just one of the better deals in API spending and
225
00:13:12,820 --> 00:13:15,140
in cloud. Like it's just not that expensive for very,
226
00:13:15,300 --> 00:13:19,300
very good models that are much more, you know, you're
227
00:13:19,300 --> 00:13:21,880
gonna be able to run the full model. The latest
228
00:13:21,880 --> 00:13:25,880
model versus whatever you can run on your average GPU,
229
00:13:26,120 --> 00:13:29,160
unless you want to buy a crazy GPU. It doesn't
230
00:13:29,160 --> 00:13:31,080
really make sense to me. Now, privacy is another concern
231
00:13:32,120 --> 00:13:33,880
that I know is kind of like a very much
232
00:13:33,960 --> 00:13:36,760
a separate thing that people just don't want their voice
233
00:13:37,000 --> 00:13:40,680
data and their voice leaving their local environment, maybe for
234
00:13:40,680 --> 00:13:44,200
regulatory reasons as well. But I'm not in that. I
235
00:13:44,600 --> 00:13:48,840
neither really care about people listening to my grocery list
236
00:13:49,080 --> 00:13:51,720
consisting of reminding myself that I need to buy more
237
00:13:51,800 --> 00:13:55,150
beer, Cheetos, and hummus, which is kind of the three
238
00:13:55,310 --> 00:13:59,870
staples of my diet during periods of poorer nutrition. But
239
00:13:59,950 --> 00:14:02,430
the kind of stuff that I transcribe, it's just not,
240
00:14:03,950 --> 00:14:07,710
it's not a privacy thing I'm that sort of sensitive
241
00:14:07,790 --> 00:14:13,150
about and I don't do anything so sensitive or secure
242
00:14:13,230 --> 00:14:16,430
that requires air gapping. So I looked at the pricing
243
00:14:16,510 --> 00:14:19,790
and especially the kind of older model mini Some of
244
00:14:19,870 --> 00:14:21,950
them are very, very affordable. And I did a back
245
00:14:22,190 --> 00:14:25,870
of the, I did a calculation once with ChatGPT and
246
00:14:25,870 --> 00:14:29,230
I was like, okay, this is the API price for
247
00:14:29,390 --> 00:14:32,270
I can't remember whatever the model was. Let's say I
248
00:14:32,350 --> 00:14:35,230
just go at it like nonstop, which it rarely happens.
249
00:14:35,470 --> 00:14:38,830
Probably, I would say on average, I might dictate 30
250
00:14:38,910 --> 00:14:41,790
to 60 minutes per day if I was probably summing
251
00:14:41,790 --> 00:14:46,990
up the emails, documents, outlines, which
252
00:14:47,230 --> 00:14:49,870
is a lot, but it's still a fairly modest amount.
253
00:14:50,030 --> 00:14:51,940
And I was like, Some days I do go on
254
00:14:52,100 --> 00:14:54,900
like one or two days where I've been usually when
255
00:14:54,900 --> 00:14:56,980
I'm like kind of out of the house and just
256
00:14:57,220 --> 00:15:00,500
have something like I have nothing else to do. Like
257
00:15:00,660 --> 00:15:04,020
if I'm at a hospital, we have a newborn and
258
00:15:04,180 --> 00:15:07,300
you're waiting for like eight hours and hours for an
259
00:15:07,380 --> 00:15:10,820
appointment. And I would probably have listened to podcasts before
260
00:15:11,380 --> 00:15:14,180
becoming a speech fanatic. And I'm like, oh, wait, let
261
00:15:14,340 --> 00:15:16,259
me just get down. Let me just get these ideas
262
00:15:16,420 --> 00:15:18,540
out of my head. And that's when I'll go on
263
00:15:19,260 --> 00:15:21,820
my speech binges. But those are like once every few
264
00:15:21,820 --> 00:15:24,940
months, like not frequently. But I said, okay, let's just
265
00:15:25,020 --> 00:15:29,100
say if I'm gonna price out Cloud SCT, if I
266
00:15:29,180 --> 00:15:33,900
was like dedicated every second of every waking hour to
267
00:15:34,060 --> 00:15:37,900
transcribing for some odd reason, I mean, I'd have to
268
00:15:37,980 --> 00:15:40,780
like eat and use the toilet. Like, you know, there's
269
00:15:40,860 --> 00:15:43,420
only so many hours I'm awake for. So like, let's
270
00:15:43,420 --> 00:15:46,620
just say a maximum of like 40 hour, 45 minutes.
271
00:15:47,210 --> 00:15:49,290
In the hour. Then I said, all right, let's just
272
00:15:49,290 --> 00:15:52,890
say 50. Who knows? You're dictating on the toilet. We
273
00:15:53,050 --> 00:15:55,050
do it. So it could be. You could just do
274
00:15:55,130 --> 00:15:59,290
60. But whatever I did. And every day, like, you're
275
00:15:59,370 --> 00:16:02,730
going flat out seven days a week dictating non-stop I
276
00:16:02,730 --> 00:16:05,850
was like, what's my monthly API bill gonna be at
277
00:16:05,930 --> 00:16:08,570
this price? And it came out to, like, 70 or
278
00:16:08,570 --> 00:16:10,730
80 bucks. And I was like, well, that would be
279
00:16:11,130 --> 00:16:15,700
an extraordinary. Amount of dictation. And I would hope that
280
00:16:16,180 --> 00:16:19,940
there was some compelling reason more worth more than $70
281
00:16:20,260 --> 00:16:23,460
that I embarked upon that project. So given that that's
282
00:16:23,460 --> 00:16:25,460
kind of the max point for me, I said that's
283
00:16:25,540 --> 00:16:29,140
actually very, very affordable. Now you're gonna, if you want
284
00:16:29,220 --> 00:16:31,700
to spec out the costs and you want to do
285
00:16:31,700 --> 00:16:36,260
the post-processing that I really do feel is valuable, that's
286
00:16:36,340 --> 00:16:40,820
gonna cost some more as well, unless you're using Gemini,
287
00:16:41,300 --> 00:16:44,420
which needless to say is a random person sitting in
288
00:16:44,500 --> 00:16:49,060
Jerusalem. I have no affiliation, nor with Google, nor anthropic,
289
00:16:49,140 --> 00:16:52,020
nor Gemini, nor any major tech vendor for that matter.
290
00:16:53,620 --> 00:16:56,820
I like Gemini not so much as a everyday model.
291
00:16:57,300 --> 00:16:59,860
It's kind of underwhelmed in that respect, I would say.
292
00:17:00,260 --> 00:17:02,740
But for multimodal, I think it's got a lot to
293
00:17:02,740 --> 00:17:06,500
offer. And I think that the transcribing functionality whereby it
294
00:17:06,580 --> 00:17:11,900
can process audio with a system prompt and both give
295
00:17:12,060 --> 00:17:15,100
you transcription that's cleaned up that reduces two steps to
296
00:17:15,260 --> 00:17:18,220
one. And that for me is a very, very big
297
00:17:18,380 --> 00:17:21,580
deal. And I feel like even Google has haven't really
298
00:17:21,820 --> 00:17:26,700
sort of thought through how useful the that modality is
299
00:17:26,780 --> 00:17:29,260
and what kind of use cases you can achieve with
300
00:17:29,340 --> 00:17:31,260
it. Because I found in the course of this year,
301
00:17:31,900 --> 00:17:36,540
just an endless list of really kind of system prompt
302
00:17:36,860 --> 00:17:40,220
system prompt stuff that I can say, okay, I've used
303
00:17:40,220 --> 00:17:43,420
it to capture context data for AI, which is literally
304
00:17:43,500 --> 00:17:45,660
I might speak for if I wanted to have a
305
00:17:45,660 --> 00:17:49,740
good bank of context data about who knows my childhood
306
00:17:50,300 --> 00:17:54,220
more realistically, maybe my career goals, something that would just
307
00:17:54,300 --> 00:17:56,700
be like really boring to type out. So I'll just
308
00:17:56,780 --> 00:18:00,780
like sit in my car and record it for 10
309
00:18:00,860 --> 00:18:03,100
minutes. And that 10 minutes you get a lot of
310
00:18:03,260 --> 00:18:08,650
information in. Um, emails, which is short text, just
311
00:18:09,050 --> 00:18:12,250
there is a whole bunch and all these workflows kind
312
00:18:12,410 --> 00:18:14,410
of require a little bit of treatment afterwards and different
313
00:18:14,650 --> 00:18:18,090
treatment. My context pipeline is kind of like just extract
314
00:18:18,170 --> 00:18:20,970
the bare essentials. So you end up with me talking
315
00:18:21,050 --> 00:18:22,970
very loosely about sort of what I've done in my
316
00:18:23,050 --> 00:18:25,370
career, where I've worked, where I might like to work.
317
00:18:25,850 --> 00:18:28,970
And it goes, it condenses that down to very robotic
318
00:18:29,210 --> 00:18:32,490
language that is easy to chunk parse and maybe put
319
00:18:32,570 --> 00:18:36,550
into a vector database. Daniel has worked in technology. Daniel
320
00:18:37,430 --> 00:18:40,150
has been working in, you know, stuff like that. That's
321
00:18:40,150 --> 00:18:43,110
not how you would speak, but I figure it's probably
322
00:18:43,350 --> 00:18:47,350
easier to parse for, after all, robots. So we've almost
323
00:18:47,430 --> 00:18:49,270
got to 20 minutes and this is actually a success
324
00:18:49,750 --> 00:18:55,110
because I wasted 20 minutes of the evening speaking
325
00:18:55,190 --> 00:18:59,910
into a microphone and the levels were shot and it
326
00:18:59,910 --> 00:19:01,590
was clipping and I said, I can't really do an
327
00:19:01,670 --> 00:19:03,990
evaluation. I have to be fair. I have to give
328
00:19:04,560 --> 00:19:07,920
the models a chance to do their thing. What am
329
00:19:07,920 --> 00:19:10,320
I hoping to achieve in this? Okay, my fine tune
330
00:19:10,320 --> 00:19:13,360
was a dud as mentioned. DeepChrom ST, I'm really, really
331
00:19:13,440 --> 00:19:16,480
hopeful that this prototype will work and it's a build
332
00:19:16,720 --> 00:19:19,280
in public open source, so anyone is welcome to use
333
00:19:19,360 --> 00:19:22,320
it if I make anything good. But that was really
334
00:19:22,480 --> 00:19:26,480
exciting for me last night when after hours of trying
335
00:19:26,560 --> 00:19:30,480
my own prototype, seeing someone just made something that works
336
00:19:30,640 --> 00:19:32,400
like that, you know, you're not gonna have to build
337
00:19:32,640 --> 00:19:37,460
a custom conda environment and image. I have AMD GPU,
338
00:19:37,620 --> 00:19:40,980
which makes things much more complicated. I didn't find it.
339
00:19:41,540 --> 00:19:42,980
And I was about to give up and I said,
340
00:19:43,060 --> 00:19:45,460
all right, let me just give Deep Grams Linux thing
341
00:19:45,940 --> 00:19:49,220
a shot. And if this doesn't work, I'm just going
342
00:19:49,220 --> 00:19:50,980
to go back to trying to Vibe code something myself.
343
00:19:51,620 --> 00:19:55,460
And when I ran the script, I was using Claude
344
00:19:55,540 --> 00:19:59,060
code to do the installation process. It ran the script
345
00:19:59,140 --> 00:20:02,020
and oh my gosh, it works just like that. The
346
00:20:02,100 --> 00:20:05,980
tricky thing For all those who want to know all
347
00:20:05,980 --> 00:20:11,260
the nitty gritty details, was that I
348
00:20:11,260 --> 00:20:14,380
don't think it was actually struggling with transcription, but pasting
349
00:20:14,700 --> 00:20:18,140
Wayland makes life very hard. And I think there was
350
00:20:18,220 --> 00:20:21,500
something not running the right time. Anyway, Deepgram, I looked
351
00:20:21,500 --> 00:20:23,820
at how they actually handled that because it worked out
352
00:20:23,900 --> 00:20:26,540
of the box when other stuff didn't. And it was
353
00:20:27,100 --> 00:20:30,570
quite a clever little mechanism. And but more so than
354
00:20:30,650 --> 00:20:33,290
that, the accuracy was brilliant. Now, what am I doing
355
00:20:33,290 --> 00:20:35,930
here? This is going to be a 20 minute audio
356
00:20:36,490 --> 00:20:42,010
sample. And I think I've done one or two
357
00:20:42,170 --> 00:20:46,570
of these before, but I did it with short snappy
358
00:20:46,730 --> 00:20:49,770
voice notes. This is kind of long form. This actually
359
00:20:50,010 --> 00:20:52,170
might be a better approximation for what's useful to me
360
00:20:52,330 --> 00:20:55,890
than voice memos. Like, I need to buy three Bread,
361
00:20:55,970 --> 00:20:58,610
eaters of milk tomorrow and Peter bread, which is probably
362
00:20:58,770 --> 00:21:01,330
how like half my voice notes sound. Like if anyone
363
00:21:01,810 --> 00:21:04,050
were to, I don't know, like find my phone, they'd
364
00:21:04,050 --> 00:21:05,570
be like, this is the most boring person in the
365
00:21:05,570 --> 00:21:09,330
world. Although actually, there are some like kind of journaling
366
00:21:09,330 --> 00:21:11,490
thoughts as well, but it's a lot of content like
367
00:21:11,490 --> 00:21:14,450
that. And the probably for the evaluation, the most useful
368
00:21:14,530 --> 00:21:20,210
thing is slightly obscure tech, GitHub, NeocleNo, hugging
369
00:21:20,290 --> 00:21:22,940
face, Not so obscure that it's not going to have
370
00:21:23,020 --> 00:21:26,460
a chance of knowing it, but hopefully sufficiently well known
371
00:21:26,460 --> 00:21:28,700
that the model should get it. I tried to do
372
00:21:28,780 --> 00:21:31,580
a little bit of speaking really fast and speaking very
373
00:21:31,740 --> 00:21:35,020
slowly. I would say in general, I've spoken, delivered this
374
00:21:35,180 --> 00:21:37,500
at a faster pace than I usually would owing to
375
00:21:37,980 --> 00:21:42,460
strong coffee flowing through my bloodstream. And the thing that
376
00:21:42,460 --> 00:21:44,700
I'm not going to get in this benchmark is background
377
00:21:44,780 --> 00:21:46,460
noise, which in my first take that I had to
378
00:21:46,460 --> 00:21:49,710
get rid of, My wife came in with my son
379
00:21:50,030 --> 00:21:52,350
and for a goodnight kiss. And that actually would have
380
00:21:52,350 --> 00:21:56,510
been super helpful to get in because it was non
381
00:21:56,590 --> 00:22:00,190
diarized or if we had diarization, a female, I could
382
00:22:00,190 --> 00:22:02,430
say, I want the male voice and that wasn't intended
383
00:22:02,430 --> 00:22:05,870
for transcription. And we're not going to get background noise
384
00:22:05,950 --> 00:22:08,270
like people honking their horns, which is something I've done
385
00:22:08,430 --> 00:22:11,150
in my main data set where I am trying to
386
00:22:11,390 --> 00:22:14,340
go back to some of my voice notes. Annotate them
387
00:22:14,580 --> 00:22:16,420
and run a benchmark. But this is going to be
388
00:22:16,420 --> 00:22:21,700
just a pure quick test. And as someone,
389
00:22:22,260 --> 00:22:24,660
I'm working on a voice note idea. That's my sort
390
00:22:24,660 --> 00:22:28,660
of end motivation. Besides thinking it's an ask to the
391
00:22:28,660 --> 00:22:32,340
outstanding technology that's coming to viability. And really, I know
392
00:22:32,420 --> 00:22:35,940
this sounds cheesy, can actually have a very transformative effect.
393
00:22:36,980 --> 00:22:41,130
It's, you know, voice technology has been life changing for
394
00:22:41,930 --> 00:22:46,970
folks living with disabilities. And I think
395
00:22:47,130 --> 00:22:48,970
there's something really nice about the fact that it can
396
00:22:49,130 --> 00:22:52,490
also benefit, you know, folks who are able bodied and
397
00:22:52,650 --> 00:22:57,690
like we can all in different ways make this tech
398
00:22:57,770 --> 00:23:00,410
as useful as possible, regardless of the exact way that
399
00:23:00,410 --> 00:23:03,770
we're using it. And I think there's something very powerful
400
00:23:03,850 --> 00:23:06,440
in that and it can be very cool. I see
401
00:23:06,600 --> 00:23:10,200
huge potential. What excites me about Voicetech? A lot of
402
00:23:10,280 --> 00:23:14,360
things actually. Firstly, the fact that it's cheap and accurate,
403
00:23:14,440 --> 00:23:17,080
as I mentioned at the very start of this. And
404
00:23:17,240 --> 00:23:19,880
it's getting better and better with stuff like accent handling.
405
00:23:20,680 --> 00:23:23,400
I'm not sure my fine-tune will actually ever come to
406
00:23:23,480 --> 00:23:25,320
fruition in the sense that I'll use it day to
407
00:23:25,400 --> 00:23:28,840
day as I imagine. I get like superb flawless words
408
00:23:28,920 --> 00:23:33,340
error rates because I'm just kind of skeptical about Local
409
00:23:33,500 --> 00:23:37,100
speech to text, as I mentioned, and I think the
410
00:23:37,180 --> 00:23:40,700
pace of innovation and improvement in the models, the main
411
00:23:40,860 --> 00:23:44,620
reasons for fine tuning from what I've seen have been
412
00:23:44,780 --> 00:23:47,420
people who are something that really blows my mind about
413
00:23:47,980 --> 00:23:53,100
ASR is the idea that it's inherently a lingual or
414
00:23:53,260 --> 00:23:58,570
multilingual phonetic based. So as folks who use speak
415
00:23:58,890 --> 00:24:02,250
very obscure languages, that there might be a paucity of
416
00:24:02,250 --> 00:24:04,890
training data or almost none at all, and therefore the
417
00:24:04,890 --> 00:24:10,090
accuracy is significantly reduced. Or folks in very critical
418
00:24:10,330 --> 00:24:14,250
environments, I know this is used extensively in medical transcription
419
00:24:14,330 --> 00:24:19,130
and dispatcher work, the call centers who send out ambulances,
420
00:24:19,210 --> 00:24:23,130
et cetera, where accuracy is absolutely paramount. And in the
421
00:24:23,130 --> 00:24:26,860
case of doctors, radiologist, they might be using very specialized
422
00:24:26,860 --> 00:24:29,420
vocab all the time. So those are kind of the
423
00:24:29,500 --> 00:24:31,420
main two things that I'm not sure that really just
424
00:24:31,500 --> 00:24:34,940
for trying to make it better on a few random
425
00:24:34,940 --> 00:24:37,900
tech words with my slightly, I mean, I have an
426
00:24:37,980 --> 00:24:41,020
accent, but like not, you know, an accent that a
427
00:24:41,100 --> 00:24:45,900
few other million people have ish. I'm not sure that
428
00:24:46,380 --> 00:24:50,300
my little fine tune is gonna actually like the bump
429
00:24:50,460 --> 00:24:53,500
in word error reduction, if I ever actually figure out
430
00:24:53,500 --> 00:24:54,620
how to do it and get it up to the
431
00:24:54,700 --> 00:24:57,870
cloud. By the time we've done that, I suspect that
432
00:24:58,190 --> 00:25:00,430
the next generation of ASR will just be so good
433
00:25:00,510 --> 00:25:02,990
that it will kind of be, well, that would have
434
00:25:02,990 --> 00:25:04,670
been cool if it worked out, but I'll just use
435
00:25:04,750 --> 00:25:08,510
this instead. So that's going to be it for today's
436
00:25:08,830 --> 00:25:14,030
episode of voice training data. Single long shot evaluation.
437
00:25:14,350 --> 00:25:17,150
Who am I going to compare? Whisper is always good
438
00:25:17,150 --> 00:25:20,510
as a benchmark, but I'm more interested in seeing Whisper
439
00:25:20,590 --> 00:25:24,510
head to head with two things, really. One is Whisper
440
00:25:24,590 --> 00:25:29,700
variants. So you've got these projects like faster Distill Whisper,
441
00:25:29,780 --> 00:25:31,700
it's a bit confusing, there's a whole bunch of them.
442
00:25:32,020 --> 00:25:35,300
And the emerging ASRs, which are also a thing. My
443
00:25:35,380 --> 00:25:37,220
intention for this is I'm not sure I'm going to
444
00:25:37,220 --> 00:25:39,860
have the time in any point in the foreseeable future
445
00:25:40,180 --> 00:25:44,580
to go back through this whole episode and create a
446
00:25:44,660 --> 00:25:49,700
proper source truth, where I fix everything. Might do
447
00:25:49,780 --> 00:25:52,740
it if I can get one transcriptions that sufficiently close
448
00:25:52,980 --> 00:25:57,040
to perfection. But what I would actually love to do
449
00:25:57,200 --> 00:25:59,920
on Hugging Face, I think would be a great probably
450
00:26:00,240 --> 00:26:02,880
how I might visualize this is having the audio waveform
451
00:26:03,200 --> 00:26:08,160
play and then have the transcript for each model below
452
00:26:08,160 --> 00:26:12,560
it and maybe even a like, you know, to scale
453
00:26:13,120 --> 00:26:15,600
and maybe even a local one as well, like local
454
00:26:15,760 --> 00:26:21,100
whisper versus OpenAI API, et cetera. And, I
455
00:26:21,180 --> 00:26:23,500
can then actually listen back to segments or anyone who
456
00:26:23,500 --> 00:26:25,820
wants to can listen back to segments of this recording
457
00:26:26,140 --> 00:26:30,940
and see where a particular model struggled and others didn't,
458
00:26:31,420 --> 00:26:33,340
as well as the sort of headline finding of which
459
00:26:33,500 --> 00:26:36,860
had the best WER, but that would require the source
460
00:26:36,860 --> 00:26:39,580
of truth. Okay, that's it. I hope this was, I
461
00:26:39,580 --> 00:26:42,540
don't know, maybe useful for other folks interested in STT.
462
00:26:42,860 --> 00:26:45,660
You want to see that I always feel think I've
463
00:26:45,660 --> 00:26:48,870
just said as something I didn't intend to. STT, I
464
00:26:48,870 --> 00:26:52,470
said for those. Listen carefully, including hopefully the models themselves.
465
00:26:53,190 --> 00:26:57,270
This has been myself, Daniel Rosell. For more jumbled repositories
466
00:26:57,350 --> 00:27:01,750
about my roving interests in AI, but particularly agentic, MCP
467
00:27:01,990 --> 00:27:07,029
and Voicetech, you can find me on GitHub, huggingface.com,
468
00:27:10,230 --> 00:27:13,270
which is my personal website, as well as this podcast,
469
00:27:13,510 --> 00:27:16,950
whose name I sadly cannot remember. Until next time, thanks
470
00:27:16,950 --> 00:27:17,510
for listening.