STT-Comparison / srt-out /speechmatics.srt
danielrosehill's picture
Fix SRT timestamp alignment with ground truth
0aa8adc
1
00:00:00,000 --> 00:00:06,400
Hello and welcome to a audio data
set consisting of one single
2
00:00:06,400 --> 00:00:12,000
episode of a non-existent podcast.
Or it, uh, I may append this to a
3
00:00:12,000 --> 00:00:16,520
podcast that I set up recently.
Um, regarding my, uh,
4
00:00:16,560 --> 00:00:21,840
with my thoughts on speech,
tech and AI in particular,
5
00:00:22,120 --> 00:00:27,840
more AI and generative AI, I would,
uh, I would say, but in any event,
6
00:00:27,840 --> 00:00:32,360
the purpose of this, um,
voice recording is actually to create
7
00:00:32,560 --> 00:00:37,440
a lengthy voice sample for a quick
evaluation, a back of the envelope
8
00:00:37,440 --> 00:00:41,040
evaluation, as they might say,
for different speech to text models.
9
00:00:41,040 --> 00:00:43,680
And I'm doing this because I,
uh, I thought I'd made a great
10
00:00:43,680 --> 00:00:48,200
breakthrough in my journey with
speech tech, and that was succeeding
11
00:00:48,200 --> 00:00:52,600
in the elusive task of fine tuning.
Whisper, whisper is.
12
00:00:52,720 --> 00:00:56,840
And I'm going to just talk.
I'm trying to mix up, uh,
13
00:00:56,840 --> 00:01:00,350
I'm going to try a few different
styles of speaking.
14
00:01:00,350 --> 00:01:02,510
I might whisper something at
some point as well,
15
00:01:03,070 --> 00:01:07,030
and I'll go back to speaking loud in,
uh, in different parts.
16
00:01:07,030 --> 00:01:09,590
I'm going to sound really like a
crazy person, because I'm also
17
00:01:09,590 --> 00:01:15,750
going to try to speak at different
pitches and cadences in order to
18
00:01:15,790 --> 00:01:20,510
really try to put a speech to
text model through its paces,
19
00:01:20,510 --> 00:01:25,750
which is trying to make sense of,
is this guy just on incoherently in
20
00:01:25,750 --> 00:01:34,230
one long sentence, or are these just
actually a series of step standalone,
21
00:01:34,230 --> 00:01:37,390
standalone, standalone sentences?
And how is it going to handle
22
00:01:37,390 --> 00:01:40,630
step alone? That's not a word.
Uh, what happens when you use
23
00:01:40,630 --> 00:01:43,910
speech to text and you use a fake
word and then you're like, wait,
24
00:01:43,910 --> 00:01:48,230
that's not actually that word doesn't
exist. How does AI handle that?
25
00:01:48,270 --> 00:01:53,790
And, uh, these and more are all
the questions that I'm seeking
26
00:01:53,790 --> 00:01:57,230
to answer in this training data.
Now, why did why was it trying
27
00:01:57,230 --> 00:01:59,620
to fine tune a whisper?
And what is whisper?
28
00:01:59,660 --> 00:02:03,420
As I said, I'm gonna try to, uh,
record this at a couple of different
29
00:02:03,420 --> 00:02:08,940
levels of technicality for folks who
are, uh, you know, in the normal, uh,
30
00:02:08,940 --> 00:02:13,340
world and not totally stuck down
the rabbit hole of AI, uh, which I
31
00:02:13,340 --> 00:02:17,340
have to say is a really wonderful,
uh, rabbit hole to be to be down.
32
00:02:17,460 --> 00:02:21,580
Um, it's a really interesting area.
And speech and voice tech is is
33
00:02:21,820 --> 00:02:24,860
the aspect of it that I find
actually most.
34
00:02:25,060 --> 00:02:28,220
I'm not sure I would say the most
interesting, because there's just
35
00:02:28,220 --> 00:02:32,580
so much that is fascinating in AI.
Uh, but the most that I find the
36
00:02:32,580 --> 00:02:36,100
most personally transformative
in terms of the impact that it's
37
00:02:36,100 --> 00:02:41,540
had on my daily work life and
productivity and how I sort of work.
38
00:02:41,820 --> 00:02:47,900
And I'm persevering hard with the
task of trying to guess a good
39
00:02:47,900 --> 00:02:51,580
solution working for Linux, which if
anyone actually does listen to this,
40
00:02:51,580 --> 00:02:54,980
not just for the training data
and for the actual content, uh,
41
00:02:55,020 --> 00:02:59,480
this is this is has sparked I had
besides the fine tune not working.
42
00:02:59,480 --> 00:03:05,440
Well, that was the failure.
Um, I used clod code because one
43
00:03:05,440 --> 00:03:10,040
thinks these days that there is
nothing short of solving,
44
00:03:10,920 --> 00:03:14,560
you know, the, uh,
the reason of life or something.
45
00:03:14,960 --> 00:03:19,440
Uh, that clod and agentic AI can't
do, uh, which is not really the case.
46
00:03:19,480 --> 00:03:23,480
Uh, it does seem that way sometimes,
but it fails a lot as well.
47
00:03:23,480 --> 00:03:26,840
And this is one of those, uh,
instances where last week I put
48
00:03:26,840 --> 00:03:31,280
together an hour of voice training
data, basically speaking just
49
00:03:31,280 --> 00:03:34,920
random things for three minutes.
And, um,
50
00:03:35,600 --> 00:03:38,400
it was actually kind of tedious
because the texts were really weird.
51
00:03:38,400 --> 00:03:42,000
Some of them were it was like it
was AI generated.
52
00:03:42,200 --> 00:03:44,800
Um, I tried before to read
Sherlock Holmes for an hour and
53
00:03:44,800 --> 00:03:46,880
I just couldn't.
I was so bored, uh,
54
00:03:46,920 --> 00:03:50,680
after ten minutes that I was like,
okay, now I'm just gonna have to
55
00:03:50,680 --> 00:03:56,350
find something else to read.
So I used a created with AI
56
00:03:56,390 --> 00:04:00,030
studio vibe coded.
A synthetic text generator.
57
00:04:00,270 --> 00:04:03,870
Um, which actually I thought was
probably a better way of doing it
58
00:04:03,870 --> 00:04:08,750
because it would give me more short
samples with more varied content.
59
00:04:08,750 --> 00:04:13,190
So I was like, okay, give me a voice
note, like I'm recording an email,
60
00:04:13,190 --> 00:04:17,990
give me a short story to read,
give me prose, um, to read.
61
00:04:17,990 --> 00:04:21,190
So I came up with all these
different things, and I added a
62
00:04:21,190 --> 00:04:24,630
little timer to it so I could
see how close I was to one hour.
63
00:04:24,870 --> 00:04:29,710
Um, and, uh, I spent like an hour one
afternoon or probably two hours by
64
00:04:29,710 --> 00:04:34,070
the time you, um, you do retakes
or whatever because you want to.
65
00:04:34,870 --> 00:04:39,070
It gave me a source of truth,
which I'm not sure if that's the
66
00:04:39,070 --> 00:04:43,430
scientific way to approach this topic
of gathering, uh, training data,
67
00:04:43,430 --> 00:04:47,950
but I thought it made sense.
Um, I have a lot of audio data
68
00:04:47,950 --> 00:04:51,950
from recording voice notes,
which I've also kind of used, um,
69
00:04:51,950 --> 00:04:55,660
been experimenting with using for
a different purpose, slightly
70
00:04:55,660 --> 00:05:00,700
different annotating task types.
It's more text classification
71
00:05:00,700 --> 00:05:03,620
experiment or uh, well,
it's more than that, actually.
72
00:05:03,620 --> 00:05:07,980
I'm working on a voice app,
so it's a prototype I guess is
73
00:05:07,980 --> 00:05:12,660
really more accurate.
Um, but you can do that and you
74
00:05:12,660 --> 00:05:14,100
can work backwards.
You're like,
75
00:05:14,140 --> 00:05:18,500
you listen back to a voice note
and you painfully go through one
76
00:05:18,500 --> 00:05:21,860
of those transcribing, you know,
where you start and stop and scrub
77
00:05:21,860 --> 00:05:23,980
around it and you fix the errors.
But it's really,
78
00:05:23,980 --> 00:05:27,100
really boring to do that.
So I thought it would be less
79
00:05:27,100 --> 00:05:31,740
tedious in the long term if I just
recorded The Source of truth.
80
00:05:32,060 --> 00:05:34,180
So it gave me these three minute
snippets.
81
00:05:34,180 --> 00:05:38,660
I recorded them and saved an MP3
and a txt in the same folder,
82
00:05:38,660 --> 00:05:43,700
and I created an hour of that data.
Uh, so I was very hopeful, quietly,
83
00:05:43,740 --> 00:05:46,260
you know, a little bit hopeful
that I would be able that I could
84
00:05:46,260 --> 00:05:49,580
actually fine tune, whisper.
Um, I want to fine tune whisper
85
00:05:49,580 --> 00:05:54,720
because when I got into voice tech
last November, my wife was in
86
00:05:54,720 --> 00:05:59,480
the US and I was alone at home.
And you know, when crazy people
87
00:05:59,480 --> 00:06:03,640
like me do really wild things like
use voice to tech, uh, technology.
88
00:06:03,640 --> 00:06:06,400
That was basically, um,
when I started doing it,
89
00:06:06,400 --> 00:06:10,160
I didn't feel like a crazy person
speaking to myself, and my
90
00:06:10,160 --> 00:06:16,000
expectations weren't that high.
Uh, I used speech tech now and again.
91
00:06:16,080 --> 00:06:18,360
Um, tried it out.
I was like, it'd be really cool
92
00:06:18,360 --> 00:06:20,400
if you could just, like,
speak into your computer.
93
00:06:20,760 --> 00:06:24,600
And whatever I tried out that
had Linux support was just.
94
00:06:25,320 --> 00:06:28,520
It was not good, basically.
Um, and this blew me away from
95
00:06:28,520 --> 00:06:31,920
the first go.
I mean, it wasn't 100% accurate
96
00:06:31,960 --> 00:06:35,040
out of the box and it took work,
but it was good enough that there was
97
00:06:35,040 --> 00:06:39,600
a solid foundation and it kind of
passed that, uh, pivot point that
98
00:06:39,600 --> 00:06:42,760
it's actually worth doing this.
You know, there's a point where
99
00:06:42,760 --> 00:06:46,800
it's so like the transcript is you
don't have to get 100% accuracy
100
00:06:46,800 --> 00:06:50,510
for it to be worth your time for
speech to text to be a worthwhile
101
00:06:50,510 --> 00:06:52,950
addition to your productivity.
But you do need to get above.
102
00:06:52,990 --> 00:06:57,630
Let's say, I don't know, 85%.
If it's 60% or 50%,
103
00:06:57,630 --> 00:07:00,670
you inevitably say, screw it.
I'll just type it because you end up
104
00:07:00,670 --> 00:07:04,950
missing errors in the transcript
and it becomes actually worse.
105
00:07:04,950 --> 00:07:06,710
You end up in a worse position
than you started with.
106
00:07:06,710 --> 00:07:10,910
And that's been my experience.
So, um, I was like, oh,
107
00:07:10,950 --> 00:07:13,430
this is actually really, really good.
Now how did that happen?
108
00:07:13,430 --> 00:07:18,790
And the answer is ASR whisper
being open sourced and the
109
00:07:18,790 --> 00:07:21,790
transformer architecture,
if you want to go back to the,
110
00:07:22,390 --> 00:07:26,630
um, to the underpinnings, which
really blows my mind and it's on my
111
00:07:26,630 --> 00:07:32,310
list to read through that paper.
Um, all you need is attention as
112
00:07:33,350 --> 00:07:38,350
attentively as can be done with my
limited brain because it's super,
113
00:07:38,350 --> 00:07:42,190
super high level stuff.
Um, super advanced stuff.
114
00:07:42,230 --> 00:07:47,950
I mean, uh, but that I think of all
the things that are fascinating
115
00:07:48,060 --> 00:07:52,700
about the sudden rise in AI and
the dramatic capabilities.
116
00:07:53,300 --> 00:07:55,580
I find it fascinating that few
people are like, hang on,
117
00:07:55,740 --> 00:07:59,620
you've got this thing that can speak
to you like a chatbot, an LLM,
118
00:08:00,300 --> 00:08:05,460
and then you've got image generation.
Okay, so firstly, those two things on
119
00:08:05,460 --> 00:08:10,740
the surface have nothing in common.
Um, so like how are they how did that
120
00:08:10,740 --> 00:08:12,980
just happen all at the same time.
And then when you extend that
121
00:08:12,980 --> 00:08:16,060
further, um, you're like sooner,
right?
122
00:08:16,060 --> 00:08:21,580
You can sing a song and AI will like,
come up with an instrumental and then
123
00:08:21,580 --> 00:08:23,740
you've got whisper and you're like,
wait a second,
124
00:08:23,940 --> 00:08:27,980
how did all this stuff, like,
if it's all AI, what's like there
125
00:08:27,980 --> 00:08:30,580
has to be some commonality.
Otherwise these are four.
126
00:08:30,660 --> 00:08:34,660
These are totally different
technologies on the surface of it.
127
00:08:34,660 --> 00:08:40,100
And, uh, the transformer architecture
is, as far as I know, the answer.
128
00:08:40,100 --> 00:08:43,740
And I can't even say can't even
pretend that I really understand
129
00:08:44,020 --> 00:08:47,170
what the transformer
architecture means in depth,
130
00:08:47,170 --> 00:08:51,690
but I have scanned it and as I said,
I want to print it and really kind
131
00:08:51,690 --> 00:08:56,650
of think over it at some point,
and I'll probably feel bad about
132
00:08:56,650 --> 00:08:58,970
myself, I think,
because weren't those guys in their
133
00:08:59,010 --> 00:09:03,890
in their 20s like, that's crazy.
I think I asked ChatGPT once who
134
00:09:03,930 --> 00:09:08,250
were the who wrote that paper
and how old were they when it
135
00:09:08,250 --> 00:09:11,170
was published in arXiv?
And I was expecting like,
136
00:09:11,410 --> 00:09:13,330
I don't know,
what do you what do you imagine?
137
00:09:13,330 --> 00:09:14,930
I personally imagine kind of like,
you know,
138
00:09:14,970 --> 00:09:19,090
you have these breakthroughs during
Covid and things like that where
139
00:09:19,130 --> 00:09:22,090
like these kind of really obscure
scientists who are like in their
140
00:09:22,090 --> 00:09:27,130
50s and they've just kind of been
laboring in labs and, uh, wearily
141
00:09:27,130 --> 00:09:30,530
and writing in publishing in kind
of obscure academic publications.
142
00:09:30,730 --> 00:09:33,930
And they finally, like,
hit a big or win a Nobel Prize and
143
00:09:33,930 --> 00:09:37,810
then their household household names.
Uh, so that was kind of what I
144
00:09:37,810 --> 00:09:39,650
had in mind.
That was the mental image I'd
145
00:09:39,650 --> 00:09:43,890
formed of the birth of arXiv.
Like, I wasn't expecting 20
146
00:09:43,930 --> 00:09:47,310
somethings in San Francisco,
though I thought that was both very,
147
00:09:47,310 --> 00:09:49,870
very funny, very cool,
and actually kind of inspiring.
148
00:09:50,390 --> 00:09:55,510
It's nice to think that people who,
you know, just you might put them
149
00:09:55,510 --> 00:10:00,910
in the kind of milieu or bubble or
world that you are in or credibly in,
150
00:10:00,950 --> 00:10:03,590
through, you know,
a series of connections that are
151
00:10:03,590 --> 00:10:07,630
coming up with such literally
world changing, um, innovations.
152
00:10:07,670 --> 00:10:11,430
Uh, so that was, I thought,
anyway, that, that that was cool.
153
00:10:12,070 --> 00:10:13,950
Okay. Voice training data.
How are we doing?
154
00:10:13,950 --> 00:10:17,990
We're about ten minutes, and I'm
still talking about voice technology.
155
00:10:18,190 --> 00:10:22,350
Um, so whisper was brilliant,
and I was so excited that I was.
156
00:10:22,350 --> 00:10:25,630
My first instinct was to, like,
get like, oh, my gosh,
157
00:10:25,630 --> 00:10:27,710
I have to get, like,
a really good microphone for this.
158
00:10:27,950 --> 00:10:31,630
So, um, I didn't go on a
spending spree because I said,
159
00:10:31,670 --> 00:10:34,470
I'm gonna have to just wait a
month and see if I still use this.
160
00:10:34,910 --> 00:10:39,990
And it just kind of became it's
become really part of my daily
161
00:10:39,990 --> 00:10:42,990
routine.
Like, if I'm writing an email,
162
00:10:42,990 --> 00:10:47,020
I'll record a voice note.
And then I've developed and it's
163
00:10:47,020 --> 00:10:49,900
nice to see that everyone is
like developing the same things
164
00:10:49,900 --> 00:10:51,900
in parallel.
Like, that's kind of a weird thing
165
00:10:51,940 --> 00:10:57,340
to say, but when I look, I kind of
came when I started working on this,
166
00:10:57,380 --> 00:11:00,700
these prototypes on GitHub,
which is where I just kind of
167
00:11:00,740 --> 00:11:04,740
share very freely and loosely,
uh, ideas and, you know,
168
00:11:04,780 --> 00:11:10,020
first iterations on, on concepts,
um, and for want of a better word,
169
00:11:10,020 --> 00:11:13,900
I called it like, uh,
lm post-processing or cleanup or
170
00:11:14,140 --> 00:11:18,100
basically a system prompt that after
you get back the raw text from
171
00:11:18,420 --> 00:11:24,100
whisper, you run it through a model
and say, okay, this is crappy text,
172
00:11:24,140 --> 00:11:27,140
like add sentence structure and,
you know, fix it up.
173
00:11:27,580 --> 00:11:32,660
And, um, now when I'm exploring the
different tools that are out there
174
00:11:32,700 --> 00:11:36,580
that people have built, I see, uh,
quite a number of projects have
175
00:11:37,180 --> 00:11:41,700
basically done the same thing,
um, less that be misconstrued.
176
00:11:41,700 --> 00:11:44,370
I'm not saying for a millisecond
that I inspired them.
177
00:11:44,370 --> 00:11:48,890
I'm sure this has been a thing that's
been integrated into tools for a
178
00:11:48,930 --> 00:11:52,290
while, but it's it's the kind of
thing that when you start using these
179
00:11:52,290 --> 00:11:56,730
tools every day, the need for it
is almost instantly apparent, uh,
180
00:11:56,730 --> 00:12:00,770
because text that doesn't have any
punctuation or paragraph spacing
181
00:12:00,810 --> 00:12:04,250
takes a long time to, you know,
it takes so long to get it into
182
00:12:04,250 --> 00:12:09,370
a presentable email that again,
it's it's it moves speech tech
183
00:12:09,410 --> 00:12:12,930
into that before that inflection
point where you're like, no,
184
00:12:12,930 --> 00:12:16,250
it's just not worth it.
It's like it'll just be quicker
185
00:12:16,250 --> 00:12:18,850
to type this.
So it's a big it's a little touch.
186
00:12:18,850 --> 00:12:24,090
That actually is a big deal.
Uh, so I was on whisper and I've
187
00:12:24,090 --> 00:12:28,170
been using whisper and I kind of
early on found a couple of tools.
188
00:12:28,210 --> 00:12:30,930
I couldn't find what I was
looking for on Linux, which is,
189
00:12:31,370 --> 00:12:35,770
um, basically just something
that'll run in the background.
190
00:12:35,810 --> 00:12:40,130
You'll give it an API key and it
will just transcribe. Um.
191
00:12:41,280 --> 00:12:44,000
with, like, a little key to
start and stop the dictation.
192
00:12:44,600 --> 00:12:49,040
Uh, and the issues were I discovered
that, like most people involved in
193
00:12:49,040 --> 00:12:53,920
creating these projects were very
much focused on local models running
194
00:12:53,920 --> 00:12:57,400
whisper locally, because you can.
And I tried that a bunch of
195
00:12:57,400 --> 00:13:00,840
times and just never got results
that were as good as the cloud.
196
00:13:01,160 --> 00:13:04,640
And when I began looking at the
cost of the speech to text APIs
197
00:13:04,640 --> 00:13:08,520
and what I was spending,
I just thought there's it's actually,
198
00:13:08,720 --> 00:13:13,200
in my opinion, just one of the better
deals in API spending and in cloud.
199
00:13:13,240 --> 00:13:17,280
Like it's just not that expensive
for very, very good models that are
200
00:13:17,400 --> 00:13:20,840
much more, you know, you're going
to be able to run the full model,
201
00:13:21,360 --> 00:13:25,960
the latest model versus whatever
you can run on your average GPU.
202
00:13:26,000 --> 00:13:29,760
Unless you want to buy a crazy GPU.
It doesn't really make sense to me.
203
00:13:29,760 --> 00:13:33,480
Now, privacy is another concern.
Um, that I know is kind of like a
204
00:13:33,520 --> 00:13:36,920
very much a separate thing that
people just don't want their voice,
205
00:13:36,920 --> 00:13:39,790
data, and their voice leaving
their local environment,
206
00:13:40,110 --> 00:13:43,830
maybe for regulatory reasons as well.
Um, but I'm not in that.
207
00:13:43,910 --> 00:13:47,910
Um, I'm neither really care about
people listening to my, uh,
208
00:13:47,950 --> 00:13:51,190
grocery list consisting of, uh,
reminding myself that I need to
209
00:13:51,230 --> 00:13:54,790
buy more beer, Cheetos and hummus,
which is kind of the three,
210
00:13:54,990 --> 00:13:59,310
three staples of my diet.
Um, during periods of poor nutrition.
211
00:13:59,590 --> 00:14:03,310
Uh, but the kind of stuff that I
transcribe, it's just not it's not a,
212
00:14:03,990 --> 00:14:09,350
it's not a privacy thing and that
sort of sensitive about and, uh,
213
00:14:09,350 --> 00:14:13,070
I don't do anything so,
you know, sensitive or secure,
214
00:14:13,070 --> 00:14:16,590
that requires air gapping.
So, um, I looked at the pricing and
215
00:14:16,590 --> 00:14:20,270
especially the kind of older models,
mini, um, some of them are very,
216
00:14:20,270 --> 00:14:23,110
very affordable.
And I did a back of the I did a
217
00:14:23,110 --> 00:14:27,150
calculation once with ChatGPT
and I was like, okay, this is a,
218
00:14:27,150 --> 00:14:31,070
this is the API price for I can't
remember whatever the model was.
219
00:14:31,550 --> 00:14:33,910
Uh, let's say I just go at it
like nonstop,
220
00:14:34,030 --> 00:14:37,410
which it rarely happens. Probably.
I would say on average,
221
00:14:37,410 --> 00:14:41,890
I might dictate 30 to 60 minutes per
day if I was probably summing up
222
00:14:41,890 --> 00:14:48,490
the emails, documents, outlines,
um, which is a lot, but it's it's
223
00:14:48,490 --> 00:14:50,730
still a fairly modest amount.
And I was like, well,
224
00:14:50,770 --> 00:14:53,930
some days I do go on like 1 or 2
days where I've been.
225
00:14:54,450 --> 00:14:58,450
Usually when I'm like kind of out of
the house and just have something
226
00:14:59,090 --> 00:15:02,250
like, I have nothing else to do.
Like if I'm at a hospital with a
227
00:15:02,250 --> 00:15:06,970
newborn, uh, and you're waiting
for like eight hours and hours
228
00:15:06,970 --> 00:15:10,210
for an appointment, and I would
probably have listened to podcasts
229
00:15:10,490 --> 00:15:14,010
before becoming a speech fanatic.
And I'm like, oh, wait,
230
00:15:14,050 --> 00:15:16,370
let me just get down.
Let me just get these ideas out
231
00:15:16,410 --> 00:15:19,610
of my head.
And that's when I'll go on my
232
00:15:19,650 --> 00:15:21,530
speech binges.
But those are like once every
233
00:15:21,530 --> 00:15:24,970
few months, like not frequently.
But I said, okay, let's just say
234
00:15:24,970 --> 00:15:30,650
if I'm gonna price out.
Cloud asked if I was like, dedicated
235
00:15:30,650 --> 00:15:36,880
every second of every waking hour to
transcribing for some odd reason. Um.
236
00:15:37,200 --> 00:15:39,680
I mean, it'd have to, like,
eat and use the toilet and,
237
00:15:39,720 --> 00:15:42,520
like, you know, there's only so
many hours I'm awake for.
238
00:15:42,520 --> 00:15:44,680
So, like,
let's just say a maximum of, like,
239
00:15:44,720 --> 00:15:48,680
40 hours, 45 minutes in the hour.
Then I said, all right,
240
00:15:48,680 --> 00:15:52,600
let's just say 50. Who knows?
You're dictating on the toilet.
241
00:15:52,640 --> 00:15:53,880
We do it.
Uh,
242
00:15:53,880 --> 00:15:58,720
so it could be you could just do 60.
But whatever I did, and every day,
243
00:15:58,760 --> 00:16:02,440
like, you're going flat out seven
days a week dictating non-stop.
244
00:16:02,480 --> 00:16:06,440
I was like, what's my monthly API
bill going to be at this price?
245
00:16:06,720 --> 00:16:09,120
And it came out to like 70 or 80
bucks.
246
00:16:09,120 --> 00:16:14,080
And I was like, well, that would be
an extraordinary amount of dictation.
247
00:16:14,080 --> 00:16:17,840
And I would hope that there was
some compelling reason,
248
00:16:18,040 --> 00:16:22,200
more worth more than $70,
that I embarked upon that project.
249
00:16:22,400 --> 00:16:25,200
Uh, so given that that's kind of the
max point for me, I said, that's
250
00:16:25,240 --> 00:16:29,000
actually very, very affordable.
Um, now you're gonna if you want
251
00:16:29,040 --> 00:16:34,080
to spec out the costs and you want
to do the post-processing that I
252
00:16:34,150 --> 00:16:37,110
really do feel is valuable.
Um, that's going to cost some more as
253
00:16:37,110 --> 00:16:43,110
well, unless you're using Gemini,
which, uh, needless to say, is a
254
00:16:43,110 --> 00:16:46,950
random person sitting in Jerusalem.
Uh, I have no affiliation,
255
00:16:46,950 --> 00:16:51,350
nor with Google, nor anthropic,
nor Gemini, nor any major tech vendor
256
00:16:51,350 --> 00:16:56,790
for that matter. Um, I like Gemini.
Not so much as a everyday model.
257
00:16:56,870 --> 00:16:59,830
Um, it's kind of underwhelmed in
that respect, I would say.
258
00:17:00,230 --> 00:17:03,030
But for multimodal,
I think it's got a lot to offer.
259
00:17:03,310 --> 00:17:06,870
And I think that the transcribing
functionality whereby it can,
260
00:17:07,270 --> 00:17:12,150
um, process audio with a system
prompt and both give you
261
00:17:12,190 --> 00:17:15,390
transcription that's cleaned up,
that reduces two steps to one.
262
00:17:15,710 --> 00:17:18,630
And that for me is a very,
very big deal.
263
00:17:18,630 --> 00:17:22,990
And, uh, I feel like even Google
has haven't really sort of thought
264
00:17:22,990 --> 00:17:27,430
through how useful the that
modality is and what kind of use
265
00:17:27,430 --> 00:17:30,790
cases you can achieve with it.
Because I found in the course of
266
00:17:30,790 --> 00:17:36,490
this year just an endless list
of really kind of system prompt,
267
00:17:36,730 --> 00:17:41,290
system prompt stuff that I can say,
okay, I've used it to capture context
268
00:17:41,290 --> 00:17:45,570
data for AI, which is literally I
might speak for if I wanted to have a
269
00:17:45,570 --> 00:17:49,730
good bank of context data about,
who knows, my childhood.
270
00:17:50,010 --> 00:17:53,450
Uh, more realistically,
maybe my career goals, uh,
271
00:17:53,450 --> 00:17:56,010
something that would just be,
like, really boring to type out.
272
00:17:56,130 --> 00:18:01,130
So I'll just, like, sit in my car
and record it for ten minutes.
273
00:18:01,130 --> 00:18:04,090
And that ten minutes,
you get a lot of information in,
274
00:18:04,530 --> 00:18:10,090
um, emails, which is short text.
Um, just there is a whole bunch.
275
00:18:10,090 --> 00:18:13,570
And all these workflows kind of
require a little bit of treatment
276
00:18:13,570 --> 00:18:17,490
afterwards and different treatment.
My context pipeline is kind of like
277
00:18:17,490 --> 00:18:21,210
just extract the bare essentials.
So you end up with me talking very
278
00:18:21,210 --> 00:18:24,250
loosely about sort of what I've done
in my career, where I've worked,
279
00:18:24,250 --> 00:18:27,610
where I might like to work,
and it goes it condenses that
280
00:18:27,610 --> 00:18:31,600
down to very robotic language
that is easy to chunk, parse,
281
00:18:31,600 --> 00:18:35,960
and maybe put into a vector database.
Daniel has worked in technology,
282
00:18:36,000 --> 00:18:39,640
Daniel is a has been working in,
you know, stuff like that.
283
00:18:39,640 --> 00:18:43,600
That's not how you would speak.
Um, but I figure it's probably easier
284
00:18:43,600 --> 00:18:48,120
to parse for, after all, robots.
So we've almost got to 20 minutes.
285
00:18:48,120 --> 00:18:52,640
And this is actually a success
because I wasted 20 minutes of my,
286
00:18:52,800 --> 00:18:56,880
uh, of the evening speaking into
a microphone, and, uh,
287
00:18:56,920 --> 00:19:00,840
the levels were shot and, uh, it,
uh, it was clipping and I said,
288
00:19:00,840 --> 00:19:03,200
I can't really do an evaluation.
I have to be fair.
289
00:19:03,200 --> 00:19:07,000
I have to give the models a
chance to do their thing.
290
00:19:07,520 --> 00:19:09,360
Uh,
what am I hoping to achieve in this?
291
00:19:09,400 --> 00:19:12,600
Okay, my fine tune was a dud,
as mentioned Deepgram SVT.
292
00:19:12,640 --> 00:19:15,520
I'm really, really hopeful that
this prototype will work.
293
00:19:15,800 --> 00:19:18,960
And it's a built in public open
source, so anyone is welcome to
294
00:19:19,000 --> 00:19:22,920
use it if I make anything good.
Um, but that was really exciting for
295
00:19:22,920 --> 00:19:27,400
me last night when after hours of,
um, trying my own prototype,
296
00:19:27,400 --> 00:19:31,230
seeing someone just made
something that works like that.
297
00:19:31,270 --> 00:19:32,670
You know,
you're not going to have to build a
298
00:19:32,670 --> 00:19:38,230
custom conda environment and image.
I have AMD GPU, which makes
299
00:19:38,230 --> 00:19:42,310
things much more complicated.
I didn't find it and I was about
300
00:19:42,310 --> 00:19:43,990
to give up and I said,
all right, let me just give deep
301
00:19:43,990 --> 00:19:48,750
grams Linux thing a shot.
And if this doesn't work, um,
302
00:19:48,750 --> 00:19:51,150
I'm just going to go back to
trying to code something myself.
303
00:19:51,510 --> 00:19:56,190
And when I ran the script,
I was using cloud code to do the
304
00:19:56,190 --> 00:20:00,030
installation process.
It ran the script and oh my gosh,
305
00:20:00,070 --> 00:20:05,350
it works just like that.
Uh, the tricky thing for all those
306
00:20:05,350 --> 00:20:10,310
who wants to know all the nitty
gritty, nitty gritty details, um, was
307
00:20:10,310 --> 00:20:13,750
that I don't think it was actually
struggling with transcription, but
308
00:20:13,750 --> 00:20:18,550
pasting Wayland makes life very hard,
and I think there was something not
309
00:20:18,550 --> 00:20:21,870
running in the right time anyway.
Deepgram I looked at how they
310
00:20:21,870 --> 00:20:24,710
actually handle that because it
worked out of the box when other
311
00:20:24,710 --> 00:20:29,140
stuff didn't, and it was quite a
clever little mechanism,
312
00:20:29,460 --> 00:20:32,100
and but more so than that,
the accuracy was brilliant.
313
00:20:32,140 --> 00:20:35,020
Now, what am I doing here?
This is going to be a 20 minute
314
00:20:35,260 --> 00:20:42,980
audio sample, and I'm I think
I've done 1 or 2 of these before,
315
00:20:42,980 --> 00:20:49,180
but I did it with short, snappy voice
notes. This is kind of long form.
316
00:20:49,460 --> 00:20:51,740
This actually might be a better
approximation for what's useful
317
00:20:51,740 --> 00:20:56,100
to me than voice memos.
Like I need to buy three liters
318
00:20:56,100 --> 00:20:59,180
of milk tomorrow, and pita bread,
which is probably how like half
319
00:20:59,180 --> 00:21:02,820
my voice voice notes sound like
if anyone were to, I don't know,
320
00:21:02,860 --> 00:21:04,580
like find my phone,
they'd be like, this is the most
321
00:21:04,580 --> 00:21:07,420
boring person in the world.
Although actually there are some
322
00:21:07,460 --> 00:21:09,700
like kind of, uh,
journaling thoughts as well.
323
00:21:09,700 --> 00:21:13,700
But it's a lot of content like that.
And the probably for the evaluation,
324
00:21:13,700 --> 00:21:20,660
the most useful thing is slightly
obscure tech GitHub uh, hugging face
325
00:21:21,180 --> 00:21:24,660
not so obscure that it's not going
to have a chance of knowing it,
326
00:21:24,660 --> 00:21:27,640
but hopefully sufficiently well
known that the model should get it.
327
00:21:28,200 --> 00:21:30,760
I tried to do a little bit of
speaking really fast and
328
00:21:30,760 --> 00:21:33,200
speaking very slowly.
I would say in general,
329
00:21:33,200 --> 00:21:36,880
I've spoken, delivered this at a
faster pace than I usually would
330
00:21:36,920 --> 00:21:40,280
owing to strong coffee flowing
through my bloodstream.
331
00:21:40,920 --> 00:21:44,200
And the thing that I'm not going
to get in this benchmark is
332
00:21:44,200 --> 00:21:46,880
background noise, which in my first
take that I had to get rid of,
333
00:21:47,680 --> 00:21:51,240
my wife came in with my son and
for a good night kiss.
334
00:21:51,440 --> 00:21:55,120
And that actually would have
been super helpful to get in
335
00:21:55,120 --> 00:21:59,760
because it was not diarised.
Or if we had diarisation a female,
336
00:21:59,880 --> 00:22:02,280
I could say I want the male
voice and that wasn't intended
337
00:22:02,280 --> 00:22:05,280
for transcription.
Um, and we're not going to get
338
00:22:05,280 --> 00:22:06,960
background noise like people
honking their horns,
339
00:22:06,960 --> 00:22:11,280
which is something I've done in my
main data set where I am trying to
340
00:22:11,440 --> 00:22:15,520
go back to some of my voice notes,
annotate them, and run a benchmark.
341
00:22:15,520 --> 00:22:18,960
But this is going to be just a
pure quick test.
342
00:22:19,440 --> 00:22:23,880
And as someone I'm working on a
voice note idea,
343
00:22:23,880 --> 00:22:28,230
that's my sort of end motivation.
Besides thinking it's an
344
00:22:28,230 --> 00:22:31,590
absolutely outstanding technology
that's coming to viability.
345
00:22:31,590 --> 00:22:34,670
And really, I know this sounds
cheesy can actually have a very
346
00:22:34,670 --> 00:22:38,830
transformative effect.
Um, it's, you know, voice technology
347
00:22:38,870 --> 00:22:44,910
has been life changing for, uh,
folks living with, um, disabilities.
348
00:22:45,630 --> 00:22:48,550
And I think there's something
really nice about the fact that
349
00:22:48,550 --> 00:22:52,710
it can also benefit, you know,
folks who are able bodied and like,
350
00:22:52,750 --> 00:22:58,950
we can all in different ways, um,
make this tech as useful as possible,
351
00:22:58,990 --> 00:23:01,110
regardless of the exact way that
we're using it.
352
00:23:01,510 --> 00:23:04,710
Um, and I think there's something
very powerful in that, and it can be
353
00:23:04,710 --> 00:23:08,910
very cool. Um, I see use potential.
What excites me about voice tech?
354
00:23:09,750 --> 00:23:13,550
A lot of things, actually.
Firstly, the fact that it's cheap
355
00:23:13,550 --> 00:23:17,110
and accurate, as I mentioned at
the very start of this, um,
356
00:23:17,110 --> 00:23:20,790
and it's getting better and better
with stuff like accent handling, um,
357
00:23:20,790 --> 00:23:24,180
I'm not sure my, my fine tune will
actually ever come to fruition in the
358
00:23:24,180 --> 00:23:27,860
sense that I'll use it day to day,
as I imagine I get like superb,
359
00:23:27,860 --> 00:23:33,540
flawless word error rates because I'm
just kind of skeptical about local
360
00:23:33,540 --> 00:23:38,100
speech to texts, as I mentioned.
And I think the pace of innovation
361
00:23:38,100 --> 00:23:42,060
and improvement in the models,
the main reasons for fine tuning from
362
00:23:42,060 --> 00:23:46,340
what I've seen have been people who
are something that really blows,
363
00:23:46,380 --> 00:23:52,940
blows my mind about ASR is the idea
that it's inherently a lingual
364
00:23:52,940 --> 00:23:59,100
or multilingual phonetic based.
So as folks who use speak very
365
00:23:59,140 --> 00:24:02,220
obscure languages that there may
be there might be a paucity of
366
00:24:02,220 --> 00:24:05,500
training data or almost none at all,
and therefore the accuracy is
367
00:24:05,500 --> 00:24:10,660
significantly reduced or folks
in very critical environments.
368
00:24:10,700 --> 00:24:13,380
I know there are.
This is used extensively in medical
369
00:24:13,380 --> 00:24:18,140
transcription and dispatcher work as,
um, you know, the call centers who
370
00:24:18,140 --> 00:24:22,490
send out ambulances, etc., where
accuracy is absolutely paramount.
371
00:24:22,490 --> 00:24:26,050
And in the case of doctors,
radiologists, they might be using
372
00:24:26,050 --> 00:24:29,610
very specialized vocab all the time.
So those are kind of the main
373
00:24:29,610 --> 00:24:31,530
two things.
And I'm not sure that really just for
374
00:24:31,530 --> 00:24:37,290
trying to make it better on a few
random tech words with my slightly.
375
00:24:37,330 --> 00:24:41,250
I mean, I have an accent, but like,
not, you know, an accent that a few
376
00:24:41,290 --> 00:24:47,210
other million people have. Ish.
I'm not sure that my little fine
377
00:24:47,210 --> 00:24:52,250
tune is going to actually like the
bump in word error rate reduction.
378
00:24:52,250 --> 00:24:54,570
If I ever actually figure out how
to do it and get it up to the
379
00:24:54,570 --> 00:24:58,610
cloud by the time I've done that.
I suspect that the next
380
00:24:58,610 --> 00:25:01,410
generation of ASR will just be
so good that it will kind of be.
381
00:25:01,930 --> 00:25:03,770
Ah, well,
that would be cool if it worked out,
382
00:25:03,770 --> 00:25:08,730
but I'll just use this instead.
So that's going to be it for today's
383
00:25:08,730 --> 00:25:14,130
episode of, uh, voice training data.
Single long shot evaluation.
384
00:25:14,410 --> 00:25:17,330
Who am I going to compare?
Whisper is always good as a
385
00:25:17,330 --> 00:25:20,600
benchmark, but I'm more
interested in seeing Whisperer
386
00:25:20,600 --> 00:25:25,080
head to head with two things,
really. One is whisper variance.
387
00:25:25,080 --> 00:25:29,880
So you've got these projects like
faster Whisper, Still whisper.
388
00:25:29,880 --> 00:25:31,640
It's a bit confusing.
There's a whole bunch of them
389
00:25:31,920 --> 00:25:34,800
and the emerging acers,
which are also a thing.
390
00:25:35,200 --> 00:25:37,680
My intention for this is I'm not
sure I'm going to have the time
391
00:25:37,680 --> 00:25:41,640
in any point in the foreseeable
future to go back through this whole
392
00:25:41,640 --> 00:25:46,560
episode and create a proper source,
truth or a fix.
393
00:25:47,320 --> 00:25:51,680
Everything might do it if I can
get one transcription that
394
00:25:51,680 --> 00:25:56,720
sufficiently close to perfection.
But what I would actually love
395
00:25:56,720 --> 00:25:59,800
to do on Hugging Face I think
would be a great.
396
00:25:59,800 --> 00:26:03,560
Probably how I might visualize this
is having the audio waveform play,
397
00:26:04,040 --> 00:26:09,800
and then have the transcript for each
model below it, and maybe even a,
398
00:26:10,480 --> 00:26:15,120
um, like, you know, two scale and
maybe even a local one as well,
399
00:26:15,160 --> 00:26:21,700
like local whisper versus open
AI API, Etc. and, um, I can then
400
00:26:21,700 --> 00:26:24,380
actually listen back to segments
or anyone who wants to can listen
401
00:26:24,380 --> 00:26:29,420
back to segments of this recording
and see where a particular model
402
00:26:29,460 --> 00:26:32,940
struggled and others didn't, as well
as the sort of headline finding
403
00:26:32,980 --> 00:26:36,780
of which had the best, uh, wer.
But that would require the source
404
00:26:36,780 --> 00:26:40,020
of truth. Okay. That's it.
Hope this was, I don't know,
405
00:26:40,180 --> 00:26:43,460
maybe useful for other folks
interested in stuff you want to see.
406
00:26:43,940 --> 00:26:48,100
I always feel think I've just said
something I didn't intend to say.
407
00:26:48,660 --> 00:26:51,020
I said for those, listen carefully.
Including, hopefully,
408
00:26:51,020 --> 00:26:54,060
the models themselves.
This has been myself,
409
00:26:54,100 --> 00:26:57,900
Daniel Rosehill, for more, um,
jumbled repositories about my,
410
00:26:57,940 --> 00:27:00,820
uh, roving interest in AI,
but particularly Agentic,
411
00:27:01,180 --> 00:27:05,340
MCP and voice tech.
Uh, you can find me on GitHub.
412
00:27:05,820 --> 00:27:11,140
Hugging face. Where else?
Daniel, which is my personal website,
413
00:27:11,140 --> 00:27:15,260
as well as this podcast whose
name I sadly cannot remember.
414
00:27:15,700 --> 00:27:17,420
Until next time.
Thanks for listening.