Spaces:

danielrosehill
/

STT-Comparison

Running

App Files Files Community

STT-Comparison / srt-out /speechmatics.srt

danielrosehill

Fix SRT timestamp alignment with ground truth

0aa8adc about 1 month ago

raw

history blame contribute delete

39.9 kB

	1
	00:00:00,000 --> 00:00:06,400
	Hello and welcome to a audio data
	set consisting of one single

	2
	00:00:06,400 --> 00:00:12,000
	episode of a non-existent podcast.
	Or it, uh, I may append this to a

	3
	00:00:12,000 --> 00:00:16,520
	podcast that I set up recently.
	Um, regarding my, uh,

	4
	00:00:16,560 --> 00:00:21,840
	with my thoughts on speech,
	tech and AI in particular,

	5
	00:00:22,120 --> 00:00:27,840
	more AI and generative AI, I would,
	uh, I would say, but in any event,

	6
	00:00:27,840 --> 00:00:32,360
	the purpose of this, um,
	voice recording is actually to create

	7
	00:00:32,560 --> 00:00:37,440
	a lengthy voice sample for a quick
	evaluation, a back of the envelope

	8
	00:00:37,440 --> 00:00:41,040
	evaluation, as they might say,
	for different speech to text models.

	9
	00:00:41,040 --> 00:00:43,680
	And I'm doing this because I,
	uh, I thought I'd made a great

	10
	00:00:43,680 --> 00:00:48,200
	breakthrough in my journey with
	speech tech, and that was succeeding

	11
	00:00:48,200 --> 00:00:52,600
	in the elusive task of fine tuning.
	Whisper, whisper is.

	12
	00:00:52,720 --> 00:00:56,840
	And I'm going to just talk.
	I'm trying to mix up, uh,

	13
	00:00:56,840 --> 00:01:00,350
	I'm going to try a few different
	styles of speaking.

	14
	00:01:00,350 --> 00:01:02,510
	I might whisper something at
	some point as well,

	15
	00:01:03,070 --> 00:01:07,030
	and I'll go back to speaking loud in,
	uh, in different parts.

	16
	00:01:07,030 --> 00:01:09,590
	I'm going to sound really like a
	crazy person, because I'm also

	17
	00:01:09,590 --> 00:01:15,750
	going to try to speak at different
	pitches and cadences in order to

	18
	00:01:15,790 --> 00:01:20,510
	really try to put a speech to
	text model through its paces,

	19
	00:01:20,510 --> 00:01:25,750
	which is trying to make sense of,
	is this guy just on incoherently in

	20
	00:01:25,750 --> 00:01:34,230
	one long sentence, or are these just
	actually a series of step standalone,

	21
	00:01:34,230 --> 00:01:37,390
	standalone, standalone sentences?
	And how is it going to handle

	22
	00:01:37,390 --> 00:01:40,630
	step alone? That's not a word.
	Uh, what happens when you use

	23
	00:01:40,630 --> 00:01:43,910
	speech to text and you use a fake
	word and then you're like, wait,

	24
	00:01:43,910 --> 00:01:48,230
	that's not actually that word doesn't
	exist. How does AI handle that?

	25
	00:01:48,270 --> 00:01:53,790
	And, uh, these and more are all
	the questions that I'm seeking

	26
	00:01:53,790 --> 00:01:57,230
	to answer in this training data.
	Now, why did why was it trying

	27
	00:01:57,230 --> 00:01:59,620
	to fine tune a whisper?
	And what is whisper?

	28
	00:01:59,660 --> 00:02:03,420
	As I said, I'm gonna try to, uh,
	record this at a couple of different

	29
	00:02:03,420 --> 00:02:08,940
	levels of technicality for folks who
	are, uh, you know, in the normal, uh,

	30
	00:02:08,940 --> 00:02:13,340
	world and not totally stuck down
	the rabbit hole of AI, uh, which I

	31
	00:02:13,340 --> 00:02:17,340
	have to say is a really wonderful,
	uh, rabbit hole to be to be down.

	32
	00:02:17,460 --> 00:02:21,580
	Um, it's a really interesting area.
	And speech and voice tech is is

	33
	00:02:21,820 --> 00:02:24,860
	the aspect of it that I find
	actually most.

	34
	00:02:25,060 --> 00:02:28,220
	I'm not sure I would say the most
	interesting, because there's just

	35
	00:02:28,220 --> 00:02:32,580
	so much that is fascinating in AI.
	Uh, but the most that I find the

	36
	00:02:32,580 --> 00:02:36,100
	most personally transformative
	in terms of the impact that it's

	37
	00:02:36,100 --> 00:02:41,540
	had on my daily work life and
	productivity and how I sort of work.

	38
	00:02:41,820 --> 00:02:47,900
	And I'm persevering hard with the
	task of trying to guess a good

	39
	00:02:47,900 --> 00:02:51,580
	solution working for Linux, which if
	anyone actually does listen to this,

	40
	00:02:51,580 --> 00:02:54,980
	not just for the training data
	and for the actual content, uh,

	41
	00:02:55,020 --> 00:02:59,480
	this is this is has sparked I had
	besides the fine tune not working.

	42
	00:02:59,480 --> 00:03:05,440
	Well, that was the failure.
	Um, I used clod code because one

	43
	00:03:05,440 --> 00:03:10,040
	thinks these days that there is
	nothing short of solving,

	44
	00:03:10,920 --> 00:03:14,560
	you know, the, uh,
	the reason of life or something.

	45
	00:03:14,960 --> 00:03:19,440
	Uh, that clod and agentic AI can't
	do, uh, which is not really the case.

	46
	00:03:19,480 --> 00:03:23,480
	Uh, it does seem that way sometimes,
	but it fails a lot as well.

	47
	00:03:23,480 --> 00:03:26,840
	And this is one of those, uh,
	instances where last week I put

	48
	00:03:26,840 --> 00:03:31,280
	together an hour of voice training
	data, basically speaking just

	49
	00:03:31,280 --> 00:03:34,920
	random things for three minutes.
	And, um,

	50
	00:03:35,600 --> 00:03:38,400
	it was actually kind of tedious
	because the texts were really weird.

	51
	00:03:38,400 --> 00:03:42,000
	Some of them were it was like it
	was AI generated.

	52
	00:03:42,200 --> 00:03:44,800
	Um, I tried before to read
	Sherlock Holmes for an hour and

	53
	00:03:44,800 --> 00:03:46,880
	I just couldn't.
	I was so bored, uh,

	54
	00:03:46,920 --> 00:03:50,680
	after ten minutes that I was like,
	okay, now I'm just gonna have to

	55
	00:03:50,680 --> 00:03:56,350
	find something else to read.
	So I used a created with AI

	56
	00:03:56,390 --> 00:04:00,030
	studio vibe coded.
	A synthetic text generator.

	57
	00:04:00,270 --> 00:04:03,870
	Um, which actually I thought was
	probably a better way of doing it

	58
	00:04:03,870 --> 00:04:08,750
	because it would give me more short
	samples with more varied content.

	59
	00:04:08,750 --> 00:04:13,190
	So I was like, okay, give me a voice
	note, like I'm recording an email,

	60
	00:04:13,190 --> 00:04:17,990
	give me a short story to read,
	give me prose, um, to read.

	61
	00:04:17,990 --> 00:04:21,190
	So I came up with all these
	different things, and I added a

	62
	00:04:21,190 --> 00:04:24,630
	little timer to it so I could
	see how close I was to one hour.

	63
	00:04:24,870 --> 00:04:29,710
	Um, and, uh, I spent like an hour one
	afternoon or probably two hours by

	64
	00:04:29,710 --> 00:04:34,070
	the time you, um, you do retakes
	or whatever because you want to.

	65
	00:04:34,870 --> 00:04:39,070
	It gave me a source of truth,
	which I'm not sure if that's the

	66
	00:04:39,070 --> 00:04:43,430
	scientific way to approach this topic
	of gathering, uh, training data,

	67
	00:04:43,430 --> 00:04:47,950
	but I thought it made sense.
	Um, I have a lot of audio data

	68
	00:04:47,950 --> 00:04:51,950
	from recording voice notes,
	which I've also kind of used, um,

	69
	00:04:51,950 --> 00:04:55,660
	been experimenting with using for
	a different purpose, slightly

	70
	00:04:55,660 --> 00:05:00,700
	different annotating task types.
	It's more text classification

	71
	00:05:00,700 --> 00:05:03,620
	experiment or uh, well,
	it's more than that, actually.

	72
	00:05:03,620 --> 00:05:07,980
	I'm working on a voice app,
	so it's a prototype I guess is

	73
	00:05:07,980 --> 00:05:12,660
	really more accurate.
	Um, but you can do that and you

	74
	00:05:12,660 --> 00:05:14,100
	can work backwards.
	You're like,

	75
	00:05:14,140 --> 00:05:18,500
	you listen back to a voice note
	and you painfully go through one

	76
	00:05:18,500 --> 00:05:21,860
	of those transcribing, you know,
	where you start and stop and scrub

	77
	00:05:21,860 --> 00:05:23,980
	around it and you fix the errors.
	But it's really,

	78
	00:05:23,980 --> 00:05:27,100
	really boring to do that.
	So I thought it would be less

	79
	00:05:27,100 --> 00:05:31,740
	tedious in the long term if I just
	recorded The Source of truth.

	80
	00:05:32,060 --> 00:05:34,180
	So it gave me these three minute
	snippets.

	81
	00:05:34,180 --> 00:05:38,660
	I recorded them and saved an MP3
	and a txt in the same folder,

	82
	00:05:38,660 --> 00:05:43,700
	and I created an hour of that data.
	Uh, so I was very hopeful, quietly,

	83
	00:05:43,740 --> 00:05:46,260
	you know, a little bit hopeful
	that I would be able that I could

	84
	00:05:46,260 --> 00:05:49,580
	actually fine tune, whisper.
	Um, I want to fine tune whisper

	85
	00:05:49,580 --> 00:05:54,720
	because when I got into voice tech
	last November, my wife was in

	86
	00:05:54,720 --> 00:05:59,480
	the US and I was alone at home.
	And you know, when crazy people

	87
	00:05:59,480 --> 00:06:03,640
	like me do really wild things like
	use voice to tech, uh, technology.

	88
	00:06:03,640 --> 00:06:06,400
	That was basically, um,
	when I started doing it,

	89
	00:06:06,400 --> 00:06:10,160
	I didn't feel like a crazy person
	speaking to myself, and my

	90
	00:06:10,160 --> 00:06:16,000
	expectations weren't that high.
	Uh, I used speech tech now and again.

	91
	00:06:16,080 --> 00:06:18,360
	Um, tried it out.
	I was like, it'd be really cool

	92
	00:06:18,360 --> 00:06:20,400
	if you could just, like,
	speak into your computer.

	93
	00:06:20,760 --> 00:06:24,600
	And whatever I tried out that
	had Linux support was just.

	94
	00:06:25,320 --> 00:06:28,520
	It was not good, basically.
	Um, and this blew me away from

	95
	00:06:28,520 --> 00:06:31,920
	the first go.
	I mean, it wasn't 100% accurate

	96
	00:06:31,960 --> 00:06:35,040
	out of the box and it took work,
	but it was good enough that there was

	97
	00:06:35,040 --> 00:06:39,600
	a solid foundation and it kind of
	passed that, uh, pivot point that

	98
	00:06:39,600 --> 00:06:42,760
	it's actually worth doing this.
	You know, there's a point where

	99
	00:06:42,760 --> 00:06:46,800
	it's so like the transcript is you
	don't have to get 100% accuracy

	100
	00:06:46,800 --> 00:06:50,510
	for it to be worth your time for
	speech to text to be a worthwhile

	101
	00:06:50,510 --> 00:06:52,950
	addition to your productivity.
	But you do need to get above.

	102
	00:06:52,990 --> 00:06:57,630
	Let's say, I don't know, 85%.
	If it's 60% or 50%,

	103
	00:06:57,630 --> 00:07:00,670
	you inevitably say, screw it.
	I'll just type it because you end up

	104
	00:07:00,670 --> 00:07:04,950
	missing errors in the transcript
	and it becomes actually worse.

	105
	00:07:04,950 --> 00:07:06,710
	You end up in a worse position
	than you started with.

	106
	00:07:06,710 --> 00:07:10,910
	And that's been my experience.
	So, um, I was like, oh,

	107
	00:07:10,950 --> 00:07:13,430
	this is actually really, really good.
	Now how did that happen?

	108
	00:07:13,430 --> 00:07:18,790
	And the answer is ASR whisper
	being open sourced and the

	109
	00:07:18,790 --> 00:07:21,790
	transformer architecture,
	if you want to go back to the,

	110
	00:07:22,390 --> 00:07:26,630
	um, to the underpinnings, which
	really blows my mind and it's on my

	111
	00:07:26,630 --> 00:07:32,310
	list to read through that paper.
	Um, all you need is attention as

	112
	00:07:33,350 --> 00:07:38,350
	attentively as can be done with my
	limited brain because it's super,

	113
	00:07:38,350 --> 00:07:42,190
	super high level stuff.
	Um, super advanced stuff.

	114
	00:07:42,230 --> 00:07:47,950
	I mean, uh, but that I think of all
	the things that are fascinating

	115
	00:07:48,060 --> 00:07:52,700
	about the sudden rise in AI and
	the dramatic capabilities.

	116
	00:07:53,300 --> 00:07:55,580
	I find it fascinating that few
	people are like, hang on,

	117
	00:07:55,740 --> 00:07:59,620
	you've got this thing that can speak
	to you like a chatbot, an LLM,

	118
	00:08:00,300 --> 00:08:05,460
	and then you've got image generation.
	Okay, so firstly, those two things on

	119
	00:08:05,460 --> 00:08:10,740
	the surface have nothing in common.
	Um, so like how are they how did that

	120
	00:08:10,740 --> 00:08:12,980
	just happen all at the same time.
	And then when you extend that

	121
	00:08:12,980 --> 00:08:16,060
	further, um, you're like sooner,
	right?

	122
	00:08:16,060 --> 00:08:21,580
	You can sing a song and AI will like,
	come up with an instrumental and then

	123
	00:08:21,580 --> 00:08:23,740
	you've got whisper and you're like,
	wait a second,

	124
	00:08:23,940 --> 00:08:27,980
	how did all this stuff, like,
	if it's all AI, what's like there

	125
	00:08:27,980 --> 00:08:30,580
	has to be some commonality.
	Otherwise these are four.

	126
	00:08:30,660 --> 00:08:34,660
	These are totally different
	technologies on the surface of it.

	127
	00:08:34,660 --> 00:08:40,100
	And, uh, the transformer architecture
	is, as far as I know, the answer.

	128
	00:08:40,100 --> 00:08:43,740
	And I can't even say can't even
	pretend that I really understand

	129
	00:08:44,020 --> 00:08:47,170
	what the transformer
	architecture means in depth,

	130
	00:08:47,170 --> 00:08:51,690
	but I have scanned it and as I said,
	I want to print it and really kind

	131
	00:08:51,690 --> 00:08:56,650
	of think over it at some point,
	and I'll probably feel bad about

	132
	00:08:56,650 --> 00:08:58,970
	myself, I think,
	because weren't those guys in their

	133
	00:08:59,010 --> 00:09:03,890
	in their 20s like, that's crazy.
	I think I asked ChatGPT once who

	134
	00:09:03,930 --> 00:09:08,250
	were the who wrote that paper
	and how old were they when it

	135
	00:09:08,250 --> 00:09:11,170
	was published in arXiv?
	And I was expecting like,

	136
	00:09:11,410 --> 00:09:13,330
	I don't know,
	what do you what do you imagine?

	137
	00:09:13,330 --> 00:09:14,930
	I personally imagine kind of like,
	you know,

	138
	00:09:14,970 --> 00:09:19,090
	you have these breakthroughs during
	Covid and things like that where

	139
	00:09:19,130 --> 00:09:22,090
	like these kind of really obscure
	scientists who are like in their

	140
	00:09:22,090 --> 00:09:27,130
	50s and they've just kind of been
	laboring in labs and, uh, wearily

	141
	00:09:27,130 --> 00:09:30,530
	and writing in publishing in kind
	of obscure academic publications.

	142
	00:09:30,730 --> 00:09:33,930
	And they finally, like,
	hit a big or win a Nobel Prize and

	143
	00:09:33,930 --> 00:09:37,810
	then their household household names.
	Uh, so that was kind of what I

	144
	00:09:37,810 --> 00:09:39,650
	had in mind.
	That was the mental image I'd

	145
	00:09:39,650 --> 00:09:43,890
	formed of the birth of arXiv.
	Like, I wasn't expecting 20

	146
	00:09:43,930 --> 00:09:47,310
	somethings in San Francisco,
	though I thought that was both very,

	147
	00:09:47,310 --> 00:09:49,870
	very funny, very cool,
	and actually kind of inspiring.

	148
	00:09:50,390 --> 00:09:55,510
	It's nice to think that people who,
	you know, just you might put them

	149
	00:09:55,510 --> 00:10:00,910
	in the kind of milieu or bubble or
	world that you are in or credibly in,

	150
	00:10:00,950 --> 00:10:03,590
	through, you know,
	a series of connections that are

	151
	00:10:03,590 --> 00:10:07,630
	coming up with such literally
	world changing, um, innovations.

	152
	00:10:07,670 --> 00:10:11,430
	Uh, so that was, I thought,
	anyway, that, that that was cool.

	153
	00:10:12,070 --> 00:10:13,950
	Okay. Voice training data.
	How are we doing?

	154
	00:10:13,950 --> 00:10:17,990
	We're about ten minutes, and I'm
	still talking about voice technology.

	155
	00:10:18,190 --> 00:10:22,350
	Um, so whisper was brilliant,
	and I was so excited that I was.

	156
	00:10:22,350 --> 00:10:25,630
	My first instinct was to, like,
	get like, oh, my gosh,

	157
	00:10:25,630 --> 00:10:27,710
	I have to get, like,
	a really good microphone for this.

	158
	00:10:27,950 --> 00:10:31,630
	So, um, I didn't go on a
	spending spree because I said,

	159
	00:10:31,670 --> 00:10:34,470
	I'm gonna have to just wait a
	month and see if I still use this.

	160
	00:10:34,910 --> 00:10:39,990
	And it just kind of became it's
	become really part of my daily

	161
	00:10:39,990 --> 00:10:42,990
	routine.
	Like, if I'm writing an email,

	162
	00:10:42,990 --> 00:10:47,020
	I'll record a voice note.
	And then I've developed and it's

	163
	00:10:47,020 --> 00:10:49,900
	nice to see that everyone is
	like developing the same things

	164
	00:10:49,900 --> 00:10:51,900
	in parallel.
	Like, that's kind of a weird thing

	165
	00:10:51,940 --> 00:10:57,340
	to say, but when I look, I kind of
	came when I started working on this,

	166
	00:10:57,380 --> 00:11:00,700
	these prototypes on GitHub,
	which is where I just kind of

	167
	00:11:00,740 --> 00:11:04,740
	share very freely and loosely,
	uh, ideas and, you know,

	168
	00:11:04,780 --> 00:11:10,020
	first iterations on, on concepts,
	um, and for want of a better word,

	169
	00:11:10,020 --> 00:11:13,900
	I called it like, uh,
	lm post-processing or cleanup or

	170
	00:11:14,140 --> 00:11:18,100
	basically a system prompt that after
	you get back the raw text from

	171
	00:11:18,420 --> 00:11:24,100
	whisper, you run it through a model
	and say, okay, this is crappy text,

	172
	00:11:24,140 --> 00:11:27,140
	like add sentence structure and,
	you know, fix it up.

	173
	00:11:27,580 --> 00:11:32,660
	And, um, now when I'm exploring the
	different tools that are out there

	174
	00:11:32,700 --> 00:11:36,580
	that people have built, I see, uh,
	quite a number of projects have

	175
	00:11:37,180 --> 00:11:41,700
	basically done the same thing,
	um, less that be misconstrued.

	176
	00:11:41,700 --> 00:11:44,370
	I'm not saying for a millisecond
	that I inspired them.

	177
	00:11:44,370 --> 00:11:48,890
	I'm sure this has been a thing that's
	been integrated into tools for a

	178
	00:11:48,930 --> 00:11:52,290
	while, but it's it's the kind of
	thing that when you start using these

	179
	00:11:52,290 --> 00:11:56,730
	tools every day, the need for it
	is almost instantly apparent, uh,

	180
	00:11:56,730 --> 00:12:00,770
	because text that doesn't have any
	punctuation or paragraph spacing

	181
	00:12:00,810 --> 00:12:04,250
	takes a long time to, you know,
	it takes so long to get it into

	182
	00:12:04,250 --> 00:12:09,370
	a presentable email that again,
	it's it's it moves speech tech

	183
	00:12:09,410 --> 00:12:12,930
	into that before that inflection
	point where you're like, no,

	184
	00:12:12,930 --> 00:12:16,250
	it's just not worth it.
	It's like it'll just be quicker

	185
	00:12:16,250 --> 00:12:18,850
	to type this.
	So it's a big it's a little touch.

	186
	00:12:18,850 --> 00:12:24,090
	That actually is a big deal.
	Uh, so I was on whisper and I've

	187
	00:12:24,090 --> 00:12:28,170
	been using whisper and I kind of
	early on found a couple of tools.

	188
	00:12:28,210 --> 00:12:30,930
	I couldn't find what I was
	looking for on Linux, which is,

	189
	00:12:31,370 --> 00:12:35,770
	um, basically just something
	that'll run in the background.

	190
	00:12:35,810 --> 00:12:40,130
	You'll give it an API key and it
	will just transcribe. Um.

	191
	00:12:41,280 --> 00:12:44,000
	with, like, a little key to
	start and stop the dictation.

	192
	00:12:44,600 --> 00:12:49,040
	Uh, and the issues were I discovered
	that, like most people involved in

	193
	00:12:49,040 --> 00:12:53,920
	creating these projects were very
	much focused on local models running

	194
	00:12:53,920 --> 00:12:57,400
	whisper locally, because you can.
	And I tried that a bunch of

	195
	00:12:57,400 --> 00:13:00,840
	times and just never got results
	that were as good as the cloud.

	196
	00:13:01,160 --> 00:13:04,640
	And when I began looking at the
	cost of the speech to text APIs

	197
	00:13:04,640 --> 00:13:08,520
	and what I was spending,
	I just thought there's it's actually,

	198
	00:13:08,720 --> 00:13:13,200
	in my opinion, just one of the better
	deals in API spending and in cloud.

	199
	00:13:13,240 --> 00:13:17,280
	Like it's just not that expensive
	for very, very good models that are

	200
	00:13:17,400 --> 00:13:20,840
	much more, you know, you're going
	to be able to run the full model,

	201
	00:13:21,360 --> 00:13:25,960
	the latest model versus whatever
	you can run on your average GPU.

	202
	00:13:26,000 --> 00:13:29,760
	Unless you want to buy a crazy GPU.
	It doesn't really make sense to me.

	203
	00:13:29,760 --> 00:13:33,480
	Now, privacy is another concern.
	Um, that I know is kind of like a

	204
	00:13:33,520 --> 00:13:36,920
	very much a separate thing that
	people just don't want their voice,

	205
	00:13:36,920 --> 00:13:39,790
	data, and their voice leaving
	their local environment,

	206
	00:13:40,110 --> 00:13:43,830
	maybe for regulatory reasons as well.
	Um, but I'm not in that.

	207
	00:13:43,910 --> 00:13:47,910
	Um, I'm neither really care about
	people listening to my, uh,

	208
	00:13:47,950 --> 00:13:51,190
	grocery list consisting of, uh,
	reminding myself that I need to

	209
	00:13:51,230 --> 00:13:54,790
	buy more beer, Cheetos and hummus,
	which is kind of the three,

	210
	00:13:54,990 --> 00:13:59,310
	three staples of my diet.
	Um, during periods of poor nutrition.

	211
	00:13:59,590 --> 00:14:03,310
	Uh, but the kind of stuff that I
	transcribe, it's just not it's not a,

	212
	00:14:03,990 --> 00:14:09,350
	it's not a privacy thing and that
	sort of sensitive about and, uh,

	213
	00:14:09,350 --> 00:14:13,070
	I don't do anything so,
	you know, sensitive or secure,

	214
	00:14:13,070 --> 00:14:16,590
	that requires air gapping.
	So, um, I looked at the pricing and

	215
	00:14:16,590 --> 00:14:20,270
	especially the kind of older models,
	mini, um, some of them are very,

	216
	00:14:20,270 --> 00:14:23,110
	very affordable.
	And I did a back of the I did a

	217
	00:14:23,110 --> 00:14:27,150
	calculation once with ChatGPT
	and I was like, okay, this is a,

	218
	00:14:27,150 --> 00:14:31,070
	this is the API price for I can't
	remember whatever the model was.

	219
	00:14:31,550 --> 00:14:33,910
	Uh, let's say I just go at it
	like nonstop,

	220
	00:14:34,030 --> 00:14:37,410
	which it rarely happens. Probably.
	I would say on average,

	221
	00:14:37,410 --> 00:14:41,890
	I might dictate 30 to 60 minutes per
	day if I was probably summing up

	222
	00:14:41,890 --> 00:14:48,490
	the emails, documents, outlines,
	um, which is a lot, but it's it's

	223
	00:14:48,490 --> 00:14:50,730
	still a fairly modest amount.
	And I was like, well,

	224
	00:14:50,770 --> 00:14:53,930
	some days I do go on like 1 or 2
	days where I've been.

	225
	00:14:54,450 --> 00:14:58,450
	Usually when I'm like kind of out of
	the house and just have something

	226
	00:14:59,090 --> 00:15:02,250
	like, I have nothing else to do.
	Like if I'm at a hospital with a

	227
	00:15:02,250 --> 00:15:06,970
	newborn, uh, and you're waiting
	for like eight hours and hours

	228
	00:15:06,970 --> 00:15:10,210
	for an appointment, and I would
	probably have listened to podcasts

	229
	00:15:10,490 --> 00:15:14,010
	before becoming a speech fanatic.
	And I'm like, oh, wait,

	230
	00:15:14,050 --> 00:15:16,370
	let me just get down.
	Let me just get these ideas out

	231
	00:15:16,410 --> 00:15:19,610
	of my head.
	And that's when I'll go on my

	232
	00:15:19,650 --> 00:15:21,530
	speech binges.
	But those are like once every

	233
	00:15:21,530 --> 00:15:24,970
	few months, like not frequently.
	But I said, okay, let's just say

	234
	00:15:24,970 --> 00:15:30,650
	if I'm gonna price out.
	Cloud asked if I was like, dedicated

	235
	00:15:30,650 --> 00:15:36,880
	every second of every waking hour to
	transcribing for some odd reason. Um.

	236
	00:15:37,200 --> 00:15:39,680
	I mean, it'd have to, like,
	eat and use the toilet and,

	237
	00:15:39,720 --> 00:15:42,520
	like, you know, there's only so
	many hours I'm awake for.

	238
	00:15:42,520 --> 00:15:44,680
	So, like,
	let's just say a maximum of, like,

	239
	00:15:44,720 --> 00:15:48,680
	40 hours, 45 minutes in the hour.
	Then I said, all right,

	240
	00:15:48,680 --> 00:15:52,600
	let's just say 50. Who knows?
	You're dictating on the toilet.

	241
	00:15:52,640 --> 00:15:53,880
	We do it.
	Uh,

	242
	00:15:53,880 --> 00:15:58,720
	so it could be you could just do 60.
	But whatever I did, and every day,

	243
	00:15:58,760 --> 00:16:02,440
	like, you're going flat out seven
	days a week dictating non-stop.

	244
	00:16:02,480 --> 00:16:06,440
	I was like, what's my monthly API
	bill going to be at this price?

	245
	00:16:06,720 --> 00:16:09,120
	And it came out to like 70 or 80
	bucks.

	246
	00:16:09,120 --> 00:16:14,080
	And I was like, well, that would be
	an extraordinary amount of dictation.

	247
	00:16:14,080 --> 00:16:17,840
	And I would hope that there was
	some compelling reason,

	248
	00:16:18,040 --> 00:16:22,200
	more worth more than $70,
	that I embarked upon that project.

	249
	00:16:22,400 --> 00:16:25,200
	Uh, so given that that's kind of the
	max point for me, I said, that's

	250
	00:16:25,240 --> 00:16:29,000
	actually very, very affordable.
	Um, now you're gonna if you want

	251
	00:16:29,040 --> 00:16:34,080
	to spec out the costs and you want
	to do the post-processing that I

	252
	00:16:34,150 --> 00:16:37,110
	really do feel is valuable.
	Um, that's going to cost some more as

	253
	00:16:37,110 --> 00:16:43,110
	well, unless you're using Gemini,
	which, uh, needless to say, is a

	254
	00:16:43,110 --> 00:16:46,950
	random person sitting in Jerusalem.
	Uh, I have no affiliation,

	255
	00:16:46,950 --> 00:16:51,350
	nor with Google, nor anthropic,
	nor Gemini, nor any major tech vendor

	256
	00:16:51,350 --> 00:16:56,790
	for that matter. Um, I like Gemini.
	Not so much as a everyday model.

	257
	00:16:56,870 --> 00:16:59,830
	Um, it's kind of underwhelmed in
	that respect, I would say.

	258
	00:17:00,230 --> 00:17:03,030
	But for multimodal,
	I think it's got a lot to offer.

	259
	00:17:03,310 --> 00:17:06,870
	And I think that the transcribing
	functionality whereby it can,

	260
	00:17:07,270 --> 00:17:12,150
	um, process audio with a system
	prompt and both give you

	261
	00:17:12,190 --> 00:17:15,390
	transcription that's cleaned up,
	that reduces two steps to one.

	262
	00:17:15,710 --> 00:17:18,630
	And that for me is a very,
	very big deal.

	263
	00:17:18,630 --> 00:17:22,990
	And, uh, I feel like even Google
	has haven't really sort of thought

	264
	00:17:22,990 --> 00:17:27,430
	through how useful the that
	modality is and what kind of use

	265
	00:17:27,430 --> 00:17:30,790
	cases you can achieve with it.
	Because I found in the course of

	266
	00:17:30,790 --> 00:17:36,490
	this year just an endless list
	of really kind of system prompt,

	267
	00:17:36,730 --> 00:17:41,290
	system prompt stuff that I can say,
	okay, I've used it to capture context

	268
	00:17:41,290 --> 00:17:45,570
	data for AI, which is literally I
	might speak for if I wanted to have a

	269
	00:17:45,570 --> 00:17:49,730
	good bank of context data about,
	who knows, my childhood.

	270
	00:17:50,010 --> 00:17:53,450
	Uh, more realistically,
	maybe my career goals, uh,

	271
	00:17:53,450 --> 00:17:56,010
	something that would just be,
	like, really boring to type out.

	272
	00:17:56,130 --> 00:18:01,130
	So I'll just, like, sit in my car
	and record it for ten minutes.

	273
	00:18:01,130 --> 00:18:04,090
	And that ten minutes,
	you get a lot of information in,

	274
	00:18:04,530 --> 00:18:10,090
	um, emails, which is short text.
	Um, just there is a whole bunch.

	275
	00:18:10,090 --> 00:18:13,570
	And all these workflows kind of
	require a little bit of treatment

	276
	00:18:13,570 --> 00:18:17,490
	afterwards and different treatment.
	My context pipeline is kind of like

	277
	00:18:17,490 --> 00:18:21,210
	just extract the bare essentials.
	So you end up with me talking very

	278
	00:18:21,210 --> 00:18:24,250
	loosely about sort of what I've done
	in my career, where I've worked,

	279
	00:18:24,250 --> 00:18:27,610
	where I might like to work,
	and it goes it condenses that

	280
	00:18:27,610 --> 00:18:31,600
	down to very robotic language
	that is easy to chunk, parse,

	281
	00:18:31,600 --> 00:18:35,960
	and maybe put into a vector database.
	Daniel has worked in technology,

	282
	00:18:36,000 --> 00:18:39,640
	Daniel is a has been working in,
	you know, stuff like that.

	283
	00:18:39,640 --> 00:18:43,600
	That's not how you would speak.
	Um, but I figure it's probably easier

	284
	00:18:43,600 --> 00:18:48,120
	to parse for, after all, robots.
	So we've almost got to 20 minutes.

	285
	00:18:48,120 --> 00:18:52,640
	And this is actually a success
	because I wasted 20 minutes of my,

	286
	00:18:52,800 --> 00:18:56,880
	uh, of the evening speaking into
	a microphone, and, uh,

	287
	00:18:56,920 --> 00:19:00,840
	the levels were shot and, uh, it,
	uh, it was clipping and I said,

	288
	00:19:00,840 --> 00:19:03,200
	I can't really do an evaluation.
	I have to be fair.

	289
	00:19:03,200 --> 00:19:07,000
	I have to give the models a
	chance to do their thing.

	290
	00:19:07,520 --> 00:19:09,360
	Uh,
	what am I hoping to achieve in this?

	291
	00:19:09,400 --> 00:19:12,600
	Okay, my fine tune was a dud,
	as mentioned Deepgram SVT.

	292
	00:19:12,640 --> 00:19:15,520
	I'm really, really hopeful that
	this prototype will work.

	293
	00:19:15,800 --> 00:19:18,960
	And it's a built in public open
	source, so anyone is welcome to

	294
	00:19:19,000 --> 00:19:22,920
	use it if I make anything good.
	Um, but that was really exciting for

	295
	00:19:22,920 --> 00:19:27,400
	me last night when after hours of,
	um, trying my own prototype,

	296
	00:19:27,400 --> 00:19:31,230
	seeing someone just made
	something that works like that.

	297
	00:19:31,270 --> 00:19:32,670
	You know,
	you're not going to have to build a

	298
	00:19:32,670 --> 00:19:38,230
	custom conda environment and image.
	I have AMD GPU, which makes

	299
	00:19:38,230 --> 00:19:42,310
	things much more complicated.
	I didn't find it and I was about

	300
	00:19:42,310 --> 00:19:43,990
	to give up and I said,
	all right, let me just give deep

	301
	00:19:43,990 --> 00:19:48,750
	grams Linux thing a shot.
	And if this doesn't work, um,

	302
	00:19:48,750 --> 00:19:51,150
	I'm just going to go back to
	trying to code something myself.

	303
	00:19:51,510 --> 00:19:56,190
	And when I ran the script,
	I was using cloud code to do the

	304
	00:19:56,190 --> 00:20:00,030
	installation process.
	It ran the script and oh my gosh,

	305
	00:20:00,070 --> 00:20:05,350
	it works just like that.
	Uh, the tricky thing for all those

	306
	00:20:05,350 --> 00:20:10,310
	who wants to know all the nitty
	gritty, nitty gritty details, um, was

	307
	00:20:10,310 --> 00:20:13,750
	that I don't think it was actually
	struggling with transcription, but

	308
	00:20:13,750 --> 00:20:18,550
	pasting Wayland makes life very hard,
	and I think there was something not

	309
	00:20:18,550 --> 00:20:21,870
	running in the right time anyway.
	Deepgram I looked at how they

	310
	00:20:21,870 --> 00:20:24,710
	actually handle that because it
	worked out of the box when other

	311
	00:20:24,710 --> 00:20:29,140
	stuff didn't, and it was quite a
	clever little mechanism,

	312
	00:20:29,460 --> 00:20:32,100
	and but more so than that,
	the accuracy was brilliant.

	313
	00:20:32,140 --> 00:20:35,020
	Now, what am I doing here?
	This is going to be a 20 minute

	314
	00:20:35,260 --> 00:20:42,980
	audio sample, and I'm I think
	I've done 1 or 2 of these before,

	315
	00:20:42,980 --> 00:20:49,180
	but I did it with short, snappy voice
	notes. This is kind of long form.

	316
	00:20:49,460 --> 00:20:51,740
	This actually might be a better
	approximation for what's useful

	317
	00:20:51,740 --> 00:20:56,100
	to me than voice memos.
	Like I need to buy three liters

	318
	00:20:56,100 --> 00:20:59,180
	of milk tomorrow, and pita bread,
	which is probably how like half

	319
	00:20:59,180 --> 00:21:02,820
	my voice voice notes sound like
	if anyone were to, I don't know,

	320
	00:21:02,860 --> 00:21:04,580
	like find my phone,
	they'd be like, this is the most

	321
	00:21:04,580 --> 00:21:07,420
	boring person in the world.
	Although actually there are some

	322
	00:21:07,460 --> 00:21:09,700
	like kind of, uh,
	journaling thoughts as well.

	323
	00:21:09,700 --> 00:21:13,700
	But it's a lot of content like that.
	And the probably for the evaluation,

	324
	00:21:13,700 --> 00:21:20,660
	the most useful thing is slightly
	obscure tech GitHub uh, hugging face

	325
	00:21:21,180 --> 00:21:24,660
	not so obscure that it's not going
	to have a chance of knowing it,

	326
	00:21:24,660 --> 00:21:27,640
	but hopefully sufficiently well
	known that the model should get it.

	327
	00:21:28,200 --> 00:21:30,760
	I tried to do a little bit of
	speaking really fast and

	328
	00:21:30,760 --> 00:21:33,200
	speaking very slowly.
	I would say in general,

	329
	00:21:33,200 --> 00:21:36,880
	I've spoken, delivered this at a
	faster pace than I usually would

	330
	00:21:36,920 --> 00:21:40,280
	owing to strong coffee flowing
	through my bloodstream.

	331
	00:21:40,920 --> 00:21:44,200
	And the thing that I'm not going
	to get in this benchmark is

	332
	00:21:44,200 --> 00:21:46,880
	background noise, which in my first
	take that I had to get rid of,

	333
	00:21:47,680 --> 00:21:51,240
	my wife came in with my son and
	for a good night kiss.

	334
	00:21:51,440 --> 00:21:55,120
	And that actually would have
	been super helpful to get in

	335
	00:21:55,120 --> 00:21:59,760
	because it was not diarised.
	Or if we had diarisation a female,

	336
	00:21:59,880 --> 00:22:02,280
	I could say I want the male
	voice and that wasn't intended

	337
	00:22:02,280 --> 00:22:05,280
	for transcription.
	Um, and we're not going to get

	338
	00:22:05,280 --> 00:22:06,960
	background noise like people
	honking their horns,

	339
	00:22:06,960 --> 00:22:11,280
	which is something I've done in my
	main data set where I am trying to

	340
	00:22:11,440 --> 00:22:15,520
	go back to some of my voice notes,
	annotate them, and run a benchmark.

	341
	00:22:15,520 --> 00:22:18,960
	But this is going to be just a
	pure quick test.

	342
	00:22:19,440 --> 00:22:23,880
	And as someone I'm working on a
	voice note idea,

	343
	00:22:23,880 --> 00:22:28,230
	that's my sort of end motivation.
	Besides thinking it's an

	344
	00:22:28,230 --> 00:22:31,590
	absolutely outstanding technology
	that's coming to viability.

	345
	00:22:31,590 --> 00:22:34,670
	And really, I know this sounds
	cheesy can actually have a very

	346
	00:22:34,670 --> 00:22:38,830
	transformative effect.
	Um, it's, you know, voice technology

	347
	00:22:38,870 --> 00:22:44,910
	has been life changing for, uh,
	folks living with, um, disabilities.

	348
	00:22:45,630 --> 00:22:48,550
	And I think there's something
	really nice about the fact that

	349
	00:22:48,550 --> 00:22:52,710
	it can also benefit, you know,
	folks who are able bodied and like,

	350
	00:22:52,750 --> 00:22:58,950
	we can all in different ways, um,
	make this tech as useful as possible,

	351
	00:22:58,990 --> 00:23:01,110
	regardless of the exact way that
	we're using it.

	352
	00:23:01,510 --> 00:23:04,710
	Um, and I think there's something
	very powerful in that, and it can be

	353
	00:23:04,710 --> 00:23:08,910
	very cool. Um, I see use potential.
	What excites me about voice tech?

	354
	00:23:09,750 --> 00:23:13,550
	A lot of things, actually.
	Firstly, the fact that it's cheap

	355
	00:23:13,550 --> 00:23:17,110
	and accurate, as I mentioned at
	the very start of this, um,

	356
	00:23:17,110 --> 00:23:20,790
	and it's getting better and better
	with stuff like accent handling, um,

	357
	00:23:20,790 --> 00:23:24,180
	I'm not sure my, my fine tune will
	actually ever come to fruition in the

	358
	00:23:24,180 --> 00:23:27,860
	sense that I'll use it day to day,
	as I imagine I get like superb,

	359
	00:23:27,860 --> 00:23:33,540
	flawless word error rates because I'm
	just kind of skeptical about local

	360
	00:23:33,540 --> 00:23:38,100
	speech to texts, as I mentioned.
	And I think the pace of innovation

	361
	00:23:38,100 --> 00:23:42,060
	and improvement in the models,
	the main reasons for fine tuning from

	362
	00:23:42,060 --> 00:23:46,340
	what I've seen have been people who
	are something that really blows,

	363
	00:23:46,380 --> 00:23:52,940
	blows my mind about ASR is the idea
	that it's inherently a lingual

	364
	00:23:52,940 --> 00:23:59,100
	or multilingual phonetic based.
	So as folks who use speak very

	365
	00:23:59,140 --> 00:24:02,220
	obscure languages that there may
	be there might be a paucity of

	366
	00:24:02,220 --> 00:24:05,500
	training data or almost none at all,
	and therefore the accuracy is

	367
	00:24:05,500 --> 00:24:10,660
	significantly reduced or folks
	in very critical environments.

	368
	00:24:10,700 --> 00:24:13,380
	I know there are.
	This is used extensively in medical

	369
	00:24:13,380 --> 00:24:18,140
	transcription and dispatcher work as,
	um, you know, the call centers who

	370
	00:24:18,140 --> 00:24:22,490
	send out ambulances, etc., where
	accuracy is absolutely paramount.

	371
	00:24:22,490 --> 00:24:26,050
	And in the case of doctors,
	radiologists, they might be using

	372
	00:24:26,050 --> 00:24:29,610
	very specialized vocab all the time.
	So those are kind of the main

	373
	00:24:29,610 --> 00:24:31,530
	two things.
	And I'm not sure that really just for

	374
	00:24:31,530 --> 00:24:37,290
	trying to make it better on a few
	random tech words with my slightly.

	375
	00:24:37,330 --> 00:24:41,250
	I mean, I have an accent, but like,
	not, you know, an accent that a few

	376
	00:24:41,290 --> 00:24:47,210
	other million people have. Ish.
	I'm not sure that my little fine

	377
	00:24:47,210 --> 00:24:52,250
	tune is going to actually like the
	bump in word error rate reduction.

	378
	00:24:52,250 --> 00:24:54,570
	If I ever actually figure out how
	to do it and get it up to the

	379
	00:24:54,570 --> 00:24:58,610
	cloud by the time I've done that.
	I suspect that the next

	380
	00:24:58,610 --> 00:25:01,410
	generation of ASR will just be
	so good that it will kind of be.

	381
	00:25:01,930 --> 00:25:03,770
	Ah, well,
	that would be cool if it worked out,

	382
	00:25:03,770 --> 00:25:08,730
	but I'll just use this instead.
	So that's going to be it for today's

	383
	00:25:08,730 --> 00:25:14,130
	episode of, uh, voice training data.
	Single long shot evaluation.

	384
	00:25:14,410 --> 00:25:17,330
	Who am I going to compare?
	Whisper is always good as a

	385
	00:25:17,330 --> 00:25:20,600
	benchmark, but I'm more
	interested in seeing Whisperer

	386
	00:25:20,600 --> 00:25:25,080
	head to head with two things,
	really. One is whisper variance.

	387
	00:25:25,080 --> 00:25:29,880
	So you've got these projects like
	faster Whisper, Still whisper.

	388
	00:25:29,880 --> 00:25:31,640
	It's a bit confusing.
	There's a whole bunch of them

	389
	00:25:31,920 --> 00:25:34,800
	and the emerging acers,
	which are also a thing.

	390
	00:25:35,200 --> 00:25:37,680
	My intention for this is I'm not
	sure I'm going to have the time

	391
	00:25:37,680 --> 00:25:41,640
	in any point in the foreseeable
	future to go back through this whole

	392
	00:25:41,640 --> 00:25:46,560
	episode and create a proper source,
	truth or a fix.

	393
	00:25:47,320 --> 00:25:51,680
	Everything might do it if I can
	get one transcription that

	394
	00:25:51,680 --> 00:25:56,720
	sufficiently close to perfection.
	But what I would actually love

	395
	00:25:56,720 --> 00:25:59,800
	to do on Hugging Face I think
	would be a great.

	396
	00:25:59,800 --> 00:26:03,560
	Probably how I might visualize this
	is having the audio waveform play,

	397
	00:26:04,040 --> 00:26:09,800
	and then have the transcript for each
	model below it, and maybe even a,

	398
	00:26:10,480 --> 00:26:15,120
	um, like, you know, two scale and
	maybe even a local one as well,

	399
	00:26:15,160 --> 00:26:21,700
	like local whisper versus open
	AI API, Etc. and, um, I can then

	400
	00:26:21,700 --> 00:26:24,380
	actually listen back to segments
	or anyone who wants to can listen

	401
	00:26:24,380 --> 00:26:29,420
	back to segments of this recording
	and see where a particular model

	402
	00:26:29,460 --> 00:26:32,940
	struggled and others didn't, as well
	as the sort of headline finding

	403
	00:26:32,980 --> 00:26:36,780
	of which had the best, uh, wer.
	But that would require the source

	404
	00:26:36,780 --> 00:26:40,020
	of truth. Okay. That's it.
	Hope this was, I don't know,

	405
	00:26:40,180 --> 00:26:43,460
	maybe useful for other folks
	interested in stuff you want to see.

	406
	00:26:43,940 --> 00:26:48,100
	I always feel think I've just said
	something I didn't intend to say.

	407
	00:26:48,660 --> 00:26:51,020
	I said for those, listen carefully.
	Including, hopefully,

	408
	00:26:51,020 --> 00:26:54,060
	the models themselves.
	This has been myself,

	409
	00:26:54,100 --> 00:26:57,900
	Daniel Rosehill, for more, um,
	jumbled repositories about my,

	410
	00:26:57,940 --> 00:27:00,820
	uh, roving interest in AI,
	but particularly Agentic,

	411
	00:27:01,180 --> 00:27:05,340
	MCP and voice tech.
	Uh, you can find me on GitHub.

	412
	00:27:05,820 --> 00:27:11,140
	Hugging face. Where else?
	Daniel, which is my personal website,

	413
	00:27:11,140 --> 00:27:15,260
	as well as this podcast whose
	name I sadly cannot remember.

	414
	00:27:15,700 --> 00:27:17,420
	Until next time.
	Thanks for listening.