Spaces:

danielrosehill
/

STT-Comparison

Running

App Files Files Community

STT-Comparison / srt-out /assembly.srt

danielrosehill

Fix SRT timestamp alignment with ground truth

0aa8adc about 1 month ago

raw

history blame contribute delete

41.1 kB

	1
	00:00:00,000 --> 00:00:05,600
	Hello and welcome to a audio data set consisting

	2
	00:00:05,600 --> 00:00:10,560
	of one single episode of a non-existent podcast. Or I

	3
	00:00:10,640 --> 00:00:13,280
	may append this to a podcast that I set up

	4
	00:00:13,520 --> 00:00:19,120
	recently regarding my with my thoughts on speech

	5
	00:00:19,200 --> 00:00:23,920
	tech and AI in particular, more AI in generative AI,

	6
	00:00:24,160 --> 00:00:28,560
	I would say. But in any event, the purpose of

	7
	00:00:28,640 --> 00:00:33,770
	this Voice recording is actually to create a lengthy

	8
	00:00:33,850 --> 00:00:37,050
	voice sample for a quick evaluation, a back of the

	9
	00:00:37,050 --> 00:00:40,570
	envelope evaluation, as they might say, for different speech attack

	10
	00:00:40,810 --> 00:00:43,370
	models. And I'm doing this because I thought I had

	11
	00:00:43,370 --> 00:00:46,730
	made a great breakthrough in my journey with speech tech,

	12
	00:00:47,050 --> 00:00:50,650
	and that was succeeding in the elusive task of fine-tuning

	13
	00:00:50,650 --> 00:00:54,730
	Whisper. Whisper is, and I'm going to just talk, I'm

	14
	00:00:54,810 --> 00:00:58,170
	trying to mix up, I'm going to try a few

	15
	00:00:58,330 --> 00:01:01,450
	different styles of speaking. I might whisper something at some

	16
	00:01:01,530 --> 00:01:04,800
	point. As well. And I'll go back to speaking loud

	17
	00:01:04,880 --> 00:01:08,000
	in, in different parts. I'm going to sound really like

	18
	00:01:08,080 --> 00:01:11,040
	a crazy person because I'm also going to try to

	19
	00:01:11,200 --> 00:01:16,160
	speak at different pitches and cadences in order to really

	20
	00:01:16,480 --> 00:01:20,480
	try to put a speech attacks model through its paces,

	21
	00:01:20,640 --> 00:01:22,960
	which is trying to make sense of is this guy

	22
	00:01:23,120 --> 00:01:27,980
	just rambling on incoherently in one long sentence or are

	23
	00:01:28,380 --> 00:01:34,140
	these just actually a series of step, standalone,

	24
	00:01:34,300 --> 00:01:37,340
	step alone, standalone sentences? And how is it gonna handle

	25
	00:01:37,420 --> 00:01:40,380
	step alone? That's not a word. What happens when you

	26
	00:01:40,460 --> 00:01:42,940
	use speech to text and you use a fake word?

	27
	00:01:43,100 --> 00:01:45,500
	And then you're like, wait, that's not actually, that word

	28
	00:01:45,660 --> 00:01:50,140
	doesn't exist. How does AI handle that? And these and

	29
	00:01:50,380 --> 00:01:54,220
	more are all the questions that I'm seeking to answer

	30
	00:01:54,380 --> 00:01:57,420
	in this training data. Now, why was it trying to

	31
	00:01:57,420 --> 00:02:00,210
	fine tune Whisper? And what is Whisper? As I said,

	32
	00:02:00,290 --> 00:02:02,930
	I'm going to try to record this at a couple

	33
	00:02:03,090 --> 00:02:07,410
	of different levels of technicality for folks who are, you

	34
	00:02:07,410 --> 00:02:11,650
	know, in the normal world and not totally stuck down

	35
	00:02:11,730 --> 00:02:13,730
	the rabbit hole of AI, which I have to say

	36
	00:02:13,890 --> 00:02:18,050
	is a really wonderful rabbit hole to be down. It's

	37
	00:02:18,130 --> 00:02:21,490
	a really interesting area and speech and voice tech is

	38
	00:02:21,890 --> 00:02:24,530
	the aspect of it that I find actually the most,

	39
	00:02:24,930 --> 00:02:27,330
	I'm not sure I would say the most interesting because

	40
	00:02:27,570 --> 00:02:31,290
	there's just so much that is fascinating in AI. But

	41
	00:02:31,450 --> 00:02:34,250
	the most that I find the most personally transformative in

	42
	00:02:34,330 --> 00:02:38,890
	terms of the impact that it's had on my daily

	43
	00:02:38,970 --> 00:02:41,450
	work life and productivity and how I sort of work.

	44
	00:02:42,090 --> 00:02:47,210
	And I'm persevering hard with the task of trying

	45
	00:02:47,210 --> 00:02:50,250
	to get a good solution working for Linux, which if

	46
	00:02:50,250 --> 00:02:52,250
	anyone actually does listen to this, not just for the

	47
	00:02:52,250 --> 00:02:56,410
	training data and for the actual content, this is sparked

	48
	00:02:56,750 --> 00:02:59,950
	I had, besides the fine tune not working, well, that

	49
	00:03:00,030 --> 00:03:05,230
	was the failure. Um, I used Claude code because one

	50
	00:03:05,470 --> 00:03:09,950
	thinks these days that there is nothing short of solving,

	51
	00:03:10,990 --> 00:03:15,390
	you know, the, the reason of life or something, that

	52
	00:03:15,790 --> 00:03:18,990
	Claude and agentic AI can't do, which is not really

	53
	00:03:19,070 --> 00:03:22,190
	the case. Uh, it does seem that way sometimes, but

	54
	00:03:22,350 --> 00:03:24,190
	it fails a lot as well. And this is one

	55
	00:03:24,190 --> 00:03:27,630
	of those, instances where last week I put together an

	56
	00:03:27,710 --> 00:03:32,010
	hour of voice training data, basically speaking, just random things

	57
	00:03:32,250 --> 00:03:37,050
	for 3 minutes. And it was actually kind of tedious

	58
	00:03:37,130 --> 00:03:39,210
	because the texts were really weird. Some of them were

	59
	00:03:39,450 --> 00:03:43,050
	it was like it was AI generated. I tried before

	60
	00:03:43,210 --> 00:03:45,130
	to read Sherlock Holmes for an hour and I just

	61
	00:03:45,130 --> 00:03:48,330
	couldn't. I was so bored after 10 minutes that I

	62
	00:03:48,330 --> 00:03:50,730
	was like, okay, no, I'm just going to have to

	63
	00:03:50,730 --> 00:03:55,290
	find something else to read. So I used a created

	64
	00:03:55,690 --> 00:04:01,280
	with AI studio vibe coded a synthetic text generator. Which

	65
	00:04:01,600 --> 00:04:03,840
	actually I thought was probably a better way of doing

	66
	00:04:03,920 --> 00:04:07,440
	it because it would give me more short samples with

	67
	00:04:07,680 --> 00:04:10,480
	more varied content. So I was like, okay, give me

	68
	00:04:10,880 --> 00:04:13,760
	a voice note, like I'm recording an email, give me

	69
	00:04:14,000 --> 00:04:17,680
	a short story to read, give me prose to read.

	70
	00:04:18,000 --> 00:04:20,400
	So I came up with all these different things and

	71
	00:04:20,560 --> 00:04:22,560
	they added a little timer to it so I could

	72
	00:04:22,720 --> 00:04:26,400
	see how close I was to one hour. And I

	73
	00:04:26,560 --> 00:04:29,600
	spent like an hour one afternoon or probably two hours

	74
	00:04:29,760 --> 00:04:33,330
	by the time you you do retakes. And whatever, because

	75
	00:04:33,410 --> 00:04:36,610
	you want to, it gave me a source of truth,

	76
	00:04:37,330 --> 00:04:40,050
	which I'm not sure if that's the scientific way to

	77
	00:04:40,210 --> 00:04:44,210
	approach this topic of gathering, training data, but I thought

	78
	00:04:44,450 --> 00:04:48,130
	made sense. Um, I have a lot of audio data

	79
	00:04:48,210 --> 00:04:50,770
	from recording voice notes, which I've also kind of used,

	80
	00:04:52,050 --> 00:04:55,810
	been experimenting with using for a different purpose, slightly different

	81
	00:04:56,210 --> 00:05:01,410
	annotating task types. It's more a text classification experiment

	82
	00:05:01,730 --> 00:05:04,160
	or, Well, it's more than that actually. I'm working on

	83
	00:05:04,160 --> 00:05:08,080
	a voice app. So it's a prototype, I guess, is

	84
	00:05:08,240 --> 00:05:12,720
	really more accurate. But you can do that and you

	85
	00:05:12,720 --> 00:05:15,200
	can work backwards. You're like, you listen back to a

	86
	00:05:15,200 --> 00:05:18,720
	voice note and you painfully go through one of those

	87
	00:05:19,040 --> 00:05:21,840
	transcribing, you know, where you start and stop and scrub

	88
	00:05:22,000 --> 00:05:23,920
	around it and you fix the errors, but it's really,

	89
	00:05:24,080 --> 00:05:26,720
	really boring to do that. So I thought it would

	90
	00:05:26,800 --> 00:05:29,040
	be less tedious in the long term if I just

	91
	00:05:30,059 --> 00:05:32,940
	recorded the source of truth. So it gave me these

	92
	00:05:33,020 --> 00:05:36,140
	three minute snippets. I recorded them. It saved an MP3

	93
	00:05:36,380 --> 00:05:39,500
	and a TXT in the same folder, and I created

	94
	00:05:39,580 --> 00:05:42,860
	an error with that data. So I was very hopeful,

	95
	00:05:43,260 --> 00:05:46,860
	quietly, a little bit hopeful that I could actually fine

	96
	00:05:46,940 --> 00:05:50,460
	tune Whisper. I want to fine tune Whisper because when

	97
	00:05:50,540 --> 00:05:54,780
	I got into Voicetech last November, my wife was in

	98
	00:05:54,780 --> 00:05:58,140
	the US and I was alone at home. And when

	99
	00:05:58,600 --> 00:06:01,400
	crazy people like me do really wild things like use

	100
	00:06:01,640 --> 00:06:06,120
	voice to tech technology. That was basically when I started

	101
	00:06:06,200 --> 00:06:08,760
	doing it, I didn't feel like a crazy person speaking

	102
	00:06:08,840 --> 00:06:13,720
	to myself. And my expectations weren't that high. I used

	103
	00:06:14,280 --> 00:06:17,640
	speech tech now and again, tried it out. It was

	104
	00:06:17,640 --> 00:06:19,160
	like, it'd be really cool if you could just, like,

	105
	00:06:19,320 --> 00:06:22,760
	speak into your computer. And whatever I tried out that

	106
	00:06:23,000 --> 00:06:26,590
	had Linux support was just. It was not good, basically.

	107
	00:06:27,230 --> 00:06:29,470
	And this blew me away from the first go. I

	108
	00:06:29,470 --> 00:06:32,750
	mean, it wasn't 100% accurate out of the box and

	109
	00:06:32,830 --> 00:06:34,910
	it took work, but it was good enough that there

	110
	00:06:34,990 --> 00:06:37,470
	was a solid foundation and it kind of passed that

	111
	00:06:38,670 --> 00:06:41,870
	pivot point that it's actually worth doing this. You know,

	112
	00:06:42,030 --> 00:06:44,670
	there's a point where it's so like the transcript is

	113
	00:06:44,910 --> 00:06:47,310
	you don't have to get 100% accuracy for it to

	114
	00:06:47,310 --> 00:06:50,030
	be worth your time for speech attacks to be a

	115
	00:06:50,030 --> 00:06:52,430
	worthwhile addition to your productivity, but you do need to

	116
	00:06:52,430 --> 00:06:55,970
	get above, let's say, I don't know, 85%. If it's

	117
	00:06:56,130 --> 00:06:59,810
	60% or 50%, you inevitably say, screw it, I'll just

	118
	00:06:59,810 --> 00:07:02,770
	type it because you end up missing errors in the

	119
	00:07:02,770 --> 00:07:05,490
	transcript and it becomes actually worse. You end up in

	120
	00:07:05,490 --> 00:07:07,570
	a worse position than you started with. That's been my

	121
	00:07:07,650 --> 00:07:11,970
	experience. So I was like, oh, this is actually really,

	122
	00:07:12,130 --> 00:07:13,970
	really good now. How did that happen? And the answer

	123
	00:07:14,130 --> 00:07:19,410
	is ASR whisper being open source and the transformer

	124
	00:07:19,410 --> 00:07:23,170
	architecture. If you want to go back to the to

	125
	00:07:23,250 --> 00:07:26,370
	the underpinnings, which really blows my mind and it's on

	126
	00:07:26,450 --> 00:07:30,680
	my list. To read through that paper. All you need

	127
	00:07:30,760 --> 00:07:35,960
	is attention as attentively as can be done

	128
	00:07:36,200 --> 00:07:39,320
	with my limited brain because it's super, super high level

	129
	00:07:39,640 --> 00:07:44,520
	stuff, super advanced stuff, I mean. But that, I think

	130
	00:07:44,680 --> 00:07:49,320
	of all the things that are fascinating about the sudden

	131
	00:07:49,640 --> 00:07:53,700
	rise in AI and the dramatic capabilities. I find it

	132
	00:07:53,700 --> 00:07:56,100
	fascinating that a few people are like, hang on, you've

	133
	00:07:56,100 --> 00:07:58,420
	got this thing that can speak to you, like a

	134
	00:07:58,420 --> 00:08:02,980
	chatbot, an LLM, and then you've got image generation. Okay,

	135
	00:08:03,060 --> 00:08:06,580
	so firstly, those two things on the surface have nothing

	136
	00:08:06,900 --> 00:08:10,740
	in common. So like, how are they, how did that

	137
	00:08:10,900 --> 00:08:12,500
	just happen all at the same time? And then when

	138
	00:08:12,500 --> 00:08:16,580
	you extend that further, you're like, Suno, right? You can

	139
	00:08:17,060 --> 00:08:20,030
	sing a song and AI will come up with and

	140
	00:08:20,190 --> 00:08:23,390
	instrumental. And then you've got Whisper and you're like, wait

	141
	00:08:23,390 --> 00:08:25,870
	a second, how did all this stuff, like, if it's

	142
	00:08:25,870 --> 00:08:29,230
	all AI, what's like, there has to be some commonality.

	143
	00:08:29,470 --> 00:08:34,590
	Otherwise, these are totally different technologies on the surface of

	144
	00:08:34,590 --> 00:08:38,830
	it. And the Transformer architecture is, as far as I

	145
	00:08:38,910 --> 00:08:41,550
	know, the answer. And I can't even say, can't even

	146
	00:08:41,630 --> 00:08:46,270
	pretend that I really understand what the Transformer architecture means.

	147
	00:08:46,770 --> 00:08:49,250
	In depth, but I have scanned it and as I

	148
	00:08:49,410 --> 00:08:51,810
	said, I want to print it and really kind of

	149
	00:08:52,210 --> 00:08:56,050
	think over it at some point. And I'll probably feel

	150
	00:08:56,290 --> 00:08:59,250
	bad about myself, I think, because weren't those guys in

	151
	00:08:59,330 --> 00:09:03,410
	their 20s? Like, that's crazy. I think I asked ChatGPT

	152
	00:09:03,490 --> 00:09:07,890
	once who wrote that paper and how old were they

	153
	00:09:08,050 --> 00:09:10,770
	when it was published in Arciv? And I was expecting,

	154
	00:09:11,010 --> 00:09:13,890
	like, I don't know, What do you imagine? I personally

	155
	00:09:13,970 --> 00:09:16,210
	imagine kind of like, you know, you have these breakthroughs

	156
	00:09:16,370 --> 00:09:19,810
	during COVID and things like that where like these kind

	157
	00:09:19,890 --> 00:09:22,770
	of really obscure scientists are like in their 50s and

	158
	00:09:22,770 --> 00:09:27,170
	they've just kind of been laboring in labs and wearily

	159
	00:09:27,170 --> 00:09:30,450
	in writing and publishing in kind of obscure academic publications.

	160
	00:09:30,770 --> 00:09:33,170
	And they finally like hit a big or win a

	161
	00:09:33,170 --> 00:09:37,250
	Nobel Prize and then their household names. So that was

	162
	00:09:37,330 --> 00:09:38,990
	kind of what I had in mind. That was the

	163
	00:09:38,990 --> 00:09:42,990
	mental image I'd formed of the birth of Arcsight. Like

	164
	00:09:42,990 --> 00:09:46,270
	I wasn't expecting 20-somethings in San Francisco, though. I thought

	165
	00:09:46,350 --> 00:09:48,830
	that was both very, very funny, very cool, and actually

	166
	00:09:48,990 --> 00:09:52,510
	kind of inspiring. It's nice to think that people who,

	167
	00:09:53,310 --> 00:09:56,110
	you know, just you might put them in the kind

	168
	00:09:56,190 --> 00:09:59,550
	of milieu or bubble or world that you are in

	169
	00:09:59,630 --> 00:10:03,230
	are credibly in through, you know, the series of connections

	170
	00:10:03,310 --> 00:10:07,390
	that are coming up with such literally world changing innovations.

	171
	00:10:07,870 --> 00:10:11,460
	So that was, I thought, anyway. That's that was cool.

	172
	00:10:11,860 --> 00:10:14,500
	Okay, voice training data. How are we doing? We're about

	173
	00:10:14,500 --> 00:10:18,580
	10 minutes and I'm still talking about voice technology. So

	174
	00:10:18,660 --> 00:10:22,100
	Whisper was brilliant and I was so excited that I

	175
	00:10:22,180 --> 00:10:25,380
	was my first instinct was to like guess like, oh

	176
	00:10:25,380 --> 00:10:26,820
	my gosh, I have to get like a really good

	177
	00:10:26,820 --> 00:10:30,580
	microphone for this. So I didn't go on a spending

	178
	00:10:30,580 --> 00:10:32,740
	spree because I said, I'm gonna have to just wait

	179
	00:10:32,740 --> 00:10:35,140
	a month and see if I still use this. And

	180
	00:10:36,430 --> 00:10:38,910
	It just kind of became, it's become really part of

	181
	00:10:39,070 --> 00:10:43,390
	my daily routine. Like if I'm writing an email, I'll

	182
	00:10:43,470 --> 00:10:46,990
	record a voice note. And then I've developed and it's

	183
	00:10:46,990 --> 00:10:49,070
	nice to see that everyone is like developing the same

	184
	00:10:49,550 --> 00:10:51,950
	things in parallel. Like that's my kind of a weird

	185
	00:10:51,950 --> 00:10:54,510
	thing to say, but when I look, I kind of

	186
	00:10:54,670 --> 00:10:58,990
	came, when I started working on this, these prototypes on

	187
	00:10:59,070 --> 00:11:01,470
	GitHub, which is where I just kind of share very

	188
	00:11:01,710 --> 00:11:06,730
	freely and loosely, ideas and first iterations on concepts.

	189
	00:11:08,490 --> 00:11:10,650
	And for want of a better word, I called it

	190
	00:11:10,730 --> 00:11:15,450
	like LLM post-processing or cleanup or basically a system prompt

	191
	00:11:15,530 --> 00:11:18,890
	that after you get back the raw text from Whisper,

	192
	00:11:19,050 --> 00:11:22,010
	you run it through a model and say, okay, this

	193
	00:11:22,090 --> 00:11:26,970
	is crappy text, like add sentence structure and fix it

	194
	00:11:27,050 --> 00:11:32,250
	up. And now when I'm exploring the different tools that

	195
	00:11:32,330 --> 00:11:35,180
	are out there that people have built, I see quite

	196
	00:11:35,420 --> 00:11:39,100
	a number of projects have basically done the same thing,

	197
	00:11:40,460 --> 00:11:43,180
	lest that be misconstrued. I'm not saying for a millisecond

	198
	00:11:43,260 --> 00:11:46,220
	that I inspired them. I'm sure this has been a

	199
	00:11:46,300 --> 00:11:49,500
	thing that's been integrated into tools for a while, but

	200
	00:11:50,380 --> 00:11:52,300
	it's the kind of thing that when you start using

	201
	00:11:52,300 --> 00:11:54,780
	these tools every day, the need for it is almost

	202
	00:11:54,940 --> 00:11:59,420
	instantly apparent because text that doesn't have any punctuation or

	203
	00:11:59,800 --> 00:12:03,000
	Paragraph spacing takes a long time to, you know, it

	204
	00:12:03,160 --> 00:12:05,400
	takes so long to get it into a presentable email

	205
	00:12:05,560 --> 00:12:09,720
	that again, it's, it's, it, it moves speech tech into

	206
	00:12:09,960 --> 00:12:13,480
	that before that inflection point where you're like, no, it's

	207
	00:12:13,480 --> 00:12:15,960
	just not worth it. It's like, it's, it'll just be

	208
	00:12:16,040 --> 00:12:18,520
	quicker to type this. So it's a big, it's a

	209
	00:12:18,520 --> 00:12:21,560
	little touch that actually is a big deal. Uh, so

	210
	00:12:21,720 --> 00:12:25,640
	I was on Whisper and I've been using Whisper and

	211
	00:12:25,640 --> 00:12:28,110
	I kind of, early on found a couple of tools.

	212
	00:12:28,270 --> 00:12:30,510
	I couldn't find what I was looking for on Linux,

	213
	00:12:30,670 --> 00:12:35,470
	which is basically just something that'll run in the background.

	214
	00:12:35,710 --> 00:12:38,030
	It'll give it an API key and it will just

	215
	00:12:38,190 --> 00:12:42,910
	like transcribe with like a little key to start and

	216
	00:12:42,990 --> 00:12:47,310
	stop the dictation. And the issues were I discovered that

	217
	00:12:47,470 --> 00:12:51,070
	like most people involved in creating these projects were very

	218
	00:12:51,230 --> 00:12:55,070
	much focused on local models, running Whisper locally because you

	219
	00:12:55,150 --> 00:12:57,940
	can. And I tried that a bunch of times and

	220
	00:12:58,020 --> 00:13:00,340
	just never got results that were as good as the

	221
	00:13:00,340 --> 00:13:03,140
	cloud. And when I began looking at the cost of

	222
	00:13:03,220 --> 00:13:05,700
	the speech to text APIs and what I was spending,

	223
	00:13:06,260 --> 00:13:09,460
	I just thought there is, it's actually, in my opinion,

	224
	00:13:09,620 --> 00:13:12,820
	just one of the better deals in API spending and

	225
	00:13:12,820 --> 00:13:15,140
	in cloud. Like it's just not that expensive for very,

	226
	00:13:15,300 --> 00:13:19,300
	very good models that are much more, you know, you're

	227
	00:13:19,300 --> 00:13:21,880
	gonna be able to run the full model. The latest

	228
	00:13:21,880 --> 00:13:25,880
	model versus whatever you can run on your average GPU,

	229
	00:13:26,120 --> 00:13:29,160
	unless you want to buy a crazy GPU. It doesn't

	230
	00:13:29,160 --> 00:13:31,080
	really make sense to me. Now, privacy is another concern

	231
	00:13:32,120 --> 00:13:33,880
	that I know is kind of like a very much

	232
	00:13:33,960 --> 00:13:36,760
	a separate thing that people just don't want their voice

	233
	00:13:37,000 --> 00:13:40,680
	data and their voice leaving their local environment, maybe for

	234
	00:13:40,680 --> 00:13:44,200
	regulatory reasons as well. But I'm not in that. I

	235
	00:13:44,600 --> 00:13:48,840
	neither really care about people listening to my grocery list

	236
	00:13:49,080 --> 00:13:51,720
	consisting of reminding myself that I need to buy more

	237
	00:13:51,800 --> 00:13:55,150
	beer, Cheetos, and hummus, which is kind of the three

	238
	00:13:55,310 --> 00:13:59,870
	staples of my diet during periods of poorer nutrition. But

	239
	00:13:59,950 --> 00:14:02,430
	the kind of stuff that I transcribe, it's just not,

	240
	00:14:03,950 --> 00:14:07,710
	it's not a privacy thing I'm that sort of sensitive

	241
	00:14:07,790 --> 00:14:13,150
	about and I don't do anything so sensitive or secure

	242
	00:14:13,230 --> 00:14:16,430
	that requires air gapping. So I looked at the pricing

	243
	00:14:16,510 --> 00:14:19,790
	and especially the kind of older model mini Some of

	244
	00:14:19,870 --> 00:14:21,950
	them are very, very affordable. And I did a back

	245
	00:14:22,190 --> 00:14:25,870
	of the, I did a calculation once with ChatGPT and

	246
	00:14:25,870 --> 00:14:29,230
	I was like, okay, this is the API price for

	247
	00:14:29,390 --> 00:14:32,270
	I can't remember whatever the model was. Let's say I

	248
	00:14:32,350 --> 00:14:35,230
	just go at it like nonstop, which it rarely happens.

	249
	00:14:35,470 --> 00:14:38,830
	Probably, I would say on average, I might dictate 30

	250
	00:14:38,910 --> 00:14:41,790
	to 60 minutes per day if I was probably summing

	251
	00:14:41,790 --> 00:14:46,990
	up the emails, documents, outlines, which

	252
	00:14:47,230 --> 00:14:49,870
	is a lot, but it's still a fairly modest amount.

	253
	00:14:50,030 --> 00:14:51,940
	And I was like, Some days I do go on

	254
	00:14:52,100 --> 00:14:54,900
	like one or two days where I've been usually when

	255
	00:14:54,900 --> 00:14:56,980
	I'm like kind of out of the house and just

	256
	00:14:57,220 --> 00:15:00,500
	have something like I have nothing else to do. Like

	257
	00:15:00,660 --> 00:15:04,020
	if I'm at a hospital, we have a newborn and

	258
	00:15:04,180 --> 00:15:07,300
	you're waiting for like eight hours and hours for an

	259
	00:15:07,380 --> 00:15:10,820
	appointment. And I would probably have listened to podcasts before

	260
	00:15:11,380 --> 00:15:14,180
	becoming a speech fanatic. And I'm like, oh, wait, let

	261
	00:15:14,340 --> 00:15:16,259
	me just get down. Let me just get these ideas

	262
	00:15:16,420 --> 00:15:18,540
	out of my head. And that's when I'll go on

	263
	00:15:19,260 --> 00:15:21,820
	my speech binges. But those are like once every few

	264
	00:15:21,820 --> 00:15:24,940
	months, like not frequently. But I said, okay, let's just

	265
	00:15:25,020 --> 00:15:29,100
	say if I'm gonna price out Cloud SCT, if I

	266
	00:15:29,180 --> 00:15:33,900
	was like dedicated every second of every waking hour to

	267
	00:15:34,060 --> 00:15:37,900
	transcribing for some odd reason, I mean, I'd have to

	268
	00:15:37,980 --> 00:15:40,780
	like eat and use the toilet. Like, you know, there's

	269
	00:15:40,860 --> 00:15:43,420
	only so many hours I'm awake for. So like, let's

	270
	00:15:43,420 --> 00:15:46,620
	just say a maximum of like 40 hour, 45 minutes.

	271
	00:15:47,210 --> 00:15:49,290
	In the hour. Then I said, all right, let's just

	272
	00:15:49,290 --> 00:15:52,890
	say 50. Who knows? You're dictating on the toilet. We

	273
	00:15:53,050 --> 00:15:55,050
	do it. So it could be. You could just do

	274
	00:15:55,130 --> 00:15:59,290
	60. But whatever I did. And every day, like, you're

	275
	00:15:59,370 --> 00:16:02,730
	going flat out seven days a week dictating non-stop I

	276
	00:16:02,730 --> 00:16:05,850
	was like, what's my monthly API bill gonna be at

	277
	00:16:05,930 --> 00:16:08,570
	this price? And it came out to, like, 70 or

	278
	00:16:08,570 --> 00:16:10,730
	80 bucks. And I was like, well, that would be

	279
	00:16:11,130 --> 00:16:15,700
	an extraordinary. Amount of dictation. And I would hope that

	280
	00:16:16,180 --> 00:16:19,940
	there was some compelling reason more worth more than $70

	281
	00:16:20,260 --> 00:16:23,460
	that I embarked upon that project. So given that that's

	282
	00:16:23,460 --> 00:16:25,460
	kind of the max point for me, I said that's

	283
	00:16:25,540 --> 00:16:29,140
	actually very, very affordable. Now you're gonna, if you want

	284
	00:16:29,220 --> 00:16:31,700
	to spec out the costs and you want to do

	285
	00:16:31,700 --> 00:16:36,260
	the post-processing that I really do feel is valuable, that's

	286
	00:16:36,340 --> 00:16:40,820
	gonna cost some more as well, unless you're using Gemini,

	287
	00:16:41,300 --> 00:16:44,420
	which needless to say is a random person sitting in

	288
	00:16:44,500 --> 00:16:49,060
	Jerusalem. I have no affiliation, nor with Google, nor anthropic,

	289
	00:16:49,140 --> 00:16:52,020
	nor Gemini, nor any major tech vendor for that matter.

	290
	00:16:53,620 --> 00:16:56,820
	I like Gemini not so much as a everyday model.

	291
	00:16:57,300 --> 00:16:59,860
	It's kind of underwhelmed in that respect, I would say.

	292
	00:17:00,260 --> 00:17:02,740
	But for multimodal, I think it's got a lot to

	293
	00:17:02,740 --> 00:17:06,500
	offer. And I think that the transcribing functionality whereby it

	294
	00:17:06,580 --> 00:17:11,900
	can process audio with a system prompt and both give

	295
	00:17:12,060 --> 00:17:15,100
	you transcription that's cleaned up that reduces two steps to

	296
	00:17:15,260 --> 00:17:18,220
	one. And that for me is a very, very big

	297
	00:17:18,380 --> 00:17:21,580
	deal. And I feel like even Google has haven't really

	298
	00:17:21,820 --> 00:17:26,700
	sort of thought through how useful the that modality is

	299
	00:17:26,780 --> 00:17:29,260
	and what kind of use cases you can achieve with

	300
	00:17:29,340 --> 00:17:31,260
	it. Because I found in the course of this year,

	301
	00:17:31,900 --> 00:17:36,540
	just an endless list of really kind of system prompt

	302
	00:17:36,860 --> 00:17:40,220
	system prompt stuff that I can say, okay, I've used

	303
	00:17:40,220 --> 00:17:43,420
	it to capture context data for AI, which is literally

	304
	00:17:43,500 --> 00:17:45,660
	I might speak for if I wanted to have a

	305
	00:17:45,660 --> 00:17:49,740
	good bank of context data about who knows my childhood

	306
	00:17:50,300 --> 00:17:54,220
	more realistically, maybe my career goals, something that would just

	307
	00:17:54,300 --> 00:17:56,700
	be like really boring to type out. So I'll just

	308
	00:17:56,780 --> 00:18:00,780
	like sit in my car and record it for 10

	309
	00:18:00,860 --> 00:18:03,100
	minutes. And that 10 minutes you get a lot of

	310
	00:18:03,260 --> 00:18:08,650
	information in. Um, emails, which is short text, just

	311
	00:18:09,050 --> 00:18:12,250
	there is a whole bunch and all these workflows kind

	312
	00:18:12,410 --> 00:18:14,410
	of require a little bit of treatment afterwards and different

	313
	00:18:14,650 --> 00:18:18,090
	treatment. My context pipeline is kind of like just extract

	314
	00:18:18,170 --> 00:18:20,970
	the bare essentials. So you end up with me talking

	315
	00:18:21,050 --> 00:18:22,970
	very loosely about sort of what I've done in my

	316
	00:18:23,050 --> 00:18:25,370
	career, where I've worked, where I might like to work.

	317
	00:18:25,850 --> 00:18:28,970
	And it goes, it condenses that down to very robotic

	318
	00:18:29,210 --> 00:18:32,490
	language that is easy to chunk parse and maybe put

	319
	00:18:32,570 --> 00:18:36,550
	into a vector database. Daniel has worked in technology. Daniel

	320
	00:18:37,430 --> 00:18:40,150
	has been working in, you know, stuff like that. That's

	321
	00:18:40,150 --> 00:18:43,110
	not how you would speak, but I figure it's probably

	322
	00:18:43,350 --> 00:18:47,350
	easier to parse for, after all, robots. So we've almost

	323
	00:18:47,430 --> 00:18:49,270
	got to 20 minutes and this is actually a success

	324
	00:18:49,750 --> 00:18:55,110
	because I wasted 20 minutes of the evening speaking

	325
	00:18:55,190 --> 00:18:59,910
	into a microphone and the levels were shot and it

	326
	00:18:59,910 --> 00:19:01,590
	was clipping and I said, I can't really do an

	327
	00:19:01,670 --> 00:19:03,990
	evaluation. I have to be fair. I have to give

	328
	00:19:04,560 --> 00:19:07,920
	the models a chance to do their thing. What am

	329
	00:19:07,920 --> 00:19:10,320
	I hoping to achieve in this? Okay, my fine tune

	330
	00:19:10,320 --> 00:19:13,360
	was a dud as mentioned. DeepChrom ST, I'm really, really

	331
	00:19:13,440 --> 00:19:16,480
	hopeful that this prototype will work and it's a build

	332
	00:19:16,720 --> 00:19:19,280
	in public open source, so anyone is welcome to use

	333
	00:19:19,360 --> 00:19:22,320
	it if I make anything good. But that was really

	334
	00:19:22,480 --> 00:19:26,480
	exciting for me last night when after hours of trying

	335
	00:19:26,560 --> 00:19:30,480
	my own prototype, seeing someone just made something that works

	336
	00:19:30,640 --> 00:19:32,400
	like that, you know, you're not gonna have to build

	337
	00:19:32,640 --> 00:19:37,460
	a custom conda environment and image. I have AMD GPU,

	338
	00:19:37,620 --> 00:19:40,980
	which makes things much more complicated. I didn't find it.

	339
	00:19:41,540 --> 00:19:42,980
	And I was about to give up and I said,

	340
	00:19:43,060 --> 00:19:45,460
	all right, let me just give Deep Grams Linux thing

	341
	00:19:45,940 --> 00:19:49,220
	a shot. And if this doesn't work, I'm just going

	342
	00:19:49,220 --> 00:19:50,980
	to go back to trying to Vibe code something myself.

	343
	00:19:51,620 --> 00:19:55,460
	And when I ran the script, I was using Claude

	344
	00:19:55,540 --> 00:19:59,060
	code to do the installation process. It ran the script

	345
	00:19:59,140 --> 00:20:02,020
	and oh my gosh, it works just like that. The

	346
	00:20:02,100 --> 00:20:05,980
	tricky thing For all those who want to know all

	347
	00:20:05,980 --> 00:20:11,260
	the nitty gritty details, was that I

	348
	00:20:11,260 --> 00:20:14,380
	don't think it was actually struggling with transcription, but pasting

	349
	00:20:14,700 --> 00:20:18,140
	Wayland makes life very hard. And I think there was

	350
	00:20:18,220 --> 00:20:21,500
	something not running the right time. Anyway, Deepgram, I looked

	351
	00:20:21,500 --> 00:20:23,820
	at how they actually handled that because it worked out

	352
	00:20:23,900 --> 00:20:26,540
	of the box when other stuff didn't. And it was

	353
	00:20:27,100 --> 00:20:30,570
	quite a clever little mechanism. And but more so than

	354
	00:20:30,650 --> 00:20:33,290
	that, the accuracy was brilliant. Now, what am I doing

	355
	00:20:33,290 --> 00:20:35,930
	here? This is going to be a 20 minute audio

	356
	00:20:36,490 --> 00:20:42,010
	sample. And I think I've done one or two

	357
	00:20:42,170 --> 00:20:46,570
	of these before, but I did it with short snappy

	358
	00:20:46,730 --> 00:20:49,770
	voice notes. This is kind of long form. This actually

	359
	00:20:50,010 --> 00:20:52,170
	might be a better approximation for what's useful to me

	360
	00:20:52,330 --> 00:20:55,890
	than voice memos. Like, I need to buy three Bread,

	361
	00:20:55,970 --> 00:20:58,610
	eaters of milk tomorrow and Peter bread, which is probably

	362
	00:20:58,770 --> 00:21:01,330
	how like half my voice notes sound. Like if anyone

	363
	00:21:01,810 --> 00:21:04,050
	were to, I don't know, like find my phone, they'd

	364
	00:21:04,050 --> 00:21:05,570
	be like, this is the most boring person in the

	365
	00:21:05,570 --> 00:21:09,330
	world. Although actually, there are some like kind of journaling

	366
	00:21:09,330 --> 00:21:11,490
	thoughts as well, but it's a lot of content like

	367
	00:21:11,490 --> 00:21:14,450
	that. And the probably for the evaluation, the most useful

	368
	00:21:14,530 --> 00:21:20,210
	thing is slightly obscure tech, GitHub, NeocleNo, hugging

	369
	00:21:20,290 --> 00:21:22,940
	face, Not so obscure that it's not going to have

	370
	00:21:23,020 --> 00:21:26,460
	a chance of knowing it, but hopefully sufficiently well known

	371
	00:21:26,460 --> 00:21:28,700
	that the model should get it. I tried to do

	372
	00:21:28,780 --> 00:21:31,580
	a little bit of speaking really fast and speaking very

	373
	00:21:31,740 --> 00:21:35,020
	slowly. I would say in general, I've spoken, delivered this

	374
	00:21:35,180 --> 00:21:37,500
	at a faster pace than I usually would owing to

	375
	00:21:37,980 --> 00:21:42,460
	strong coffee flowing through my bloodstream. And the thing that

	376
	00:21:42,460 --> 00:21:44,700
	I'm not going to get in this benchmark is background

	377
	00:21:44,780 --> 00:21:46,460
	noise, which in my first take that I had to

	378
	00:21:46,460 --> 00:21:49,710
	get rid of, My wife came in with my son

	379
	00:21:50,030 --> 00:21:52,350
	and for a goodnight kiss. And that actually would have

	380
	00:21:52,350 --> 00:21:56,510
	been super helpful to get in because it was non

	381
	00:21:56,590 --> 00:22:00,190
	diarized or if we had diarization, a female, I could

	382
	00:22:00,190 --> 00:22:02,430
	say, I want the male voice and that wasn't intended

	383
	00:22:02,430 --> 00:22:05,870
	for transcription. And we're not going to get background noise

	384
	00:22:05,950 --> 00:22:08,270
	like people honking their horns, which is something I've done

	385
	00:22:08,430 --> 00:22:11,150
	in my main data set where I am trying to

	386
	00:22:11,390 --> 00:22:14,340
	go back to some of my voice notes. Annotate them

	387
	00:22:14,580 --> 00:22:16,420
	and run a benchmark. But this is going to be

	388
	00:22:16,420 --> 00:22:21,700
	just a pure quick test. And as someone,

	389
	00:22:22,260 --> 00:22:24,660
	I'm working on a voice note idea. That's my sort

	390
	00:22:24,660 --> 00:22:28,660
	of end motivation. Besides thinking it's an ask to the

	391
	00:22:28,660 --> 00:22:32,340
	outstanding technology that's coming to viability. And really, I know

	392
	00:22:32,420 --> 00:22:35,940
	this sounds cheesy, can actually have a very transformative effect.

	393
	00:22:36,980 --> 00:22:41,130
	It's, you know, voice technology has been life changing for

	394
	00:22:41,930 --> 00:22:46,970
	folks living with disabilities. And I think

	395
	00:22:47,130 --> 00:22:48,970
	there's something really nice about the fact that it can

	396
	00:22:49,130 --> 00:22:52,490
	also benefit, you know, folks who are able bodied and

	397
	00:22:52,650 --> 00:22:57,690
	like we can all in different ways make this tech

	398
	00:22:57,770 --> 00:23:00,410
	as useful as possible, regardless of the exact way that

	399
	00:23:00,410 --> 00:23:03,770
	we're using it. And I think there's something very powerful

	400
	00:23:03,850 --> 00:23:06,440
	in that and it can be very cool. I see

	401
	00:23:06,600 --> 00:23:10,200
	huge potential. What excites me about Voicetech? A lot of

	402
	00:23:10,280 --> 00:23:14,360
	things actually. Firstly, the fact that it's cheap and accurate,

	403
	00:23:14,440 --> 00:23:17,080
	as I mentioned at the very start of this. And

	404
	00:23:17,240 --> 00:23:19,880
	it's getting better and better with stuff like accent handling.

	405
	00:23:20,680 --> 00:23:23,400
	I'm not sure my fine-tune will actually ever come to

	406
	00:23:23,480 --> 00:23:25,320
	fruition in the sense that I'll use it day to

	407
	00:23:25,400 --> 00:23:28,840
	day as I imagine. I get like superb flawless words

	408
	00:23:28,920 --> 00:23:33,340
	error rates because I'm just kind of skeptical about Local

	409
	00:23:33,500 --> 00:23:37,100
	speech to text, as I mentioned, and I think the

	410
	00:23:37,180 --> 00:23:40,700
	pace of innovation and improvement in the models, the main

	411
	00:23:40,860 --> 00:23:44,620
	reasons for fine tuning from what I've seen have been

	412
	00:23:44,780 --> 00:23:47,420
	people who are something that really blows my mind about

	413
	00:23:47,980 --> 00:23:53,100
	ASR is the idea that it's inherently a lingual or

	414
	00:23:53,260 --> 00:23:58,570
	multilingual phonetic based. So as folks who use speak

	415
	00:23:58,890 --> 00:24:02,250
	very obscure languages, that there might be a paucity of

	416
	00:24:02,250 --> 00:24:04,890
	training data or almost none at all, and therefore the

	417
	00:24:04,890 --> 00:24:10,090
	accuracy is significantly reduced. Or folks in very critical

	418
	00:24:10,330 --> 00:24:14,250
	environments, I know this is used extensively in medical transcription

	419
	00:24:14,330 --> 00:24:19,130
	and dispatcher work, the call centers who send out ambulances,

	420
	00:24:19,210 --> 00:24:23,130
	et cetera, where accuracy is absolutely paramount. And in the

	421
	00:24:23,130 --> 00:24:26,860
	case of doctors, radiologist, they might be using very specialized

	422
	00:24:26,860 --> 00:24:29,420
	vocab all the time. So those are kind of the

	423
	00:24:29,500 --> 00:24:31,420
	main two things that I'm not sure that really just

	424
	00:24:31,500 --> 00:24:34,940
	for trying to make it better on a few random

	425
	00:24:34,940 --> 00:24:37,900
	tech words with my slightly, I mean, I have an

	426
	00:24:37,980 --> 00:24:41,020
	accent, but like not, you know, an accent that a

	427
	00:24:41,100 --> 00:24:45,900
	few other million people have ish. I'm not sure that

	428
	00:24:46,380 --> 00:24:50,300
	my little fine tune is gonna actually like the bump

	429
	00:24:50,460 --> 00:24:53,500
	in word error reduction, if I ever actually figure out

	430
	00:24:53,500 --> 00:24:54,620
	how to do it and get it up to the

	431
	00:24:54,700 --> 00:24:57,870
	cloud. By the time we've done that, I suspect that

	432
	00:24:58,190 --> 00:25:00,430
	the next generation of ASR will just be so good

	433
	00:25:00,510 --> 00:25:02,990
	that it will kind of be, well, that would have

	434
	00:25:02,990 --> 00:25:04,670
	been cool if it worked out, but I'll just use

	435
	00:25:04,750 --> 00:25:08,510
	this instead. So that's going to be it for today's

	436
	00:25:08,830 --> 00:25:14,030
	episode of voice training data. Single long shot evaluation.

	437
	00:25:14,350 --> 00:25:17,150
	Who am I going to compare? Whisper is always good

	438
	00:25:17,150 --> 00:25:20,510
	as a benchmark, but I'm more interested in seeing Whisper

	439
	00:25:20,590 --> 00:25:24,510
	head to head with two things, really. One is Whisper

	440
	00:25:24,590 --> 00:25:29,700
	variants. So you've got these projects like faster Distill Whisper,

	441
	00:25:29,780 --> 00:25:31,700
	it's a bit confusing, there's a whole bunch of them.

	442
	00:25:32,020 --> 00:25:35,300
	And the emerging ASRs, which are also a thing. My

	443
	00:25:35,380 --> 00:25:37,220
	intention for this is I'm not sure I'm going to

	444
	00:25:37,220 --> 00:25:39,860
	have the time in any point in the foreseeable future

	445
	00:25:40,180 --> 00:25:44,580
	to go back through this whole episode and create a

	446
	00:25:44,660 --> 00:25:49,700
	proper source truth, where I fix everything. Might do

	447
	00:25:49,780 --> 00:25:52,740
	it if I can get one transcriptions that sufficiently close

	448
	00:25:52,980 --> 00:25:57,040
	to perfection. But what I would actually love to do

	449
	00:25:57,200 --> 00:25:59,920
	on Hugging Face, I think would be a great probably

	450
	00:26:00,240 --> 00:26:02,880
	how I might visualize this is having the audio waveform

	451
	00:26:03,200 --> 00:26:08,160
	play and then have the transcript for each model below

	452
	00:26:08,160 --> 00:26:12,560
	it and maybe even a like, you know, to scale

	453
	00:26:13,120 --> 00:26:15,600
	and maybe even a local one as well, like local

	454
	00:26:15,760 --> 00:26:21,100
	whisper versus OpenAI API, et cetera. And, I

	455
	00:26:21,180 --> 00:26:23,500
	can then actually listen back to segments or anyone who

	456
	00:26:23,500 --> 00:26:25,820
	wants to can listen back to segments of this recording

	457
	00:26:26,140 --> 00:26:30,940
	and see where a particular model struggled and others didn't,

	458
	00:26:31,420 --> 00:26:33,340
	as well as the sort of headline finding of which

	459
	00:26:33,500 --> 00:26:36,860
	had the best WER, but that would require the source

	460
	00:26:36,860 --> 00:26:39,580
	of truth. Okay, that's it. I hope this was, I

	461
	00:26:39,580 --> 00:26:42,540
	don't know, maybe useful for other folks interested in STT.

	462
	00:26:42,860 --> 00:26:45,660
	You want to see that I always feel think I've

	463
	00:26:45,660 --> 00:26:48,870
	just said as something I didn't intend to. STT, I

	464
	00:26:48,870 --> 00:26:52,470
	said for those. Listen carefully, including hopefully the models themselves.

	465
	00:26:53,190 --> 00:26:57,270
	This has been myself, Daniel Rosell. For more jumbled repositories

	466
	00:26:57,350 --> 00:27:01,750
	about my roving interests in AI, but particularly agentic, MCP

	467
	00:27:01,990 --> 00:27:07,029
	and Voicetech, you can find me on GitHub, huggingface.com,

	468
	00:27:10,230 --> 00:27:13,270
	which is my personal website, as well as this podcast,

	469
	00:27:13,510 --> 00:27:16,950
	whose name I sadly cannot remember. Until next time, thanks

	470
	00:27:16,950 --> 00:27:17,510
	for listening.