Spaces:
Running
Running
| 1 | |
| 00:00:00,000 --> 00:00:05,600 | |
| Hello and welcome to a audio data set consisting | |
| 2 | |
| 00:00:05,600 --> 00:00:10,560 | |
| of one single episode of a non-existent podcast. Or I | |
| 3 | |
| 00:00:10,640 --> 00:00:13,280 | |
| may append this to a podcast that I set up | |
| 4 | |
| 00:00:13,520 --> 00:00:19,120 | |
| recently regarding my with my thoughts on speech | |
| 5 | |
| 00:00:19,200 --> 00:00:23,920 | |
| tech and AI in particular, more AI in generative AI, | |
| 6 | |
| 00:00:24,160 --> 00:00:28,560 | |
| I would say. But in any event, the purpose of | |
| 7 | |
| 00:00:28,640 --> 00:00:33,770 | |
| this Voice recording is actually to create a lengthy | |
| 8 | |
| 00:00:33,850 --> 00:00:37,050 | |
| voice sample for a quick evaluation, a back of the | |
| 9 | |
| 00:00:37,050 --> 00:00:40,570 | |
| envelope evaluation, as they might say, for different speech attack | |
| 10 | |
| 00:00:40,810 --> 00:00:43,370 | |
| models. And I'm doing this because I thought I had | |
| 11 | |
| 00:00:43,370 --> 00:00:46,730 | |
| made a great breakthrough in my journey with speech tech, | |
| 12 | |
| 00:00:47,050 --> 00:00:50,650 | |
| and that was succeeding in the elusive task of fine-tuning | |
| 13 | |
| 00:00:50,650 --> 00:00:54,730 | |
| Whisper. Whisper is, and I'm going to just talk, I'm | |
| 14 | |
| 00:00:54,810 --> 00:00:58,170 | |
| trying to mix up, I'm going to try a few | |
| 15 | |
| 00:00:58,330 --> 00:01:01,450 | |
| different styles of speaking. I might whisper something at some | |
| 16 | |
| 00:01:01,530 --> 00:01:04,800 | |
| point. As well. And I'll go back to speaking loud | |
| 17 | |
| 00:01:04,880 --> 00:01:08,000 | |
| in, in different parts. I'm going to sound really like | |
| 18 | |
| 00:01:08,080 --> 00:01:11,040 | |
| a crazy person because I'm also going to try to | |
| 19 | |
| 00:01:11,200 --> 00:01:16,160 | |
| speak at different pitches and cadences in order to really | |
| 20 | |
| 00:01:16,480 --> 00:01:20,480 | |
| try to put a speech attacks model through its paces, | |
| 21 | |
| 00:01:20,640 --> 00:01:22,960 | |
| which is trying to make sense of is this guy | |
| 22 | |
| 00:01:23,120 --> 00:01:27,980 | |
| just rambling on incoherently in one long sentence or are | |
| 23 | |
| 00:01:28,380 --> 00:01:34,140 | |
| these just actually a series of step, standalone, | |
| 24 | |
| 00:01:34,300 --> 00:01:37,340 | |
| step alone, standalone sentences? And how is it gonna handle | |
| 25 | |
| 00:01:37,420 --> 00:01:40,380 | |
| step alone? That's not a word. What happens when you | |
| 26 | |
| 00:01:40,460 --> 00:01:42,940 | |
| use speech to text and you use a fake word? | |
| 27 | |
| 00:01:43,100 --> 00:01:45,500 | |
| And then you're like, wait, that's not actually, that word | |
| 28 | |
| 00:01:45,660 --> 00:01:50,140 | |
| doesn't exist. How does AI handle that? And these and | |
| 29 | |
| 00:01:50,380 --> 00:01:54,220 | |
| more are all the questions that I'm seeking to answer | |
| 30 | |
| 00:01:54,380 --> 00:01:57,420 | |
| in this training data. Now, why was it trying to | |
| 31 | |
| 00:01:57,420 --> 00:02:00,210 | |
| fine tune Whisper? And what is Whisper? As I said, | |
| 32 | |
| 00:02:00,290 --> 00:02:02,930 | |
| I'm going to try to record this at a couple | |
| 33 | |
| 00:02:03,090 --> 00:02:07,410 | |
| of different levels of technicality for folks who are, you | |
| 34 | |
| 00:02:07,410 --> 00:02:11,650 | |
| know, in the normal world and not totally stuck down | |
| 35 | |
| 00:02:11,730 --> 00:02:13,730 | |
| the rabbit hole of AI, which I have to say | |
| 36 | |
| 00:02:13,890 --> 00:02:18,050 | |
| is a really wonderful rabbit hole to be down. It's | |
| 37 | |
| 00:02:18,130 --> 00:02:21,490 | |
| a really interesting area and speech and voice tech is | |
| 38 | |
| 00:02:21,890 --> 00:02:24,530 | |
| the aspect of it that I find actually the most, | |
| 39 | |
| 00:02:24,930 --> 00:02:27,330 | |
| I'm not sure I would say the most interesting because | |
| 40 | |
| 00:02:27,570 --> 00:02:31,290 | |
| there's just so much that is fascinating in AI. But | |
| 41 | |
| 00:02:31,450 --> 00:02:34,250 | |
| the most that I find the most personally transformative in | |
| 42 | |
| 00:02:34,330 --> 00:02:38,890 | |
| terms of the impact that it's had on my daily | |
| 43 | |
| 00:02:38,970 --> 00:02:41,450 | |
| work life and productivity and how I sort of work. | |
| 44 | |
| 00:02:42,090 --> 00:02:47,210 | |
| And I'm persevering hard with the task of trying | |
| 45 | |
| 00:02:47,210 --> 00:02:50,250 | |
| to get a good solution working for Linux, which if | |
| 46 | |
| 00:02:50,250 --> 00:02:52,250 | |
| anyone actually does listen to this, not just for the | |
| 47 | |
| 00:02:52,250 --> 00:02:56,410 | |
| training data and for the actual content, this is sparked | |
| 48 | |
| 00:02:56,750 --> 00:02:59,950 | |
| I had, besides the fine tune not working, well, that | |
| 49 | |
| 00:03:00,030 --> 00:03:05,230 | |
| was the failure. Um, I used Claude code because one | |
| 50 | |
| 00:03:05,470 --> 00:03:09,950 | |
| thinks these days that there is nothing short of solving, | |
| 51 | |
| 00:03:10,990 --> 00:03:15,390 | |
| you know, the, the reason of life or something, that | |
| 52 | |
| 00:03:15,790 --> 00:03:18,990 | |
| Claude and agentic AI can't do, which is not really | |
| 53 | |
| 00:03:19,070 --> 00:03:22,190 | |
| the case. Uh, it does seem that way sometimes, but | |
| 54 | |
| 00:03:22,350 --> 00:03:24,190 | |
| it fails a lot as well. And this is one | |
| 55 | |
| 00:03:24,190 --> 00:03:27,630 | |
| of those, instances where last week I put together an | |
| 56 | |
| 00:03:27,710 --> 00:03:32,010 | |
| hour of voice training data, basically speaking, just random things | |
| 57 | |
| 00:03:32,250 --> 00:03:37,050 | |
| for 3 minutes. And it was actually kind of tedious | |
| 58 | |
| 00:03:37,130 --> 00:03:39,210 | |
| because the texts were really weird. Some of them were | |
| 59 | |
| 00:03:39,450 --> 00:03:43,050 | |
| it was like it was AI generated. I tried before | |
| 60 | |
| 00:03:43,210 --> 00:03:45,130 | |
| to read Sherlock Holmes for an hour and I just | |
| 61 | |
| 00:03:45,130 --> 00:03:48,330 | |
| couldn't. I was so bored after 10 minutes that I | |
| 62 | |
| 00:03:48,330 --> 00:03:50,730 | |
| was like, okay, no, I'm just going to have to | |
| 63 | |
| 00:03:50,730 --> 00:03:55,290 | |
| find something else to read. So I used a created | |
| 64 | |
| 00:03:55,690 --> 00:04:01,280 | |
| with AI studio vibe coded a synthetic text generator. Which | |
| 65 | |
| 00:04:01,600 --> 00:04:03,840 | |
| actually I thought was probably a better way of doing | |
| 66 | |
| 00:04:03,920 --> 00:04:07,440 | |
| it because it would give me more short samples with | |
| 67 | |
| 00:04:07,680 --> 00:04:10,480 | |
| more varied content. So I was like, okay, give me | |
| 68 | |
| 00:04:10,880 --> 00:04:13,760 | |
| a voice note, like I'm recording an email, give me | |
| 69 | |
| 00:04:14,000 --> 00:04:17,680 | |
| a short story to read, give me prose to read. | |
| 70 | |
| 00:04:18,000 --> 00:04:20,400 | |
| So I came up with all these different things and | |
| 71 | |
| 00:04:20,560 --> 00:04:22,560 | |
| they added a little timer to it so I could | |
| 72 | |
| 00:04:22,720 --> 00:04:26,400 | |
| see how close I was to one hour. And I | |
| 73 | |
| 00:04:26,560 --> 00:04:29,600 | |
| spent like an hour one afternoon or probably two hours | |
| 74 | |
| 00:04:29,760 --> 00:04:33,330 | |
| by the time you you do retakes. And whatever, because | |
| 75 | |
| 00:04:33,410 --> 00:04:36,610 | |
| you want to, it gave me a source of truth, | |
| 76 | |
| 00:04:37,330 --> 00:04:40,050 | |
| which I'm not sure if that's the scientific way to | |
| 77 | |
| 00:04:40,210 --> 00:04:44,210 | |
| approach this topic of gathering, training data, but I thought | |
| 78 | |
| 00:04:44,450 --> 00:04:48,130 | |
| made sense. Um, I have a lot of audio data | |
| 79 | |
| 00:04:48,210 --> 00:04:50,770 | |
| from recording voice notes, which I've also kind of used, | |
| 80 | |
| 00:04:52,050 --> 00:04:55,810 | |
| been experimenting with using for a different purpose, slightly different | |
| 81 | |
| 00:04:56,210 --> 00:05:01,410 | |
| annotating task types. It's more a text classification experiment | |
| 82 | |
| 00:05:01,730 --> 00:05:04,160 | |
| or, Well, it's more than that actually. I'm working on | |
| 83 | |
| 00:05:04,160 --> 00:05:08,080 | |
| a voice app. So it's a prototype, I guess, is | |
| 84 | |
| 00:05:08,240 --> 00:05:12,720 | |
| really more accurate. But you can do that and you | |
| 85 | |
| 00:05:12,720 --> 00:05:15,200 | |
| can work backwards. You're like, you listen back to a | |
| 86 | |
| 00:05:15,200 --> 00:05:18,720 | |
| voice note and you painfully go through one of those | |
| 87 | |
| 00:05:19,040 --> 00:05:21,840 | |
| transcribing, you know, where you start and stop and scrub | |
| 88 | |
| 00:05:22,000 --> 00:05:23,920 | |
| around it and you fix the errors, but it's really, | |
| 89 | |
| 00:05:24,080 --> 00:05:26,720 | |
| really boring to do that. So I thought it would | |
| 90 | |
| 00:05:26,800 --> 00:05:29,040 | |
| be less tedious in the long term if I just | |
| 91 | |
| 00:05:30,059 --> 00:05:32,940 | |
| recorded the source of truth. So it gave me these | |
| 92 | |
| 00:05:33,020 --> 00:05:36,140 | |
| three minute snippets. I recorded them. It saved an MP3 | |
| 93 | |
| 00:05:36,380 --> 00:05:39,500 | |
| and a TXT in the same folder, and I created | |
| 94 | |
| 00:05:39,580 --> 00:05:42,860 | |
| an error with that data. So I was very hopeful, | |
| 95 | |
| 00:05:43,260 --> 00:05:46,860 | |
| quietly, a little bit hopeful that I could actually fine | |
| 96 | |
| 00:05:46,940 --> 00:05:50,460 | |
| tune Whisper. I want to fine tune Whisper because when | |
| 97 | |
| 00:05:50,540 --> 00:05:54,780 | |
| I got into Voicetech last November, my wife was in | |
| 98 | |
| 00:05:54,780 --> 00:05:58,140 | |
| the US and I was alone at home. And when | |
| 99 | |
| 00:05:58,600 --> 00:06:01,400 | |
| crazy people like me do really wild things like use | |
| 100 | |
| 00:06:01,640 --> 00:06:06,120 | |
| voice to tech technology. That was basically when I started | |
| 101 | |
| 00:06:06,200 --> 00:06:08,760 | |
| doing it, I didn't feel like a crazy person speaking | |
| 102 | |
| 00:06:08,840 --> 00:06:13,720 | |
| to myself. And my expectations weren't that high. I used | |
| 103 | |
| 00:06:14,280 --> 00:06:17,640 | |
| speech tech now and again, tried it out. It was | |
| 104 | |
| 00:06:17,640 --> 00:06:19,160 | |
| like, it'd be really cool if you could just, like, | |
| 105 | |
| 00:06:19,320 --> 00:06:22,760 | |
| speak into your computer. And whatever I tried out that | |
| 106 | |
| 00:06:23,000 --> 00:06:26,590 | |
| had Linux support was just. It was not good, basically. | |
| 107 | |
| 00:06:27,230 --> 00:06:29,470 | |
| And this blew me away from the first go. I | |
| 108 | |
| 00:06:29,470 --> 00:06:32,750 | |
| mean, it wasn't 100% accurate out of the box and | |
| 109 | |
| 00:06:32,830 --> 00:06:34,910 | |
| it took work, but it was good enough that there | |
| 110 | |
| 00:06:34,990 --> 00:06:37,470 | |
| was a solid foundation and it kind of passed that | |
| 111 | |
| 00:06:38,670 --> 00:06:41,870 | |
| pivot point that it's actually worth doing this. You know, | |
| 112 | |
| 00:06:42,030 --> 00:06:44,670 | |
| there's a point where it's so like the transcript is | |
| 113 | |
| 00:06:44,910 --> 00:06:47,310 | |
| you don't have to get 100% accuracy for it to | |
| 114 | |
| 00:06:47,310 --> 00:06:50,030 | |
| be worth your time for speech attacks to be a | |
| 115 | |
| 00:06:50,030 --> 00:06:52,430 | |
| worthwhile addition to your productivity, but you do need to | |
| 116 | |
| 00:06:52,430 --> 00:06:55,970 | |
| get above, let's say, I don't know, 85%. If it's | |
| 117 | |
| 00:06:56,130 --> 00:06:59,810 | |
| 60% or 50%, you inevitably say, screw it, I'll just | |
| 118 | |
| 00:06:59,810 --> 00:07:02,770 | |
| type it because you end up missing errors in the | |
| 119 | |
| 00:07:02,770 --> 00:07:05,490 | |
| transcript and it becomes actually worse. You end up in | |
| 120 | |
| 00:07:05,490 --> 00:07:07,570 | |
| a worse position than you started with. That's been my | |
| 121 | |
| 00:07:07,650 --> 00:07:11,970 | |
| experience. So I was like, oh, this is actually really, | |
| 122 | |
| 00:07:12,130 --> 00:07:13,970 | |
| really good now. How did that happen? And the answer | |
| 123 | |
| 00:07:14,130 --> 00:07:19,410 | |
| is ASR whisper being open source and the transformer | |
| 124 | |
| 00:07:19,410 --> 00:07:23,170 | |
| architecture. If you want to go back to the to | |
| 125 | |
| 00:07:23,250 --> 00:07:26,370 | |
| the underpinnings, which really blows my mind and it's on | |
| 126 | |
| 00:07:26,450 --> 00:07:30,680 | |
| my list. To read through that paper. All you need | |
| 127 | |
| 00:07:30,760 --> 00:07:35,960 | |
| is attention as attentively as can be done | |
| 128 | |
| 00:07:36,200 --> 00:07:39,320 | |
| with my limited brain because it's super, super high level | |
| 129 | |
| 00:07:39,640 --> 00:07:44,520 | |
| stuff, super advanced stuff, I mean. But that, I think | |
| 130 | |
| 00:07:44,680 --> 00:07:49,320 | |
| of all the things that are fascinating about the sudden | |
| 131 | |
| 00:07:49,640 --> 00:07:53,700 | |
| rise in AI and the dramatic capabilities. I find it | |
| 132 | |
| 00:07:53,700 --> 00:07:56,100 | |
| fascinating that a few people are like, hang on, you've | |
| 133 | |
| 00:07:56,100 --> 00:07:58,420 | |
| got this thing that can speak to you, like a | |
| 134 | |
| 00:07:58,420 --> 00:08:02,980 | |
| chatbot, an LLM, and then you've got image generation. Okay, | |
| 135 | |
| 00:08:03,060 --> 00:08:06,580 | |
| so firstly, those two things on the surface have nothing | |
| 136 | |
| 00:08:06,900 --> 00:08:10,740 | |
| in common. So like, how are they, how did that | |
| 137 | |
| 00:08:10,900 --> 00:08:12,500 | |
| just happen all at the same time? And then when | |
| 138 | |
| 00:08:12,500 --> 00:08:16,580 | |
| you extend that further, you're like, Suno, right? You can | |
| 139 | |
| 00:08:17,060 --> 00:08:20,030 | |
| sing a song and AI will come up with and | |
| 140 | |
| 00:08:20,190 --> 00:08:23,390 | |
| instrumental. And then you've got Whisper and you're like, wait | |
| 141 | |
| 00:08:23,390 --> 00:08:25,870 | |
| a second, how did all this stuff, like, if it's | |
| 142 | |
| 00:08:25,870 --> 00:08:29,230 | |
| all AI, what's like, there has to be some commonality. | |
| 143 | |
| 00:08:29,470 --> 00:08:34,590 | |
| Otherwise, these are totally different technologies on the surface of | |
| 144 | |
| 00:08:34,590 --> 00:08:38,830 | |
| it. And the Transformer architecture is, as far as I | |
| 145 | |
| 00:08:38,910 --> 00:08:41,550 | |
| know, the answer. And I can't even say, can't even | |
| 146 | |
| 00:08:41,630 --> 00:08:46,270 | |
| pretend that I really understand what the Transformer architecture means. | |
| 147 | |
| 00:08:46,770 --> 00:08:49,250 | |
| In depth, but I have scanned it and as I | |
| 148 | |
| 00:08:49,410 --> 00:08:51,810 | |
| said, I want to print it and really kind of | |
| 149 | |
| 00:08:52,210 --> 00:08:56,050 | |
| think over it at some point. And I'll probably feel | |
| 150 | |
| 00:08:56,290 --> 00:08:59,250 | |
| bad about myself, I think, because weren't those guys in | |
| 151 | |
| 00:08:59,330 --> 00:09:03,410 | |
| their 20s? Like, that's crazy. I think I asked ChatGPT | |
| 152 | |
| 00:09:03,490 --> 00:09:07,890 | |
| once who wrote that paper and how old were they | |
| 153 | |
| 00:09:08,050 --> 00:09:10,770 | |
| when it was published in Arciv? And I was expecting, | |
| 154 | |
| 00:09:11,010 --> 00:09:13,890 | |
| like, I don't know, What do you imagine? I personally | |
| 155 | |
| 00:09:13,970 --> 00:09:16,210 | |
| imagine kind of like, you know, you have these breakthroughs | |
| 156 | |
| 00:09:16,370 --> 00:09:19,810 | |
| during COVID and things like that where like these kind | |
| 157 | |
| 00:09:19,890 --> 00:09:22,770 | |
| of really obscure scientists are like in their 50s and | |
| 158 | |
| 00:09:22,770 --> 00:09:27,170 | |
| they've just kind of been laboring in labs and wearily | |
| 159 | |
| 00:09:27,170 --> 00:09:30,450 | |
| in writing and publishing in kind of obscure academic publications. | |
| 160 | |
| 00:09:30,770 --> 00:09:33,170 | |
| And they finally like hit a big or win a | |
| 161 | |
| 00:09:33,170 --> 00:09:37,250 | |
| Nobel Prize and then their household names. So that was | |
| 162 | |
| 00:09:37,330 --> 00:09:38,990 | |
| kind of what I had in mind. That was the | |
| 163 | |
| 00:09:38,990 --> 00:09:42,990 | |
| mental image I'd formed of the birth of Arcsight. Like | |
| 164 | |
| 00:09:42,990 --> 00:09:46,270 | |
| I wasn't expecting 20-somethings in San Francisco, though. I thought | |
| 165 | |
| 00:09:46,350 --> 00:09:48,830 | |
| that was both very, very funny, very cool, and actually | |
| 166 | |
| 00:09:48,990 --> 00:09:52,510 | |
| kind of inspiring. It's nice to think that people who, | |
| 167 | |
| 00:09:53,310 --> 00:09:56,110 | |
| you know, just you might put them in the kind | |
| 168 | |
| 00:09:56,190 --> 00:09:59,550 | |
| of milieu or bubble or world that you are in | |
| 169 | |
| 00:09:59,630 --> 00:10:03,230 | |
| are credibly in through, you know, the series of connections | |
| 170 | |
| 00:10:03,310 --> 00:10:07,390 | |
| that are coming up with such literally world changing innovations. | |
| 171 | |
| 00:10:07,870 --> 00:10:11,460 | |
| So that was, I thought, anyway. That's that was cool. | |
| 172 | |
| 00:10:11,860 --> 00:10:14,500 | |
| Okay, voice training data. How are we doing? We're about | |
| 173 | |
| 00:10:14,500 --> 00:10:18,580 | |
| 10 minutes and I'm still talking about voice technology. So | |
| 174 | |
| 00:10:18,660 --> 00:10:22,100 | |
| Whisper was brilliant and I was so excited that I | |
| 175 | |
| 00:10:22,180 --> 00:10:25,380 | |
| was my first instinct was to like guess like, oh | |
| 176 | |
| 00:10:25,380 --> 00:10:26,820 | |
| my gosh, I have to get like a really good | |
| 177 | |
| 00:10:26,820 --> 00:10:30,580 | |
| microphone for this. So I didn't go on a spending | |
| 178 | |
| 00:10:30,580 --> 00:10:32,740 | |
| spree because I said, I'm gonna have to just wait | |
| 179 | |
| 00:10:32,740 --> 00:10:35,140 | |
| a month and see if I still use this. And | |
| 180 | |
| 00:10:36,430 --> 00:10:38,910 | |
| It just kind of became, it's become really part of | |
| 181 | |
| 00:10:39,070 --> 00:10:43,390 | |
| my daily routine. Like if I'm writing an email, I'll | |
| 182 | |
| 00:10:43,470 --> 00:10:46,990 | |
| record a voice note. And then I've developed and it's | |
| 183 | |
| 00:10:46,990 --> 00:10:49,070 | |
| nice to see that everyone is like developing the same | |
| 184 | |
| 00:10:49,550 --> 00:10:51,950 | |
| things in parallel. Like that's my kind of a weird | |
| 185 | |
| 00:10:51,950 --> 00:10:54,510 | |
| thing to say, but when I look, I kind of | |
| 186 | |
| 00:10:54,670 --> 00:10:58,990 | |
| came, when I started working on this, these prototypes on | |
| 187 | |
| 00:10:59,070 --> 00:11:01,470 | |
| GitHub, which is where I just kind of share very | |
| 188 | |
| 00:11:01,710 --> 00:11:06,730 | |
| freely and loosely, ideas and first iterations on concepts. | |
| 189 | |
| 00:11:08,490 --> 00:11:10,650 | |
| And for want of a better word, I called it | |
| 190 | |
| 00:11:10,730 --> 00:11:15,450 | |
| like LLM post-processing or cleanup or basically a system prompt | |
| 191 | |
| 00:11:15,530 --> 00:11:18,890 | |
| that after you get back the raw text from Whisper, | |
| 192 | |
| 00:11:19,050 --> 00:11:22,010 | |
| you run it through a model and say, okay, this | |
| 193 | |
| 00:11:22,090 --> 00:11:26,970 | |
| is crappy text, like add sentence structure and fix it | |
| 194 | |
| 00:11:27,050 --> 00:11:32,250 | |
| up. And now when I'm exploring the different tools that | |
| 195 | |
| 00:11:32,330 --> 00:11:35,180 | |
| are out there that people have built, I see quite | |
| 196 | |
| 00:11:35,420 --> 00:11:39,100 | |
| a number of projects have basically done the same thing, | |
| 197 | |
| 00:11:40,460 --> 00:11:43,180 | |
| lest that be misconstrued. I'm not saying for a millisecond | |
| 198 | |
| 00:11:43,260 --> 00:11:46,220 | |
| that I inspired them. I'm sure this has been a | |
| 199 | |
| 00:11:46,300 --> 00:11:49,500 | |
| thing that's been integrated into tools for a while, but | |
| 200 | |
| 00:11:50,380 --> 00:11:52,300 | |
| it's the kind of thing that when you start using | |
| 201 | |
| 00:11:52,300 --> 00:11:54,780 | |
| these tools every day, the need for it is almost | |
| 202 | |
| 00:11:54,940 --> 00:11:59,420 | |
| instantly apparent because text that doesn't have any punctuation or | |
| 203 | |
| 00:11:59,800 --> 00:12:03,000 | |
| Paragraph spacing takes a long time to, you know, it | |
| 204 | |
| 00:12:03,160 --> 00:12:05,400 | |
| takes so long to get it into a presentable email | |
| 205 | |
| 00:12:05,560 --> 00:12:09,720 | |
| that again, it's, it's, it, it moves speech tech into | |
| 206 | |
| 00:12:09,960 --> 00:12:13,480 | |
| that before that inflection point where you're like, no, it's | |
| 207 | |
| 00:12:13,480 --> 00:12:15,960 | |
| just not worth it. It's like, it's, it'll just be | |
| 208 | |
| 00:12:16,040 --> 00:12:18,520 | |
| quicker to type this. So it's a big, it's a | |
| 209 | |
| 00:12:18,520 --> 00:12:21,560 | |
| little touch that actually is a big deal. Uh, so | |
| 210 | |
| 00:12:21,720 --> 00:12:25,640 | |
| I was on Whisper and I've been using Whisper and | |
| 211 | |
| 00:12:25,640 --> 00:12:28,110 | |
| I kind of, early on found a couple of tools. | |
| 212 | |
| 00:12:28,270 --> 00:12:30,510 | |
| I couldn't find what I was looking for on Linux, | |
| 213 | |
| 00:12:30,670 --> 00:12:35,470 | |
| which is basically just something that'll run in the background. | |
| 214 | |
| 00:12:35,710 --> 00:12:38,030 | |
| It'll give it an API key and it will just | |
| 215 | |
| 00:12:38,190 --> 00:12:42,910 | |
| like transcribe with like a little key to start and | |
| 216 | |
| 00:12:42,990 --> 00:12:47,310 | |
| stop the dictation. And the issues were I discovered that | |
| 217 | |
| 00:12:47,470 --> 00:12:51,070 | |
| like most people involved in creating these projects were very | |
| 218 | |
| 00:12:51,230 --> 00:12:55,070 | |
| much focused on local models, running Whisper locally because you | |
| 219 | |
| 00:12:55,150 --> 00:12:57,940 | |
| can. And I tried that a bunch of times and | |
| 220 | |
| 00:12:58,020 --> 00:13:00,340 | |
| just never got results that were as good as the | |
| 221 | |
| 00:13:00,340 --> 00:13:03,140 | |
| cloud. And when I began looking at the cost of | |
| 222 | |
| 00:13:03,220 --> 00:13:05,700 | |
| the speech to text APIs and what I was spending, | |
| 223 | |
| 00:13:06,260 --> 00:13:09,460 | |
| I just thought there is, it's actually, in my opinion, | |
| 224 | |
| 00:13:09,620 --> 00:13:12,820 | |
| just one of the better deals in API spending and | |
| 225 | |
| 00:13:12,820 --> 00:13:15,140 | |
| in cloud. Like it's just not that expensive for very, | |
| 226 | |
| 00:13:15,300 --> 00:13:19,300 | |
| very good models that are much more, you know, you're | |
| 227 | |
| 00:13:19,300 --> 00:13:21,880 | |
| gonna be able to run the full model. The latest | |
| 228 | |
| 00:13:21,880 --> 00:13:25,880 | |
| model versus whatever you can run on your average GPU, | |
| 229 | |
| 00:13:26,120 --> 00:13:29,160 | |
| unless you want to buy a crazy GPU. It doesn't | |
| 230 | |
| 00:13:29,160 --> 00:13:31,080 | |
| really make sense to me. Now, privacy is another concern | |
| 231 | |
| 00:13:32,120 --> 00:13:33,880 | |
| that I know is kind of like a very much | |
| 232 | |
| 00:13:33,960 --> 00:13:36,760 | |
| a separate thing that people just don't want their voice | |
| 233 | |
| 00:13:37,000 --> 00:13:40,680 | |
| data and their voice leaving their local environment, maybe for | |
| 234 | |
| 00:13:40,680 --> 00:13:44,200 | |
| regulatory reasons as well. But I'm not in that. I | |
| 235 | |
| 00:13:44,600 --> 00:13:48,840 | |
| neither really care about people listening to my grocery list | |
| 236 | |
| 00:13:49,080 --> 00:13:51,720 | |
| consisting of reminding myself that I need to buy more | |
| 237 | |
| 00:13:51,800 --> 00:13:55,150 | |
| beer, Cheetos, and hummus, which is kind of the three | |
| 238 | |
| 00:13:55,310 --> 00:13:59,870 | |
| staples of my diet during periods of poorer nutrition. But | |
| 239 | |
| 00:13:59,950 --> 00:14:02,430 | |
| the kind of stuff that I transcribe, it's just not, | |
| 240 | |
| 00:14:03,950 --> 00:14:07,710 | |
| it's not a privacy thing I'm that sort of sensitive | |
| 241 | |
| 00:14:07,790 --> 00:14:13,150 | |
| about and I don't do anything so sensitive or secure | |
| 242 | |
| 00:14:13,230 --> 00:14:16,430 | |
| that requires air gapping. So I looked at the pricing | |
| 243 | |
| 00:14:16,510 --> 00:14:19,790 | |
| and especially the kind of older model mini Some of | |
| 244 | |
| 00:14:19,870 --> 00:14:21,950 | |
| them are very, very affordable. And I did a back | |
| 245 | |
| 00:14:22,190 --> 00:14:25,870 | |
| of the, I did a calculation once with ChatGPT and | |
| 246 | |
| 00:14:25,870 --> 00:14:29,230 | |
| I was like, okay, this is the API price for | |
| 247 | |
| 00:14:29,390 --> 00:14:32,270 | |
| I can't remember whatever the model was. Let's say I | |
| 248 | |
| 00:14:32,350 --> 00:14:35,230 | |
| just go at it like nonstop, which it rarely happens. | |
| 249 | |
| 00:14:35,470 --> 00:14:38,830 | |
| Probably, I would say on average, I might dictate 30 | |
| 250 | |
| 00:14:38,910 --> 00:14:41,790 | |
| to 60 minutes per day if I was probably summing | |
| 251 | |
| 00:14:41,790 --> 00:14:46,990 | |
| up the emails, documents, outlines, which | |
| 252 | |
| 00:14:47,230 --> 00:14:49,870 | |
| is a lot, but it's still a fairly modest amount. | |
| 253 | |
| 00:14:50,030 --> 00:14:51,940 | |
| And I was like, Some days I do go on | |
| 254 | |
| 00:14:52,100 --> 00:14:54,900 | |
| like one or two days where I've been usually when | |
| 255 | |
| 00:14:54,900 --> 00:14:56,980 | |
| I'm like kind of out of the house and just | |
| 256 | |
| 00:14:57,220 --> 00:15:00,500 | |
| have something like I have nothing else to do. Like | |
| 257 | |
| 00:15:00,660 --> 00:15:04,020 | |
| if I'm at a hospital, we have a newborn and | |
| 258 | |
| 00:15:04,180 --> 00:15:07,300 | |
| you're waiting for like eight hours and hours for an | |
| 259 | |
| 00:15:07,380 --> 00:15:10,820 | |
| appointment. And I would probably have listened to podcasts before | |
| 260 | |
| 00:15:11,380 --> 00:15:14,180 | |
| becoming a speech fanatic. And I'm like, oh, wait, let | |
| 261 | |
| 00:15:14,340 --> 00:15:16,259 | |
| me just get down. Let me just get these ideas | |
| 262 | |
| 00:15:16,420 --> 00:15:18,540 | |
| out of my head. And that's when I'll go on | |
| 263 | |
| 00:15:19,260 --> 00:15:21,820 | |
| my speech binges. But those are like once every few | |
| 264 | |
| 00:15:21,820 --> 00:15:24,940 | |
| months, like not frequently. But I said, okay, let's just | |
| 265 | |
| 00:15:25,020 --> 00:15:29,100 | |
| say if I'm gonna price out Cloud SCT, if I | |
| 266 | |
| 00:15:29,180 --> 00:15:33,900 | |
| was like dedicated every second of every waking hour to | |
| 267 | |
| 00:15:34,060 --> 00:15:37,900 | |
| transcribing for some odd reason, I mean, I'd have to | |
| 268 | |
| 00:15:37,980 --> 00:15:40,780 | |
| like eat and use the toilet. Like, you know, there's | |
| 269 | |
| 00:15:40,860 --> 00:15:43,420 | |
| only so many hours I'm awake for. So like, let's | |
| 270 | |
| 00:15:43,420 --> 00:15:46,620 | |
| just say a maximum of like 40 hour, 45 minutes. | |
| 271 | |
| 00:15:47,210 --> 00:15:49,290 | |
| In the hour. Then I said, all right, let's just | |
| 272 | |
| 00:15:49,290 --> 00:15:52,890 | |
| say 50. Who knows? You're dictating on the toilet. We | |
| 273 | |
| 00:15:53,050 --> 00:15:55,050 | |
| do it. So it could be. You could just do | |
| 274 | |
| 00:15:55,130 --> 00:15:59,290 | |
| 60. But whatever I did. And every day, like, you're | |
| 275 | |
| 00:15:59,370 --> 00:16:02,730 | |
| going flat out seven days a week dictating non-stop I | |
| 276 | |
| 00:16:02,730 --> 00:16:05,850 | |
| was like, what's my monthly API bill gonna be at | |
| 277 | |
| 00:16:05,930 --> 00:16:08,570 | |
| this price? And it came out to, like, 70 or | |
| 278 | |
| 00:16:08,570 --> 00:16:10,730 | |
| 80 bucks. And I was like, well, that would be | |
| 279 | |
| 00:16:11,130 --> 00:16:15,700 | |
| an extraordinary. Amount of dictation. And I would hope that | |
| 280 | |
| 00:16:16,180 --> 00:16:19,940 | |
| there was some compelling reason more worth more than $70 | |
| 281 | |
| 00:16:20,260 --> 00:16:23,460 | |
| that I embarked upon that project. So given that that's | |
| 282 | |
| 00:16:23,460 --> 00:16:25,460 | |
| kind of the max point for me, I said that's | |
| 283 | |
| 00:16:25,540 --> 00:16:29,140 | |
| actually very, very affordable. Now you're gonna, if you want | |
| 284 | |
| 00:16:29,220 --> 00:16:31,700 | |
| to spec out the costs and you want to do | |
| 285 | |
| 00:16:31,700 --> 00:16:36,260 | |
| the post-processing that I really do feel is valuable, that's | |
| 286 | |
| 00:16:36,340 --> 00:16:40,820 | |
| gonna cost some more as well, unless you're using Gemini, | |
| 287 | |
| 00:16:41,300 --> 00:16:44,420 | |
| which needless to say is a random person sitting in | |
| 288 | |
| 00:16:44,500 --> 00:16:49,060 | |
| Jerusalem. I have no affiliation, nor with Google, nor anthropic, | |
| 289 | |
| 00:16:49,140 --> 00:16:52,020 | |
| nor Gemini, nor any major tech vendor for that matter. | |
| 290 | |
| 00:16:53,620 --> 00:16:56,820 | |
| I like Gemini not so much as a everyday model. | |
| 291 | |
| 00:16:57,300 --> 00:16:59,860 | |
| It's kind of underwhelmed in that respect, I would say. | |
| 292 | |
| 00:17:00,260 --> 00:17:02,740 | |
| But for multimodal, I think it's got a lot to | |
| 293 | |
| 00:17:02,740 --> 00:17:06,500 | |
| offer. And I think that the transcribing functionality whereby it | |
| 294 | |
| 00:17:06,580 --> 00:17:11,900 | |
| can process audio with a system prompt and both give | |
| 295 | |
| 00:17:12,060 --> 00:17:15,100 | |
| you transcription that's cleaned up that reduces two steps to | |
| 296 | |
| 00:17:15,260 --> 00:17:18,220 | |
| one. And that for me is a very, very big | |
| 297 | |
| 00:17:18,380 --> 00:17:21,580 | |
| deal. And I feel like even Google has haven't really | |
| 298 | |
| 00:17:21,820 --> 00:17:26,700 | |
| sort of thought through how useful the that modality is | |
| 299 | |
| 00:17:26,780 --> 00:17:29,260 | |
| and what kind of use cases you can achieve with | |
| 300 | |
| 00:17:29,340 --> 00:17:31,260 | |
| it. Because I found in the course of this year, | |
| 301 | |
| 00:17:31,900 --> 00:17:36,540 | |
| just an endless list of really kind of system prompt | |
| 302 | |
| 00:17:36,860 --> 00:17:40,220 | |
| system prompt stuff that I can say, okay, I've used | |
| 303 | |
| 00:17:40,220 --> 00:17:43,420 | |
| it to capture context data for AI, which is literally | |
| 304 | |
| 00:17:43,500 --> 00:17:45,660 | |
| I might speak for if I wanted to have a | |
| 305 | |
| 00:17:45,660 --> 00:17:49,740 | |
| good bank of context data about who knows my childhood | |
| 306 | |
| 00:17:50,300 --> 00:17:54,220 | |
| more realistically, maybe my career goals, something that would just | |
| 307 | |
| 00:17:54,300 --> 00:17:56,700 | |
| be like really boring to type out. So I'll just | |
| 308 | |
| 00:17:56,780 --> 00:18:00,780 | |
| like sit in my car and record it for 10 | |
| 309 | |
| 00:18:00,860 --> 00:18:03,100 | |
| minutes. And that 10 minutes you get a lot of | |
| 310 | |
| 00:18:03,260 --> 00:18:08,650 | |
| information in. Um, emails, which is short text, just | |
| 311 | |
| 00:18:09,050 --> 00:18:12,250 | |
| there is a whole bunch and all these workflows kind | |
| 312 | |
| 00:18:12,410 --> 00:18:14,410 | |
| of require a little bit of treatment afterwards and different | |
| 313 | |
| 00:18:14,650 --> 00:18:18,090 | |
| treatment. My context pipeline is kind of like just extract | |
| 314 | |
| 00:18:18,170 --> 00:18:20,970 | |
| the bare essentials. So you end up with me talking | |
| 315 | |
| 00:18:21,050 --> 00:18:22,970 | |
| very loosely about sort of what I've done in my | |
| 316 | |
| 00:18:23,050 --> 00:18:25,370 | |
| career, where I've worked, where I might like to work. | |
| 317 | |
| 00:18:25,850 --> 00:18:28,970 | |
| And it goes, it condenses that down to very robotic | |
| 318 | |
| 00:18:29,210 --> 00:18:32,490 | |
| language that is easy to chunk parse and maybe put | |
| 319 | |
| 00:18:32,570 --> 00:18:36,550 | |
| into a vector database. Daniel has worked in technology. Daniel | |
| 320 | |
| 00:18:37,430 --> 00:18:40,150 | |
| has been working in, you know, stuff like that. That's | |
| 321 | |
| 00:18:40,150 --> 00:18:43,110 | |
| not how you would speak, but I figure it's probably | |
| 322 | |
| 00:18:43,350 --> 00:18:47,350 | |
| easier to parse for, after all, robots. So we've almost | |
| 323 | |
| 00:18:47,430 --> 00:18:49,270 | |
| got to 20 minutes and this is actually a success | |
| 324 | |
| 00:18:49,750 --> 00:18:55,110 | |
| because I wasted 20 minutes of the evening speaking | |
| 325 | |
| 00:18:55,190 --> 00:18:59,910 | |
| into a microphone and the levels were shot and it | |
| 326 | |
| 00:18:59,910 --> 00:19:01,590 | |
| was clipping and I said, I can't really do an | |
| 327 | |
| 00:19:01,670 --> 00:19:03,990 | |
| evaluation. I have to be fair. I have to give | |
| 328 | |
| 00:19:04,560 --> 00:19:07,920 | |
| the models a chance to do their thing. What am | |
| 329 | |
| 00:19:07,920 --> 00:19:10,320 | |
| I hoping to achieve in this? Okay, my fine tune | |
| 330 | |
| 00:19:10,320 --> 00:19:13,360 | |
| was a dud as mentioned. DeepChrom ST, I'm really, really | |
| 331 | |
| 00:19:13,440 --> 00:19:16,480 | |
| hopeful that this prototype will work and it's a build | |
| 332 | |
| 00:19:16,720 --> 00:19:19,280 | |
| in public open source, so anyone is welcome to use | |
| 333 | |
| 00:19:19,360 --> 00:19:22,320 | |
| it if I make anything good. But that was really | |
| 334 | |
| 00:19:22,480 --> 00:19:26,480 | |
| exciting for me last night when after hours of trying | |
| 335 | |
| 00:19:26,560 --> 00:19:30,480 | |
| my own prototype, seeing someone just made something that works | |
| 336 | |
| 00:19:30,640 --> 00:19:32,400 | |
| like that, you know, you're not gonna have to build | |
| 337 | |
| 00:19:32,640 --> 00:19:37,460 | |
| a custom conda environment and image. I have AMD GPU, | |
| 338 | |
| 00:19:37,620 --> 00:19:40,980 | |
| which makes things much more complicated. I didn't find it. | |
| 339 | |
| 00:19:41,540 --> 00:19:42,980 | |
| And I was about to give up and I said, | |
| 340 | |
| 00:19:43,060 --> 00:19:45,460 | |
| all right, let me just give Deep Grams Linux thing | |
| 341 | |
| 00:19:45,940 --> 00:19:49,220 | |
| a shot. And if this doesn't work, I'm just going | |
| 342 | |
| 00:19:49,220 --> 00:19:50,980 | |
| to go back to trying to Vibe code something myself. | |
| 343 | |
| 00:19:51,620 --> 00:19:55,460 | |
| And when I ran the script, I was using Claude | |
| 344 | |
| 00:19:55,540 --> 00:19:59,060 | |
| code to do the installation process. It ran the script | |
| 345 | |
| 00:19:59,140 --> 00:20:02,020 | |
| and oh my gosh, it works just like that. The | |
| 346 | |
| 00:20:02,100 --> 00:20:05,980 | |
| tricky thing For all those who want to know all | |
| 347 | |
| 00:20:05,980 --> 00:20:11,260 | |
| the nitty gritty details, was that I | |
| 348 | |
| 00:20:11,260 --> 00:20:14,380 | |
| don't think it was actually struggling with transcription, but pasting | |
| 349 | |
| 00:20:14,700 --> 00:20:18,140 | |
| Wayland makes life very hard. And I think there was | |
| 350 | |
| 00:20:18,220 --> 00:20:21,500 | |
| something not running the right time. Anyway, Deepgram, I looked | |
| 351 | |
| 00:20:21,500 --> 00:20:23,820 | |
| at how they actually handled that because it worked out | |
| 352 | |
| 00:20:23,900 --> 00:20:26,540 | |
| of the box when other stuff didn't. And it was | |
| 353 | |
| 00:20:27,100 --> 00:20:30,570 | |
| quite a clever little mechanism. And but more so than | |
| 354 | |
| 00:20:30,650 --> 00:20:33,290 | |
| that, the accuracy was brilliant. Now, what am I doing | |
| 355 | |
| 00:20:33,290 --> 00:20:35,930 | |
| here? This is going to be a 20 minute audio | |
| 356 | |
| 00:20:36,490 --> 00:20:42,010 | |
| sample. And I think I've done one or two | |
| 357 | |
| 00:20:42,170 --> 00:20:46,570 | |
| of these before, but I did it with short snappy | |
| 358 | |
| 00:20:46,730 --> 00:20:49,770 | |
| voice notes. This is kind of long form. This actually | |
| 359 | |
| 00:20:50,010 --> 00:20:52,170 | |
| might be a better approximation for what's useful to me | |
| 360 | |
| 00:20:52,330 --> 00:20:55,890 | |
| than voice memos. Like, I need to buy three Bread, | |
| 361 | |
| 00:20:55,970 --> 00:20:58,610 | |
| eaters of milk tomorrow and Peter bread, which is probably | |
| 362 | |
| 00:20:58,770 --> 00:21:01,330 | |
| how like half my voice notes sound. Like if anyone | |
| 363 | |
| 00:21:01,810 --> 00:21:04,050 | |
| were to, I don't know, like find my phone, they'd | |
| 364 | |
| 00:21:04,050 --> 00:21:05,570 | |
| be like, this is the most boring person in the | |
| 365 | |
| 00:21:05,570 --> 00:21:09,330 | |
| world. Although actually, there are some like kind of journaling | |
| 366 | |
| 00:21:09,330 --> 00:21:11,490 | |
| thoughts as well, but it's a lot of content like | |
| 367 | |
| 00:21:11,490 --> 00:21:14,450 | |
| that. And the probably for the evaluation, the most useful | |
| 368 | |
| 00:21:14,530 --> 00:21:20,210 | |
| thing is slightly obscure tech, GitHub, NeocleNo, hugging | |
| 369 | |
| 00:21:20,290 --> 00:21:22,940 | |
| face, Not so obscure that it's not going to have | |
| 370 | |
| 00:21:23,020 --> 00:21:26,460 | |
| a chance of knowing it, but hopefully sufficiently well known | |
| 371 | |
| 00:21:26,460 --> 00:21:28,700 | |
| that the model should get it. I tried to do | |
| 372 | |
| 00:21:28,780 --> 00:21:31,580 | |
| a little bit of speaking really fast and speaking very | |
| 373 | |
| 00:21:31,740 --> 00:21:35,020 | |
| slowly. I would say in general, I've spoken, delivered this | |
| 374 | |
| 00:21:35,180 --> 00:21:37,500 | |
| at a faster pace than I usually would owing to | |
| 375 | |
| 00:21:37,980 --> 00:21:42,460 | |
| strong coffee flowing through my bloodstream. And the thing that | |
| 376 | |
| 00:21:42,460 --> 00:21:44,700 | |
| I'm not going to get in this benchmark is background | |
| 377 | |
| 00:21:44,780 --> 00:21:46,460 | |
| noise, which in my first take that I had to | |
| 378 | |
| 00:21:46,460 --> 00:21:49,710 | |
| get rid of, My wife came in with my son | |
| 379 | |
| 00:21:50,030 --> 00:21:52,350 | |
| and for a goodnight kiss. And that actually would have | |
| 380 | |
| 00:21:52,350 --> 00:21:56,510 | |
| been super helpful to get in because it was non | |
| 381 | |
| 00:21:56,590 --> 00:22:00,190 | |
| diarized or if we had diarization, a female, I could | |
| 382 | |
| 00:22:00,190 --> 00:22:02,430 | |
| say, I want the male voice and that wasn't intended | |
| 383 | |
| 00:22:02,430 --> 00:22:05,870 | |
| for transcription. And we're not going to get background noise | |
| 384 | |
| 00:22:05,950 --> 00:22:08,270 | |
| like people honking their horns, which is something I've done | |
| 385 | |
| 00:22:08,430 --> 00:22:11,150 | |
| in my main data set where I am trying to | |
| 386 | |
| 00:22:11,390 --> 00:22:14,340 | |
| go back to some of my voice notes. Annotate them | |
| 387 | |
| 00:22:14,580 --> 00:22:16,420 | |
| and run a benchmark. But this is going to be | |
| 388 | |
| 00:22:16,420 --> 00:22:21,700 | |
| just a pure quick test. And as someone, | |
| 389 | |
| 00:22:22,260 --> 00:22:24,660 | |
| I'm working on a voice note idea. That's my sort | |
| 390 | |
| 00:22:24,660 --> 00:22:28,660 | |
| of end motivation. Besides thinking it's an ask to the | |
| 391 | |
| 00:22:28,660 --> 00:22:32,340 | |
| outstanding technology that's coming to viability. And really, I know | |
| 392 | |
| 00:22:32,420 --> 00:22:35,940 | |
| this sounds cheesy, can actually have a very transformative effect. | |
| 393 | |
| 00:22:36,980 --> 00:22:41,130 | |
| It's, you know, voice technology has been life changing for | |
| 394 | |
| 00:22:41,930 --> 00:22:46,970 | |
| folks living with disabilities. And I think | |
| 395 | |
| 00:22:47,130 --> 00:22:48,970 | |
| there's something really nice about the fact that it can | |
| 396 | |
| 00:22:49,130 --> 00:22:52,490 | |
| also benefit, you know, folks who are able bodied and | |
| 397 | |
| 00:22:52,650 --> 00:22:57,690 | |
| like we can all in different ways make this tech | |
| 398 | |
| 00:22:57,770 --> 00:23:00,410 | |
| as useful as possible, regardless of the exact way that | |
| 399 | |
| 00:23:00,410 --> 00:23:03,770 | |
| we're using it. And I think there's something very powerful | |
| 400 | |
| 00:23:03,850 --> 00:23:06,440 | |
| in that and it can be very cool. I see | |
| 401 | |
| 00:23:06,600 --> 00:23:10,200 | |
| huge potential. What excites me about Voicetech? A lot of | |
| 402 | |
| 00:23:10,280 --> 00:23:14,360 | |
| things actually. Firstly, the fact that it's cheap and accurate, | |
| 403 | |
| 00:23:14,440 --> 00:23:17,080 | |
| as I mentioned at the very start of this. And | |
| 404 | |
| 00:23:17,240 --> 00:23:19,880 | |
| it's getting better and better with stuff like accent handling. | |
| 405 | |
| 00:23:20,680 --> 00:23:23,400 | |
| I'm not sure my fine-tune will actually ever come to | |
| 406 | |
| 00:23:23,480 --> 00:23:25,320 | |
| fruition in the sense that I'll use it day to | |
| 407 | |
| 00:23:25,400 --> 00:23:28,840 | |
| day as I imagine. I get like superb flawless words | |
| 408 | |
| 00:23:28,920 --> 00:23:33,340 | |
| error rates because I'm just kind of skeptical about Local | |
| 409 | |
| 00:23:33,500 --> 00:23:37,100 | |
| speech to text, as I mentioned, and I think the | |
| 410 | |
| 00:23:37,180 --> 00:23:40,700 | |
| pace of innovation and improvement in the models, the main | |
| 411 | |
| 00:23:40,860 --> 00:23:44,620 | |
| reasons for fine tuning from what I've seen have been | |
| 412 | |
| 00:23:44,780 --> 00:23:47,420 | |
| people who are something that really blows my mind about | |
| 413 | |
| 00:23:47,980 --> 00:23:53,100 | |
| ASR is the idea that it's inherently a lingual or | |
| 414 | |
| 00:23:53,260 --> 00:23:58,570 | |
| multilingual phonetic based. So as folks who use speak | |
| 415 | |
| 00:23:58,890 --> 00:24:02,250 | |
| very obscure languages, that there might be a paucity of | |
| 416 | |
| 00:24:02,250 --> 00:24:04,890 | |
| training data or almost none at all, and therefore the | |
| 417 | |
| 00:24:04,890 --> 00:24:10,090 | |
| accuracy is significantly reduced. Or folks in very critical | |
| 418 | |
| 00:24:10,330 --> 00:24:14,250 | |
| environments, I know this is used extensively in medical transcription | |
| 419 | |
| 00:24:14,330 --> 00:24:19,130 | |
| and dispatcher work, the call centers who send out ambulances, | |
| 420 | |
| 00:24:19,210 --> 00:24:23,130 | |
| et cetera, where accuracy is absolutely paramount. And in the | |
| 421 | |
| 00:24:23,130 --> 00:24:26,860 | |
| case of doctors, radiologist, they might be using very specialized | |
| 422 | |
| 00:24:26,860 --> 00:24:29,420 | |
| vocab all the time. So those are kind of the | |
| 423 | |
| 00:24:29,500 --> 00:24:31,420 | |
| main two things that I'm not sure that really just | |
| 424 | |
| 00:24:31,500 --> 00:24:34,940 | |
| for trying to make it better on a few random | |
| 425 | |
| 00:24:34,940 --> 00:24:37,900 | |
| tech words with my slightly, I mean, I have an | |
| 426 | |
| 00:24:37,980 --> 00:24:41,020 | |
| accent, but like not, you know, an accent that a | |
| 427 | |
| 00:24:41,100 --> 00:24:45,900 | |
| few other million people have ish. I'm not sure that | |
| 428 | |
| 00:24:46,380 --> 00:24:50,300 | |
| my little fine tune is gonna actually like the bump | |
| 429 | |
| 00:24:50,460 --> 00:24:53,500 | |
| in word error reduction, if I ever actually figure out | |
| 430 | |
| 00:24:53,500 --> 00:24:54,620 | |
| how to do it and get it up to the | |
| 431 | |
| 00:24:54,700 --> 00:24:57,870 | |
| cloud. By the time we've done that, I suspect that | |
| 432 | |
| 00:24:58,190 --> 00:25:00,430 | |
| the next generation of ASR will just be so good | |
| 433 | |
| 00:25:00,510 --> 00:25:02,990 | |
| that it will kind of be, well, that would have | |
| 434 | |
| 00:25:02,990 --> 00:25:04,670 | |
| been cool if it worked out, but I'll just use | |
| 435 | |
| 00:25:04,750 --> 00:25:08,510 | |
| this instead. So that's going to be it for today's | |
| 436 | |
| 00:25:08,830 --> 00:25:14,030 | |
| episode of voice training data. Single long shot evaluation. | |
| 437 | |
| 00:25:14,350 --> 00:25:17,150 | |
| Who am I going to compare? Whisper is always good | |
| 438 | |
| 00:25:17,150 --> 00:25:20,510 | |
| as a benchmark, but I'm more interested in seeing Whisper | |
| 439 | |
| 00:25:20,590 --> 00:25:24,510 | |
| head to head with two things, really. One is Whisper | |
| 440 | |
| 00:25:24,590 --> 00:25:29,700 | |
| variants. So you've got these projects like faster Distill Whisper, | |
| 441 | |
| 00:25:29,780 --> 00:25:31,700 | |
| it's a bit confusing, there's a whole bunch of them. | |
| 442 | |
| 00:25:32,020 --> 00:25:35,300 | |
| And the emerging ASRs, which are also a thing. My | |
| 443 | |
| 00:25:35,380 --> 00:25:37,220 | |
| intention for this is I'm not sure I'm going to | |
| 444 | |
| 00:25:37,220 --> 00:25:39,860 | |
| have the time in any point in the foreseeable future | |
| 445 | |
| 00:25:40,180 --> 00:25:44,580 | |
| to go back through this whole episode and create a | |
| 446 | |
| 00:25:44,660 --> 00:25:49,700 | |
| proper source truth, where I fix everything. Might do | |
| 447 | |
| 00:25:49,780 --> 00:25:52,740 | |
| it if I can get one transcriptions that sufficiently close | |
| 448 | |
| 00:25:52,980 --> 00:25:57,040 | |
| to perfection. But what I would actually love to do | |
| 449 | |
| 00:25:57,200 --> 00:25:59,920 | |
| on Hugging Face, I think would be a great probably | |
| 450 | |
| 00:26:00,240 --> 00:26:02,880 | |
| how I might visualize this is having the audio waveform | |
| 451 | |
| 00:26:03,200 --> 00:26:08,160 | |
| play and then have the transcript for each model below | |
| 452 | |
| 00:26:08,160 --> 00:26:12,560 | |
| it and maybe even a like, you know, to scale | |
| 453 | |
| 00:26:13,120 --> 00:26:15,600 | |
| and maybe even a local one as well, like local | |
| 454 | |
| 00:26:15,760 --> 00:26:21,100 | |
| whisper versus OpenAI API, et cetera. And, I | |
| 455 | |
| 00:26:21,180 --> 00:26:23,500 | |
| can then actually listen back to segments or anyone who | |
| 456 | |
| 00:26:23,500 --> 00:26:25,820 | |
| wants to can listen back to segments of this recording | |
| 457 | |
| 00:26:26,140 --> 00:26:30,940 | |
| and see where a particular model struggled and others didn't, | |
| 458 | |
| 00:26:31,420 --> 00:26:33,340 | |
| as well as the sort of headline finding of which | |
| 459 | |
| 00:26:33,500 --> 00:26:36,860 | |
| had the best WER, but that would require the source | |
| 460 | |
| 00:26:36,860 --> 00:26:39,580 | |
| of truth. Okay, that's it. I hope this was, I | |
| 461 | |
| 00:26:39,580 --> 00:26:42,540 | |
| don't know, maybe useful for other folks interested in STT. | |
| 462 | |
| 00:26:42,860 --> 00:26:45,660 | |
| You want to see that I always feel think I've | |
| 463 | |
| 00:26:45,660 --> 00:26:48,870 | |
| just said as something I didn't intend to. STT, I | |
| 464 | |
| 00:26:48,870 --> 00:26:52,470 | |
| said for those. Listen carefully, including hopefully the models themselves. | |
| 465 | |
| 00:26:53,190 --> 00:26:57,270 | |
| This has been myself, Daniel Rosell. For more jumbled repositories | |
| 466 | |
| 00:26:57,350 --> 00:27:01,750 | |
| about my roving interests in AI, but particularly agentic, MCP | |
| 467 | |
| 00:27:01,990 --> 00:27:07,029 | |
| and Voicetech, you can find me on GitHub, huggingface.com, | |
| 468 | |
| 00:27:10,230 --> 00:27:13,270 | |
| which is my personal website, as well as this podcast, | |
| 469 | |
| 00:27:13,510 --> 00:27:16,950 | |
| whose name I sadly cannot remember. Until next time, thanks | |
| 470 | |
| 00:27:16,950 --> 00:27:17,510 | |
| for listening. | |