Spaces:
Running
Running
| 1 | |
| 00:00:00,000 --> 00:00:06,400 | |
| Hello and welcome to a audio data | |
| set consisting of one single | |
| 2 | |
| 00:00:06,400 --> 00:00:12,000 | |
| episode of a non-existent podcast. | |
| Or it, uh, I may append this to a | |
| 3 | |
| 00:00:12,000 --> 00:00:16,520 | |
| podcast that I set up recently. | |
| Um, regarding my, uh, | |
| 4 | |
| 00:00:16,560 --> 00:00:21,840 | |
| with my thoughts on speech, | |
| tech and AI in particular, | |
| 5 | |
| 00:00:22,120 --> 00:00:27,840 | |
| more AI and generative AI, I would, | |
| uh, I would say, but in any event, | |
| 6 | |
| 00:00:27,840 --> 00:00:32,360 | |
| the purpose of this, um, | |
| voice recording is actually to create | |
| 7 | |
| 00:00:32,560 --> 00:00:37,440 | |
| a lengthy voice sample for a quick | |
| evaluation, a back of the envelope | |
| 8 | |
| 00:00:37,440 --> 00:00:41,040 | |
| evaluation, as they might say, | |
| for different speech to text models. | |
| 9 | |
| 00:00:41,040 --> 00:00:43,680 | |
| And I'm doing this because I, | |
| uh, I thought I'd made a great | |
| 10 | |
| 00:00:43,680 --> 00:00:48,200 | |
| breakthrough in my journey with | |
| speech tech, and that was succeeding | |
| 11 | |
| 00:00:48,200 --> 00:00:52,600 | |
| in the elusive task of fine tuning. | |
| Whisper, whisper is. | |
| 12 | |
| 00:00:52,720 --> 00:00:56,840 | |
| And I'm going to just talk. | |
| I'm trying to mix up, uh, | |
| 13 | |
| 00:00:56,840 --> 00:01:00,350 | |
| I'm going to try a few different | |
| styles of speaking. | |
| 14 | |
| 00:01:00,350 --> 00:01:02,510 | |
| I might whisper something at | |
| some point as well, | |
| 15 | |
| 00:01:03,070 --> 00:01:07,030 | |
| and I'll go back to speaking loud in, | |
| uh, in different parts. | |
| 16 | |
| 00:01:07,030 --> 00:01:09,590 | |
| I'm going to sound really like a | |
| crazy person, because I'm also | |
| 17 | |
| 00:01:09,590 --> 00:01:15,750 | |
| going to try to speak at different | |
| pitches and cadences in order to | |
| 18 | |
| 00:01:15,790 --> 00:01:20,510 | |
| really try to put a speech to | |
| text model through its paces, | |
| 19 | |
| 00:01:20,510 --> 00:01:25,750 | |
| which is trying to make sense of, | |
| is this guy just on incoherently in | |
| 20 | |
| 00:01:25,750 --> 00:01:34,230 | |
| one long sentence, or are these just | |
| actually a series of step standalone, | |
| 21 | |
| 00:01:34,230 --> 00:01:37,390 | |
| standalone, standalone sentences? | |
| And how is it going to handle | |
| 22 | |
| 00:01:37,390 --> 00:01:40,630 | |
| step alone? That's not a word. | |
| Uh, what happens when you use | |
| 23 | |
| 00:01:40,630 --> 00:01:43,910 | |
| speech to text and you use a fake | |
| word and then you're like, wait, | |
| 24 | |
| 00:01:43,910 --> 00:01:48,230 | |
| that's not actually that word doesn't | |
| exist. How does AI handle that? | |
| 25 | |
| 00:01:48,270 --> 00:01:53,790 | |
| And, uh, these and more are all | |
| the questions that I'm seeking | |
| 26 | |
| 00:01:53,790 --> 00:01:57,230 | |
| to answer in this training data. | |
| Now, why did why was it trying | |
| 27 | |
| 00:01:57,230 --> 00:01:59,620 | |
| to fine tune a whisper? | |
| And what is whisper? | |
| 28 | |
| 00:01:59,660 --> 00:02:03,420 | |
| As I said, I'm gonna try to, uh, | |
| record this at a couple of different | |
| 29 | |
| 00:02:03,420 --> 00:02:08,940 | |
| levels of technicality for folks who | |
| are, uh, you know, in the normal, uh, | |
| 30 | |
| 00:02:08,940 --> 00:02:13,340 | |
| world and not totally stuck down | |
| the rabbit hole of AI, uh, which I | |
| 31 | |
| 00:02:13,340 --> 00:02:17,340 | |
| have to say is a really wonderful, | |
| uh, rabbit hole to be to be down. | |
| 32 | |
| 00:02:17,460 --> 00:02:21,580 | |
| Um, it's a really interesting area. | |
| And speech and voice tech is is | |
| 33 | |
| 00:02:21,820 --> 00:02:24,860 | |
| the aspect of it that I find | |
| actually most. | |
| 34 | |
| 00:02:25,060 --> 00:02:28,220 | |
| I'm not sure I would say the most | |
| interesting, because there's just | |
| 35 | |
| 00:02:28,220 --> 00:02:32,580 | |
| so much that is fascinating in AI. | |
| Uh, but the most that I find the | |
| 36 | |
| 00:02:32,580 --> 00:02:36,100 | |
| most personally transformative | |
| in terms of the impact that it's | |
| 37 | |
| 00:02:36,100 --> 00:02:41,540 | |
| had on my daily work life and | |
| productivity and how I sort of work. | |
| 38 | |
| 00:02:41,820 --> 00:02:47,900 | |
| And I'm persevering hard with the | |
| task of trying to guess a good | |
| 39 | |
| 00:02:47,900 --> 00:02:51,580 | |
| solution working for Linux, which if | |
| anyone actually does listen to this, | |
| 40 | |
| 00:02:51,580 --> 00:02:54,980 | |
| not just for the training data | |
| and for the actual content, uh, | |
| 41 | |
| 00:02:55,020 --> 00:02:59,480 | |
| this is this is has sparked I had | |
| besides the fine tune not working. | |
| 42 | |
| 00:02:59,480 --> 00:03:05,440 | |
| Well, that was the failure. | |
| Um, I used clod code because one | |
| 43 | |
| 00:03:05,440 --> 00:03:10,040 | |
| thinks these days that there is | |
| nothing short of solving, | |
| 44 | |
| 00:03:10,920 --> 00:03:14,560 | |
| you know, the, uh, | |
| the reason of life or something. | |
| 45 | |
| 00:03:14,960 --> 00:03:19,440 | |
| Uh, that clod and agentic AI can't | |
| do, uh, which is not really the case. | |
| 46 | |
| 00:03:19,480 --> 00:03:23,480 | |
| Uh, it does seem that way sometimes, | |
| but it fails a lot as well. | |
| 47 | |
| 00:03:23,480 --> 00:03:26,840 | |
| And this is one of those, uh, | |
| instances where last week I put | |
| 48 | |
| 00:03:26,840 --> 00:03:31,280 | |
| together an hour of voice training | |
| data, basically speaking just | |
| 49 | |
| 00:03:31,280 --> 00:03:34,920 | |
| random things for three minutes. | |
| And, um, | |
| 50 | |
| 00:03:35,600 --> 00:03:38,400 | |
| it was actually kind of tedious | |
| because the texts were really weird. | |
| 51 | |
| 00:03:38,400 --> 00:03:42,000 | |
| Some of them were it was like it | |
| was AI generated. | |
| 52 | |
| 00:03:42,200 --> 00:03:44,800 | |
| Um, I tried before to read | |
| Sherlock Holmes for an hour and | |
| 53 | |
| 00:03:44,800 --> 00:03:46,880 | |
| I just couldn't. | |
| I was so bored, uh, | |
| 54 | |
| 00:03:46,920 --> 00:03:50,680 | |
| after ten minutes that I was like, | |
| okay, now I'm just gonna have to | |
| 55 | |
| 00:03:50,680 --> 00:03:56,350 | |
| find something else to read. | |
| So I used a created with AI | |
| 56 | |
| 00:03:56,390 --> 00:04:00,030 | |
| studio vibe coded. | |
| A synthetic text generator. | |
| 57 | |
| 00:04:00,270 --> 00:04:03,870 | |
| Um, which actually I thought was | |
| probably a better way of doing it | |
| 58 | |
| 00:04:03,870 --> 00:04:08,750 | |
| because it would give me more short | |
| samples with more varied content. | |
| 59 | |
| 00:04:08,750 --> 00:04:13,190 | |
| So I was like, okay, give me a voice | |
| note, like I'm recording an email, | |
| 60 | |
| 00:04:13,190 --> 00:04:17,990 | |
| give me a short story to read, | |
| give me prose, um, to read. | |
| 61 | |
| 00:04:17,990 --> 00:04:21,190 | |
| So I came up with all these | |
| different things, and I added a | |
| 62 | |
| 00:04:21,190 --> 00:04:24,630 | |
| little timer to it so I could | |
| see how close I was to one hour. | |
| 63 | |
| 00:04:24,870 --> 00:04:29,710 | |
| Um, and, uh, I spent like an hour one | |
| afternoon or probably two hours by | |
| 64 | |
| 00:04:29,710 --> 00:04:34,070 | |
| the time you, um, you do retakes | |
| or whatever because you want to. | |
| 65 | |
| 00:04:34,870 --> 00:04:39,070 | |
| It gave me a source of truth, | |
| which I'm not sure if that's the | |
| 66 | |
| 00:04:39,070 --> 00:04:43,430 | |
| scientific way to approach this topic | |
| of gathering, uh, training data, | |
| 67 | |
| 00:04:43,430 --> 00:04:47,950 | |
| but I thought it made sense. | |
| Um, I have a lot of audio data | |
| 68 | |
| 00:04:47,950 --> 00:04:51,950 | |
| from recording voice notes, | |
| which I've also kind of used, um, | |
| 69 | |
| 00:04:51,950 --> 00:04:55,660 | |
| been experimenting with using for | |
| a different purpose, slightly | |
| 70 | |
| 00:04:55,660 --> 00:05:00,700 | |
| different annotating task types. | |
| It's more text classification | |
| 71 | |
| 00:05:00,700 --> 00:05:03,620 | |
| experiment or uh, well, | |
| it's more than that, actually. | |
| 72 | |
| 00:05:03,620 --> 00:05:07,980 | |
| I'm working on a voice app, | |
| so it's a prototype I guess is | |
| 73 | |
| 00:05:07,980 --> 00:05:12,660 | |
| really more accurate. | |
| Um, but you can do that and you | |
| 74 | |
| 00:05:12,660 --> 00:05:14,100 | |
| can work backwards. | |
| You're like, | |
| 75 | |
| 00:05:14,140 --> 00:05:18,500 | |
| you listen back to a voice note | |
| and you painfully go through one | |
| 76 | |
| 00:05:18,500 --> 00:05:21,860 | |
| of those transcribing, you know, | |
| where you start and stop and scrub | |
| 77 | |
| 00:05:21,860 --> 00:05:23,980 | |
| around it and you fix the errors. | |
| But it's really, | |
| 78 | |
| 00:05:23,980 --> 00:05:27,100 | |
| really boring to do that. | |
| So I thought it would be less | |
| 79 | |
| 00:05:27,100 --> 00:05:31,740 | |
| tedious in the long term if I just | |
| recorded The Source of truth. | |
| 80 | |
| 00:05:32,060 --> 00:05:34,180 | |
| So it gave me these three minute | |
| snippets. | |
| 81 | |
| 00:05:34,180 --> 00:05:38,660 | |
| I recorded them and saved an MP3 | |
| and a txt in the same folder, | |
| 82 | |
| 00:05:38,660 --> 00:05:43,700 | |
| and I created an hour of that data. | |
| Uh, so I was very hopeful, quietly, | |
| 83 | |
| 00:05:43,740 --> 00:05:46,260 | |
| you know, a little bit hopeful | |
| that I would be able that I could | |
| 84 | |
| 00:05:46,260 --> 00:05:49,580 | |
| actually fine tune, whisper. | |
| Um, I want to fine tune whisper | |
| 85 | |
| 00:05:49,580 --> 00:05:54,720 | |
| because when I got into voice tech | |
| last November, my wife was in | |
| 86 | |
| 00:05:54,720 --> 00:05:59,480 | |
| the US and I was alone at home. | |
| And you know, when crazy people | |
| 87 | |
| 00:05:59,480 --> 00:06:03,640 | |
| like me do really wild things like | |
| use voice to tech, uh, technology. | |
| 88 | |
| 00:06:03,640 --> 00:06:06,400 | |
| That was basically, um, | |
| when I started doing it, | |
| 89 | |
| 00:06:06,400 --> 00:06:10,160 | |
| I didn't feel like a crazy person | |
| speaking to myself, and my | |
| 90 | |
| 00:06:10,160 --> 00:06:16,000 | |
| expectations weren't that high. | |
| Uh, I used speech tech now and again. | |
| 91 | |
| 00:06:16,080 --> 00:06:18,360 | |
| Um, tried it out. | |
| I was like, it'd be really cool | |
| 92 | |
| 00:06:18,360 --> 00:06:20,400 | |
| if you could just, like, | |
| speak into your computer. | |
| 93 | |
| 00:06:20,760 --> 00:06:24,600 | |
| And whatever I tried out that | |
| had Linux support was just. | |
| 94 | |
| 00:06:25,320 --> 00:06:28,520 | |
| It was not good, basically. | |
| Um, and this blew me away from | |
| 95 | |
| 00:06:28,520 --> 00:06:31,920 | |
| the first go. | |
| I mean, it wasn't 100% accurate | |
| 96 | |
| 00:06:31,960 --> 00:06:35,040 | |
| out of the box and it took work, | |
| but it was good enough that there was | |
| 97 | |
| 00:06:35,040 --> 00:06:39,600 | |
| a solid foundation and it kind of | |
| passed that, uh, pivot point that | |
| 98 | |
| 00:06:39,600 --> 00:06:42,760 | |
| it's actually worth doing this. | |
| You know, there's a point where | |
| 99 | |
| 00:06:42,760 --> 00:06:46,800 | |
| it's so like the transcript is you | |
| don't have to get 100% accuracy | |
| 100 | |
| 00:06:46,800 --> 00:06:50,510 | |
| for it to be worth your time for | |
| speech to text to be a worthwhile | |
| 101 | |
| 00:06:50,510 --> 00:06:52,950 | |
| addition to your productivity. | |
| But you do need to get above. | |
| 102 | |
| 00:06:52,990 --> 00:06:57,630 | |
| Let's say, I don't know, 85%. | |
| If it's 60% or 50%, | |
| 103 | |
| 00:06:57,630 --> 00:07:00,670 | |
| you inevitably say, screw it. | |
| I'll just type it because you end up | |
| 104 | |
| 00:07:00,670 --> 00:07:04,950 | |
| missing errors in the transcript | |
| and it becomes actually worse. | |
| 105 | |
| 00:07:04,950 --> 00:07:06,710 | |
| You end up in a worse position | |
| than you started with. | |
| 106 | |
| 00:07:06,710 --> 00:07:10,910 | |
| And that's been my experience. | |
| So, um, I was like, oh, | |
| 107 | |
| 00:07:10,950 --> 00:07:13,430 | |
| this is actually really, really good. | |
| Now how did that happen? | |
| 108 | |
| 00:07:13,430 --> 00:07:18,790 | |
| And the answer is ASR whisper | |
| being open sourced and the | |
| 109 | |
| 00:07:18,790 --> 00:07:21,790 | |
| transformer architecture, | |
| if you want to go back to the, | |
| 110 | |
| 00:07:22,390 --> 00:07:26,630 | |
| um, to the underpinnings, which | |
| really blows my mind and it's on my | |
| 111 | |
| 00:07:26,630 --> 00:07:32,310 | |
| list to read through that paper. | |
| Um, all you need is attention as | |
| 112 | |
| 00:07:33,350 --> 00:07:38,350 | |
| attentively as can be done with my | |
| limited brain because it's super, | |
| 113 | |
| 00:07:38,350 --> 00:07:42,190 | |
| super high level stuff. | |
| Um, super advanced stuff. | |
| 114 | |
| 00:07:42,230 --> 00:07:47,950 | |
| I mean, uh, but that I think of all | |
| the things that are fascinating | |
| 115 | |
| 00:07:48,060 --> 00:07:52,700 | |
| about the sudden rise in AI and | |
| the dramatic capabilities. | |
| 116 | |
| 00:07:53,300 --> 00:07:55,580 | |
| I find it fascinating that few | |
| people are like, hang on, | |
| 117 | |
| 00:07:55,740 --> 00:07:59,620 | |
| you've got this thing that can speak | |
| to you like a chatbot, an LLM, | |
| 118 | |
| 00:08:00,300 --> 00:08:05,460 | |
| and then you've got image generation. | |
| Okay, so firstly, those two things on | |
| 119 | |
| 00:08:05,460 --> 00:08:10,740 | |
| the surface have nothing in common. | |
| Um, so like how are they how did that | |
| 120 | |
| 00:08:10,740 --> 00:08:12,980 | |
| just happen all at the same time. | |
| And then when you extend that | |
| 121 | |
| 00:08:12,980 --> 00:08:16,060 | |
| further, um, you're like sooner, | |
| right? | |
| 122 | |
| 00:08:16,060 --> 00:08:21,580 | |
| You can sing a song and AI will like, | |
| come up with an instrumental and then | |
| 123 | |
| 00:08:21,580 --> 00:08:23,740 | |
| you've got whisper and you're like, | |
| wait a second, | |
| 124 | |
| 00:08:23,940 --> 00:08:27,980 | |
| how did all this stuff, like, | |
| if it's all AI, what's like there | |
| 125 | |
| 00:08:27,980 --> 00:08:30,580 | |
| has to be some commonality. | |
| Otherwise these are four. | |
| 126 | |
| 00:08:30,660 --> 00:08:34,660 | |
| These are totally different | |
| technologies on the surface of it. | |
| 127 | |
| 00:08:34,660 --> 00:08:40,100 | |
| And, uh, the transformer architecture | |
| is, as far as I know, the answer. | |
| 128 | |
| 00:08:40,100 --> 00:08:43,740 | |
| And I can't even say can't even | |
| pretend that I really understand | |
| 129 | |
| 00:08:44,020 --> 00:08:47,170 | |
| what the transformer | |
| architecture means in depth, | |
| 130 | |
| 00:08:47,170 --> 00:08:51,690 | |
| but I have scanned it and as I said, | |
| I want to print it and really kind | |
| 131 | |
| 00:08:51,690 --> 00:08:56,650 | |
| of think over it at some point, | |
| and I'll probably feel bad about | |
| 132 | |
| 00:08:56,650 --> 00:08:58,970 | |
| myself, I think, | |
| because weren't those guys in their | |
| 133 | |
| 00:08:59,010 --> 00:09:03,890 | |
| in their 20s like, that's crazy. | |
| I think I asked ChatGPT once who | |
| 134 | |
| 00:09:03,930 --> 00:09:08,250 | |
| were the who wrote that paper | |
| and how old were they when it | |
| 135 | |
| 00:09:08,250 --> 00:09:11,170 | |
| was published in arXiv? | |
| And I was expecting like, | |
| 136 | |
| 00:09:11,410 --> 00:09:13,330 | |
| I don't know, | |
| what do you what do you imagine? | |
| 137 | |
| 00:09:13,330 --> 00:09:14,930 | |
| I personally imagine kind of like, | |
| you know, | |
| 138 | |
| 00:09:14,970 --> 00:09:19,090 | |
| you have these breakthroughs during | |
| Covid and things like that where | |
| 139 | |
| 00:09:19,130 --> 00:09:22,090 | |
| like these kind of really obscure | |
| scientists who are like in their | |
| 140 | |
| 00:09:22,090 --> 00:09:27,130 | |
| 50s and they've just kind of been | |
| laboring in labs and, uh, wearily | |
| 141 | |
| 00:09:27,130 --> 00:09:30,530 | |
| and writing in publishing in kind | |
| of obscure academic publications. | |
| 142 | |
| 00:09:30,730 --> 00:09:33,930 | |
| And they finally, like, | |
| hit a big or win a Nobel Prize and | |
| 143 | |
| 00:09:33,930 --> 00:09:37,810 | |
| then their household household names. | |
| Uh, so that was kind of what I | |
| 144 | |
| 00:09:37,810 --> 00:09:39,650 | |
| had in mind. | |
| That was the mental image I'd | |
| 145 | |
| 00:09:39,650 --> 00:09:43,890 | |
| formed of the birth of arXiv. | |
| Like, I wasn't expecting 20 | |
| 146 | |
| 00:09:43,930 --> 00:09:47,310 | |
| somethings in San Francisco, | |
| though I thought that was both very, | |
| 147 | |
| 00:09:47,310 --> 00:09:49,870 | |
| very funny, very cool, | |
| and actually kind of inspiring. | |
| 148 | |
| 00:09:50,390 --> 00:09:55,510 | |
| It's nice to think that people who, | |
| you know, just you might put them | |
| 149 | |
| 00:09:55,510 --> 00:10:00,910 | |
| in the kind of milieu or bubble or | |
| world that you are in or credibly in, | |
| 150 | |
| 00:10:00,950 --> 00:10:03,590 | |
| through, you know, | |
| a series of connections that are | |
| 151 | |
| 00:10:03,590 --> 00:10:07,630 | |
| coming up with such literally | |
| world changing, um, innovations. | |
| 152 | |
| 00:10:07,670 --> 00:10:11,430 | |
| Uh, so that was, I thought, | |
| anyway, that, that that was cool. | |
| 153 | |
| 00:10:12,070 --> 00:10:13,950 | |
| Okay. Voice training data. | |
| How are we doing? | |
| 154 | |
| 00:10:13,950 --> 00:10:17,990 | |
| We're about ten minutes, and I'm | |
| still talking about voice technology. | |
| 155 | |
| 00:10:18,190 --> 00:10:22,350 | |
| Um, so whisper was brilliant, | |
| and I was so excited that I was. | |
| 156 | |
| 00:10:22,350 --> 00:10:25,630 | |
| My first instinct was to, like, | |
| get like, oh, my gosh, | |
| 157 | |
| 00:10:25,630 --> 00:10:27,710 | |
| I have to get, like, | |
| a really good microphone for this. | |
| 158 | |
| 00:10:27,950 --> 00:10:31,630 | |
| So, um, I didn't go on a | |
| spending spree because I said, | |
| 159 | |
| 00:10:31,670 --> 00:10:34,470 | |
| I'm gonna have to just wait a | |
| month and see if I still use this. | |
| 160 | |
| 00:10:34,910 --> 00:10:39,990 | |
| And it just kind of became it's | |
| become really part of my daily | |
| 161 | |
| 00:10:39,990 --> 00:10:42,990 | |
| routine. | |
| Like, if I'm writing an email, | |
| 162 | |
| 00:10:42,990 --> 00:10:47,020 | |
| I'll record a voice note. | |
| And then I've developed and it's | |
| 163 | |
| 00:10:47,020 --> 00:10:49,900 | |
| nice to see that everyone is | |
| like developing the same things | |
| 164 | |
| 00:10:49,900 --> 00:10:51,900 | |
| in parallel. | |
| Like, that's kind of a weird thing | |
| 165 | |
| 00:10:51,940 --> 00:10:57,340 | |
| to say, but when I look, I kind of | |
| came when I started working on this, | |
| 166 | |
| 00:10:57,380 --> 00:11:00,700 | |
| these prototypes on GitHub, | |
| which is where I just kind of | |
| 167 | |
| 00:11:00,740 --> 00:11:04,740 | |
| share very freely and loosely, | |
| uh, ideas and, you know, | |
| 168 | |
| 00:11:04,780 --> 00:11:10,020 | |
| first iterations on, on concepts, | |
| um, and for want of a better word, | |
| 169 | |
| 00:11:10,020 --> 00:11:13,900 | |
| I called it like, uh, | |
| lm post-processing or cleanup or | |
| 170 | |
| 00:11:14,140 --> 00:11:18,100 | |
| basically a system prompt that after | |
| you get back the raw text from | |
| 171 | |
| 00:11:18,420 --> 00:11:24,100 | |
| whisper, you run it through a model | |
| and say, okay, this is crappy text, | |
| 172 | |
| 00:11:24,140 --> 00:11:27,140 | |
| like add sentence structure and, | |
| you know, fix it up. | |
| 173 | |
| 00:11:27,580 --> 00:11:32,660 | |
| And, um, now when I'm exploring the | |
| different tools that are out there | |
| 174 | |
| 00:11:32,700 --> 00:11:36,580 | |
| that people have built, I see, uh, | |
| quite a number of projects have | |
| 175 | |
| 00:11:37,180 --> 00:11:41,700 | |
| basically done the same thing, | |
| um, less that be misconstrued. | |
| 176 | |
| 00:11:41,700 --> 00:11:44,370 | |
| I'm not saying for a millisecond | |
| that I inspired them. | |
| 177 | |
| 00:11:44,370 --> 00:11:48,890 | |
| I'm sure this has been a thing that's | |
| been integrated into tools for a | |
| 178 | |
| 00:11:48,930 --> 00:11:52,290 | |
| while, but it's it's the kind of | |
| thing that when you start using these | |
| 179 | |
| 00:11:52,290 --> 00:11:56,730 | |
| tools every day, the need for it | |
| is almost instantly apparent, uh, | |
| 180 | |
| 00:11:56,730 --> 00:12:00,770 | |
| because text that doesn't have any | |
| punctuation or paragraph spacing | |
| 181 | |
| 00:12:00,810 --> 00:12:04,250 | |
| takes a long time to, you know, | |
| it takes so long to get it into | |
| 182 | |
| 00:12:04,250 --> 00:12:09,370 | |
| a presentable email that again, | |
| it's it's it moves speech tech | |
| 183 | |
| 00:12:09,410 --> 00:12:12,930 | |
| into that before that inflection | |
| point where you're like, no, | |
| 184 | |
| 00:12:12,930 --> 00:12:16,250 | |
| it's just not worth it. | |
| It's like it'll just be quicker | |
| 185 | |
| 00:12:16,250 --> 00:12:18,850 | |
| to type this. | |
| So it's a big it's a little touch. | |
| 186 | |
| 00:12:18,850 --> 00:12:24,090 | |
| That actually is a big deal. | |
| Uh, so I was on whisper and I've | |
| 187 | |
| 00:12:24,090 --> 00:12:28,170 | |
| been using whisper and I kind of | |
| early on found a couple of tools. | |
| 188 | |
| 00:12:28,210 --> 00:12:30,930 | |
| I couldn't find what I was | |
| looking for on Linux, which is, | |
| 189 | |
| 00:12:31,370 --> 00:12:35,770 | |
| um, basically just something | |
| that'll run in the background. | |
| 190 | |
| 00:12:35,810 --> 00:12:40,130 | |
| You'll give it an API key and it | |
| will just transcribe. Um. | |
| 191 | |
| 00:12:41,280 --> 00:12:44,000 | |
| with, like, a little key to | |
| start and stop the dictation. | |
| 192 | |
| 00:12:44,600 --> 00:12:49,040 | |
| Uh, and the issues were I discovered | |
| that, like most people involved in | |
| 193 | |
| 00:12:49,040 --> 00:12:53,920 | |
| creating these projects were very | |
| much focused on local models running | |
| 194 | |
| 00:12:53,920 --> 00:12:57,400 | |
| whisper locally, because you can. | |
| And I tried that a bunch of | |
| 195 | |
| 00:12:57,400 --> 00:13:00,840 | |
| times and just never got results | |
| that were as good as the cloud. | |
| 196 | |
| 00:13:01,160 --> 00:13:04,640 | |
| And when I began looking at the | |
| cost of the speech to text APIs | |
| 197 | |
| 00:13:04,640 --> 00:13:08,520 | |
| and what I was spending, | |
| I just thought there's it's actually, | |
| 198 | |
| 00:13:08,720 --> 00:13:13,200 | |
| in my opinion, just one of the better | |
| deals in API spending and in cloud. | |
| 199 | |
| 00:13:13,240 --> 00:13:17,280 | |
| Like it's just not that expensive | |
| for very, very good models that are | |
| 200 | |
| 00:13:17,400 --> 00:13:20,840 | |
| much more, you know, you're going | |
| to be able to run the full model, | |
| 201 | |
| 00:13:21,360 --> 00:13:25,960 | |
| the latest model versus whatever | |
| you can run on your average GPU. | |
| 202 | |
| 00:13:26,000 --> 00:13:29,760 | |
| Unless you want to buy a crazy GPU. | |
| It doesn't really make sense to me. | |
| 203 | |
| 00:13:29,760 --> 00:13:33,480 | |
| Now, privacy is another concern. | |
| Um, that I know is kind of like a | |
| 204 | |
| 00:13:33,520 --> 00:13:36,920 | |
| very much a separate thing that | |
| people just don't want their voice, | |
| 205 | |
| 00:13:36,920 --> 00:13:39,790 | |
| data, and their voice leaving | |
| their local environment, | |
| 206 | |
| 00:13:40,110 --> 00:13:43,830 | |
| maybe for regulatory reasons as well. | |
| Um, but I'm not in that. | |
| 207 | |
| 00:13:43,910 --> 00:13:47,910 | |
| Um, I'm neither really care about | |
| people listening to my, uh, | |
| 208 | |
| 00:13:47,950 --> 00:13:51,190 | |
| grocery list consisting of, uh, | |
| reminding myself that I need to | |
| 209 | |
| 00:13:51,230 --> 00:13:54,790 | |
| buy more beer, Cheetos and hummus, | |
| which is kind of the three, | |
| 210 | |
| 00:13:54,990 --> 00:13:59,310 | |
| three staples of my diet. | |
| Um, during periods of poor nutrition. | |
| 211 | |
| 00:13:59,590 --> 00:14:03,310 | |
| Uh, but the kind of stuff that I | |
| transcribe, it's just not it's not a, | |
| 212 | |
| 00:14:03,990 --> 00:14:09,350 | |
| it's not a privacy thing and that | |
| sort of sensitive about and, uh, | |
| 213 | |
| 00:14:09,350 --> 00:14:13,070 | |
| I don't do anything so, | |
| you know, sensitive or secure, | |
| 214 | |
| 00:14:13,070 --> 00:14:16,590 | |
| that requires air gapping. | |
| So, um, I looked at the pricing and | |
| 215 | |
| 00:14:16,590 --> 00:14:20,270 | |
| especially the kind of older models, | |
| mini, um, some of them are very, | |
| 216 | |
| 00:14:20,270 --> 00:14:23,110 | |
| very affordable. | |
| And I did a back of the I did a | |
| 217 | |
| 00:14:23,110 --> 00:14:27,150 | |
| calculation once with ChatGPT | |
| and I was like, okay, this is a, | |
| 218 | |
| 00:14:27,150 --> 00:14:31,070 | |
| this is the API price for I can't | |
| remember whatever the model was. | |
| 219 | |
| 00:14:31,550 --> 00:14:33,910 | |
| Uh, let's say I just go at it | |
| like nonstop, | |
| 220 | |
| 00:14:34,030 --> 00:14:37,410 | |
| which it rarely happens. Probably. | |
| I would say on average, | |
| 221 | |
| 00:14:37,410 --> 00:14:41,890 | |
| I might dictate 30 to 60 minutes per | |
| day if I was probably summing up | |
| 222 | |
| 00:14:41,890 --> 00:14:48,490 | |
| the emails, documents, outlines, | |
| um, which is a lot, but it's it's | |
| 223 | |
| 00:14:48,490 --> 00:14:50,730 | |
| still a fairly modest amount. | |
| And I was like, well, | |
| 224 | |
| 00:14:50,770 --> 00:14:53,930 | |
| some days I do go on like 1 or 2 | |
| days where I've been. | |
| 225 | |
| 00:14:54,450 --> 00:14:58,450 | |
| Usually when I'm like kind of out of | |
| the house and just have something | |
| 226 | |
| 00:14:59,090 --> 00:15:02,250 | |
| like, I have nothing else to do. | |
| Like if I'm at a hospital with a | |
| 227 | |
| 00:15:02,250 --> 00:15:06,970 | |
| newborn, uh, and you're waiting | |
| for like eight hours and hours | |
| 228 | |
| 00:15:06,970 --> 00:15:10,210 | |
| for an appointment, and I would | |
| probably have listened to podcasts | |
| 229 | |
| 00:15:10,490 --> 00:15:14,010 | |
| before becoming a speech fanatic. | |
| And I'm like, oh, wait, | |
| 230 | |
| 00:15:14,050 --> 00:15:16,370 | |
| let me just get down. | |
| Let me just get these ideas out | |
| 231 | |
| 00:15:16,410 --> 00:15:19,610 | |
| of my head. | |
| And that's when I'll go on my | |
| 232 | |
| 00:15:19,650 --> 00:15:21,530 | |
| speech binges. | |
| But those are like once every | |
| 233 | |
| 00:15:21,530 --> 00:15:24,970 | |
| few months, like not frequently. | |
| But I said, okay, let's just say | |
| 234 | |
| 00:15:24,970 --> 00:15:30,650 | |
| if I'm gonna price out. | |
| Cloud asked if I was like, dedicated | |
| 235 | |
| 00:15:30,650 --> 00:15:36,880 | |
| every second of every waking hour to | |
| transcribing for some odd reason. Um. | |
| 236 | |
| 00:15:37,200 --> 00:15:39,680 | |
| I mean, it'd have to, like, | |
| eat and use the toilet and, | |
| 237 | |
| 00:15:39,720 --> 00:15:42,520 | |
| like, you know, there's only so | |
| many hours I'm awake for. | |
| 238 | |
| 00:15:42,520 --> 00:15:44,680 | |
| So, like, | |
| let's just say a maximum of, like, | |
| 239 | |
| 00:15:44,720 --> 00:15:48,680 | |
| 40 hours, 45 minutes in the hour. | |
| Then I said, all right, | |
| 240 | |
| 00:15:48,680 --> 00:15:52,600 | |
| let's just say 50. Who knows? | |
| You're dictating on the toilet. | |
| 241 | |
| 00:15:52,640 --> 00:15:53,880 | |
| We do it. | |
| Uh, | |
| 242 | |
| 00:15:53,880 --> 00:15:58,720 | |
| so it could be you could just do 60. | |
| But whatever I did, and every day, | |
| 243 | |
| 00:15:58,760 --> 00:16:02,440 | |
| like, you're going flat out seven | |
| days a week dictating non-stop. | |
| 244 | |
| 00:16:02,480 --> 00:16:06,440 | |
| I was like, what's my monthly API | |
| bill going to be at this price? | |
| 245 | |
| 00:16:06,720 --> 00:16:09,120 | |
| And it came out to like 70 or 80 | |
| bucks. | |
| 246 | |
| 00:16:09,120 --> 00:16:14,080 | |
| And I was like, well, that would be | |
| an extraordinary amount of dictation. | |
| 247 | |
| 00:16:14,080 --> 00:16:17,840 | |
| And I would hope that there was | |
| some compelling reason, | |
| 248 | |
| 00:16:18,040 --> 00:16:22,200 | |
| more worth more than $70, | |
| that I embarked upon that project. | |
| 249 | |
| 00:16:22,400 --> 00:16:25,200 | |
| Uh, so given that that's kind of the | |
| max point for me, I said, that's | |
| 250 | |
| 00:16:25,240 --> 00:16:29,000 | |
| actually very, very affordable. | |
| Um, now you're gonna if you want | |
| 251 | |
| 00:16:29,040 --> 00:16:34,080 | |
| to spec out the costs and you want | |
| to do the post-processing that I | |
| 252 | |
| 00:16:34,150 --> 00:16:37,110 | |
| really do feel is valuable. | |
| Um, that's going to cost some more as | |
| 253 | |
| 00:16:37,110 --> 00:16:43,110 | |
| well, unless you're using Gemini, | |
| which, uh, needless to say, is a | |
| 254 | |
| 00:16:43,110 --> 00:16:46,950 | |
| random person sitting in Jerusalem. | |
| Uh, I have no affiliation, | |
| 255 | |
| 00:16:46,950 --> 00:16:51,350 | |
| nor with Google, nor anthropic, | |
| nor Gemini, nor any major tech vendor | |
| 256 | |
| 00:16:51,350 --> 00:16:56,790 | |
| for that matter. Um, I like Gemini. | |
| Not so much as a everyday model. | |
| 257 | |
| 00:16:56,870 --> 00:16:59,830 | |
| Um, it's kind of underwhelmed in | |
| that respect, I would say. | |
| 258 | |
| 00:17:00,230 --> 00:17:03,030 | |
| But for multimodal, | |
| I think it's got a lot to offer. | |
| 259 | |
| 00:17:03,310 --> 00:17:06,870 | |
| And I think that the transcribing | |
| functionality whereby it can, | |
| 260 | |
| 00:17:07,270 --> 00:17:12,150 | |
| um, process audio with a system | |
| prompt and both give you | |
| 261 | |
| 00:17:12,190 --> 00:17:15,390 | |
| transcription that's cleaned up, | |
| that reduces two steps to one. | |
| 262 | |
| 00:17:15,710 --> 00:17:18,630 | |
| And that for me is a very, | |
| very big deal. | |
| 263 | |
| 00:17:18,630 --> 00:17:22,990 | |
| And, uh, I feel like even Google | |
| has haven't really sort of thought | |
| 264 | |
| 00:17:22,990 --> 00:17:27,430 | |
| through how useful the that | |
| modality is and what kind of use | |
| 265 | |
| 00:17:27,430 --> 00:17:30,790 | |
| cases you can achieve with it. | |
| Because I found in the course of | |
| 266 | |
| 00:17:30,790 --> 00:17:36,490 | |
| this year just an endless list | |
| of really kind of system prompt, | |
| 267 | |
| 00:17:36,730 --> 00:17:41,290 | |
| system prompt stuff that I can say, | |
| okay, I've used it to capture context | |
| 268 | |
| 00:17:41,290 --> 00:17:45,570 | |
| data for AI, which is literally I | |
| might speak for if I wanted to have a | |
| 269 | |
| 00:17:45,570 --> 00:17:49,730 | |
| good bank of context data about, | |
| who knows, my childhood. | |
| 270 | |
| 00:17:50,010 --> 00:17:53,450 | |
| Uh, more realistically, | |
| maybe my career goals, uh, | |
| 271 | |
| 00:17:53,450 --> 00:17:56,010 | |
| something that would just be, | |
| like, really boring to type out. | |
| 272 | |
| 00:17:56,130 --> 00:18:01,130 | |
| So I'll just, like, sit in my car | |
| and record it for ten minutes. | |
| 273 | |
| 00:18:01,130 --> 00:18:04,090 | |
| And that ten minutes, | |
| you get a lot of information in, | |
| 274 | |
| 00:18:04,530 --> 00:18:10,090 | |
| um, emails, which is short text. | |
| Um, just there is a whole bunch. | |
| 275 | |
| 00:18:10,090 --> 00:18:13,570 | |
| And all these workflows kind of | |
| require a little bit of treatment | |
| 276 | |
| 00:18:13,570 --> 00:18:17,490 | |
| afterwards and different treatment. | |
| My context pipeline is kind of like | |
| 277 | |
| 00:18:17,490 --> 00:18:21,210 | |
| just extract the bare essentials. | |
| So you end up with me talking very | |
| 278 | |
| 00:18:21,210 --> 00:18:24,250 | |
| loosely about sort of what I've done | |
| in my career, where I've worked, | |
| 279 | |
| 00:18:24,250 --> 00:18:27,610 | |
| where I might like to work, | |
| and it goes it condenses that | |
| 280 | |
| 00:18:27,610 --> 00:18:31,600 | |
| down to very robotic language | |
| that is easy to chunk, parse, | |
| 281 | |
| 00:18:31,600 --> 00:18:35,960 | |
| and maybe put into a vector database. | |
| Daniel has worked in technology, | |
| 282 | |
| 00:18:36,000 --> 00:18:39,640 | |
| Daniel is a has been working in, | |
| you know, stuff like that. | |
| 283 | |
| 00:18:39,640 --> 00:18:43,600 | |
| That's not how you would speak. | |
| Um, but I figure it's probably easier | |
| 284 | |
| 00:18:43,600 --> 00:18:48,120 | |
| to parse for, after all, robots. | |
| So we've almost got to 20 minutes. | |
| 285 | |
| 00:18:48,120 --> 00:18:52,640 | |
| And this is actually a success | |
| because I wasted 20 minutes of my, | |
| 286 | |
| 00:18:52,800 --> 00:18:56,880 | |
| uh, of the evening speaking into | |
| a microphone, and, uh, | |
| 287 | |
| 00:18:56,920 --> 00:19:00,840 | |
| the levels were shot and, uh, it, | |
| uh, it was clipping and I said, | |
| 288 | |
| 00:19:00,840 --> 00:19:03,200 | |
| I can't really do an evaluation. | |
| I have to be fair. | |
| 289 | |
| 00:19:03,200 --> 00:19:07,000 | |
| I have to give the models a | |
| chance to do their thing. | |
| 290 | |
| 00:19:07,520 --> 00:19:09,360 | |
| Uh, | |
| what am I hoping to achieve in this? | |
| 291 | |
| 00:19:09,400 --> 00:19:12,600 | |
| Okay, my fine tune was a dud, | |
| as mentioned Deepgram SVT. | |
| 292 | |
| 00:19:12,640 --> 00:19:15,520 | |
| I'm really, really hopeful that | |
| this prototype will work. | |
| 293 | |
| 00:19:15,800 --> 00:19:18,960 | |
| And it's a built in public open | |
| source, so anyone is welcome to | |
| 294 | |
| 00:19:19,000 --> 00:19:22,920 | |
| use it if I make anything good. | |
| Um, but that was really exciting for | |
| 295 | |
| 00:19:22,920 --> 00:19:27,400 | |
| me last night when after hours of, | |
| um, trying my own prototype, | |
| 296 | |
| 00:19:27,400 --> 00:19:31,230 | |
| seeing someone just made | |
| something that works like that. | |
| 297 | |
| 00:19:31,270 --> 00:19:32,670 | |
| You know, | |
| you're not going to have to build a | |
| 298 | |
| 00:19:32,670 --> 00:19:38,230 | |
| custom conda environment and image. | |
| I have AMD GPU, which makes | |
| 299 | |
| 00:19:38,230 --> 00:19:42,310 | |
| things much more complicated. | |
| I didn't find it and I was about | |
| 300 | |
| 00:19:42,310 --> 00:19:43,990 | |
| to give up and I said, | |
| all right, let me just give deep | |
| 301 | |
| 00:19:43,990 --> 00:19:48,750 | |
| grams Linux thing a shot. | |
| And if this doesn't work, um, | |
| 302 | |
| 00:19:48,750 --> 00:19:51,150 | |
| I'm just going to go back to | |
| trying to code something myself. | |
| 303 | |
| 00:19:51,510 --> 00:19:56,190 | |
| And when I ran the script, | |
| I was using cloud code to do the | |
| 304 | |
| 00:19:56,190 --> 00:20:00,030 | |
| installation process. | |
| It ran the script and oh my gosh, | |
| 305 | |
| 00:20:00,070 --> 00:20:05,350 | |
| it works just like that. | |
| Uh, the tricky thing for all those | |
| 306 | |
| 00:20:05,350 --> 00:20:10,310 | |
| who wants to know all the nitty | |
| gritty, nitty gritty details, um, was | |
| 307 | |
| 00:20:10,310 --> 00:20:13,750 | |
| that I don't think it was actually | |
| struggling with transcription, but | |
| 308 | |
| 00:20:13,750 --> 00:20:18,550 | |
| pasting Wayland makes life very hard, | |
| and I think there was something not | |
| 309 | |
| 00:20:18,550 --> 00:20:21,870 | |
| running in the right time anyway. | |
| Deepgram I looked at how they | |
| 310 | |
| 00:20:21,870 --> 00:20:24,710 | |
| actually handle that because it | |
| worked out of the box when other | |
| 311 | |
| 00:20:24,710 --> 00:20:29,140 | |
| stuff didn't, and it was quite a | |
| clever little mechanism, | |
| 312 | |
| 00:20:29,460 --> 00:20:32,100 | |
| and but more so than that, | |
| the accuracy was brilliant. | |
| 313 | |
| 00:20:32,140 --> 00:20:35,020 | |
| Now, what am I doing here? | |
| This is going to be a 20 minute | |
| 314 | |
| 00:20:35,260 --> 00:20:42,980 | |
| audio sample, and I'm I think | |
| I've done 1 or 2 of these before, | |
| 315 | |
| 00:20:42,980 --> 00:20:49,180 | |
| but I did it with short, snappy voice | |
| notes. This is kind of long form. | |
| 316 | |
| 00:20:49,460 --> 00:20:51,740 | |
| This actually might be a better | |
| approximation for what's useful | |
| 317 | |
| 00:20:51,740 --> 00:20:56,100 | |
| to me than voice memos. | |
| Like I need to buy three liters | |
| 318 | |
| 00:20:56,100 --> 00:20:59,180 | |
| of milk tomorrow, and pita bread, | |
| which is probably how like half | |
| 319 | |
| 00:20:59,180 --> 00:21:02,820 | |
| my voice voice notes sound like | |
| if anyone were to, I don't know, | |
| 320 | |
| 00:21:02,860 --> 00:21:04,580 | |
| like find my phone, | |
| they'd be like, this is the most | |
| 321 | |
| 00:21:04,580 --> 00:21:07,420 | |
| boring person in the world. | |
| Although actually there are some | |
| 322 | |
| 00:21:07,460 --> 00:21:09,700 | |
| like kind of, uh, | |
| journaling thoughts as well. | |
| 323 | |
| 00:21:09,700 --> 00:21:13,700 | |
| But it's a lot of content like that. | |
| And the probably for the evaluation, | |
| 324 | |
| 00:21:13,700 --> 00:21:20,660 | |
| the most useful thing is slightly | |
| obscure tech GitHub uh, hugging face | |
| 325 | |
| 00:21:21,180 --> 00:21:24,660 | |
| not so obscure that it's not going | |
| to have a chance of knowing it, | |
| 326 | |
| 00:21:24,660 --> 00:21:27,640 | |
| but hopefully sufficiently well | |
| known that the model should get it. | |
| 327 | |
| 00:21:28,200 --> 00:21:30,760 | |
| I tried to do a little bit of | |
| speaking really fast and | |
| 328 | |
| 00:21:30,760 --> 00:21:33,200 | |
| speaking very slowly. | |
| I would say in general, | |
| 329 | |
| 00:21:33,200 --> 00:21:36,880 | |
| I've spoken, delivered this at a | |
| faster pace than I usually would | |
| 330 | |
| 00:21:36,920 --> 00:21:40,280 | |
| owing to strong coffee flowing | |
| through my bloodstream. | |
| 331 | |
| 00:21:40,920 --> 00:21:44,200 | |
| And the thing that I'm not going | |
| to get in this benchmark is | |
| 332 | |
| 00:21:44,200 --> 00:21:46,880 | |
| background noise, which in my first | |
| take that I had to get rid of, | |
| 333 | |
| 00:21:47,680 --> 00:21:51,240 | |
| my wife came in with my son and | |
| for a good night kiss. | |
| 334 | |
| 00:21:51,440 --> 00:21:55,120 | |
| And that actually would have | |
| been super helpful to get in | |
| 335 | |
| 00:21:55,120 --> 00:21:59,760 | |
| because it was not diarised. | |
| Or if we had diarisation a female, | |
| 336 | |
| 00:21:59,880 --> 00:22:02,280 | |
| I could say I want the male | |
| voice and that wasn't intended | |
| 337 | |
| 00:22:02,280 --> 00:22:05,280 | |
| for transcription. | |
| Um, and we're not going to get | |
| 338 | |
| 00:22:05,280 --> 00:22:06,960 | |
| background noise like people | |
| honking their horns, | |
| 339 | |
| 00:22:06,960 --> 00:22:11,280 | |
| which is something I've done in my | |
| main data set where I am trying to | |
| 340 | |
| 00:22:11,440 --> 00:22:15,520 | |
| go back to some of my voice notes, | |
| annotate them, and run a benchmark. | |
| 341 | |
| 00:22:15,520 --> 00:22:18,960 | |
| But this is going to be just a | |
| pure quick test. | |
| 342 | |
| 00:22:19,440 --> 00:22:23,880 | |
| And as someone I'm working on a | |
| voice note idea, | |
| 343 | |
| 00:22:23,880 --> 00:22:28,230 | |
| that's my sort of end motivation. | |
| Besides thinking it's an | |
| 344 | |
| 00:22:28,230 --> 00:22:31,590 | |
| absolutely outstanding technology | |
| that's coming to viability. | |
| 345 | |
| 00:22:31,590 --> 00:22:34,670 | |
| And really, I know this sounds | |
| cheesy can actually have a very | |
| 346 | |
| 00:22:34,670 --> 00:22:38,830 | |
| transformative effect. | |
| Um, it's, you know, voice technology | |
| 347 | |
| 00:22:38,870 --> 00:22:44,910 | |
| has been life changing for, uh, | |
| folks living with, um, disabilities. | |
| 348 | |
| 00:22:45,630 --> 00:22:48,550 | |
| And I think there's something | |
| really nice about the fact that | |
| 349 | |
| 00:22:48,550 --> 00:22:52,710 | |
| it can also benefit, you know, | |
| folks who are able bodied and like, | |
| 350 | |
| 00:22:52,750 --> 00:22:58,950 | |
| we can all in different ways, um, | |
| make this tech as useful as possible, | |
| 351 | |
| 00:22:58,990 --> 00:23:01,110 | |
| regardless of the exact way that | |
| we're using it. | |
| 352 | |
| 00:23:01,510 --> 00:23:04,710 | |
| Um, and I think there's something | |
| very powerful in that, and it can be | |
| 353 | |
| 00:23:04,710 --> 00:23:08,910 | |
| very cool. Um, I see use potential. | |
| What excites me about voice tech? | |
| 354 | |
| 00:23:09,750 --> 00:23:13,550 | |
| A lot of things, actually. | |
| Firstly, the fact that it's cheap | |
| 355 | |
| 00:23:13,550 --> 00:23:17,110 | |
| and accurate, as I mentioned at | |
| the very start of this, um, | |
| 356 | |
| 00:23:17,110 --> 00:23:20,790 | |
| and it's getting better and better | |
| with stuff like accent handling, um, | |
| 357 | |
| 00:23:20,790 --> 00:23:24,180 | |
| I'm not sure my, my fine tune will | |
| actually ever come to fruition in the | |
| 358 | |
| 00:23:24,180 --> 00:23:27,860 | |
| sense that I'll use it day to day, | |
| as I imagine I get like superb, | |
| 359 | |
| 00:23:27,860 --> 00:23:33,540 | |
| flawless word error rates because I'm | |
| just kind of skeptical about local | |
| 360 | |
| 00:23:33,540 --> 00:23:38,100 | |
| speech to texts, as I mentioned. | |
| And I think the pace of innovation | |
| 361 | |
| 00:23:38,100 --> 00:23:42,060 | |
| and improvement in the models, | |
| the main reasons for fine tuning from | |
| 362 | |
| 00:23:42,060 --> 00:23:46,340 | |
| what I've seen have been people who | |
| are something that really blows, | |
| 363 | |
| 00:23:46,380 --> 00:23:52,940 | |
| blows my mind about ASR is the idea | |
| that it's inherently a lingual | |
| 364 | |
| 00:23:52,940 --> 00:23:59,100 | |
| or multilingual phonetic based. | |
| So as folks who use speak very | |
| 365 | |
| 00:23:59,140 --> 00:24:02,220 | |
| obscure languages that there may | |
| be there might be a paucity of | |
| 366 | |
| 00:24:02,220 --> 00:24:05,500 | |
| training data or almost none at all, | |
| and therefore the accuracy is | |
| 367 | |
| 00:24:05,500 --> 00:24:10,660 | |
| significantly reduced or folks | |
| in very critical environments. | |
| 368 | |
| 00:24:10,700 --> 00:24:13,380 | |
| I know there are. | |
| This is used extensively in medical | |
| 369 | |
| 00:24:13,380 --> 00:24:18,140 | |
| transcription and dispatcher work as, | |
| um, you know, the call centers who | |
| 370 | |
| 00:24:18,140 --> 00:24:22,490 | |
| send out ambulances, etc., where | |
| accuracy is absolutely paramount. | |
| 371 | |
| 00:24:22,490 --> 00:24:26,050 | |
| And in the case of doctors, | |
| radiologists, they might be using | |
| 372 | |
| 00:24:26,050 --> 00:24:29,610 | |
| very specialized vocab all the time. | |
| So those are kind of the main | |
| 373 | |
| 00:24:29,610 --> 00:24:31,530 | |
| two things. | |
| And I'm not sure that really just for | |
| 374 | |
| 00:24:31,530 --> 00:24:37,290 | |
| trying to make it better on a few | |
| random tech words with my slightly. | |
| 375 | |
| 00:24:37,330 --> 00:24:41,250 | |
| I mean, I have an accent, but like, | |
| not, you know, an accent that a few | |
| 376 | |
| 00:24:41,290 --> 00:24:47,210 | |
| other million people have. Ish. | |
| I'm not sure that my little fine | |
| 377 | |
| 00:24:47,210 --> 00:24:52,250 | |
| tune is going to actually like the | |
| bump in word error rate reduction. | |
| 378 | |
| 00:24:52,250 --> 00:24:54,570 | |
| If I ever actually figure out how | |
| to do it and get it up to the | |
| 379 | |
| 00:24:54,570 --> 00:24:58,610 | |
| cloud by the time I've done that. | |
| I suspect that the next | |
| 380 | |
| 00:24:58,610 --> 00:25:01,410 | |
| generation of ASR will just be | |
| so good that it will kind of be. | |
| 381 | |
| 00:25:01,930 --> 00:25:03,770 | |
| Ah, well, | |
| that would be cool if it worked out, | |
| 382 | |
| 00:25:03,770 --> 00:25:08,730 | |
| but I'll just use this instead. | |
| So that's going to be it for today's | |
| 383 | |
| 00:25:08,730 --> 00:25:14,130 | |
| episode of, uh, voice training data. | |
| Single long shot evaluation. | |
| 384 | |
| 00:25:14,410 --> 00:25:17,330 | |
| Who am I going to compare? | |
| Whisper is always good as a | |
| 385 | |
| 00:25:17,330 --> 00:25:20,600 | |
| benchmark, but I'm more | |
| interested in seeing Whisperer | |
| 386 | |
| 00:25:20,600 --> 00:25:25,080 | |
| head to head with two things, | |
| really. One is whisper variance. | |
| 387 | |
| 00:25:25,080 --> 00:25:29,880 | |
| So you've got these projects like | |
| faster Whisper, Still whisper. | |
| 388 | |
| 00:25:29,880 --> 00:25:31,640 | |
| It's a bit confusing. | |
| There's a whole bunch of them | |
| 389 | |
| 00:25:31,920 --> 00:25:34,800 | |
| and the emerging acers, | |
| which are also a thing. | |
| 390 | |
| 00:25:35,200 --> 00:25:37,680 | |
| My intention for this is I'm not | |
| sure I'm going to have the time | |
| 391 | |
| 00:25:37,680 --> 00:25:41,640 | |
| in any point in the foreseeable | |
| future to go back through this whole | |
| 392 | |
| 00:25:41,640 --> 00:25:46,560 | |
| episode and create a proper source, | |
| truth or a fix. | |
| 393 | |
| 00:25:47,320 --> 00:25:51,680 | |
| Everything might do it if I can | |
| get one transcription that | |
| 394 | |
| 00:25:51,680 --> 00:25:56,720 | |
| sufficiently close to perfection. | |
| But what I would actually love | |
| 395 | |
| 00:25:56,720 --> 00:25:59,800 | |
| to do on Hugging Face I think | |
| would be a great. | |
| 396 | |
| 00:25:59,800 --> 00:26:03,560 | |
| Probably how I might visualize this | |
| is having the audio waveform play, | |
| 397 | |
| 00:26:04,040 --> 00:26:09,800 | |
| and then have the transcript for each | |
| model below it, and maybe even a, | |
| 398 | |
| 00:26:10,480 --> 00:26:15,120 | |
| um, like, you know, two scale and | |
| maybe even a local one as well, | |
| 399 | |
| 00:26:15,160 --> 00:26:21,700 | |
| like local whisper versus open | |
| AI API, Etc. and, um, I can then | |
| 400 | |
| 00:26:21,700 --> 00:26:24,380 | |
| actually listen back to segments | |
| or anyone who wants to can listen | |
| 401 | |
| 00:26:24,380 --> 00:26:29,420 | |
| back to segments of this recording | |
| and see where a particular model | |
| 402 | |
| 00:26:29,460 --> 00:26:32,940 | |
| struggled and others didn't, as well | |
| as the sort of headline finding | |
| 403 | |
| 00:26:32,980 --> 00:26:36,780 | |
| of which had the best, uh, wer. | |
| But that would require the source | |
| 404 | |
| 00:26:36,780 --> 00:26:40,020 | |
| of truth. Okay. That's it. | |
| Hope this was, I don't know, | |
| 405 | |
| 00:26:40,180 --> 00:26:43,460 | |
| maybe useful for other folks | |
| interested in stuff you want to see. | |
| 406 | |
| 00:26:43,940 --> 00:26:48,100 | |
| I always feel think I've just said | |
| something I didn't intend to say. | |
| 407 | |
| 00:26:48,660 --> 00:26:51,020 | |
| I said for those, listen carefully. | |
| Including, hopefully, | |
| 408 | |
| 00:26:51,020 --> 00:26:54,060 | |
| the models themselves. | |
| This has been myself, | |
| 409 | |
| 00:26:54,100 --> 00:26:57,900 | |
| Daniel Rosehill, for more, um, | |
| jumbled repositories about my, | |
| 410 | |
| 00:26:57,940 --> 00:27:00,820 | |
| uh, roving interest in AI, | |
| but particularly Agentic, | |
| 411 | |
| 00:27:01,180 --> 00:27:05,340 | |
| MCP and voice tech. | |
| Uh, you can find me on GitHub. | |
| 412 | |
| 00:27:05,820 --> 00:27:11,140 | |
| Hugging face. Where else? | |
| Daniel, which is my personal website, | |
| 413 | |
| 00:27:11,140 --> 00:27:15,260 | |
| as well as this podcast whose | |
| name I sadly cannot remember. | |
| 414 | |
| 00:27:15,700 --> 00:27:17,420 | |
| Until next time. | |
| Thanks for listening. | |