Spaces:
Running
Running
| window.TRANSCRIPTS = {"truth": "\ufeff1\n00:00:00,000 --> 00:00:08,640\nHello and welcome to a audio dataset consisting of one single episode of a non-existent podcast.\n\n2\n00:00:08,640 --> 00:00:19,120\nOr, it eh, I may append this to a podcast that I set up recently regarding my with my thoughts on speech\n\n3\n00:00:19,120 --> 00:00:28,720\ntech and AI in particular. More AI and generative AI I would, I would say. But in any event, the purpose of this\n\n4\n00:00:30,080 --> 00:00:37,120\nvoice recording is actually to create a lengthy voice sample for a quick evaluation, a back of the\n\n5\n00:00:37,120 --> 00:00:42,320\nenvelope evaluation as they might say for different speech to text models. And I'm doing this because I\n\n6\n00:00:42,800 --> 00:00:48,560\nI thought I'd made a great breakthrough in my journey with speech tech. And that was succeeding in\n\n7\n00:00:48,560 --> 00:00:55,120\nthe elusive task of fine-tuning Whisper. Whisper is, and I'm going to just talk, I'm trying to\n\n8\n00:00:55,760 --> 00:01:01,600\nmix up, I'm going to try a few different styles of speaking. I might whisper something at some\n\n9\n00:01:01,600 --> 00:01:07,760\npoints as well. And I'll go back to speaking loud in different parts. I'm going to sound really\n\n10\n00:01:07,760 --> 00:01:15,200\nlike a crazy person because I'm also going to try to speak at different pitches and cadences\n\n11\n00:01:15,200 --> 00:01:21,600\nin order to really try to put a speech to text model through its paces, which is trying to make\n\n12\n00:01:21,600 --> 00:01:30,320\nsense of \"is this guy just rambling on incoherently in one long sentence?\" Or \"are these just actually\n\n13\n00:01:30,320 --> 00:01:38,320\na series of step standalone stepalone standalone sentences?\" And how is it going to handle stepalone?! That's not a\n\n14\n00:01:38,320 --> 00:01:43,919\nword! What happens when you use speech to text and you use a fake word and then you're like, wait,\n\n15\n00:01:43,919 --> 00:01:51,520\nthat's not actually, that word doesn't exist. How does AI handle that? And these and more are all the\n\n16\n00:01:52,880 --> 00:01:57,359\nquestions that I'm seeking to answer in this training data. Now, why did why was I trying to\n\n17\n00:01:57,360 --> 00:02:01,040\nfine tune whisper? And what is Whisper? As I said, I'm going to try to\n\n18\n00:02:02,080 --> 00:02:04,240\nrecord this at a couple of different levels of\n\n19\n00:02:04,880 --> 00:02:10,320\ntechnicality - for folks who are in the normal world and not totally\n\n20\n00:02:11,360 --> 00:02:16,079\nstuck down the rabbit hole of AI. Which I have to say is a really wonderful rabbit hole to be\n\n21\n00:02:16,720 --> 00:02:23,440\nto be down. It's a really interesting area. And speech and voice tech is the aspect of it that\n\n22\n00:02:23,440 --> 00:02:28,880\nI find actually most - I'm not sure I would say the most interesting because there's just so much\n\n23\n00:02:28,880 --> 00:02:34,560\nthat is fascinating in AI. But the most that I find the most personally transformative in terms of\n\n24\n00:02:34,560 --> 00:02:42,240\nthe impact that it's had on my daily work life and productivity and how I sort of work. And\n\n25\n00:02:42,960 --> 00:02:49,920\nI am persevering hard with the task of trying to get a good solution working for Linux.\n\n26\n00:02:49,920 --> 00:02:53,440\nWhich if anyone actually does listen to this not just for the training data and for the\n\n27\n00:02:53,440 --> 00:03:00,399\nactual content, this has sparked. I had, besides the fine tune not working, well that was\n\n28\n00:03:00,399 --> 00:03:07,679\nthe failure. I used Claude Code. Because one thinks these days that there is nothing\n\n29\n00:03:08,560 --> 00:03:16,799\nshort of solving, you know, the reason of life or something that Claude and\n\n30\n00:03:16,800 --> 00:03:22,720\nagentic AI can't do. Which is not really the case. It does seem that way sometimes. But it\n\n31\n00:03:22,720 --> 00:03:28,080\nfails a lot as well. And this is one of those instances where last week I put together an hour\n\n32\n00:03:28,080 --> 00:03:33,600\nof voice training data: basically speaking just random things for three minutes. And\n\n33\n00:03:35,600 --> 00:03:40,160\nit was actually kind of tedious because the texts were really weird. Some of them were it was like,\n\n34\n00:03:40,160 --> 00:03:45,440\nit was AI generated. I tried before to read Sherlock Holmes for an hour and I just couldn't,\n\n35\n00:03:45,440 --> 00:03:51,120\nI was so bored after 10 minutes that I was like, \"okay, no, I'm just gonna have to find something\n\n36\n00:03:51,120 --> 00:03:59,920\nelse to read.\" So I used I created with AI Studio, vibe coded, a synthetic text generator,\n\n37\n00:04:00,800 --> 00:04:05,680\nwhich actually I thought was probably a better way of doing it because it would give me more\n\n38\n00:04:05,680 --> 00:04:12,000\nshort samples with more varied content. So I was like, okay, give me a voice note. Like I'm\n\n39\n00:04:12,000 --> 00:04:18,800\nrecording an email. Give me a short story to read. Give me prose. So I came up with all\n\n40\n00:04:18,800 --> 00:04:24,240\nthese different things and I added a little timer to it so I could see how close I was to one\n\n41\n00:04:24,240 --> 00:04:32,480\nhour. And I spent like an hour one afternoon or probably two hours by the time you do retakes\n\n42\n00:04:32,480 --> 00:04:39,120\nand whatever because you want to. It gave me a source of truth which I'm not sure if that's the\n\n43\n00:04:39,120 --> 00:04:45,120\nscientific way to approach this topic of gathering training data but I thought made sense.\n\n44\n00:04:46,560 --> 00:04:50,880\nI have a lot of audio data from recording voice notes which I've also kind of used\n\n45\n00:04:52,000 --> 00:04:56,720\nbeen experimenting with using for a different purpose. It's slightly different - annotating\n\n46\n00:04:57,840 --> 00:05:03,680\ntask types. It's more text classification experiment. Or well it's more than that actually\n\n47\n00:05:03,680 --> 00:05:08,880\nI'm working on a voice app. So it's a prototype I guess is really more accurate.\n\n48\n00:05:11,280 --> 00:05:15,920\nBut you can do that and you can work backwards. You listen back to a voice note and you\n\n49\n00:05:17,520 --> 00:05:22,400\npainfully go through one of those - transcribing where you start and stop and scrub around it and\n\n50\n00:05:22,400 --> 00:05:27,680\nyou fix the errors . But it's really really boring to do that. So I thought it would be less tedious\n\n51\n00:05:27,680 --> 00:05:34,240\nin the long term if I just recorded the source of truth. So it gave me these three minute snippets.\n\n52\n00:05:34,240 --> 00:05:40,480\nI recorded them and saved an MP3 and a TXT in the same folder and I created an hour of that data.\n\n53\n00:05:41,840 --> 00:05:47,280\nSo I was very hopeful - quitely, you know, a little bit hopeful - that I would be able that I could actually fine tune\n\n54\n00:05:47,280 --> 00:05:54,720\nWhisper. I want to fine tune Whisper because when I got into voice tech last November my wife was in\n\n55\n00:05:54,720 --> 00:06:01,920\nthe US and I was alone at home. And when crazy people like me do really wild things like use voice\n\n56\n00:06:01,920 --> 00:06:08,320\nto tech technology that was basically when I started doing it. I didn't feel like a crazy person\n\n57\n00:06:08,320 --> 00:06:15,760\nspeaking to myself. And my expectations weren't that high. I used speech tech now and again\n\n58\n00:06:16,960 --> 00:06:21,200\ntried it out. I was like \"it'd be really cool if you could just like speak into your computer.\" And\n\n59\n00:06:21,280 --> 00:06:28,479\nwhatever I tried out that had Linux support was just - it was not good, basically. And this blew me away\n\n60\n00:06:28,479 --> 00:06:34,400\nfrom the first go. I mean it wasn't 100% accurate out of the box. And it took work. But it was good\n\n61\n00:06:34,400 --> 00:06:40,320\nenough that there was a solid foundation. And it kind of passed that pivot point that it's actually\n\n62\n00:06:40,320 --> 00:06:46,320\nworth doing this. You know, there's a point where it's. So like the transcript is you don't have to get 100%\n\n63\n00:06:46,400 --> 00:06:51,040\naccuracy for it to be worth your time for speech to text to be a worthwhile addition to your\n\n64\n00:06:51,040 --> 00:06:58,320\nproductivity. But you do need to get above let's say I don't know 85%. If it's 60% or 50% you inevitably\n\n65\n00:06:58,320 --> 00:07:03,920\nsay \"screw it I'll just type it.\"Because you end up missing errors in the transcript and it becomes\n\n66\n00:07:03,920 --> 00:07:07,840\nactually worse. You end up in a worse position than you started with it. That's been my experience.\n\n67\n00:07:08,400 --> 00:07:14,400\nSo I was like \"oh, this is actually really really good. Now how did that happen?\" The answer is\n\n68\n00:07:14,400 --> 00:07:21,599\nASR, Whisper being open-sourced. and the transformer architecture if you want to go back to the\n\n69\n00:07:23,200 --> 00:07:29,440\nto the underpinnings. Which really blows my mind. And it's on my list to read through that paper\n\n70\n00:07:30,239 --> 00:07:38,400\n'All You Need Is Attention' as attentively as can be done with my limited brain. Because it's super\n\n71\n00:07:38,960 --> 00:07:45,679\nhigh-level stuff - super advanced stuff I mean. But that I think of all the things that\n\n72\n00:07:47,280 --> 00:07:54,080\nare fascinating about the sudden rise and AI and the dramatic capabilities I find it fascinating\n\n73\n00:07:54,080 --> 00:07:59,599\nthat few people are like \"hang on, you've got this thing that can speak to you like a chatbot - an LLM.\n\n74\n00:08:00,640 --> 00:08:06,799\nThen you've got image generation. Okay, so firstly those two things on the surface have nothing\n\n75\n00:08:06,800 --> 00:08:12,560\nin common. So like, \"how are they ... how did THAT just happen all at the same time?\" And then when you\n\n76\n00:08:12,560 --> 00:08:19,920\nextend that further you're like Suno right. You can sing a song and AI will like come up with\n\n77\n00:08:19,920 --> 00:08:25,200\nan instrumental. And then you've got Whisper. And then you're like \"wait a second how did all this stuff\n\n78\n00:08:25,200 --> 00:08:30,880\nlike if it's all AI what's like, there has to be some commonality. Otherwise these are four these are\n\n79\n00:08:31,600 --> 00:08:38,640\ntotally different technologies on the surface of it and the transformer architecture is as far as\n\n80\n00:08:38,640 --> 00:08:44,720\nI know the answer. And I can't even say I can't even pretend that I really understand what the\n\n81\n00:08:44,720 --> 00:08:51,200\ntransformer architecture means in depth. But I have scanned it. And as I said I want to print it and\n\n82\n00:08:51,200 --> 00:08:57,760\nreally kind of think over it's at some point. And I'll probably feel bad about myself I think!\n\n83\n00:08:57,760 --> 00:09:03,280\nBecause weren't those guys in their in their 20s like? That's crazy! I think I asked ChatGPT\n\n84\n00:09:03,280 --> 00:09:09,439\nonce \"who were the? Who wrote that paper and how old were they when it was published in Arxiv?\"\n\n85\n00:09:09,439 --> 00:09:14,640\nAnd I was expecting like, I don't know. What do you what do you imagine? I personally imagine kind of\n\n86\n00:09:14,640 --> 00:09:19,840\nlike you know you have these breakthroughs during COVID and things like that where like these kind\n\n87\n00:09:19,840 --> 00:09:24,480\nof really obscure scientists who are like in their 50s and they've just kind of been laboring in\n\n88\n00:09:24,640 --> 00:09:31,120\nlabs and wearily writing and publishing in kind of obscure academic publications and they\n\n89\n00:09:31,120 --> 00:09:37,200\nfinally like hit a big or win a Nobel Prize. And then they're household household names. So I that\n\n90\n00:09:37,200 --> 00:09:42,680\nwas kind of what I had in mind. That was the mental image I'd formed of the birth of Arxiv.\n\n91\n00:09:42,680 --> 00:09:47,760\nLike, I wasn't expecting 20-somethings in San Francisco! Though I thought that was both very very\n\n92\n00:09:47,760 --> 00:09:54,160\nfunny, very cool, and actually kind of inspiring. It's nice to think that people who you know just\n\n93\n00:09:54,160 --> 00:10:01,439\nyou might put them in the kind of milieu or bubble or world that you are in or credibly in through\n\n94\n00:10:01,439 --> 00:10:06,079\nyou know the series of connections that are coming up with such literally world changing\n\n95\n00:10:06,880 --> 00:10:13,439\ninnovations. So that was I thought anyway that that was cool. Okay voice training data. How\n\n96\n00:10:13,439 --> 00:10:19,280\nare we doing? We're about 10 minutes. And I'm still talking about voice technology! So Whisper was\n\n97\n00:10:19,280 --> 00:10:25,680\nbrilliant. And I was so excited that I was my first instinct was to like guess it's like \"oh my gosh\n\n98\n00:10:25,680 --> 00:10:31,040\nI have to get like a really good microphone for this.\" So I didn't go on a spending spree because\n\n99\n00:10:31,040 --> 00:10:37,760\nI said I'm gonna have to just wait a month and see if I still use this.\" And it just kind of became\n\n100\n00:10:37,760 --> 00:10:44,800\nit's become really part of my daily routine. Like, if I'm writing an email I'll record a voice note\n\n101\n00:10:44,880 --> 00:10:50,079\nand then I'll develop it and it's nice to see that everyone is like developing the same things in\n\n102\n00:10:50,079 --> 00:10:56,319\nparallel. Like, that's maybe kind of a weird thing to say. But when I look, I kind of came when I started\n\n103\n00:10:56,319 --> 00:11:02,640\nworking on this these prototypes on GitHub, which is where I just kind of share very freely and loosely\n\n104\n00:11:03,199 --> 00:11:10,800\nideas and you know first iterations on concepts. And for want of a better word I called it like\n\n105\n00:11:11,439 --> 00:11:17,680\n\"LLM post processing.\" Or cleanup. Or basically a system prompt that after you get back the raw text\n\n106\n00:11:17,680 --> 00:11:25,920\nfrom Whisper, you run it through model and say \"okay this is crappy text like add sentence structure\n\n107\n00:11:25,920 --> 00:11:33,199\nand you know fix it up. \" And now when I'm exploring the different tools that are out there that people\n\n108\n00:11:33,200 --> 00:11:39,040\nhave built, I see quite a number of projects have basically you know done the same thing.\n\n109\n00:11:40,640 --> 00:11:45,040\nLest that be misconstrued, I'm not saying for a millisecond that I inspired them. I'm sure this\n\n110\n00:11:45,040 --> 00:11:51,440\nhas been a thing that's been integrated into tools for a while. But it's, it's the kind of thing that\n\n111\n00:11:51,440 --> 00:11:57,520\nwhen you start using these tools every day the need for it is almost instantly apparent. Because text\n\n112\n00:11:57,600 --> 00:12:03,520\nthat doesn't have any punctuation or paragraph spacing takes a long time to you know, it takes so\n\n113\n00:12:03,520 --> 00:12:10,079\nlong to get it into a presentable email that,again, it moves speech tech into that,\n\n114\n00:12:11,280 --> 00:12:16,000\nbefore that inflection point where you're like \"nah it's just not worth.\" It it's like it'll just be\n\n115\n00:12:16,000 --> 00:12:20,800\nquicker to type this. So it's it's a big - it's a little touch that actually is a big deal\n\n116\n00:12:21,520 --> 00:12:28,319\nSo I was on Whisper and I've been using Whisper and I kind of early on find a couple of tools.\n\n117\n00:12:28,319 --> 00:12:33,680\nI couldn't find what I was looking for on Linux which is basically just something that'll run\n\n118\n00:12:34,800 --> 00:12:39,120\nin the background. You'll give it an API key and it'll just like transcribe.\n\n119\n00:12:41,439 --> 00:12:47,359\nWith like a little key to start and stop the dictation. And the issues wer I discovered that\n\n120\n00:12:47,440 --> 00:12:52,720\nlike most people involved in creating these projects were very much focused on local models.\n\n121\n00:12:52,720 --> 00:12:58,400\nAnd running Whisper locally because you can. And I tried that a bunch of times and just never\n\n122\n00:12:58,400 --> 00:13:03,920\ngot results that were as good as the cloud. And when I began looking at the cost of the speech to\n\n123\n00:13:03,920 --> 00:13:10,080\ntext APIs and what I was spending just thought there it's actually in my opinion just one of\n\n124\n00:13:10,080 --> 00:13:15,600\nthe better deals in API spending and in cloud. Like, it's just not that expensive for very, very good\n\n125\n00:13:15,600 --> 00:13:22,240\nmodels that are much more. You know, you're going to be able to run the full model, the latest model\n\n126\n00:13:22,240 --> 00:13:28,960\nversus whatever you can run on your average GPU. Unless you want to buy a crazy GPU. It doesn't\n\n127\n00:13:28,960 --> 00:13:34,000\nreally make sense to me. Now, I privacy is another concern that I know is kind of like a very much\n\n128\n00:13:34,000 --> 00:13:38,720\na separate thing. That people just don't want their voice data and their voice leaving their\n\n129\n00:13:38,720 --> 00:13:45,360\nlocal environment. Maybe for regulatory reasons as well. But I'm not in that. I'm don't really really\n\n130\n00:13:45,360 --> 00:13:51,440\ncare about people listening to my grocery list consisting of reminding myself that I need to buy\n\n131\n00:13:51,440 --> 00:13:58,240\nmore beer, Cheetos and hummus. Which is kind of the three three staples of my diet during periods of\n\n132\n00:13:58,240 --> 00:14:04,560\npoor nutrition. But the kind of stuff that I transcribe most it's just not it's not a it's not a\n\n133\n00:14:04,560 --> 00:14:12,640\nprivacy thing. I'm not that sort of sensitive about. And I don't do anything so you know sensitive\n\n134\n00:14:12,640 --> 00:14:17,680\nor secure that requires airgapping. So I looked at the pricing and especially the kind of older\n\n135\n00:14:17,680 --> 00:14:24,400\nmodels mini. Some of them are very very affordable. And I did a back of the, I did a calculation once\n\n136\n00:14:24,400 --> 00:14:30,239\nwith ChatGPT and I was like \"okay, this is the, this is the API price for I can't remember whatever\n\n137\n00:14:30,320 --> 00:14:37,040\nthe model was. Let's say I just go at it like nonstop which rarely happens. Probably I would say an\n\n138\n00:14:37,040 --> 00:14:45,200\naverage I might dictate 30 to 60 minutes per day if I was probably summing up the emails, documents,\n\n139\n00:14:45,200 --> 00:14:51,360\noutlines. Which is a lot. But it's it's still a fairly modest amount. And I was like well some days I\n\n140\n00:14:51,360 --> 00:14:56,720\ndo go on like one or two days where I've been usually when I'm like kind of out of the house and\n\n141\n00:14:56,720 --> 00:15:02,800\njust have something like I've nothing else to do. Like if I'm at a hospital. We have a newborn.\n\n142\n00:15:04,000 --> 00:15:09,040\nAnd you're waiting for like hours and hours for an appointment. And I would probably have\n\n143\n00:15:09,040 --> 00:15:15,280\nlistened to podcasts before becoming a speech fanatic. And I'm like \"oh wait let me just get down\n\n144\n00:15:15,280 --> 00:15:20,880\nlet me just get these ideas out of my head.\" And that's when I'll go on my speech binges. But those\n\n145\n00:15:20,880 --> 00:15:26,240\nare like once every few months - like not frequently. But I said okay let's just say if I'm gonna price\n\n146\n00:15:26,240 --> 00:15:35,440\nout cloud STT. If I was like dedicated every second of every waking hour to transcribing for some\n\n147\n00:15:35,440 --> 00:15:41,600\nodd reason. I mean, I'd have to like eat and use the toilet! Like, you know there's only so many hours\n\n148\n00:15:41,600 --> 00:15:48,480\nI'm awake for. So like let's just say a maximum of like 40 hour 45 minutes in the hours and I said\n\n149\n00:15:48,480 --> 00:15:55,360\nall right let's just say 50. Who knows? You're dictating on the toilet! We do it! So you could just do 60.\n\n150\n00:15:55,440 --> 00:16:02,560\nBut whatever I did - and every day. Like you're going flat out, seven days a week dictating nonstop\n\n151\n00:16:02,560 --> 00:16:08,640\nas like \"what's my monthly API bill gonna be at this price?\" And it came out to like 70 or\n\n152\n00:16:08,640 --> 00:16:14,960\n80 bucks. And I was like, well that would be an extraordinary amount of dictation! And I would hope\n\n153\n00:16:15,600 --> 00:16:21,680\nthat there was some compelling reason more worth more than 70 dollars that I embarked upon that.\n\n154\n00:16:22,640 --> 00:16:26,959\nSo given that that's kind of the max point for me I said that's actually very very affordable.\n\n155\n00:16:27,920 --> 00:16:32,640\nNow you're gonna if you want to spec out the costs and you want to do the post processing\n\n156\n00:16:33,599 --> 00:16:39,199\nthat I really do feel is valuable that's gonna cost more as well. Unless you're using\n\n157\n00:16:40,160 --> 00:16:47,839\nGemini which needless to say as a random person sitting in Jerusalem I have no affiliation nor with\n\n158\n00:16:47,840 --> 00:16:54,800\nGoogle nor Anthropic nor Gemini nor any major tech vendor for that matter. Um I like Gemini\n\n159\n00:16:54,800 --> 00:17:00,080\nnot so much as a everyday model. Um it's kind of underwhelmed in that respect I would say.\n\n160\n00:17:00,080 --> 00:17:05,920\nBut for multimodal I think it's got a lot to offer. And I think that the transcribing functionality\n\n161\n00:17:05,920 --> 00:17:13,280\nwhereby it can um process audio with the system prompt and both give you a transcription that's\n\n162\n00:17:13,280 --> 00:17:20,079\ncleaned up - that reduces two steps to one. And that for me is a very very big deal. And uh I feel like\n\n163\n00:17:20,079 --> 00:17:27,280\neven Google hasn't really sort of thought through how useful the that modality is and what kind of\n\n164\n00:17:27,280 --> 00:17:33,280\nuse cases uh you can achieve with it. Because I found in the course of this year just an endless\n\n165\n00:17:33,280 --> 00:17:40,399\nlist of really kind of system prompt system prompt stuff that I can say \"okay I've used it\n\n166\n00:17:40,560 --> 00:17:45,920\nto capture context data for AI which is literally I might speak for if I wanted to have a good\n\n167\n00:17:45,920 --> 00:17:52,560\nbank of context data about who knows my childhood uh more realistically maybe my career goals\n\n168\n00:17:53,520 --> 00:17:59,520\nsomething that would just be like really boring to type out so I'll just like sit in my car\n\n169\n00:17:59,520 --> 00:18:06,640\nand record it for 10 minutes. And that 10 minutes you get a lot of information in um emails which is\n\n170\n00:18:06,640 --> 00:18:13,200\nshort text uh just there is a whole bunch. And all these workflows kind of require a little bit\n\n171\n00:18:13,200 --> 00:18:18,320\nof treatment afterwards and different treatment. My context pipeline is kind of like just extract the\n\n172\n00:18:18,320 --> 00:18:23,520\nbare essential. So you end up with me talking very loosely about sort of what I've done in my career,\n\n173\n00:18:23,520 --> 00:18:30,000\nwhere I've worked, where I might like to work. And it goes - it condenses that down to very robotic language\n\n174\n00:18:30,000 --> 00:18:36,000\nthat is easy to chunk, parse, and maybe put into a vector database. \"Daniel has worked in technology!\n\n175\n00:18:36,080 --> 00:18:42,400\nDaniel is a has been working in marketing.\" Stuff like that. That's not how you would speak um but I\n\n176\n00:18:42,400 --> 00:18:48,480\nfigure it's probably easier to parse for, after all, robots. So we've almost got to 20 minutes. And this\n\n177\n00:18:48,480 --> 00:18:56,880\nis actually a success because I wasted 20 minutes of my uh of the evening speaking into microphone and\n\n178\n00:18:56,880 --> 00:19:02,720\nthe levels were shot and it uh it was clipping. And I said I can't really do an evaluation. I have to\n\n179\n00:19:02,720 --> 00:19:09,440\nbe fair. I have to give the models a chance to do their thing. Uh what am I hoping to achieve in this?\n\n180\n00:19:09,440 --> 00:19:14,960\nOkay my fine tune was a dud as mentioned. Deepgram STT - I'm really really hopeful that this prototype\n\n181\n00:19:14,960 --> 00:19:20,560\nwill work. And it's a build in public open source. So anyone is welcome to use it if I make anything good\n\n182\n00:19:21,600 --> 00:19:28,000\nBut what was really exciting for me last night when after hours of um trying my own prototypes, seeing\n\n183\n00:19:28,080 --> 00:19:33,120\nsomeone just made something that works like that. You know, you're not going to have to build a custom\n\n184\n00:19:34,240 --> 00:19:40,960\nConda environment and image. I have AMD GPU which makes things much more complicated. I didn't find it\n\n185\n00:19:41,840 --> 00:19:46,400\nAnd I was about to give up and I said \"all right. Let me just give Deepgram's Linux thing a shot\n\n186\n00:19:47,040 --> 00:19:50,960\nand if this doesn't work um I'm just going to go back to trying to vibe code something myself.\"\n\n187\n00:19:51,600 --> 00:19:57,360\nAnd when I ran the script - I was using Claude Code to do the installation process -\n\n188\n00:19:58,160 --> 00:20:02,800\nit ran the script and \"oh my gosh, it works!\" Just like that! Uh the tricky thing\n\n189\n00:20:04,480 --> 00:20:12,480\nfor all those ones who want to know all the nitty gritty details um was that I don't think it was actually\n\n190\n00:20:12,480 --> 00:20:18,160\nstruggling with transcription but pasting. Wayland makes life very hard. And I think there was\n\n191\n00:20:18,160 --> 00:20:22,800\nsomething not running at the right time. Anyway, Deepgram - I looked at how they actually handled\n\n192\n00:20:22,960 --> 00:20:28,960\nthat because it worked out of the box when other stuff didn't. And it was quite a clever little mechanism\n\n193\n00:20:29,520 --> 00:20:34,560\nand but more so than that the accuracy was brilliant. Now, what am I doing here? This is going to be a 20\n\n194\n00:20:34,560 --> 00:20:44,399\nminute audio uh sample and I'm I think I've done one or two of these before but I did it with\n\n195\n00:20:45,360 --> 00:20:51,120\nshort, snappy voice notes. This is kind of long form. This actually might be a better approximation\n\n196\n00:20:51,120 --> 00:20:55,040\nfor what's useful to me than voice memos like \"I need to buy three\n\n197\n00:20:55,840 --> 00:20:59,840\nliters of milk tomorrow and pita bread.\" Which is probably how like half my voice note\n\n198\n00:20:59,840 --> 00:21:04,399\nvoice notes sound. Like if anyone were to I don't know like find my phone they'd be like \"this is\n\n199\n00:21:04,399 --> 00:21:09,280\nthe most boring person in the world!\" Although actually there are some like kind of uh journaling\n\n200\n00:21:09,280 --> 00:21:14,080\nthoughts as well. But it's a lot of content like that. And the probably for the evaluation the most\n\n201\n00:21:14,080 --> 00:21:22,560\nuseful thing is slightly obscure tech: Github, Nuclino, Hugging Face. Not so obscure that it's not\n\n202\n00:21:22,560 --> 00:21:27,360\ngoing to have a chance of knowing it. But hopefully sufficiently well known that the models should get\n\n203\n00:21:27,360 --> 00:21:32,800\nit. Uh I tried to do a little bit of speaking really fast and speaking very slowly. I would say in\n\n204\n00:21:32,800 --> 00:21:38,960\ngeneral I've spoken delivered this at a faster pace than I usually would owing to strong coffee\n\n205\n00:21:39,120 --> 00:21:44,240\nflowing through my bloodstream. And the thing that I'm not going to get in this benchmark is\n\n206\n00:21:44,240 --> 00:21:49,920\nbackground noise. Which in my first take that I had to get rid of my wife came in with my son\n\n207\n00:21:49,920 --> 00:21:55,680\nfor a good night kiss. And that actually would have been super helpful to get in because it was\n\n208\n00:21:56,400 --> 00:22:01,600\nnon-diarised. Or if we had diarisation a female I could say I want the male voice and that\n\n209\n00:22:01,600 --> 00:22:06,240\nwasn't intended for transcription um. And we're not going to get background noise like people\n\n210\n00:22:06,240 --> 00:22:11,840\nhonking their horns. Which is something I've done in my main dataset where I am trying to go back\n\n211\n00:22:11,840 --> 00:22:16,880\nto some of my voice notes, annotate them, and run a benchmark. But this is going to be just a pure\n\n212\n00:22:17,680 --> 00:22:24,960\nquick test. And as someone I'm working on a voice note idea that's my sort of end\n\n213\n00:22:26,560 --> 00:22:30,320\nmotivation besides thinking it's an absolute outstanding technology that's coming to\n\n214\n00:22:30,960 --> 00:22:36,240\nviability and really - I know this sounds cheesy - can actually have a very transformative effect.\n\n215\n00:22:37,120 --> 00:22:42,720\nIt's, you know, voice technology has been life changing for folks living with\n\n216\n00:22:44,000 --> 00:22:49,760\ndisabilities. And I think there's something really nice about the fact that it can also benefit\n\n217\n00:22:50,480 --> 00:22:54,639\nyou know folks who are able-bodied. And like we can all in different ways\n\n218\n00:22:55,120 --> 00:23:02,560\num make this tech as useful as possible regardless of the exact way that we're using it um. And I\n\n219\n00:23:02,560 --> 00:23:07,760\nthink there's something very powerful in that. And it can be very cool um I see huge potential. What\n\n220\n00:23:07,760 --> 00:23:14,480\nexcites me about voice tech - a lot of things actually. Firstly the fact that it's cheap and accurate\n\n221\n00:23:14,480 --> 00:23:19,040\nas I mentioned at the very start of this um. And it's getting better and better with stuff like\n\n222\n00:23:19,040 --> 00:23:24,160\naccent handling um. I'm not sure my my fine tune will actually ever come to fruition in the\n\n223\n00:23:24,160 --> 00:23:30,240\nsense that I'll use it day to day as I imagine and get like superb flawless words error rates. Because\n\n224\n00:23:30,240 --> 00:23:37,680\nI'm just kind of skeptical about local speech to tech as I mentioned. And I think the pace of\n\n225\n00:23:37,680 --> 00:23:42,720\ninnovation and improvement in the models. The main reasons for fine tuning from what I've seen\n\n226\n00:23:44,320 --> 00:23:50,480\nhave been people who are something that really blows blows my mind about ASR is the idea that it's\n\n227\n00:23:50,480 --> 00:24:00,080\ninherently a-llingual. Or multilingual. Phonetic-based. So as folks who use speak very obscure languages\n\n228\n00:24:00,080 --> 00:24:04,800\nthat there may be there there might be a paucity of training data or almost none at all. And therefore\n\n229\n00:24:04,800 --> 00:24:11,440\nthe accuracy is significantly reduced. Or folks in very critical environments. I know there are\n\n230\n00:24:11,440 --> 00:24:17,680\nthis is used extensively in medical transcription and dispatcher work as um you know the call\n\n231\n00:24:17,680 --> 00:24:24,000\ncenters who send out ambulances etc where accuracy is absolutely paramount. And in the case of doctors,\n\n232\n00:24:24,560 --> 00:24:29,680\nradiologists they might be using very specialized vocab all the time. So those are kind of the main\n\n233\n00:24:29,680 --> 00:24:35,680\ntwo things. And I'm not sure that really just for trying to make it better on a few random tech words\n\n234\n00:24:35,680 --> 00:24:41,840\nwith my slightly. I mean, I have an accent! But like, not you know an accent that a few other million\n\n235\n00:24:41,840 --> 00:24:50,720\npeople have. Ish. I'm not sure that my little fine tune is going to actually like the bump in\n\n236\n00:24:50,720 --> 00:24:55,760\nword error reduction if I ever actually figure out how to do it and get it up to the cloud. By the\n\n237\n00:24:55,760 --> 00:25:00,879\ntime we've done that I suspect that the next generation of ASR will just be so good that it will\n\n238\n00:25:00,879 --> 00:25:07,040\nkind of be \"nah, well, that would be cool if it worked out. But I'll just use this instead.\" So that's going to be\n\n239\n00:25:07,280 --> 00:25:15,040\nit for today's episodes of voice training data single long shot evaluation. Who am I going to\n\n240\n00:25:15,040 --> 00:25:21,200\ncompare? Whisper is always good as a benchmark. But I'm more interested in seeing Whisper head-to-head\n\n241\n00:25:21,200 --> 00:25:27,680\nwith two things really. One is Whisper variants. So you've got these projects like Faster Whisper,\n\n242\n00:25:29,120 --> 00:25:34,000\nDistilled Whisper. It's a bit confusing. There's a whole bunch of them. And the emerging ASRs which\n\n243\n00:25:34,160 --> 00:25:38,960\nare also a thing. My intention for this is I'm not sure I'm going to have the time in any point\n\n244\n00:25:38,960 --> 00:25:46,320\nof the foreseeable future to go back through this whole episode and create a proper source truth or I fix\n\n245\n00:25:47,520 --> 00:25:53,760\neverything. I might do it if I can get one transcription that's sufficiently close to perfection.\n\n246\n00:25:54,960 --> 00:26:00,560\nBut what I would actually love to do on Hugging Face I think would be a great probably how I might\n\n247\n00:26:00,560 --> 00:26:08,080\nvisualize this is having the audio waveform play. And then have the transcript for each model below\n\n248\n00:26:08,080 --> 00:26:16,320\nit. And maybe even a like you know to scale. And maybe even a local one as well like local Whisper\n\n249\n00:26:16,320 --> 00:26:23,919\nversus Open AI API etc. And I can then actually listen back to segments. Or anyone who wants to\n\n250\n00:26:24,000 --> 00:26:30,000\ncan listen back to segments of this recording and see where a particular model struggled\n\n251\n00:26:30,000 --> 00:26:35,600\nwhile others didn't, as well as the sort of headline finding of which had the best WER. But that would\n\n252\n00:26:35,600 --> 00:26:41,120\nrequire the source of truth. Okay, that's it. Hope this was, I don't know, maybe useful for other\n\n253\n00:26:41,120 --> 00:26:46,480\nfolks interested in STT. You want to see - that I always feel think I've just said as something I\n\n254\n00:26:46,480 --> 00:26:52,800\ndidn't intend to. STT I said for those listening carefully! Including hopefully the models themselves!\n\n255\n00:26:53,280 --> 00:26:58,960\nThis has been myself Daniel Rosehill. For more um jumbled repositories about my uh roving interests\n\n256\n00:26:58,960 --> 00:27:06,639\nin AI. But particularly agentic AI, MCP, and voice tech, you can find me on Github, Hugging Face.\n\n257\n00:27:08,080 --> 00:27:14,000\nWhere else? DanielRosehilll.com which is my personal website. As well as this podcast whose name\n\n258\n00:27:14,000 --> 00:27:17,280\nI sadly cannot remember! Until next time, thanks for listening!\n\n", "assembly": "1\n00:00:00,080 --> 00:00:05,680\nHello and welcome to a audio data set consisting\n\n2\n00:00:05,680 --> 00:00:10,640\nof one single episode of a non-existent podcast. Or I\n\n3\n00:00:10,720 --> 00:00:13,360\nmay append this to a podcast that I set up\n\n4\n00:00:13,600 --> 00:00:19,200\nrecently regarding my with my thoughts on speech\n\n5\n00:00:19,280 --> 00:00:24,000\ntech and AI in particular, more AI in generative AI,\n\n6\n00:00:24,240 --> 00:00:28,640\nI would say. But in any event, the purpose of\n\n7\n00:00:28,720 --> 00:00:33,850\nthis Voice recording is actually to create a lengthy\n\n8\n00:00:33,930 --> 00:00:37,130\nvoice sample for a quick evaluation, a back of the\n\n9\n00:00:37,130 --> 00:00:40,650\nenvelope evaluation, as they might say, for different speech attack\n\n10\n00:00:40,890 --> 00:00:43,450\nmodels. And I'm doing this because I thought I had\n\n11\n00:00:43,450 --> 00:00:46,810\nmade a great breakthrough in my journey with speech tech,\n\n12\n00:00:47,130 --> 00:00:50,730\nand that was succeeding in the elusive task of fine-tuning\n\n13\n00:00:50,730 --> 00:00:54,810\nWhisper. Whisper is, and I'm going to just talk, I'm\n\n14\n00:00:54,890 --> 00:00:58,250\ntrying to mix up, I'm going to try a few\n\n15\n00:00:58,410 --> 00:01:01,530\ndifferent styles of speaking. I might whisper something at some\n\n16\n00:01:01,610 --> 00:01:04,880\npoint. As well. And I'll go back to speaking loud\n\n17\n00:01:04,960 --> 00:01:08,080\nin, in different parts. I'm going to sound really like\n\n18\n00:01:08,160 --> 00:01:11,120\na crazy person because I'm also going to try to\n\n19\n00:01:11,280 --> 00:01:16,240\nspeak at different pitches and cadences in order to really\n\n20\n00:01:16,560 --> 00:01:20,560\ntry to put a speech attacks model through its paces,\n\n21\n00:01:20,720 --> 00:01:23,040\nwhich is trying to make sense of is this guy\n\n22\n00:01:23,200 --> 00:01:28,060\njust rambling on incoherently in one long sentence or are\n\n23\n00:01:28,460 --> 00:01:34,220\nthese just actually a series of step, standalone,\n\n24\n00:01:34,380 --> 00:01:37,420\nstep alone, standalone sentences? And how is it gonna handle\n\n25\n00:01:37,500 --> 00:01:40,460\nstep alone? That's not a word. What happens when you\n\n26\n00:01:40,540 --> 00:01:43,020\nuse speech to text and you use a fake word?\n\n27\n00:01:43,180 --> 00:01:45,580\nAnd then you're like, wait, that's not actually, that word\n\n28\n00:01:45,740 --> 00:01:50,220\ndoesn't exist. How does AI handle that? And these and\n\n29\n00:01:50,460 --> 00:01:54,300\nmore are all the questions that I'm seeking to answer\n\n30\n00:01:54,460 --> 00:01:57,500\nin this training data. Now, why was it trying to\n\n31\n00:01:57,500 --> 00:02:00,290\nfine tune Whisper? And what is Whisper? As I said,\n\n32\n00:02:00,370 --> 00:02:03,010\nI'm going to try to record this at a couple\n\n33\n00:02:03,170 --> 00:02:07,490\nof different levels of technicality for folks who are, you\n\n34\n00:02:07,490 --> 00:02:11,730\nknow, in the normal world and not totally stuck down\n\n35\n00:02:11,810 --> 00:02:13,810\nthe rabbit hole of AI, which I have to say\n\n36\n00:02:13,970 --> 00:02:18,130\nis a really wonderful rabbit hole to be down. It's\n\n37\n00:02:18,210 --> 00:02:21,570\na really interesting area and speech and voice tech is\n\n38\n00:02:21,970 --> 00:02:24,610\nthe aspect of it that I find actually the most,\n\n39\n00:02:25,010 --> 00:02:27,410\nI'm not sure I would say the most interesting because\n\n40\n00:02:27,650 --> 00:02:31,370\nthere's just so much that is fascinating in AI. But\n\n41\n00:02:31,530 --> 00:02:34,330\nthe most that I find the most personally transformative in\n\n42\n00:02:34,410 --> 00:02:38,970\nterms of the impact that it's had on my daily\n\n43\n00:02:39,050 --> 00:02:41,530\nwork life and productivity and how I sort of work.\n\n44\n00:02:42,170 --> 00:02:47,290\nAnd I'm persevering hard with the task of trying\n\n45\n00:02:47,290 --> 00:02:50,330\nto get a good solution working for Linux, which if\n\n46\n00:02:50,330 --> 00:02:52,330\nanyone actually does listen to this, not just for the\n\n47\n00:02:52,330 --> 00:02:56,490\ntraining data and for the actual content, this is sparked\n\n48\n00:02:56,830 --> 00:03:00,030\nI had, besides the fine tune not working, well, that\n\n49\n00:03:00,110 --> 00:03:05,310\nwas the failure. Um, I used Claude code because one\n\n50\n00:03:05,550 --> 00:03:10,030\nthinks these days that there is nothing short of solving,\n\n51\n00:03:11,070 --> 00:03:15,470\nyou know, the, the reason of life or something, that\n\n52\n00:03:15,870 --> 00:03:19,070\nClaude and agentic AI can't do, which is not really\n\n53\n00:03:19,150 --> 00:03:22,270\nthe case. Uh, it does seem that way sometimes, but\n\n54\n00:03:22,430 --> 00:03:24,270\nit fails a lot as well. And this is one\n\n55\n00:03:24,270 --> 00:03:27,710\nof those, instances where last week I put together an\n\n56\n00:03:27,790 --> 00:03:32,090\nhour of voice training data, basically speaking, just random things\n\n57\n00:03:32,330 --> 00:03:37,130\nfor 3 minutes. And it was actually kind of tedious\n\n58\n00:03:37,210 --> 00:03:39,290\nbecause the texts were really weird. Some of them were\n\n59\n00:03:39,530 --> 00:03:43,130\nit was like it was AI generated. I tried before\n\n60\n00:03:43,290 --> 00:03:45,210\nto read Sherlock Holmes for an hour and I just\n\n61\n00:03:45,210 --> 00:03:48,410\ncouldn't. I was so bored after 10 minutes that I\n\n62\n00:03:48,410 --> 00:03:50,810\nwas like, okay, no, I'm just going to have to\n\n63\n00:03:50,810 --> 00:03:55,370\nfind something else to read. So I used a created\n\n64\n00:03:55,770 --> 00:04:01,360\nwith AI studio vibe coded a synthetic text generator. Which\n\n65\n00:04:01,680 --> 00:04:03,920\nactually I thought was probably a better way of doing\n\n66\n00:04:04,000 --> 00:04:07,520\nit because it would give me more short samples with\n\n67\n00:04:07,760 --> 00:04:10,560\nmore varied content. So I was like, okay, give me\n\n68\n00:04:10,960 --> 00:04:13,840\na voice note, like I'm recording an email, give me\n\n69\n00:04:14,080 --> 00:04:17,760\na short story to read, give me prose to read.\n\n70\n00:04:18,080 --> 00:04:20,480\nSo I came up with all these different things and\n\n71\n00:04:20,640 --> 00:04:22,640\nthey added a little timer to it so I could\n\n72\n00:04:22,800 --> 00:04:26,480\nsee how close I was to one hour. And I\n\n73\n00:04:26,640 --> 00:04:29,680\nspent like an hour one afternoon or probably two hours\n\n74\n00:04:29,840 --> 00:04:33,410\nby the time you you do retakes. And whatever, because\n\n75\n00:04:33,490 --> 00:04:36,690\nyou want to, it gave me a source of truth,\n\n76\n00:04:37,410 --> 00:04:40,130\nwhich I'm not sure if that's the scientific way to\n\n77\n00:04:40,290 --> 00:04:44,290\napproach this topic of gathering, training data, but I thought\n\n78\n00:04:44,530 --> 00:04:48,210\nmade sense. Um, I have a lot of audio data\n\n79\n00:04:48,290 --> 00:04:50,850\nfrom recording voice notes, which I've also kind of used,\n\n80\n00:04:52,130 --> 00:04:55,890\nbeen experimenting with using for a different purpose, slightly different\n\n81\n00:04:56,290 --> 00:05:01,490\nannotating task types. It's more a text classification experiment\n\n82\n00:05:01,810 --> 00:05:04,240\nor, Well, it's more than that actually. I'm working on\n\n83\n00:05:04,240 --> 00:05:08,160\na voice app. So it's a prototype, I guess, is\n\n84\n00:05:08,320 --> 00:05:12,800\nreally more accurate. But you can do that and you\n\n85\n00:05:12,800 --> 00:05:15,280\ncan work backwards. You're like, you listen back to a\n\n86\n00:05:15,280 --> 00:05:18,800\nvoice note and you painfully go through one of those\n\n87\n00:05:19,120 --> 00:05:21,920\ntranscribing, you know, where you start and stop and scrub\n\n88\n00:05:22,080 --> 00:05:24,000\naround it and you fix the errors, but it's really,\n\n89\n00:05:24,160 --> 00:05:26,800\nreally boring to do that. So I thought it would\n\n90\n00:05:26,880 --> 00:05:29,120\nbe less tedious in the long term if I just\n\n91\n00:05:30,139 --> 00:05:33,020\nrecorded the source of truth. So it gave me these\n\n92\n00:05:33,100 --> 00:05:36,220\nthree minute snippets. I recorded them. It saved an MP3\n\n93\n00:05:36,460 --> 00:05:39,580\nand a TXT in the same folder, and I created\n\n94\n00:05:39,660 --> 00:05:42,940\nan error with that data. So I was very hopeful,\n\n95\n00:05:43,340 --> 00:05:46,940\nquietly, a little bit hopeful that I could actually fine\n\n96\n00:05:47,020 --> 00:05:50,540\ntune Whisper. I want to fine tune Whisper because when\n\n97\n00:05:50,620 --> 00:05:54,860\nI got into Voicetech last November, my wife was in\n\n98\n00:05:54,860 --> 00:05:58,220\nthe US and I was alone at home. And when\n\n99\n00:05:58,680 --> 00:06:01,480\ncrazy people like me do really wild things like use\n\n100\n00:06:01,720 --> 00:06:06,200\nvoice to tech technology. That was basically when I started\n\n101\n00:06:06,280 --> 00:06:08,840\ndoing it, I didn't feel like a crazy person speaking\n\n102\n00:06:08,920 --> 00:06:13,800\nto myself. And my expectations weren't that high. I used\n\n103\n00:06:14,360 --> 00:06:17,720\nspeech tech now and again, tried it out. It was\n\n104\n00:06:17,720 --> 00:06:19,240\nlike, it'd be really cool if you could just, like,\n\n105\n00:06:19,400 --> 00:06:22,840\nspeak into your computer. And whatever I tried out that\n\n106\n00:06:23,080 --> 00:06:26,670\nhad Linux support was just. It was not good, basically.\n\n107\n00:06:27,310 --> 00:06:29,550\nAnd this blew me away from the first go. I\n\n108\n00:06:29,550 --> 00:06:32,830\nmean, it wasn't 100% accurate out of the box and\n\n109\n00:06:32,910 --> 00:06:34,990\nit took work, but it was good enough that there\n\n110\n00:06:35,070 --> 00:06:37,550\nwas a solid foundation and it kind of passed that\n\n111\n00:06:38,750 --> 00:06:41,950\npivot point that it's actually worth doing this. You know,\n\n112\n00:06:42,110 --> 00:06:44,750\nthere's a point where it's so like the transcript is\n\n113\n00:06:44,990 --> 00:06:47,390\nyou don't have to get 100% accuracy for it to\n\n114\n00:06:47,390 --> 00:06:50,110\nbe worth your time for speech attacks to be a\n\n115\n00:06:50,110 --> 00:06:52,510\nworthwhile addition to your productivity, but you do need to\n\n116\n00:06:52,510 --> 00:06:56,050\nget above, let's say, I don't know, 85%. If it's\n\n117\n00:06:56,210 --> 00:06:59,890\n60% or 50%, you inevitably say, screw it, I'll just\n\n118\n00:06:59,890 --> 00:07:02,850\ntype it because you end up missing errors in the\n\n119\n00:07:02,850 --> 00:07:05,570\ntranscript and it becomes actually worse. You end up in\n\n120\n00:07:05,570 --> 00:07:07,650\na worse position than you started with. That's been my\n\n121\n00:07:07,730 --> 00:07:12,050\nexperience. So I was like, oh, this is actually really,\n\n122\n00:07:12,210 --> 00:07:14,050\nreally good now. How did that happen? And the answer\n\n123\n00:07:14,210 --> 00:07:19,490\nis ASR whisper being open source and the transformer\n\n124\n00:07:19,490 --> 00:07:23,250\narchitecture. If you want to go back to the to\n\n125\n00:07:23,330 --> 00:07:26,450\nthe underpinnings, which really blows my mind and it's on\n\n126\n00:07:26,530 --> 00:07:30,760\nmy list. To read through that paper. All you need\n\n127\n00:07:30,840 --> 00:07:36,040\nis attention as attentively as can be done\n\n128\n00:07:36,280 --> 00:07:39,400\nwith my limited brain because it's super, super high level\n\n129\n00:07:39,720 --> 00:07:44,600\nstuff, super advanced stuff, I mean. But that, I think\n\n130\n00:07:44,760 --> 00:07:49,400\nof all the things that are fascinating about the sudden\n\n131\n00:07:49,720 --> 00:07:53,780\nrise in AI and the dramatic capabilities. I find it\n\n132\n00:07:53,780 --> 00:07:56,180\nfascinating that a few people are like, hang on, you've\n\n133\n00:07:56,180 --> 00:07:58,500\ngot this thing that can speak to you, like a\n\n134\n00:07:58,500 --> 00:08:03,060\nchatbot, an LLM, and then you've got image generation. Okay,\n\n135\n00:08:03,140 --> 00:08:06,660\nso firstly, those two things on the surface have nothing\n\n136\n00:08:06,980 --> 00:08:10,820\nin common. So like, how are they, how did that\n\n137\n00:08:10,980 --> 00:08:12,580\njust happen all at the same time? And then when\n\n138\n00:08:12,580 --> 00:08:16,660\nyou extend that further, you're like, Suno, right? You can\n\n139\n00:08:17,140 --> 00:08:20,110\nsing a song and AI will come up with and\n\n140\n00:08:20,270 --> 00:08:23,470\ninstrumental. And then you've got Whisper and you're like, wait\n\n141\n00:08:23,470 --> 00:08:25,950\na second, how did all this stuff, like, if it's\n\n142\n00:08:25,950 --> 00:08:29,310\nall AI, what's like, there has to be some commonality.\n\n143\n00:08:29,550 --> 00:08:34,670\nOtherwise, these are totally different technologies on the surface of\n\n144\n00:08:34,670 --> 00:08:38,910\nit. And the Transformer architecture is, as far as I\n\n145\n00:08:38,990 --> 00:08:41,630\nknow, the answer. And I can't even say, can't even\n\n146\n00:08:41,710 --> 00:08:46,350\npretend that I really understand what the Transformer architecture means.\n\n147\n00:08:46,850 --> 00:08:49,330\nIn depth, but I have scanned it and as I\n\n148\n00:08:49,490 --> 00:08:51,890\nsaid, I want to print it and really kind of\n\n149\n00:08:52,290 --> 00:08:56,130\nthink over it at some point. And I'll probably feel\n\n150\n00:08:56,370 --> 00:08:59,330\nbad about myself, I think, because weren't those guys in\n\n151\n00:08:59,410 --> 00:09:03,490\ntheir 20s? Like, that's crazy. I think I asked ChatGPT\n\n152\n00:09:03,570 --> 00:09:07,970\nonce who wrote that paper and how old were they\n\n153\n00:09:08,130 --> 00:09:10,850\nwhen it was published in Arciv? And I was expecting,\n\n154\n00:09:11,090 --> 00:09:13,970\nlike, I don't know, What do you imagine? I personally\n\n155\n00:09:14,050 --> 00:09:16,290\nimagine kind of like, you know, you have these breakthroughs\n\n156\n00:09:16,450 --> 00:09:19,890\nduring COVID and things like that where like these kind\n\n157\n00:09:19,970 --> 00:09:22,850\nof really obscure scientists are like in their 50s and\n\n158\n00:09:22,850 --> 00:09:27,250\nthey've just kind of been laboring in labs and wearily\n\n159\n00:09:27,250 --> 00:09:30,530\nin writing and publishing in kind of obscure academic publications.\n\n160\n00:09:30,850 --> 00:09:33,250\nAnd they finally like hit a big or win a\n\n161\n00:09:33,250 --> 00:09:37,330\nNobel Prize and then their household names. So that was\n\n162\n00:09:37,410 --> 00:09:39,070\nkind of what I had in mind. That was the\n\n163\n00:09:39,070 --> 00:09:43,070\nmental image I'd formed of the birth of Arcsight. Like\n\n164\n00:09:43,070 --> 00:09:46,350\nI wasn't expecting 20-somethings in San Francisco, though. I thought\n\n165\n00:09:46,430 --> 00:09:48,910\nthat was both very, very funny, very cool, and actually\n\n166\n00:09:49,070 --> 00:09:52,590\nkind of inspiring. It's nice to think that people who,\n\n167\n00:09:53,390 --> 00:09:56,190\nyou know, just you might put them in the kind\n\n168\n00:09:56,270 --> 00:09:59,630\nof milieu or bubble or world that you are in\n\n169\n00:09:59,710 --> 00:10:03,310\nare credibly in through, you know, the series of connections\n\n170\n00:10:03,390 --> 00:10:07,470\nthat are coming up with such literally world changing innovations.\n\n171\n00:10:07,950 --> 00:10:11,540\nSo that was, I thought, anyway. That's that was cool.\n\n172\n00:10:11,940 --> 00:10:14,580\nOkay, voice training data. How are we doing? We're about\n\n173\n00:10:14,580 --> 00:10:18,660\n10 minutes and I'm still talking about voice technology. So\n\n174\n00:10:18,740 --> 00:10:22,180\nWhisper was brilliant and I was so excited that I\n\n175\n00:10:22,260 --> 00:10:25,460\nwas my first instinct was to like guess like, oh\n\n176\n00:10:25,460 --> 00:10:26,900\nmy gosh, I have to get like a really good\n\n177\n00:10:26,900 --> 00:10:30,660\nmicrophone for this. So I didn't go on a spending\n\n178\n00:10:30,660 --> 00:10:32,820\nspree because I said, I'm gonna have to just wait\n\n179\n00:10:32,820 --> 00:10:35,220\na month and see if I still use this. And\n\n180\n00:10:36,510 --> 00:10:38,990\nIt just kind of became, it's become really part of\n\n181\n00:10:39,150 --> 00:10:43,470\nmy daily routine. Like if I'm writing an email, I'll\n\n182\n00:10:43,550 --> 00:10:47,070\nrecord a voice note. And then I've developed and it's\n\n183\n00:10:47,070 --> 00:10:49,150\nnice to see that everyone is like developing the same\n\n184\n00:10:49,630 --> 00:10:52,030\nthings in parallel. Like that's my kind of a weird\n\n185\n00:10:52,030 --> 00:10:54,590\nthing to say, but when I look, I kind of\n\n186\n00:10:54,750 --> 00:10:59,070\ncame, when I started working on this, these prototypes on\n\n187\n00:10:59,150 --> 00:11:01,550\nGitHub, which is where I just kind of share very\n\n188\n00:11:01,790 --> 00:11:06,810\nfreely and loosely, ideas and first iterations on concepts.\n\n189\n00:11:08,570 --> 00:11:10,730\nAnd for want of a better word, I called it\n\n190\n00:11:10,810 --> 00:11:15,530\nlike LLM post-processing or cleanup or basically a system prompt\n\n191\n00:11:15,610 --> 00:11:18,970\nthat after you get back the raw text from Whisper,\n\n192\n00:11:19,130 --> 00:11:22,090\nyou run it through a model and say, okay, this\n\n193\n00:11:22,170 --> 00:11:27,050\nis crappy text, like add sentence structure and fix it\n\n194\n00:11:27,130 --> 00:11:32,330\nup. And now when I'm exploring the different tools that\n\n195\n00:11:32,410 --> 00:11:35,260\nare out there that people have built, I see quite\n\n196\n00:11:35,500 --> 00:11:39,180\na number of projects have basically done the same thing,\n\n197\n00:11:40,540 --> 00:11:43,260\nlest that be misconstrued. I'm not saying for a millisecond\n\n198\n00:11:43,340 --> 00:11:46,300\nthat I inspired them. I'm sure this has been a\n\n199\n00:11:46,380 --> 00:11:49,580\nthing that's been integrated into tools for a while, but\n\n200\n00:11:50,460 --> 00:11:52,380\nit's the kind of thing that when you start using\n\n201\n00:11:52,380 --> 00:11:54,860\nthese tools every day, the need for it is almost\n\n202\n00:11:55,020 --> 00:11:59,500\ninstantly apparent because text that doesn't have any punctuation or\n\n203\n00:11:59,880 --> 00:12:03,080\nParagraph spacing takes a long time to, you know, it\n\n204\n00:12:03,240 --> 00:12:05,480\ntakes so long to get it into a presentable email\n\n205\n00:12:05,640 --> 00:12:09,800\nthat again, it's, it's, it, it moves speech tech into\n\n206\n00:12:10,040 --> 00:12:13,560\nthat before that inflection point where you're like, no, it's\n\n207\n00:12:13,560 --> 00:12:16,040\njust not worth it. It's like, it's, it'll just be\n\n208\n00:12:16,120 --> 00:12:18,600\nquicker to type this. So it's a big, it's a\n\n209\n00:12:18,600 --> 00:12:21,640\nlittle touch that actually is a big deal. Uh, so\n\n210\n00:12:21,800 --> 00:12:25,720\nI was on Whisper and I've been using Whisper and\n\n211\n00:12:25,720 --> 00:12:28,190\nI kind of, early on found a couple of tools.\n\n212\n00:12:28,350 --> 00:12:30,590\nI couldn't find what I was looking for on Linux,\n\n213\n00:12:30,750 --> 00:12:35,550\nwhich is basically just something that'll run in the background.\n\n214\n00:12:35,790 --> 00:12:38,110\nIt'll give it an API key and it will just\n\n215\n00:12:38,270 --> 00:12:42,990\nlike transcribe with like a little key to start and\n\n216\n00:12:43,070 --> 00:12:47,390\nstop the dictation. And the issues were I discovered that\n\n217\n00:12:47,550 --> 00:12:51,150\nlike most people involved in creating these projects were very\n\n218\n00:12:51,310 --> 00:12:55,150\nmuch focused on local models, running Whisper locally because you\n\n219\n00:12:55,230 --> 00:12:58,020\ncan. And I tried that a bunch of times and\n\n220\n00:12:58,100 --> 00:13:00,420\njust never got results that were as good as the\n\n221\n00:13:00,420 --> 00:13:03,220\ncloud. And when I began looking at the cost of\n\n222\n00:13:03,300 --> 00:13:05,780\nthe speech to text APIs and what I was spending,\n\n223\n00:13:06,340 --> 00:13:09,540\nI just thought there is, it's actually, in my opinion,\n\n224\n00:13:09,700 --> 00:13:12,900\njust one of the better deals in API spending and\n\n225\n00:13:12,900 --> 00:13:15,220\nin cloud. Like it's just not that expensive for very,\n\n226\n00:13:15,380 --> 00:13:19,380\nvery good models that are much more, you know, you're\n\n227\n00:13:19,380 --> 00:13:21,960\ngonna be able to run the full model. The latest\n\n228\n00:13:21,960 --> 00:13:25,960\nmodel versus whatever you can run on your average GPU,\n\n229\n00:13:26,200 --> 00:13:29,240\nunless you want to buy a crazy GPU. It doesn't\n\n230\n00:13:29,240 --> 00:13:31,160\nreally make sense to me. Now, privacy is another concern\n\n231\n00:13:32,200 --> 00:13:33,960\nthat I know is kind of like a very much\n\n232\n00:13:34,040 --> 00:13:36,840\na separate thing that people just don't want their voice\n\n233\n00:13:37,080 --> 00:13:40,760\ndata and their voice leaving their local environment, maybe for\n\n234\n00:13:40,760 --> 00:13:44,280\nregulatory reasons as well. But I'm not in that. I\n\n235\n00:13:44,680 --> 00:13:48,920\nneither really care about people listening to my grocery list\n\n236\n00:13:49,160 --> 00:13:51,800\nconsisting of reminding myself that I need to buy more\n\n237\n00:13:51,880 --> 00:13:55,230\nbeer, Cheetos, and hummus, which is kind of the three\n\n238\n00:13:55,390 --> 00:13:59,950\nstaples of my diet during periods of poorer nutrition. But\n\n239\n00:14:00,030 --> 00:14:02,510\nthe kind of stuff that I transcribe, it's just not,\n\n240\n00:14:04,030 --> 00:14:07,790\nit's not a privacy thing I'm that sort of sensitive\n\n241\n00:14:07,870 --> 00:14:13,230\nabout and I don't do anything so sensitive or secure\n\n242\n00:14:13,310 --> 00:14:16,510\nthat requires air gapping. So I looked at the pricing\n\n243\n00:14:16,590 --> 00:14:19,870\nand especially the kind of older model mini Some of\n\n244\n00:14:19,950 --> 00:14:22,030\nthem are very, very affordable. And I did a back\n\n245\n00:14:22,270 --> 00:14:25,950\nof the, I did a calculation once with ChatGPT and\n\n246\n00:14:25,950 --> 00:14:29,310\nI was like, okay, this is the API price for\n\n247\n00:14:29,470 --> 00:14:32,350\nI can't remember whatever the model was. Let's say I\n\n248\n00:14:32,430 --> 00:14:35,310\njust go at it like nonstop, which it rarely happens.\n\n249\n00:14:35,550 --> 00:14:38,910\nProbably, I would say on average, I might dictate 30\n\n250\n00:14:38,990 --> 00:14:41,870\nto 60 minutes per day if I was probably summing\n\n251\n00:14:41,870 --> 00:14:47,070\nup the emails, documents, outlines, which\n\n252\n00:14:47,310 --> 00:14:49,950\nis a lot, but it's still a fairly modest amount.\n\n253\n00:14:50,110 --> 00:14:52,020\nAnd I was like, Some days I do go on\n\n254\n00:14:52,180 --> 00:14:54,980\nlike one or two days where I've been usually when\n\n255\n00:14:54,980 --> 00:14:57,060\nI'm like kind of out of the house and just\n\n256\n00:14:57,300 --> 00:15:00,580\nhave something like I have nothing else to do. Like\n\n257\n00:15:00,740 --> 00:15:04,100\nif I'm at a hospital, we have a newborn and\n\n258\n00:15:04,260 --> 00:15:07,380\nyou're waiting for like eight hours and hours for an\n\n259\n00:15:07,460 --> 00:15:10,900\nappointment. And I would probably have listened to podcasts before\n\n260\n00:15:11,460 --> 00:15:14,260\nbecoming a speech fanatic. And I'm like, oh, wait, let\n\n261\n00:15:14,420 --> 00:15:16,339\nme just get down. Let me just get these ideas\n\n262\n00:15:16,500 --> 00:15:18,620\nout of my head. And that's when I'll go on\n\n263\n00:15:19,340 --> 00:15:21,900\nmy speech binges. But those are like once every few\n\n264\n00:15:21,900 --> 00:15:25,020\nmonths, like not frequently. But I said, okay, let's just\n\n265\n00:15:25,100 --> 00:15:29,180\nsay if I'm gonna price out Cloud SCT, if I\n\n266\n00:15:29,260 --> 00:15:33,980\nwas like dedicated every second of every waking hour to\n\n267\n00:15:34,140 --> 00:15:37,980\ntranscribing for some odd reason, I mean, I'd have to\n\n268\n00:15:38,060 --> 00:15:40,860\nlike eat and use the toilet. Like, you know, there's\n\n269\n00:15:40,940 --> 00:15:43,500\nonly so many hours I'm awake for. So like, let's\n\n270\n00:15:43,500 --> 00:15:46,700\njust say a maximum of like 40 hour, 45 minutes.\n\n271\n00:15:47,290 --> 00:15:49,370\nIn the hour. Then I said, all right, let's just\n\n272\n00:15:49,370 --> 00:15:52,970\nsay 50. Who knows? You're dictating on the toilet. We\n\n273\n00:15:53,130 --> 00:15:55,130\ndo it. So it could be. You could just do\n\n274\n00:15:55,210 --> 00:15:59,370\n60. But whatever I did. And every day, like, you're\n\n275\n00:15:59,450 --> 00:16:02,810\ngoing flat out seven days a week dictating non-stop I\n\n276\n00:16:02,810 --> 00:16:05,930\nwas like, what's my monthly API bill gonna be at\n\n277\n00:16:06,010 --> 00:16:08,650\nthis price? And it came out to, like, 70 or\n\n278\n00:16:08,650 --> 00:16:10,810\n80 bucks. And I was like, well, that would be\n\n279\n00:16:11,210 --> 00:16:15,780\nan extraordinary. Amount of dictation. And I would hope that\n\n280\n00:16:16,260 --> 00:16:20,020\nthere was some compelling reason more worth more than $70\n\n281\n00:16:20,340 --> 00:16:23,540\nthat I embarked upon that project. So given that that's\n\n282\n00:16:23,540 --> 00:16:25,540\nkind of the max point for me, I said that's\n\n283\n00:16:25,620 --> 00:16:29,220\nactually very, very affordable. Now you're gonna, if you want\n\n284\n00:16:29,300 --> 00:16:31,780\nto spec out the costs and you want to do\n\n285\n00:16:31,780 --> 00:16:36,340\nthe post-processing that I really do feel is valuable, that's\n\n286\n00:16:36,420 --> 00:16:40,900\ngonna cost some more as well, unless you're using Gemini,\n\n287\n00:16:41,380 --> 00:16:44,500\nwhich needless to say is a random person sitting in\n\n288\n00:16:44,580 --> 00:16:49,140\nJerusalem. I have no affiliation, nor with Google, nor anthropic,\n\n289\n00:16:49,220 --> 00:16:52,100\nnor Gemini, nor any major tech vendor for that matter.\n\n290\n00:16:53,700 --> 00:16:56,900\nI like Gemini not so much as a everyday model.\n\n291\n00:16:57,380 --> 00:16:59,940\nIt's kind of underwhelmed in that respect, I would say.\n\n292\n00:17:00,340 --> 00:17:02,820\nBut for multimodal, I think it's got a lot to\n\n293\n00:17:02,820 --> 00:17:06,580\noffer. And I think that the transcribing functionality whereby it\n\n294\n00:17:06,660 --> 00:17:11,980\ncan process audio with a system prompt and both give\n\n295\n00:17:12,140 --> 00:17:15,180\nyou transcription that's cleaned up that reduces two steps to\n\n296\n00:17:15,340 --> 00:17:18,300\none. And that for me is a very, very big\n\n297\n00:17:18,460 --> 00:17:21,660\ndeal. And I feel like even Google has haven't really\n\n298\n00:17:21,900 --> 00:17:26,780\nsort of thought through how useful the that modality is\n\n299\n00:17:26,860 --> 00:17:29,340\nand what kind of use cases you can achieve with\n\n300\n00:17:29,420 --> 00:17:31,340\nit. Because I found in the course of this year,\n\n301\n00:17:31,980 --> 00:17:36,620\njust an endless list of really kind of system prompt\n\n302\n00:17:36,940 --> 00:17:40,300\nsystem prompt stuff that I can say, okay, I've used\n\n303\n00:17:40,300 --> 00:17:43,500\nit to capture context data for AI, which is literally\n\n304\n00:17:43,580 --> 00:17:45,740\nI might speak for if I wanted to have a\n\n305\n00:17:45,740 --> 00:17:49,820\ngood bank of context data about who knows my childhood\n\n306\n00:17:50,380 --> 00:17:54,300\nmore realistically, maybe my career goals, something that would just\n\n307\n00:17:54,380 --> 00:17:56,780\nbe like really boring to type out. So I'll just\n\n308\n00:17:56,860 --> 00:18:00,860\nlike sit in my car and record it for 10\n\n309\n00:18:00,940 --> 00:18:03,180\nminutes. And that 10 minutes you get a lot of\n\n310\n00:18:03,340 --> 00:18:08,730\ninformation in. Um, emails, which is short text, just\n\n311\n00:18:09,130 --> 00:18:12,330\nthere is a whole bunch and all these workflows kind\n\n312\n00:18:12,490 --> 00:18:14,490\nof require a little bit of treatment afterwards and different\n\n313\n00:18:14,730 --> 00:18:18,170\ntreatment. My context pipeline is kind of like just extract\n\n314\n00:18:18,250 --> 00:18:21,050\nthe bare essentials. So you end up with me talking\n\n315\n00:18:21,130 --> 00:18:23,050\nvery loosely about sort of what I've done in my\n\n316\n00:18:23,130 --> 00:18:25,450\ncareer, where I've worked, where I might like to work.\n\n317\n00:18:25,930 --> 00:18:29,050\nAnd it goes, it condenses that down to very robotic\n\n318\n00:18:29,290 --> 00:18:32,570\nlanguage that is easy to chunk parse and maybe put\n\n319\n00:18:32,650 --> 00:18:36,630\ninto a vector database. Daniel has worked in technology. Daniel\n\n320\n00:18:37,510 --> 00:18:40,230\nhas been working in, you know, stuff like that. That's\n\n321\n00:18:40,230 --> 00:18:43,190\nnot how you would speak, but I figure it's probably\n\n322\n00:18:43,430 --> 00:18:47,430\neasier to parse for, after all, robots. So we've almost\n\n323\n00:18:47,510 --> 00:18:49,350\ngot to 20 minutes and this is actually a success\n\n324\n00:18:49,830 --> 00:18:55,190\nbecause I wasted 20 minutes of the evening speaking\n\n325\n00:18:55,270 --> 00:18:59,990\ninto a microphone and the levels were shot and it\n\n326\n00:18:59,990 --> 00:19:01,670\nwas clipping and I said, I can't really do an\n\n327\n00:19:01,750 --> 00:19:04,070\nevaluation. I have to be fair. I have to give\n\n328\n00:19:04,640 --> 00:19:08,000\nthe models a chance to do their thing. What am\n\n329\n00:19:08,000 --> 00:19:10,400\nI hoping to achieve in this? Okay, my fine tune\n\n330\n00:19:10,400 --> 00:19:13,440\nwas a dud as mentioned. DeepChrom ST, I'm really, really\n\n331\n00:19:13,520 --> 00:19:16,560\nhopeful that this prototype will work and it's a build\n\n332\n00:19:16,800 --> 00:19:19,360\nin public open source, so anyone is welcome to use\n\n333\n00:19:19,440 --> 00:19:22,400\nit if I make anything good. But that was really\n\n334\n00:19:22,560 --> 00:19:26,560\nexciting for me last night when after hours of trying\n\n335\n00:19:26,640 --> 00:19:30,560\nmy own prototype, seeing someone just made something that works\n\n336\n00:19:30,720 --> 00:19:32,480\nlike that, you know, you're not gonna have to build\n\n337\n00:19:32,720 --> 00:19:37,540\na custom conda environment and image. I have AMD GPU,\n\n338\n00:19:37,700 --> 00:19:41,060\nwhich makes things much more complicated. I didn't find it.\n\n339\n00:19:41,620 --> 00:19:43,060\nAnd I was about to give up and I said,\n\n340\n00:19:43,140 --> 00:19:45,540\nall right, let me just give Deep Grams Linux thing\n\n341\n00:19:46,020 --> 00:19:49,300\na shot. And if this doesn't work, I'm just going\n\n342\n00:19:49,300 --> 00:19:51,060\nto go back to trying to Vibe code something myself.\n\n343\n00:19:51,700 --> 00:19:55,540\nAnd when I ran the script, I was using Claude\n\n344\n00:19:55,620 --> 00:19:59,140\ncode to do the installation process. It ran the script\n\n345\n00:19:59,220 --> 00:20:02,100\nand oh my gosh, it works just like that. The\n\n346\n00:20:02,180 --> 00:20:06,060\ntricky thing For all those who want to know all\n\n347\n00:20:06,060 --> 00:20:11,340\nthe nitty gritty details, was that I\n\n348\n00:20:11,340 --> 00:20:14,460\ndon't think it was actually struggling with transcription, but pasting\n\n349\n00:20:14,780 --> 00:20:18,220\nWayland makes life very hard. And I think there was\n\n350\n00:20:18,300 --> 00:20:21,580\nsomething not running the right time. Anyway, Deepgram, I looked\n\n351\n00:20:21,580 --> 00:20:23,900\nat how they actually handled that because it worked out\n\n352\n00:20:23,980 --> 00:20:26,620\nof the box when other stuff didn't. And it was\n\n353\n00:20:27,180 --> 00:20:30,650\nquite a clever little mechanism. And but more so than\n\n354\n00:20:30,730 --> 00:20:33,370\nthat, the accuracy was brilliant. Now, what am I doing\n\n355\n00:20:33,370 --> 00:20:36,010\nhere? This is going to be a 20 minute audio\n\n356\n00:20:36,570 --> 00:20:42,090\nsample. And I think I've done one or two\n\n357\n00:20:42,250 --> 00:20:46,650\nof these before, but I did it with short snappy\n\n358\n00:20:46,810 --> 00:20:49,850\nvoice notes. This is kind of long form. This actually\n\n359\n00:20:50,090 --> 00:20:52,250\nmight be a better approximation for what's useful to me\n\n360\n00:20:52,410 --> 00:20:55,970\nthan voice memos. Like, I need to buy three Bread,\n\n361\n00:20:56,050 --> 00:20:58,690\neaters of milk tomorrow and Peter bread, which is probably\n\n362\n00:20:58,850 --> 00:21:01,410\nhow like half my voice notes sound. Like if anyone\n\n363\n00:21:01,890 --> 00:21:04,130\nwere to, I don't know, like find my phone, they'd\n\n364\n00:21:04,130 --> 00:21:05,650\nbe like, this is the most boring person in the\n\n365\n00:21:05,650 --> 00:21:09,410\nworld. Although actually, there are some like kind of journaling\n\n366\n00:21:09,410 --> 00:21:11,570\nthoughts as well, but it's a lot of content like\n\n367\n00:21:11,570 --> 00:21:14,530\nthat. And the probably for the evaluation, the most useful\n\n368\n00:21:14,610 --> 00:21:20,290\nthing is slightly obscure tech, GitHub, NeocleNo, hugging\n\n369\n00:21:20,370 --> 00:21:23,020\nface, Not so obscure that it's not going to have\n\n370\n00:21:23,100 --> 00:21:26,540\na chance of knowing it, but hopefully sufficiently well known\n\n371\n00:21:26,540 --> 00:21:28,780\nthat the model should get it. I tried to do\n\n372\n00:21:28,860 --> 00:21:31,660\na little bit of speaking really fast and speaking very\n\n373\n00:21:31,820 --> 00:21:35,100\nslowly. I would say in general, I've spoken, delivered this\n\n374\n00:21:35,260 --> 00:21:37,580\nat a faster pace than I usually would owing to\n\n375\n00:21:38,060 --> 00:21:42,540\nstrong coffee flowing through my bloodstream. And the thing that\n\n376\n00:21:42,540 --> 00:21:44,780\nI'm not going to get in this benchmark is background\n\n377\n00:21:44,860 --> 00:21:46,540\nnoise, which in my first take that I had to\n\n378\n00:21:46,540 --> 00:21:49,790\nget rid of, My wife came in with my son\n\n379\n00:21:50,110 --> 00:21:52,430\nand for a goodnight kiss. And that actually would have\n\n380\n00:21:52,430 --> 00:21:56,590\nbeen super helpful to get in because it was non\n\n381\n00:21:56,670 --> 00:22:00,270\ndiarized or if we had diarization, a female, I could\n\n382\n00:22:00,270 --> 00:22:02,510\nsay, I want the male voice and that wasn't intended\n\n383\n00:22:02,510 --> 00:22:05,950\nfor transcription. And we're not going to get background noise\n\n384\n00:22:06,030 --> 00:22:08,350\nlike people honking their horns, which is something I've done\n\n385\n00:22:08,510 --> 00:22:11,230\nin my main data set where I am trying to\n\n386\n00:22:11,470 --> 00:22:14,420\ngo back to some of my voice notes. Annotate them\n\n387\n00:22:14,660 --> 00:22:16,500\nand run a benchmark. But this is going to be\n\n388\n00:22:16,500 --> 00:22:21,780\njust a pure quick test. And as someone,\n\n389\n00:22:22,340 --> 00:22:24,740\nI'm working on a voice note idea. That's my sort\n\n390\n00:22:24,740 --> 00:22:28,740\nof end motivation. Besides thinking it's an ask to the\n\n391\n00:22:28,740 --> 00:22:32,420\noutstanding technology that's coming to viability. And really, I know\n\n392\n00:22:32,500 --> 00:22:36,020\nthis sounds cheesy, can actually have a very transformative effect.\n\n393\n00:22:37,060 --> 00:22:41,210\nIt's, you know, voice technology has been life changing for\n\n394\n00:22:42,010 --> 00:22:47,050\nfolks living with disabilities. And I think\n\n395\n00:22:47,210 --> 00:22:49,050\nthere's something really nice about the fact that it can\n\n396\n00:22:49,210 --> 00:22:52,570\nalso benefit, you know, folks who are able bodied and\n\n397\n00:22:52,730 --> 00:22:57,770\nlike we can all in different ways make this tech\n\n398\n00:22:57,850 --> 00:23:00,490\nas useful as possible, regardless of the exact way that\n\n399\n00:23:00,490 --> 00:23:03,850\nwe're using it. And I think there's something very powerful\n\n400\n00:23:03,930 --> 00:23:06,520\nin that and it can be very cool. I see\n\n401\n00:23:06,680 --> 00:23:10,280\nhuge potential. What excites me about Voicetech? A lot of\n\n402\n00:23:10,360 --> 00:23:14,440\nthings actually. Firstly, the fact that it's cheap and accurate,\n\n403\n00:23:14,520 --> 00:23:17,160\nas I mentioned at the very start of this. And\n\n404\n00:23:17,320 --> 00:23:19,960\nit's getting better and better with stuff like accent handling.\n\n405\n00:23:20,760 --> 00:23:23,480\nI'm not sure my fine-tune will actually ever come to\n\n406\n00:23:23,560 --> 00:23:25,400\nfruition in the sense that I'll use it day to\n\n407\n00:23:25,480 --> 00:23:28,920\nday as I imagine. I get like superb flawless words\n\n408\n00:23:29,000 --> 00:23:33,420\nerror rates because I'm just kind of skeptical about Local\n\n409\n00:23:33,580 --> 00:23:37,180\nspeech to text, as I mentioned, and I think the\n\n410\n00:23:37,260 --> 00:23:40,780\npace of innovation and improvement in the models, the main\n\n411\n00:23:40,940 --> 00:23:44,700\nreasons for fine tuning from what I've seen have been\n\n412\n00:23:44,860 --> 00:23:47,500\npeople who are something that really blows my mind about\n\n413\n00:23:48,060 --> 00:23:53,180\nASR is the idea that it's inherently a lingual or\n\n414\n00:23:53,340 --> 00:23:58,650\nmultilingual phonetic based. So as folks who use speak\n\n415\n00:23:58,970 --> 00:24:02,330\nvery obscure languages, that there might be a paucity of\n\n416\n00:24:02,330 --> 00:24:04,970\ntraining data or almost none at all, and therefore the\n\n417\n00:24:04,970 --> 00:24:10,170\naccuracy is significantly reduced. Or folks in very critical\n\n418\n00:24:10,410 --> 00:24:14,330\nenvironments, I know this is used extensively in medical transcription\n\n419\n00:24:14,410 --> 00:24:19,210\nand dispatcher work, the call centers who send out ambulances,\n\n420\n00:24:19,290 --> 00:24:23,210\net cetera, where accuracy is absolutely paramount. And in the\n\n421\n00:24:23,210 --> 00:24:26,940\ncase of doctors, radiologist, they might be using very specialized\n\n422\n00:24:26,940 --> 00:24:29,500\nvocab all the time. So those are kind of the\n\n423\n00:24:29,580 --> 00:24:31,500\nmain two things that I'm not sure that really just\n\n424\n00:24:31,580 --> 00:24:35,020\nfor trying to make it better on a few random\n\n425\n00:24:35,020 --> 00:24:37,980\ntech words with my slightly, I mean, I have an\n\n426\n00:24:38,060 --> 00:24:41,100\naccent, but like not, you know, an accent that a\n\n427\n00:24:41,180 --> 00:24:45,980\nfew other million people have ish. I'm not sure that\n\n428\n00:24:46,460 --> 00:24:50,380\nmy little fine tune is gonna actually like the bump\n\n429\n00:24:50,540 --> 00:24:53,580\nin word error reduction, if I ever actually figure out\n\n430\n00:24:53,580 --> 00:24:54,700\nhow to do it and get it up to the\n\n431\n00:24:54,780 --> 00:24:57,950\ncloud. By the time we've done that, I suspect that\n\n432\n00:24:58,270 --> 00:25:00,510\nthe next generation of ASR will just be so good\n\n433\n00:25:00,590 --> 00:25:03,070\nthat it will kind of be, well, that would have\n\n434\n00:25:03,070 --> 00:25:04,750\nbeen cool if it worked out, but I'll just use\n\n435\n00:25:04,830 --> 00:25:08,590\nthis instead. So that's going to be it for today's\n\n436\n00:25:08,910 --> 00:25:14,110\nepisode of voice training data. Single long shot evaluation.\n\n437\n00:25:14,430 --> 00:25:17,230\nWho am I going to compare? Whisper is always good\n\n438\n00:25:17,230 --> 00:25:20,590\nas a benchmark, but I'm more interested in seeing Whisper\n\n439\n00:25:20,670 --> 00:25:24,590\nhead to head with two things, really. One is Whisper\n\n440\n00:25:24,670 --> 00:25:29,780\nvariants. So you've got these projects like faster Distill Whisper,\n\n441\n00:25:29,860 --> 00:25:31,780\nit's a bit confusing, there's a whole bunch of them.\n\n442\n00:25:32,100 --> 00:25:35,380\nAnd the emerging ASRs, which are also a thing. My\n\n443\n00:25:35,460 --> 00:25:37,300\nintention for this is I'm not sure I'm going to\n\n444\n00:25:37,300 --> 00:25:39,940\nhave the time in any point in the foreseeable future\n\n445\n00:25:40,260 --> 00:25:44,660\nto go back through this whole episode and create a\n\n446\n00:25:44,740 --> 00:25:49,780\nproper source truth, where I fix everything. Might do\n\n447\n00:25:49,860 --> 00:25:52,820\nit if I can get one transcriptions that sufficiently close\n\n448\n00:25:53,060 --> 00:25:57,120\nto perfection. But what I would actually love to do\n\n449\n00:25:57,280 --> 00:26:00,000\non Hugging Face, I think would be a great probably\n\n450\n00:26:00,320 --> 00:26:02,960\nhow I might visualize this is having the audio waveform\n\n451\n00:26:03,280 --> 00:26:08,240\nplay and then have the transcript for each model below\n\n452\n00:26:08,240 --> 00:26:12,640\nit and maybe even a like, you know, to scale\n\n453\n00:26:13,200 --> 00:26:15,680\nand maybe even a local one as well, like local\n\n454\n00:26:15,840 --> 00:26:21,180\nwhisper versus OpenAI API, et cetera. And, I\n\n455\n00:26:21,260 --> 00:26:23,580\ncan then actually listen back to segments or anyone who\n\n456\n00:26:23,580 --> 00:26:25,900\nwants to can listen back to segments of this recording\n\n457\n00:26:26,220 --> 00:26:31,020\nand see where a particular model struggled and others didn't,\n\n458\n00:26:31,500 --> 00:26:33,420\nas well as the sort of headline finding of which\n\n459\n00:26:33,580 --> 00:26:36,940\nhad the best WER, but that would require the source\n\n460\n00:26:36,940 --> 00:26:39,660\nof truth. Okay, that's it. I hope this was, I\n\n461\n00:26:39,660 --> 00:26:42,620\ndon't know, maybe useful for other folks interested in STT.\n\n462\n00:26:42,940 --> 00:26:45,740\nYou want to see that I always feel think I've\n\n463\n00:26:45,740 --> 00:26:48,950\njust said as something I didn't intend to. STT, I\n\n464\n00:26:48,950 --> 00:26:52,550\nsaid for those. Listen carefully, including hopefully the models themselves.\n\n465\n00:26:53,270 --> 00:26:57,350\nThis has been myself, Daniel Rosell. For more jumbled repositories\n\n466\n00:26:57,430 --> 00:27:01,830\nabout my roving interests in AI, but particularly agentic, MCP\n\n467\n00:27:02,070 --> 00:27:07,109\nand Voicetech, you can find me on GitHub, huggingface.com,\n\n468\n00:27:10,310 --> 00:27:13,350\nwhich is my personal website, as well as this podcast,\n\n469\n00:27:13,590 --> 00:27:17,030\nwhose name I sadly cannot remember. Until next time, thanks\n\n470\n00:27:17,030 --> 00:27:17,590\nfor listening.\n\n", "gladia": "1\n00:00:00.172 --> 00:00:15.108\nHello and welcome to a audio data set consisting of one single episode of a non-existent podcast or it uh i may append this to a podcast that i set up recently um\n\n2\n00:00:15.467 --> 00:00:29.435\nregarding my uh with my thoughts on speech tech and ai in particular more ai and generative ai i would uh i would say but in any event the purpose of this um\n\n3\n00:00:30.219 --> 00:00:36.545\nvoice recording is actually to create a lengthy voice sample for a quick evaluation,\n\n4\n00:00:36.546 --> 00:00:38.088\na back of the envelope evaluation,\n\n5\n00:00:38.390 --> 00:00:39.148\nas they might say,\n\n6\n00:00:39.749 --> 00:00:41.273\nfor different speech to text models.\n\n7\n00:00:41.274 --> 00:00:42.195\nAnd I'm doing this because\n\n8\n00:00:42.975 --> 00:00:46.655\nI thought I'd made a great breakthrough in my journey with speech tech,\n\n9\n00:00:47.234 --> 00:00:50.999\nand that was succeeding in the elusive task of fine tuning Whisper.\n\n10\n00:00:51.780 --> 00:00:52.655\nWhisper is,\n\n11\n00:00:52.920 --> 00:00:58.890\nand I'm going to just talk i'm trying to mix up uh i'm going to try a few different\n\n12\n00:00:59.524 --> 00:01:18.581\nstyles of speaking i might whisper something at some points as well and i'll go back to speaking loud in uh in different parts i'm going to sound really like a crazy person because i'm also going to try to speak at different pitches and cadences in order to really try to put a\n\n13\n00:01:18.706 --> 00:01:28.831\nspeech attacks model through its paces which is trying to make sense of is this guy just rambling on incoherently in one long sentence or are these\n\n14\n00:01:29.652 --> 00:01:33.436\njust actually a series of step,\n\n15\n00:01:33.734 --> 00:01:34.355\nstandalone,\n\n16\n00:01:34.415 --> 00:01:34.918\nstep alone,\n\n17\n00:01:35.016 --> 00:01:36.200\nstandalone sentences.\n\n18\n00:01:36.519 --> 00:01:38.040\nAnd how is it going to handle step alone?\n\n19\n00:01:38.078 --> 00:01:38.680\nThat's not a word.\n\n20\n00:01:39.859 --> 00:01:43.343\nWhat happens when you use speech to text and you use a fake word?\n\n21\n00:01:43.367 --> 00:01:43.884\nAnd then you're like,\n\n22\n00:01:43.923 --> 00:01:44.063\nwait,\n\n23\n00:01:44.087 --> 00:01:44.703\nthat's not actually,\n\n24\n00:01:45.468 --> 00:01:46.328\nthat word doesn't exist.\n\n25\n00:01:47.048 --> 00:01:48.266\nHow does AI handle that?\n\n26\n00:01:48.484 --> 00:01:55.359\nAnd these and more are all the questions that I'm seeking to answer in this training data.\n\n27\n00:01:56.001 --> 00:01:56.141\nNow,\n\n28\n00:01:56.359 --> 00:01:56.718\nwhy did,\n\n29\n00:01:56.843 --> 00:01:58.266\nwhy was it trying to fine tune Whisper?\n\n30\n00:01:58.787 --> 00:02:16.968\nwhat is whisper as i said i'm gonna try to uh record this at a couple of different levels of technicality for folks who are uh you know in the normal uh world and not totally stuck down the rabbit hole of ai which i have to say is a really wonderful uh rabbit hole to be to\n\n31\n00:02:16.969 --> 00:02:27.735\nbe down um it's a really interesting area and speech and voice tech is is the aspect of it that i find actually most i'm not sure i would say the most interesting because there's\n\n32\n00:02:28.147 --> 00:02:30.349\nJust so much that is fascinating in AI.\n\n33\n00:02:31.372 --> 00:02:41.520\nBut the most that I find the most personally transformative in terms of the impact that it's had on my daily work life and productivity and how I sort of work.\n\n34\n00:02:42.082 --> 00:02:42.379\nAnd\n\n35\n00:02:43.183 --> 00:02:47.230\nI'm persevering hard with the task of training,\n\n36\n00:02:47.231 --> 00:02:47.527\nI guess,\n\n37\n00:02:47.730 --> 00:02:49.762\na good solution working for Linux,\n\n38\n00:02:50.122 --> 00:02:51.683\nwhich if anyone actually does listen to this,\n\n39\n00:02:51.777 --> 00:02:54.355\nnot just for the training data and for the actual content,\n\n40\n00:02:55.247 --> 00:02:56.497\nthis is this is sparked.\n\n41\n00:02:56.762 --> 00:02:57.044\nI had\n\n42\n00:02:58.056 --> 00:03:13.229\nbesides the fine-tune not working well that was the failure um i used plod code because one thinks these days that there is nothing short of solving you know the uh the\n\n43\n00:03:13.368 --> 00:03:24.518\nreason of life or something uh that plod and agentic ai can't do uh which is not really the case uh it does seem that way sometimes but it fails a lot as well and this is one of those\n\n44\n00:03:25.304 --> 00:03:29.768\ninstances where last week I put together an hour of voice training data,\n\n45\n00:03:30.528 --> 00:03:31.229\nbasically speaking,\n\n46\n00:03:31.271 --> 00:03:33.174\njust random things for three minutes.\n\n47\n00:03:33.407 --> 00:03:38.618\nAnd it was actually kind of tedious because the texts were really weird.\n\n48\n00:03:38.674 --> 00:03:39.174\nSome of them were,\n\n49\n00:03:39.556 --> 00:03:40.080\nit was like,\n\n50\n00:03:40.361 --> 00:03:40.596\nit was\n\n51\n00:03:41.127 --> 00:03:41.939\nAI generated.\n\n52\n00:03:42.721 --> 00:03:45.518\nI tried before to read Sherlock Holmes for an hour and I just couldn't,\n\n53\n00:03:45.564 --> 00:03:48.893\nI was so bored after 10 minutes that I was like,\n\n54\n00:03:48.894 --> 00:03:49.064\nokay,\n\n55\n00:03:49.066 --> 00:03:51.705\nI know I'm just going to have to find something else to read.\n\n56\n00:03:51.752 --> 00:03:51.877\nSo\n\n57\n00:03:52.907 --> 00:03:53.705\nI used...\n\n58\n00:03:54.207 --> 00:04:11.201\na created with AI studio vibe coded a synthetic text generator which actually I thought was probably a better way of doing it because it would give me more short samples with more varied content so I was like okay give me a\n\n59\n00:04:11.248 --> 00:04:22.858\nvoice note like I'm recording an email give me a short story to read give me prose to read so it came up with all these different things and they added a little timer to it so I could see.\n\n60\n00:04:23.295 --> 00:04:50.961\nhow close i was to one hour um and uh i spent like an hour one afternoon or probably two hours by the time you um you do retakes and whatever because you want to it gave me a source of truth which i'm not sure if that's the scientific way to approach this topic of gathering uh training data but i thought made sense um i have a lot of audio data from recording voice notes which I've also kind of used\n\n61\n00:04:52.117 --> 00:04:52.384\nBean.\n\n62\n00:04:52.755 --> 00:05:02.007\nexperimenting with using for a different purpose slightly different annotating task types it's more text classification experiment or\n\n63\n00:05:02.836 --> 00:05:02.956\nWell,\n\n64\n00:05:02.956 --> 00:05:03.497\nit's more than that,\n\n65\n00:05:03.536 --> 00:05:03.776\nactually.\n\n66\n00:05:03.778 --> 00:05:04.857\nI'm working on a voice app.\n\n67\n00:05:04.937 --> 00:05:07.660\nSo it's a prototype,\n\n68\n00:05:07.680 --> 00:05:07.980\nI guess,\n\n69\n00:05:08.019 --> 00:05:09.000\nis really more accurate.\n\n70\n00:05:11.382 --> 00:05:13.805\nBut you can do that and you can work backwards.\n\n71\n00:05:13.843 --> 00:05:14.187\nYou're like,\n\n72\n00:05:14.343 --> 00:05:19.757\nyou listen back to a voice note and you painfully go through one of those transcribing,\n\n73\n00:05:19.992 --> 00:05:20.226\nyou know,\n\n74\n00:05:20.274 --> 00:05:23.413\nwhere you start and stop and scrub around it and you fix the errors.\n\n75\n00:05:23.415 --> 00:05:24.117\nBut it's really,\n\n76\n00:05:24.180 --> 00:05:25.538\nreally boring to do that.\n\n77\n00:05:26.163 --> 00:05:31.680\nSo I thought it would be less tedious in the long term if I just recorded the source of truth.\n\n78\n00:05:32.247 --> 00:05:34.190\nSo it gave me these three minute snippets.\n\n79\n00:05:34.428 --> 00:05:38.593\nI recorded them and saved an MP3 and a TXT in the same folder.\n\n80\n00:05:38.855 --> 00:05:40.500\nAnd I created an error of that data.\n\n81\n00:05:41.975 --> 00:05:43.038\nSo I was very hopeful,\n\n82\n00:05:43.398 --> 00:05:43.781\nquietly,\n\n83\n00:05:43.898 --> 00:05:44.117\nyou know,\n\n84\n00:05:44.117 --> 00:05:47.725\na little bit hopeful that I would be able that I could actually fine tune Whisper.\n\n85\n00:05:48.586 --> 00:05:53.100\nI want to fine tune Whisper because when I got into voice tech last November,\n\n86\n00:05:54.242 --> 00:05:57.538\nmy wife was in the US and I was alone at home and,\n\n87\n00:05:57.819 --> 00:05:58.053\nyou know,\n\n88\n00:05:58.069 --> 00:05:59.117\nwent crazy.\n\n89\n00:05:59.444 --> 00:06:12.454\npeople like me do really wild things like use voice to tech technology that was basically when I started doing it I didn't feel like a crazy person speaking to myself and my expectations weren't that high\n\n90\n00:06:13.336 --> 00:06:26.509\nI used speech tech now and again tried it out I was like it'd be really cool if you could just like speak into your computer and whatever I tried out that had support was just it was not good basically\n\n91\n00:06:27.500 --> 00:06:29.440\nAnd this blew me away from the first go.\n\n92\n00:06:29.480 --> 00:06:29.701\nI mean,\n\n93\n00:06:29.701 --> 00:06:30.860\nit wasn't 100%\n\n94\n00:06:31.841 --> 00:06:33.360\naccurate out of the box and it took work,\n\n95\n00:06:33.942 --> 00:06:41.302\nbut it was good enough that there was a solid foundation and it kind of passed that pivot point that it's actually worth doing this.\n\n96\n00:06:41.942 --> 00:06:42.185\nYou know,\n\n97\n00:06:42.185 --> 00:06:46.418\nthere's a point where it's so like the transcript is you don't have to get 100%\n\n98\n00:06:46.482 --> 00:06:48.262\naccuracy for it to be worth your time,\n\n99\n00:06:49.091 --> 00:06:51.668\nfor a speech to text to be a worthwhile addition to your productivity.\n\n100\n00:06:51.778 --> 00:06:53.043\nBut you do need to get above,\n\n101\n00:06:53.091 --> 00:06:53.418\nlet's say,\n\n102\n00:06:53.528 --> 00:06:53.887\nI don't know,\n\n103\n00:06:53.966 --> 00:06:54.451\n85%.\n\n104\n00:06:54.466 --> 00:06:54.887\npercent.\n\n105\n00:06:55.711 --> 00:06:56.651\nIf it's 60%\n\n106\n00:06:57.031 --> 00:06:57.413\nor 50%,\n\n107\n00:06:57.793 --> 00:06:58.692\nyou inevitably say,\n\n108\n00:06:59.173 --> 00:06:59.512\nscrew it,\n\n109\n00:06:59.514 --> 00:07:05.033\nI'll just type it because you end up missing errors in the transcript and it becomes actually worse.\n\n110\n00:07:05.110 --> 00:07:06.978\nYou end up in a worse position than you started with it.\n\n111\n00:07:06.978 --> 00:07:07.915\nThat's been my experience.\n\n112\n00:07:08.555 --> 00:07:08.673\nSo\n\n113\n00:07:10.572 --> 00:07:10.915\nI was like,\n\n114\n00:07:10.994 --> 00:07:11.134\noh,\n\n115\n00:07:11.158 --> 00:07:12.228\nthis is actually really,\n\n116\n00:07:12.274 --> 00:07:12.838\nreally good now.\n\n117\n00:07:12.930 --> 00:07:13.555\nHow did that happen?\n\n118\n00:07:13.603 --> 00:07:15.040\nAnd the answer is ASR,\n\n119\n00:07:15.680 --> 00:07:20.072\nWhisper being open sourced and the transformer architecture.\n\n120\n00:07:20.072 --> 00:07:21.619\nIf you want to go back to the\n\n121\n00:07:23.319 --> 00:07:24.120\nto the underpinnings,\n\n122\n00:07:24.139 --> 00:07:25.660\nwhich really blows my mind.\n\n123\n00:07:25.920 --> 00:07:29.480\nAnd it's on my list to read through that paper.\n\n124\n00:07:30.422 --> 00:07:38.444\nAll you need is attention as attentively as can be done with my limited brain because it's super,\n\n125\n00:07:38.500 --> 00:07:39.819\nsuper high level stuff.\n\n126\n00:07:41.461 --> 00:07:42.350\nSuper advanced stuff,\n\n127\n00:07:42.367 --> 00:07:42.678\nI mean.\n\n128\n00:07:43.100 --> 00:07:44.100\nBut that,\n\n129\n00:07:44.507 --> 00:07:52.600\nI think of all the things that are fascinating about the sudden rise in AI and the dramatic capabilities.\n\n130\n00:07:53.507 --> 00:07:55.048\nI find it fascinating that few people are like,\n\n131\n00:07:55.189 --> 00:07:55.490\nhang on,\n\n132\n00:07:56.009 --> 00:07:58.994\nyou've got this thing that can speak to you like a chatbot,\n\n133\n00:07:58.995 --> 00:07:59.634\nan LLM.\n\n134\n00:08:00.576 --> 00:08:02.600\nAnd then you've got image generation.\n\n135\n00:08:02.959 --> 00:08:03.076\nOK,\n\n136\n00:08:03.139 --> 00:08:03.521\nso firstly,\n\n137\n00:08:03.639 --> 00:08:07.341\nthose two things on the surface have nothing in common.\n\n138\n00:08:08.545 --> 00:08:08.826\nSo like,\n\n139\n00:08:08.904 --> 00:08:09.505\nhow are they?\n\n140\n00:08:10.427 --> 00:08:12.286\nHow did that just happen all at the same time?\n\n141\n00:08:12.302 --> 00:08:13.411\nAnd then when you extend that further,\n\n142\n00:08:14.944 --> 00:08:15.630\nyou're like Suno,\n\n143\n00:08:16.036 --> 00:08:16.225\nright?\n\n144\n00:08:16.271 --> 00:08:20.896\nYou can sing a song and AI will like come up with an instrumental.\n\n145\n00:08:21.516 --> 00:08:22.637\nAnd then you've got Whisper.\n\n146\n00:08:22.757 --> 00:08:23.077\nAnd you're like,\n\n147\n00:08:23.079 --> 00:08:23.699\nwait a second.\n\n148\n00:08:24.158 --> 00:08:25.201\nHow did all this stuff,\n\n149\n00:08:25.319 --> 00:08:26.598\nlike if it's all AI,\n\n150\n00:08:27.262 --> 00:08:27.603\nwhat's,\n\n151\n00:08:27.942 --> 00:08:29.384\nlike there has to be some commonality.\n\n152\n00:08:29.543 --> 00:08:30.161\nOtherwise,\n\n153\n00:08:30.865 --> 00:08:34.707\nthese are totally different technologies on the surface of it.\n\n154\n00:08:34.888 --> 00:08:37.990\nAnd the transformer architecture is,\n\n155\n00:08:38.349 --> 00:08:39.067\nas far as I know,\n\n156\n00:08:39.240 --> 00:08:40.162\nthe answer.\n\n157\n00:08:40.332 --> 00:08:41.192\nAnd I can't even say,\n\n158\n00:08:41.302 --> 00:08:47.287\ncan't even pretend that I really understand what the transformer architecture means in depth.\n\n159\n00:08:47.317 --> 00:08:48.457\nBut I have scanned this.\n\n160\n00:08:48.707 --> 00:08:49.629\nAnd as I said,\n\n161\n00:08:49.707 --> 00:08:50.599\nI want to...\n\n162\n00:08:50.840 --> 00:09:01.552\nprinted and really kind of think over it at some point and I'll probably feel bad about myself I think because weren't those guys in their in their 20s like that's crazy\n\n163\n00:09:02.208 --> 00:09:11.177\nI think I asked chat gpt once who were the who wrote that paper and how old were they when it was published in arcs if and I was expecting like\n\n164\n00:09:11.662 --> 00:09:20.067\nI don't know what do you what do you imagine I personally imagine kind of like you know you have these breakthroughs during covid and things like that where like these kind of\n\n165\n00:09:20.543 --> 00:09:22.184\nreally obscure scientists who are like in their\n\n166\n00:09:22.524 --> 00:09:41.356\n50s and they've just kind of been laboring in labs and uh wearily and writing and publishing in kind of obscure academic publications and they finally like hit a big or win a noble prize and then their household household names uh so that was kind of what i had in mind that was the mental image i'd formed of the\n\n167\n00:09:41.919 --> 00:09:49.809\nbirth of arcs of like i wasn't expecting 20 somethings in san francisco though i i thought that was both very very funny very cool and actually kind of inspiring\n\n168\n00:09:50.580 --> 00:09:52.484\nIt's nice to think that people who,\n\n169\n00:09:53.488 --> 00:09:53.729\nyou know,\n\n170\n00:09:53.927 --> 00:09:56.294\njust you might put them in the kind of.\n\n171\n00:09:56.966 --> 00:10:12.508\nmilieu or bubble or world that you are in or credibly in through you know the series of connections that are coming up with such literally world-changing um innovations uh so that was i thought anyway that's that that was cool okay\n\n172\n00:10:12.570 --> 00:10:24.687\nvoice training data how are we doing we're about 10 minutes and i'm still talking about voice technology um so whisper was brilliant and i was so excited that i was my first instinct was to like guess\n\n173\n00:10:25.066 --> 00:10:25.326\nIt's like,\n\n174\n00:10:25.326 --> 00:10:25.807\noh my gosh,\n\n175\n00:10:25.826 --> 00:10:27.609\nI have to get like a really good microphone for this.\n\n176\n00:10:28.169 --> 00:10:28.288\nSo\n\n177\n00:10:29.370 --> 00:10:31.471\nI didn't go on a spending spree because I said,\n\n178\n00:10:31.592 --> 00:10:34.432\nI'm going to have to just wait a month and see if I still use this.\n\n179\n00:10:35.198 --> 00:10:37.596\nAnd it just kind of became,\n\n180\n00:10:38.019 --> 00:10:40.823\nit's become really part of my daily routine.\n\n181\n00:10:41.863 --> 00:10:43.003\nLike if I'm writing an email,\n\n182\n00:10:43.269 --> 00:10:44.503\nI'll record a voice note.\n\n183\n00:10:45.049 --> 00:10:46.284\nAnd then I've developed.\n\n184\n00:10:46.784 --> 00:10:50.534\nAnd it's nice to see that everyone is like developing the same things in parallel.\n\n185\n00:10:50.566 --> 00:10:52.409\nLike that's kind of a weird thing to say.\n\n186\n00:10:52.488 --> 00:10:53.549\nBut when I look,\n\n187\n00:10:53.659 --> 00:10:53.769\nI...\n\n188\n00:10:54.298 --> 00:11:11.754\nkind of came when i started working on this uh these prototypes on github which is where i just kind of share very freely and loosely uh ideas and you know first iterations on on concepts um and for want of a better word i called it like uh\n\n189\n00:11:11.754 --> 00:11:21.441\nllm post-processing or cleanup or basically a system prompt that after you get back the raw text from whisper you run it through a model and say,\n\n190\n00:11:21.566 --> 00:11:21.738\nokay,\n\n191\n00:11:21.784 --> 00:11:22.909\nthis is crappy.\n\n192\n00:11:23.785 --> 00:11:33.653\ntext like add sentence structure and you know fix it up and now when I'm exploring the different tools that are out there that people have built\n\n193\n00:11:34.216 --> 00:11:49.996\nI see quite a number of projects have basically you know done the same thing lest that be misconstrued I'm not saying for a millisecond that I inspired them I'm sure this has been a thing that's been integrated into tools for a while but it's\n\n194\n00:11:50.710 --> 00:11:53.312\nIt's the kind of thing that when you start using these tools every day,\n\n195\n00:11:53.613 --> 00:12:02.100\nthe need for it is almost instantly apparent because text that doesn't have any punctuation or paragraph spacing takes a long time to,\n\n196\n00:12:02.842 --> 00:12:03.086\nyou know,\n\n197\n00:12:03.163 --> 00:12:06.023\nit takes so long to get it into a presentable email that again,\n\n198\n00:12:06.086 --> 00:12:06.241\nit's,\n\n199\n00:12:06.428 --> 00:12:06.600\nit's,\n\n200\n00:12:06.788 --> 00:12:06.928\nit,\n\n201\n00:12:07.086 --> 00:12:13.006\nit moves speech tech into that before that inflection point where you're like,\n\n202\n00:12:13.008 --> 00:12:13.131\nnah,\n\n203\n00:12:13.133 --> 00:12:13.836\nit's just not worth it.\n\n204\n00:12:13.850 --> 00:12:14.491\nIt's like,\n\n205\n00:12:15.178 --> 00:12:16.898\nit'll just be quicker to type this.\n\n206\n00:12:17.428 --> 00:12:18.336\nSo it's a big,\n\n207\n00:12:18.350 --> 00:12:19.461\nit's a little touch that actually.\n\n208\n00:12:20.289 --> 00:12:20.791\nis a big deal.\n\n209\n00:12:21.672 --> 00:12:22.373\nSo I was on\n\n210\n00:12:22.712 --> 00:12:28.100\nWhisper and I've been using Whisper and I kind of early on found a couple of tools.\n\n211\n00:12:28.458 --> 00:12:30.419\nI couldn't find what I was looking for on Linux,\n\n212\n00:12:30.498 --> 00:12:35.725\nwhich is basically just something that'll run in the background.\n\n213\n00:12:36.044 --> 00:12:43.873\nYou'll give it an API key and it will just like transcribe with like a little key to start and stop the dictation.\n\n214\n00:12:45.248 --> 00:12:47.061\nAnd the issues were I discovered\n\n215\n00:12:47.241 --> 00:13:06.619\nthat like most people involved in creating these projects were very much focused on local models running whisper locally because you can and i tried that a bunch of times and just never got results that were as good as the cloud and when i began looking at the cost of the speech to text apis and what i was spending i\n\n216\n00:13:06.682 --> 00:13:16.104\njust thought there is it's actually in my opinion just one of the better deals in api spending and in cloud like it's just not that expensive for very very good models\n\n217\n00:13:16.730 --> 00:13:18.470\nThat are much more,\n\n218\n00:13:19.070 --> 00:13:19.291\nyou know,\n\n219\n00:13:19.292 --> 00:13:20.688\nyou're going to be able to run the full model,\n\n220\n00:13:21.572 --> 00:13:24.916\nthe latest model versus whatever you can run on your average\n\n221\n00:13:25.533 --> 00:13:28.711\nGPU, unless you want to buy a crazy GPU.\n\n222\n00:13:28.751 --> 00:13:29.892\nIt doesn't really make sense to me.\n\n223\n00:13:30.033 --> 00:13:39.619\nPrivacy is another concern that I know is kind of like a very much a separate thing that people just don't want their voice data and their voice leaving their local environment,\n\n224\n00:13:40.352 --> 00:13:42.197\nmaybe for regulatory reasons as well.\n\n225\n00:13:42.916 --> 00:13:43.727\nBut I'm not in that.\n\n226\n00:13:44.291 --> 00:13:45.744\nI'm neither really care.\n\n227\n00:13:46.118 --> 00:13:52.018\nabout people listening to my grocery list consisting of reminding myself that I need to buy more beer,\n\n228\n00:13:52.619 --> 00:13:53.721\nCheetos and hummus,\n\n229\n00:13:53.759 --> 00:13:54.716\nwhich is kind of the three,\n\n230\n00:13:55.264 --> 00:13:59.240\nthree staples of my diet during periods of poor nutrition.\n\n231\n00:14:00.020 --> 00:14:01.458\nBut the kind of stuff that I transcribe,\n\n232\n00:14:01.498 --> 00:14:02.153\nit's just not,\n\n233\n00:14:02.154 --> 00:14:03.106\nit's not a,\n\n234\n00:14:04.248 --> 00:14:08.059\nit's not a privacy thing I'm that sort of sensitive about.\n\n235\n00:14:08.356 --> 00:14:08.748\nAnd\n\n236\n00:14:09.606 --> 00:14:10.544\nI don't do anything so,\n\n237\n00:14:11.559 --> 00:14:11.826\nyou know,\n\n238\n00:14:12.356 --> 00:14:14.356\nsensitive or secure that requires air gapping.\n\n239\n00:14:14.403 --> 00:14:14.528\nSo.\n\n240\n00:14:15.770 --> 00:14:18.131\nI looked at the pricing and especially the kind of older models,\n\n241\n00:14:18.273 --> 00:14:18.493\nmini,\n\n242\n00:14:19.714 --> 00:14:20.417\nsome of them are very,\n\n243\n00:14:20.495 --> 00:14:21.174\nvery affordable.\n\n244\n00:14:21.256 --> 00:14:21.475\nAnd\n\n245\n00:14:22.937 --> 00:14:24.721\nI did a calculation once with\n\n246\n00:14:25.361 --> 00:14:26.339\nChatGPT and I was like,\n\n247\n00:14:26.424 --> 00:14:26.542\nOK,\n\n248\n00:14:27.322 --> 00:14:27.783\nthis is the\n\n249\n00:14:28.464 --> 00:14:31.027\nAPI price for I can't remember whatever the model was.\n\n250\n00:14:31.971 --> 00:14:33.861\nLet's say I just go at it like nonstop,\n\n251\n00:14:34.269 --> 00:14:35.408\nwhich it rarely happens.\n\n252\n00:14:35.549 --> 00:14:36.033\nProbably\n\n253\n00:14:36.691 --> 00:14:42.956\nI would say on average I might dictate 30 to 60 minutes per day if I was probably summing up the emails.\n\n254\n00:14:44.114 --> 00:14:44.234\nuh,\n\n255\n00:14:44.635 --> 00:14:45.236\ndocuments,\n\n256\n00:14:45.356 --> 00:14:46.080\noutlines,\n\n257\n00:14:46.760 --> 00:14:47.100\num,\n\n258\n00:14:47.201 --> 00:14:47.763\nwhich is a lot,\n\n259\n00:14:47.802 --> 00:14:48.182\nbut it's,\n\n260\n00:14:48.484 --> 00:14:49.889\nit's still a fairly modest amount.\n\n261\n00:14:50.327 --> 00:14:50.730\nAnd I was like,\n\n262\n00:14:50.750 --> 00:14:50.870\nwell,\n\n263\n00:14:50.952 --> 00:14:53.840\nsome days I do go on like one or two days where I've been.\n\n264\n00:14:54.749 --> 00:15:00.255\nUsually when I'm like kind of out of the house and just have something like I have nothing else to do.\n\n265\n00:15:00.354 --> 00:15:01.813\nLike if I'm at a hospital,\n\n266\n00:15:01.856 --> 00:15:07.841\nwe have a newborn and you're waiting for like eight hours and hours for an appointment.\n\n267\n00:15:08.380 --> 00:15:12.865\nAnd I would probably have listened to podcasts before becoming a speech fanatic.\n\n268\n00:15:12.942 --> 00:15:13.475\nAnd I'm like,\n\n269\n00:15:13.520 --> 00:15:13.645\noh,\n\n270\n00:15:13.662 --> 00:15:13.865\nwait,\n\n271\n00:15:14.302 --> 00:15:15.255\nlet me just get down.\n\n272\n00:15:15.427 --> 00:15:16.975\nLet me just get these ideas out of my head.\n\n273\n00:15:17.567 --> 00:15:20.645\nAnd that's when I'll go on my speech binges.\n\n274\n00:15:20.692 --> 00:15:22.067\nBut those are like once every few months,\n\n275\n00:15:22.130 --> 00:15:23.270\nlike not frequently.\n\n276\n00:15:23.832 --> 00:15:24.192\nBut I said,\n\n277\n00:15:24.232 --> 00:15:24.413\nokay,\n\n278\n00:15:24.494 --> 00:15:27.597\nlet's just say if I'm going to price out cloud STT,\n\n279\n00:15:29.038 --> 00:15:36.043\nif I was like dedicated every second of every waking hour to transcribing for some odd reason,\n\n280\n00:15:36.823 --> 00:15:37.129\num,\n\n281\n00:15:37.323 --> 00:15:37.590\nI mean,\n\n282\n00:15:37.591 --> 00:15:39.465\nit'd have to like eat and use the toilet.\n\n283\n00:15:39.823 --> 00:15:40.090\nLike,\n\n284\n00:15:40.527 --> 00:15:40.730\nyou know,\n\n285\n00:15:40.730 --> 00:15:42.527\nthere's only so many hours I'm awake for.\n\n286\n00:15:42.652 --> 00:15:43.090\nSo like,\n\n287\n00:15:43.198 --> 00:15:45.495\nlet's just say a maximum of like 40 hours,\n\n288\n00:15:45.620 --> 00:15:48.058\n45 minutes in the hour.\n\n289\n00:15:48.120 --> 00:15:48.573\nThen I said,\n\n290\n00:15:48.590 --> 00:15:48.840\nall right,\n\n291\n00:15:48.855 --> 00:15:49.823\nlet's just say 50.\n\n292\n00:15:50.715 --> 00:15:51.277\nWho knows?\n\n293\n00:15:51.495 --> 00:15:52.573\nYou're dictating on the toilet.\n\n294\n00:15:52.855 --> 00:15:53.323\nWe do it.\n\n295\n00:15:54.144 --> 00:15:55.385\nSo you could just do 60,\n\n296\n00:15:55.524 --> 00:15:58.764\nbut whatever I did and every day,\n\n297\n00:15:58.986 --> 00:16:02.525\nlike you're going flat out seven days a week dictating nonstop.\n\n298\n00:16:02.565 --> 00:16:02.964\nI was like,\n\n299\n00:16:03.104 --> 00:16:06.424\nwhat's my monthly API bill going to be at this price?\n\n300\n00:16:06.947 --> 00:16:09.307\nAnd it came out to like 70 or 80 bucks.\n\n301\n00:16:09.307 --> 00:16:09.745\nAnd I was like,\n\n302\n00:16:09.854 --> 00:16:10.042\nwell,\n\n303\n00:16:10.135 --> 00:16:14.167\nthat would be an extraordinary amount of dictation.\n\n304\n00:16:14.322 --> 00:16:22.104\nAnd I would hope that there was some compelling reason worth more than $70 that I embarked upon that project.\n\n305\n00:16:22.832 --> 00:16:24.716\nSo given that that's kind of the max point for me,\n\n306\n00:16:24.895 --> 00:16:26.116\nI said that's actually very,\n\n307\n00:16:26.296 --> 00:16:26.996\nvery affordable.\n\n308\n00:16:28.099 --> 00:16:28.220\nNow,\n\n309\n00:16:28.278 --> 00:16:35.504\nyou're going to if you want to spec out the costs and you want to do the post-processing that I really do feel is valuable,\n\n310\n00:16:36.207 --> 00:16:37.365\nthat's going to cost some more as well.\n\n311\n00:16:38.091 --> 00:16:39.309\nUnless you're using\n\n312\n00:16:40.309 --> 00:16:42.996\nGemini, which needless to say,\n\n313\n00:16:43.013 --> 00:16:45.091\nis a random person sitting in Jerusalem.\n\n314\n00:16:46.013 --> 00:16:46.934\nI have no affiliation,\n\n315\n00:16:47.216 --> 00:16:48.341\nnor with Google,\n\n316\n00:16:48.403 --> 00:16:49.184\nnor Anthropic,\n\n317\n00:16:49.231 --> 00:16:49.903\nnor Gemini,\n\n318\n00:16:49.966 --> 00:16:52.028\nnor any major tech vendor for that matter.\n\n319\n00:16:52.688 --> 00:16:52.908\nUm,\n\n320\n00:16:53.951 --> 00:16:56.770\nI like Gemini not so much as a everyday model.\n\n321\n00:16:57.072 --> 00:16:57.412\nUm,\n\n322\n00:16:57.513 --> 00:16:59.416\nit's kind of underwhelmed in that respect,\n\n323\n00:16:59.434 --> 00:16:59.837\nI would say,\n\n324\n00:17:00.477 --> 00:17:01.653\nbut for multimodal,\n\n325\n00:17:01.716 --> 00:17:02.934\nI think it's got a lot to offer.\n\n326\n00:17:03.576 --> 00:17:06.762\nAnd I think that the transcribing functionality whereby it can,\n\n327\n00:17:07.584 --> 00:17:07.840\num,\n\n328\n00:17:08.059 --> 00:17:13.809\nprocess audio with a system prompt and both give you transcription that's cleaned up,\n\n329\n00:17:13.873 --> 00:17:15.373\nthat reduces two steps to one.\n\n330\n00:17:15.965 --> 00:17:18.012\nAnd that for me is a very,\n\n331\n00:17:18.076 --> 00:17:18.653\nvery big deal.\n\n332\n00:17:18.873 --> 00:17:19.090\nAnd,\n\n333\n00:17:19.840 --> 00:17:19.951\nuh,\n\n334\n00:17:19.951 --> 00:17:22.045\nI feel like even Google has haven't really sort of\n\n335\n00:17:22.669 --> 00:17:39.968\nthought through how useful the that modality is and what kind of use cases you can achieve with it because i found in the course of this year just an endless list of really kind of system prompt system prompt stuff that i can say okay\n\n336\n00:17:40.125 --> 00:17:49.733\ni've used it to capture context data for ai which is literally i might speak for if i wanted to have a good bank of context data about who knows my childhood.\n\n337\n00:17:50.480 --> 00:18:06.348\nmore realistically maybe my career goals something that would just be like really boring to type out so I'll just like sit in my car and record it for 10 minutes and that 10 minutes you get a lot of information in emails\n\n338\n00:18:06.458 --> 00:18:15.864\nwhich is short text just there is a whole bunch and all these workflows kind of require a little bit of treatment afterwards and different treatment my context\n\n339\n00:18:16.441 --> 00:18:37.698\npipeline is kind of like just extract the bare essentials so you end up with me talking very loosely about sort of what i've done in my career where i've worked where i might like to work and it goes it condenses that down to very robotic language that is easy to chunk parse and maybe put into a vector database daniel has worked in technology daniel is a has\n\n340\n00:18:37.979 --> 00:18:44.526\nbeen working in martin you know stuff like that that's not how you would speak um but i figure it's probably easier to parse for,\n\n341\n00:18:44.962 --> 00:18:45.432\nafter all,\n\n342\n00:18:45.759 --> 00:18:46.104\nrobots.\n\n343\n00:18:46.930 --> 00:19:02.180\nSo we've almost got to 20 minutes and this is actually a success because I wasted 20 minutes of the evening speaking into a microphone and the levels were shot and it was clipping and I said I can't really do an evaluation.\n\n344\n00:19:02.539 --> 00:19:03.320\nI have to be fair.\n\n345\n00:19:03.398 --> 00:19:06.961\nI have to give the models a chance to do their thing.\n\n346\n00:19:07.852 --> 00:19:09.430\nWhat am I hoping to achieve in this?\n\n347\n00:19:09.586 --> 00:19:09.789\nOkay,\n\n348\n00:19:09.852 --> 00:19:11.352\nmy fine tune was a dud as mentioned.\n\n349\n00:19:11.977 --> 00:19:12.648\nDeepgram SDT,\n\n350\n00:19:12.789 --> 00:19:13.180\nI'm really,\n\n351\n00:19:13.211 --> 00:19:15.477\nreally hopeful that this prototype will work.\n\n352\n00:19:16.060 --> 00:19:17.843\nAnd it's a built in public open source.\n\n353\n00:19:17.844 --> 00:19:20.624\nSo anyone is welcome to use it if I make anything good.\n\n354\n00:19:21.788 --> 00:19:27.515\nBut that was really exciting for me last night when after hours of trying my own prototype,\n\n355\n00:19:27.593 --> 00:19:31.054\nseeing someone just made something that works like that,\n\n356\n00:19:31.451 --> 00:19:31.654\nyou know,\n\n357\n00:19:31.655 --> 00:19:36.279\nyou're not going to have to build a custom conda environment and image.\n\n358\n00:19:36.468 --> 00:19:37.482\nI have AMD GPU,\n\n359\n00:19:37.546 --> 00:19:39.811\nwhich makes things much more complicated.\n\n360\n00:19:40.311 --> 00:19:41.029\nI didn't find it.\n\n361\n00:19:42.093 --> 00:19:42.843\nAnd I was about to give up.\n\n362\n00:19:42.844 --> 00:19:43.140\nAnd I said,\n\n363\n00:19:43.171 --> 00:19:43.421\nall right,\n\n364\n00:19:43.422 --> 00:19:45.468\nlet me just give Deepgram's Linux thing.\n\n365\n00:19:46.178 --> 00:19:48.265\nshot and if it doesn't work,\n\n366\n00:19:49.027 --> 00:19:53.621\nI'm just gonna go back to trying to vibe code something myself and when I ran the script\n\n367\n00:19:54.367 --> 00:19:57.450\nI was using cloud code to do the installation process.\n\n368\n00:19:58.271 --> 00:20:00.114\nIt ran the script and oh my gosh,\n\n369\n00:20:00.192 --> 00:20:01.195\nit works just like that.\n\n370\n00:20:01.977 --> 00:20:10.789\nThe tricky thing for all those who wants to know all the nitty gritty details was that\n\n371\n00:20:11.398 --> 00:20:13.648\nI don't think it was actually struggling with transcription,\n\n372\n00:20:13.680 --> 00:20:14.352\nbut pasting,\n\n373\n00:20:14.884 --> 00:20:17.509\nWayland makes life very hard.\n\n374\n00:20:17.617 --> 00:20:19.634\nAnd I think there was something not running at the right time.\n\n375\n00:20:19.695 --> 00:20:19.977\nAnyway,\n\n376\n00:20:20.617 --> 00:20:21.117\nDeepgram,\n\n377\n00:20:21.273 --> 00:20:24.134\nI looked at how they actually handled that because it worked out of the...\n\n378\n00:20:24.203 --> 00:20:40.180\nbox when other stuff didn't and it was quite a clever little mechanism and but more so than that the accuracy was brilliant now what am i doing here this is going to be a 20 minute audio sample and i'm i\n\n379\n00:20:40.181 --> 00:20:52.413\nthink i've done one or two of these before but i did it with short snappy voice notes this is kind of long form this actually might be a better approximation for what's useful to me then\n\n380\n00:20:53.144 --> 00:21:09.383\nvoice memos like i need to buy three liters of milk tomorrow and peter bread which is probably how like half my voice note voice notes sound like if anyone were to i don't know like find my phone they'd be like this is the most boring person in the world although actually there are some like kind of uh journaling\n\n381\n00:21:09.398 --> 00:21:21.586\nthoughts as well but it's a lot of content like that and the probably for the evaluation the most useful thing is slightly obscure tech github nucleano uh hugging face not\n\n382\n00:21:21.743 --> 00:21:38.417\nso obscure that it's not going to have a chance of knowing it but hopefully sufficiently well known that the model should get it i tried to do a little bit of speaking really fast and speaking very slowly i would say in general i've spoken delivered this at a faster pace than i usually would owing to strong\n\n383\n00:21:38.542 --> 00:21:51.214\ncoffee flowing through my bloodstream and the thing that i'm not going to get in this benchmark is background noise which in my first take that i had to get rid of my wife came in with my son and for a good night kiss\n\n384\n00:21:51.675 --> 00:21:58.541\nAnd that actually would have been super helpful to get in because it was non-diarized or if we had diarization,\n\n385\n00:21:59.502 --> 00:21:59.968\na female,\n\n386\n00:22:00.007 --> 00:22:00.443\nI could say,\n\n387\n00:22:00.607 --> 00:22:03.171\nI want the male voice and that wasn't intended for transcription.\n\n388\n00:22:04.724 --> 00:22:07.029\nAnd we're not going to get background noise like people honking their horns,\n\n389\n00:22:07.146 --> 00:22:13.099\nwhich is something I've done in my main data set where I am trying to go back to some of my voice notes,\n\n390\n00:22:13.818 --> 00:22:15.740\nannotate them and run a benchmark.\n\n391\n00:22:15.741 --> 00:22:17.007\nBut this is going to be just a pure,\n\n392\n00:22:17.788 --> 00:22:20.007\nquick test and\n\n393\n00:22:21.152 --> 00:22:24.012\nAs someone working on a voice note idea,\n\n394\n00:22:24.071 --> 00:22:27.272\nthat's my sort of end motivation,\n\n395\n00:22:27.332 --> 00:22:31.694\nbesides thinking it's an absolutely outstanding technology that's coming to viability.\n\n396\n00:22:31.772 --> 00:22:32.172\nAnd really,\n\n397\n00:22:32.211 --> 00:22:33.094\nI know this sounds cheesy,\n\n398\n00:22:33.633 --> 00:22:36.336\ncan actually have a very transformative effect.\n\n399\n00:22:37.272 --> 00:22:37.429\nIt's,\n\n400\n00:22:37.836 --> 00:22:38.069\nyou know,\n\n401\n00:22:38.101 --> 00:22:44.897\nvoice technology has been life changing for folks living with disabilities.\n\n402\n00:22:45.851 --> 00:22:46.258\nAnd\n\n403\n00:22:47.054 --> 00:22:49.851\nI think there's something really nice about the fact that it can also benefit.\n\n404\n00:22:50.619 --> 00:22:50.859\nyou know,\n\n405\n00:22:51.019 --> 00:22:58.787\nfolks who are able-bodied and like we can all in different ways make this tech as useful as possible,\n\n406\n00:22:59.231 --> 00:23:01.051\nregardless of the exact way that we're using it.\n\n407\n00:23:02.490 --> 00:23:05.294\nAnd I think there's something very powerful in that and it can be very cool.\n\n408\n00:23:06.395 --> 00:23:07.451\nI see huge potential.\n\n409\n00:23:07.715 --> 00:23:08.934\nWhat excites me about voice tech?\n\n410\n00:23:09.903 --> 00:23:10.512\nA lot of things,\n\n411\n00:23:10.576 --> 00:23:10.872\nactually.\n\n412\n00:23:12.294 --> 00:23:12.622\nFirstly,\n\n413\n00:23:13.028 --> 00:23:14.278\nthe fact that it's cheap and accurate,\n\n414\n00:23:14.715 --> 00:23:16.122\nas I mentioned at the very start of this,\n\n415\n00:23:17.372 --> 00:23:19.809\nand it's getting better and better with stuff like accent handling.\n\n416\n00:23:21.053 --> 00:23:25.577\nI'm not sure my fine tune will actually ever come to fruition in the sense that I'll use it day to day,\n\n417\n00:23:25.675 --> 00:23:26.878\nas I imagine.\n\n418\n00:23:26.880 --> 00:23:27.878\nI get like superb,\n\n419\n00:23:28.000 --> 00:23:28.942\nflawless words,\n\n420\n00:23:29.058 --> 00:23:29.582\nerror rates,\n\n421\n00:23:29.597 --> 00:23:34.489\nbecause I'm just kind of skeptical about local speech to text,\n\n422\n00:23:34.847 --> 00:23:35.503\nas I mentioned.\n\n423\n00:23:36.105 --> 00:23:36.371\nAnd\n\n424\n00:23:36.792 --> 00:23:40.386\nI think the pace of innovation and improvement in the models,\n\n425\n00:23:40.574 --> 00:23:47.511\nthe main reasons for fine tuning from what I've seen have been people who are something that really blows my mind about\n\n426\n00:23:48.199 --> 00:23:49.278\nASR is\n\n427\n00:23:49.531 --> 00:24:04.644\nthe idea that it's inherently alingual or multilingual phonetic based so as folks who use speak very obscure languages that there may be very there might be a paucity of training data or almost none at all and\n\n428\n00:24:04.644 --> 00:24:15.738\ntherefore the accuracy is significantly reduced or folks in very critical environments i know there you this is used extensively in medical transcription and dispatcher your work as,\n\n429\n00:24:15.955 --> 00:24:16.894\num,\n\n430\n00:24:17.195 --> 00:24:17.435\nyou know,\n\n431\n00:24:17.455 --> 00:24:19.137\nthe call centers who send out ambulances,\n\n432\n00:24:19.199 --> 00:24:19.618\net cetera,\n\n433\n00:24:20.397 --> 00:24:22.441\nwhere accuracy is absolutely paramount.\n\n434\n00:24:22.660 --> 00:24:24.125\nAnd in the case of doctors,\n\n435\n00:24:24.721 --> 00:24:25.461\nradiologists,\n\n436\n00:24:25.461 --> 00:24:28.008\nthey might be using very specialized vocab all the time.\n\n437\n00:24:28.827 --> 00:24:30.147\nSo those are kind of the main two things.\n\n438\n00:24:30.148 --> 00:24:37.093\nAnd I'm not sure that really just for trying to make it better on a few random tech words with my slightly,\n\n439\n00:24:37.530 --> 00:24:37.750\nI mean,\n\n440\n00:24:37.750 --> 00:24:38.358\nI have an accent,\n\n441\n00:24:38.436 --> 00:24:39.218\nbut like not,\n\n442\n00:24:39.530 --> 00:24:39.797\nyou know,\n\n443\n00:24:40.233 --> 00:24:43.936\nan accent that a few other million people have it.\n\n444\n00:24:44.922 --> 00:24:46.172\nI'm not sure that.\n\n445\n00:24:46.579 --> 00:24:56.540\nmy little fine tune is going to actually like the bump in word error reduction if I ever actually figure out how to do it and get it up to the cloud by the time I've done that\n\n446\n00:24:57.029 --> 00:25:01.308\nI suspect that the next generation of ASR will just be so good that it will kind of be,\n\n447\n00:25:02.051 --> 00:25:02.173\nno,\n\n448\n00:25:02.430 --> 00:25:02.630\nwell,\n\n449\n00:25:02.808 --> 00:25:03.833\nthat would have been cool if it worked out,\n\n450\n00:25:03.872 --> 00:25:05.192\nbut I'll just use this instead.\n\n451\n00:25:05.972 --> 00:25:11.294\nSo that's going to be it for today's episode of voice training data.\n\n452\n00:25:12.011 --> 00:25:12.333\nSingle,\n\n453\n00:25:12.933 --> 00:25:14.028\nlong shot evaluation.\n\n454\n00:25:14.636 --> 00:25:15.450\nWho am I going to compare?\n\n455\n00:25:16.622 --> 00:25:17.855\nWhisper is always good as a benchmark,\n\n456\n00:25:17.886 --> 00:25:22.278\nbut I'm more interested in seeing Whisper head-to-head with two things,\n\n457\n00:25:22.308 --> 00:25:22.511\nreally.\n\n458\n00:25:23.450 --> 00:25:25.169\nOne is Whisper variants.\n\n459\n00:25:25.200 --> 00:25:25.950\nSo you've got these...\n\n460\n00:25:26.178 --> 00:25:44.617\nprojects like faster whisper uh distill whisper it's a bit confusing there's a whole bunch of them and the emerging asrs which are also a thing my intention for this is i'm not sure i'm going to have the time in any point in the foreseeable future to go back through this whole episode and create\n\n461\n00:25:44.618 --> 00:25:55.430\na proper source truth where i fix everything might do it if i can get one transcriptions as sufficiently close to perfection but\n\n462\n00:25:55.942 --> 00:25:57.241\nWhat I would actually love to do on\n\n463\n00:25:58.102 --> 00:25:58.903\nHugging Face,\n\n464\n00:25:59.021 --> 00:25:59.800\nI think would be a great,\n\n465\n00:25:59.984 --> 00:26:08.324\nprobably how I might visualize this is having the audio waveform play and then have the transcript for each model below it.\n\n466\n00:26:08.824 --> 00:26:09.722\nAnd maybe even a,\n\n467\n00:26:11.144 --> 00:26:11.364\nlike,\n\n468\n00:26:11.489 --> 00:26:11.722\nyou know,\n\n469\n00:26:11.871 --> 00:26:15.105\ntwo scale and maybe even a local one as well,\n\n470\n00:26:15.371 --> 00:26:17.903\nlike Local Whisper versus OpenAI API,\n\n471\n00:26:18.903 --> 00:26:19.449\net cetera.\n\n472\n00:26:19.746 --> 00:26:20.105\nAnd...\n\n473\n00:26:21.238 --> 00:26:30.903\nI can then actually listen back to segments or anyone who wants to can listen back to segments of this recording and see where a particular model struggled and others didn't,\n\n474\n00:26:31.606 --> 00:26:34.090\nas well as the sort of headline finding of which had the best\n\n475\n00:26:34.731 --> 00:26:37.372\nWER, but that would require the source of truth.\n\n476\n00:26:37.919 --> 00:26:38.090\nOkay,\n\n477\n00:26:38.137 --> 00:26:38.434\nthat's it.\n\n478\n00:26:38.637 --> 00:26:39.372\nI hope this was,\n\n479\n00:26:39.622 --> 00:26:39.997\nI don't know,\n\n480\n00:26:40.419 --> 00:26:42.403\nmaybe useful for other folks interested in STT.\n\n481\n00:26:43.106 --> 00:26:43.762\nYou want to see that\n\n482\n00:26:44.137 --> 00:26:44.919\nI always feel,\n\n483\n00:26:45.434 --> 00:26:47.247\nthink I've just said as something I didn't intend to.\n\n484\n00:26:48.044 --> 00:26:48.481\nSTT,\n\n485\n00:26:48.872 --> 00:26:49.528\nI said for those.\n\n486\n00:26:49.817 --> 00:26:50.378\nlisten carefully,\n\n487\n00:26:50.419 --> 00:26:52.902\nincluding hopefully the models themselves.\n\n488\n00:26:53.441 --> 00:26:54.163\nThis has been myself,\n\n489\n00:26:54.304 --> 00:26:54.902\nDaniel Rosehill.\n\n490\n00:26:55.022 --> 00:26:59.404\nFor more jumbled repositories about my roving interest in AI,\n\n491\n00:26:59.507 --> 00:27:00.765\nbut particularly agentic,\n\n492\n00:27:01.451 --> 00:27:03.015\nMCP and voice tech,\n\n493\n00:27:03.413 --> 00:27:04.335\nyou can find me on\n\n494\n00:27:04.990 --> 00:27:06.749\nGitHub, Hugging Face,\n\n495\n00:27:08.279 --> 00:27:08.811\nwhere else?\n\n496\n00:27:09.140 --> 00:27:10.154\nDanielrosehill.com,\n\n497\n00:27:10.171 --> 00:27:11.296\nwhich is my personal website,\n\n498\n00:27:11.374 --> 00:27:13.483\nas well as this podcast,\n\n499\n00:27:13.624 --> 00:27:15.186\nwhose name I sadly cannot remember.\n\n500\n00:27:15.936 --> 00:27:16.499\nUntil next time,\n\n501\n00:27:16.826 --> 00:27:17.343\nthanks for listening.\n", "nova3": "1\n00:00:00,080 --> 00:00:06,240\nHello and welcome to a audio dataset consisting of one\n\n2\n00:00:06,240 --> 00:00:08,400\nsingle episode of a nonexistent podcast.\n\n3\n00:00:08,800 --> 00:00:12,880\nOr it I may append this to a podcast that\n\n4\n00:00:12,880 --> 00:00:18,814\nI set up recently regarding my with my thoughts on\n\n5\n00:00:18,815 --> 00:00:20,815\nspeech tech and A.\n\n6\n00:00:20,815 --> 00:00:21,214\nI.\n\n7\n00:00:21,214 --> 00:00:22,814\nIn particular, more A.\n\n8\n00:00:22,814 --> 00:00:23,054\nI.\n\n9\n00:00:23,054 --> 00:00:23,935\nAnd generative A.\n\n10\n00:00:23,935 --> 00:00:24,095\nI.\n\n11\n00:00:24,095 --> 00:00:26,494\nI would I would say.\n\n12\n00:00:26,814 --> 00:00:30,869\nBut in any event, the purpose of this voice recording\n\n13\n00:00:30,869 --> 00:00:35,590\nis actually to create a lengthy voice sample for a\n\n14\n00:00:35,590 --> 00:00:38,950\nquick evaluation, a back of the envelope evaluation, they might\n\n15\n00:00:38,950 --> 00:00:41,429\nsay, for different speech attacks models.\n\n16\n00:00:41,429 --> 00:00:43,945\nI'm doing this because I thought I'd made a great\n\n17\n00:00:43,945 --> 00:00:47,784\nbreakthrough in my journey with speech tech and that was\n\n18\n00:00:47,784 --> 00:00:51,385\nsucceeding in the elusive task of fine tuning whisper.\n\n19\n00:00:51,704 --> 00:00:56,424\nWhisper is, and I'm to just talk, I'm trying to\n\n20\n00:00:55,829 --> 00:00:56,789\nmix up.\n\n21\n00:00:56,869 --> 00:01:00,390\nI'm going to try a few different styles of speaking\n\n22\n00:01:00,390 --> 00:01:02,869\nwhisper something at some points as well.\n\n23\n00:01:03,350 --> 00:01:06,790\nAnd I'll go back to speaking loud in in different\n\n24\n00:01:06,790 --> 00:01:09,030\nparts are going to sound really like a crazy person\n\n25\n00:01:09,030 --> 00:01:12,424\nbecause I'm also going to try to speak at different\n\n26\n00:01:12,984 --> 00:01:18,025\npitches and cadences in order to really try to push\n\n27\n00:01:18,344 --> 00:01:21,145\na speech to text model through its paces, which is\n\n28\n00:01:21,145 --> 00:01:24,609\ntrying to make sense of is this guy just rambling\n\n29\n00:01:24,609 --> 00:01:30,049\non incoherently in one long sentence or are these just\n\n30\n00:01:30,049 --> 00:01:36,450\nactually a series of step standalone, standalone, standalone sentences?\n\n31\n00:01:36,450 --> 00:01:38,130\nAnd how is it going to handle step alone?\n\n32\n00:01:38,130 --> 00:01:38,770\nThat's not a word.\n\n33\n00:01:39,704 --> 00:01:42,025\nWhat happens when you use speech to text and you\n\n34\n00:01:42,025 --> 00:01:43,384\nuse a fake word?\n\n35\n00:01:43,384 --> 00:01:45,784\nAnd then you're like, wait, that's not actually that word\n\n36\n00:01:45,784 --> 00:01:46,665\ndoesn't exist.\n\n37\n00:01:46,984 --> 00:01:48,584\nHow does AI handle that?\n\n38\n00:01:48,584 --> 00:01:53,750\nAnd these and more are all the questions that I'm\n\n39\n00:01:53,750 --> 00:01:55,750\nseeking to answer in this training data.\n\n40\n00:01:55,829 --> 00:01:58,549\nNow, why was I trying to fine tune Whisper?\n\n41\n00:01:58,549 --> 00:01:59,750\nAnd what is Whisper?\n\n42\n00:01:59,750 --> 00:02:02,710\nAs I said, I'm going to try to record this\n\n43\n00:02:02,710 --> 00:02:06,644\nat a couple of different levels of technicality for folks\n\n44\n00:02:06,644 --> 00:02:11,764\nwho are in the normal world and not totally stuck\n\n45\n00:02:11,764 --> 00:02:13,764\ndown the rabbit hole of AI, which you have to\n\n46\n00:02:13,764 --> 00:02:17,685\nsay is a really wonderful rabbit hole to be done.\n\n47\n00:02:17,844 --> 00:02:20,919\nIt's a really interesting area and speech and voice tech\n\n48\n00:02:20,919 --> 00:02:24,359\nis is the aspect of it that I find actually\n\n49\n00:02:24,359 --> 00:02:27,239\nmost I'm not sure I would say the most interesting\n\n50\n00:02:27,239 --> 00:02:30,759\nbecause there's just so much that is fascinating in AI.\n\n51\n00:02:31,400 --> 00:02:34,134\nBut the most that I find the most personally transformative\n\n52\n00:02:34,134 --> 00:02:38,534\nin terms of the impact that it's had on my\n\n53\n00:02:38,534 --> 00:02:41,254\ndaily work life and productivity and how I sort of\n\n54\n00:02:41,254 --> 00:02:41,895\nwork.\n\n55\n00:02:42,935 --> 00:02:47,500\nI'm persevering hard with the task of trying to get\n\n56\n00:02:47,500 --> 00:02:50,939\na good solution working for Linux, which if anyone actually\n\n57\n00:02:50,939 --> 00:02:52,939\ndoes listen to this, not just for the training data\n\n58\n00:02:52,939 --> 00:02:56,700\nand for the actual content, is sparked.\n\n59\n00:02:56,700 --> 00:02:59,980\nI had, besides the fine tune not working, well that\n\n60\n00:02:59,980 --> 00:03:01,385\nwas the failure.\n\n61\n00:03:02,504 --> 00:03:06,745\nI used Claude code because one thinks these days that\n\n62\n00:03:06,745 --> 00:03:13,280\nthere is nothing short of solving, you know, the the\n\n63\n00:03:13,280 --> 00:03:17,599\nreason of life or something that clause and agentic AI\n\n64\n00:03:17,599 --> 00:03:19,680\ncan't do, which is not really the case.\n\n65\n00:03:19,680 --> 00:03:23,199\nIt does seem that way sometimes, but it fails a\n\n66\n00:03:23,199 --> 00:03:23,759\nlot as well.\n\n67\n00:03:23,759 --> 00:03:26,639\nAnd this is one of those instances where last week\n\n68\n00:03:26,639 --> 00:03:30,824\nI put together an hour of voice training data, basically\n\n69\n00:03:30,824 --> 00:03:33,465\nspeaking just random things for three minutes.\n\n70\n00:03:35,465 --> 00:03:38,104\nIt was actually kind of tedious because the texts were\n\n71\n00:03:38,104 --> 00:03:38,664\nreally weird.\n\n72\n00:03:38,664 --> 00:03:41,370\nSome of them were, it was like it was AI\n\n73\n00:03:41,370 --> 00:03:42,250\ngenerated.\n\n74\n00:03:42,569 --> 00:03:44,889\nI tried before to read Sherlock Holmes for an hour\n\n75\n00:03:44,889 --> 00:03:47,689\nand I just couldn't, I was so bored after ten\n\n76\n00:03:47,689 --> 00:03:50,569\nminutes that I was like, okay, no, I'm just gonna\n\n77\n00:03:50,569 --> 00:03:51,930\nhave to find something else to read.\n\n78\n00:03:51,930 --> 00:03:58,284\nSo I used a created with AI Studio, VibeCoded, a\n\n79\n00:03:58,284 --> 00:04:03,164\nsynthetic text generator which actually I thought was probably a\n\n80\n00:04:03,164 --> 00:04:05,245\nbetter way of doing it because it would give me\n\n81\n00:04:05,245 --> 00:04:09,069\nmore short samples with more varied content.\n\n82\n00:04:09,069 --> 00:04:11,710\nSo I was like, okay, give me a voice note\n\n83\n00:04:11,710 --> 00:04:14,909\nlike I'm recording an email, give me a short story\n\n84\n00:04:14,909 --> 00:04:18,189\nto read, give me prose to read.\n\n85\n00:04:18,189 --> 00:04:20,634\nSo I came up with all these different things and\n\n86\n00:04:20,634 --> 00:04:22,714\nthey added a little timer to it so I could\n\n87\n00:04:22,714 --> 00:04:24,955\nsee how close I was to one hour.\n\n88\n00:04:25,915 --> 00:04:29,115\nAnd I spent like an hour one afternoon or probably\n\n89\n00:04:29,115 --> 00:04:33,115\ntwo hours by the time you do retakes and whatever\n\n90\n00:04:33,115 --> 00:04:36,169\nbecause you want to it gave me a source of\n\n91\n00:04:36,169 --> 00:04:40,009\ntruth which I'm not sure if that's the scientific way\n\n92\n00:04:40,009 --> 00:04:44,169\nto approach this topic of gathering training data but I\n\n93\n00:04:44,169 --> 00:04:45,449\nthought made sense.\n\n94\n00:04:46,490 --> 00:04:49,464\nI have a lot of audio data from recording voice\n\n95\n00:04:49,464 --> 00:04:53,544\nnotes which I've also kind of used, been experimenting with\n\n96\n00:04:53,544 --> 00:04:55,064\nusing for a different purpose.\n\n97\n00:04:55,384 --> 00:04:58,745\nSlightly different annotating task types.\n\n98\n00:04:58,745 --> 00:05:03,250\nIt's more a text classification experiment or Well, it's more\n\n99\n00:05:03,250 --> 00:05:03,810\nthan that actually.\n\n100\n00:05:03,810 --> 00:05:05,009\nI'm working on a voice app.\n\n101\n00:05:05,009 --> 00:05:09,329\nSo it's a prototype, I guess, is really more accurate.\n\n102\n00:05:11,409 --> 00:05:13,969\nBut you can do that and you can work backwards.\n\n103\n00:05:13,969 --> 00:05:18,354\nListen back to a voice note and you painfully go\n\n104\n00:05:18,354 --> 00:05:21,474\nthrough one of those transcribing, where you start and stop\n\n105\n00:05:21,474 --> 00:05:23,634\nand scrub around it and you fix the errors, but\n\n106\n00:05:23,634 --> 00:05:25,875\nit's really, really pouring to do that.\n\n107\n00:05:26,115 --> 00:05:28,034\nSo I thought it would be less tedious in the\n\n108\n00:05:28,034 --> 00:05:31,714\nlong term if I just recorded the source of truth.\n\n109\n00:05:32,069 --> 00:05:34,389\nSo it gave me these three minutes snippets.\n\n110\n00:05:34,389 --> 00:05:37,509\nI recorded them and saved an MP3 and a TXT\n\n111\n00:05:37,750 --> 00:05:40,310\nin the same folder and I created an error that\n\n112\n00:05:40,310 --> 00:05:40,949\ndata.\n\n113\n00:05:41,990 --> 00:05:44,870\nSo I was very hopeful, quietly, a little bit hopeful\n\n114\n00:05:44,870 --> 00:05:47,029\nthat I would be able, that I could actually fine\n\n115\n00:05:47,029 --> 00:05:47,750\ntune Whisper.\n\n116\n00:05:48,365 --> 00:05:51,085\nI want to fine tune Whisper because when I got\n\n117\n00:05:51,085 --> 00:05:55,004\ninto voice tech last November, my wife was in the\n\n118\n00:05:55,004 --> 00:05:57,245\nUS and I was alone at home.\n\n119\n00:05:57,324 --> 00:06:01,004\nAnd when crazy people like me do really wild things\n\n120\n00:06:01,004 --> 00:06:03,980\nlike use voice to tech technology.\n\n121\n00:06:03,980 --> 00:06:06,939\nThat was basically when I started doing it, I didn't\n\n122\n00:06:06,939 --> 00:06:09,580\nfeel like a crazy person speaking to myself.\n\n123\n00:06:09,980 --> 00:06:12,780\nAnd my expectations weren't that high.\n\n124\n00:06:13,180 --> 00:06:17,685\nI'd used speech tech now and again, tried it out.\n\n125\n00:06:17,685 --> 00:06:18,884\nI was like, it'd be really cool if you could\n\n126\n00:06:18,884 --> 00:06:22,404\njust like speak into your computer and whatever I tried\n\n127\n00:06:22,404 --> 00:06:25,925\nout that had Linux support was just, it was not\n\n128\n00:06:25,925 --> 00:06:26,805\ngood basically.\n\n129\n00:06:27,365 --> 00:06:29,524\nAnd this blew me away from the first go.\n\n130\n00:06:29,524 --> 00:06:32,339\nI mean, it wasn't one hundred percent accurate out of\n\n131\n00:06:32,339 --> 00:06:34,500\nthe box and it took work, but it was good\n\n132\n00:06:34,500 --> 00:06:36,819\nenough that there was a solid foundation and it kind\n\n133\n00:06:36,819 --> 00:06:41,139\nof passed that pivot point that it's actually worth doing\n\n134\n00:06:41,139 --> 00:06:41,620\nthis.\n\n135\n00:06:41,939 --> 00:06:43,939\nYou know, there's a point where it's so like, the\n\n136\n00:06:43,939 --> 00:06:46,485\ntranscript is you don't have to get one hundred percent\n\n137\n00:06:46,485 --> 00:06:49,525\naccuracy for it to be worth your time for speech\n\n138\n00:06:49,525 --> 00:06:51,925\nto text to be a worthwhile addition to your productivity.\n\n139\n00:06:51,925 --> 00:06:53,685\nBut you do need to get above, let's say, I\n\n140\n00:06:53,685 --> 00:06:55,125\ndon't know, eighty five percent.\n\n141\n00:06:55,605 --> 00:06:58,805\nIf it's sixty percent or fifty percent, you inevitably say,\n\n142\n00:06:59,040 --> 00:07:00,319\nScrew it, I'll just type it.\n\n143\n00:07:00,319 --> 00:07:03,680\nBecause you end up missing errors in the transcript and\n\n144\n00:07:03,680 --> 00:07:05,040\nit becomes actually worse.\n\n145\n00:07:05,040 --> 00:07:06,720\nYou end up in a worse position than you started\n\n146\n00:07:06,720 --> 00:07:07,040\nwith it.\n\n147\n00:07:07,040 --> 00:07:08,240\nThat's been my experience.\n\n148\n00:07:08,560 --> 00:07:12,480\nSo I was like, Oh, this is actually really, really\n\n149\n00:07:12,480 --> 00:07:12,960\ngood now.\n\n150\n00:07:12,960 --> 00:07:13,680\nHow did that happen?\n\n151\n00:07:13,680 --> 00:07:17,995\nAnd the answer is ASR, Whisper being open sourced and\n\n152\n00:07:18,714 --> 00:07:21,594\nthe transformer architecture, if you want to go back to\n\n153\n00:07:21,594 --> 00:07:26,394\nthe underpinnings, which really blows my mind and it's on\n\n154\n00:07:26,394 --> 00:07:29,830\nmy list to read through that paper.\n\n155\n00:07:30,389 --> 00:07:35,990\nAll you need is attention as attentively as can be\n\n156\n00:07:35,990 --> 00:07:39,350\ndone with my limited brain because it's super super high\n\n157\n00:07:39,350 --> 00:07:43,045\nlevel stuff, super advanced stuff, mean.\n\n158\n00:07:43,285 --> 00:07:48,084\nThat I think of all the things that are fascinating\n\n159\n00:07:48,084 --> 00:07:52,564\nabout the sudden rise in AI and the dramatic capabilities,\n\n160\n00:07:53,339 --> 00:07:55,419\nI find it fascinating that few people are like, hang\n\n161\n00:07:55,419 --> 00:07:58,300\non, you've got this thing that can speak to you\n\n162\n00:07:58,300 --> 00:08:00,060\nlike a chatbot, an LLM.\n\n163\n00:08:00,620 --> 00:08:02,860\nAnd then you've got image generation.\n\n164\n00:08:02,860 --> 00:08:03,180\nOkay.\n\n165\n00:08:03,180 --> 00:08:07,100\nSo firstly, two things on the surface have nothing in\n\n166\n00:08:07,100 --> 00:08:07,419\ncommon.\n\n167\n00:08:08,365 --> 00:08:12,044\nSo how did that just happen all at the same\n\n168\n00:08:12,044 --> 00:08:12,285\ntime?\n\n169\n00:08:12,285 --> 00:08:15,964\nAnd then when you extend that further, you're like, Suno.\n\n170\n00:08:15,964 --> 00:08:19,485\nYou can sing a song and AI will come up\n\n171\n00:08:19,485 --> 00:08:21,165\nwith an instrumental.\n\n172\n00:08:21,485 --> 00:08:23,485\nAnd then you've got Whisper and you're like, Wait a\n\n173\n00:08:23,485 --> 00:08:23,725\nsecond.\n\n174\n00:08:24,100 --> 00:08:28,180\nHow did all this stuff If it's all AI, there\n\n175\n00:08:28,180 --> 00:08:29,540\nhas to be some commonality.\n\n176\n00:08:29,540 --> 00:08:35,139\nOtherwise, are totally different technologies on the surface of it.\n\n177\n00:08:35,220 --> 00:08:39,384\nAnd the transformer architecture is, as far as I know,\n\n178\n00:08:39,384 --> 00:08:40,264\nthe answer.\n\n179\n00:08:40,264 --> 00:08:42,985\nAnd I can't even say, can't even pretend that I\n\n180\n00:08:42,985 --> 00:08:47,384\nreally understand what the transformer architecture means in-depth.\n\n181\n00:08:47,384 --> 00:08:49,865\nBut I have scanned this and as I said, I\n\n182\n00:08:49,865 --> 00:08:52,879\nwant to print it and really kind of think over\n\n183\n00:08:52,879 --> 00:08:54,160\nit at some point.\n\n184\n00:08:54,879 --> 00:08:58,080\nAnd I'll probably feel bad about myself, I think, because\n\n185\n00:08:58,080 --> 00:08:59,679\nweren't those guys in twenties?\n\n186\n00:09:00,320 --> 00:09:01,840\nLike, that's crazy.\n\n187\n00:09:02,160 --> 00:09:06,160\nI think I asked ChatGPT once who wrote that paper\n\n188\n00:09:06,545 --> 00:09:09,264\nand how old were they when it was published in\n\n189\n00:09:09,264 --> 00:09:09,825\nArcSiv?\n\n190\n00:09:09,825 --> 00:09:13,105\nAnd I was expecting like, I don't know, what do\n\n191\n00:09:13,105 --> 00:09:13,585\nyou imagine?\n\n192\n00:09:13,585 --> 00:09:15,665\nI personally imagine kind of like, you you have these\n\n193\n00:09:15,665 --> 00:09:19,745\nbreakthroughs during COVID and things like that, where like these\n\n194\n00:09:19,745 --> 00:09:22,629\nkind of really obscure scientists who are in their 50s\n\n195\n00:09:22,629 --> 00:09:26,870\nand they've just kind of been laboring in labs and\n\n196\n00:09:26,870 --> 00:09:29,830\nwearily in writing and publishing in kind of obscure academic\n\n197\n00:09:29,830 --> 00:09:30,710\npublications.\n\n198\n00:09:30,870 --> 00:09:33,669\nAnd they finally hit a big or win a Nobel\n\n199\n00:09:33,669 --> 00:09:36,235\nPrize and then their household names.\n\n200\n00:09:36,634 --> 00:09:38,634\nSo that was kind of what I had in mind.\n\n201\n00:09:38,634 --> 00:09:42,154\nThat was the mental image I'd formed of the birth\n\n202\n00:09:42,154 --> 00:09:42,955\nof ArcSim.\n\n203\n00:09:42,955 --> 00:09:45,595\nLike I wasn't expecting twenty somethings in San Francisco.\n\n204\n00:09:45,595 --> 00:09:48,794\nI thought that was both very funny, very cool, and\n\n205\n00:09:48,794 --> 00:09:50,075\nactually kind of inspiring.\n\n206\n00:09:50,554 --> 00:09:55,230\nIt's nice to think that people who just you might\n\n207\n00:09:55,230 --> 00:09:58,509\nput them in the kind of milieu or bubble or\n\n208\n00:09:58,509 --> 00:10:02,669\nworld that you are in incredibly in through a series\n\n209\n00:10:02,669 --> 00:10:05,835\nof connections that are coming up with such literally world\n\n210\n00:10:05,835 --> 00:10:07,835\nchanging innovations.\n\n211\n00:10:07,914 --> 00:10:11,274\nSo that was I thought anyway, that's that that was\n\n212\n00:10:11,274 --> 00:10:11,835\ncool.\n\n213\n00:10:12,235 --> 00:10:12,554\nOkay.\n\n214\n00:10:12,554 --> 00:10:13,434\nVoice training data.\n\n215\n00:10:13,434 --> 00:10:14,154\nHow are we doing?\n\n216\n00:10:14,154 --> 00:10:17,355\nWe're about ten minutes, and I'm still talking about voice\n\n217\n00:10:17,355 --> 00:10:18,235\ntechnology.\n\n218\n00:10:18,634 --> 00:10:22,179\nSo Whisper was brilliant, and I was so excited that\n\n219\n00:10:22,179 --> 00:10:25,860\nmy first instinct was to guess, like, Oh my gosh,\n\n220\n00:10:25,860 --> 00:10:28,019\nI have to get a really good microphone for this.\n\n221\n00:10:28,179 --> 00:10:31,379\nSo I didn't go on a spending spree because I\n\n222\n00:10:31,379 --> 00:10:33,299\nsaid, I'm gonna have to just wait a month and\n\n223\n00:10:33,299 --> 00:10:34,740\nsee if I still use this.\n\n224\n00:10:35,220 --> 00:10:38,875\nAnd it just kind of became it's become really part\n\n225\n00:10:38,875 --> 00:10:40,955\nof my daily routine.\n\n226\n00:10:41,754 --> 00:10:44,315\nLike if I'm writing an email, I'll record a voice\n\n227\n00:10:44,315 --> 00:10:47,595\nnote and then I've developed and it's nice to see\n\n228\n00:10:47,595 --> 00:10:50,759\nthat everyone is like developing the same things in parallel.\n\n229\n00:10:50,759 --> 00:10:53,399\nThat's kind of a weird thing to say, when I\n\n230\n00:10:53,399 --> 00:11:00,279\nstarted working on these prototypes on GitHub, which is where\n\n231\n00:11:00,279 --> 00:11:04,039\nI just kind of share very freely and loosely ideas\n\n232\n00:11:04,039 --> 00:11:06,945\nand first iterations on concepts.\n\n233\n00:11:09,024 --> 00:11:10,704\nAnd for want of a better word, I called it\n\n234\n00:11:10,704 --> 00:11:14,945\nlike LLM post processing or clean up or basically a\n\n235\n00:11:14,945 --> 00:11:17,745\nsystem prompt that after you get back the raw text\n\n236\n00:11:17,745 --> 00:11:21,620\nfrom Whisper, you run it through a model and say,\n\n237\n00:11:21,620 --> 00:11:26,339\nokay, this is crappy text like add sentence structure and,\n\n238\n00:11:26,339 --> 00:11:27,459\nyou know, fix it up.\n\n239\n00:11:27,860 --> 00:11:32,579\nAnd now when I'm exploring the different tools that are\n\n240\n00:11:32,579 --> 00:11:35,634\nout there that people have built, I see quite a\n\n241\n00:11:35,634 --> 00:11:39,475\nnumber of projects have basically done the same thing.\n\n242\n00:11:40,754 --> 00:11:43,235\nLest that be misconstrued, I'm not saying for a millisecond\n\n243\n00:11:43,235 --> 00:11:44,595\nthat I inspired them.\n\n244\n00:11:44,595 --> 00:11:48,034\nI'm sure this has been a thing that's been integrated\n\n245\n00:11:48,034 --> 00:11:51,290\ninto tools for a while, but it's the kind of\n\n246\n00:11:51,290 --> 00:11:53,690\nthing that when you start using these tools every day,\n\n247\n00:11:53,690 --> 00:11:57,610\nthe need for it is almost instantly apparent because text\n\n248\n00:11:57,610 --> 00:12:01,529\nthat doesn't have any punctuation or paragraph spacing takes a\n\n249\n00:12:01,529 --> 00:12:03,965\nlong time to, you know, it takes so long to\n\n250\n00:12:03,965 --> 00:12:09,004\nget it into a presentable email that again, moves speech\n\n251\n00:12:09,004 --> 00:12:13,085\ntech into that before that inflection point where you're like,\n\n252\n00:12:13,085 --> 00:12:13,965\nnah, it's just not worth it.\n\n253\n00:12:13,965 --> 00:12:16,924\nIt's like, it'll just be quicker to type this.\n\n254\n00:12:17,279 --> 00:12:19,840\nSo it's a big, it's a little touch that actually\n\n255\n00:12:20,080 --> 00:12:21,200\nis a big deal.\n\n256\n00:12:21,519 --> 00:12:25,440\nSo I was on Whisper and I've been using Whisper\n\n257\n00:12:25,440 --> 00:12:27,759\nand I kind of early on found a couple of\n\n258\n00:12:27,759 --> 00:12:28,399\ntools.\n\n259\n00:12:28,399 --> 00:12:30,639\nI couldn't find what I was looking for on Linux,\n\n260\n00:12:30,639 --> 00:12:35,924\nwhich is basically just something that'll run-in the background.\n\n261\n00:12:35,924 --> 00:12:38,245\nYou'll give it an API key and it will just\n\n262\n00:12:38,245 --> 00:12:43,044\nlike transcribe with like a little key to start and\n\n263\n00:12:43,044 --> 00:12:43,845\nstop the dictation.\n\n264\n00:12:45,080 --> 00:12:48,440\nAnd the issues where I discovered that like most people\n\n265\n00:12:48,440 --> 00:12:52,040\ninvolved in creating these projects were very much focused on\n\n266\n00:12:52,040 --> 00:12:55,800\nlocal models, running Whisper locally because you can.\n\n267\n00:12:56,279 --> 00:12:58,200\nAnd I tried that a bunch of times and just\n\n268\n00:12:58,200 --> 00:13:01,054\nnever got results that were as good as the cloud.\n\n269\n00:13:01,455 --> 00:13:03,615\nAnd when I began looking at the cost of the\n\n270\n00:13:03,615 --> 00:13:06,654\nspeech to text APIs and what I was spending, I\n\n271\n00:13:06,654 --> 00:13:09,855\njust thought there is it's actually, in my opinion, just\n\n272\n00:13:09,855 --> 00:13:13,160\none of the better deals in API spending in the\n\n273\n00:13:13,160 --> 00:13:13,480\ncloud.\n\n274\n00:13:13,480 --> 00:13:15,720\nLike, it's just not that expensive for very, very good\n\n275\n00:13:15,720 --> 00:13:19,639\nmodels that are much more, you know, you're gonna be\n\n276\n00:13:19,639 --> 00:13:22,759\nable to run the full model, the latest model versus\n\n277\n00:13:22,759 --> 00:13:26,605\nwhatever you can run on your average GPU unless you\n\n278\n00:13:26,605 --> 00:13:28,845\nwant to buy a crazy GPU.\n\n279\n00:13:28,845 --> 00:13:30,044\nIt doesn't really make sense to me.\n\n280\n00:13:30,044 --> 00:13:33,164\nPrivacy is another concern that I know is kind of\n\n281\n00:13:33,164 --> 00:13:35,325\nlike a very much a separate thing that people just\n\n282\n00:13:35,325 --> 00:13:38,845\ndon't want their voice data and their voice leaving their\n\n283\n00:13:38,845 --> 00:13:42,460\nlocal environment maybe for regulatory reasons as well.\n\n284\n00:13:42,700 --> 00:13:43,980\nBut I'm not in that.\n\n285\n00:13:44,220 --> 00:13:48,540\nI neither really care about people listening to my, grocery\n\n286\n00:13:48,540 --> 00:13:51,580\nlist, consisting of, reminding myself that I need to buy\n\n287\n00:13:51,580 --> 00:13:54,779\nmore beer, Cheetos, and hummus, which is kind of the\n\n288\n00:13:55,334 --> 00:13:59,574\nthree staples of my diet during periods of poor nutrition.\n\n289\n00:13:59,894 --> 00:14:02,375\nBut the kind of stuff that I transcribe, it's just\n\n290\n00:14:02,375 --> 00:14:02,694\nnot.\n\n291\n00:14:02,694 --> 00:14:07,814\nIt's not a privacy thing I'm that sort of sensitive\n\n292\n00:14:07,814 --> 00:14:13,269\nabout and I don't do anything so sensitive or secure\n\n293\n00:14:13,269 --> 00:14:14,790\nthat requires air capping.\n\n294\n00:14:15,670 --> 00:14:17,590\nI looked at the pricing and especially the kind of\n\n295\n00:14:17,590 --> 00:14:18,950\nolder model mini.\n\n296\n00:14:19,590 --> 00:14:21,910\nSome of them are very, very affordable and I did\n\n297\n00:14:21,910 --> 00:14:26,764\na calculation once with ChatGPT and I was like, okay,\n\n298\n00:14:26,764 --> 00:14:30,365\nthis is the API price for I can't remember whatever\n\n299\n00:14:30,365 --> 00:14:31,404\nthe model was.\n\n300\n00:14:31,804 --> 00:14:34,445\nLet's say I just go at it like nonstop, which\n\n301\n00:14:34,445 --> 00:14:35,565\nrarely happens.\n\n302\n00:14:35,644 --> 00:14:38,959\nProbably, I would say on average I might dictate thirty\n\n303\n00:14:38,959 --> 00:14:41,759\nto sixty minutes per day if I was probably summing\n\n304\n00:14:41,759 --> 00:14:48,000\nup the emails, documents, outlines, which is a lot, but\n\n305\n00:14:48,000 --> 00:14:50,159\nit's it's still a fairly modest amount.\n\n306\n00:14:50,159 --> 00:14:51,839\nAnd I was like, well, some days I do go\n\n307\n00:14:51,839 --> 00:14:54,934\non like one or two days where I've been usually\n\n308\n00:14:54,934 --> 00:14:56,855\nwhen I'm like kind of out of the house and\n\n309\n00:14:56,855 --> 00:15:00,535\njust have something like I have nothing else to do.\n\n310\n00:15:00,535 --> 00:15:03,175\nLike if I'm at a hospital, we have a newborn\n\n311\n00:15:03,575 --> 00:15:07,299\nand you're waiting for like eight hours and hours for\n\n312\n00:15:07,299 --> 00:15:08,100\nan appointment.\n\n313\n00:15:08,179 --> 00:15:12,019\nAnd I would probably have listened to podcasts before becoming\n\n314\n00:15:12,019 --> 00:15:12,980\na speech fanatic.\n\n315\n00:15:12,980 --> 00:15:15,379\nAnd I'm like, Oh, wait, let me just get down.\n\n316\n00:15:15,379 --> 00:15:17,379\nLet me just get these ideas out of my head.\n\n317\n00:15:17,540 --> 00:15:20,745\nAnd that's when I'll go on my speech binges.\n\n318\n00:15:20,745 --> 00:15:22,664\nBut those are like once every few months, like not\n\n319\n00:15:22,664 --> 00:15:23,544\nfrequently.\n\n320\n00:15:23,784 --> 00:15:25,784\nBut I said, okay, let's just say if I'm going\n\n321\n00:15:25,784 --> 00:15:28,184\nto price out cloud STT.\n\n322\n00:15:28,985 --> 00:15:33,500\nIf I was like dedicated every second of every waking\n\n323\n00:15:33,500 --> 00:15:37,820\nhour to transcribing for some odd reason, I mean I'd\n\n324\n00:15:37,820 --> 00:15:39,820\nhave to eat and use the toilet.\n\n325\n00:15:40,540 --> 00:15:42,700\nThere's only so many hours I'm awake for.\n\n326\n00:15:42,700 --> 00:15:47,019\nSo let's just say a maximum of forty five minutes\n\n327\n00:15:47,205 --> 00:15:49,205\nin the hour, then I said, All right, let's just\n\n328\n00:15:49,205 --> 00:15:50,165\nsay fifty.\n\n329\n00:15:50,644 --> 00:15:51,365\nWho knows?\n\n330\n00:15:51,365 --> 00:15:52,804\nYou're dictating on the toilet.\n\n331\n00:15:52,804 --> 00:15:53,605\nWe do it.\n\n332\n00:15:53,924 --> 00:15:56,884\nSo you could just do sixty, but whatever I did\n\n333\n00:15:57,125 --> 00:16:01,179\nand every day, like you're going flat out seven days\n\n334\n00:16:01,179 --> 00:16:02,620\na week dictating nonstop.\n\n335\n00:16:02,620 --> 00:16:05,579\nI was like, What's my monthly API bill going to\n\n336\n00:16:05,579 --> 00:16:06,700\nbe at this price?\n\n337\n00:16:06,779 --> 00:16:09,339\nAnd it came out to like seventy or eighty bucks.\n\n338\n00:16:09,339 --> 00:16:12,620\nAnd I was like, Well, that would be an extraordinary\n\n339\n00:16:12,940 --> 00:16:14,379\namount of dictation.\n\n340\n00:16:14,379 --> 00:16:18,105\nAnd I would hope that there was some compelling reason\n\n341\n00:16:18,745 --> 00:16:21,784\nworth more than seventy dollars that I embarked upon that\n\n342\n00:16:21,784 --> 00:16:22,424\nproject.\n\n343\n00:16:22,664 --> 00:16:24,585\nSo given that that's kind of the max point for\n\n344\n00:16:24,585 --> 00:16:27,304\nme I said that's actually very very affordable.\n\n345\n00:16:28,024 --> 00:16:30,504\nNow you're gonna if you want to spec out the\n\n346\n00:16:30,504 --> 00:16:33,909\ncosts and you want to do the post processing that\n\n347\n00:16:33,909 --> 00:16:36,789\nI really do feel is valuable, that's going to cost\n\n348\n00:16:36,789 --> 00:16:37,750\nsome more as well.\n\n349\n00:16:38,070 --> 00:16:43,269\nUnless you're using Gemini, which needless to say is a\n\n350\n00:16:43,269 --> 00:16:45,190\nrandom person sitting in Jerusalem.\n\n351\n00:16:45,855 --> 00:16:49,455\nI have no affiliation nor with Google nor Anthropic nor\n\n352\n00:16:49,455 --> 00:16:52,414\nGemini nor any major tech vendor for that matter.\n\n353\n00:16:53,855 --> 00:16:57,215\nI like Gemini not so much as a everyday model.\n\n354\n00:16:57,455 --> 00:16:59,934\nIt's kind of underwhelmed in that respect, I would say.\n\n355\n00:17:00,379 --> 00:17:02,779\nBut for multimodal, I think it's got a lot to\n\n356\n00:17:02,779 --> 00:17:03,339\noffer.\n\n357\n00:17:03,659 --> 00:17:07,179\nAnd I think that the transcribing functionality whereby it can,\n\n358\n00:17:08,059 --> 00:17:12,380\nprocess audio with a system prompt and both give you\n\n359\n00:17:12,380 --> 00:17:13,900\ntranscription that's cleaned up.\n\n360\n00:17:13,900 --> 00:17:15,339\nThat reduces two steps to one.\n\n361\n00:17:15,835 --> 00:17:18,954\nAnd that for me is a very, very big deal.\n\n362\n00:17:18,955 --> 00:17:22,474\nAnd I feel like even Google hasn't really sort of\n\n363\n00:17:22,555 --> 00:17:27,195\nthought through how useful the that modality is and what\n\n364\n00:17:27,195 --> 00:17:29,700\nkind of use cases you can achieve with it.\n\n365\n00:17:29,700 --> 00:17:32,339\nBecause I found in the course of this year just\n\n366\n00:17:32,339 --> 00:17:38,019\nan endless list of really kind of system prompt stuff\n\n367\n00:17:38,019 --> 00:17:40,900\nthat I can say, okay, I've used it to capture\n\n368\n00:17:40,900 --> 00:17:44,115\ncontext data for AI, which is literally I might speak\n\n369\n00:17:44,115 --> 00:17:46,755\nfor if I wanted to have a good bank of\n\n370\n00:17:46,755 --> 00:17:50,035\ncontext data about who knows my childhood.\n\n371\n00:17:50,434 --> 00:17:54,355\nMore realistically, maybe my career goals, something that would just\n\n372\n00:17:54,355 --> 00:17:56,195\nbe like really boring to type out.\n\n373\n00:17:56,195 --> 00:18:00,500\nSo I'll just like sit in my car and record\n\n374\n00:18:00,500 --> 00:18:01,460\nit for ten minutes.\n\n375\n00:18:01,460 --> 00:18:03,779\nAnd that ten minutes you get a lot of information\n\n376\n00:18:03,779 --> 00:18:04,419\nin.\n\n377\n00:18:05,619 --> 00:18:07,700\nEmails, which is short text.\n\n378\n00:18:08,660 --> 00:18:10,419\nJust there is a whole bunch.\n\n379\n00:18:10,420 --> 00:18:13,375\nAnd all these workflows kind of require a little bit\n\n380\n00:18:13,375 --> 00:18:15,134\nof treatment afterwards and different treatment.\n\n381\n00:18:15,134 --> 00:18:18,414\nMy context pipeline is kind of like just extract the\n\n382\n00:18:18,414 --> 00:18:19,295\nbare essentials.\n\n383\n00:18:19,295 --> 00:18:22,174\nYou end up with me talking very loosely about sort\n\n384\n00:18:22,174 --> 00:18:24,494\nof what I've done in my career, where I've worked,\n\n385\n00:18:24,494 --> 00:18:25,454\nwhere I might like to work.\n\n386\n00:18:26,000 --> 00:18:29,119\nAnd it goes, it condenses that down to very robotic\n\n387\n00:18:29,119 --> 00:18:32,720\nlanguage that is easy to chunk parse and maybe put\n\n388\n00:18:32,720 --> 00:18:34,000\ninto a vector database.\n\n389\n00:18:34,000 --> 00:18:36,240\nDaniel has worked in technology.\n\n390\n00:18:36,240 --> 00:18:39,840\nDaniel has been working in, know, stuff like that.\n\n391\n00:18:39,840 --> 00:18:43,055\nThat's not how you would speak, but I figure it's\n\n392\n00:18:43,055 --> 00:18:46,494\nprobably easier to parse for, after all, robots.\n\n393\n00:18:46,815 --> 00:18:48,734\nSo we've almost got to twenty minutes and this is\n\n394\n00:18:48,734 --> 00:18:53,134\nactually a success because I wasted twenty minutes of my\n\n395\n00:18:53,535 --> 00:18:57,200\nof the evening speaking into you in microphone and the\n\n396\n00:18:57,200 --> 00:19:01,119\nlevels were shot and was clipping and I said I\n\n397\n00:19:01,119 --> 00:19:02,400\ncan't really do an evaluation.\n\n398\n00:19:02,400 --> 00:19:03,440\nI have to be fair.\n\n399\n00:19:03,440 --> 00:19:06,400\nI have to give the models a chance to do\n\n400\n00:19:06,400 --> 00:19:06,960\ntheir thing.\n\n401\n00:19:07,505 --> 00:19:09,585\nWhat am I hoping to achieve in this?\n\n402\n00:19:09,585 --> 00:19:11,664\nOkay, my fine tune was a dud as mentioned.\n\n403\n00:19:11,745 --> 00:19:15,265\nDeepgram STT, I'm really, really hopeful that this prototype will\n\n404\n00:19:15,265 --> 00:19:18,065\nwork and it's a build in public open source so\n\n405\n00:19:18,065 --> 00:19:20,384\nanyone is welcome to use it if I make anything\n\n406\n00:19:20,384 --> 00:19:20,705\ngood.\n\n407\n00:19:21,640 --> 00:19:23,880\nBut that was really exciting for me last night when\n\n408\n00:19:23,880 --> 00:19:28,920\nafter hours of trying my own prototype, seeing someone just\n\n409\n00:19:28,920 --> 00:19:32,119\nmade something that works like that, you you're not gonna\n\n410\n00:19:32,119 --> 00:19:36,454\nhave to build a custom conda environment and image.\n\n411\n00:19:36,454 --> 00:19:40,054\nI have AMD GPU which makes things much more complicated.\n\n412\n00:19:40,294 --> 00:19:42,694\nI didn't find it and I was about to give\n\n413\n00:19:42,694 --> 00:19:43,974\nup and I said, All right, let me just give\n\n414\n00:19:43,974 --> 00:19:46,535\nDeepgram's Linux thing a shot.\n\n415\n00:19:47,109 --> 00:19:49,669\nAnd if this doesn't work, I'm just gonna go back\n\n416\n00:19:49,669 --> 00:19:51,429\nto trying to vibe code something myself.\n\n417\n00:19:51,750 --> 00:19:55,589\nAnd when I ran the script, I was using Cloud\n\n418\n00:19:55,589 --> 00:19:59,109\nCode to do the installation process, it ran the script\n\n419\n00:19:59,109 --> 00:20:01,269\nand, oh my gosh, it works just like that.\n\n420\n00:20:01,904 --> 00:20:06,065\nThe tricky thing for all those who wants to know\n\n421\n00:20:06,065 --> 00:20:11,505\nall the nitty, ditty, nitty gritty details was that I\n\n422\n00:20:11,505 --> 00:20:14,704\ndon't think it was actually struggling with transcription, but pasting\n\n423\n00:20:14,785 --> 00:20:17,619\nWeyland makes life very hard.\n\n424\n00:20:17,619 --> 00:20:19,220\nAnd I think there was something not running at the\n\n425\n00:20:19,220 --> 00:20:19,779\nright time.\n\n426\n00:20:19,779 --> 00:20:23,059\nAnyway, Deepgram, I looked at how they actually handle that\n\n427\n00:20:23,059 --> 00:20:25,220\nbecause it worked out of the box when other stuff\n\n428\n00:20:25,220 --> 00:20:25,859\ndidn't.\n\n429\n00:20:26,180 --> 00:20:28,980\nAnd it was quite a clever little mechanism.\n\n430\n00:20:29,575 --> 00:20:32,215\nAnd but more so than that, the accuracy was brilliant.\n\n431\n00:20:32,215 --> 00:20:33,654\nNow what am I what am I doing here?\n\n432\n00:20:33,654 --> 00:20:37,255\nThis is gonna be a twenty minute audio sample.\n\n433\n00:20:38,455 --> 00:20:42,490\nAnd I'm I think I've done one or two of\n\n434\n00:20:42,490 --> 00:20:47,210\nthese before, but I did it with short, snappy voice\n\n435\n00:20:47,210 --> 00:20:47,690\nnotes.\n\n436\n00:20:47,690 --> 00:20:49,450\nThis is kind of long form.\n\n437\n00:20:49,529 --> 00:20:52,009\nThis actually might be a better approximation for what's useful\n\n438\n00:20:52,009 --> 00:20:53,929\nto me than voice memos.\n\n439\n00:20:53,929 --> 00:20:56,974\nLike, I need to buy three liters of milk tomorrow\n\n440\n00:20:56,974 --> 00:21:00,255\nand peter bread, which is probably how half my voice\n\n441\n00:21:00,255 --> 00:21:00,815\nnotes sound.\n\n442\n00:21:00,815 --> 00:21:04,174\nLike if anyone were to find my phone they'd be\n\n443\n00:21:04,174 --> 00:21:06,014\nlike this is the most boring person in the world.\n\n444\n00:21:06,095 --> 00:21:10,130\nAlthough actually there are some journaling thoughts as well, but\n\n445\n00:21:10,130 --> 00:21:11,890\nit's a lot of content like that.\n\n446\n00:21:11,890 --> 00:21:14,690\nAnd the probably for the evaluation, the most useful thing\n\n447\n00:21:14,690 --> 00:21:21,914\nis slightly obscure tech, GitHub, Nucleano, hugging face, not so\n\n448\n00:21:21,914 --> 00:21:24,554\nobscure that it's not gonna have a chance of knowing\n\n449\n00:21:24,554 --> 00:21:27,274\nit, but hopefully sufficiently well known that the model should\n\n450\n00:21:27,274 --> 00:21:27,914\nget it.\n\n451\n00:21:27,994 --> 00:21:30,075\nI tried to do a little bit of speaking really\n\n452\n00:21:30,075 --> 00:21:32,474\nfast and speaking very slowly.\n\n453\n00:21:32,474 --> 00:21:35,609\nWould say in general, I've spoken, delivered this at a\n\n454\n00:21:35,609 --> 00:21:39,210\nfaster pace than I usually would owing to strong coffee\n\n455\n00:21:39,210 --> 00:21:40,650\nflowing through my bloodstream.\n\n456\n00:21:41,210 --> 00:21:43,609\nAnd the thing that I'm not gonna get in this\n\n457\n00:21:43,609 --> 00:21:46,170\nbenchmark is background noise, which in my first take that\n\n458\n00:21:46,170 --> 00:21:48,535\nI had to get rid of, my wife came in\n\n459\n00:21:48,535 --> 00:21:51,575\nwith my son and for a good night kiss.\n\n460\n00:21:51,654 --> 00:21:55,174\nAnd that actually would have been super helpful to get\n\n461\n00:21:55,174 --> 00:21:57,894\nin because it was non diarized or if we had\n\n462\n00:21:57,894 --> 00:21:58,775\ndiarization.\n\n463\n00:21:59,414 --> 00:22:01,494\nA female, I could say, I want the male voice\n\n464\n00:22:01,494 --> 00:22:03,174\nand that wasn't intended for transcription.\n\n465\n00:22:04,589 --> 00:22:06,349\nAnd we're not going to get background noise like people\n\n466\n00:22:06,349 --> 00:22:09,069\nhonking their horns, which is something I've done in my\n\n467\n00:22:09,230 --> 00:22:11,950\nmain data set where I am trying to go back\n\n468\n00:22:11,950 --> 00:22:15,150\nto some of my voice notes, annotate them and run\n\n469\n00:22:15,150 --> 00:22:15,789\na benchmark.\n\n470\n00:22:15,789 --> 00:22:18,345\nBut this is going to be just a pure quick\n\n471\n00:22:18,345 --> 00:22:19,144\ntest.\n\n472\n00:22:19,865 --> 00:22:24,105\nAnd as someone I'm working on a voice note idea.\n\n473\n00:22:24,105 --> 00:22:28,265\nThat's my sort of end motivation besides thinking it's an\n\n474\n00:22:28,265 --> 00:22:31,865\nabsolutely outstanding technology that's coming to viability.\n\n475\n00:22:31,865 --> 00:22:34,480\nAnd really, I know this sounds cheesy, can actually have\n\n476\n00:22:34,480 --> 00:22:36,559\na very transformative effect.\n\n477\n00:22:38,000 --> 00:22:43,200\nVoice technology has been life changing for folks living with\n\n478\n00:22:44,079 --> 00:22:45,119\ndisabilities.\n\n479\n00:22:46,000 --> 00:22:48,625\nAnd I think there's something really nice about the fact\n\n480\n00:22:48,625 --> 00:22:52,625\nthat it can also benefit folks who are able-bodied and\n\n481\n00:22:52,625 --> 00:22:57,984\nwe can all in different ways make this tech as\n\n482\n00:22:57,984 --> 00:23:00,785\nuseful as possible regardless of the exact way that we're\n\n483\n00:23:00,785 --> 00:23:01,105\nusing it.\n\n484\n00:23:02,279 --> 00:23:04,519\nAnd I think there's something very powerful in that, and\n\n485\n00:23:04,519 --> 00:23:05,639\nit can be very cool.\n\n486\n00:23:06,200 --> 00:23:07,639\nI see huge potential.\n\n487\n00:23:07,639 --> 00:23:09,399\nWhat excites me about voice tech?\n\n488\n00:23:09,799 --> 00:23:11,239\nA lot of things actually.\n\n489\n00:23:12,200 --> 00:23:14,919\nFirstly, the fact that it's cheap and accurate, as I\n\n490\n00:23:14,919 --> 00:23:17,865\nmentioned at the very start of this, and it's getting\n\n491\n00:23:17,865 --> 00:23:20,184\nbetter and better with stuff like accent handling.\n\n492\n00:23:20,825 --> 00:23:23,384\nI'm not sure my fine tune will actually ever come\n\n493\n00:23:23,384 --> 00:23:25,305\nto fruition in the sense that I'll use it day\n\n494\n00:23:25,305 --> 00:23:26,664\nto day as I imagine.\n\n495\n00:23:26,744 --> 00:23:30,585\nI get like superb, flawless words error rates because I'm\n\n496\n00:23:30,585 --> 00:23:35,029\njust kind of skeptical about local speech to text, as\n\n497\n00:23:35,029 --> 00:23:35,750\nI mentioned.\n\n498\n00:23:36,150 --> 00:23:39,910\nAnd I think the pace of innovation and improvement in\n\n499\n00:23:39,910 --> 00:23:42,390\nthe models, the main reasons for fine tuning from what\n\n500\n00:23:42,390 --> 00:23:46,230\nI've seen have been people who are something that really\n\n501\n00:23:46,230 --> 00:23:50,455\nblows blows my mind about ASR is the idea that\n\n502\n00:23:50,455 --> 00:23:55,654\nit's inherently ailingual or multilingual, phonetic based.\n\n503\n00:23:56,375 --> 00:24:00,455\nSo as folks who use speak very obscure languages that\n\n504\n00:24:00,455 --> 00:24:03,174\nthere may be very there might be a paucity of\n\n505\n00:24:02,309 --> 00:24:05,110\ntraining data or almost none at all, and therefore the\n\n506\n00:24:05,110 --> 00:24:06,870\naccuracy is significantly reduced.\n\n507\n00:24:06,870 --> 00:24:11,430\nOr folks in very critical environments, I know there are\n\n508\n00:24:11,590 --> 00:24:15,430\nthis is used extensively in medical transcription and dispatcher work\n\n509\n00:24:15,430 --> 00:24:19,144\nas, you know the call centers who send out ambulances\n\n510\n00:24:19,144 --> 00:24:19,944\netc.\n\n511\n00:24:20,345 --> 00:24:23,625\nWhere accuracy is absolutely paramount and in the case of\n\n512\n00:24:23,625 --> 00:24:27,625\ndoctors radiologists they might be using very specialized vocab all\n\n513\n00:24:27,625 --> 00:24:27,945\nthe time.\n\n514\n00:24:28,710 --> 00:24:30,309\nSo those are kind of the main two things, and\n\n515\n00:24:30,309 --> 00:24:32,230\nI'm not sure that really just for trying to make\n\n516\n00:24:32,230 --> 00:24:36,470\nit better on a few random tech words with my\n\n517\n00:24:36,470 --> 00:24:39,509\nslightly I mean, I have an accent, but, like, not,\n\n518\n00:24:39,509 --> 00:24:42,549\nyou know, an accent that a few other million people\n\n519\n00:24:42,950 --> 00:24:43,990\nhave ish.\n\n520\n00:24:44,765 --> 00:24:48,045\nI'm not sure that my little fine tune is gonna\n\n521\n00:24:48,045 --> 00:24:52,684\nactually like, the bump in word error reduction, if I\n\n522\n00:24:52,684 --> 00:24:54,285\never actually figure out how to do it and get\n\n523\n00:24:54,285 --> 00:24:56,445\nit up to the cloud, by the time we've done\n\n524\n00:24:56,445 --> 00:25:00,039\nthat, I suspect that the next generation of ASR will\n\n525\n00:25:00,039 --> 00:25:01,799\njust be so good that it will kind of be,\n\n526\n00:25:02,039 --> 00:25:04,039\nwell, that would have been cool if it worked out,\n\n527\n00:25:04,039 --> 00:25:05,559\nbut I'll just use this instead.\n\n528\n00:25:05,799 --> 00:25:10,759\nSo that's gonna be it for today's episode of voice\n\n529\n00:25:10,759 --> 00:25:11,720\ntraining data.\n\n530\n00:25:11,960 --> 00:25:14,335\nSingle, long shot evaluation.\n\n531\n00:25:14,575 --> 00:25:15,774\nWho am I gonna compare?\n\n532\n00:25:16,494 --> 00:25:18,654\nWhisper is always good as a benchmark, but I'm more\n\n533\n00:25:18,654 --> 00:25:22,255\ninterested in seeing Whisper head to head with two things\n\n534\n00:25:22,255 --> 00:25:22,974\nreally.\n\n535\n00:25:23,375 --> 00:25:25,214\nOne is Whisper variants.\n\n536\n00:25:25,214 --> 00:25:27,775\nSo you've got these projects like Faster Whisper.\n\n537\n00:25:29,190 --> 00:25:30,069\nDistill Whisper.\n\n538\n00:25:30,069 --> 00:25:30,789\nIt's a bit confusing.\n\n539\n00:25:30,789 --> 00:25:31,989\nThere's a whole bunch of them.\n\n540\n00:25:32,230 --> 00:25:35,190\nAnd the emerging ASRs, which are also a thing.\n\n541\n00:25:35,349 --> 00:25:37,190\nMy intention for this is I'm not sure I'm gonna\n\n542\n00:25:37,190 --> 00:25:39,990\nhave the time in any point in the foreseeable future\n\n543\n00:25:39,990 --> 00:25:44,855\nto go back to this whole episode and create a\n\n544\n00:25:44,855 --> 00:25:48,374\nproper source truth where I fix everything.\n\n545\n00:25:49,335 --> 00:25:51,974\nMight do it if I can get one transcription that's\n\n546\n00:25:51,974 --> 00:25:54,214\nsufficiently close to perfection.\n\n547\n00:25:55,014 --> 00:25:58,480\nBut what I would actually love to do on Hugging\n\n548\n00:25:58,480 --> 00:26:00,559\nFace, I think would be a great probably how I\n\n549\n00:26:00,559 --> 00:26:04,480\nmight visualize this is having the audio waveform play and\n\n550\n00:26:04,480 --> 00:26:08,960\nthen have the transcript for each model below it and\n\n551\n00:26:08,960 --> 00:26:13,845\nmaybe even a, like, you know, to scale and maybe\n\n552\n00:26:13,845 --> 00:26:16,724\neven a local one as well, like local whisper versus\n\n553\n00:26:16,724 --> 00:26:19,764\nOpenAI API, etcetera.\n\n554\n00:26:19,845 --> 00:26:23,204\nAnd I can then actually listen back to segments or\n\n555\n00:26:23,204 --> 00:26:25,365\nanyone who wants to can listen back to segments of\n\n556\n00:26:25,365 --> 00:26:30,299\nthis recording and see where a particular model struggled and\n\n557\n00:26:30,299 --> 00:26:33,179\nothers didn't as well as the sort of headline finding\n\n558\n00:26:33,179 --> 00:26:35,659\nof which had the best W E R but that\n\n559\n00:26:35,659 --> 00:26:37,739\nwould require the source of truth.\n\n560\n00:26:37,740 --> 00:26:38,539\nOkay, that's it.\n\n561\n00:26:38,505 --> 00:26:41,065\nI hope this was, I don't know, maybe useful for\n\n562\n00:26:41,065 --> 00:26:42,984\nother folks interested in STT.\n\n563\n00:26:43,065 --> 00:26:46,025\nYou want to see I always think I've just said\n\n564\n00:26:46,025 --> 00:26:47,704\nit as something I didn't intend to.\n\n565\n00:26:47,944 --> 00:26:49,704\nSTT, I said for those.\n\n566\n00:26:49,704 --> 00:26:53,129\nListen carefully, including hopefully the models themselves.\n\n567\n00:26:53,369 --> 00:26:55,129\nThis has been myself, Daniel Rosol.\n\n568\n00:26:55,129 --> 00:26:59,450\nFor more jumbled repositories about my roving interest in AI\n\n569\n00:26:59,450 --> 00:27:04,089\nbut particularly AgenTic, MCP and VoiceTech you can find me\n\n570\n00:27:04,089 --> 00:27:05,769\non GitHub.\n\n571\n00:27:06,009 --> 00:27:06,730\nHugging Face.\n\n572\n00:27:08,125 --> 00:27:09,004\nWhere else?\n\n573\n00:27:09,005 --> 00:27:11,805\nDanielRosel dot com, which is my personal website, as well\n\n574\n00:27:11,805 --> 00:27:15,565\nas this podcast whose name I sadly cannot remember.\n\n575\n00:27:15,724 --> 00:27:16,765\nUntil next time.\n\n576\n00:27:16,765 --> 00:27:17,404\nThanks for listening.\n\n", "speechmatics": "1\n00:00:00,120 --> 00:00:06,520\nHello and welcome to a audio data\nset consisting of one single\n\n2\n00:00:06,520 --> 00:00:12,120\nepisode of a non-existent podcast.\nOr it, uh, I may append this to a\n\n3\n00:00:12,120 --> 00:00:16,640\npodcast that I set up recently.\nUm, regarding my, uh,\n\n4\n00:00:16,680 --> 00:00:21,960\nwith my thoughts on speech,\ntech and AI in particular,\n\n5\n00:00:22,240 --> 00:00:27,960\nmore AI and generative AI, I would,\nuh, I would say, but in any event,\n\n6\n00:00:27,960 --> 00:00:32,480\nthe purpose of this, um,\nvoice recording is actually to create\n\n7\n00:00:32,680 --> 00:00:37,560\na lengthy voice sample for a quick\nevaluation, a back of the envelope\n\n8\n00:00:37,560 --> 00:00:41,160\nevaluation, as they might say,\nfor different speech to text models.\n\n9\n00:00:41,160 --> 00:00:43,800\nAnd I'm doing this because I,\nuh, I thought I'd made a great\n\n10\n00:00:43,800 --> 00:00:48,320\nbreakthrough in my journey with\nspeech tech, and that was succeeding\n\n11\n00:00:48,320 --> 00:00:52,720\nin the elusive task of fine tuning.\nWhisper, whisper is.\n\n12\n00:00:52,840 --> 00:00:56,960\nAnd I'm going to just talk.\nI'm trying to mix up, uh,\n\n13\n00:00:56,960 --> 00:01:00,470\nI'm going to try a few different\nstyles of speaking.\n\n14\n00:01:00,470 --> 00:01:02,630\nI might whisper something at\nsome point as well,\n\n15\n00:01:03,190 --> 00:01:07,150\nand I'll go back to speaking loud in,\nuh, in different parts.\n\n16\n00:01:07,150 --> 00:01:09,710\nI'm going to sound really like a\ncrazy person, because I'm also\n\n17\n00:01:09,710 --> 00:01:15,870\ngoing to try to speak at different\npitches and cadences in order to\n\n18\n00:01:15,910 --> 00:01:20,630\nreally try to put a speech to\ntext model through its paces,\n\n19\n00:01:20,630 --> 00:01:25,870\nwhich is trying to make sense of,\nis this guy just on incoherently in\n\n20\n00:01:25,870 --> 00:01:34,350\none long sentence, or are these just\nactually a series of step standalone,\n\n21\n00:01:34,350 --> 00:01:37,510\nstandalone, standalone sentences?\nAnd how is it going to handle\n\n22\n00:01:37,510 --> 00:01:40,750\nstep alone? That's not a word.\nUh, what happens when you use\n\n23\n00:01:40,750 --> 00:01:44,030\nspeech to text and you use a fake\nword and then you're like, wait,\n\n24\n00:01:44,030 --> 00:01:48,350\nthat's not actually that word doesn't\nexist. How does AI handle that?\n\n25\n00:01:48,390 --> 00:01:53,910\nAnd, uh, these and more are all\nthe questions that I'm seeking\n\n26\n00:01:53,910 --> 00:01:57,350\nto answer in this training data.\nNow, why did why was it trying\n\n27\n00:01:57,350 --> 00:01:59,740\nto fine tune a whisper?\nAnd what is whisper?\n\n28\n00:01:59,780 --> 00:02:03,540\nAs I said, I'm gonna try to, uh,\nrecord this at a couple of different\n\n29\n00:02:03,540 --> 00:02:09,060\nlevels of technicality for folks who\nare, uh, you know, in the normal, uh,\n\n30\n00:02:09,060 --> 00:02:13,460\nworld and not totally stuck down\nthe rabbit hole of AI, uh, which I\n\n31\n00:02:13,460 --> 00:02:17,460\nhave to say is a really wonderful,\nuh, rabbit hole to be to be down.\n\n32\n00:02:17,580 --> 00:02:21,700\nUm, it's a really interesting area.\nAnd speech and voice tech is is\n\n33\n00:02:21,940 --> 00:02:24,980\nthe aspect of it that I find\nactually most.\n\n34\n00:02:25,180 --> 00:02:28,340\nI'm not sure I would say the most\ninteresting, because there's just\n\n35\n00:02:28,340 --> 00:02:32,700\nso much that is fascinating in AI.\nUh, but the most that I find the\n\n36\n00:02:32,700 --> 00:02:36,220\nmost personally transformative\nin terms of the impact that it's\n\n37\n00:02:36,220 --> 00:02:41,660\nhad on my daily work life and\nproductivity and how I sort of work.\n\n38\n00:02:41,940 --> 00:02:48,020\nAnd I'm persevering hard with the\ntask of trying to guess a good\n\n39\n00:02:48,020 --> 00:02:51,700\nsolution working for Linux, which if\nanyone actually does listen to this,\n\n40\n00:02:51,700 --> 00:02:55,100\nnot just for the training data\nand for the actual content, uh,\n\n41\n00:02:55,140 --> 00:02:59,600\nthis is this is has sparked I had\nbesides the fine tune not working.\n\n42\n00:02:59,600 --> 00:03:05,560\nWell, that was the failure.\nUm, I used clod code because one\n\n43\n00:03:05,560 --> 00:03:10,160\nthinks these days that there is\nnothing short of solving,\n\n44\n00:03:11,040 --> 00:03:14,680\nyou know, the, uh,\nthe reason of life or something.\n\n45\n00:03:15,080 --> 00:03:19,560\nUh, that clod and agentic AI can't\ndo, uh, which is not really the case.\n\n46\n00:03:19,600 --> 00:03:23,600\nUh, it does seem that way sometimes,\nbut it fails a lot as well.\n\n47\n00:03:23,600 --> 00:03:26,960\nAnd this is one of those, uh,\ninstances where last week I put\n\n48\n00:03:26,960 --> 00:03:31,400\ntogether an hour of voice training\ndata, basically speaking just\n\n49\n00:03:31,400 --> 00:03:35,040\nrandom things for three minutes.\nAnd, um,\n\n50\n00:03:35,720 --> 00:03:38,520\nit was actually kind of tedious\nbecause the texts were really weird.\n\n51\n00:03:38,520 --> 00:03:42,120\nSome of them were it was like it\nwas AI generated.\n\n52\n00:03:42,320 --> 00:03:44,920\nUm, I tried before to read\nSherlock Holmes for an hour and\n\n53\n00:03:44,920 --> 00:03:47,000\nI just couldn't.\nI was so bored, uh,\n\n54\n00:03:47,040 --> 00:03:50,800\nafter ten minutes that I was like,\nokay, now I'm just gonna have to\n\n55\n00:03:50,800 --> 00:03:56,470\nfind something else to read.\nSo I used a created with AI\n\n56\n00:03:56,510 --> 00:04:00,150\nstudio vibe coded.\nA synthetic text generator.\n\n57\n00:04:00,390 --> 00:04:03,990\nUm, which actually I thought was\nprobably a better way of doing it\n\n58\n00:04:03,990 --> 00:04:08,870\nbecause it would give me more short\nsamples with more varied content.\n\n59\n00:04:08,870 --> 00:04:13,310\nSo I was like, okay, give me a voice\nnote, like I'm recording an email,\n\n60\n00:04:13,310 --> 00:04:18,110\ngive me a short story to read,\ngive me prose, um, to read.\n\n61\n00:04:18,110 --> 00:04:21,310\nSo I came up with all these\ndifferent things, and I added a\n\n62\n00:04:21,310 --> 00:04:24,750\nlittle timer to it so I could\nsee how close I was to one hour.\n\n63\n00:04:24,990 --> 00:04:29,830\nUm, and, uh, I spent like an hour one\nafternoon or probably two hours by\n\n64\n00:04:29,830 --> 00:04:34,190\nthe time you, um, you do retakes\nor whatever because you want to.\n\n65\n00:04:34,990 --> 00:04:39,190\nIt gave me a source of truth,\nwhich I'm not sure if that's the\n\n66\n00:04:39,190 --> 00:04:43,550\nscientific way to approach this topic\nof gathering, uh, training data,\n\n67\n00:04:43,550 --> 00:04:48,070\nbut I thought it made sense.\nUm, I have a lot of audio data\n\n68\n00:04:48,070 --> 00:04:52,070\nfrom recording voice notes,\nwhich I've also kind of used, um,\n\n69\n00:04:52,070 --> 00:04:55,780\nbeen experimenting with using for\na different purpose, slightly\n\n70\n00:04:55,780 --> 00:05:00,820\ndifferent annotating task types.\nIt's more text classification\n\n71\n00:05:00,820 --> 00:05:03,740\nexperiment or uh, well,\nit's more than that, actually.\n\n72\n00:05:03,740 --> 00:05:08,100\nI'm working on a voice app,\nso it's a prototype I guess is\n\n73\n00:05:08,100 --> 00:05:12,780\nreally more accurate.\nUm, but you can do that and you\n\n74\n00:05:12,780 --> 00:05:14,220\ncan work backwards.\nYou're like,\n\n75\n00:05:14,260 --> 00:05:18,620\nyou listen back to a voice note\nand you painfully go through one\n\n76\n00:05:18,620 --> 00:05:21,980\nof those transcribing, you know,\nwhere you start and stop and scrub\n\n77\n00:05:21,980 --> 00:05:24,100\naround it and you fix the errors.\nBut it's really,\n\n78\n00:05:24,100 --> 00:05:27,220\nreally boring to do that.\nSo I thought it would be less\n\n79\n00:05:27,220 --> 00:05:31,860\ntedious in the long term if I just\nrecorded The Source of truth.\n\n80\n00:05:32,180 --> 00:05:34,300\nSo it gave me these three minute\nsnippets.\n\n81\n00:05:34,300 --> 00:05:38,780\nI recorded them and saved an MP3\nand a txt in the same folder,\n\n82\n00:05:38,780 --> 00:05:43,820\nand I created an hour of that data.\nUh, so I was very hopeful, quietly,\n\n83\n00:05:43,860 --> 00:05:46,380\nyou know, a little bit hopeful\nthat I would be able that I could\n\n84\n00:05:46,380 --> 00:05:49,700\nactually fine tune, whisper.\nUm, I want to fine tune whisper\n\n85\n00:05:49,700 --> 00:05:54,840\nbecause when I got into voice tech\nlast November, my wife was in\n\n86\n00:05:54,840 --> 00:05:59,600\nthe US and I was alone at home.\nAnd you know, when crazy people\n\n87\n00:05:59,600 --> 00:06:03,760\nlike me do really wild things like\nuse voice to tech, uh, technology.\n\n88\n00:06:03,760 --> 00:06:06,520\nThat was basically, um,\nwhen I started doing it,\n\n89\n00:06:06,520 --> 00:06:10,280\nI didn't feel like a crazy person\nspeaking to myself, and my\n\n90\n00:06:10,280 --> 00:06:16,120\nexpectations weren't that high.\nUh, I used speech tech now and again.\n\n91\n00:06:16,200 --> 00:06:18,480\nUm, tried it out.\nI was like, it'd be really cool\n\n92\n00:06:18,480 --> 00:06:20,520\nif you could just, like,\nspeak into your computer.\n\n93\n00:06:20,880 --> 00:06:24,720\nAnd whatever I tried out that\nhad Linux support was just.\n\n94\n00:06:25,440 --> 00:06:28,640\nIt was not good, basically.\nUm, and this blew me away from\n\n95\n00:06:28,640 --> 00:06:32,040\nthe first go.\nI mean, it wasn't 100% accurate\n\n96\n00:06:32,080 --> 00:06:35,160\nout of the box and it took work,\nbut it was good enough that there was\n\n97\n00:06:35,160 --> 00:06:39,720\na solid foundation and it kind of\npassed that, uh, pivot point that\n\n98\n00:06:39,720 --> 00:06:42,880\nit's actually worth doing this.\nYou know, there's a point where\n\n99\n00:06:42,880 --> 00:06:46,920\nit's so like the transcript is you\ndon't have to get 100% accuracy\n\n100\n00:06:46,920 --> 00:06:50,630\nfor it to be worth your time for\nspeech to text to be a worthwhile\n\n101\n00:06:50,630 --> 00:06:53,070\naddition to your productivity.\nBut you do need to get above.\n\n102\n00:06:53,110 --> 00:06:57,750\nLet's say, I don't know, 85%.\nIf it's 60% or 50%,\n\n103\n00:06:57,750 --> 00:07:00,790\nyou inevitably say, screw it.\nI'll just type it because you end up\n\n104\n00:07:00,790 --> 00:07:05,070\nmissing errors in the transcript\nand it becomes actually worse.\n\n105\n00:07:05,070 --> 00:07:06,830\nYou end up in a worse position\nthan you started with.\n\n106\n00:07:06,830 --> 00:07:11,030\nAnd that's been my experience.\nSo, um, I was like, oh,\n\n107\n00:07:11,070 --> 00:07:13,550\nthis is actually really, really good.\nNow how did that happen?\n\n108\n00:07:13,550 --> 00:07:18,910\nAnd the answer is ASR whisper\nbeing open sourced and the\n\n109\n00:07:18,910 --> 00:07:21,910\ntransformer architecture,\nif you want to go back to the,\n\n110\n00:07:22,510 --> 00:07:26,750\num, to the underpinnings, which\nreally blows my mind and it's on my\n\n111\n00:07:26,750 --> 00:07:32,430\nlist to read through that paper.\nUm, all you need is attention as\n\n112\n00:07:33,470 --> 00:07:38,470\nattentively as can be done with my\nlimited brain because it's super,\n\n113\n00:07:38,470 --> 00:07:42,310\nsuper high level stuff.\nUm, super advanced stuff.\n\n114\n00:07:42,350 --> 00:07:48,070\nI mean, uh, but that I think of all\nthe things that are fascinating\n\n115\n00:07:48,180 --> 00:07:52,820\nabout the sudden rise in AI and\nthe dramatic capabilities.\n\n116\n00:07:53,420 --> 00:07:55,700\nI find it fascinating that few\npeople are like, hang on,\n\n117\n00:07:55,860 --> 00:07:59,740\nyou've got this thing that can speak\nto you like a chatbot, an LLM,\n\n118\n00:08:00,420 --> 00:08:05,580\nand then you've got image generation.\nOkay, so firstly, those two things on\n\n119\n00:08:05,580 --> 00:08:10,860\nthe surface have nothing in common.\nUm, so like how are they how did that\n\n120\n00:08:10,860 --> 00:08:13,100\njust happen all at the same time.\nAnd then when you extend that\n\n121\n00:08:13,100 --> 00:08:16,180\nfurther, um, you're like sooner,\nright?\n\n122\n00:08:16,180 --> 00:08:21,700\nYou can sing a song and AI will like,\ncome up with an instrumental and then\n\n123\n00:08:21,700 --> 00:08:23,860\nyou've got whisper and you're like,\nwait a second,\n\n124\n00:08:24,060 --> 00:08:28,100\nhow did all this stuff, like,\nif it's all AI, what's like there\n\n125\n00:08:28,100 --> 00:08:30,700\nhas to be some commonality.\nOtherwise these are four.\n\n126\n00:08:30,780 --> 00:08:34,780\nThese are totally different\ntechnologies on the surface of it.\n\n127\n00:08:34,780 --> 00:08:40,220\nAnd, uh, the transformer architecture\nis, as far as I know, the answer.\n\n128\n00:08:40,220 --> 00:08:43,860\nAnd I can't even say can't even\npretend that I really understand\n\n129\n00:08:44,140 --> 00:08:47,290\nwhat the transformer\narchitecture means in depth,\n\n130\n00:08:47,290 --> 00:08:51,810\nbut I have scanned it and as I said,\nI want to print it and really kind\n\n131\n00:08:51,810 --> 00:08:56,770\nof think over it at some point,\nand I'll probably feel bad about\n\n132\n00:08:56,770 --> 00:08:59,090\nmyself, I think,\nbecause weren't those guys in their\n\n133\n00:08:59,130 --> 00:09:04,010\nin their 20s like, that's crazy.\nI think I asked ChatGPT once who\n\n134\n00:09:04,050 --> 00:09:08,370\nwere the who wrote that paper\nand how old were they when it\n\n135\n00:09:08,370 --> 00:09:11,290\nwas published in arXiv?\nAnd I was expecting like,\n\n136\n00:09:11,530 --> 00:09:13,450\nI don't know,\nwhat do you what do you imagine?\n\n137\n00:09:13,450 --> 00:09:15,050\nI personally imagine kind of like,\nyou know,\n\n138\n00:09:15,090 --> 00:09:19,210\nyou have these breakthroughs during\nCovid and things like that where\n\n139\n00:09:19,250 --> 00:09:22,210\nlike these kind of really obscure\nscientists who are like in their\n\n140\n00:09:22,210 --> 00:09:27,250\n50s and they've just kind of been\nlaboring in labs and, uh, wearily\n\n141\n00:09:27,250 --> 00:09:30,650\nand writing in publishing in kind\nof obscure academic publications.\n\n142\n00:09:30,850 --> 00:09:34,050\nAnd they finally, like,\nhit a big or win a Nobel Prize and\n\n143\n00:09:34,050 --> 00:09:37,930\nthen their household household names.\nUh, so that was kind of what I\n\n144\n00:09:37,930 --> 00:09:39,770\nhad in mind.\nThat was the mental image I'd\n\n145\n00:09:39,770 --> 00:09:44,010\nformed of the birth of arXiv.\nLike, I wasn't expecting 20\n\n146\n00:09:44,050 --> 00:09:47,430\nsomethings in San Francisco,\nthough I thought that was both very,\n\n147\n00:09:47,430 --> 00:09:49,990\nvery funny, very cool,\nand actually kind of inspiring.\n\n148\n00:09:50,510 --> 00:09:55,630\nIt's nice to think that people who,\nyou know, just you might put them\n\n149\n00:09:55,630 --> 00:10:01,030\nin the kind of milieu or bubble or\nworld that you are in or credibly in,\n\n150\n00:10:01,070 --> 00:10:03,710\nthrough, you know,\na series of connections that are\n\n151\n00:10:03,710 --> 00:10:07,750\ncoming up with such literally\nworld changing, um, innovations.\n\n152\n00:10:07,790 --> 00:10:11,550\nUh, so that was, I thought,\nanyway, that, that that was cool.\n\n153\n00:10:12,190 --> 00:10:14,070\nOkay. Voice training data.\nHow are we doing?\n\n154\n00:10:14,070 --> 00:10:18,110\nWe're about ten minutes, and I'm\nstill talking about voice technology.\n\n155\n00:10:18,310 --> 00:10:22,470\nUm, so whisper was brilliant,\nand I was so excited that I was.\n\n156\n00:10:22,470 --> 00:10:25,750\nMy first instinct was to, like,\nget like, oh, my gosh,\n\n157\n00:10:25,750 --> 00:10:27,830\nI have to get, like,\na really good microphone for this.\n\n158\n00:10:28,070 --> 00:10:31,750\nSo, um, I didn't go on a\nspending spree because I said,\n\n159\n00:10:31,790 --> 00:10:34,590\nI'm gonna have to just wait a\nmonth and see if I still use this.\n\n160\n00:10:35,030 --> 00:10:40,110\nAnd it just kind of became it's\nbecome really part of my daily\n\n161\n00:10:40,110 --> 00:10:43,110\nroutine.\nLike, if I'm writing an email,\n\n162\n00:10:43,110 --> 00:10:47,140\nI'll record a voice note.\nAnd then I've developed and it's\n\n163\n00:10:47,140 --> 00:10:50,020\nnice to see that everyone is\nlike developing the same things\n\n164\n00:10:50,020 --> 00:10:52,020\nin parallel.\nLike, that's kind of a weird thing\n\n165\n00:10:52,060 --> 00:10:57,460\nto say, but when I look, I kind of\ncame when I started working on this,\n\n166\n00:10:57,500 --> 00:11:00,820\nthese prototypes on GitHub,\nwhich is where I just kind of\n\n167\n00:11:00,860 --> 00:11:04,860\nshare very freely and loosely,\nuh, ideas and, you know,\n\n168\n00:11:04,900 --> 00:11:10,140\nfirst iterations on, on concepts,\num, and for want of a better word,\n\n169\n00:11:10,140 --> 00:11:14,020\nI called it like, uh,\nlm post-processing or cleanup or\n\n170\n00:11:14,260 --> 00:11:18,220\nbasically a system prompt that after\nyou get back the raw text from\n\n171\n00:11:18,540 --> 00:11:24,220\nwhisper, you run it through a model\nand say, okay, this is crappy text,\n\n172\n00:11:24,260 --> 00:11:27,260\nlike add sentence structure and,\nyou know, fix it up.\n\n173\n00:11:27,700 --> 00:11:32,780\nAnd, um, now when I'm exploring the\ndifferent tools that are out there\n\n174\n00:11:32,820 --> 00:11:36,700\nthat people have built, I see, uh,\nquite a number of projects have\n\n175\n00:11:37,300 --> 00:11:41,820\nbasically done the same thing,\num, less that be misconstrued.\n\n176\n00:11:41,820 --> 00:11:44,490\nI'm not saying for a millisecond\nthat I inspired them.\n\n177\n00:11:44,490 --> 00:11:49,010\nI'm sure this has been a thing that's\nbeen integrated into tools for a\n\n178\n00:11:49,050 --> 00:11:52,410\nwhile, but it's it's the kind of\nthing that when you start using these\n\n179\n00:11:52,410 --> 00:11:56,850\ntools every day, the need for it\nis almost instantly apparent, uh,\n\n180\n00:11:56,850 --> 00:12:00,890\nbecause text that doesn't have any\npunctuation or paragraph spacing\n\n181\n00:12:00,930 --> 00:12:04,370\ntakes a long time to, you know,\nit takes so long to get it into\n\n182\n00:12:04,370 --> 00:12:09,490\na presentable email that again,\nit's it's it moves speech tech\n\n183\n00:12:09,530 --> 00:12:13,050\ninto that before that inflection\npoint where you're like, no,\n\n184\n00:12:13,050 --> 00:12:16,370\nit's just not worth it.\nIt's like it'll just be quicker\n\n185\n00:12:16,370 --> 00:12:18,970\nto type this.\nSo it's a big it's a little touch.\n\n186\n00:12:18,970 --> 00:12:24,210\nThat actually is a big deal.\nUh, so I was on whisper and I've\n\n187\n00:12:24,210 --> 00:12:28,290\nbeen using whisper and I kind of\nearly on found a couple of tools.\n\n188\n00:12:28,330 --> 00:12:31,050\nI couldn't find what I was\nlooking for on Linux, which is,\n\n189\n00:12:31,490 --> 00:12:35,890\num, basically just something\nthat'll run in the background.\n\n190\n00:12:35,930 --> 00:12:40,250\nYou'll give it an API key and it\nwill just transcribe. Um.\n\n191\n00:12:41,400 --> 00:12:44,120\nwith, like, a little key to\nstart and stop the dictation.\n\n192\n00:12:44,720 --> 00:12:49,160\nUh, and the issues were I discovered\nthat, like most people involved in\n\n193\n00:12:49,160 --> 00:12:54,040\ncreating these projects were very\nmuch focused on local models running\n\n194\n00:12:54,040 --> 00:12:57,520\nwhisper locally, because you can.\nAnd I tried that a bunch of\n\n195\n00:12:57,520 --> 00:13:00,960\ntimes and just never got results\nthat were as good as the cloud.\n\n196\n00:13:01,280 --> 00:13:04,760\nAnd when I began looking at the\ncost of the speech to text APIs\n\n197\n00:13:04,760 --> 00:13:08,640\nand what I was spending,\nI just thought there's it's actually,\n\n198\n00:13:08,840 --> 00:13:13,320\nin my opinion, just one of the better\ndeals in API spending and in cloud.\n\n199\n00:13:13,360 --> 00:13:17,400\nLike it's just not that expensive\nfor very, very good models that are\n\n200\n00:13:17,520 --> 00:13:20,960\nmuch more, you know, you're going\nto be able to run the full model,\n\n201\n00:13:21,480 --> 00:13:26,080\nthe latest model versus whatever\nyou can run on your average GPU.\n\n202\n00:13:26,120 --> 00:13:29,880\nUnless you want to buy a crazy GPU.\nIt doesn't really make sense to me.\n\n203\n00:13:29,880 --> 00:13:33,600\nNow, privacy is another concern.\nUm, that I know is kind of like a\n\n204\n00:13:33,640 --> 00:13:37,040\nvery much a separate thing that\npeople just don't want their voice,\n\n205\n00:13:37,040 --> 00:13:39,910\ndata, and their voice leaving\ntheir local environment,\n\n206\n00:13:40,230 --> 00:13:43,950\nmaybe for regulatory reasons as well.\nUm, but I'm not in that.\n\n207\n00:13:44,030 --> 00:13:48,030\nUm, I'm neither really care about\npeople listening to my, uh,\n\n208\n00:13:48,070 --> 00:13:51,310\ngrocery list consisting of, uh,\nreminding myself that I need to\n\n209\n00:13:51,350 --> 00:13:54,910\nbuy more beer, Cheetos and hummus,\nwhich is kind of the three,\n\n210\n00:13:55,110 --> 00:13:59,430\nthree staples of my diet.\nUm, during periods of poor nutrition.\n\n211\n00:13:59,710 --> 00:14:03,430\nUh, but the kind of stuff that I\ntranscribe, it's just not it's not a,\n\n212\n00:14:04,110 --> 00:14:09,470\nit's not a privacy thing and that\nsort of sensitive about and, uh,\n\n213\n00:14:09,470 --> 00:14:13,190\nI don't do anything so,\nyou know, sensitive or secure,\n\n214\n00:14:13,190 --> 00:14:16,710\nthat requires air gapping.\nSo, um, I looked at the pricing and\n\n215\n00:14:16,710 --> 00:14:20,390\nespecially the kind of older models,\nmini, um, some of them are very,\n\n216\n00:14:20,390 --> 00:14:23,230\nvery affordable.\nAnd I did a back of the I did a\n\n217\n00:14:23,230 --> 00:14:27,270\ncalculation once with ChatGPT\nand I was like, okay, this is a,\n\n218\n00:14:27,270 --> 00:14:31,190\nthis is the API price for I can't\nremember whatever the model was.\n\n219\n00:14:31,670 --> 00:14:34,030\nUh, let's say I just go at it\nlike nonstop,\n\n220\n00:14:34,150 --> 00:14:37,530\nwhich it rarely happens. Probably.\nI would say on average,\n\n221\n00:14:37,530 --> 00:14:42,010\nI might dictate 30 to 60 minutes per\nday if I was probably summing up\n\n222\n00:14:42,010 --> 00:14:48,610\nthe emails, documents, outlines,\num, which is a lot, but it's it's\n\n223\n00:14:48,610 --> 00:14:50,850\nstill a fairly modest amount.\nAnd I was like, well,\n\n224\n00:14:50,890 --> 00:14:54,050\nsome days I do go on like 1 or 2\ndays where I've been.\n\n225\n00:14:54,570 --> 00:14:58,570\nUsually when I'm like kind of out of\nthe house and just have something\n\n226\n00:14:59,210 --> 00:15:02,370\nlike, I have nothing else to do.\nLike if I'm at a hospital with a\n\n227\n00:15:02,370 --> 00:15:07,090\nnewborn, uh, and you're waiting\nfor like eight hours and hours\n\n228\n00:15:07,090 --> 00:15:10,330\nfor an appointment, and I would\nprobably have listened to podcasts\n\n229\n00:15:10,610 --> 00:15:14,130\nbefore becoming a speech fanatic.\nAnd I'm like, oh, wait,\n\n230\n00:15:14,170 --> 00:15:16,490\nlet me just get down.\nLet me just get these ideas out\n\n231\n00:15:16,530 --> 00:15:19,730\nof my head.\nAnd that's when I'll go on my\n\n232\n00:15:19,770 --> 00:15:21,650\nspeech binges.\nBut those are like once every\n\n233\n00:15:21,650 --> 00:15:25,090\nfew months, like not frequently.\nBut I said, okay, let's just say\n\n234\n00:15:25,090 --> 00:15:30,770\nif I'm gonna price out.\nCloud asked if I was like, dedicated\n\n235\n00:15:30,770 --> 00:15:37,000\nevery second of every waking hour to\ntranscribing for some odd reason. Um.\n\n236\n00:15:37,320 --> 00:15:39,800\nI mean, it'd have to, like,\neat and use the toilet and,\n\n237\n00:15:39,840 --> 00:15:42,640\nlike, you know, there's only so\nmany hours I'm awake for.\n\n238\n00:15:42,640 --> 00:15:44,800\nSo, like,\nlet's just say a maximum of, like,\n\n239\n00:15:44,840 --> 00:15:48,800\n40 hours, 45 minutes in the hour.\nThen I said, all right,\n\n240\n00:15:48,800 --> 00:15:52,720\nlet's just say 50. Who knows?\nYou're dictating on the toilet.\n\n241\n00:15:52,760 --> 00:15:54,000\nWe do it.\nUh,\n\n242\n00:15:54,000 --> 00:15:58,840\nso it could be you could just do 60.\nBut whatever I did, and every day,\n\n243\n00:15:58,880 --> 00:16:02,560\nlike, you're going flat out seven\ndays a week dictating non-stop.\n\n244\n00:16:02,600 --> 00:16:06,560\nI was like, what's my monthly API\nbill going to be at this price?\n\n245\n00:16:06,840 --> 00:16:09,240\nAnd it came out to like 70 or 80\nbucks.\n\n246\n00:16:09,240 --> 00:16:14,200\nAnd I was like, well, that would be\nan extraordinary amount of dictation.\n\n247\n00:16:14,200 --> 00:16:17,960\nAnd I would hope that there was\nsome compelling reason,\n\n248\n00:16:18,160 --> 00:16:22,320\nmore worth more than $70,\nthat I embarked upon that project.\n\n249\n00:16:22,520 --> 00:16:25,320\nUh, so given that that's kind of the\nmax point for me, I said, that's\n\n250\n00:16:25,360 --> 00:16:29,120\nactually very, very affordable.\nUm, now you're gonna if you want\n\n251\n00:16:29,160 --> 00:16:34,200\nto spec out the costs and you want\nto do the post-processing that I\n\n252\n00:16:34,270 --> 00:16:37,230\nreally do feel is valuable.\nUm, that's going to cost some more as\n\n253\n00:16:37,230 --> 00:16:43,230\nwell, unless you're using Gemini,\nwhich, uh, needless to say, is a\n\n254\n00:16:43,230 --> 00:16:47,070\nrandom person sitting in Jerusalem.\nUh, I have no affiliation,\n\n255\n00:16:47,070 --> 00:16:51,470\nnor with Google, nor anthropic,\nnor Gemini, nor any major tech vendor\n\n256\n00:16:51,470 --> 00:16:56,910\nfor that matter. Um, I like Gemini.\nNot so much as a everyday model.\n\n257\n00:16:56,990 --> 00:16:59,950\nUm, it's kind of underwhelmed in\nthat respect, I would say.\n\n258\n00:17:00,350 --> 00:17:03,150\nBut for multimodal,\nI think it's got a lot to offer.\n\n259\n00:17:03,430 --> 00:17:06,990\nAnd I think that the transcribing\nfunctionality whereby it can,\n\n260\n00:17:07,390 --> 00:17:12,270\num, process audio with a system\nprompt and both give you\n\n261\n00:17:12,310 --> 00:17:15,510\ntranscription that's cleaned up,\nthat reduces two steps to one.\n\n262\n00:17:15,830 --> 00:17:18,750\nAnd that for me is a very,\nvery big deal.\n\n263\n00:17:18,750 --> 00:17:23,110\nAnd, uh, I feel like even Google\nhas haven't really sort of thought\n\n264\n00:17:23,110 --> 00:17:27,550\nthrough how useful the that\nmodality is and what kind of use\n\n265\n00:17:27,550 --> 00:17:30,910\ncases you can achieve with it.\nBecause I found in the course of\n\n266\n00:17:30,910 --> 00:17:36,610\nthis year just an endless list\nof really kind of system prompt,\n\n267\n00:17:36,850 --> 00:17:41,410\nsystem prompt stuff that I can say,\nokay, I've used it to capture context\n\n268\n00:17:41,410 --> 00:17:45,690\ndata for AI, which is literally I\nmight speak for if I wanted to have a\n\n269\n00:17:45,690 --> 00:17:49,850\ngood bank of context data about,\nwho knows, my childhood.\n\n270\n00:17:50,130 --> 00:17:53,570\nUh, more realistically,\nmaybe my career goals, uh,\n\n271\n00:17:53,570 --> 00:17:56,130\nsomething that would just be,\nlike, really boring to type out.\n\n272\n00:17:56,250 --> 00:18:01,250\nSo I'll just, like, sit in my car\nand record it for ten minutes.\n\n273\n00:18:01,250 --> 00:18:04,210\nAnd that ten minutes,\nyou get a lot of information in,\n\n274\n00:18:04,650 --> 00:18:10,210\num, emails, which is short text.\nUm, just there is a whole bunch.\n\n275\n00:18:10,210 --> 00:18:13,690\nAnd all these workflows kind of\nrequire a little bit of treatment\n\n276\n00:18:13,690 --> 00:18:17,610\nafterwards and different treatment.\nMy context pipeline is kind of like\n\n277\n00:18:17,610 --> 00:18:21,330\njust extract the bare essentials.\nSo you end up with me talking very\n\n278\n00:18:21,330 --> 00:18:24,370\nloosely about sort of what I've done\nin my career, where I've worked,\n\n279\n00:18:24,370 --> 00:18:27,730\nwhere I might like to work,\nand it goes it condenses that\n\n280\n00:18:27,730 --> 00:18:31,720\ndown to very robotic language\nthat is easy to chunk, parse,\n\n281\n00:18:31,720 --> 00:18:36,080\nand maybe put into a vector database.\nDaniel has worked in technology,\n\n282\n00:18:36,120 --> 00:18:39,760\nDaniel is a has been working in,\nyou know, stuff like that.\n\n283\n00:18:39,760 --> 00:18:43,720\nThat's not how you would speak.\nUm, but I figure it's probably easier\n\n284\n00:18:43,720 --> 00:18:48,240\nto parse for, after all, robots.\nSo we've almost got to 20 minutes.\n\n285\n00:18:48,240 --> 00:18:52,760\nAnd this is actually a success\nbecause I wasted 20 minutes of my,\n\n286\n00:18:52,920 --> 00:18:57,000\nuh, of the evening speaking into\na microphone, and, uh,\n\n287\n00:18:57,040 --> 00:19:00,960\nthe levels were shot and, uh, it,\nuh, it was clipping and I said,\n\n288\n00:19:00,960 --> 00:19:03,320\nI can't really do an evaluation.\nI have to be fair.\n\n289\n00:19:03,320 --> 00:19:07,120\nI have to give the models a\nchance to do their thing.\n\n290\n00:19:07,640 --> 00:19:09,480\nUh,\nwhat am I hoping to achieve in this?\n\n291\n00:19:09,520 --> 00:19:12,720\nOkay, my fine tune was a dud,\nas mentioned Deepgram SVT.\n\n292\n00:19:12,760 --> 00:19:15,640\nI'm really, really hopeful that\nthis prototype will work.\n\n293\n00:19:15,920 --> 00:19:19,080\nAnd it's a built in public open\nsource, so anyone is welcome to\n\n294\n00:19:19,120 --> 00:19:23,040\nuse it if I make anything good.\nUm, but that was really exciting for\n\n295\n00:19:23,040 --> 00:19:27,520\nme last night when after hours of,\num, trying my own prototype,\n\n296\n00:19:27,520 --> 00:19:31,350\nseeing someone just made\nsomething that works like that.\n\n297\n00:19:31,390 --> 00:19:32,790\nYou know,\nyou're not going to have to build a\n\n298\n00:19:32,790 --> 00:19:38,350\ncustom conda environment and image.\nI have AMD GPU, which makes\n\n299\n00:19:38,350 --> 00:19:42,430\nthings much more complicated.\nI didn't find it and I was about\n\n300\n00:19:42,430 --> 00:19:44,110\nto give up and I said,\nall right, let me just give deep\n\n301\n00:19:44,110 --> 00:19:48,870\ngrams Linux thing a shot.\nAnd if this doesn't work, um,\n\n302\n00:19:48,870 --> 00:19:51,270\nI'm just going to go back to\ntrying to code something myself.\n\n303\n00:19:51,630 --> 00:19:56,310\nAnd when I ran the script,\nI was using cloud code to do the\n\n304\n00:19:56,310 --> 00:20:00,150\ninstallation process.\nIt ran the script and oh my gosh,\n\n305\n00:20:00,190 --> 00:20:05,470\nit works just like that.\nUh, the tricky thing for all those\n\n306\n00:20:05,470 --> 00:20:10,430\nwho wants to know all the nitty\ngritty, nitty gritty details, um, was\n\n307\n00:20:10,430 --> 00:20:13,870\nthat I don't think it was actually\nstruggling with transcription, but\n\n308\n00:20:13,870 --> 00:20:18,670\npasting Wayland makes life very hard,\nand I think there was something not\n\n309\n00:20:18,670 --> 00:20:21,990\nrunning in the right time anyway.\nDeepgram I looked at how they\n\n310\n00:20:21,990 --> 00:20:24,830\nactually handle that because it\nworked out of the box when other\n\n311\n00:20:24,830 --> 00:20:29,260\nstuff didn't, and it was quite a\nclever little mechanism,\n\n312\n00:20:29,580 --> 00:20:32,220\nand but more so than that,\nthe accuracy was brilliant.\n\n313\n00:20:32,260 --> 00:20:35,140\nNow, what am I doing here?\nThis is going to be a 20 minute\n\n314\n00:20:35,380 --> 00:20:43,100\naudio sample, and I'm I think\nI've done 1 or 2 of these before,\n\n315\n00:20:43,100 --> 00:20:49,300\nbut I did it with short, snappy voice\nnotes. This is kind of long form.\n\n316\n00:20:49,580 --> 00:20:51,860\nThis actually might be a better\napproximation for what's useful\n\n317\n00:20:51,860 --> 00:20:56,220\nto me than voice memos.\nLike I need to buy three liters\n\n318\n00:20:56,220 --> 00:20:59,300\nof milk tomorrow, and pita bread,\nwhich is probably how like half\n\n319\n00:20:59,300 --> 00:21:02,940\nmy voice voice notes sound like\nif anyone were to, I don't know,\n\n320\n00:21:02,980 --> 00:21:04,700\nlike find my phone,\nthey'd be like, this is the most\n\n321\n00:21:04,700 --> 00:21:07,540\nboring person in the world.\nAlthough actually there are some\n\n322\n00:21:07,580 --> 00:21:09,820\nlike kind of, uh,\njournaling thoughts as well.\n\n323\n00:21:09,820 --> 00:21:13,820\nBut it's a lot of content like that.\nAnd the probably for the evaluation,\n\n324\n00:21:13,820 --> 00:21:20,780\nthe most useful thing is slightly\nobscure tech GitHub uh, hugging face\n\n325\n00:21:21,300 --> 00:21:24,780\nnot so obscure that it's not going\nto have a chance of knowing it,\n\n326\n00:21:24,780 --> 00:21:27,760\nbut hopefully sufficiently well\nknown that the model should get it.\n\n327\n00:21:28,320 --> 00:21:30,880\nI tried to do a little bit of\nspeaking really fast and\n\n328\n00:21:30,880 --> 00:21:33,320\nspeaking very slowly.\nI would say in general,\n\n329\n00:21:33,320 --> 00:21:37,000\nI've spoken, delivered this at a\nfaster pace than I usually would\n\n330\n00:21:37,040 --> 00:21:40,400\nowing to strong coffee flowing\nthrough my bloodstream.\n\n331\n00:21:41,040 --> 00:21:44,320\nAnd the thing that I'm not going\nto get in this benchmark is\n\n332\n00:21:44,320 --> 00:21:47,000\nbackground noise, which in my first\ntake that I had to get rid of,\n\n333\n00:21:47,800 --> 00:21:51,360\nmy wife came in with my son and\nfor a good night kiss.\n\n334\n00:21:51,560 --> 00:21:55,240\nAnd that actually would have\nbeen super helpful to get in\n\n335\n00:21:55,240 --> 00:21:59,880\nbecause it was not diarised.\nOr if we had diarisation a female,\n\n336\n00:22:00,000 --> 00:22:02,400\nI could say I want the male\nvoice and that wasn't intended\n\n337\n00:22:02,400 --> 00:22:05,400\nfor transcription.\nUm, and we're not going to get\n\n338\n00:22:05,400 --> 00:22:07,080\nbackground noise like people\nhonking their horns,\n\n339\n00:22:07,080 --> 00:22:11,400\nwhich is something I've done in my\nmain data set where I am trying to\n\n340\n00:22:11,560 --> 00:22:15,640\ngo back to some of my voice notes,\nannotate them, and run a benchmark.\n\n341\n00:22:15,640 --> 00:22:19,080\nBut this is going to be just a\npure quick test.\n\n342\n00:22:19,560 --> 00:22:24,000\nAnd as someone I'm working on a\nvoice note idea,\n\n343\n00:22:24,000 --> 00:22:28,350\nthat's my sort of end motivation.\nBesides thinking it's an\n\n344\n00:22:28,350 --> 00:22:31,710\nabsolutely outstanding technology\nthat's coming to viability.\n\n345\n00:22:31,710 --> 00:22:34,790\nAnd really, I know this sounds\ncheesy can actually have a very\n\n346\n00:22:34,790 --> 00:22:38,950\ntransformative effect.\nUm, it's, you know, voice technology\n\n347\n00:22:38,990 --> 00:22:45,030\nhas been life changing for, uh,\nfolks living with, um, disabilities.\n\n348\n00:22:45,750 --> 00:22:48,670\nAnd I think there's something\nreally nice about the fact that\n\n349\n00:22:48,670 --> 00:22:52,830\nit can also benefit, you know,\nfolks who are able bodied and like,\n\n350\n00:22:52,870 --> 00:22:59,070\nwe can all in different ways, um,\nmake this tech as useful as possible,\n\n351\n00:22:59,110 --> 00:23:01,230\nregardless of the exact way that\nwe're using it.\n\n352\n00:23:01,630 --> 00:23:04,830\nUm, and I think there's something\nvery powerful in that, and it can be\n\n353\n00:23:04,830 --> 00:23:09,030\nvery cool. Um, I see use potential.\nWhat excites me about voice tech?\n\n354\n00:23:09,870 --> 00:23:13,670\nA lot of things, actually.\nFirstly, the fact that it's cheap\n\n355\n00:23:13,670 --> 00:23:17,230\nand accurate, as I mentioned at\nthe very start of this, um,\n\n356\n00:23:17,230 --> 00:23:20,910\nand it's getting better and better\nwith stuff like accent handling, um,\n\n357\n00:23:20,910 --> 00:23:24,300\nI'm not sure my, my fine tune will\nactually ever come to fruition in the\n\n358\n00:23:24,300 --> 00:23:27,980\nsense that I'll use it day to day,\nas I imagine I get like superb,\n\n359\n00:23:27,980 --> 00:23:33,660\nflawless word error rates because I'm\njust kind of skeptical about local\n\n360\n00:23:33,660 --> 00:23:38,220\nspeech to texts, as I mentioned.\nAnd I think the pace of innovation\n\n361\n00:23:38,220 --> 00:23:42,180\nand improvement in the models,\nthe main reasons for fine tuning from\n\n362\n00:23:42,180 --> 00:23:46,460\nwhat I've seen have been people who\nare something that really blows,\n\n363\n00:23:46,500 --> 00:23:53,060\nblows my mind about ASR is the idea\nthat it's inherently a lingual\n\n364\n00:23:53,060 --> 00:23:59,220\nor multilingual phonetic based.\nSo as folks who use speak very\n\n365\n00:23:59,260 --> 00:24:02,340\nobscure languages that there may\nbe there might be a paucity of\n\n366\n00:24:02,340 --> 00:24:05,620\ntraining data or almost none at all,\nand therefore the accuracy is\n\n367\n00:24:05,620 --> 00:24:10,780\nsignificantly reduced or folks\nin very critical environments.\n\n368\n00:24:10,820 --> 00:24:13,500\nI know there are.\nThis is used extensively in medical\n\n369\n00:24:13,500 --> 00:24:18,260\ntranscription and dispatcher work as,\num, you know, the call centers who\n\n370\n00:24:18,260 --> 00:24:22,610\nsend out ambulances, etc., where\naccuracy is absolutely paramount.\n\n371\n00:24:22,610 --> 00:24:26,170\nAnd in the case of doctors,\nradiologists, they might be using\n\n372\n00:24:26,170 --> 00:24:29,730\nvery specialized vocab all the time.\nSo those are kind of the main\n\n373\n00:24:29,730 --> 00:24:31,650\ntwo things.\nAnd I'm not sure that really just for\n\n374\n00:24:31,650 --> 00:24:37,410\ntrying to make it better on a few\nrandom tech words with my slightly.\n\n375\n00:24:37,450 --> 00:24:41,370\nI mean, I have an accent, but like,\nnot, you know, an accent that a few\n\n376\n00:24:41,410 --> 00:24:47,330\nother million people have. Ish.\nI'm not sure that my little fine\n\n377\n00:24:47,330 --> 00:24:52,370\ntune is going to actually like the\nbump in word error rate reduction.\n\n378\n00:24:52,370 --> 00:24:54,690\nIf I ever actually figure out how\nto do it and get it up to the\n\n379\n00:24:54,690 --> 00:24:58,730\ncloud by the time I've done that.\nI suspect that the next\n\n380\n00:24:58,730 --> 00:25:01,530\ngeneration of ASR will just be\nso good that it will kind of be.\n\n381\n00:25:02,050 --> 00:25:03,890\nAh, well,\nthat would be cool if it worked out,\n\n382\n00:25:03,890 --> 00:25:08,850\nbut I'll just use this instead.\nSo that's going to be it for today's\n\n383\n00:25:08,850 --> 00:25:14,250\nepisode of, uh, voice training data.\nSingle long shot evaluation.\n\n384\n00:25:14,530 --> 00:25:17,450\nWho am I going to compare?\nWhisper is always good as a\n\n385\n00:25:17,450 --> 00:25:20,720\nbenchmark, but I'm more\ninterested in seeing Whisperer\n\n386\n00:25:20,720 --> 00:25:25,200\nhead to head with two things,\nreally. One is whisper variance.\n\n387\n00:25:25,200 --> 00:25:30,000\nSo you've got these projects like\nfaster Whisper, Still whisper.\n\n388\n00:25:30,000 --> 00:25:31,760\nIt's a bit confusing.\nThere's a whole bunch of them\n\n389\n00:25:32,040 --> 00:25:34,920\nand the emerging acers,\nwhich are also a thing.\n\n390\n00:25:35,320 --> 00:25:37,800\nMy intention for this is I'm not\nsure I'm going to have the time\n\n391\n00:25:37,800 --> 00:25:41,760\nin any point in the foreseeable\nfuture to go back through this whole\n\n392\n00:25:41,760 --> 00:25:46,680\nepisode and create a proper source,\ntruth or a fix.\n\n393\n00:25:47,440 --> 00:25:51,800\nEverything might do it if I can\nget one transcription that\n\n394\n00:25:51,800 --> 00:25:56,840\nsufficiently close to perfection.\nBut what I would actually love\n\n395\n00:25:56,840 --> 00:25:59,920\nto do on Hugging Face I think\nwould be a great.\n\n396\n00:25:59,920 --> 00:26:03,680\nProbably how I might visualize this\nis having the audio waveform play,\n\n397\n00:26:04,160 --> 00:26:09,920\nand then have the transcript for each\nmodel below it, and maybe even a,\n\n398\n00:26:10,600 --> 00:26:15,240\num, like, you know, two scale and\nmaybe even a local one as well,\n\n399\n00:26:15,280 --> 00:26:21,820\nlike local whisper versus open\nAI API, Etc. and, um, I can then\n\n400\n00:26:21,820 --> 00:26:24,500\nactually listen back to segments\nor anyone who wants to can listen\n\n401\n00:26:24,500 --> 00:26:29,540\nback to segments of this recording\nand see where a particular model\n\n402\n00:26:29,580 --> 00:26:33,060\nstruggled and others didn't, as well\nas the sort of headline finding\n\n403\n00:26:33,100 --> 00:26:36,900\nof which had the best, uh, wer.\nBut that would require the source\n\n404\n00:26:36,900 --> 00:26:40,140\nof truth. Okay. That's it.\nHope this was, I don't know,\n\n405\n00:26:40,300 --> 00:26:43,580\nmaybe useful for other folks\ninterested in stuff you want to see.\n\n406\n00:26:44,060 --> 00:26:48,220\nI always feel think I've just said\nsomething I didn't intend to say.\n\n407\n00:26:48,780 --> 00:26:51,140\nI said for those, listen carefully.\nIncluding, hopefully,\n\n408\n00:26:51,140 --> 00:26:54,180\nthe models themselves.\nThis has been myself,\n\n409\n00:26:54,220 --> 00:26:58,020\nDaniel Rosehill, for more, um,\njumbled repositories about my,\n\n410\n00:26:58,060 --> 00:27:00,940\nuh, roving interest in AI,\nbut particularly Agentic,\n\n411\n00:27:01,300 --> 00:27:05,460\nMCP and voice tech.\nUh, you can find me on GitHub.\n\n412\n00:27:05,940 --> 00:27:11,260\nHugging face. Where else?\nDaniel, which is my personal website,\n\n413\n00:27:11,260 --> 00:27:15,380\nas well as this podcast whose\nname I sadly cannot remember.\n\n414\n00:27:15,820 --> 00:27:17,540\nUntil next time.\nThanks for listening.\n"}; | |