0% loaded...

Blog

E-mail Twitter GitHub

3D rendering

3D rendering is implemented with flutter_filament , which lets me write all the code for lighting, model loading, camera, and animation triggers in Dart, compile to WASM and render to a HTML5 canvas.

This isn't a Flutter web app! Despite the name, the flutter_filament package is no longer tied to Flutter, and can be used with any JS/WASM runtime. The UI components are actually generated with Jaspr , a Dart framework with Flutter-esque syntax for creating reactive HTML/Javascript apps. Flutter Web would need a ~5mb download for the framework alone, and I felt this was overkill just to display a few text boxes.

Avatar & Audio

I used Avaturn to create the avatar, which generates a reasonable (albeit not perfect) 3D model from a few phone camera headshots. Speech recognition is implemented with sherpa-onnx (compiled to WASM). and I cloned my voice with Cartesia/Sonic , which I've found to be the best available zero-shot TTS model.

The pitch and timbre of the Cartesia voice are practically identical to my real voice, though to my ears, it does give me a noticeably stronger Australian accent than I have in real life.

Dialog & Animations

This obviously isn't an open-ended dialog system; when the user passes some input, we perform a distance-based lookup on the input embeddings against a predetermined set, then return the associated response.

Why didn't I connect it to an LLM? Two reasons - hallucinations and invented responses are still a problem. It didn't feel right to have my avatar talking about things I never actually did! To be fair, a more comprehensive script/database would go a long way to addressing this problem, so that may well be the next step.

Second, the lip animation ( FaceDiffuser ) and gestures ( DiffSHEG ) are generated directly from the audio, and all three are generated offline (since the set of candidate responses is closed). A "real" dialog system would need to generate all three of these at runtime; that in itself is straightforward, but doing it with low-latency is not, so that's a larger project for another day.