I Built an AI Avatar for My Portfolio — Here's Every Decision I Made

The Premise

Most portfolio sites have a contact form. Mine has a real-time AI avatar that answers questions about my career in my voice.

I built it partly because I thought it would be genuinely useful for recruiters who want a fast read on what I've done — and partly because if you're going to call yourself an AI-integrated frontend engineer, you should probably have something to show for it. A portfolio is a demo. Make the demo count.

This is a full walkthrough of how it works, every piece of the stack, and what actually gave me trouble along the way.

The Stack at a Glance

- Simli — real-time AI avatar via WebRTC (the talking face)

- Claude Haiku — the language model powering the chat (streaming responses)

- OpenAI text-embedding-3-small — turns career data into vectors

- Supabase with pgvector — vector database and similarity search

- OpenAI tts-1 — converts text to speech in PCM format

- React Router v7 — the frontend framework

- My own Instagram data export — yes, really. More on that.

Starting with Simli

Simli handles the hardest part: making a face talk in real time without looking like a disaster. It uses WebRTC under the hood, which means the latency is low enough to feel live — you're not waiting for a video to render, you're streaming frames directly to a canvas element synchronized to audio.

The integration is straightforward once you understand what Simli actually needs: raw PCM audio at 16kHz, 16-bit mono. That's it. You push audio data, it drives the face. The face ID is a specific avatar tied to your Simli account — you set that up in their dashboard by uploading a photo or video. The API key gates access to that session.

On the frontend, the SimliClient initializes a WebRTC peer connection, starts receiving video frames, and exposes a sendAudioData() method. The trick is that Simli expects audio at exactly 16kHz — and OpenAI's TTS API produces PCM at 24kHz. That mismatch required a custom downsampling function.

function downsample24to16(pcm24: ArrayBuffer): Uint8Array {
  const input = new Int16Array(pcm24);
  const outLen = Math.floor((input.length * 2) / 3);
  const output = new Int16Array(outLen);
  for (let i = 0, j = 0; i + 2 < input.length && j + 1 < outLen; i += 3, j += 2) {
    output[j] = Math.round((input[i] * 2 + input[i + 1]) / 3);
    output[j + 1] = Math.round((input[i + 1] + input[i + 2] * 2) / 3);
  }
  return new Uint8Array(output.buffer);
}

Simple 2-to-3 linear interpolation decimation. Not perfect audio science, but the quality is completely imperceptible in a TTS context and it's fast.

The RAG Layer: Supabase + pgvector

The chat isn't just a generic AI — it answers questions specifically about my career, using real data I curated. That's Retrieval-Augmented Generation, and the implementation has three steps.

Step 1: Create the knowledge base.

I wrote all my career data into a structured TypeScript file — chunks categorized as experience, skills, projects, personality, interview prep, and more. Each chunk is a self-contained paragraph of context, not just a list of bullet points. The model needs to read these like context, not parse them like a spreadsheet.

Step 2: Embed the chunks.

A local script runs through every chunk, sends the text to OpenAI's text-embedding-3-small model, and stores the resulting 1,536-dimensional vector in Supabase alongside the original text.

const embeddingResponse = await openai.embeddings.create({
  model: 'text-embedding-3-small',
  input: chunk.content,
});
await supabase.from('career_chunks').upsert({
  id: chunk.id,
  content: chunk.content,
  category: chunk.category,
  embedding: embeddingResponse.data[0].embedding,
});

Step 3: Query at runtime.

When a user asks a question, the server embeds the question using the same model, then runs a cosine similarity search against the stored vectors via a Supabase RPC function backed by pgvector. The top 5 matching chunks get injected into Claude's system prompt as context.

const { data: chunks } = await supabase.rpc('match_career_chunks', {
  query_embedding: queryEmbedding,
  match_threshold: 0.4,
  match_count: 5,
});

The result is a model that can accurately answer "What did you build at Guild Mortgage?" or "How do you approach microfrontend architecture?" — because it has the actual context, not just a generic persona.

The Instagram Hack

This is the part I didn't expect to work as well as it did.

The generic career chunks were good for professional questions. But they didn't capture who I actually am as a person — the kind of detail that makes a conversation feel like you're actually talking to someone, not reading a resume out loud. A recruiter who asks "what do you do outside of work?" deserved a real answer, not a deflection.

Instagram lets you export your account data as JSON. That export includes your liked posts, complete with captions, hashtags, and account names. I ran a script against 8,721 liked posts and discovered what my own engagement patterns revealed about me:

- Golf is not a casual hobby. The hashtag #golf appeared in 390 liked posts. I follow PGA Tour pros, golf coaches, instructional accounts — and I interact with all of them. I play competitively and keep score.

- I'm a Lakers fan, not a Clippers fan, despite the algorithm's best efforts to suggest otherwise. One look at my actual comments confirmed it — I was throwing shade at the Clippers in other people's threads.

- The comments file was even more revealing. My own comment history showed my actual voice: how I engage with film discussions, sports debates, tech content, Black culture and media. I correct factual errors on Kill Bill. I have strong opinions on who the greatest women's basketball player of all time is (Cheryl Miller, and it's not close). I call out weak takes on Raiders games while still showing up for every one.

I turned all of this into additional RAG chunks — personality, sports, culture, voice — so when someone asks the AI a casual question, it has something real to draw from. The AI now knows I take my daughter golfing, that I'm a De La Soul listener, and that I attended San Diego Comic Con. That's not in my resume. It's better than my resume.

The TTS Pipeline: Making It Fast

The biggest UX problem with AI voice chat is latency. Claude streams the response token by token. OpenAI's TTS converts text to audio. Simli drives the avatar. Three sequential operations is too slow — the user would wait until Claude finished generating before hearing a single word.

The solution was to pipeline them at the sentence level.

While Claude is still streaming, I buffer the output and split on sentence boundaries as they arrive:

function drainSentences(buf: string): [sentences: string[], remaining: string] {
  const parts = buf.split(/(?<=[.!?])\s+/);
  const sentences = parts.slice(0, -1).filter((s) => s.trim().length > 10);
  return [sentences, parts[parts.length - 1]];
}

Every time a complete sentence lands, I fire a TTS request immediately — in parallel with Claude still generating the rest of the response. By the time Claude finishes, several TTS requests are already resolved or nearly done. The avatar starts speaking the first sentence while the model is still writing the last one.

for (const s of sentences) ttsQueue.push(fetchTTSBuffer(s));
// ...
await simliRef.current.speak(ttsQueue); // plays promises in order as they resolve

The speak() method on the avatar component accepts an array of promises and plays them in order as each resolves — so sentence 1 plays as soon as it's ready, sentence 2 plays immediately after, and so on. The playback is sequential, the fetching is parallel.

The voice is OpenAI's onyx — deep, clear, deliberate. Fits the portfolio context.

The Greeting Race Condition

When chat opens, the avatar should greet the user immediately. Simple in concept. Less simple in practice, because there are two async processes that both need to be ready: the chat panel needs to be open, and the Simli WebRTC connection needs to be established.

Either one can happen first. If the user opens chat before Simli finishes connecting, the greeting fires into the void. If Simli connects before the user opens chat, the ready event has already fired by the time you need it.

The fix was a ref mirror pattern — maintain a ref that tracks the current value of the chat state, updated by a useEffect, so that the onSimliReady callback can read the current value without capturing a stale closure:

const chatOpenRef = useRef(false);
useEffect(() => { chatOpenRef.current = chatOpen; }, [chatOpen]);

const onSimliReady = useCallback(() => {
  simliReadyRef.current = true;
  if (chatOpenRef.current) fireGreeting();
}, [fireGreeting]);

useEffect(() => {
  if (chatOpen && simliReadyRef.current) fireGreeting();
}, [chatOpen, fireGreeting]);

Both paths covered. Chat opens first → the useEffect fires the greeting when Simli connects. Simli connects first → onSimliReady fires the greeting when chat opens. One greetedRef guard prevents it from firing twice.

Suggestion Pre-Caching

The suggested questions at the bottom of the chat panel look instantaneous when you click them. They're not instantaneous by accident — the answers are already in flight the moment the chat opens.

When the chat panel mounts, it silently fires all four suggested questions against the API in parallel and caches the responses. When the user clicks one, the answer is already there.

useEffect(() => {
  if (!chatOpen) return;
  SUGGESTED.forEach(async (q) => {
    if (suggestionCache.current[q.text]) return;
    const res = await fetch('/api/chat', { method: 'POST', ... });
    suggestionCache.current[q.text] = await res.text();
  });
}, [chatOpen]);

The cost is four extra API calls on open. The payoff is that the first interaction feels instant. For a portfolio where you want the first impression to land well, that tradeoff is obvious.

Performance: Lazy Loading the SDK

Simli's client library is ~500KB. If it's part of the main bundle, every visitor to the site pays that cost before they ever open the chat. Most visitors never open the chat.

The fix is React's lazy():

const SimliAvatar = lazy(() => import('../components/SimliAvatar'));

The SDK now loads as a separate chunk only when the component renders. The initial page bundle stays lean. The WebRTC connection starts when it's actually needed.

Alongside that: converting all images from PNG to WebP dropped the two main assets from 4.7MB and 3.7MB down to 128KB and 256KB respectively — a combined reduction of over 8MB. LCP on mobile went from 15.9 seconds to well under 3. That's not a performance win. That's a performance rescue.

What I'd Do Differently

A few things I'd approach differently if I were starting over:

Skip PCM, negotiate a different format. The 24kHz-to-16kHz downsampling works fine, but it's an extra step that wouldn't be necessary if I could negotiate the audio format upstream. Worth checking whether Simli adds format flexibility over time.

Add streaming to the TTS endpoint. Right now each TTS request waits for the full audio buffer before returning. Streaming the PCM bytes as they're generated from OpenAI would reduce per-sentence latency further — especially on longer sentences.

Version the career chunks. The RAG system works well, but right now re-embedding everything is an all-or-nothing operation. A diff-aware embedding update that only re-embeds changed chunks would be cleaner for ongoing maintenance.

The Result

The avatar greets you when you open the chat. It answers questions about my career from real context, not hallucinations. It sounds consistent because the voice model, the persona prompt, and the underlying data are all aligned. It starts talking before it's finished thinking. And it knows I keep score on the golf course.

That's the bar I set when I started this. I think it clears it.

If you want to ask it something, the chat's open.