Interviewer: Please design a system to moderate offensive language on a video streaming platform like Twitch or Instagram Live. To keep things simple, just have the system alert streamers when they’ve said a banned word.
Got it. So if the streamer says something like, “Oh shit, chat! My shirt’s on fire!!” we would want to send them a notification to watch their language?
Interviewer: Haha, that’s right. Maybe we can think about layering in context awareness later on, but simple banned word detection is a good start.
Oh, and this is probably obvious, but we want this as real-time as possible.
Ok, interesting. Interesting. Yeah, I’ll have to think through that a bit, but maybe I’ll just get started and see where that takes us.
I’m seeing three main steps. First, we need to pick up the audio stream, process it, and translate into English words. Then, we’ll compare the transcript to anything in our banned list. If a word matches, we send a notification back to the device to alert the user, or we ban them, or whatever.
Interviewer: That’s a great start. Go ahead and fill in some more details.
Sure. Just—let me write some notes real quick—I’m struggling a bit with the real-time aspect. I’m ballparking this at a P90 of 500 ms round-trip.
Interviewer: 500 ms is not so bad! Why don’t you walk me through the timing?
Okay, yeah, let’s talk this out.
So the naive approach is to pick up the sound via the client’s onboard microphone and process it server-side, probably via our hosting service’s toolchain. Sound waves travel pretty fast, so the time for articulation to microphone is only about 1 ms, but digitization of the signal, a network hop, and the transcription is probably adding 250 ms of continuous overhead.
(I’m also ignoring for the moment that it takes about 100 ms for the streamer to actually say a single-syllable word like “shit”.)
The lookup is fast enough at about 10 ms—we won’t use fuzzy search—and then the alert will take about 150 ms to send over the network, receive it, and display it.
If the streamer is paying attention, and literate, they’ll take about a half second to read a ten word message. I don’t have a good sense of the user base to estimate how long it would take to modify their behavior.
Interviewer: Haha, got it. Love the focus on the user; most people ignore that part. I think that works as a start. How about we make the content detection more nuanced?
Actually, I’ve got some ideas on improving the latency.
Interviewer: Oh, I’m comfortable with a half-second delay.
Huh. I mean a half-second is obviously not real-time.
Interviewer: Okay, go ahead.
All I was going to say was that we can save the network hop if we do all the voice processing onboard the device and download the offensive language list from the client. That could save us around 200 ms.
And, honestly, now that I’m thinking about it, we could avoid the speech-to-text piece altogether and just pass them through a client-side Wav2Vec transformer and determine matches. It’s slower than the lookup itself, and we’ve have to do some model tuning, but it nets out probably 50 ms faster for every evaluation.
Interviewer: Okay, great. Good ideas. Trade off a bit more overhead on the client-side development, but about twice as fast.
Just trying to meet the requirements.
Interviewer: Appreciate that. So, about the context-aware functionality—
Sorry, excuse me. I just—now the latency on the signal generation side is killing me.
We’re just sitting there waiting nearly 100 ms for the user to finish saying “shit”. That’s like, human muscle speed, Meat speed. It would be way faster if we knew when they were starting to say it.
Interviewer: This isn’t Minority Report. We don’t—
No, no, I’m not talking about pre-cog anything. It’s all post-cog. The brain decides to speak, then it encodes the pattern, then sends it to the mouth. Nearly 300 ms elapses between Broca’s area activity and the /t/ phoneme.
If we wanted to pre-cog it, we’d target the prefrontal cortex, not Broca’s area.
The best part is, we can totally get rid of the microphone. We can ship them streamers a welcome package with a few electrodes—
Interviewer: Let me stop you there. Love the dedication to performance, but I’m not sure this is the best rabbit hole to go down.
Sorry—look, we don’t have to dwell. It’s basically: Welcome package. Microelectrode array. Onboarding video. Fiber-optic USB-C cable. RNNs.
Boom. 30 ms from the intention to say “shit” to detection, versus the 350 ms we were sleeping through earlier.
Interviewer: That’s disgusting.
Well, we could do EEG, but that’s disgusting in its own way. I just don’t see your influencers spraying electrode gel in their hair before going on a stream. Noisier signal, too.
Interviewer: We could never—
Oh! Oh, this is perfect. Because I was really getting hung up on the fact that, look, we can tune down detection like crazy, but the gaping hole this whole time is—what if the user just doesn’t read the notification?
We can get as low latency as we want, but it doesn’t matter unless we close that loop. It could be ages, literally EONS, or INFINITY before our system is ever completed.
But now, but NOW, we have the tools we need to solve that problem.
Interviewer: That’s out of scope.
Just hear me out.
During onboarding, we get the user to plant ANOTHER electrode, this time in the anterior cingulate cortex. Bam. We’ve closed the reward loop. Detect-stimulate. Detect-stimulate. We can be CERTAIN our system has delivered the punishment appropriately, rather than inferring success from the display of a notification, and an indefinite length of behavior monitoring.
So now we’re talking, let’s see, maybe 50 ms total for a round-trip. Much closer to our target since we’re kicking things off just as motor activity is beginning.
This setup is VERY CLOSE to stopping “shit” at the source.
Interviewer: Are you done?
We’ll have logging on all this by the way. Haven’t mentioned that.
Interviewer: You’ve totally missed the point.
I have?
Interviewer: Yes. This is totally detached from reality--
WAIT. I see where you’re going with this now. I see it!
Real. Time. Data.
It’s a trick question!
Interviewer: No!
YES!
You SAID “as real-time as possible,” but EVERYONE knows there’s no fixed definition of time, and that it’s all from the point of THE OBSERVER. That was, like, Einstein’s whole thing! So it’s FOOLISH to try and get it down to lower and lower, because from THE SYSTEM’S PERSPECTIVE, we can always get faster and faster, until we hit fundamental hardware constraints, and bottom out in uncertainty, so that EVEN THEN, we couldn’t assert we are operating in REAL TIME because, again, TIME IS SUBJECTIVE.
BUT THAT’S THE WHOLE POINT YOU WERE TRYING TO MAKE!
Interviewer: That wasn’t—
It’s brilliant!
From MY PERSPECTIVE, A must follow B. But FROM THE USER’S PERSPECTIVE, if A and B occur within the same neural window of stimulation, then A does NOT follow B. A and B co-occur!
AS A USER, to say “shit” is to be punished; to be punished is to say, “shit!”
THAT'S what you were getting at!
Interviewer: I’m—
HEBBIAN LEARNING! NEURONS THAT FIRE TOGETHER, WIRE TOGETHER!
At sub-50 ms A and B induce long-term potentiation, so A and B are not only PERCIEVED the same, they BECOME the same! A IS B!
INSIDE THE SYSTEM, TIME GOES TO ZERO!
TIME BECOMES A NON-FACTOR!
TIME IS NO LONGER REAL!
BOOM!
Boom!
Boom, whew. Whew, great question! What a ride.
Was NOT expecting a twist like that.
Hello?
Can you hear—am I frozen? Hello?
Ah, shit.
This is hilarious!
Love this haha