May 24, 2024

Lessons Learned Building a GenAI App with 10k users

After basically two and a half years of being ~100% billable on client work (I am very tired) I've recently rotated onto an internal tool team to lead their application rebuild. It's an interesting project: building a full-stack application that uses GenAI tech to automate large portions of internal workflows that were previously done by hand – a pretty common GenAI use case, if you believe the hype. Interestingly, we have a lot of takeup: 10k users within the firm and growing, with interest from clients and partners alike.

Of all the projects I've worked on, this one has provoked the most interesting conversations: what will happen to the people that do this work by hand today? How much more of our internal work can be automated? If we can automate 60% of the grunt-work of the firm.... where does that leave us? Not to mention the AI questions – some people on the team are firmly "yeah we'll see about this whole GenAI thing¹" and others are firmly "the AI is smarter than all of us and we need to accept it."

I like to think about philosophical questions, and I love the fact that real, genuine political and philosophical questions are openly discussed on the team. But as an engineer, ever practical, at the end of the day my biggest question is: What does this GenAI tech change about the hands on experience of building software, and running software projects? What about this tech changes my real, actual life? Here are a few thoughts:

GenAI Is Actually Useful

Doubtlessly the more AI bullish among you will find this insight banal – that's fine with me. Personally, after seeing the number of Crypto and Web3 people who switched to being GenAI people, I was skeptical of the tech even though projects like AI Dungeon were quite impressive. But I have come to believe that GenAI is innovative and impressive technology, even if there are plenty of problems left to solve.

Setting aside the fact that the entire team uses Copilot Enterprise and ChatGPT Enterprise many times daily to do their work (with mixed results on occasion²), there have been many times that GenAI has surprised me with the ability to solve practical problems in the real world. One example: I wanted to classify some images generated by our internal workflow according to a fixed taxonomy (this is an example of image type "news story," this is an example of image type "selfie," etc..) Previously, I suppose, we would have had to train a custom image classifier for that, and a brief perusal of open source tools indicated I would need to gather and hand-classify a few hundred examples in order to do so.

But on the advice of our fearless leader Jordan ("Zero shot that shit!") I instead made a pdf slide deck of the different kinds of image I wanted to classify, labeling each set of about 5 examples. I converted that pdf to an ordered series of images, then submitted them all as system prompts to the LLM API with a setup prompt something like "I'm going to give you some examples of images and their classifications, and then a new image that it is your job to classify" along with some role-playing info. To my great shock, it seems to get the classification right over 90% of the time!

While it is doubtlessly going to be wrong on occasion, especially for edge cases and for parts of the taxonomy with fewer examples, this works for our purposes. Furthermore, GenAI solving that problem allowed us to build an all-new feature allowing some user personalization that is critical to our summer release. Perhaps a more traditional image classifier would be more accurate more often – but at 90%+ perceived accuracy and under an hour of work to build, I don't really care!

GenAI Will Randomly Fail and Users Will Blame You

Programmers have long sought to raise errors from run time, to compile time, and then to build rules into programming languages themselves to protect against them ever happening at all. Furthermore programming – at least, most programming – assumes generally that systems will respond deterministically. These are the bedrocks upon which debugging, operations strategies, system monitoring, and many other rituals of our profession are built³.

GenAI will change all that! Consider our industry disrupted⁴.

Aside from the usual external system dependencies (they can have an outage!) there's also new failures that can occur: the GenAI system could suddenly output gibberish at every request, or potentially worse, could hallucinate information and lie to users confidently. Already users have to deal with the increasingly tall house of cards that is modern application infrastructure, another important cloud dependency isn't helping. There is a certain class of... shall we say, executive? lawyer? that is quick to point out in fine print that of course you shouldn't rely exclusively on GenAI and should fact check everything it outputs. But realistically, users do not care. If you have a box that can give information, and the information is wrong, users will blame you. If you have a box that solves their problem, and the box breaks, they will blame you. There is no amount of explanation that will undo the user tendency to take every capability of your app for granted, instantly, and without hesitation or remorse.

An example: We have a freeform input that allows users to work together with the LLM to create a visual output. That output contains a lot of information and text. Typically that text is drafted by a person, and the contents edited with GenAI. To our surprise, one day a user asked us "when I use the LLM to convert currencies, from what source does it get the current exchange rates?" Turns out, this person was prompting the LLM with very specific information, including financial data, and asking for currency conversion in the output. We had never thought of this, and were quite horrified! As it turns out, the LLM we're using does appear to have a currency conversion "mode," but it was unclear the source of the exchange rate, or how often it was updated.

If this person's output from our app had been placed in front of a stakeholder to whom the specific financial information mattered, we could have been in quite an uncomfortable position. It wasn't our application that did that conversion, but it's impossible to imagine an executive or senior leader saying "hey I know it's not your fault we got the USD to GBP conversion wrong, haha, this just happens with AI sometimes." Users, leaders, and stakeholders will hold you accountable for not preventing that input⁵.

Garbage In, Garbage Out, Garbage All The Way Down

One interesting thing about our application is that 99% of it is a traditional webapp backed by a cloud provider. A random full stack dev off the street could easily work on our application, provided they had a sense of humor. In some ways, that fact of our architecture is great: it's been easy to staff members of the project team because many people have the experience necessary. In other ways, that's a warning: GenAI be damned, the difficult parts of our project are cloud infrastructure, politics, and code quality/hygiene.

I am fortunate that in my time at BCG I have done a wide variety of work – from IoT backend builds, to quick POCs in NextJS, full-stack webapps, and even data platform builds. It is the last category about which I am most bullish in the next few years. GenAI tech basically assumes you have your data ready for analysis, and the worse off your data is, the farther from correct your output will be. How many companies, regardless of size, follow data engineering best practices? How many have a single source of truth for their business? How many think that Salesforce (may its name be a curse!) is that for them, and are 5 years from realizing it isn't?

GenAI tech, at the margin, can seriously enhance certain applications with minimal difficulty to integrate. But your data, and your team must be ready to harness it. If your org or team can't ship good quality software, then GenAI can't save you. Innovative technology is no substitute for professionalism and focus.

What's Next in GenAI?

So where do we go from here? Right now GenAI tech isn't capable of building the full application by itself, and is marginal in the featureset, even if it powers the most interesting features of the app we're building. I'm going to go out on a limb and make a few predictions for how things evolve from here:

GenAI Goes To Court

Oh yeah, we should probably find out if this tech is legal? I mean, there's the question of whether or not specific instances of "building the torment nexus" were legal. But more broadly, there are plenty of serious copyright concerns about LLMs and their training data. Many people who, uh, invest in GenAI companies argue that 1. it isn't illegal to scrape copyrighted material for training data and that 2. even if it is, it shouldn't be. But we'll see! Google and OpenAI have both paid reddit for their data to train with. If the data is free for everyone to use simply by being on the internet, why pay for it? Could it be... to avoid disputes over ownership? Oops!

LLMs Get Smarter..... Kinda

Improvements to LLMs aren't going to slow down per-se, especially given the ~~bubble~~ rapid investment in the tech. But I do think we will rapidly run out of training data, and that proposals of "idk we'll just make it up" aren't really realistic. If the AI was good enough to generate its own training data, what do you need to train it for? That said, LLM powered enhancements to things like image generation, voice synthesis, audio transcription, etc. will likely continue for a while. I certainly know our application has tons of headroom to grow without hitting the limit of the LLM. Speaking of LLMs improving consumer experiences...

Voice Assistants Are So Back

Remember when we thought Alexa was going to be the next big thing? I had a guy from a previous job try to label himself an "Alexa developer." What would that even have meant? And to whom was that marketing effective? But now, it turns out, the thing at least some people wanted from virtual assistants was to be computers that kinda sounded like people. Maybe even, uh, specific people. Will Alexa, Siri, or even Cortana suddenly become actually useful? Probably not for a while. Surely the tech backing the Rabbit R1 has to become real before that's possible. But now, finally, the voice assistants can sound like real people, and so I expect to see them come back in a big way. Maybe this is Cortana's big chance!

Footnotes:

1: Cards on the table: I have been and remain an AI skeptic. I think it's cool that "spicy autocomplete" has proven to be able to do so much, but the idea that GenAI has reasoning, thought, or logic capabilities is pretty laughable to me still today. Of course, it is useful even in its limited state. If you can break a problem down into "spicy autocomplete" territory, it can be surprisingly great at solving it.

2: The amount of incorrect logical leaps made by Copilot in its current form is pretty astonishing. Maybe this will get solved when they figure out how to performantly send our entire repo in the context window/to an assistant, though I wonder about the legal issues there. ChatGPT has proven way, way better at coding help than Copilot for meaningful tasks, but is a bit more of a pain to use. I'm bullish on GenAI intellisense long term though.

3: Gross oversimplifications, but they're stylistically correct. It's my blog!

4: I used to hate the term disruption but now I love it – so many of the 2010s "disruptors" like Uber or Doordash are now pretty universally reviled for creating a worse landscape than before they existed. "Disruption" indeed!

5: Or, potentially, not flagging that the output is wrong.