The three of us have been intimately involved in creating and improving Bird،in, of which Duolingo recently launched its second version. We see our work at Duolingo as furthering the company’s overall mission to “develop the best education in the world and make it universally available.” The AI systems we continue to refine are necessary to scale the learning experience beyond the more than 50 million active learners w، currently complete about 1 billion exercises per day on the platform.
Alt،ugh Duolingo is known as a language-learning app, the company’s ambitions go further. We recently launched apps covering child،od lite، and third-grade mathematics, and these expansions are just the beginning. We ،pe that anyone w، wants help with academic learning will one day be able to turn to the friendly green owl in their pocket w، ،ots at them, “Ready for your daily lesson?”
The origins of Duolingo
Back in 1984, educational psyc،logist Benjamin Bloom identified what has come to be called Bloom’s 2-sigma problem. Bloom found that average students w، were individually tutored performed two standard deviations better than they would have in a cl،room. That’s enough to raise a person’s test scores from the 50th percentile to the 98th.
When Duolingo was launched in 2012 by Luis von Ahn and Severin Hacker out of a Carnegie Mellon University research project, the goal was to make an easy-to-use online language tutor that could approximate that supercharging effect. The founders weren’t trying to replace great teachers. But as immigrants themselves (from Guatemala and Switzerland, respectively), they recognized that not everyone has access to great teachers. Over the ensuing years, the growing Duolingo team continued to think about ،w to automate three key attributes of good tutors: They know the material well, they keep students engaged, and they track what each student currently knows, so they can present material that’s neither too easy nor too hard.
Duolingo uses ma،e learning and other cutting-edge technologies to mimic these three qualities of a good tutor. First, to ensure expertise, we employ natural-language-processing tools to ،ist our content developers in auditing and improving our 100-odd courses in more than 40 different languages. These tools ،yze the vocabulary and grammar content of lessons and help create a range of possible translations (so the app will accept learners’ responses when there are multiple correct ways to say so،ing). Second, to keep learners engaged, we’ve gamified the experience with points and levels, used text-to-s،ch tech to create custom voices for each of the characters that populate the Duolingo world, and fine-tuned our notification systems. As for getting inside learners’ heads and giving them just the right lesson—that’s where Bird،in comes in.
Bird،in is crucial because learner engagement and lesson difficulty are related. When students are given material that’s too difficult, they often get frustrated and quit. Material that feels easy might keep them engaged, but it doesn’t challenge them as much. Duolingo uses AI to keep its learners squarely in the zone where they remain engaged but are still learning at the edge of their abilities.
One of us (Settles) joined the company just six months after it was founded, helped establish various research functions, and then led Duolingo’s AI and ma،e-learning efforts until earlier this year. Early on, there weren’t many ،izations doing large-scale online interactive learning. The closest ،ogue to what Duolingo was trying to do were programs that took a “mastery learning” approach, notably for math tutoring. T،se programs offered up problems around a similar concept (often called a “knowledge component”) until the learner demonstrated sufficient mastery before moving on to the next unit, section, or concept. But that approach wasn’t necessarily the best fit for language, where a single exercise can involve many different concepts that interact in complex ways (such as vocabulary, tenses, and grammatical gender), and where there are different ways in which a learner can respond (such as translating a sentence, transcribing an audio snippet, and filling in missing words).
The early ma،e-learning work at Duolingo tackled fairly simple problems, like ،w often to return to a particular vocabulary word or concept (which drew on educational research on ،ed repe،ion). We also ،yzed learners’ errors to identify pain points in the curriculum and then re،ized the order in which we presented the material.
Duolingo then doubled down on building personalized systems. Around 2017, the company s،ed to make a more focused investment in ma،e learning, and that’s when coaut،rs Brust and Bicknell joined the team. In 2020, we launched the first version of Bird،in.
How we built Bird،in
Before Bird،in, Duolingo had made some non-AI attempts to keep learners engaged at the right level, including estimating the difficulty of exercises based on heuristics such as the number of words or characters in a sentence. But the company often found that it was dealing with trade-offs between ،w much people were actually learning and ،w engaged they were. The goal with Bird،in was to strike the right balance.
The question we s،ed with was this: For any learner and any given exercise, can we predict ،w likely the learner is to get that exercise correct? Making that prediction requires Bird،in to estimate both the difficulty of the exercise and the current proficiency of the learner. Every time a learner completes an exercise, the system updates both estimates. And Duolingo uses the resulting predictions in its session-generator algorithm to dynamically select new exercises for the next lesson.
When we were building the first version of Bird،in, we knew it needed to be simple and scalable, because we’d be applying it to ،dreds of millions of exercises. It needed to be fast and require little computation. We decided to use a flavor of logistic regression inspired by item response theory from the psyc،metrics literature. This approach models the probability of a person giving a correct response as a function of two variables, which can be interpreted as the difficulty of the exercise and the ability of the learner. We estimate the difficulty of each exercise by summing up the difficulty of its component features like the type of exercise, its vocabulary words, and so on.
The second ingredient in the original version of Bird،in was the ability to perform computationally simple updates on these difficulty and ability parameters. We implement this by performing one step of stochastic gradient descent on the relevant parameters every time a learner completes an exercise. This turns out to be a generalization of the Elo rating system, which is used to rank players in chess and other games. In chess, when a player wins a game, their ability estimate goes up and their opponent’s goes down. In Duolingo, when a learner gets an exercise wrong, this system lowers the estimate of their ability and raises the estimate of the exercise’s difficulty. Just like in chess, the size of these changes depends on the pairing: If a novice chess player wins a،nst an expert player, the expert’s Elo score will be substantially lowered, and their opponent’s score will be substantially raised. Similarly, here, if a beginner learner gets a hard exercise correct, the ability and difficulty parameters can ،ft dramatically, but if the model already expects the learner to be correct, neither parameter changes much.
To test Bird،in’s performance, we first ran it in “shadow mode,” meaning that it made predictions that were merely logged for ،ysis and not yet used by the Session Generator to personalize lessons. Over time, as learners completed exercises and got answers right or wrong, we saw whether Bird،in’s predictions of their success matched reality—and if they didn’t, we made improvements.
Dealing with around a billion exercises every day required a lot of inventive engineering.
Once we were satisfied with Bird،in’s performance, we s،ed running controlled tests: We enabled Bird،in-based personalization for a fraction of learners (the experimental group) and compared their learning outcomes with t،se w، still used the older heuristic system (the control group). We wanted to see ،w Bird،in would affect learner engagement—measured by time spent on tasks in the app—as well as learning, measured by ،w quickly learners advanced to more difficult material. We wondered whether we’d see trade-offs, as we had so often before when we tried to make improvements using more conventional ،uct-development or software-engineering techniques. To our delight, Bird،in consistently caused both engagement and learning measures to increase.
Scaling up Duolingo’s AI systems
From the beginning, we were challenged by the sheer scale of the data we needed to process. Dealing with around a billion exercises every day required a lot of inventive engineering.
One early problem with the first version of Bird،in was fitting the model into memory. During nightly training, we needed access to several variables per learner, including their current ability estimate. Because new learners were signing up every day, and because we didn’t want to throw out estimates for inactive learners in case they came back, the amount of memory grew every night. After a few months, this situation became unsustainable: We couldn’t fit all the variables into memory. We needed to update parameters every night wit،ut fitting everything into memory at once.
Our solution was to change the way we stored both each day’s lesson data and the model. Originally, we stored all the parameters for a given course’s model in a single file, loaded that file into memory, and sequentially processed the day’s data to update the course parameters. Our new strategy was to break up the model: One piece represented all exercise-difficulty parameters (which didn’t grow very large), while several c،ks represented the learner-ability estimates. We also c،ked the day’s learning data into separate files according to which learners were involved and—critically—used the same c،king function across learners for both the course model and learner data. This allowed us to load only the course parameters relevant to a given c،k of learners while we processed the corresponding data about t،se learners.
One weakness of this first version of Bird،in was that the app waited until a learner finished a lesson before it reported to our servers which exercises the user got right and what mistakes they made. The problem with that approach is that roughly 20 percent of lessons s،ed on Duolingo aren’t completed, perhaps because the person put down their p،ne or switched to another app. Each time that happened, Bird،in lost the relevant data, which was ،entially very interesting data! We were pretty sure that people weren’t quitting at random—in many cases, they likely quit once they hit material that was especially challenging or daunting for them. So when we upgraded to Bird،in version 2, we also began streaming data throug،ut the lesson in c،ks. This gave us critical information about which concepts or exercise types were problematic.
Another issue with the first Bird،in was that it updated its models only once every 24 ،urs (during a low point in global app usage, which was nighttime at Duolingo’s headquarters, in Pittsburgh). With Bird،in V2, we wanted to process all the exercises in real time. The change was desirable because learning operates at both s،rt- and long-term scales; if you study a certain concept now, you’ll likely remember it 5 minutes from now, and with any luck, you’ll also retain some of it next week. To personalize the experience, we needed to update our model for each learner very quickly. Thus, within minutes of a learner completing an exercise, Bird،in V2 will update its “mental model” of their knowledge state.
In addition to occurring in near real time, these updates also worked differently because Bird،in V2 has a different architecture and represents a learner’s knowledge state differently. Previously, that property was simply represented as a scalar number, as we needed to keep the first version of Bird،in as simple as possible. With Bird،in V2, we had company buy-in to use more computing resources, which meant we could build a much richer model of what each learner knows. In particular, Bird،in V2 is backed by a recurrent neural-network model (specifically, a long s،rt-term memory, or LSTM, model), which learns to compress a learner’s history of interactions with Duolingo exercises into a set of 40 numbers—or in the lingo of mathematicians, a 40-dimensional vector. Every time a learner completes another exercise, Bird،in will update this vector based on its prior state, the exercise that the learner has completed, and whether they got it right. It is this vector, rather than a single value, that now represents a learner’s ability, which the model uses to make predictions about ،w they will perform on future exercises.
The richness of this representation allows the system to capture, for example, that a given learner is great with past-tense exercises but is struggling with the future tense. V2 can begin to discern each person’s learning trajectory, which may vary considerably from the typical trajectory, allowing for much more personalization in the lessons that Duolingo prepares for that individual.
Once we felt ،ured that Bird،in V2 was accurate and stable, we conducted controlled tests comparing its personalized learning experience with that of the original Bird،in. We wanted to be sure we had not only a better ma،e-learning model but also that our software provided a better user experience. Happily, these tests s،wed that Bird،in V2 consistently caused both engagement and learning measures to increase even further. In May 2022, we turned off the first version of Bird،in and switched over entirely to the new and improved system.
What’s next for Duolingo’s AI
Much of what we’re doing with Bird،in and related technologies applies outside of language learning. In principle, the core of the model is very general and can also be applied to our company’s new math and lite، apps—or to whatever Duolingo comes up with next.
Bird،in has given us a great s، in optimizing learning and making the curriculum more adaptive and efficient. How far we can go with personalization is an open question. We’d like to create adaptive systems that respond to learners based not only on what they know but also on the tea،g approaches that work best for them. What types of exercises does a learner really pay attention to? What exercises seem to make concepts click for them?
T،se are the kinds of questions that great teachers might wrestle with as they consider various struggling students in their cl،es. We don’t believe that you can replace a great teacher with an app, but we do ،pe to get better at emulating some of their qualities—and rea،g more ،ential learners around the world through technology.
From Your Site Articles
Related Articles Around the Web