Attention. Money-ball. Sarcasm. 

Attention, one of mathematical cornerstones fueling the large-language model revolution, may seem complex and intricate, but at its core, it exudes simplicity. Let’s dissect its essence. Envision a premier league soccer team adorned with superstars and hefty contracts, yet lacking any semblance of chemistry. Game after game, they disappoint, burdened by ego clashes, credit-seeking mentalities, and a dearth of teamwork. Consequently, the soccer field becomes a stage for live embarrassments, a consequence of these glaring inadequacies.

Enter the maverick coach. You. Your mission, should you choose to accept it, is to figure out which players work best together so your team can become more than its players and evolve into a soccer superpower. But, you are no ordinary coach. You are a data-driven coach and you have a unique little plan.

You’ve imprinted three secret spatial vectors on each player’s jersey, visible only to you through your magic glasses. The first vector, Q, gauges a player’s passing proficiency in a given position; the second vector, K, measures their overall game sense and decision-making prowess. The third vector, V, represents a player’s comprehensive skill level.

With this information, you analyze a midfielder, examining their Q vector and matching it against the K vectors of all players. The highest matches indicate complementary pairs , where Di Maria’s passing echoes Mbappe’s game understanding, chemistry thrives! Armed with these player pairs and their overall value V, you select the best combination of eleven players to face any opposition.

That, in essence, is attention. A collection of vectors (Q, K, V) for each word, when trained, captures the significance of how certain words collaborate harmoniously. Thus, attention becomes a tool to grasp the essence of meaning within a context window, be it a sentence, paragraph, or an entire chapter.

After extensive computation and training on copious amounts of internet data to predict the next word, magic unfolds. A program emerges that can craft prose akin to Shakespeare and write speculative sentences like Hemingway.

Technically, the vectors representing each atomic unit, such as a word, can exist independently of one another. However, to expedite the learning process, one can impose certain constraints based on domain expertise. For instance, consider generating some text for a business email, where each upcoming word relies solely on all the preceding words within the context window. Let’s illustrate this with an example: “Rob rides his bike and goes down a _(trail or galaxy or slide).” Given the past words, the most likely last word is “trail” and not “galaxy.” This constraint helps hasten the process.

Similarly, when tackling the comprehension of sarcasm, the presence of past words might depend more on the future words. For instance: “I love your attempts at showing care whilst you stare at your screen when having a friendly conversation.” Here, the word “care” harbors the sarcasm, and altering it alters the entire meaning.

Applying this setup to soccer, the super-coach can impose specific constraints on players. For example, two players may always be required to hang back, regardless of the team’s onslaught, necessitating a slightly more restrained computation of the K, Q, and V values. These constraints help optimize player combinations and team dynamics.

Overall, attention is a general concept which could be used for many problem domains where unstructured data is plentiful. We have just scratched the surface with language, videos and images. I can bet it will be getting used in domains such as :  sports analytics, finance, construction planning and many other fields in the years to come.

Post Script:

Here, I have only discussed self-attention and not other forms of attention like cross attention.