Attention All The Way Down

Recently on Twitter, Anna Gat asked the question, “What was the last Big Idea?” and there were many great responses, many of which have been world-changing.

For me, two ideas immediately came to mind: "Attention Is All You Need" a paper from Google researchers in 2017 that has transformed the world of AI, and Dr. Iain McGilchrist’s Divided Brain theory, introduced in his 2009 book “The Master and His Emissary.”

A brief intro to the two ideas (errors in summarizing are mine of course):

  1. Attention Is All You Need: This paper introduced a new way of understanding and generating language via the transformer. Previously, models focused on word-by-word prediction and had trouble with longer inputs and understanding the context of the text. However, the new model doesn’t work sequentially word-by-word but instead reads the entire sentence or input as a whole and then uses its understanding of the context to focus its attention on the most important parts. This not only greatly increases text comprehension and generation but also enhances speed and efficiency.

  2. McGilchrist’s “Divided Brain” theory: McGilchrist wasn’t the first to introduce the concept of the two hemispheres of the brain being better at different tasks, but he popularized the theory that they focus attention in different ways. The divided brain has a left hemisphere (LH) that is more focused on details and good at tasks like language, math, and logic, processing information step-by-step. The right hemisphere (RH), however, is better at understanding the big picture, appreciating art and music, and processing information all at once.

Considering these revolutionary ideas together for the first time, I realized that they both involve the same basic function: the role of attention. With LLMs, attention means the algorithmic weights given to words or parts of words – a machine version of attention. With the divided brain, it means the more traditional human understanding of attention. This post is an attempt to connect the two (with the help of ChatGPT in italics).

The main similarity between the two is that although the ideas handle attention in different ways, they both involve it simultaneously to achieve the best results:

  • For the transformer, the two types of attention involved are called self-attention and multi-head attention. Self-attention weighs the importance of different words in a sentence relative to other words and to the text as a whole. Multi-head attention is similar but has multiple variations of the self-attention weights computed in separate layers and then merged, making it better at concentrating on separate details. Self-attention runs sequentially, and multi-head attention runs in parallel, so there are multiple ways the attention is calculated, and they run simultaneously.

  • For the Divided Brain theory, the LH focuses on the details of something, and the RH focuses on how it all fits together. The LH is analytical and logical, focused on representing things in our mind (thus making virtual representations and removing the liveliness from objects), whereas the RH is holistic and looks at the whole picture, focused on presenting something as a whole (thus invoking the liveliness of objects). Both types of attention are crucial for understanding intricacies and the whole picture, and both happen simultaneously.

Connecting them together:
  • Self-attention is similar to the RH of the brain since it takes into account the context of the integrative whole, whereas multi-head attention is similar to the LH since it’s better at focusing on details and specific parts.
  • From ChatGPT: Each layer in the Transformer model (multi-head attention) mimics the left hemisphere’s detailed focus by computing attention weights for each word relative to every other word. These layers operate in parallel, much like how different aspects of a task can be processed simultaneously in the brain.

    The second similarity is that both ideas have attentions that are dynamic and adjust to the information they consider in real-time. Again, a summary directly from ChatGPT:

    • Both human and Transformer attention are dynamic, capable of adjusting focus based on context. This reflects the ability to re-read or re-evaluate certain parts of the input as new information emerges. This ongoing integration of new information with existing knowledge, allows for a coherent and updated understanding.

    • Attention weights are recalibrated at each layer, allowing the model (or brain) to dynamically shift focus and update understanding.

    Thus, both the transformer and the Divided Brain theory suggest that attention is handled in different yet simultaneous ways, and both also dynamically adapt based on new information.

    Dr. McGilchrist expanded upon his Divided Brain theory in his 2021 book “The Matter With Things” and concluded that society has been placing more and more emphasis on LH processes (language, logic, analysis) instead of balancing them with RH processes (beauty, wholeness). This resulting imbalance is the root cause of many of society’s current problems. From not understanding each other because we’re not seeing commonalities to constant distractions caused by technology to the exponential rise in attention-related disorders, society seems out of whack.

    But we have been fighting back and trying to regain our hemispheric balance, and the two examples I’ll end with also involve attention.

    The first is the rise in practicing mindfulness. One of the most popular forms of mindfulness is called “focused attention meditation,” where you focus all of your attention on a single input (like your breathing) and try to keep your attention focused solely on that. When other thoughts try to steal your attention, you’re supposed to simply bring it back to that singular input. Another form of mindfulness is being open with your mind and allowing thoughts and feelings to come to you, observing them (paying attention to them) without judging or evaluating what they mean. The second potential rebalancing is following your curiosity, or what you feel naturally attentive to. Following these curiosities often feels like play instead of work, like you’re uncovering or following things that resonate with you.

    Thanks for your attention :)