{"id":9081,"date":"2025-12-24T22:09:19","date_gmt":"2025-12-24T22:09:19","guid":{"rendered":"https:\/\/uplatz.com\/blog\/?p=9081"},"modified":"2025-12-24T22:09:19","modified_gmt":"2025-12-24T22:09:19","slug":"mechanistic-interpretability-reverse-engineering-the-neural-code","status":"publish","type":"post","link":"https:\/\/uplatz.com\/blog\/mechanistic-interpretability-reverse-engineering-the-neural-code\/","title":{"rendered":"Mechanistic Interpretability: Reverse Engineering the Neural Code"},"content":{"rendered":"<h2><b>1. Introduction: The Black Box Crisis and the Mechanistic Turn<\/b><\/h2>\n<p><span style=\"font-weight: 400;\">The ascendance of deep learning, particularly through the proliferation of Large Language Models (LLMs) based on the Transformer architecture, has precipitated a fundamental epistemological crisis in artificial intelligence. We have succeeded in constructing systems that exhibit emergent reasoning, complex language generation, and sophisticated problem-solving capabilities, yet the internal causal mechanisms driving these behaviors remain profoundly opaque. The prevailing paradigm has been one of alchemy\u2014mixing architectures, data, and compute to achieve empirical performance\u2014without a corresponding chemical theory to explain the underlying interactions. Mechanistic Interpretability (MI) has emerged as the necessary scientific response to this opacity, representing a rigorous, engineering-driven discipline aimed at reverse-engineering the opaque matrices of neural networks into human-understandable algorithms.<\/span><span style=\"font-weight: 400;\">1<\/span><\/p>\n<p><span style=\"font-weight: 400;\">The distinction between Mechanistic Interpretability and traditional Explainable AI (XAI) is not merely semantic; it is foundational. Traditional interpretability often relies on post-hoc rationalizations or feature attribution methods\u2014such as saliency maps, SHAP, or LIME\u2014that identify <\/span><i><span style=\"font-weight: 400;\">where<\/span><\/i><span style=\"font-weight: 400;\"> a model attends or which inputs correlate with an output.<\/span><span style=\"font-weight: 400;\">1<\/span><span style=\"font-weight: 400;\"> While useful for debugging local errors, these methods treat the model as a black box, offering correlational insights that crumble under adversarial pressure or distributional shifts. They tell us which pixels in an image of a dog were important, but they do not explain the algorithm the model used to recognize the &#8220;dogness&#8221; or how that concept is represented in the high-dimensional vector space of the network.<\/span><span style=\"font-weight: 400;\">3<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Mechanistic interpretability, by contrast, operates on the &#8220;Linear Representation Hypothesis&#8221; and the &#8220;Computational Graph View.&#8221; It posits that neural networks are not inscrutable statistical soups but rather sophisticated computational graphs composed of distinct, functionally specialized sub-networks, or &#8220;circuits&#8221;.<\/span><span style=\"font-weight: 400;\">5<\/span><span style=\"font-weight: 400;\"> The goal is to &#8220;decompile&#8221; the weights and activations\u2014analogous to binary machine code\u2014into a higher-level pseudo-code or logic that humans can comprehend, audit, and verify.<\/span><span style=\"font-weight: 400;\">4<\/span><span style=\"font-weight: 400;\"> This approach seeks to identify the specific attention heads that perform information routing, the Multilayer Perceptron (MLP) neurons that serve as key-value memories for factual recall, and the geometric structures that allow models to compress vast amounts of knowledge into limited dimensions.<\/span><span style=\"font-weight: 400;\">6<\/span><\/p>\n<p><span style=\"font-weight: 400;\">The stakes of this endeavor extend far beyond academic curiosity. As AI systems are increasingly integrated into critical infrastructure, decision-making pipelines, and scientific discovery, the inability to distinguish between robust, causal reasoning and deceptive, brittle memorization poses severe risks.<\/span><span style=\"font-weight: 400;\">2<\/span><span style=\"font-weight: 400;\"> Mechanistic interpretability serves as a cornerstone for AI safety. By identifying the specific circuits responsible for behaviors, researchers aim to develop techniques for detecting &#8220;misaligned&#8221; cognition\u2014such as deception, power-seeking, or sycophancy\u2014before they manifest in external behavior.<\/span><span style=\"font-weight: 400;\">3<\/span><span style=\"font-weight: 400;\"> Furthermore, this understanding enables &#8220;Mechanistic Unlearning&#8221; and &#8220;Representation Engineering,&#8221; allowing us to surgically excise hazardous knowledge or steer models toward honesty and harmlessness with mathematical precision, rather than relying on the &#8220;whack-a-mole&#8221; approach of Reinforcement Learning from Human Feedback (RLHF).<\/span><span style=\"font-weight: 400;\">9<\/span><\/p>\n<p><span style=\"font-weight: 400;\">This report provides an exhaustive analysis of the current state of mechanistic interpretability. We will traverse the theoretical foundations of the field, exploring the counter-intuitive geometry of high-dimensional activation spaces and the phenomenon of &#8220;Superposition&#8221;.<\/span><span style=\"font-weight: 400;\">11<\/span><span style=\"font-weight: 400;\"> We will detail the discovery and anatomy of specific algorithmic circuits, including the Induction Heads that drive in-context learning <\/span><span style=\"font-weight: 400;\">12<\/span><span style=\"font-weight: 400;\"> and the complex routing systems that perform indirect object identification.<\/span><span style=\"font-weight: 400;\">13<\/span><span style=\"font-weight: 400;\"> We will examine the methodological toolkit\u2014from Activation Patching to Causal Scrubbing\u2014that allows researchers to establish causality.<\/span><span style=\"font-weight: 400;\">14<\/span><span style=\"font-weight: 400;\"> Finally, we will explore the recent breakthroughs in Sparse Autoencoders (SAEs) that promise to resolve the polysemanticity of neurons <\/span><span style=\"font-weight: 400;\">16<\/span><span style=\"font-weight: 400;\">, and the emerging field of Representation Engineering that offers top-down control over model cognition.<\/span><span style=\"font-weight: 400;\">10<\/span><\/p>\n<h2><b>2. The Geometry of Representation: Decomposing the High-Dimensional Mind<\/b><\/h2>\n<p><span style=\"font-weight: 400;\">To reverse-engineer a neural network, one must first understand the fundamental data structures it uses to think. Unlike classical software, where variables have clear names and types, neural networks operate on continuous vectors in high-dimensional spaces. A central tenet of mechanistic interpretability is that these networks represent features\u2014interpretable properties of the input\u2014as directions in this activation space.<\/span><span style=\"font-weight: 400;\">17<\/span><\/p>\n<h3><b>2.1 The Linear Representation Hypothesis<\/b><\/h3>\n<p><span style=\"font-weight: 400;\">The Linear Representation Hypothesis suggests that concepts\u2014whether simple features like &#8220;blue&#8221; or &#8220;curve,&#8221; or complex abstractions like &#8220;past tense&#8221; or &#8220;irony&#8221;\u2014are encoded as linear combinations of neuron activations. If a feature corresponds to a direction vector $\\mathbf{v}$, the activation of that feature for a given input $\\mathbf{x}$ is the projection of the activation vector onto $\\mathbf{v}$ (i.e., the dot product $\\mathbf{x} \\cdot \\mathbf{v}$). This linearity is crucial because it implies that features can be manipulated via vector arithmetic, a property empirically verified through techniques like Representation Engineering.<\/span><span style=\"font-weight: 400;\">10<\/span><\/p>\n<p><span style=\"font-weight: 400;\">The intuition behind this hypothesis stems from the architecture of deep learning itself. Neural networks are composed of alternating layers of linear transformations (matrix multiplications) and non-linear activation functions. The linear layers allow the model to rotate and scale the representation space, effectively &#8220;reading&#8221; and &#8220;writing&#8221; features to different subspaces. If features were encoded in a highly non-linear manifold, the linear layers would struggle to manipulate them efficiently. Thus, the pressure to compute efficiently encourages the model to &#8220;unfold&#8221; features into linear directions.<\/span><span style=\"font-weight: 400;\">6<\/span><\/p>\n<p><span style=\"font-weight: 400;\">However, a significant hurdle complicates this elegant picture: the number of interpretable features a network learns often vastly exceeds the number of neurons available to represent them. In a layer with 512 neurons, a model might need to represent thousands of distinct concepts\u2014from specific vocabulary words to grammatical nuances and factual knowledge. This resource constraint leads to the phenomenon of <\/span><i><span style=\"font-weight: 400;\">superposition<\/span><\/i><span style=\"font-weight: 400;\">.<\/span><\/p>\n<h3><b>2.2 Superposition: Compression in High Dimensions<\/b><\/h3>\n<p><span style=\"font-weight: 400;\">Superposition occurs when a model represents more than $n$ features in an $n$-dimensional activation space.<\/span><span style=\"font-weight: 400;\">11<\/span><span style=\"font-weight: 400;\"> In this regime, features are not assigned to individual neurons in a one-to-one mapping (a &#8220;privileged basis&#8221;) but are instead stored as linear combinations of neurons that interfere with one another. This results in the confusing phenomenon of <\/span><i><span style=\"font-weight: 400;\">polysemantic neurons<\/span><\/i><span style=\"font-weight: 400;\">\u2014neurons that activate for multiple, seemingly unrelated concepts.<\/span><span style=\"font-weight: 400;\">20<\/span><\/p>\n<p><span style=\"font-weight: 400;\">For example, a single neuron in a vision model might fire strongly for &#8220;cat faces,&#8221; &#8220;car hoods,&#8221; and &#8220;text about philosophy.&#8221; In a monosemantic framework, this would imply a bizarre causal link between cats and philosophy. In the context of superposition, however, the neuron is simply a shared resource. The vector for &#8220;cat&#8221; might involve Neuron A + Neuron B, while the vector for &#8220;philosophy&#8221; involves Neuron A &#8211; Neuron B. The model can distinguish them by looking at the specific <\/span><i><span style=\"font-weight: 400;\">combination<\/span><\/i><span style=\"font-weight: 400;\"> (direction), even though Neuron A fires for both.<\/span><span style=\"font-weight: 400;\">11<\/span><\/p>\n<h4><b>2.2.1 The Role of Sparsity<\/b><\/h4>\n<p><span style=\"font-weight: 400;\">Why does the model tolerate this interference? Research using &#8220;Toy Models&#8221;\u2014small ReLU networks trained on synthetic data\u2014has revealed that <\/span><i><span style=\"font-weight: 400;\">sparsity<\/span><\/i><span style=\"font-weight: 400;\"> is the key driver of superposition.<\/span><span style=\"font-weight: 400;\">11<\/span><span style=\"font-weight: 400;\"> Most features in the real world are sparse; they are not present in every input. The concept of &#8220;The Eiffel Tower&#8221; is only relevant in a tiny fraction of all text.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">The model learns that it can safely store multiple sparse features in the same subspace (almost orthogonal to each other) because the probability of them being active simultaneously is low. If &#8220;Feature A&#8221; and &#8220;Feature B&#8221; never appear together, the model can use the same dimensions to represent both without destructive interference. The &#8220;interference noise&#8221; only occurs when both are active, which is rare. This allows the model to achieve &#8220;super-linear compression&#8221;\u2014storing far more features than it has dimensions.<\/span><span style=\"font-weight: 400;\">17<\/span><\/p>\n<h4><b>2.2.2 Geometric Polytopes and &#8220;Energy Levels&#8221;<\/b><\/h4>\n<p><span style=\"font-weight: 400;\">The geometry of superposition is not random. When features are stored in superposition, they self-organize into specific geometric structures to minimize interference. This is often related to the Johnson-Lindenstrauss lemma, which describes how many nearly orthogonal vectors can be packed into a high-dimensional space.<\/span><span style=\"font-weight: 400;\">17<\/span><\/p>\n<p><span style=\"font-weight: 400;\">In toy models, researchers have observed phase transitions where features organize into regular polytopes:<\/span><\/p>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Digons:<\/b><span style=\"font-weight: 400;\"> Two features sharing a dimension in opposite directions.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Triangles\/Tetrahedrons:<\/b><span style=\"font-weight: 400;\"> Three or four features spreading out in 2D or 3D subspaces to maximize the angles between them.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Pentagons:<\/b><span style=\"font-weight: 400;\"> Five features sharing a 2D subspace.<\/span><\/li>\n<\/ul>\n<p><span style=\"font-weight: 400;\">These configurations represent &#8220;energy levels&#8221; or local minima in the loss landscape. As the sparsity of the data increases (features become rarer), the model undergoes phase transitions, suddenly snapping from representing features in orthogonal dimensions (no superposition) to packing them into these tighter geometric structures (superposition).<\/span><span style=\"font-weight: 400;\">17<\/span><span style=\"font-weight: 400;\"> This creates a &#8220;fractal&#8221; basin of attraction where the model is constantly trading off the &#8220;cleanliness&#8221; of the representation against the capacity to store more information.<\/span><span style=\"font-weight: 400;\">20<\/span><\/p>\n<h4><b>2.2.3 The Privileged Basis Problem<\/b><\/h4>\n<p><span style=\"font-weight: 400;\">A critical distinction in this geometry is whether the basis (the axes defined by individual neurons) is &#8220;privileged.&#8221;<\/span><\/p>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Non-Privileged Basis:<\/b><span style=\"font-weight: 400;\"> In layers without activation functions (like the residual stream or the output of a linear projection), any rotation of the vector space is mathematically equivalent. There is no reason for a feature to align with &#8220;Neuron 1&#8221; vs. &#8220;0.7 * Neuron 1 + 0.3 * Neuron 2.&#8221;<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Privileged Basis:<\/b><span style=\"font-weight: 400;\"> Non-linear activation functions (like ReLU or GELU) operate element-wise on the neurons. $ReLU(x)$ is different from $ReLU(rotated(x))$. This breaks the rotational symmetry and creates a &#8220;privileged basis.&#8221; The model is incentivized to align features with specific neurons (or sparse combinations) because the activation function acts on those specific axes.<\/span><span style=\"font-weight: 400;\">4<\/span><\/li>\n<\/ul>\n<p><span style=\"font-weight: 400;\">However, the pressure for superposition often overwhelms the pressure for a privileged basis. The model chooses to store features as &#8220;dense&#8221; combinations (polysemantic neurons) to maximize capacity, even if it means the activation function is less efficient. This is why looking at individual neurons in Large Language Models is often futile; the &#8220;true&#8221; features are directions <\/span><i><span style=\"font-weight: 400;\">skew to<\/span><\/i><span style=\"font-weight: 400;\"> the neuron axes.<\/span><span style=\"font-weight: 400;\">21<\/span><\/p>\n<h3><b>2.3 Implications for Interpretability<\/b><\/h3>\n<p><span style=\"font-weight: 400;\">The existence of superposition fundamentally alters the interpretability landscape. It explains why &#8220;neuron-level analysis&#8221; has historically failed to yield scalable insights for engineers.<\/span><span style=\"font-weight: 400;\">24<\/span><span style=\"font-weight: 400;\"> If we look for the &#8220;happiness neuron,&#8221; we will fail, because &#8220;happiness&#8221; is likely a vector distributed across 500 neurons, each of which also codes for &#8220;surface tension,&#8221; &#8220;jazz music,&#8221; and &#8220;financial derivatives.&#8221;<\/span><\/p>\n<p><span style=\"font-weight: 400;\">This geometric reality necessitates two primary approaches in mechanistic interpretability:<\/span><\/p>\n<ol>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Decomposition (Dictionary Learning):<\/b><span style=\"font-weight: 400;\"> Using tools like Sparse Autoencoders to mathematically disentangle the polysemantic neurons into monosemantic features.<\/span><span style=\"font-weight: 400;\">16<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Circuit Analysis:<\/b><span style=\"font-weight: 400;\"> Focusing on the <\/span><i><span style=\"font-weight: 400;\">algorithms<\/span><\/i><span style=\"font-weight: 400;\"> (the movement and processing of information) rather than just the static representations. Even if we can&#8217;t perfectly define &#8220;happiness,&#8221; we can track how the model moves information from the &#8220;Subject&#8221; position to the &#8220;Verb&#8221; position.<\/span><span style=\"font-weight: 400;\">5<\/span><\/li>\n<\/ol>\n<h2><b>3. The Circuit Landscape: Decompiling the Transformer Algorithms<\/b><\/h2>\n<p><span style=\"font-weight: 400;\">If features are the variables of the neural code, <\/span><i><span style=\"font-weight: 400;\">circuits<\/span><\/i><span style=\"font-weight: 400;\"> are the functions and control flow structures. A circuit is defined as a subgraph of the model&#8217;s computational graph responsible for a specific behavior.<\/span><span style=\"font-weight: 400;\">13<\/span><span style=\"font-weight: 400;\"> Through rigorous reverse-engineering, researchers have demonstrated that Transformers, despite their complexity, implement clean, human-understandable algorithms for tasks ranging from grammatical agreement to arithmetic.<\/span><\/p>\n<h3><b>3.1 The Transformer as a Computational Graph<\/b><\/h3>\n<p><span style=\"font-weight: 400;\">To find circuits, we must view the Transformer not as a monolithic stack of layers, but as a collection of independent mechanisms reading from and writing to a shared communication channel: the &#8220;residual stream&#8221;.<\/span><span style=\"font-weight: 400;\">6<\/span><\/p>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>The Residual Stream:<\/b><span style=\"font-weight: 400;\"> This is the central &#8220;memory&#8221; of the model. It is a vector that persists through the layers. Each layer (Attention or MLP) reads from the stream, calculates an update, and <\/span><i><span style=\"font-weight: 400;\">adds<\/span><\/i><span style=\"font-weight: 400;\"> it back to the stream. This additive nature means that components can largely operate independently, writing &#8220;messages&#8221; to specific subspaces of the stream.<\/span><span style=\"font-weight: 400;\">6<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Attention Heads:<\/b><span style=\"font-weight: 400;\"> These are the primary &#8220;routing&#8221; mechanisms. They move information between token positions. Crucially, they can be decomposed into two independent circuits:<\/span><\/li>\n<\/ul>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"2\"><b>The QK Circuit (Query-Key):<\/b><span style=\"font-weight: 400;\"> Determines <\/span><i><span style=\"font-weight: 400;\">where<\/span><\/i><span style=\"font-weight: 400;\"> to move information from (the &#8220;if&#8221; statement). It computes the attention pattern.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"2\"><b>The OV Circuit (Output-Value):<\/b><span style=\"font-weight: 400;\"> Determines <\/span><i><span style=\"font-weight: 400;\">what<\/span><\/i><span style=\"font-weight: 400;\"> information to move (the payload). It computes the vector that gets added to the destination token.<\/span><span style=\"font-weight: 400;\">6<\/span><\/li>\n<\/ul>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>MLP Layers:<\/b><span style=\"font-weight: 400;\"> These process information at a single token position. Mechanistic work suggests they act as &#8220;Key-Value Memories.&#8221; The first layer (projecting up) acts as a &#8220;key&#8221; detection (e.g., &#8220;I see the vector for &#8216;Eiffel Tower'&#8221;), and the second layer (projecting down) writes the &#8220;value&#8221; (e.g., &#8220;The vector for &#8216;Paris'&#8221;).<\/span><span style=\"font-weight: 400;\">7<\/span><\/li>\n<\/ul>\n<h3><b>3.2 Induction Heads: The Engine of In-Context Learning<\/b><\/h3>\n<p><span style=\"font-weight: 400;\">One of the most robust and significant discoveries in mechanistic interpretability is the <\/span><b>Induction Head<\/b><span style=\"font-weight: 400;\">. This specific type of attention circuit explains the &#8220;In-Context Learning&#8221; (ICL) capability of LLMs\u2014the ability to learn from examples in the prompt without any weight updates.<\/span><span style=\"font-weight: 400;\">5<\/span><\/p>\n<h4><b>3.2.1 The Algorithm<\/b><\/h4>\n<p><span style=\"font-weight: 400;\">An induction head implements a simple but powerful &#8220;copy-paste&#8221; algorithm based on the heuristic: &#8220;If I see token $A$, and in the past $A$ was followed by $B$, then predict $B$ next.&#8221;<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Mathematically, this corresponds to completing the sequence pattern: $[A] \\dots [A] \\rightarrow$.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">For the model to execute this, it requires a two-step circuit involving communication between heads in different layers:<\/span><\/p>\n<ol>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>The Previous Token Head (Layer $L$):<\/b><span style=\"font-weight: 400;\"> A head in an early layer attends to the previous token position ($i-1$) and copies the content of that token to the current position ($i$). Now, the token at position $i$ &#8220;knows&#8221; what the previous token was.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>The Induction Head (Layer $L+k$):<\/b><span style=\"font-weight: 400;\"> A head in a later layer uses this copied information.<\/span><\/li>\n<\/ol>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"2\"><i><span style=\"font-weight: 400;\">Query:<\/span><\/i><span style=\"font-weight: 400;\"> &#8220;I am looking for the token that matches the one preceding me (Token $A$).&#8221;<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"2\"><i><span style=\"font-weight: 400;\">Key:<\/span><\/i><span style=\"font-weight: 400;\"> It scans the context. Since previous tokens also had Previous Token Heads writing to them, the token <\/span><i><span style=\"font-weight: 400;\">after<\/span><\/i><span style=\"font-weight: 400;\"> the previous $A$ knows &#8220;I was preceded by $A$.&#8221;<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"2\"><i><span style=\"font-weight: 400;\">Attention:<\/span><\/i><span style=\"font-weight: 400;\"> The head attends to the previous instance of $A$ (or the token immediately following it, depending on implementation).<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"2\"><i><span style=\"font-weight: 400;\">Output:<\/span><\/i><span style=\"font-weight: 400;\"> It copies the <\/span><i><span style=\"font-weight: 400;\">next<\/span><\/i><span style=\"font-weight: 400;\"> token ($B$) to the current residual stream, boosting the probability of predicting $B$.<\/span><span style=\"font-weight: 400;\">12<\/span><\/li>\n<\/ul>\n<h4><b>3.2.2 The Phase Transition and Universality<\/b><\/h4>\n<p><span style=\"font-weight: 400;\">The significance of induction heads is underscored by their training dynamics. Research shows a sharp <\/span><b>phase transition<\/b><span style=\"font-weight: 400;\"> during model training. In a very short window of training steps, the model suddenly acquires induction heads, and simultaneously, its validation loss drops significantly, and its ability to perform few-shot learning emerges.<\/span><span style=\"font-weight: 400;\">12<\/span><span style=\"font-weight: 400;\"> This suggests that induction heads are not just one of many features, but the <\/span><i><span style=\"font-weight: 400;\">primary mechanism<\/span><\/i><span style=\"font-weight: 400;\"> for general-purpose in-context learning.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Furthermore, these heads appear to be universal. They have been found in models ranging from tiny 2-layer toy transformers to massive frontier models, and even in architectures trained on different data modalities (e.g., code). They also exhibit &#8220;fuzzy&#8221; behavior in larger models, matching conceptually similar tokens (e.g., &#8220;king&#8221; and &#8220;queen&#8221;) rather than just identical ones, which likely supports more abstract reasoning capabilities.<\/span><span style=\"font-weight: 400;\">27<\/span><\/p>\n<h3><b>3.3 Indirect Object Identification (IOI): A Complex Routing Circuit<\/b><\/h3>\n<p><span style=\"font-weight: 400;\">While induction heads explain general pattern matching, how do models solve specific logical tasks? The Indirect Object Identification (IOI) task involves completing sentences like &#8220;When Mary and John went to the store, John gave a drink to&#8230;&#8221; with the correct name (&#8220;Mary&#8221;). The model must identify that &#8220;John&#8221; is the subject (repeated) and &#8220;Mary&#8221; is the indirect object.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Researchers completely reverse-engineered the circuit in GPT-2 Small responsible for this, identifying a subgraph of <\/span><b>26 attention heads<\/b><span style=\"font-weight: 400;\"> grouped into 7 functional categories.<\/span><span style=\"font-weight: 400;\">13<\/span><\/p>\n<h4><b>3.3.1 The Algorithmic Steps<\/b><\/h4>\n<ol>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Duplicate Identification:<\/b><span style=\"font-weight: 400;\"> The model must first know which name is repeated.<\/span><\/li>\n<\/ol>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"2\"><i><span style=\"font-weight: 400;\">Duplicate Token Heads:<\/span><\/i><span style=\"font-weight: 400;\"> These heads attend from the second &#8220;John&#8221; (S2) back to the first &#8220;John&#8221; (S1), identifying the repetition.<\/span><\/li>\n<\/ul>\n<ol>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>S-Inhibition:<\/b><span style=\"font-weight: 400;\"> Once the duplicate is found, the model must suppress it.<\/span><\/li>\n<\/ol>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"2\"><i><span style=\"font-weight: 400;\">S-Inhibition Heads:<\/span><\/i><span style=\"font-weight: 400;\"> These heads move the &#8220;duplicate&#8221; signal to the final token (&#8220;to&#8221;). Crucially, they write a <\/span><i><span style=\"font-weight: 400;\">negative<\/span><\/i><span style=\"font-weight: 400;\"> vector for &#8220;John&#8221; into the residual stream, effectively saying &#8220;Do not predict John.&#8221;<\/span><\/li>\n<\/ul>\n<ol>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Name Moving:<\/b><span style=\"font-weight: 400;\"> The model must find the remaining name.<\/span><\/li>\n<\/ol>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"2\"><i><span style=\"font-weight: 400;\">Name Mover Heads:<\/span><\/i><span style=\"font-weight: 400;\"> These heads attend to all names in the context. However, because &#8220;John&#8221; has been inhibited (or marked as duplicate), their net effect is to preferentially copy the vector for &#8220;Mary&#8221; to the output.<\/span><\/li>\n<\/ul>\n<h4><b>3.3.2 The Role of Negative Heads and Redundancy<\/b><\/h4>\n<p><span style=\"font-weight: 400;\">A fascinating discovery in the IOI circuit was the existence of Negative Name Mover Heads.26 These heads act as &#8220;brakes.&#8221; They attend to the correct answer (&#8220;Mary&#8221;) but write a negative prediction for it.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Why would the model oppose its own correct answer? Ablation studies reveal this is a calibration\/hedging mechanism. If the positive Name Movers are manually ablated (turned off), the Negative Name Movers also reduce their activity to compensate. The model maintains a balance to ensure it doesn&#8217;t become &#8220;overconfident&#8221; or unstable. This redundancy (also seen in &#8220;Backup Name Mover Heads&#8221; that only activate if primary heads fail) highlights the robust, self-correcting nature of neural circuits derived from dropout training.26<\/span><\/p>\n<h3><b>3.4 Arithmetic Circuits: Logic and Carry Propagation<\/b><\/h3>\n<p><span style=\"font-weight: 400;\">Until recently, it was debated whether LLMs performed arithmetic by memorizing tables or by learning algorithms. Work in 2024 and 2025 has definitively shown that small Transformers trained on arithmetic converge to robust, human-understandable algorithms, specifically implementing modular addition and carry propagation logic.<\/span><span style=\"font-weight: 400;\">28<\/span><\/p>\n<h4><b>3.4.1 The &#8220;TriCase&#8221; Logic<\/b><\/h4>\n<p><span style=\"font-weight: 400;\">The breakdown of the arithmetic circuit reveals that the model does not process the sum as a single retrieval. Instead, it parallelizes the operation into digit-specific streams.<\/span><\/p>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Base Add:<\/b><span style=\"font-weight: 400;\"> For a position $i$, the model computes $A_i + B_i \\pmod{10}$.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Carry Calculation:<\/b><span style=\"font-weight: 400;\"> The most complex part is handling carries, especially cascading carries (e.g., $999 + 1$). The model learns a specific <\/span><b>&#8220;TriCase&#8221;<\/b><span style=\"font-weight: 400;\"> logic. It classifies every digit pair $(A_i, B_i)$ into three states:<\/span><\/li>\n<\/ul>\n<ol>\n<li style=\"font-weight: 400;\" aria-level=\"2\"><b>Always Carry:<\/b><span style=\"font-weight: 400;\"> Sum $&gt; 9$ (e.g., $5+6$). A carry is generated regardless of the previous position.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"2\"><b>Never Carry:<\/b><span style=\"font-weight: 400;\"> Sum $&lt; 9$ (e.g., $2+3$). No carry is generated.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"2\"><b>Maybe Carry:<\/b><span style=\"font-weight: 400;\"> Sum $= 9$ (e.g., $4+5$). A carry is generated <\/span><i><span style=\"font-weight: 400;\">only if<\/span><\/i><span style=\"font-weight: 400;\"> the previous position generates a carry.<\/span><span style=\"font-weight: 400;\">28<\/span><\/li>\n<\/ol>\n<h4><b>3.4.2 Sum Validation and Cascading<\/b><\/h4>\n<p><span style=\"font-weight: 400;\">To resolve the &#8220;Maybe Carry&#8221; states, the model implements a <\/span><b>&#8220;Sum Validation&#8221;<\/b><span style=\"font-weight: 400;\"> circuit. This circuit looks at the previous positions. If position $i$ is a &#8220;Maybe Carry&#8221; and position $i-1$ is &#8220;Always Carry,&#8221; then position $i$ becomes &#8220;Always Carry.&#8221; This allows the carry bit to cascade up the chain of digits, effectively implementing the exact same logic as a ripple-carry adder in digital electronics.<\/span><span style=\"font-weight: 400;\">29<\/span><\/p>\n<p><span style=\"font-weight: 400;\">This finding is pivotal because it demonstrates that Transformers can learn exact, discrete logic gates and sequential dependencies, refuting the notion that they rely solely on fuzzy pattern matching. The circuit is so precise that researchers could identify the specific heads responsible for the &#8220;TriCase&#8221; classification and the MLP neurons that encoded the modulo addition tables.<\/span><span style=\"font-weight: 400;\">28<\/span><\/p>\n<h2><b>4. The Methodological Toolkit: From Observation to Causality<\/b><\/h2>\n<p><span style=\"font-weight: 400;\">Reverse-engineering these circuits is not done by simply staring at attention patterns. It requires a sophisticated suite of <\/span><i><span style=\"font-weight: 400;\">interventional<\/span><\/i><span style=\"font-weight: 400;\"> tools that allow researchers to establish causal necessity and sufficiency. The field has moved from observational methods (like looking at attention maps) to rigorous &#8220;surgical&#8221; interventions on the model&#8217;s internals.<\/span><\/p>\n<h3><b>4.1 Activation Patching (Causal Tracing)<\/b><\/h3>\n<p><span style=\"font-weight: 400;\">Activation patching (or Causal Tracing) is the current gold standard for localizing model behavior.<\/span><span style=\"font-weight: 400;\">14<\/span><span style=\"font-weight: 400;\"> The core idea is to isolate the causal effect of a specific component by swapping its activations between two different model runs.<\/span><\/p>\n<h4><b>4.1.1 The Procedure<\/b><\/h4>\n<ol>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Clean Run:<\/b><span style=\"font-weight: 400;\"> Run the model on an input where it performs correctly (e.g., &#8220;The Eiffel Tower is in [Paris]&#8221;). Cache all internal activations.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Corrupted Run:<\/b><span style=\"font-weight: 400;\"> Run the model on an input where the information is missing or different (e.g., &#8220;The Colosseum is in&#8221;).<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>The Patch:<\/b><span style=\"font-weight: 400;\"> Surgically replace a specific activation (e.g., the output of Head 7 in Layer 4) in the <\/span><i><span style=\"font-weight: 400;\">Corrupted Run<\/span><\/i><span style=\"font-weight: 400;\"> with the corresponding activation from the <\/span><i><span style=\"font-weight: 400;\">Clean Run<\/span><\/i><span style=\"font-weight: 400;\">.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Measurement:<\/b><span style=\"font-weight: 400;\"> Check the output. If the model now predicts &#8220;Paris&#8221; (despite the input being &#8220;Colosseum&#8221;), then Head 7, Layer 4 is causally responsible for transmitting the location information.<\/span><span style=\"font-weight: 400;\">15<\/span><\/li>\n<\/ol>\n<h4><b>4.1.2 Denoising vs. Noising<\/b><\/h4>\n<p><span style=\"font-weight: 400;\">There are two distinct modes of patching that provide different insights:<\/span><\/p>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Denoising (Clean $\\to$ Corrupt):<\/b><span style=\"font-weight: 400;\"> This tests for <\/span><b>sufficiency<\/b><span style=\"font-weight: 400;\">. By putting the &#8220;clean&#8221; activation into the &#8220;corrupted&#8221; model, we ask: &#8220;Is this component alone sufficient to restore the correct behavior?&#8221; If yes, we have found a critical pathway.<\/span><span style=\"font-weight: 400;\">15<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Noising (Corrupt $\\to$ Clean):<\/b><span style=\"font-weight: 400;\"> This tests for <\/span><b>necessity<\/b><span style=\"font-weight: 400;\">. By putting a &#8220;corrupted&#8221; (or random) activation into a &#8220;clean&#8221; model, we ask: &#8220;Does breaking this component break the model&#8217;s performance?&#8221; If the model still works, the component is redundant or irrelevant.<\/span><span style=\"font-weight: 400;\">15<\/span><\/li>\n<\/ul>\n<h3><b>4.2 Path Patching and Causal Scrubbing<\/b><\/h3>\n<p><span style=\"font-weight: 400;\">While Activation Patching identifies <\/span><i><span style=\"font-weight: 400;\">nodes<\/span><\/i><span style=\"font-weight: 400;\"> (neurons or heads), <\/span><b>Path Patching<\/b><span style=\"font-weight: 400;\"> identifies the <\/span><i><span style=\"font-weight: 400;\">edges<\/span><\/i><span style=\"font-weight: 400;\"> (connections) between them.<\/span><\/p>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Path Patching:<\/b><span style=\"font-weight: 400;\"> Instead of patching an activation universally, we patch it only as it is <\/span><i><span style=\"font-weight: 400;\">read<\/span><\/i><span style=\"font-weight: 400;\"> by a specific downstream component. For example, we can patch the output of the &#8220;Duplicate Token Head&#8221; <\/span><i><span style=\"font-weight: 400;\">only<\/span><\/i><span style=\"font-weight: 400;\"> into the input of the &#8220;S-Inhibition Head,&#8221; while leaving its connection to the rest of the model untouched. This allows researchers to map the precise Directed Acyclic Graph (DAG) of the circuit.<\/span><span style=\"font-weight: 400;\">15<\/span><\/li>\n<\/ul>\n<p><b>Causal Scrubbing<\/b><span style=\"font-weight: 400;\"> takes this rigor to the extreme. It is a method for hypothesis testing.<\/span><\/p>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>The Concept:<\/b><span style=\"font-weight: 400;\"> If we have a hypothesis (e.g., &#8220;The model only uses the gender of the name to pick the pronoun&#8221;), we can replace all activations in the model with random values <\/span><i><span style=\"font-weight: 400;\">constrained<\/span><\/i><span style=\"font-weight: 400;\"> by that hypothesis (e.g., replacing &#8220;Mary&#8221; with &#8220;Alice&#8221; because they share the gender, but not with &#8220;John&#8221;).<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>The Test:<\/b><span style=\"font-weight: 400;\"> If the model&#8217;s performance remains unchanged under this &#8220;scrubbing,&#8221; our hypothesis is verified\u2014the model really was only using the gender information. If performance drops, our hypothesis was incomplete (the model was using some other feature we didn&#8217;t account for).<\/span><span style=\"font-weight: 400;\">14<\/span><\/li>\n<\/ul>\n<h3><b>4.3 Attribution Patching<\/b><\/h3>\n<p><span style=\"font-weight: 400;\">A major limitation of standard Activation Patching is scalability. Patching every head in every layer requires a separate forward pass for each component\u2014computationally prohibitive for large models.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Attribution Patching offers a fast, gradient-based approximation. It uses the Taylor expansion to estimate the effect of a patch.<\/span><\/p>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">$$\\text{Effect} \\approx (\\text{Clean Activation} &#8211; \\text{Corrupt Activation}) \\times \\nabla_{\\text{Activation}} \\text{Logits}$$<\/span><\/p>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">This allows researchers to estimate the causal importance of every component in the network in a single backward pass. While less accurate than exact patching, it serves as a powerful heuristic to identify &#8220;hotspots&#8221; for more detailed investigation.32<\/span><\/p>\n<h2><b>5. Resolving Superposition: The Era of Sparse Autoencoders<\/b><\/h2>\n<p><span style=\"font-weight: 400;\">While circuits explain the <\/span><i><span style=\"font-weight: 400;\">wiring<\/span><\/i><span style=\"font-weight: 400;\"> of the model, the <\/span><i><span style=\"font-weight: 400;\">nodes<\/span><\/i><span style=\"font-weight: 400;\">\u2014the MLP neurons\u2014remained a mystery due to polysemanticity. The year 2024-2025 marked a breakthrough with the application of <\/span><b>Sparse Autoencoders (SAEs)<\/b><span style=\"font-weight: 400;\"> to mathematically resolve this issue.<\/span><span style=\"font-weight: 400;\">16<\/span><\/p>\n<h3><b>5.1 Dictionary Learning for Feature Extraction<\/b><\/h3>\n<p><span style=\"font-weight: 400;\">The core insight of SAEs is to treat the activations of a neural network layer not as the fundamental features, but as a compressed &#8220;ciphertext&#8221; resulting from superposition. The goal is to &#8220;decrypt&#8221; this into the original, sparse features.<\/span><\/p>\n<h4><b>5.1.1 The SAE Architecture<\/b><\/h4>\n<p><span style=\"font-weight: 400;\">Researchers train a separate autoencoder on the activations of the target LLM (e.g., GPT-4&#8217;s residual stream).<\/span><\/p>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Expansion:<\/b><span style=\"font-weight: 400;\"> The autoencoder maps the model&#8217;s activation vector (dimension $d_{model}$) to a much larger latent space (dimension $d_{SAE}$), often 16x to 256x larger.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Sparsity:<\/b><span style=\"font-weight: 400;\"> A strong sparsity penalty (such as L1 regularization or a Top-K activation function) forces the autoencoder to represent any given input using only a handful of active latent units.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Reconstruction:<\/b><span style=\"font-weight: 400;\"> The decoder attempts to reconstruct the original model activation from this sparse code.<\/span><\/li>\n<\/ul>\n<p><span style=\"font-weight: 400;\">If the reconstruction is accurate and the code is sparse, the latent units of the SAE correspond to the &#8220;true&#8221; monosemantic features of the model.<\/span><span style=\"font-weight: 400;\">21<\/span><\/p>\n<h3><b>5.2 From Polysemantic to Monosemantic<\/b><\/h3>\n<p><span style=\"font-weight: 400;\">The results of this approach have been transformative. While individual neurons in the LLM are polysemantic nightmares, the features discovered by SAEs are remarkably interpretable and <\/span><b>monosemantic<\/b><span style=\"font-weight: 400;\">.<\/span><\/p>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Specific Features:<\/b><span style=\"font-weight: 400;\"> Anthropic and other labs have extracted thousands of crisp features. Examples include:<\/span><\/li>\n<\/ul>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"2\"><b>The &#8220;Arabic Script&#8221; Feature:<\/b><span style=\"font-weight: 400;\"> Activates solely on Arabic text.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"2\"><b>The &#8220;DNA&#8221; Feature:<\/b><span style=\"font-weight: 400;\"> Activates on genetic sequences.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"2\"><b>The &#8220;Base64&#8221; Feature:<\/b><span style=\"font-weight: 400;\"> Activates on Base64 encoded strings.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"2\"><b>The &#8220;Code Error&#8221; Feature:<\/b><span style=\"font-weight: 400;\"> Activates specifically on syntax errors in Python code.<\/span><span style=\"font-weight: 400;\">16<\/span><\/li>\n<\/ul>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Steering:<\/b><span style=\"font-weight: 400;\"> Crucially, these features are causal. If we manually clamp the &#8220;Golden Gate Bridge&#8221; feature to a high value, the model\u2014regardless of the prompt\u2014will start hallucinating references to the bridge. This proves the feature is a fundamental unit of the model&#8217;s cognition.<\/span><span style=\"font-weight: 400;\">34<\/span><\/li>\n<\/ul>\n<h4><b>5.2.2 Feature Splitting and Universality<\/b><\/h4>\n<p><span style=\"font-weight: 400;\">SAEs also reveal the fractal nature of concepts. As researchers increase the size of the SAE (the expansion factor), features &#8220;split.&#8221; A feature that broadly represented &#8220;sadness&#8221; in a small dictionary might split into distinct features for &#8220;grief,&#8221; &#8220;melancholy,&#8221; &#8220;frustration,&#8221; and &#8220;somberness&#8221; in a larger dictionary. This suggests that LLMs have a hierarchical ontology of concepts that we can inspect at varying resolutions.<\/span><span style=\"font-weight: 400;\">16<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Furthermore, there is evidence of <\/span><b>Universality<\/b><span style=\"font-weight: 400;\">. Features found in one model often map 1:1 to features found in completely different models trained on similar data. This hints at a &#8220;Platonic&#8221; space of features that any intelligent system learning from human text must discover.<\/span><span style=\"font-weight: 400;\">16<\/span><\/p>\n<h3><b>5.3 Application to Vision-Language Models (VLMs)<\/b><\/h3>\n<p><span style=\"font-weight: 400;\">The SAE methodology extends beyond text. In Vision-Language Models like CLIP, SAEs have been used to decompose visual representations. Researchers introduced a <\/span><b>MonoSemanticity Score (MS)<\/b><span style=\"font-weight: 400;\"> to quantify how specific a neuron is to a visual concept.<\/span><\/p>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Findings:<\/b><span style=\"font-weight: 400;\"> SAEs revealed features for specific objects (e.g., &#8220;pencil,&#8221; &#8220;blue jay&#8221;) and abstract visual motifs (e.g., &#8220;checkerboard patterns&#8221;).<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Control:<\/b><span style=\"font-weight: 400;\"> This allows for unsupervised &#8220;concept steering&#8221; in images. By suppressing the &#8220;pencil&#8221; feature in the vision encoder, researchers could force the multimodal LLM (like LLaVA) to describe an image of a pencil without ever using the word or recognizing the object, proving the causal link between the visual feature and the linguistic output.<\/span><span style=\"font-weight: 400;\">25<\/span><\/li>\n<\/ul>\n<h2><b>6. Representation Engineering: Top-Down Control and Safety<\/b><\/h2>\n<p><span style=\"font-weight: 400;\">While Mechanistic Interpretability often builds understanding &#8220;bottom-up&#8221; (from neurons to circuits), a complementary approach known as <\/span><b>Representation Engineering (RepE)<\/b><span style=\"font-weight: 400;\"> has emerged, focusing on &#8220;top-down&#8221; control.<\/span><span style=\"font-weight: 400;\">10<\/span><span style=\"font-weight: 400;\"> RepE does not necessarily identify the exact circuit wiring; instead, it identifies the <\/span><i><span style=\"font-weight: 400;\">direction<\/span><\/i><span style=\"font-weight: 400;\"> in activation space that corresponds to high-level traits like honesty, morality, or harmlessness, and then manipulates it.<\/span><\/p>\n<h3><b>6.1 Linear Artificial Tomography (LAT) and Control Vectors<\/b><\/h3>\n<p><span style=\"font-weight: 400;\">RepE treats the model&#8217;s internal state as a transparent medium that can be scanned and adjusted.<\/span><\/p>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>LAT Scans:<\/b><span style=\"font-weight: 400;\"> By running the model on pairs of contrasting prompts (e.g., &#8220;Answer honestly&#8221; vs. &#8220;Answer deceptively&#8221;), researchers can perform a &#8220;Linear Artificial Tomography&#8221; scan. This involves taking the difference in activations between the two conditions to compute a <\/span><b>Control Vector<\/b><span style=\"font-weight: 400;\">\u2014the direction in space that represents &#8220;Honesty&#8221;.<\/span><span style=\"font-weight: 400;\">10<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Intervention:<\/b><span style=\"font-weight: 400;\"> Once this vector is known, it can be added or subtracted from the residual stream during inference.<\/span><\/li>\n<\/ul>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"2\"><i><span style=\"font-weight: 400;\">Honesty Steering:<\/span><\/i><span style=\"font-weight: 400;\"> Researchers found that adding the &#8220;Honesty&#8221; vector could force a model to correct &#8220;imitative falsehoods.&#8221; For example, when asked what &#8220;WIKI&#8221; stands for, models often hallucinate &#8220;What I Know Is.&#8221; With the honesty vector applied, the model accessed its latent knowledge and correctly answered &#8220;Wikiwiki&#8221; (Hawaiian for fast).<\/span><span style=\"font-weight: 400;\">10<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"2\"><i><span style=\"font-weight: 400;\">Power-Seeking:<\/span><\/i><span style=\"font-weight: 400;\"> Similarly, vectors for &#8220;power-seeking&#8221; or &#8220;sycophancy&#8221; can be subtracted to make the model safer and more robust to jailbreaks.<\/span><span style=\"font-weight: 400;\">36<\/span><\/li>\n<\/ul>\n<h3><b>6.2 Circuit Breakers and Mechanistic Unlearning<\/b><\/h3>\n<p><span style=\"font-weight: 400;\">A critical application of this understanding is the creation of <\/span><b>Circuit Breakers<\/b><span style=\"font-weight: 400;\">\u2014mechanisms that dynamically interrupt harmful thought processes.<\/span><\/p>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Representation Rerouting:<\/b><span style=\"font-weight: 400;\"> Instead of just training the model to say &#8220;I cannot answer that&#8221; (which acts as a surface-level mask), researchers identify the <\/span><i><span style=\"font-weight: 400;\">trajectory<\/span><\/i><span style=\"font-weight: 400;\"> of activations that leads to a harmful output. They then install a &#8220;circuit breaker&#8221; that projects the activation away from this harmful subspace the moment it is detected. This provides a robust defense against &#8220;jailbreaking&#8221; attacks that try to bypass the model&#8217;s safety filters.<\/span><span style=\"font-weight: 400;\">10<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Targeting the Fact Lookup Unit (FLU):<\/b><span style=\"font-weight: 400;\"> In the domain of <\/span><b>Mechanistic Unlearning<\/b><span style=\"font-weight: 400;\"> (making a model &#8220;forget&#8221; hazardous knowledge, like bio-weapon recipes), RepE has shown that standard methods (Output Tracing) are insufficient because they only suppress the final token. The model still &#8220;knows&#8221; the fact internally.<\/span><\/li>\n<\/ul>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"2\"><i><span style=\"font-weight: 400;\">The FLU Solution:<\/span><\/i><span style=\"font-weight: 400;\"> Research indicates that the specific MLP layers acting as &#8220;Fact Lookup Units&#8221; must be targeted. By identifying and editing the weights in these specific layers (the &#8220;key-value&#8221; memories), researchers can achieve robust unlearning that resists &#8220;relearning&#8221; attacks. If the fact is erased from the lookup table, the model cannot recover it, even if prompted with paraphrases or adversarial cues.<\/span><span style=\"font-weight: 400;\">7<\/span><\/li>\n<\/ul>\n<h2><b>7. Future Frontiers and Challenges<\/b><\/h2>\n<p><span style=\"font-weight: 400;\">Despite the immense progress, the field of mechanistic interpretability faces significant hurdles on the path to fully &#8220;white-box&#8221; AI.<\/span><\/p>\n<h3><b>7.1 Scalability and the &#8220;Hydra&#8221; Effect<\/b><\/h3>\n<p><span style=\"font-weight: 400;\">The primary critique of MI is scalability. Reverse-engineering the IOI circuit in GPT-2 Small took months of human effort. Frontier models are orders of magnitude larger and more complex.<\/span><\/p>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>The Hydra Effect:<\/b><span style=\"font-weight: 400;\"> As models scale, they do not just get better; they learn <\/span><i><span style=\"font-weight: 400;\">more<\/span><\/i><span style=\"font-weight: 400;\"> features and <\/span><i><span style=\"font-weight: 400;\">more<\/span><\/i><span style=\"font-weight: 400;\"> circuits. SAEs on large models reveal hundreds of thousands of features. Interpreting them one by one is impossible.<\/span><span style=\"font-weight: 400;\">1<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Dark Matter:<\/b><span style=\"font-weight: 400;\"> Critics argue that current research focuses on &#8220;cherry-picked&#8221; circuits (like IOI or Arithmetic) that happen to be clean and algorithmic. However, a vast portion of the network (the &#8220;dark matter&#8221;) may operate on messy, distributed heuristics that defy clean circuit logic.<\/span><span style=\"font-weight: 400;\">39<\/span><\/li>\n<\/ul>\n<h3><b>7.2 Automation: The Only Way Forward<\/b><\/h3>\n<p><span style=\"font-weight: 400;\">The solution to the scalability crisis is <\/span><b>Automated Circuit Discovery (ACD)<\/b><span style=\"font-weight: 400;\">. We need AI to interpret AI.<\/span><\/p>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Automated Algorithms:<\/b><span style=\"font-weight: 400;\"> Recent work in late 2024 and 2025 has introduced algorithms that can search the computational graph for circuits with <\/span><b>provable guarantees<\/b><span style=\"font-weight: 400;\"> of robustness and minimality. These algorithms use formal verification techniques to ensure that the discovered circuit faithfully replicates the model&#8217;s behavior across the entire input domain, not just on a few test examples.<\/span><span style=\"font-weight: 400;\">40<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Auto-Interpretability:<\/b><span style=\"font-weight: 400;\"> Researchers are using advanced LLMs (like GPT-4) to interpret the features found by SAEs. By feeding the SAE feature activations and the corresponding text to GPT-4, the &#8220;interpreter&#8221; model can generate a natural language description of what the feature represents (e.g., &#8220;This feature detects references to 19th-century literature&#8221;). This creates a scalable pipeline where machines map the minds of other machines.<\/span><span style=\"font-weight: 400;\">1<\/span><\/li>\n<\/ul>\n<h3><b>7.3 Beyond Text: Biology and Science<\/b><\/h3>\n<p><span style=\"font-weight: 400;\">Mechanistic interpretability is expanding beyond language. In <\/span><b>Protein Language Models<\/b><span style=\"font-weight: 400;\"> (like ESM-2), SAEs are being used to discover biological features.<\/span><\/p>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><b>Biological Circuits:<\/b><span style=\"font-weight: 400;\"> Researchers have found SAE features that correspond to specific protein secondary structures (alpha helices), binding motifs, and even evolutionary lineages. This suggests that MI could be a powerful tool for scientific discovery, allowing us to decode the &#8220;language of biology&#8221; learned by these models.<\/span><span style=\"font-weight: 400;\">33<\/span><\/li>\n<\/ul>\n<h2><b>8. Conclusion<\/b><\/h2>\n<p><span style=\"font-weight: 400;\">Mechanistic Interpretability has matured from a niche subfield into a rigorous scientific discipline essential for the future of AI. We have moved beyond the &#8220;alchemy&#8221; of training black boxes and are beginning to construct a &#8220;periodic table&#8221; of neural elements\u2014from the Induction Heads that power learning to the geometric polytopes that enable memory.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">The breakthroughs in Sparse Autoencoders and Representation Engineering demonstrate that the &#8220;black box&#8221; is not impenetrable. It is a complex, high-dimensional machine, but one that is ultimately governed by discoverable logic and geometry. The ability to decompose this logic\u2014to distinguish between a model that <\/span><i><span style=\"font-weight: 400;\">reasons<\/span><\/i><span style=\"font-weight: 400;\"> and a model that <\/span><i><span style=\"font-weight: 400;\">mimics<\/span><\/i><span style=\"font-weight: 400;\">\u2014is the critical enablement for safe, aligned, and trustworthy Artificial Intelligence. As models continue to scale, the microscope of interpretability must scale with them, evolving from manual inspection to automated, mathematically guaranteed reverse engineering.<\/span><\/p>\n<h3><b>Table 1: Key Algorithmic Circuits Identified in Transformers<\/b><\/h3>\n<p>&nbsp;<\/p>\n<table>\n<tbody>\n<tr>\n<td><b>Circuit Name<\/b><\/td>\n<td><b>Primary Function<\/b><\/td>\n<td><b>Mechanism Summary<\/b><\/td>\n<td><b>Key Discovery Paper<\/b><\/td>\n<\/tr>\n<tr>\n<td><b>Induction Heads<\/b><\/td>\n<td><span style=\"font-weight: 400;\">In-Context Learning (Few-Shot)<\/span><\/td>\n<td><b>Step 1:<\/b><span style=\"font-weight: 400;\"> Previous Token Head copies token $A$ to position $i$.<\/span><\/p>\n<p><b>Step 2:<\/b><span style=\"font-weight: 400;\"> Induction Head uses $A$ to query context for previous $A$, then copies subsequent token $B$.<\/span><\/td>\n<td><span style=\"font-weight: 400;\">Olsson et al. (Anthropic) <\/span><span style=\"font-weight: 400;\">12<\/span><\/td>\n<\/tr>\n<tr>\n<td><b>IOI Circuit<\/b><\/td>\n<td><span style=\"font-weight: 400;\">Indirect Object Identification<\/span><\/td>\n<td><b>Routing:<\/b><span style=\"font-weight: 400;\"> Duplicate Heads (Find $S$) $\\to$ S-Inhibition Heads (Suppress $S$) $\\to$ Name Mover Heads (Output $IO$).<\/span><\/p>\n<p><b>Hedging:<\/b><span style=\"font-weight: 400;\"> Negative Name Movers oppose the output to calibrate confidence.<\/span><\/td>\n<td><span style=\"font-weight: 400;\">Wang et al. (Redwood) <\/span><span style=\"font-weight: 400;\">13<\/span><\/td>\n<\/tr>\n<tr>\n<td><b>Arithmetic Circuit<\/b><\/td>\n<td><span style=\"font-weight: 400;\">Multi-digit Addition\/Subtraction<\/span><\/td>\n<td><b>Logic:<\/b><span style=\"font-weight: 400;\"> Decomposes sum into digit streams. Uses &#8220;TriCase&#8221; logic (Always\/Never\/Maybe Carry) and &#8220;Sum Validation&#8221; to cascade carries.<\/span><\/td>\n<td><span style=\"font-weight: 400;\">Quirke et al. <\/span><span style=\"font-weight: 400;\">28<\/span><\/td>\n<\/tr>\n<tr>\n<td><b>Fact Lookup Unit<\/b><\/td>\n<td><span style=\"font-weight: 400;\">Factual Recall<\/span><\/td>\n<td><b>Memory:<\/b><span style=\"font-weight: 400;\"> MLP layers act as Key-Value stores. First layer detects the subject (Key), second layer outputs the attribute (Value).<\/span><\/td>\n<td><span style=\"font-weight: 400;\">Meng et al. <\/span><span style=\"font-weight: 400;\">7<\/span><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<h3><b>Table 2: The Toolkit of Mechanistic Interpretability<\/b><\/h3>\n<table>\n<tbody>\n<tr>\n<td><b>Method<\/b><\/td>\n<td><b>Type<\/b><\/td>\n<td><b>Purpose<\/b><\/td>\n<td><b>How It Works<\/b><\/td>\n<\/tr>\n<tr>\n<td><b>Activation Patching<\/b><\/td>\n<td><span style=\"font-weight: 400;\">Causal Intervention<\/span><\/td>\n<td><span style=\"font-weight: 400;\">Localize <\/span><i><span style=\"font-weight: 400;\">nodes<\/span><\/i><span style=\"font-weight: 400;\"> (heads\/neurons).<\/span><\/td>\n<td><span style=\"font-weight: 400;\">Swap activation from &#8220;Clean&#8221; run to &#8220;Corrupted&#8221; run. Tests if component is sufficient to restore behavior.<\/span><\/td>\n<\/tr>\n<tr>\n<td><b>Path Patching<\/b><\/td>\n<td><span style=\"font-weight: 400;\">Causal Intervention<\/span><\/td>\n<td><span style=\"font-weight: 400;\">Localize <\/span><i><span style=\"font-weight: 400;\">edges<\/span><\/i><span style=\"font-weight: 400;\"> (connections).<\/span><\/td>\n<td><span style=\"font-weight: 400;\">Patch an activation <\/span><i><span style=\"font-weight: 400;\">only<\/span><\/i><span style=\"font-weight: 400;\"> as it enters a specific downstream component. Maps the wiring diagram.<\/span><\/td>\n<\/tr>\n<tr>\n<td><b>Causal Scrubbing<\/b><\/td>\n<td><span style=\"font-weight: 400;\">Hypothesis Testing<\/span><\/td>\n<td><span style=\"font-weight: 400;\">Verify <\/span><i><span style=\"font-weight: 400;\">circuits<\/span><\/i><span style=\"font-weight: 400;\">.<\/span><\/td>\n<td><span style=\"font-weight: 400;\">Replace all activations with random values constrained by a hypothesis. If performance holds, hypothesis is true.<\/span><\/td>\n<\/tr>\n<tr>\n<td><b>Sparse Autoencoders<\/b><\/td>\n<td><span style=\"font-weight: 400;\">Dictionary Learning<\/span><\/td>\n<td><span style=\"font-weight: 400;\">Resolve <\/span><i><span style=\"font-weight: 400;\">Superposition<\/span><\/i><span style=\"font-weight: 400;\">.<\/span><\/td>\n<td><span style=\"font-weight: 400;\">Train a sparse autoencoder on activations to disentangle polysemantic neurons into monosemantic features.<\/span><\/td>\n<\/tr>\n<tr>\n<td><b>Attribution Patching<\/b><\/td>\n<td><span style=\"font-weight: 400;\">Gradient Approximation<\/span><\/td>\n<td><span style=\"font-weight: 400;\">Scalable <\/span><i><span style=\"font-weight: 400;\">Heuristic<\/span><\/i><span style=\"font-weight: 400;\">.<\/span><\/td>\n<td><span style=\"font-weight: 400;\">Use gradients to approximate the effect of patching every component in one pass. Good for scanning large models.<\/span><\/td>\n<\/tr>\n<tr>\n<td><b>Control Vectors (RepE)<\/b><\/td>\n<td><span style=\"font-weight: 400;\">Top-Down Steering<\/span><\/td>\n<td><span style=\"font-weight: 400;\">Control <\/span><i><span style=\"font-weight: 400;\">Behavior<\/span><\/i><span style=\"font-weight: 400;\">.<\/span><\/td>\n<td><span style=\"font-weight: 400;\">Identify vector direction of a trait (e.g., Honesty) and add\/subtract it from residual stream to steer output.<\/span><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<h4><b>Works cited<\/b><\/h4>\n<ol>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Open Problems in Mechanistic Interpretability &#8211; arXiv, accessed on December 22, 2025, <\/span><a href=\"https:\/\/arxiv.org\/html\/2501.16496v1\"><span style=\"font-weight: 400;\">https:\/\/arxiv.org\/html\/2501.16496v1<\/span><\/a><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Survey on the Role of Mechanistic Interpretability in Generative AI &#8211; MDPI, accessed on December 22, 2025, <\/span><a href=\"https:\/\/www.mdpi.com\/2504-2289\/9\/8\/193\"><span style=\"font-weight: 400;\">https:\/\/www.mdpi.com\/2504-2289\/9\/8\/193<\/span><\/a><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Understanding Mechanistic Interpretability in AI Models &#8211; IntuitionLabs, accessed on December 22, 2025, <\/span><a href=\"https:\/\/intuitionlabs.ai\/articles\/mechanistic-interpretability-ai-llms\"><span style=\"font-weight: 400;\">https:\/\/intuitionlabs.ai\/articles\/mechanistic-interpretability-ai-llms<\/span><\/a><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Mechanistic Interpretability, Variables, and the Importance of Interpretable Bases, accessed on December 22, 2025, <\/span><a href=\"https:\/\/www.transformer-circuits.pub\/2022\/mech-interp-essay\"><span style=\"font-weight: 400;\">https:\/\/www.transformer-circuits.pub\/2022\/mech-interp-essay<\/span><\/a><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Circuits in Transformers Mechanistic Interpretability 2 &#8211; Rohan Hitchcock, accessed on December 22, 2025, <\/span><a href=\"https:\/\/rohanhitchcock.com\/notes\/2023-6-slt-alignment-talk-mech-interp.pdf\"><span style=\"font-weight: 400;\">https:\/\/rohanhitchcock.com\/notes\/2023-6-slt-alignment-talk-mech-interp.pdf<\/span><\/a><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">A Mathematical Framework for Transformer Circuits, accessed on December 22, 2025, <\/span><a href=\"https:\/\/transformer-circuits.pub\/2021\/framework\/index.html\"><span style=\"font-weight: 400;\">https:\/\/transformer-circuits.pub\/2021\/framework\/index.html<\/span><\/a><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Robust Knowledge Unlearning via Mechanistic Localizations &#8211; OpenReview, accessed on December 22, 2025, <\/span><a href=\"https:\/\/openreview.net\/pdf?id=06pNzrEjnH\"><span style=\"font-weight: 400;\">https:\/\/openreview.net\/pdf?id=06pNzrEjnH<\/span><\/a><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Neel Nanda on Mechanistic Interpretability: Progress, Limits, and Paths to Safer AI, accessed on December 22, 2025, <\/span><a href=\"https:\/\/forum.effectivealtruism.org\/posts\/za2oHe8HBtcYNnN7C\/neel-nanda-mechanistic-interpretability\"><span style=\"font-weight: 400;\">https:\/\/forum.effectivealtruism.org\/posts\/za2oHe8HBtcYNnN7C\/neel-nanda-mechanistic-interpretability<\/span><\/a><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Robust Knowledge Unlearning and Editing via Mechanistic Localization &#8211; ChatPaper, accessed on December 22, 2025, <\/span><a href=\"https:\/\/chatpaper.com\/paper\/165135\"><span style=\"font-weight: 400;\">https:\/\/chatpaper.com\/paper\/165135<\/span><\/a><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">CMU CSD PhD Blog &#8211; From Representation Engineering to Circuit &#8230;, accessed on December 22, 2025, <\/span><a href=\"https:\/\/www.cs.cmu.edu\/~csd-phd-blog\/2025\/representation-engineering\/\"><span style=\"font-weight: 400;\">https:\/\/www.cs.cmu.edu\/~csd-phd-blog\/2025\/representation-engineering\/<\/span><\/a><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Toy Models of Superposition &#8211; Anthropic, accessed on December 22, 2025, <\/span><a href=\"https:\/\/www.anthropic.com\/research\/toy-models-of-superposition\"><span style=\"font-weight: 400;\">https:\/\/www.anthropic.com\/research\/toy-models-of-superposition<\/span><\/a><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">In-context Learning and Induction Heads &#8211; Transformer Circuits Thread, accessed on December 22, 2025, <\/span><a href=\"https:\/\/transformer-circuits.pub\/2022\/in-context-learning-and-induction-heads\/index.html\"><span style=\"font-weight: 400;\">https:\/\/transformer-circuits.pub\/2022\/in-context-learning-and-induction-heads\/index.html<\/span><\/a><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">INTERPRETABILITY IN THE WILD: A CIRCUIT FOR INDIRECT OBJECT IDENTIFICATION IN GPT-2 SMALL &#8211; OpenReview, accessed on December 22, 2025, <\/span><a href=\"https:\/\/openreview.net\/pdf?id=NpsVSN6o4ul\"><span style=\"font-weight: 400;\">https:\/\/openreview.net\/pdf?id=NpsVSN6o4ul<\/span><\/a><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Mechanistic Interpretability Techniques &#8211; Emergent Mind, accessed on December 22, 2025, <\/span><a href=\"https:\/\/www.emergentmind.com\/topics\/mechanistic-interpretability-techniques\"><span style=\"font-weight: 400;\">https:\/\/www.emergentmind.com\/topics\/mechanistic-interpretability-techniques<\/span><\/a><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">How to use and interpret activation patching \u2014 LessWrong, accessed on December 22, 2025, <\/span><a href=\"https:\/\/www.lesswrong.com\/posts\/FhryNAFknqKAdDcYy\/how-to-use-and-interpret-activation-patching\"><span style=\"font-weight: 400;\">https:\/\/www.lesswrong.com\/posts\/FhryNAFknqKAdDcYy\/how-to-use-and-interpret-activation-patching<\/span><\/a><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Towards Monosemanticity: Decomposing Language Models With &#8230;, accessed on December 22, 2025, <\/span><a href=\"https:\/\/transformer-circuits.pub\/2023\/monosemantic-features\"><span style=\"font-weight: 400;\">https:\/\/transformer-circuits.pub\/2023\/monosemantic-features<\/span><\/a><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Toy Models of Superposition &#8211; Transformer Circuits Thread, accessed on December 22, 2025, <\/span><a href=\"https:\/\/transformer-circuits.pub\/2022\/toy_model\/index.html\"><span style=\"font-weight: 400;\">https:\/\/transformer-circuits.pub\/2022\/toy_model\/index.html<\/span><\/a><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models, accessed on December 22, 2025, <\/span><a href=\"https:\/\/arxiv.org\/html\/2407.02646v2\"><span style=\"font-weight: 400;\">https:\/\/arxiv.org\/html\/2407.02646v2<\/span><\/a><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Representation Engineering Mistral-7B an Acid Trip &#8211; Theia Vogel, accessed on December 22, 2025, <\/span><a href=\"https:\/\/vgel.me\/posts\/representation-engineering\/\"><span style=\"font-weight: 400;\">https:\/\/vgel.me\/posts\/representation-engineering\/<\/span><\/a><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">[1.3.1] Toy Models of Superposition &amp; Sparse Autoencoders &#8211; Transformer Interpretability, accessed on December 22, 2025, <\/span><a href=\"https:\/\/arena-chapter1-transformer-interp.streamlit.app\/[1.3.1]_Toy_Models_of_Superposition_&amp;_SAEs\"><span style=\"font-weight: 400;\">https:\/\/arena-chapter1-transformer-interp.streamlit.app\/[1.3.1]_Toy_Models_of_Superposition_&amp;_SAEs<\/span><\/a><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Monosemanticity: How Anthropic Made AI 70% More Interpretable | Galileo, accessed on December 22, 2025, <\/span><a href=\"https:\/\/galileo.ai\/blog\/anthropic-ai-interpretability-breakthrough\"><span style=\"font-weight: 400;\">https:\/\/galileo.ai\/blog\/anthropic-ai-interpretability-breakthrough<\/span><\/a><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Mechanistic Interpretability in Brains and Machines and Category Theory | by Farshad Noravesh | Medium, accessed on December 22, 2025, <\/span><a href=\"https:\/\/medium.com\/@noraveshfarshad\/mechanistic-interpretability-in-brains-and-machines-37981e6e7ffc\"><span style=\"font-weight: 400;\">https:\/\/medium.com\/@noraveshfarshad\/mechanistic-interpretability-in-brains-and-machines-37981e6e7ffc<\/span><\/a><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">A Walkthrough of Toy Models of Superposition w\/ Jess Smith &#8211; YouTube, accessed on December 22, 2025, <\/span><a href=\"https:\/\/www.youtube.com\/watch?v=R3nbXgMnVqQ\"><span style=\"font-weight: 400;\">https:\/\/www.youtube.com\/watch?v=R3nbXgMnVqQ<\/span><\/a><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Mechanistic Interpretability for Engineers | by Zaina Haider | Dec, 2025 &#8211; Medium, accessed on December 22, 2025, <\/span><a href=\"https:\/\/medium.com\/@thekzgroupllc\/mechanistic-interpretability-for-engineers-41ee86f9d53f\"><span style=\"font-weight: 400;\">https:\/\/medium.com\/@thekzgroupllc\/mechanistic-interpretability-for-engineers-41ee86f9d53f<\/span><\/a><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Sparse Autoencoders Learn Monosemantic Features in Vision-Language Models, accessed on December 22, 2025, <\/span><a href=\"https:\/\/openreview.net\/forum?id=DaNnkQJSQf\"><span style=\"font-weight: 400;\">https:\/\/openreview.net\/forum?id=DaNnkQJSQf<\/span><\/a><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">[1.3] Indirect Object Identification &#8211; Streamlit, accessed on December 22, 2025, <\/span><a href=\"https:\/\/arena-ch1-transformers.streamlit.app\/[1.3]_Indirect_Object_Identification\"><span style=\"font-weight: 400;\">https:\/\/arena-ch1-transformers.streamlit.app\/[1.3]_Indirect_Object_Identification<\/span><\/a><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Induction Heads as an Essential Mechanism for Pattern Matching in In-context Learning &#8211; ACL Anthology, accessed on December 22, 2025, <\/span><a href=\"https:\/\/aclanthology.org\/2025.findings-naacl.283.pdf\"><span style=\"font-weight: 400;\">https:\/\/aclanthology.org\/2025.findings-naacl.283.pdf<\/span><\/a><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Arithmetic in Transformers Explained &#8211; arXiv, accessed on December 22, 2025, <\/span><a href=\"https:\/\/arxiv.org\/html\/2402.02619v9\"><span style=\"font-weight: 400;\">https:\/\/arxiv.org\/html\/2402.02619v9<\/span><\/a><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Understanding Addition and Subtraction in Transformers &#8211; arXiv, accessed on December 22, 2025, <\/span><a href=\"https:\/\/arxiv.org\/html\/2402.02619v10\"><span style=\"font-weight: 400;\">https:\/\/arxiv.org\/html\/2402.02619v10<\/span><\/a><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">How to Think About Activation Patching &#8211; AI Alignment Forum, accessed on December 22, 2025, <\/span><a href=\"https:\/\/www.alignmentforum.org\/posts\/xh85KbTFhbCz7taD4\/how-to-think-about-activation-patching\"><span style=\"font-weight: 400;\">https:\/\/www.alignmentforum.org\/posts\/xh85KbTFhbCz7taD4\/how-to-think-about-activation-patching<\/span><\/a><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">How to use and interpret activation patching &#8211; arXiv, accessed on December 22, 2025, <\/span><a href=\"https:\/\/arxiv.org\/pdf\/2404.15255\"><span style=\"font-weight: 400;\">https:\/\/arxiv.org\/pdf\/2404.15255<\/span><\/a><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Attribution Patching: Activation Patching At Industrial Scale &#8211; Neel Nanda, accessed on December 22, 2025, <\/span><a href=\"https:\/\/www.neelnanda.io\/mechanistic-interpretability\/attribution-patching\"><span style=\"font-weight: 400;\">https:\/\/www.neelnanda.io\/mechanistic-interpretability\/attribution-patching<\/span><\/a><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">From Mechanistic Interpretability to Mechanistic Biology: Training, Evaluating, and Interpreting Sparse Autoencoders on Protein Language Models &#8211; PMC &#8211; PubMed Central, accessed on December 22, 2025, <\/span><a href=\"https:\/\/pmc.ncbi.nlm.nih.gov\/articles\/PMC11839115\/\"><span style=\"font-weight: 400;\">https:\/\/pmc.ncbi.nlm.nih.gov\/articles\/PMC11839115\/<\/span><\/a><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Neel Nanda on the race to read AI minds (part 1) | 80,000 Hours, accessed on December 22, 2025, <\/span><a href=\"https:\/\/80000hours.org\/podcast\/episodes\/neel-nanda-mechanistic-interpretability\/\"><span style=\"font-weight: 400;\">https:\/\/80000hours.org\/podcast\/episodes\/neel-nanda-mechanistic-interpretability\/<\/span><\/a><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">ExplainableML\/sae-for-vlm: [NeurIPS 2025] Sparse Autoencoders Learn Monosemantic Features in Vision-Language Models &#8211; GitHub, accessed on December 22, 2025, <\/span><a href=\"https:\/\/github.com\/ExplainableML\/sae-for-vlm\"><span style=\"font-weight: 400;\">https:\/\/github.com\/ExplainableML\/sae-for-vlm<\/span><\/a><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Representation Engineering for Al Alignment &#8211; Satvik Golechha, accessed on December 22, 2025, <\/span><a href=\"https:\/\/7vik.io\/2023\/10\/10\/engineering-representations-for-al-alignment\/\"><span style=\"font-weight: 400;\">https:\/\/7vik.io\/2023\/10\/10\/engineering-representations-for-al-alignment\/<\/span><\/a><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Robust Knowledge Unlearning and Editing via Mechanistic Localization &#8211; OpenReview, accessed on December 22, 2025, <\/span><a href=\"https:\/\/openreview.net\/forum?id=vsU2veUpiR\"><span style=\"font-weight: 400;\">https:\/\/openreview.net\/forum?id=vsU2veUpiR<\/span><\/a><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Mechanistic Interpretability for AI Safety A Review &#8211; arXiv, accessed on December 22, 2025, <\/span><a href=\"https:\/\/arxiv.org\/html\/2404.14082v1\"><span style=\"font-weight: 400;\">https:\/\/arxiv.org\/html\/2404.14082v1<\/span><\/a><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">EIS VI: Critiques of Mechanistic Interpretability Work in AI Safety, accessed on December 22, 2025, <\/span><a href=\"https:\/\/www.alignmentforum.org\/posts\/wt7HXaCWzuKQipqz3\/eis-vi-critiques-of-mechanistic-interpretability-work-in-ai\"><span style=\"font-weight: 400;\">https:\/\/www.alignmentforum.org\/posts\/wt7HXaCWzuKQipqz3\/eis-vi-critiques-of-mechanistic-interpretability-work-in-ai<\/span><\/a><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Provable Guarantees for Automated Circuit Discovery in Mechanistic &#8230;, accessed on December 22, 2025, <\/span><a href=\"https:\/\/openreview.net\/forum?id=Timsb74vIY\"><span style=\"font-weight: 400;\">https:\/\/openreview.net\/forum?id=Timsb74vIY<\/span><\/a><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Circuits Updates &#8211; July 2025, accessed on December 22, 2025, <\/span><a href=\"https:\/\/transformer-circuits.pub\/2025\/july-update\/index.html\"><span style=\"font-weight: 400;\">https:\/\/transformer-circuits.pub\/2025\/july-update\/index.html<\/span><\/a><\/li>\n<\/ol>\n","protected":false},"excerpt":{"rendered":"<p>1. Introduction: The Black Box Crisis and the Mechanistic Turn The ascendance of deep learning, particularly through the proliferation of Large Language Models (LLMs) based on the Transformer architecture, has <span class=\"readmore\"><a href=\"https:\/\/uplatz.com\/blog\/mechanistic-interpretability-reverse-engineering-the-neural-code\/\">Read More &#8230;<\/a><\/span><\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[2374],"tags":[],"class_list":["post-9081","post","type-post","status-publish","format-standard","hentry","category-deep-research"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v28.1 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>Mechanistic Interpretability: Reverse Engineering the Neural Code | Uplatz Blog<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/uplatz.com\/blog\/mechanistic-interpretability-reverse-engineering-the-neural-code\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Mechanistic Interpretability: Reverse Engineering the Neural Code | Uplatz Blog\" \/>\n<meta property=\"og:description\" content=\"1. Introduction: The Black Box Crisis and the Mechanistic Turn The ascendance of deep learning, particularly through the proliferation of Large Language Models (LLMs) based on the Transformer architecture, has Read More ...\" \/>\n<meta property=\"og:url\" content=\"https:\/\/uplatz.com\/blog\/mechanistic-interpretability-reverse-engineering-the-neural-code\/\" \/>\n<meta property=\"og:site_name\" content=\"Uplatz Blog\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/Uplatz-1077816825610769\/\" \/>\n<meta property=\"article:published_time\" content=\"2025-12-24T22:09:19+00:00\" \/>\n<meta name=\"author\" content=\"uplatzblog\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@uplatz_global\" \/>\n<meta name=\"twitter:site\" content=\"@uplatz_global\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"uplatzblog\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"27 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\\\/\\\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\\\/\\\/uplatz.com\\\/blog\\\/mechanistic-interpretability-reverse-engineering-the-neural-code\\\/#article\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/uplatz.com\\\/blog\\\/mechanistic-interpretability-reverse-engineering-the-neural-code\\\/\"},\"author\":{\"name\":\"uplatzblog\",\"@id\":\"https:\\\/\\\/uplatz.com\\\/blog\\\/#\\\/schema\\\/person\\\/8ecae69a21d0757bdb2f776e67d2645e\"},\"headline\":\"Mechanistic Interpretability: Reverse Engineering the Neural Code\",\"datePublished\":\"2025-12-24T22:09:19+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\\\/\\\/uplatz.com\\\/blog\\\/mechanistic-interpretability-reverse-engineering-the-neural-code\\\/\"},\"wordCount\":6071,\"publisher\":{\"@id\":\"https:\\\/\\\/uplatz.com\\\/blog\\\/#organization\"},\"articleSection\":[\"Deep Research\"],\"inLanguage\":\"en-US\"},{\"@type\":\"WebPage\",\"@id\":\"https:\\\/\\\/uplatz.com\\\/blog\\\/mechanistic-interpretability-reverse-engineering-the-neural-code\\\/\",\"url\":\"https:\\\/\\\/uplatz.com\\\/blog\\\/mechanistic-interpretability-reverse-engineering-the-neural-code\\\/\",\"name\":\"Mechanistic Interpretability: Reverse Engineering the Neural Code | Uplatz Blog\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/uplatz.com\\\/blog\\\/#website\"},\"datePublished\":\"2025-12-24T22:09:19+00:00\",\"breadcrumb\":{\"@id\":\"https:\\\/\\\/uplatz.com\\\/blog\\\/mechanistic-interpretability-reverse-engineering-the-neural-code\\\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\\\/\\\/uplatz.com\\\/blog\\\/mechanistic-interpretability-reverse-engineering-the-neural-code\\\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\\\/\\\/uplatz.com\\\/blog\\\/mechanistic-interpretability-reverse-engineering-the-neural-code\\\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\\\/\\\/uplatz.com\\\/blog\\\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Mechanistic Interpretability: Reverse Engineering the Neural Code\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\\\/\\\/uplatz.com\\\/blog\\\/#website\",\"url\":\"https:\\\/\\\/uplatz.com\\\/blog\\\/\",\"name\":\"Uplatz Blog\",\"description\":\"Uplatz is a global IT Training &amp; Consulting company\",\"publisher\":{\"@id\":\"https:\\\/\\\/uplatz.com\\\/blog\\\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\\\/\\\/uplatz.com\\\/blog\\\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\\\/\\\/uplatz.com\\\/blog\\\/#organization\",\"name\":\"uplatz.com\",\"url\":\"https:\\\/\\\/uplatz.com\\\/blog\\\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/uplatz.com\\\/blog\\\/#\\\/schema\\\/logo\\\/image\\\/\",\"url\":\"https:\\\/\\\/uplatz.com\\\/blog\\\/wp-content\\\/uploads\\\/2016\\\/11\\\/Uplatz-Logo-Copy-2.png\",\"contentUrl\":\"https:\\\/\\\/uplatz.com\\\/blog\\\/wp-content\\\/uploads\\\/2016\\\/11\\\/Uplatz-Logo-Copy-2.png\",\"width\":1280,\"height\":800,\"caption\":\"uplatz.com\"},\"image\":{\"@id\":\"https:\\\/\\\/uplatz.com\\\/blog\\\/#\\\/schema\\\/logo\\\/image\\\/\"},\"sameAs\":[\"https:\\\/\\\/www.facebook.com\\\/Uplatz-1077816825610769\\\/\",\"https:\\\/\\\/x.com\\\/uplatz_global\",\"https:\\\/\\\/www.instagram.com\\\/\",\"https:\\\/\\\/www.linkedin.com\\\/company\\\/7956715?trk=tyah&amp;amp;amp;amp;trkInfo=clickedVertical:company,clickedEntityId:7956715,idx:1-1-1,tarId:1464353969447,tas:uplatz\"]},{\"@type\":\"Person\",\"@id\":\"https:\\\/\\\/uplatz.com\\\/blog\\\/#\\\/schema\\\/person\\\/8ecae69a21d0757bdb2f776e67d2645e\",\"name\":\"uplatzblog\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/7f814c72279199f59ded4418a8653ad15f5f8904ac75e025a4e2abe24d58fa5d?s=96&d=mm&r=g\",\"url\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/7f814c72279199f59ded4418a8653ad15f5f8904ac75e025a4e2abe24d58fa5d?s=96&d=mm&r=g\",\"contentUrl\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/7f814c72279199f59ded4418a8653ad15f5f8904ac75e025a4e2abe24d58fa5d?s=96&d=mm&r=g\",\"caption\":\"uplatzblog\"}}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Mechanistic Interpretability: Reverse Engineering the Neural Code | Uplatz Blog","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/uplatz.com\/blog\/mechanistic-interpretability-reverse-engineering-the-neural-code\/","og_locale":"en_US","og_type":"article","og_title":"Mechanistic Interpretability: Reverse Engineering the Neural Code | Uplatz Blog","og_description":"1. Introduction: The Black Box Crisis and the Mechanistic Turn The ascendance of deep learning, particularly through the proliferation of Large Language Models (LLMs) based on the Transformer architecture, has Read More ...","og_url":"https:\/\/uplatz.com\/blog\/mechanistic-interpretability-reverse-engineering-the-neural-code\/","og_site_name":"Uplatz Blog","article_publisher":"https:\/\/www.facebook.com\/Uplatz-1077816825610769\/","article_published_time":"2025-12-24T22:09:19+00:00","author":"uplatzblog","twitter_card":"summary_large_image","twitter_creator":"@uplatz_global","twitter_site":"@uplatz_global","twitter_misc":{"Written by":"uplatzblog","Est. reading time":"27 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/uplatz.com\/blog\/mechanistic-interpretability-reverse-engineering-the-neural-code\/#article","isPartOf":{"@id":"https:\/\/uplatz.com\/blog\/mechanistic-interpretability-reverse-engineering-the-neural-code\/"},"author":{"name":"uplatzblog","@id":"https:\/\/uplatz.com\/blog\/#\/schema\/person\/8ecae69a21d0757bdb2f776e67d2645e"},"headline":"Mechanistic Interpretability: Reverse Engineering the Neural Code","datePublished":"2025-12-24T22:09:19+00:00","mainEntityOfPage":{"@id":"https:\/\/uplatz.com\/blog\/mechanistic-interpretability-reverse-engineering-the-neural-code\/"},"wordCount":6071,"publisher":{"@id":"https:\/\/uplatz.com\/blog\/#organization"},"articleSection":["Deep Research"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/uplatz.com\/blog\/mechanistic-interpretability-reverse-engineering-the-neural-code\/","url":"https:\/\/uplatz.com\/blog\/mechanistic-interpretability-reverse-engineering-the-neural-code\/","name":"Mechanistic Interpretability: Reverse Engineering the Neural Code | Uplatz Blog","isPartOf":{"@id":"https:\/\/uplatz.com\/blog\/#website"},"datePublished":"2025-12-24T22:09:19+00:00","breadcrumb":{"@id":"https:\/\/uplatz.com\/blog\/mechanistic-interpretability-reverse-engineering-the-neural-code\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/uplatz.com\/blog\/mechanistic-interpretability-reverse-engineering-the-neural-code\/"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/uplatz.com\/blog\/mechanistic-interpretability-reverse-engineering-the-neural-code\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/uplatz.com\/blog\/"},{"@type":"ListItem","position":2,"name":"Mechanistic Interpretability: Reverse Engineering the Neural Code"}]},{"@type":"WebSite","@id":"https:\/\/uplatz.com\/blog\/#website","url":"https:\/\/uplatz.com\/blog\/","name":"Uplatz Blog","description":"Uplatz is a global IT Training &amp; Consulting company","publisher":{"@id":"https:\/\/uplatz.com\/blog\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/uplatz.com\/blog\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/uplatz.com\/blog\/#organization","name":"uplatz.com","url":"https:\/\/uplatz.com\/blog\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/uplatz.com\/blog\/#\/schema\/logo\/image\/","url":"https:\/\/uplatz.com\/blog\/wp-content\/uploads\/2016\/11\/Uplatz-Logo-Copy-2.png","contentUrl":"https:\/\/uplatz.com\/blog\/wp-content\/uploads\/2016\/11\/Uplatz-Logo-Copy-2.png","width":1280,"height":800,"caption":"uplatz.com"},"image":{"@id":"https:\/\/uplatz.com\/blog\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/Uplatz-1077816825610769\/","https:\/\/x.com\/uplatz_global","https:\/\/www.instagram.com\/","https:\/\/www.linkedin.com\/company\/7956715?trk=tyah&amp;amp;amp;amp;trkInfo=clickedVertical:company,clickedEntityId:7956715,idx:1-1-1,tarId:1464353969447,tas:uplatz"]},{"@type":"Person","@id":"https:\/\/uplatz.com\/blog\/#\/schema\/person\/8ecae69a21d0757bdb2f776e67d2645e","name":"uplatzblog","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/secure.gravatar.com\/avatar\/7f814c72279199f59ded4418a8653ad15f5f8904ac75e025a4e2abe24d58fa5d?s=96&d=mm&r=g","url":"https:\/\/secure.gravatar.com\/avatar\/7f814c72279199f59ded4418a8653ad15f5f8904ac75e025a4e2abe24d58fa5d?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/7f814c72279199f59ded4418a8653ad15f5f8904ac75e025a4e2abe24d58fa5d?s=96&d=mm&r=g","caption":"uplatzblog"}}]}},"_links":{"self":[{"href":"https:\/\/uplatz.com\/blog\/wp-json\/wp\/v2\/posts\/9081","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/uplatz.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/uplatz.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/uplatz.com\/blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/uplatz.com\/blog\/wp-json\/wp\/v2\/comments?post=9081"}],"version-history":[{"count":1,"href":"https:\/\/uplatz.com\/blog\/wp-json\/wp\/v2\/posts\/9081\/revisions"}],"predecessor-version":[{"id":9082,"href":"https:\/\/uplatz.com\/blog\/wp-json\/wp\/v2\/posts\/9081\/revisions\/9082"}],"wp:attachment":[{"href":"https:\/\/uplatz.com\/blog\/wp-json\/wp\/v2\/media?parent=9081"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/uplatz.com\/blog\/wp-json\/wp\/v2\/categories?post=9081"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/uplatz.com\/blog\/wp-json\/wp\/v2\/tags?post=9081"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}