<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://ramesh-arvind.github.io/feed.xml" rel="self" type="application/atom+xml" /><link href="https://ramesh-arvind.github.io/" rel="alternate" type="text/html" /><updated>2026-05-05T10:56:41+00:00</updated><id>https://ramesh-arvind.github.io/feed.xml</id><title type="html">Ramesh Arvind Naagarajan</title><subtitle>Personal website of Ramesh Arvind Naagarajan, PhD researcher at the Chair of Automatic Control and System Dynamics, TU Chemnitz, working on explainable AI for control systems, mechanistic interpretability, and large language models.</subtitle><author><name>Ramesh Arvind Naagarajan</name><email>ramesh.naagarajan@etit.tu-chemnitz.de</email></author><entry><title type="html">Our ICML 2026 paper</title><link href="https://ramesh-arvind.github.io/blog/2026/icml-2026-acceptance/" rel="alternate" type="text/html" title="Our ICML 2026 paper" /><published>2026-05-05T00:00:00+00:00</published><updated>2026-05-05T00:00:00+00:00</updated><id>https://ramesh-arvind.github.io/blog/2026/icml-2026-acceptance</id><content type="html" xml:base="https://ramesh-arvind.github.io/blog/2026/icml-2026-acceptance/"><![CDATA[<p>Quick note that our paper, Hierarchical Causal Abduction: A Foundation Framework for Explainable Model Predictive Control, has been accepted at ICML 2026. The conference is in Seoul, at COEX, from the 6th to the 11th of July 2026. The paper is joint work with Zühal Wagner and my supervisor, Prof. Dr. Stefan Streif, at the Professorship of Automatic Control and System Dynamics, TU Chemnitz.</p>

<p><img src="/assets/img/hca-icml2026.png" alt="Hierarchical Causal Abduction: a one-figure summary of the framework, the three evidence streams, and the headline results." /></p>

<p>The figure above is the one-page version of the paper. The black-box MPC controller on the top left is the problem, the operators next to it are the audience, and the three panels in the middle are the framework. Panel one is the physics and domain knowledge encoded as a graph. Panel two is the optimiser’s own KKT structure used as evidence about which constraints actually drove the action. Panel three is the temporal causal graph learned from recent data with PCMCI. The hierarchical reasoning engine in the centre is the part that combines the three streams into a single auditable chain. The bottom row is what comes out the other side, a 53 percent improvement over LIME, validation across three domains, and an expert clarity score of 4.3 out of 5. The before-and-after on the right is the part that matters most to me, an operator going from “why did heating activate” to a forward-looking explanation about preventing a constraint violation two hours from now.</p>

<p>ICML 2026 received 23,918 submissions and accepted 6,352 of them, which puts the acceptance rate around 26.6 percent. I am genuinely thankful that this work made it through. The community of reviewers around explainable control at top ML venues is small and rigorous, and getting useful feedback there is itself a privilege.</p>

<p>The short version of the paper. We propose a framework that builds explanations for model predictive control by combining three things, a physics knowledge graph that captures what the operator already knows, the KKT structure of the controller’s own optimisation as evidence about which constraints actually drove the action, and a temporal causal graph learned from recent operating data. A hierarchical causal abduction procedure puts the three together and produces explanations that are auditable at three different levels, the binding constraint, the physical mechanism, and the data-grounded check. We test it on greenhouse climate, building HVAC, and chemical process control, and we improve explanation accuracy by 53 percent over LIME on our benchmark.</p>

<p>There is a longer companion post on this blog that walks through the framework section by section, including the failure modes the design was meant to block and the limitations we are still working on. If you came here for the technical version, that one is the right next click.</p>

<p>I will be at ICML in Seoul. If you work on explainable control, causality, mechanistic interpretability of domain-adapted models, or trustworthy AI for safety-critical systems, please come and find me. Conferences are mostly hallway conversations for me, and the hallway is where the most useful arguments tend to happen.</p>]]></content><author><name>Ramesh Arvind Naagarajan</name><email>ramesh.naagarajan@etit.tu-chemnitz.de</email></author><category term="ICML" /><category term="Explainable AI" /><category term="Announcement" /><summary type="html"><![CDATA[A short note on the acceptance, what the paper is about, and what comes next.]]></summary></entry><entry><title type="html">Reading constraints inside a controller’s language model</title><link href="https://ramesh-arvind.github.io/blog/2026/mechinterp-reading-constraints/" rel="alternate" type="text/html" title="Reading constraints inside a controller’s language model" /><published>2026-04-15T00:00:00+00:00</published><updated>2026-04-15T00:00:00+00:00</updated><id>https://ramesh-arvind.github.io/blog/2026/mechinterp-reading-constraints</id><content type="html" xml:base="https://ramesh-arvind.github.io/blog/2026/mechinterp-reading-constraints/"><![CDATA[<p>The earlier post on this blog, the mechanistic interpretability primer, ended on a promise. I said the techniques transfer cleanly to any transformer, including ones that have been adapted to reason about constrained control problems, and that I would write up what we actually find when we run the toolkit on such a model. This is that write-up. It is deliberately concrete and a little narrow, because the most useful thing I can offer here is a worked example, not a survey.</p>

<p>The setup, briefly. We take an open-weights base language model, fine-tune it on a curated corpus of control-theory papers, MPC textbooks, and worked examples of constrained optimisation, and use the resulting model as a reasoning interface inside the explainer described in our ICML 2026 paper. The model never replaces the optimiser. It reads the optimiser’s artefacts, like the active constraints and the multipliers, and turns them into language an operator can interrogate. The question of this post is what is actually inside that model when it does that job.</p>

<p>Three questions worth asking</p>

<p>The mechinterp toolkit is most useful when the question you bring to it is precise. Vague questions like “is the model interpretable” rarely produce useful answers. The three questions I keep coming back to in our setting are these.</p>

<p>Where does the model represent the notion of a constraint being binding. This is a binary distinction the optimiser already produces, and we want to know whether the language model has internalised it as a recognisable structure or whether it is reasoning about active constraints case by case. If there is a clean binding-or-not feature, we can use it as a probe everywhere we need to. If there is not, we are doing something more like prompt engineering than interpretation.</p>

<p>Where does the model represent the prediction horizon. MPC is fundamentally about acting now to prevent something later, and a faithful explanation of an MPC action almost always reaches across time. We want to know whether the language model carries any internal sense of “this is a stage-k consideration” or whether everything collapses into one undifferentiated “future.”</p>

<p>Where does the model store the difference between an objective term and a constraint. From the operator’s point of view, these are the same kind of thing, a number you push around. From the optimiser’s point of view, they are very different. A faithful explainer needs to keep them distinct, and we want to know whether the language model is doing that automatically or only when explicitly prompted.</p>

<p>What we find, in plain terms</p>

<p>For the first question, the answer is mostly yes. Sparse autoencoders trained on the residual stream at middle layers surface features that fire on tokens describing binding constraints across very different surface forms. The same feature lights up on phrases like “the cooling capacity limit is hit,” “the ventilation valve is saturated,” and the more textbook-style “the inequality multiplier is positive.” Ablating this feature degrades the explanations in a specific way, the model still names the constraint correctly but loses the language of it being binding. That pattern of degradation is the kind of evidence the field treats as compelling.</p>

<p>For the second question, the answer is a soft yes. There is a cluster of features that correlate with stage indices in the prediction horizon, but the cluster is smeared and the features are not as monosemantic as the binding-constraint feature. The model has a notion of “now versus later,” but it does not seem to have crisp internal coordinates for stage-1, stage-2, stage-3, and so on. This is roughly what I expected, and it is consistent with the model having seen many descriptions of horizons but few examples of explicit per-stage reasoning.</p>

<p>For the third question, the answer is the most interesting one, because the answer is “kind of, and the failures are diagnostic.” There are features that distinguish objective contributions from constraint contributions, but they are layer-specific. In early layers the distinction is clean. In middle layers it gets confused, especially on prompts where the user has phrased a soft constraint as a cost penalty. In late layers it cleans up again, but only after the model has used the surrounding context to disambiguate. That layered trajectory is itself a finding. It tells us where in the network the real disambiguation happens, which tells us where to look first when an explanation goes wrong.</p>

<p>What this is good for</p>

<p>None of the above is meant to be a self-contained scientific result. The point is methodological. If you treat a domain-adapted language model as a black box that “either explains things well or it does not,” you have nothing to do when it fails except retrain it. If you treat it as a network with features and circuits that you can name, probe, and ablate, every failure becomes a localisable bug. The binding-constraint feature is robust. The horizon features are smeared. The objective-versus-constraint distinction is layer-dependent. Each of those statements is actionable in a way “the model sometimes hallucinates” is not.</p>

<p>The other thing this work is good for, and the reason it is on this blog rather than in a paper yet, is that it gives a concrete answer to a question I get a lot. The question is whether mechanistic interpretability is “ready for engineering.” My honest answer has been “yes for transformers, with caveats, and only if you ask precise questions.” The work above is the version of that answer with receipts. Sparse autoencoders, residual-stream probes, and targeted ablations are not magic, but they are tractable, and on a model that has been trained on a domain you understand, they tell you things you can use.</p>

<p>What is next</p>

<p>The next step is to use the binding-constraint feature as a hard signal inside the explainer itself, rather than as a diagnostic on the side. If the language model has a clean internal indicator of when a constraint is binding, we should be able to lift that indicator out and use it as a verification check on the explanations the model produces, in addition to the KKT-based check the optimiser already gives us. Two independent signals on the same property is the kind of redundancy that turns a research prototype into something a practitioner can trust.</p>

<p>I will write that up when we have the numbers. The mechinterp primer remains the right starting point if you are coming to this fresh, and the <a href="/blog/2026/hierarchical-causal-abduction-walkthrough/">paper-walkthrough post</a> is the right next read if you want to see how this language-model interpretability work fits into the larger framework.</p>]]></content><author><name>Ramesh Arvind Naagarajan</name><email>ramesh.naagarajan@etit.tu-chemnitz.de</email></author><category term="Interpretability" /><category term="LLM" /><category term="Control" /><category term="Mechanistic Interpretability" /><summary type="html"><![CDATA[A follow-up to the mechanistic interpretability primer, focused on what the standard toolkit actually finds when applied to a domain-adapted LLM that reasons about constrained control problems.]]></summary></entry><entry><title type="html">Evaluating explanations: why LIME and SHAP are not enough</title><link href="https://ramesh-arvind.github.io/blog/2025/evaluating-explanations-beyond-shap/" rel="alternate" type="text/html" title="Evaluating explanations: why LIME and SHAP are not enough" /><published>2025-12-19T00:00:00+00:00</published><updated>2025-12-19T00:00:00+00:00</updated><id>https://ramesh-arvind.github.io/blog/2025/evaluating-explanations-beyond-shap</id><content type="html" xml:base="https://ramesh-arvind.github.io/blog/2025/evaluating-explanations-beyond-shap/"><![CDATA[<p>LIME and SHAP have done the field a great service. They made
explainability operational. Before them, “the model is interpretable”
was a vibe. After them, it was a number you could put in a paper.
That was real progress.</p>

<p>The cost of that progress is that we now treat the number as the
goal. A SHAP value is a measure of marginal contribution under a
specific game-theoretic assumption. It is one slice of what an
explanation could mean. Used as the only slice, it misleads.</p>

<p>Three axes are missing from the SHAP-shaped picture, and any serious
evaluation of an explanation system has to address all three.</p>

<p><strong>Faithfulness.</strong> Does the explanation describe what the model
actually did, or does it describe what a simpler surrogate model
would have done in its place? LIME explicitly fits a local linear
surrogate, which is honest but limits the faithfulness ceiling. SHAP
estimates contributions under feature-coalition reasoning, which
makes specific assumptions about feature independence that are
routinely violated. For a deployed control system the right
faithfulness test is operational, not statistical. Take the
explanation, perturb the input in the way the explanation says
matters, and check that the model’s output changes the way the
explanation predicts. If it does not, the explanation is not
faithful, regardless of what its SHAP values say.</p>

<p><strong>Stability.</strong> Does a small change in input produce a small change
in explanation? In safety-critical settings, an explanation that
flips between contradictory stories on neighbouring inputs is
dangerous. It trains operators to ignore the explanation entirely.
Stability is testable. Sample neighbouring inputs, generate
explanations, and measure the distance between explanations under a
sensible metric. If the distance is high while the model output
barely moved, the explanation system is unstable, and you have a
problem.</p>

<p><strong>Operator utility.</strong> Does the explanation actually help a human do
their job better, or does it just feel informative. This is the test
that gets skipped most often, because it is the most expensive. It
needs a study with real operators, real tasks, and a measurable
outcome. Decision time. Override rate. Detection of induced faults.
The literature on this is thin and mostly comes from medical AI.
Control systems need their own version of this work, and not enough
of it exists yet.</p>

<p>The methods we developed in our 2025 papers were tested on the first
two axes during paper review. The third axis, operator utility, is
where I want the next chunk of work to live. It is harder, slower,
and less publishable per unit time. It is also the only axis that
matters when the system is actually running in a glasshouse with a
real grower making real decisions.</p>

<p>A short note on tooling. SHAP is not the enemy. I still use it as one
diagnostic among several, and I would still default to it for tabular
models in low-stakes settings. The mistake is to treat it as the only
diagnostic, the way too many papers do. The right stance is closer to
how a control engineer thinks about Bode plots: useful, well
understood, decisive in some questions, silent on others. You would
not certify a controller on a Bode plot alone. You should not certify
an explanation system on SHAP alone.</p>

<p>What I would like to see in the next year of explainability papers,
in roughly priority order. More operator-in-the-loop studies. More
stability analyses, especially in time-series and control settings.
More work on explanation methods that are faithful by construction,
like our optimiser-grounded approach, instead of faithful by
post-hoc approximation. And less benchmarking against MNIST.</p>

<p>The benchmark for an explanation is whether it changes a human
decision in the right direction. Everything else is a proxy.</p>]]></content><author><name>Ramesh Arvind Naagarajan</name><email>ramesh.naagarajan@etit.tu-chemnitz.de</email></author><category term="Explainability" /><category term="Evaluation" /><category term="SHAP" /><summary type="html"><![CDATA[Faithfulness, stability, and operator utility, three axes the standard tools do not measure.]]></summary></entry><entry><title type="html">Symbolic constraints, optimisation, and what LLMs miss</title><link href="https://ramesh-arvind.github.io/blog/2025/symbolic-constraints-llms-miss/" rel="alternate" type="text/html" title="Symbolic constraints, optimisation, and what LLMs miss" /><published>2025-11-28T00:00:00+00:00</published><updated>2025-11-28T00:00:00+00:00</updated><id>https://ramesh-arvind.github.io/blog/2025/symbolic-constraints-llms-miss</id><content type="html" xml:base="https://ramesh-arvind.github.io/blog/2025/symbolic-constraints-llms-miss/"><![CDATA[<p>Ask a modern frontier model to state the Karush-Kuhn-Tucker
conditions for a constrained optimisation problem. It will give you
a clean answer. Stationarity, primal feasibility, dual feasibility,
complementary slackness. It can recite the textbook. Ask it to
identify the active set in a small numerical example, and it
sometimes gets that right too.</p>

<p>Now embed the same problem inside a control loop, give the model the
sensor readings and the cost weights and the constraint bounds, and
ask it to predict which constraint will become binding at the next
sample period. The accuracy collapses.</p>

<p>This is not a knowledge gap. The model knows the math. It is a
reasoning gap, and it is structural.</p>

<p>Optimisation reasoning has a particular shape that does not match how
language models compute. Three patterns make this concrete.</p>

<p>The first pattern is global feasibility. To know whether a candidate
solution is feasible, you have to evaluate every constraint, not the
ones that look relevant. Language models are very good at attending to
the most relevant tokens, which is the wrong attention pattern for
feasibility checking. They will quietly skip over a constraint that
looks numerically uninteresting and miss exactly the one that
matters.</p>

<p>The second pattern is the active set. In a constrained optimum, only
some constraints are tight. Identifying the active set is the central
combinatorial step in QP and NLP solvers, and there are mature
algorithms for it. Asking an LLM to do this implicitly, by reasoning
through it in natural language, is asking it to simulate a solver. It
can do this for very small problems. It does not scale. The error
mode is interesting: the model picks a plausible-looking active set,
then writes a confident justification for it, regardless of whether
the active set is actually correct.</p>

<p>The third pattern is the duality argument. KKT logic flows in both
directions. From the primal you can reason about the dual, and the
dual gives you the shadow prices that explain the primal. Language
models tend to flatten this into a single direction. They will
explain a primal decision in primal terms (we did X because the cost
of X was lowest) and skip the dual reasoning (we did X because the
shadow price on the constraint that would have ruled out Y was
higher than the shadow price on the constraint that would have ruled
out X). The dual story is often the more useful one for an operator,
and it is the one most likely to be lost.</p>

<p>These three patterns are not unique to LLMs. They show up in any
system that tries to reason about optimisation without actually
solving the optimisation. The difference is that a numerical solver
will tell you when it cannot find a feasible point. An LLM will
generate a fluent paragraph that sounds like it found one.</p>

<p>The engineering response, in our work, is to never ask the LLM to do
the optimisation reasoning by itself. The optimiser does the
optimisation. The LLM reads the optimiser’s output, the active set,
the dual variables, the slack values, and translates that into a
human-readable explanation. The LLM is a translator, not a solver.</p>

<p>Once you accept that split, a lot of the disappointment with LLMs in
optimisation contexts goes away. The model is being used inside its
competence, on the linguistic and compositional side. The numerical
heavy lifting stays where it has always been good, in the solver.</p>

<p>The interesting research question that remains is the one in the
middle. Can a language model, given access to a solver as a tool,
reliably decide when to call the solver, what problem to pose to it,
and how to interpret its output? That is a non-trivial reasoning
problem in its own right, and it is closer to where the field is
going than the “just prompt the model harder” line.</p>

<p>I am cautiously optimistic about that direction. I am not optimistic
about LLMs as standalone optimisers, and I do not think any amount of
scaling alone fixes the three patterns above.</p>]]></content><author><name>Ramesh Arvind Naagarajan</name><email>ramesh.naagarajan@etit.tu-chemnitz.de</email></author><category term="Optimisation" /><category term="LLM" /><category term="Control" /><summary type="html"><![CDATA[Why a model that can quote the textbook on KKT conditions still cannot reliably reason about them.]]></summary></entry><entry><title type="html">From causal discovery to causal reasoning</title><link href="https://ramesh-arvind.github.io/blog/2025/causal-discovery-to-causal-reasoning/" rel="alternate" type="text/html" title="From causal discovery to causal reasoning" /><published>2025-11-07T00:00:00+00:00</published><updated>2025-11-07T00:00:00+00:00</updated><id>https://ramesh-arvind.github.io/blog/2025/causal-discovery-to-causal-reasoning</id><content type="html" xml:base="https://ramesh-arvind.github.io/blog/2025/causal-discovery-to-causal-reasoning/"><![CDATA[<p>Causal discovery is having a moment. Constraint-based methods like PC
and FCI, score-based methods like NOTEARS, time-series methods like
PCMCI and Granger-style approaches, are all in active use. Given
enough data and a tolerable set of assumptions, you can recover a
plausible directed graph over your variables. The literature treats
that graph as the output.</p>

<p>In a control setting, the graph is the input. The output is a
decision an operator will live with.</p>

<p>That gap, between having a causal graph and using it to reason, is
where most of the engineering effort actually goes, and it is
underexposed in the literature.</p>

<p>Three problems show up the moment you try to deploy a discovered
graph in a working system.</p>

<p>The first problem is that the graph is wrong. Not catastrophically
wrong, just wrong in the way real models are wrong. An edge points
the wrong direction. Two variables that should be connected are not.
A latent confounder is misattributed to a spurious direct link. If
you take the graph at face value and feed it to a downstream
reasoning system, the system will produce confidently wrong answers.
The honest fix is to never let the graph stand alone. It needs a
domain expert in the loop, and the system has to make it cheap for
that expert to inspect, edit, and version the graph.</p>

<p>The second problem is that the graph is static and the world is not.
Greenhouse dynamics in summer are not the same as in winter. The
right causal structure for tomato in flowering is not the same as in
fruiting. A single discovered graph collapses time-varying causal
structure into one frozen picture. You can address this with regime
detection, with windowed discovery, with hierarchical graphs that
distinguish slow-varying structure from fast-varying parameters. None
of these fixes are free. They all add complexity and they all need
their own validation story.</p>

<p>The third problem, the deepest one, is that even a correct,
up-to-date graph does not tell you what to do. It tells you how
variables relate. The leap from “ventilation flap causes humidity”
to “should I open the ventilation flap right now” is a planning
problem, not a discovery problem. The graph is a constraint on the
planner, not a substitute for it.</p>

<p>Our 2025 <em>Frontiers in Agronomy</em> paper sits in this gap. We use
constraint-based causal discovery on greenhouse sensor streams to
recover a graph over climate, plant, and control variables. Then we
expose that graph to an LLM as a structured reference. The LLM does
not do causal discovery. It reads the discovered graph and uses it as
a scaffold for plain-language recommendations that a grower can
follow.</p>

<p>The split matters. The causal-discovery algorithm is good at finding
relations from data. It is not good at deciding what to do with them.
The LLM is good at composing readable, contextual recommendations.
It is not good at separating correlation from causation. Putting them
in series, with the graph as the bridge, lets each component do what
it is actually good at.</p>

<p>There is an interesting open question hiding in this setup. How
should the LLM handle disagreement with the graph? In some queries
the model’s pretraining tells it one thing and the discovered graph
tells it another. The conservative answer is “always defer to the
graph.” The more useful answer is probably “flag the disagreement,
explain both views, let the operator decide.” The right policy is
not obvious and we are still learning it.</p>

<p>The bigger picture. Causal discovery alone is not the goal. It is
one tool in a longer pipeline that ends with a human making a
decision. The papers that move the field forward in the next few
years will, I think, be the ones that take that pipeline seriously
end to end.</p>]]></content><author><name>Ramesh Arvind Naagarajan</name><email>ramesh.naagarajan@etit.tu-chemnitz.de</email></author><category term="Causality" /><category term="LLM" /><category term="Reasoning" /><summary type="html"><![CDATA[Discovering a graph is the easy half. Reasoning over it is the rest of the problem.]]></summary></entry><entry><title type="html">Domain knowledge graphs as scaffolds for LLM reasoning</title><link href="https://ramesh-arvind.github.io/blog/2025/knowledge-graphs-llm-scaffolds/" rel="alternate" type="text/html" title="Domain knowledge graphs as scaffolds for LLM reasoning" /><published>2025-10-17T00:00:00+00:00</published><updated>2025-10-17T00:00:00+00:00</updated><id>https://ramesh-arvind.github.io/blog/2025/knowledge-graphs-llm-scaffolds</id><content type="html" xml:base="https://ramesh-arvind.github.io/blog/2025/knowledge-graphs-llm-scaffolds/"><![CDATA[<p>Retrieval-augmented generation is the default answer to “how do I make
an LLM stop hallucinating.” Index your documents, retrieve the top-k
chunks, stuff them into the context window, and let the model
generate. It works surprisingly well on broad domains, customer
support, legal search, internal wikis. It works much less well on
narrow, technical, control-heavy domains. There is a structural
reason for this, and it points to a different design.</p>

<p>Retrieval over a corpus assumes that the right answer is somewhere in
the corpus, expressed in roughly the right words. In a narrow domain
like greenhouse climate control, the right answer is almost never
expressed in the corpus. The corpus has fragments. It has a paper on
vapour-pressure deficit, a manual on a specific climate computer, a
PhD thesis on tomato transpiration, an FAQ on dehumidification. The
operator’s actual question, “why did the controller open the
ventilation flap right now,” is a composition of those fragments, and
the composition is the hard part.</p>

<p>Knowledge graphs are good at exactly that compositional layer. A
graph is a set of entities and a set of typed relations between them.
For a greenhouse, the entities are sensors, actuators, climate
variables, plant physiology states, and constraints. The relations
are things like “ventilation flap actuator influences humidity
variable,” “humidity variable affects fungal-disease risk,”
“fungal-disease risk is constrained below threshold X for cultivar
Y.” That is a small graph, a few hundred nodes, a few thousand edges.
You can build it by hand with a domain expert in two afternoons.</p>

<p>The interesting move is what you do with the graph at inference time.</p>

<p>Naive use is bad. If you simply dump the entire graph into the
context window, you have just made the LLM read a long, structured
document, and you are back to the corpus-retrieval problem with extra
steps.</p>

<p>The right use is constrained. The graph becomes a vocabulary that the
LLM is allowed to talk about. When the model generates an
explanation, it is required to express the explanation in terms of
graph nodes and edges. Anything that does not reduce to the graph is
flagged as ungrounded. The graph is not extra context. It is a
contract about what the model is allowed to claim.</p>

<p>This is the move our 2025 <em>Smart Agricultural Technology</em> paper
makes. We pair a model predictive controller with an LLM, and the LLM
is forced to stay inside the domain knowledge graph when it explains
a control action. The controller decides what to do. The graph
decides what the explanation is allowed to say. The LLM does the
linguistic gluing.</p>

<p>Three things become easier once you do this.</p>

<p><strong>Verification.</strong> You can check that every entity and every relation
in the explanation actually exists in the graph. If the model invents
a new variable, the verifier catches it. This eliminates a whole
class of confident-sounding hallucinations.</p>

<p><strong>Editing.</strong> When the domain expert disagrees with an explanation,
they can change the graph. They cannot easily change a 70-billion-parameter
language model. The graph gives the human a steering wheel that the
model cannot ignore.</p>

<p><strong>Cross-domain reuse.</strong> The LLM stays the same. The graph swaps. Move
from greenhouse to building HVAC and you swap the entities and
relations, you do not retrain anything.</p>

<p>The cost is real. Building the graph is the unglamorous part of the
work. It needs domain interviews, careful ontology decisions, and
upkeep as the underlying plant changes. It also caps the system’s
expressivity at whatever the graph contains. If the graph does not
have a node for “leaf wetness,” the system cannot explain in terms of
leaf wetness, even if the underlying physics involves it. That is a
feature, not a bug, in safety-critical contexts. The system fails
visibly, in the graph, where a human can see it, rather than
invisibly, inside a transformer, where they cannot.</p>

<p>The pattern generalises beyond control. Any domain where the corpus
is sparse, the variables are well-defined, and the cost of
hallucination is high, is a domain where knowledge graphs as
scaffolds beat retrieval over text. Medicine, manufacturing, energy
systems, all of them fit. The trick is having the patience to build
the graph.</p>]]></content><author><name>Ramesh Arvind Naagarajan</name><email>ramesh.naagarajan@etit.tu-chemnitz.de</email></author><category term="Knowledge Graphs" /><category term="LLM" /><category term="Grounding" /><summary type="html"><![CDATA[Why a small, hand-built graph beats a large, retrieved corpus in narrow domains.]]></summary></entry><entry><title type="html">Mechanistic interpretability for non-NLP people, a primer</title><link href="https://ramesh-arvind.github.io/blog/2025/mechinterp-for-non-nlp/" rel="alternate" type="text/html" title="Mechanistic interpretability for non-NLP people, a primer" /><published>2025-09-26T00:00:00+00:00</published><updated>2025-09-26T00:00:00+00:00</updated><id>https://ramesh-arvind.github.io/blog/2025/mechinterp-for-non-nlp</id><content type="html" xml:base="https://ramesh-arvind.github.io/blog/2025/mechinterp-for-non-nlp/"><![CDATA[<p>If you work in control, robotics, or any other field where neural
networks are used as components rather than as the whole product, the
mechanistic interpretability literature looks intimidating. There is
a vocabulary problem. Circuits, features, superposition, induction
heads, monosemanticity, sparse autoencoders, probing, patching. Each
of these is a real concept with a real reason to exist, but the way
the field talks about them assumes you have read the last three years
of papers on transformer internals.</p>

<p>This post is a translation. It is the version I wish had existed when
I started reading this literature seriously.</p>

<p>Start with the basic question. Mechanistic interpretability is not
asking “which input features mattered for this output.” That is the
SHAP question, and it has a different shape. Mechanistic
interpretability asks “which internal computations did the model
actually run to get from input to output.” It treats the network as a
program and tries to reverse-engineer the program.</p>

<p>The unit of analysis is the circuit. A circuit is a small subset of
neurons, attention heads, and connections that together implement a
recognisable computation. The classic example is the induction head,
a two-attention-head circuit in transformers that implements pattern
completion of the form “A B … A -&gt; B.” Once you know the circuit
exists, you can find it, ablate it, and watch the model fail at the
task. That is the kind of evidence the field treats as compelling.</p>

<p>Three concepts do most of the work.</p>

<p><strong>Features.</strong> A feature is a direction in activation space that
corresponds to a human-interpretable concept. “This input mentions
the Eiffel Tower.” “This token is the first one of a list.”
“Temperature is rising.” Features are the alphabet of the network’s
internal language.</p>

<p><strong>Superposition.</strong> Networks have far more features than neurons. They
solve this by storing features in overlapping linear combinations.
This is why you cannot just look at one neuron and read off “this
neuron is the temperature neuron.” It almost never works that way.
The temperature feature is spread across many neurons, and many other
features share those same neurons.</p>

<p><strong>Sparse autoencoders.</strong> The current best tool for getting around
superposition. You train a wide, sparse autoencoder on the model’s
activations and let it discover an over-complete basis. Each basis
direction is a candidate feature. With enough scale, many of those
features turn out to be human-interpretable.</p>

<p>Now the part that matters for non-NLP people.</p>

<p>Almost every mechanistic interpretability technique developed for
language models transfers to any transformer. If you have a
transformer-based controller, a transformer-based world model, a
transformer-based perception module, the same toolkit applies. You
can probe activations, find features, identify circuits, ablate them,
and check whether your causal story holds.</p>

<p>What does not transfer cleanly is the intuition. Language models have
a token vocabulary and a discrete, compositional structure that makes
features feel natural. A controller that reads continuous sensor
inputs does not have tokens. The “alphabet” is harder to even define.
That does not mean features are absent. It means we have to do more
work to find them.</p>

<p>This is where my current work goes. I take a domain-adapted language
model that has been trained to reason about constrained control
problems, and I ask the standard mechinterp questions of it. Which
features encode the notion of a constraint being binding? Which
circuits handle the prediction horizon? Where does the model store
the difference between an objective and a constraint? Some of these
questions have promising preliminary answers. The full version is in
review and I will write about it once it is out.</p>

<p>The takeaway for now is not “go read 40 papers.” The takeaway is this.
Mechanistic interpretability is a tractable engineering discipline,
not a philosophy. It has its own tools, its own evidence standards,
and its own failure modes. If you are deploying a neural network in a
loop with humans or hardware, knowing what is inside it is a
reasonable engineering expectation, not a research luxury.</p>]]></content><author><name>Ramesh Arvind Naagarajan</name><email>ramesh.naagarajan@etit.tu-chemnitz.de</email></author><category term="Interpretability" /><category term="LLM" /><category term="Primer" /><summary type="html"><![CDATA[What the interpretability community has actually figured out, translated for engineers.]]></summary></entry><entry><title type="html">Why LLMs need to explain themselves: a control-systems perspective</title><link href="https://ramesh-arvind.github.io/blog/2025/llms-explain-themselves-control/" rel="alternate" type="text/html" title="Why LLMs need to explain themselves: a control-systems perspective" /><published>2025-09-05T00:00:00+00:00</published><updated>2025-09-05T00:00:00+00:00</updated><id>https://ramesh-arvind.github.io/blog/2025/llms-explain-themselves-control</id><content type="html" xml:base="https://ramesh-arvind.github.io/blog/2025/llms-explain-themselves-control/"><![CDATA[<p>There is a quiet assumption in most modern AI deployments. The model
makes a decision, the operator accepts the decision, and the
explanation, if anyone bothers, is generated afterwards by a separate
post-hoc tool. SHAP, LIME, attention maps, the usual suspects. The
explanation is treated as a kind of receipt. Optional. Cosmetic.</p>

<p>In a control system this assumption falls apart immediately.</p>

<p>A controller is not a one-shot predictor. It runs in a loop. Every
sample period it picks an action, the plant moves, sensors update, the
controller picks the next action. The grower, the operator, the plant
manager, are not external auditors who occasionally check on it. They
are inside the loop. They override it, they retune it, they switch it
off when something feels wrong. The explanation is the channel through
which the human and the controller stay in sync. If the channel is
broken, the loop is broken.</p>

<p>That single observation reframes the whole explainability problem.
Explanations stop being a UX nicety. They become a control-loop
requirement, with the same status as observability or robustness
margins.</p>

<p>A few consequences follow.</p>

<p>First, latency matters. A 30-second explanation is useless when the
sample period is one minute. The explanation has to live on the same
timescale as the decision, otherwise the operator falls back to the
old habit of trusting their gut and ignoring the controller.</p>

<p>Second, faithfulness matters more than fluency. A confident,
plausible-sounding explanation that does not actually reflect what the
optimiser did is worse than no explanation at all. It teaches the
operator a wrong mental model of the plant. Every time the LLM smooths
over a constraint that was actually binding, it widens the gap between
the human’s mental model and the real system. That gap is where
accidents happen.</p>

<p>Third, the explanation has to be grounded in something. Free-form text
out of a base LLM is not grounded in anything except its training
corpus. In a control loop the natural grounding is the optimiser
itself: the constraints, the cost terms, the active set, the
prediction horizon, the reference signal. The LLM should be reading
from those, not from a vibe.</p>

<p>Fourth, the explanation should be testable. If a controller says “I
turned the heater down because the predicted humidity was about to hit
the upper bound,” that statement is either true or false in the
optimisation problem. We can check it. Explanations that are
not checkable are not engineering artefacts, they are decoration.</p>

<p>When you stack those four requirements, latency, faithfulness,
grounding, and testability, you end up in a very specific design
corner. Post-hoc methods like SHAP cannot satisfy faithfulness because
they approximate a different model. Pure prompt-engineered LLMs cannot
satisfy grounding because they do not have privileged access to the
optimiser. Attention maps cannot satisfy testability because there is
no map from “this attention head lit up” to “this constraint was
binding.”</p>

<p>What does fit is a tighter coupling: an LLM that reads structured
state from the optimiser, a domain knowledge graph that constrains
which entities and relations the LLM is allowed to talk about, and a
verifier that checks every generated explanation back against the
optimisation problem before it is shown to a human.</p>

<p>That stack is what our 2025 <em>Smart Agricultural Technology</em> paper
prototypes. The greenhouse is incidental. The same stack would apply
to any plant where decisions are made by an optimiser and consumed by
a human. Building HVAC, district heating, autonomous trains,
fuel-cell stacks, all of them have the same shape.</p>

<p>The deeper claim is this. We have spent ten years arguing about
whether neural networks should be allowed near safety-critical
control. The right question turns out to be different. The question
is whether they can stay in the loop with humans, not whether they
can replace them. And that question is an explainability question
first.</p>

<p>Most of what comes next on this blog is about the engineering of that
loop.</p>]]></content><author><name>Ramesh Arvind Naagarajan</name><email>ramesh.naagarajan@etit.tu-chemnitz.de</email></author><category term="Explainability" /><category term="Control" /><category term="LLM" /><summary type="html"><![CDATA[Explanations are not a UX feature. They are a control-loop requirement.]]></summary></entry><entry><title type="html">How we made greenhouse controllers explain themselves</title><link href="https://ramesh-arvind.github.io/blog/2025/explainable-greenhouse-control/" rel="alternate" type="text/html" title="How we made greenhouse controllers explain themselves" /><published>2025-08-15T00:00:00+00:00</published><updated>2025-08-15T00:00:00+00:00</updated><id>https://ramesh-arvind.github.io/blog/2025/explainable-greenhouse-control</id><content type="html" xml:base="https://ramesh-arvind.github.io/blog/2025/explainable-greenhouse-control/"><![CDATA[<p>If you have ever stood in front of an industrial controller and asked
“why did you just do that?”, you already know the problem this post is about.</p>

<p>Modern greenhouses run on model predictive control. The controller looks a few
hours into the future, predicts what the plants and the building will do under
different ventilation, lighting, irrigation, and CO₂ choices, and picks the
sequence of actions that minimizes a cost function. It is genuinely good at
its job. The trouble is that the cost function does not speak English, and the
grower does.</p>

<p>This is the gap our two 2025 papers were trying to close, from two different
sides.</p>

<h2 id="paper-one-read-the-setpoint-before-you-criticize-the-controller">Paper one: read the setpoint before you criticize the controller</h2>

<p>Before you can explain a controller’s actions, you have to understand what it
was being asked to do. In greenhouses, that “ask” is a reference trajectory
for temperature, humidity, and CO₂ that shifts across the day and the season.
These trajectories are not flat. They have ramps, dwell periods, plateaus,
diurnal cycles, slow seasonal drifts, and the occasional anomaly when someone
opened a vent at the wrong time.</p>

<p>Our <em>Frontiers in Agronomy</em> paper, <em>Automated analysis of reference signals</em>,
is essentially a piece of plumbing nobody had built carefully: an automated
pipeline that takes a real reference trajectory and decomposes it into the
components a human grower would describe if you asked them to. Diurnal pattern
here, weekly trend there, this hour is the ramp into night setback, that hour
is a recovery from a humidity excursion. The output is a structured,
operator-readable description of the setpoint, not a black-box model of it.</p>

<p>The reason this matters is that any explanation of controller behaviour is
only as good as the description of what the controller was being asked to
track. Without this layer, you end up explaining noise. With it, you can have
a real conversation about whether the setpoint design itself was reasonable.</p>

<p><a href="https://doi.org/10.3389/fagro.2025.1536998">Read the paper.</a></p>

<h2 id="paper-two-let-the-grower-talk-to-the-controller">Paper two: let the grower talk to the controller</h2>

<p>Our <em>Smart Agricultural Technology</em> paper, <em>Enhancing greenhouse management
with interpretable AI</em>, is the second half of the story. Once you can describe
what the controller is being asked to do, you can build something that lets a
grower ask, in plain English, why the controller is doing what it is doing.</p>

<p>The naive way to do this is to hand the question to a large language model
and hope. That does not work for two reasons. First, the model has no idea
what is actually happening inside the optimizer. Second, even when it sounds
confident, its answers are not grounded in the controller’s own reasoning,
which means they cannot be audited.</p>

<p>What we built is a thin language layer that does three things in sequence.
It maps the operator’s natural-language question onto the structured
artefacts the optimizer already produces, things like the active constraints,
the multipliers, and the per-step contributions to the cost. It uses a domain
knowledge graph, encoding what plants, vents, lights, and humidity actually
do to each other, to constrain which explanations are even physically
plausible. And it returns the answer in the operator’s own language, with
the underlying evidence inline so a sceptical grower can dig into the numbers.</p>

<p>The headline result is that the system produces explanations that match
expert annotations far better than off-the-shelf attribution methods. But the
honest reason we are happy with it is more boring: when we sat with growers
and watched them use it, they trusted it. They argued with it. They
occasionally won the argument. That is what an interpretable system should
feel like.</p>

<p><a href="https://doi.org/10.1016/j.atech.2025.101041">Read the paper.</a></p>

<h2 id="what-ties-them-together">What ties them together</h2>

<p>These are two papers, but they are really one thesis. If you want a controller
to explain itself in language a domain expert will accept, you need both an
honest description of the goal it was given and a faithful translation of the
reasoning it actually used. Either half on its own is a demo. Together, they
start to look like a real tool.</p>

<h2 id="whats-next">What’s next</h2>

<p>I am extending these ideas in two directions. One is mechanistic
interpretability for the language layer itself, treating the explainer as an
object of study, not just a wrapper. The other is moving beyond greenhouses
to controlled-environment systems with stricter safety requirements, where
the cost of an opaque decision is much higher.</p>

<p>If any of this resonates, or if you are working on something nearby, I would
genuinely like to hear from you. Email is the fastest way.</p>]]></content><author><name>Ramesh Arvind Naagarajan</name><email>ramesh.naagarajan@etit.tu-chemnitz.de</email></author><category term="Explainable AI" /><category term="Model Predictive Control" /><category term="LLM" /><category term="Greenhouse" /><summary type="html"><![CDATA[If you have ever stood in front of an industrial controller and asked “why did you just do that?”, you already know the problem this post is about.]]></summary></entry></feed>