Ghosts Who Code: Do LLMs Truly “Understand” the Code They Write?

Inspired by Ersin Koç, with an additional point raised by Tyler Burns, PhD about how not every scientist needs or wants to become a programmer.

Large Language Models (LLMs) have quietly slipped into the center of scientific and computational work. Whether you run genomic pipelines, automate data cleanup, or draft quick analysis scripts, tools like ChatGPT, Claude, and Gemini now feel like routine lab equipment. You ask them for a function, they produce it. You paste an error, they fix it. You describe an architecture, they outline it.

And most of the time, the result runs well enough that it’s tempting to believe the model “understands” what it is doing.

It doesn’t.

Behind every correct-looking function is not a mind, not reasoning, not a conceptual model but a massive statistical mechanism calculating the most probable next token. The illusion of intelligence is strong because the simulation is high-resolution. But it remains a simulation.

This distinction is essential for scientists, especially those working with biological data where correctness and grounding matter.

The Comfortable Illusion of Understanding

When an LLM writes something like:

def calculate_area(radius): return 3.14 * radius ** 2

it has not recognized a geometry problem.
It has not reconstructed a mental model of a circle.
It has not reasoned through the concept of area.

It has simply observed that in its training data, this pattern is statistically common.

Biologists already know a version of this phenomenon:
a convolutional network can segment nuclei without “understanding” cells;
AlphaFold can fold proteins without “knowing” chemistry.

LLMs succeed through pattern reproduction, not comprehension.

Syntax Without Semantics: A Familiar Scientific Distinction

A clear way to explain LLMs to scientists is to borrow language from biology:

Syntax is the sequence: ACTGACTG (or tokens in text).
Semantics is the function: transcription, regulation, phenotype.

LLMs operate entirely at the level of syntax.

They know how code should look.
They do not know what code means.

For an LLM:

price = 100 and
weight_grams = 100

have no real-world interpretation.
They are just vectors in a statistical space.

This is why LLMs produce syntactically correct scripts that can be logically or scientifically wrong a failure mode biologists also recognize in overfitted models.

Why It Feels So Intelligent: The High-Resolution Simulation

Ersin Koç described LLM behavior using the “stochastic parrot” framing: a system that imitates expert language without grasping meaning. It is an accurate description.

The simulation is so refined that humans naturally project understanding onto it. We fill in gaps the model itself cannot see.

This is similar to how:

structure models mimic folding without thermodynamics
expression models mimic cell identity without homeostasis
denoising algorithms mimic clarity without microscopy

The output matches expert work, but the mechanism behind it is nothing like expert reasoning.

The Chinese Room, Revisited for Coders

Philosopher John Searle’s “Chinese Room” captures this dynamic perfectly.

Imagine:

you don’t speak Chinese
you have a massive rulebook mapping symbols to symbols
someone outside sends you a sentence
you look up the rules and send back the appropriate symbols

From the outside, it appears you “understand” Chinese.
Inside the room, you know you don’t.

LLMs are this system scaled to billions of parameters.

They do not know that a line of code interacts with a database, moves a robot, operates on patient metadata, or touches real biology. They only know that certain token patterns frequently follow others.

They perform symbol transformation, not conceptual reasoning.

Why This Matters for Scientists

In experimental science, we validate everything:

Western blots need controls.
RNA-seq pipelines need QC.
Structural predictions need benchmarking.
Machine-learning models need ground truth.

Yet researchers often accept LLM-generated code at face value.
This is dangerous because:

The model can hallucinate functions.
It can propose APIs that don’t exist.
It can mix plausible syntax with broken logic.
It can silently fail with biomedical data constraints.

The risk is not that LLMs are weak. It’s that they appear stronger than they truly are.

So Do LLMs Understand Code?

No.
Not in any scientific or cognitive sense.

They do not have:

grounded concepts
causal models
awareness of consequences
domain understanding
intentionality

They have probability distributions.

Their “intelligence” is surface-level a statistically optimized continuation of patterns that humans mistake for reasoning.

The model is not collaborating with you.
It is simulating collaboration.

And the simulation is good enough that you forget it is a simulation.

Final Thought

For me, the real issue is not whether LLMs are useful. They clearly are. The problem starts when people treat their output as if it comes from understanding rather than pattern prediction. The moment we confuse fluency with reasoning, or correctness with comprehension, we create space for silent errors that scientists cannot afford.

LLMs speed up work, reduce technical barriers, and help with routine coding. But they remain ungrounded systems. They don’t know biology, don’t grasp consequences, and don’t “reason” through a pipeline the way a scientist does. Their strength is scale, not insight.

So I use them but the same way I treat any model that has no mechanistic grounding: with validation, skepticism, and the awareness that a convincing answer is not the same thing as a true one.