The 'Physics-Based Model' Sales Pitch vs. Data-Driven:
What 95 Responses Revealed About the Real Debate
Our recent article on ‘physics-based model’ marketing claims drew 95 responses on LinkedIn from industry experts spanning reservoir engineering, process modeling, chemical engineering, digital twins, and industrial AI. The discussion revealed something more important than who was ‘right’ — the entire physics vs. data-driven framing is fundamentally flawed.
First, the common ground: Whether physics-based, data-driven, or hybrid, all approaches share the same goal — the correct representation (model) of a physical system. The debate isn’t about the objective. It’s about how we achieve it, what information we use, and what works in practice.
Here’s what the conversation actually taught us.
The Framing Problem
Many responses framed the discussion as physics vs. data-driven vs. hybrid. But even “hybrid” still implies that “physics” and “data-driven” are separate things being combined. The reality is more fundamental — and the points that follow explain why.
Many Industrial Systems Have No Theoretical Equations
The physics exists. The equations don’t.
This is the critical point that “physics-based” marketing consistently ignores: a significant portion of industrial modeling involves systems for which no theoretical equations exist.
The underlying physics exists — the system is physical — but the mathematical description doesn’t. And in many cases, may never.
Product Quality and Sensory Properties:
- Wet burst strength of paper products — “companies have spent millions on R&D, but there’s no equation”
- Smoothness of aged whiskey
- Flavor profiles — even with gas chromatography providing a complete component list, no equation maps composition to the sensory output
- textures, aromas, appearance
Complex Material Properties:
- Needle coke electrical properties
- Polymer behavior under real manufacturing conditions
- Non-Newtonian Emulsions
Emergent System Behaviors:
- Non-ideal multiphase flow in heterogeneous media
- Biological and fermentation processes
- Fouling, waxing, and corrosion progression
For these systems, data-driven models learn the empirical function directly. If wet burst strength = f(fiber length, moisture content, basis weight, chemical additives, process parameters…), but no theoretical equation for f exists, we determine f empirically from measured input-output data.
“The IP is in the process knowledge and data, not the physics.”
This isn’t a niche problem. It represents enormous economic value across consumer products, specialty materials, pharmaceuticals, biotechnology, and chemical processing. When your key performance indicator has no theoretical basis, “physics-based modeling” simply isn’t available as an option. The physics exists. The equations don’t.
When Theoretical Equations Exist, They Still Require Data
Even where physics equations exist, they require data-driven calibration to represent real systems accurately.
“Process engineers don’t use PV=nRT. We use NRTL, Peng-Robinson, SRK, Pitzer — semi-empirical models blending thermodynamic rigor with fitted parameters.”
The structure comes from theory. The parameters come from data. This is not “physics-based” in the pure sense that marketing implies.
The Vogel equation was raised as a specific example. Presented as a physics-based model for well performance, it doesn’t hold in some real-world cases because the assumptions underlying the equation don’t exist. In some cases, the Vogel equation described the exact opposite of the process dynamics.
“Even theoretical models must be field fit, so in a sense they are data driven too.”
Sometimes physics-based simulation is used alongside data-driven models, not as the model — such as thermal diffusivity modeling for steam injection into a sand bed — the theoretical component served as input to the data-driven system. Structure from theory. Accuracy from data.
“Theory takes you part way, hand tweaking and data takes you across the finish line.”
When “rules of thumb” or “first principles” from experienced engineers were tested against actual operating data, roughly half were correct, about a quarter were conditional, and a quarter were not supported by the data at all.
Data-Driven Models Represent Physical Systems
When you collect temperature, pressure, flow rate, and concentration measurements from an operating system, you are measuring a physical system operating under physical laws — including laws we haven’t derived theoretically, interactions too complex to model from first principles, real boundary conditions rather than idealized assumptions, and dynamic effects such as aging, fouling, and environmental variation.
“Reality doesn’t violate physics; it violates our simplifications and boundary conditions.”
“Data tells what really happens, regardless of theory.”
“The data doesn’t violate physics — it IS the physics, just measured rather than theorized, mindful of noise and sampling errors.”
Data Must Be Handled Properly
Several responses correctly raised concerns about data quality and model construction. The issues raised in the discussion:
Noise and bias
"Measurements are contaminated by noise, observer effects, and system bias."
Overfitting
Models with excessive degrees of freedom "have sufficient flexibility to fit not only the underlying signal, but also noise and bias present in the data." Von Neumann: "With four parameters I can fit an elephant, and with five I can make him wiggle his trunk."
Non-stationarity
Systems change over time — aging, fouling, shifting operating regimes. Models, theoretical or data driven, that aren't updated will degrade.
Causality vs. correlation
"Almost all [data-based] methods are correlation-based." Causal analysis is yet another great opportunity for discussion on its own.
Latent variables
Unmeasured factors affect system behavior. Their effects often manifest within observed variables — pressures and temperatures reflecting fouling, waxing, corrosion, hydrate formation.
Asynchronous data
Industrial data is not synchronous — variable readings arrive any time, not at the same time, and so 'steps' is not a valid concept.
Data source integrity
In some cases, reported production figures don't correspond to metered results. Which data you use matters.
Rigorous Methodology Addresses These
Input selection — algorithms and domain knowledge to identify true causal relationships.
Model structure optimization — appropriate complexity for available data, not maximum complexity. Models should be as simple as possible and no simpler.
Multiple dataset validation — regression on training data, validation on multiple independent datasets.
Holdback data — true out-of-sample performance testing.
Auto-calibration — as new data arrives to handle non-stationarity.
Continuous performance monitoring — ongoing validation as systems evolve.
On model size specifically: "LLMs should NOT be used for predictive modeling. Smaller, focused models should be used, with as few weights as possible and the optimal combination of inputs to service the need."
Experts ARE Data-Driven Models
One exchange deserves examination. An expert described analyzing a single independent well, applying theoretical equations, adjusting parameters based on geological knowledge, and concluded: “Single well. Single sand. No database for a data-driven solution.”
But consider what actually happened. Decades of observing where equations work, and fail. Learning which adjustments apply in which conditions. Building functional relationships between geological indicators and prediction accuracy. That IS a database — stored in a human instead of silicon. It is rarely transferable, often lost at retirement.
When experts calibrate theoretical equations from experience, they are performing data-driven learning implicitly.
Additional Perspectives From the Responses
Conservation Laws
Mass, energy, and momentum balances provide hard constraints that shouldn't be violated. "What physics provides: Conservation law guarantees, Extrapolation capability, Interpretable failure modes."
Low-Data Regimes
"In the very low-data regime, any physical prior, constraint, or boundary condition can materially help optimization converge. That's especially valuable at the beginning of a project, when you're just starting to collect data."
Design Phase
Before a system exists, physics-based simulation is the only tool available. "Don't expect to design a data-based space mission."
Adaptive Models
"The strength of data-based methods is that they are not 'one and done.' They can track changes which inevitably happen — due to aging of the system — by using adaptive ML algorithms. Whereas physics-based models stay unchanged from status quo ante."
Chaotic Behavior
"Even under textbook assumptions, physics-based doesn't guarantee predictability or model quality. Smooth, deterministic systems can already exhibit chaotic behavior."
Feedback Loops
Systems with feedback can become unstable if not properly damped — relevant to both modeling and process control.
Expensive Experimentation
When real-world experimentation is expensive or risky, physics-based simulation for generating data for optimization makes sense.
Assumptions Matter More Than Method
"Any data-based model with good assumptions will beat one with bad assumptions. Prior knowledge, like physics, is simply a source of modeling assumptions which from experience are usually productive."
Three Types of Practitioners
Those who reject data against their understanding of physics. Those who reject physics against what data says. Those who don't care and continue their work. All three require caution.
What the Discussion Actually Revealed
WHAT DETERMINES YOUR APPROACH
No theoretical equations exist
Learn the empirical function directly from measured input-output data.
Product quality, sensory properties, complex materials
Equations exist but require calibration
Use theoretical structure. Fit parameters from data.
Thermodynamic systems, well performance, multiphase flow
No operational data exists
Physics-based simulation. Do the work. Capture the data.
New designs, safety studies, greenfield, space missions
Low-data regime
Physics constraints aid convergence. Transition to data-calibrated models as data accumulates.
Early-stage operations, initial project phases
In all cases: validate rigorously. Use appropriate complexity. Adapt as systems change.
One response from reservoir engineering captured the reality precisely: "This distinction between physics models and data driven models simply does not exist. We build reservoir models and solve the differential equations. But the PDE is controlled by measured values. So it is already a mix."
The question is not "physics vs. data-driven." The question is: what information is available, what form should the model take, and how do we validate it properly?
No theoretical equations exist → Learn the empirical function from measured input-output data. Product quality properties, complex material behaviors, sensory properties, emergent system behaviors.
Theoretical equations exist but require calibration → Use theoretical structure, fit parameters from data. Thermodynamic systems, well performance, multiphase flow.
No operational data exists → Physics-based simulation is a commonly accepted option. New designs, safety studies, greenfield projects, space missions. An alternative is to do trial and error; do the work, capture the data.
Low-data regime → Physics constraints aid convergence; transition to data-calibrated models as data accumulates. Early-stage operations, initial project phases.
In all cases: validate rigorously on independent data, use appropriate model complexity, and adapt as systems change over time.
The Refined Understanding
The 95 responses from 35 experts revealed the complete picture:
The goal is the same across all approaches — correct representation of physical systems.
A massive portion of industrial systems have no theoretical equations. The physics exists; the mathematical description doesn’t. Data-driven models learn the empirical function directly from measured input-output relationships.
Where theoretical equations exist, they require data-driven calibration — making them partially data-driven by nature.
Data from physical systems embodies actual physics — measured rather than theorized, including effects that theoretical models simplify away.
Expert judgment is implicit data-driven learning — stored in human experience rather than explicit models. Non-transferable and lost at retirement.
Context determines the appropriate approach. Conservation laws, low-data regimes, design phases, and expensive experimentation all have specific needs that shape the right methodology.
Methodology matters. Proper validation, appropriate model complexity, input selection, and continuous adaptation are essential regardless of approach.
“All models are wrong, some are useful.”
The goal isn’t to champion theory or data. It’s to build accurate, reliable representations of physical systems using whatever information is available.
We’ve been building systems that enhance your understanding, predict and optimize your products and processes for over 30 years across oil & gas, consumer products, specialty materials, and process industries.
If you’re evaluating how to model your systems — or wondering why your current approach isn’t delivering — let’s talk.
