Prediction – I do not think it means what you think it means.

In the fields of “data science,” “data analysis,” “machine learning,” and “artificial intelligence,” the term “prediction” is frequently used by novice practitioners, perhaps incorrectly. I want to help improve the nomenclature use and enhance the “street cred” of the new people joining the field. And here I humorously don my swashbuckling outfit and reply, “You keep using that word. I do not think it means what you think it means.”

“Prediction” is not just a casual term but well-defined with significant historical and professional context. The root of the word “prediction” comes from the Latin “prae-” meaning “before” and “dicere” meaning “to say.” Pre-Diction. That means you are saying what will happen in the future. In modeling, it uses current and past information to create an output for a future time. It is also known as “Forecasting” (“Fore”-“Casting”).

The “future” is defined with respect to what is known at the time of each row of the input variables, whether that row of data is in the past or now. If you model Y(t) = Function(X(t)…), you are not predicting. It is “prediction” if you have deliberately time-shifted your “dependent” variable into the future. If you model Y(t+n) = Function(X(t)…), you are predicting. The first is often relatively easy compared to the second, which is relatively difficult. If you are saying you are predicting, when indeed you are not, you are expressing an accomplishment that is more difficult than what you are actually doing. Discerning customers and other practitioners may notice and your “rep” may take a “ding”.

The correct term for when your dependent variable is on the same time basis (or has no time basis at all) is “Estimation”. You are estimating Y based on all your X’s. “Estimation” uses current and past information to create a value for the current time (with respect to the time of the inputs).

While we are on the topic of nomenclature, here are a couple of terms that are useful for time series prediction:

Bucket: The time interval that each row of data represents.

Horizon: The time interval into the future to make predictions.

If you wish additional definitions around prediction, estimation, and forecasting, you may wish to consult the dictionaries of APICS (once called the American Production and Inventory Control Society) and ASCM (Association for Supply Chain Management), which have been doing prediction science for a long time, but not quite as long as the Egyptians.

If you think you have a REALLY GOOD model that makes super accurate predictions, I’d like to introduce you to another important concept in this field: the nefarious “future leak.” This occurs when future data is inadvertently used as input to a model. “Future leaks” can happen in very sneaky ways. Such leaks can lead to misleadingly high performance in predictions—more on this, perhaps another time.

Understanding and correctly applying terms is vital in our field. Clarity in language reflects clarity in thought, leading to more accurate and reliable models in data science and its allied disciplines. Using the right terms for your work will raise your credibility and professionalism.

– – –

Carl Cook is the founder and president of BioComp Systems, IntelliDynamics (R), and has been practicing prediction and forecasting in industry since the late 1970s. He can be reached at cmcook@intellidynamics.net

Related Posts