Machine learning through the prism of econometrics

“Although we can predict house prices accurately, we cannot use such ML models to answer questions such as whether more dining rooms are needed. “

Artificial intelligence has been a force of nature in many areas. Whether it’s increasing health and education advancements or filling the gaps with voice recognition and translating AI, artificial intelligence is becoming more essential for us every day. Sendhil Mullainathan, professor at the Booth School of Business at the University of Chicago, and Jann Spiess, assistant professor at the Stanford Graduate School of Business, observed how machine learning, particularly supervised machine learning, was more empirical than procedural. For example, facial recognition algorithms do not use rigid rules to analyze certain pixel recognitions. Rather, these algorithms use large datasets of photographs to predict what a face looks like. This means that the machine would use the images to estimate a function f (x) that predicts the presence (y) of a face from pixels (x).

Register for the free hands-on workshop: oneAPI IA analysis toolkit

Another discipline that relies heavily on such approaches is econometrics. Econometrics is the application of statistical procedures to economic data to provide empirical analysis of economic relationships. With machine learning being used on data for purposes like forecasting, can empirical economists use ML tools in their work?

New methods for new data

Today, we are seeing a dramatic shift in what constitutes the data within which people can work. Machine learning allows statisticians and analysts to work with data considered too dimensional for standard estimation methods, such as online publications and reviews, images, and linguistic information. Statisticians could hardly examine such types of data for processes such as regression. In a 2016 study, however, the researchers used images from Google Street View to measure block-level income in New York and Boston. In addition, a 2013 research developed a model for using online publications to predict the outcome of hygiene inspections. So, we see how machine learning can augment the way we research today. Let’s look at this in more detail.

Traditional estimation methods, such as ordinary least squares (OLS), are already used to make predictions. So how does ML fit into this? To see this we go back to Sendhil Mullainathan and Jann Spiess’ job– which was written in 2017, when the first was teaching and the second was a doctoral student at Harvard University. The paper took an example, predicting house prices, for which they selected ten thousand owner-occupied houses (randomly selected) from the 2011 American Housing Survey metropolitan sample. They included 150 variables on the house and its location, such as the number of bedrooms. They used several tools (OLS and ML) to predict the values ​​of log units on a separate set of 41,808 dwellings, for out-of-sample testing.

Applying OLS to this will require making specifically organized choices about which variables to include in the regression. Adding all the interactions between variables (eg between base area and number of bedrooms) is not feasible because it would consist of more regressors than data points. ML, however, automatically searches for such interactions. For example, in regression trees, the prediction function would take the form of a tree that would divide at each node, representing a variable. Such methods would allow researchers to construct a class of interactive functions.

One problem here is that a tree with these many interactions would lead to overfitting, that is, it wouldn’t be flexible enough to work with other datasets. This problem can be solved by what is called regularization. In the case of a regression tree, a tree of a certain depth will have to be chosen according to the compromise between a less good fit in the sample and a weaker overfit. This level of regularization will be selected by empirically adjusting the ML algorithm, creating an out-of-sample experiment within the original sample.

Thus, the choice of the ML-based prediction function involves two steps: selecting the best loss minimization function and finding the optimal level of complexity by adjusting it empirically. Trees and their departments are just one example. Mullainathan and Speiss said the technique would work with other ML tools such as neural networks. For their data, they tested this on various other ML methods, including FORESTS and LASSO, and found that they outperformed OLS (depth-adjusted shafts, however, were no more effective than OLS). Traditional OLS). The best prediction performance was seen by a set that ran several separate algorithms (the paper ran LASSO, tree, and forest). Thus, econometrics can guide design choices to help improve the quality of predictions.

There are, of course, a few issues associated with using ML here. The first is the absence of standard errors on the coefficients in ML approaches. Let’s see how this can be a problem: The Mullainathan-Spiess study randomly divided the sample of dwellings into ten equal partitions. After that, they re-estimated the LASSO predictor (with the regularizer kept fixed). The results showed a huge problem: a variable used by the LASSO model in one partition may be unused in another. There were very few stable patterns across the scores.

It doesn’t affect the accuracy of the prediction too much, but it doesn’t help decipher if two variables are highly correlated. In traditional estimation methods, these correlations result in significant standard errors. For this reason, although we can predict house prices accurately, we cannot use such ML models to answer questions such as whether a variable, for example the number of dining rooms, is not important. in this research simply because the LASSO regression did not use it. Regularization also poses problems: it allows the choice of less complex but potentially erroneous models. It could also raise concerns about the biases of the omitted variables.

Finally, it is essential to understand the type of problems ML solves. ML revolves around the prediction of a function y from the variable x. However, many economic applications revolve around the estimation of the parameter β which could underlie the relationship between x and y. ML algorithms are not built for this purpose. The danger here is to take an algorithm constructed for y = ŷ and assume that its value would have the properties associated with the estimate output.

Still, ML improves prediction, so one might benefit by looking for problems with larger consequences (i.e. situations where improved predictions have immense applied value).

One of these categories is one of the new data types (language, images) mentioned above. Analysis of this data involves prediction as a preprocessing step. This is particularly relevant in the presence of missing data on economic performance. For example, a 2016 study trained a neural network to predict local economic outcomes using satellite data in five African countries. Economists can also use such ML methods in political applications. An example provided by the Mullainathan and Spiess article was deciding which teacher to hire. This would involve a prediction task (deciphering the teacher’s added value) and help to make informed decisions. These tools therefore clearly show that AI and ML should not go unnoticed in today’s world.


Join our Telegram group. Be part of an engaging online community. Join here.

Subscribe to our newsletter

Receive the latest updates and relevant offers by sharing your email.

About Clara Barnard

Clara Barnard

Check Also

The Chinese growth trap that could cripple Beijing’s economy

The Chinese labor force peaked in 2017 with 787 million people. Demographics say the fall …