Machine Learning technologies have been proven to perform well when automating daily tasks or sensitive needs. These include recommendation systems, biometric authentication, fraud detection, and image classification. The developed tools became so trustworthy that even the medical domain can now benefit from them. But while the adoption rate of ML technologies may create more value in products, in some situations the expectations on what ML can deliver actually exceeds the actual capabilities of the technology.
Just to clarify, Machine Learning is the science that enables computers to learn concepts using data without being explicitly programmed. The applications of ML are involved in your every-day activities allowing you to buy products personalized for your specific needs, filtering spam emails thus allowing you to read only the ones with relevant content or automatically categorizing your images based on their content.
Below I will present some of the most common misconceptions about ML technologies. Contexts where I was required to employ ML technologies but after proper analysis, it turned out that these technologies were not actually a good fit.
A big volume of data enables predictions
And not all use cases are fit for a Machine Learning approach. Having large corpus of documents with hundreds of pages whose content barely changes through time is not big data. Employing an ML strategy is not the best approach when you have a lot of data, but it is missing the veracity and the velocity, the other dimensions required for big data. Such data cannot be used in order to build a prediction model as it cannot generalize and learn from this data and you find yourself developing a model that overfits. A more suitable approach for such a situation could be a pattern-based approach that follows the small changes occurring in your documents and indicates them.
An initial assessment can be done with a dozen data points
Take the example of the medical field. One of the most sensitive issues with medical data is privacy, that is why for initial research purposes the amount of available data is typically limited. We easily enter a vicious circle here. Not having access to real and satisfactory amount of data, makes it difficult to assess a medical problem from an ML perspective. I was once required to make proposals for a medical project that in order to receive funding required an ML component. The data that I received in order to make the assessment consisted of 9 data points with at most 15 features. On top of that, for some features, the values were missing, but this would be expected when working with medical data.
With such a limited amount of data, it becomes difficult to get an overview of the topic you are required to tackle or to understand the problem that needs solving. While I understand the data is confidential, sensitive and difficult to acquire, if it cannot be provided, maybe an ML-based approach is not the best fit at the moment.
The pre-trained models on images are the one-size-fits-all solution to every image related tasks
Since pre-trained image models are popular right now, it has become a common approach to overgeneralize their usage and believe that such models might be the solution for any type of problem or data. What I think should be mentioned in this case is that these pre-trained models are not built to be universal. To employ such a model, it requires fine-tuning and a clear definition of the problem. Nevertheless, such models are usually limited to solving problems similar to the use case they were designed for at the beginning. Additionally, the input and output require additional processing such that the new data fits the model’s design. Just take these models, when they exist, as the starting point of actually customizing your own models.
A binary classification model can be trained solely on positive examples
With the easy implementation of pre-trained image classification models came the need and demand of employing such technique whenever images were available. I was once working on a project with the purpose of identifying a region of interest in scanned documents. The general flow of the implementation included the employment of a pre-trained image classification model that was fit on the images of interest and the goal was to predict whether the image included the pattern of interest. The model was trained with data that included solely images that had the pattern, while all other images were discarded.
The problem was that all the images were classified as containing the pattern whether this was true or not. The strategy of using only positive samples contradicts the purpose of binary classification, where we expect to have the training data both the positive and the negative data samples. If only positive examples are used to train a model, it will only be able to predict yes for any type of new data, as it cannot differentiate among classes.
Stock market movements can be easily predicted based on historical open-high-low-close indicators
Some of the most used examples of Machine Learning applications, especially in tutorials is either forecast prediction or fraud detection. On the other hand, the question I received quite often regarding what ML can do was regarding stock market prediction.
People find it difficult to understand that for learning a prediction model you require more than 3 features: index, timestamp, and value. These features are the arguments people offer whenever they are not satisfied with the answer that, no, you cannot predict the stock market with the data you scrape from a stock market website.
While my expertise does not include stock market analysis, I am confident that the factors that stock market analysts consider when making a prediction are not magically included in the historical data. And I believe that several external sources of information are needed in order to make an accurate forecast.
The ML algorithms identify my business problems
To solve a business problem you are required to define the problem, understand it, and have access to the right data. In other words, we expect from a client to show a minimum conceptualization of the outcome. Machine Learning is not a one-size-fits-all technology, as there are preprocessing steps and learning algorithms specific to each data type, data volume and problem.
Expectations such as ’you just throw in the data to the Machine Learning algorithm and we see what we get and start from there’ are not realistic. There are no ML algorithms that you can feed any type of data, in any format, and expect it to offer you solutions. This type of expectations may lead to long term research that may not follow the actual needs of the client.
An ML-based approach is better than any traditional approach
I find it quite common that simple tasks that require simple decision-making rules are converted into very complex problems. Identifying crowded locations, very active users or popular products do not require any machine learning component. This task requires having access to the data, cleaning, and grouping followed by a pretty visualization employing chart generation tools. The need for something more complex comes when you are interested in predicting what location will be crowded at 5 pm, or which users are more likely to leave your platform.
This tendency appears also when people mistake simple text-to-speech features with actually making use of a text-to-speech solution. Let me give you an example: you want your laptop to notify you using a gentle voice that your battery runs out, you’ve got a new email, or you forgot to save your work. This does not mean you need any kind of learning, you simply define a pair of action and text and make use of a text-to-speech app of your choice to voice the notification.
While I believe that your product would be more visible and attract more users when including state of the art technologies, I also believe that easier is better and faster, saves money and saves resources.
There are two potential approaches of including new technologies in business: create a custom strategy for the problem at hand or try to find a problem where to employ existing solutions. I believe that the first brings more benefits and better results. Unfortunately, nowadays, the latter is more frequent because of the need and eagerness of employing ML in every business. This frequently leads to unrealistic expectations and potential failures.
When a business decides it wants to employ state of the art technologies, my recommendation is that they consult a Data Scientist or a specialized person. This way, not only do they get a better understanding of the capabilities of the technologies but they are assisted in finding the best use cases for their business that would benefit them. At Tapptitude, we can help with both avoiding over-complicating things around data, but also have the right smart data strategy.
Ioana Bărbănțan
Machine Learning & Data Scientist
Ioana Bărbănțan is a Machine Learning Engineer @Tapptitude. She has a passion for data, structure, and visual representations. Ioana got her Ph.D. in Computer Science and specialised in Machine Learning and Natural Language Processing. Her work @Tapptitude focuses on helping clients automate and optimize their data, processes and making their products smart.