The AutoML Dilemma. An Infrastructure Engineer’s… | by Haifeng Jin | Sep, 2023

We learned where we are now and where we are going with AutoML. The question is how we are getting there. We summarize the problems we face today into three categories. When these problems are solved, AutoML will reach mass adoption.

Problem 1: Lack of business incentives

Modeling is trivial compared with developing a usable machine learning solution, which may include but is not limited to , cleaning, verification, , and monitoring. For any company that can afford to hire people to do all these steps, the cost overhead of hiring machine learning experts to do the modeling is trivial. When they can build a team of experts without much cost overhead, they do not bother experimenting with new techniques like AutoML.

So, people would only start to use AutoML when the costs of all other steps are reduced to the bottom. That is when the cost of hiring people for modeling becomes significant. Now, let’s see our roadmap towards this.

Many steps can be automated. We should be optimistic that as the cloud services evolve, many steps in developing a machine learning solution could be automated, like data verification, monitoring, and serving. However, there is one crucial step that can never be automated, which is data labeling. Unless machines can teach themselves, humans will always need to prepare the data for machines to learn.

Data labeling may become the main cost of developing an ML solution at the end of the day. If we can reduce the cost of data labeling, they would have the business incentive to use AutoML to remove the modeling cost, which would be the only cost of developing an ML solution.

The long-term solution: Unfortunately, the ultimate solution to reduce the cost of data labeling does not exist today. We will rely on future research breakthroughs on “learning with small data”. One possible path is to invest in transfer learning.

However, people are not interested in on transfer learning because it is hard to publish on this topic. For more details, you can watch this video, Why most machine learning research is useless.

The short-term solution: In the short-term, we can just fine-tune the pretrained large with small data, which is a simple way of transfer learning and learning with small data.

In summary, with most of the steps in developing an ML solution automated by cloud services, and AutoML can use pretrained models to learn from smaller to reduce the data labeling cost, there will be business incentives to apply AutoML to reduce their cost in ML modeling.

Problem 2: Lack of maintainability

All deep learning models are not reliable. The behavior of the model is unpredictable sometimes. It is hard to understand why the model gives specific outputs.

Engineers maintain the models. Today, we need an engineer to diagnose and fix the model when problems occur. The company communicates with the engineers for anything they want to change for the deep learning model.

The AutoML system is much harder to interact with than an engineer. Today, you can only use it as a one-shot method to create the deep learning model by giving the AutoML system a series of objectives clearly defined in math in advance. If you encounter any problem using the model in practice, it will not help you fix it.

The long-term solution: We need more research in HCI (Human-Computer Interaction). We need a more intuitive way to define the objectives so that the models created by AutoML are more reliable. We also need better ways to interact with the AutoML system to update the model to meet new requirements or fix any problems without spending too much resources searching all the different models again.

The short-term solution: Support more objective types, like FLOPS and the number of parameters to limit the model size and inferencing time, and weighted confusion matrix to deal with imbalanced data. When a problem occurs in the model, people can add a relevant objective to the AutoML system to let it generate a new model.

Problem 3: Lack of infrastructure support

When developing an AutoML system, we found some features we need from the deep learning frameworks that just do not exist today. Without these features, the power of the AutoML system is limited. They are summarized as follows.

First, state-of-the-art models with flexible unified APIs. To build an effective AutoML system, we need a large pool of state-of-the-art models to assemble the final solution. The model pool needs to be updated regularly and well-maintained. Moreover, the APIs to call the models need to be highly flexible and unified so we can call them programmatically from the AutoML system. They are used as building blocks to construct an end-to-end ML solution.

To solve this problem, we developed KerasCV and KerasNLP, domain-specific libraries for computer vision and language processing tasks built upon Keras. They wrap the state-of-the-art models into simple, clean, yet flexible APIs, which meet the requirements of an AutoML system.

Second, automatic placement of the models. The AutoML system may need to build and train large models distributed across multiple GPUs on multiple machines. An AutoML system should be runnable on any given amount of computing resources, which requires it to dynamically decide how to distribute the model (model parallelism) or the training data (data parallelism) for the given hardware.

Surprisingly and unfortunately, none of the deep learning frameworks today can automatically distribute a model on multiple GPUs. You will have to explicitly specify the GPU allocation for each tensor. When the hardware environment changes, for example, the number of GPUs is reduced, your model code may no longer .

I do not see a clear solution for this problem yet. We must allow some time for the deep learning frameworks to evolve. Some day, the model definition code will be independent from the code for tensor hardware placement.

Third, the ease of deployment of the models. Any model produced by the AutoML system may need to be deployed down the stream to the cloud services, end devices, etc. Suppose you still need to hire an engineer to reimplement the model for specific hardware before deployment, which is most likely the case today. Why don’t you just use the same engineer to implement the model in the first place instead of using an AutoML system?

People are working on this deployment problem today. For example, Modular created a unified format for all models and integrated all the major hardware and deep learning frameworks into this representation. When a model is implemented with a deep learning framework, it can be exported to this format and become deployable to the hardware supporting it.

With all the problems we discussed, I am still confident in AutoML in the long run. I believe they will be solved eventually because automation and efficiency are the future of deep learning development. Though AutoML has not been massively adopted today, it will be as long as the ML revolution continues.

Source link