Companies across industries are adopting Artificial Intelligence to scale up and improve their business operations. Advances in Deep Learning are helping drive the business success from e-commerce to national security. Data is the most important ingredient to a successful recipe of an AI model. Unlike traditional coding models, the outcome of an AI algorithm is very dependent on the data used to train it as it infers results based on what it has been trained on.
It’s quite similar to teaching a young child. When a toddler sees an Alaskan husky, its parents help to identify it as a “dog”. Now the toddler has a word for the four-legged furry thing, which she can use to identify its movement and its behaviour. But what happens when the toddler comes across a cat? She may very well assume it to be a dog too. Here, the parents will help her to understand that a cat, while four-legged and furry, behaves quite differently from the concept of “dog”. The feedback mechanism helps the toddler build up a recognition framework. There may still be edge cases, for example where a very furry small dog can be mistaken for a cat – until it makes a sound. This is an additional feature extracted from the data to increase the differentiation.
In supervised learning, machines learn from labelled examples. In Computer Vision, the machine is taught to identify everyday objects like chairs, table, and pillars in a room, or cars, pedestrians, and pavement on the road. The training data set needs the “ideal answer”, also known as “ground truth”, to be associated with each training sample, for the machine to build a feedback loop and improve its answers. Associating the ground truth with the data is called labelling, and relies on human specialists. This is called human judgement. This concept also applies to other types of data. For Natural Language Processing, machines need to be taught the difference between “That chicken burger was so bad” and “I want a chicken burger so bad”. Though both sentences share several words, they mean totally different things. Hence machines need to be trained on a large volume of meticulously labelled data. This is where humans step in to parent the machine-learning model.
To a machine, an image is simply a series of pixels. But labelled images show machines that certain collection of pixels are certain semantic objects (like a lamppost or a truck). The images are labelled by data experts or “Humans in the Loop”. Labelling experts perform semantic segmentation on hundreds of street images every day. They label the elements in the images into predetermined classes of objects, ultimately dividing the image into semantically meaningful parts. Similarly, in NLP, humans in the loop perform named entity recognition, sentiment analysis, speech to text validation to help bolster the machine learning.
Without human judgment, such data is opaque and cannot be used to train machine-learning algorithms. Likewise, humans also audit the results of an algorithm, to ensure it isn’t going off-track. Human nuance combines with machine scale to create a machine learning solution. The reliance on humans is a lesser-known aspect of machine learning and can come as a surprise to new practitioners.
Data labelling is an increasingly specialised service. In the past, machine learning efforts relied on the data scientists or some interns, to perform the labelling. Today, companies must plan for scalable and secure data pipelines where they can ensure consistent and high-quality labels for millions of data points. Scientists must be able to iterate rapidly on training experiments and add or remove features which help them get better results. More and more nuanced categories of data need to be labelled. Diversity in the labelling workforce can also help create a more rounded input data set in very subjective scenarios.
To successfully choose, pilot and implement machine learning within your company, you have to ask some key questions before you deploy a highly paid Machine Learning team. First, where is the data? Do you have proprietary data or are you going to use public datasets? Will your choice create enough accuracy and differentiation in the problem you set out to solve? Next, how will you pilot and scale your data labelling and auditing efforts? Do you have a reliable vendor who can grow with your needs? Today’s algorithms can deliver increasingly higher accuracy if trained on larger and larger data sets. Do you have the necessary budget set aside to handle data labelling at scale, including version management and tool integration? Do you require domain expertise or can you work with labellers who are trained using instructions from you? What’s the change management? Larger companies are now defining data pipeline managers whose role is to consolidate and streamline external data labelling efforts for various data teams within the organisation. This is a sign that the discipline is being addressed with the seriousness it requires. Make friends with your training data. It will repay you in spades.