9 Simple Rules of Business Process Mapping

This article was created to simplify the process of creating a Business Process Map for your organization(s). In my experience, people try to codify the business process map where they follow strict…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




Applying Machine Learning Models in a Media Company

First, let’s take a look at what is important for our use cases:

The business question raised in the media sector is to find answers in solutions that address the context and semantic understanding of the data — rather than in purely perceptual recognition. Meaning, the intuitive message or interpretation of media content is often more important for the business than the reliable detection of objects, people or actions. Detecting cars and people on the street can depict a harmless scene in a video, while detecting that an accident happened on the street does not. Holding a knife while cutting tomatoes in a cooking show is harmless, while using a knife as a weapon is not. Identifying the topic of a video as “automotive” can seem a useful information to place a contextual advertisement. However, knowing if it is about an accident or not, is a relevant information when you select what ad fits better.

Therefore, the context and emotional message of the scene make the difference in many media applications. However, solutions that combine the video modalities in a way that brings more semantic meaning are still in early stages. In addition, the required semantics are usually already ignored at the level of labeling or annotation. It is much more likely that a scene is annotated with “luggage”, “people” and “plane” than with “travelling”, for example. Higher level concepts, such as “arguing”, “threatening”, “celebrating”, etc. are hardly ever represented in labels and therefore generally not learned by models.

Besides the lack of context or higher-level concepts in the existing AI multimedia solutions, there is another challenge that comes with the data they are designed for, compared to our own data. If we take a look at speech recognition solutions, we have seen that they work quite well for news shows. But as soon as you try them on a TV show with interviews on the street and a lot of background noise and unprepared speech, or a Germany’s Next Topmodel episode where models talk at the same time, you start noticing the limitations: Most NLP (Natural Language Processing) approaches are designed for long format or social media texts rather than transcripts obtained from ASR (Automatic Speech Recognition) systems for German language. If we look at ImageNet, which is an amazing dataset to train algorithms and there are already very good networks trained on it, there are a lot of classes that are useful only in very limited contexts and for very specific businesses. Similarly, if we look at the datasets used for action recognition, they are far from the actions appearing in our media videos (e.g., multiple actions at the same time, longer timespan).

For all the reasons mentioned above, when we are developing our solutions, we are focusing on bridging the gap between classical solutions existing for multimedia content mining and the way the information is needed for our businesses. This part is as difficult as you can imagine.

So, how do we do all this? We sometimes focus on adding value on top of classical image processing solutions applied to our video content (as described in the previous article for automated tagging). Other times, we combine different modalities (text, audio, image, speech) to infer context, or focus on developing and training our own models using one or more modalities.

When we develop our own models, we also need to consider the fact that our data is unstructured and heterogenous. Developing generic solutions that fit all content is nearly impossible, but we try to generalize as much as possible.

To give you a deeper insight on how we develop in-house solutions, I will describe our approach for topic detection in videos. First, getting this type of information from a video relies a lot on context. The spoken information in a video pertains a lot more semantic meaning than just the objects you can identify in an image. For this reason, we decided to employ the use of an ASR system for German language and build our models using transcripts of speech. A first and crucial step is labelling the data. With our business colleagues we defined a set of topics (e.g., healthy living, travel, events and attractions, etc.) with clear definitions for when a video gets labelled with one or more of them. Yes, we set our problem in the space of multi-class multi-label approaches.

Now for building our model, we started with the preprocessing of the transcripts and choosing the feature representation, followed by the training of the models. Working with transcripts is challenging since they do not respect the norms of written texts: you do not have punctuation, there are errors in the words transcribed and they are not structured in sentences but in breath groups (i.e., words pronounced between breath intakes). For this reason, we needed to do some preprocessing like stopwords removal (e.g., also, auf, das, er, es, etc.) and lemmatization (i.e., using the basic form of the words). For feature representation, we chose to train our own word2vec embeddings. Now, going to the models used, since we have to solve a multi-class and multi-label problem, we saw what works best for us: multi-label approaches like PowerSet (i.e., consider each label combination as a unique label) and BinaryRelevance (i.e., transform the multi-label learning to several independent binary learning tasks, one per class label), using weighted SVMs (Support Vector Machines) as the underlying ML algorithm. Our dataset is at a size of thousands of labelled videos clips. We have a held-out test set of around 10% and we did cross validation with grid-search to train and validate the final models. Our objective function is a combination between precision (i.e., how many predictions are correct) and recall (i.e., how many of the labels we were supposed to find we predicted correctly), with more importance given to precision: Our business colleagues want to be sure when we say that a video is about food it really is — while missing a couple of videos that contain food is not as problematic. Many things are considered in building and combining the models for the end solution, but to keep this shorter, I will not go more into detail. But I would like to mention one of the main challenges we had to face that I think occurs quite often in the industry: dealing with a highly imbalanced dataset. To tackle this, we focused on using ML techniques tailored to imbalanced classes as well as on improving the labelling process to target content for which we needed more samples by designing a semi-automatic labelling solution.

I hope this article gave you an idea about the way we develop ML solutions. We have the advantage of working to build AI solutions, applied to multimedia content, with the data right at hand. But not only that, we also have the experts right by our side who handle this data on a daily basis and who have a great amount of know-how built up over several years of experience AI solutions can’t compete with. But AI can help. We can understand the pain points and build models that can help make the daily work for our businesses better.

Add a comment

Related posts:

Revealing police violence in America.

Leveraging Data Visualization to provide users with data visualization to explain incidents of police violence. This project aims to provide users of our website with the ability to be well-informed…

Protecting Critical Infrastructure

One of the most regularly proven methods of protecting national infrastructure is to isolate critical data and operations of the network. In 2017, a global shipping company, responsible for 76 ports…

On The Beach

An abridged excerpt from Carolyn Hastings’ forthcoming MY novel showcasing the main characters, Simon and his dog, Rocket, in a scene on the beach.