Anomaly detector based on multimodal neural network

Modern computer vision requires constant updates in technologies. Deployment of neural networks that can recognise and learn visual concepts without human supervision is the cutting-edge technology in this sphere.

Modern computer vision requires constant updates in technologies. Deployment of neural networks that can recognise and learn visual concepts without human supervision is the cutting-edge technology in this sphere.

One of the most efficient networks, used in lots of projects, is CLIP (Contrastive Language–Image Pretraining). Our expert, Kirill Starkov, Senior Machine Learning Engineer, will tell us about CLIP model peculiarities and will share his experience.

Why do we use CLIP?

CLIP is a unique neural network, because it was trained on all information from the Internet. It is a multimodal model because it represents the notion in different varieties: as a picture and as a text (description). ‘CLIP can be instructed in natural language to perform a great variety of classification benchmarks without direct optimisation for the benchmark’s performance. In CV, it is very useful, because we can determine the detection object without developing a separate detector’.

How to work with CLIP?

Deployment of the CLIP model in CV requires special training with prompts. ‘Added specific prompts like “fire”, “smoke” or “a lying person”, “a man in red clothes” help to generate text and visual embeddings. The visual embeddings from CLIP can be used to compare with the embeddings obtained from the prompts to determine whether an image matches a given description’. We can say that the CLIP model is rather scalable and universal: it can be easily trained for rare detection purposes with prompts and visual embedding and that saves money, calculation costs and time of clients.

CLIP model for CV

‘We decided to use a CLIP model to determine fire and smoke on the picture and to improve detection results in day and night vision modes. Our client needed a special fire and smoke detector, and determination of these processes is very difficult in the dark. We faced some complications connected with inconvenient detector placement, but still achieved a significant improvement. Deployment of the CLIP anomaly detector with particular preprocessing added 15% to accuracy’.

Conclusions

CLIP is a great option for companies with restricted financial resources, because prompts are free of charge. Any client can create a new prompt and the system will adapt to it. If we compare CLIP-based models and directly trained specific models, it is obvious that there is room for improvement. The main disadvantage of CLIP-based models is bias, because CLIP was developed on unrefined data. It cannot stand the competition on the market of specifically trained detectors and surveillance models. But it still enables bespoke, niche surveillance use cases for which no well-tailored models or datasets exist.