Search This Blog

Powered by Blogger.

Blog Archive

Labels

Showing posts with label RETVec. Show all posts

Google Introduces RETVec: Gmail’s New Defense to Identify Spams


Google has recently introduced a new multilingual text vectorizer called RETVec (an acronym for Resilient and Efficient Text Vectorizer), to aid identification of potentially malicious content like spam and fraudulent emails in Gmail. 

While massive platforms like YouTube and Gmail use text classification models to identify frauds, offensive remarks, and phishing attempts, threat actors are known to create counter-strategies to get around these security mechanisms. 

The project description on GitHub reads, "RETVec is trained to be resilient against character-level manipulations including insertion, deletion, typos, homoglyphs, LEET substitution, and more."

"The RETVec model is trained on top of a novel character encoder which can encode all UTF-8 characters and words efficiently."

The Google-sponsored platforms reveal that they have been using Adversarial text manipulations, such as the usage of homoglyphs, keyword stuffing, and invisible characters. 

With its out-of-the-box support for over 100 languages, RETVec seeks to contribute to developing more robust and computationally affordable server-side and on-device text classifiers that are more durable and effective. 

In natural language processing (NLP), vectorization is a technique that maps words or phrases from a lexicon to a matching numerical representation for use in sentiment analysis, text classification, and named entity recognition, among other analyses. 

Google’s anti-abuse researchers Elie Bursztein and Marina Zhang note in the Google Security blog that, “due to its novel architecture, RETVec works out-of-the-box on every language and all UTF-8 characters without the need for text preprocessing, making it the ideal candidate for on-device, web, and large-scale text classification deployments." 

Google further notes that incorporating vectorizer into Gmail has really helped in detecting spam, with the detection rate escalating over the baseline by 38%. Also, the false positive rate has declined by 19.4%. 

Moreover, vectorization has also reduced the model's Tensor Processing Unit (TPU) usage by 83%. 

"Models trained with RETVec exhibit faster inference speed due to its compact representation. Having smaller models reduces computational costs and decreases latency, which is critical for large-scale applications and on-device models," Bursztein and Zhang added. 

Spams are the most popular attack vector in the virtual space, used by almost every cybercriminal. The popularity comes with its convenience of being omnipresent, cheap, and efficient, enabling cybercriminals to transfer malware and access sensitive data from targeted systems.