According to Alphabet Inc (NASDAQ:GOOG) ‘s Google, it is challenging to solve Natural Language Processing because that involves syntactic nuances causing interpretation of phrases or a sentence differently. For example, the paraphrase pairs’ Flights from New Delhi To Chennai’ and ‘Flights to Chennai from New Delhi’ tells the same thing. The latest algorithms could not even properly distinguish the snippets like ‘Flights to New Delhi from Chennai,’ which is different from the previous set.
Paraphrase adversaries from word scrambling
To solve the difficulties involved in natural language processing using AI, Google has unveiled PAWS (Paraphrase adversaries from word scrambling) in English. It has also introduced an extension – PAWS-X for six distinct languages that include Japanese, German, Spanish, Korean, French, and Chinese.
Google said both PAWS and PAWS-X comprise ‘well-formed pairs’ of non-paraphrases and paraphrases. The new data sets help to improve the accuracy of capturing the word structure and order to 89% from below 50%.
Google included 108,463 human-labeled pairs (in English) collected from Wikipedia pages and Quora Question Papers in the PAWS data set. The extension – ‘PAWS-X’ comprises machine-translated training pairs (296,406), and human translated PAWS evaluation pairs (23,659).
PAWS data set can produce several sentence pairs sharing many common words. Google has passed the phrases through a model to form variants that may or may not be the paraphrase pairs. It has used the human raters to judge the grammaticality, and another team is roped in to check if they are a paraphrase of each other. To ensure the formation of correct paraphrases, it has translated them back to see that they match with the original and mean the same thing.
Two models – BERT and DIIN
Google has introduced two models DIIN and BERT, which are trained by the researchers, to check the impact of the Corpora on Natural Language Processing accuracy. BERT has improved the accuracy to 83.1% from 33.5%. Both models showed remarkable improvement when compared with the baseline.
Yuan Zhang, a research scientist at Google, said he hopes that new data sets would help the researchers to achieve further progress on the multi-lingual models for better accuracy. They would help to eliminate small perturbations in the order of words and ensure the correct meaning of the sentence.