NLP Approaches to Solving Text Mining Analysis Challenges

Nicolas Sacchetti 2023-11-06

by Nicolas Sacchetti

Researchers are delving into new methodologies to extract data from the vast repository of information available on the web.

At the P4IE Measuring Metrics that Matter Conference hosted by 4POINT0, Pietro Cruciata, a PhD candidate in mathematics and statistics, presented his research on May 10, 2022.

The study conducted in collaboration with Davide Pulizzotto, a research associate, and Catherine Beaudry, the director of research — all affiliated with Polytechnique Montréal — focused on developing innovation indicators from web-based unstructured text.

The team began by reviewing state-of-the-art studies in innovation. Recognizing that previous researchers had successfully constructed innovation indicators using web pages and text mining—enhanced by keyword searches and various weighting schemes—their attention turned to refining the methodology.

Cruciata and his colleagues identified significant limitations inherent in word analysis of web pages, such as polysemy (words with multiple meanings, e.g., ‘bank’ as in a river bank vs. a financial institution) and semantic ellipsis (the omission of words that are conceptually implied, such as ‘join forces,’ ‘work with,’ or ‘joint venture,’ which all suggest collaboration).

To address these linguistic challenges, they explored two NLP (Natural Language Processing) approaches: Word Sense Disambiguation (WSD), employing the Lesk Algorithm to tackle the issue of polysemy, and Information Retrieval (IR) to manage semantic ellipsis.

The findings revealed that using IR with a single word surpassed the effectiveness of keyword searches alone. « It is a superior method for identifying concepts within sentences, » remarked the PhD candidate. Notably, when IR with a single word was paired with keyword searches, the results improved significantly. This suggests that the IR method, when combined with single-word and keyword searches, can partially mitigate the problem of semantic ellipsis in text mining web pages, thereby serving as a complementary approach.

In their final remarks, the researchers acknowledged that the Lesk Algorithm underperformed due to inadequate disambiguation in definitions and a disregard for word order within the Set of Words. Nevertheless, they believe that « this strategy warrants further investigation to develop more reliable innovation indicators derived from unstructured text. »

Ce contenu a été mis à jour le 2023-11-06 à 19 h 41 min.