Web Mining to Collect Data

by Nicolas Sacchetti

Mikael Héroux-Vaillancourt who obtained his Ph.D. earlier this year in Technology and Innovation Management within the CDC-Innov Chair at Polytechnique Montréal, emphasizes the potential of utilizing data science methodologies in firm’s innovation research. 

His presentation was a part of the May 9th P4IE Pre-Conference on Measuring Metrics that Matter, hosted by 4POINT0.

Héroux-Vaillancourt unveiled the drawbacks with the traditional data collection methods employed in company studies, such as surveys and the exploitation of databases from national statistics offices (NSOs) like StatCan and the USPTO. He highlights the inherent issues with surveys—such as low response rates and numerous methodological biases, including demand characteristics, extreme responding, social desirability, and selection and non-response biases. Furthermore, accessing data from NSOs often proves to be a formidable challenge.

Web mining emerges as a potent alternative. The information that researchers seek is typically embedded within the web content of companies—freely accessible, abundant, mostly current, and available anytime. Nonetheless, Héroux-Vaillancourt identifies a significant limitation: the unstructured nature of web data. He draws attention to the substantial disparities in website content and the tendency for self-reporting on these platforms to be heavily biased due to their marketing nature.

Web-mining emerges as a potent alternative. Some of the information that researchers seek may be contained within the web content of companies – freely accessible, abundant, mostly current, and available anytime. Nonetheless, Héroux-Vaillancourt identifies a significant limitation: the unstructured nature of web data. He draws attention to the substantial disparities in website content and the tendency for self-reporting on these platforms to be heavily biased due to their marketing nature. « Web-mining indicators could unearth new insights that traditional methodologies have missed, potentially complementing existing data sources, » suggests Michael Héroux-Vaillancourt, Data Scientist.

In the 2020 study « Using Web Content Analysis to Create Innovation Indicators – What Do We Really Measure? » co-authored with Catherine Beaudry and Constant Rietsch, the authors propose that: 

« Indicators derived from textual data on websites could pave the way for novel metrics. This could extend to exploring other website elements, such as the use of colors, images, and illustrations; audio and video content; web design choices; adoption of cutting-edge web technologies; update frequencies; and visitor interactions via calls to action—along with many other untapped data potentials. »

— (Héroux-Vaillancourt et al., 2020)

Ce contenu a été mis à jour le 2023-11-19 à 15 h 21 min.