A joint effort between Princeton University’s Empirical Study of Conflicts Project and the Carnegie Endowment for International Peace’s Partnership for Countering Influence Operations sought to evaluate the need for large-scale research infrastructure and the feasibility of overcoming critical barriers. Over the course of a year the team conducted interviews and meetings with more than 240 researchers and commissioned 13 exploratory studies with 20 partners from 17 institutions.
These studies:
- examined the research process to identify the kinds of infrastructure that could speed discovery;
- reviewed the design space on research administration and funding models; and
- analyzed how analogous institutions handle privacy and ethical considerations.
Collectively the studies provide a rich evidence base for understanding how to best move forward.
Overall, most papers study single social media platforms, focus on text, and examine the US or the EU:
We analyzed 3,923 academic papers published on the information environment from 2017-2021 in the top ten journals by impact factor five academic fields (Communications, Computer Science, Economics, Political Science, and Sociology), plus the top six general interest science journals. Out of the total, only 169 utilized social media data. We found that:
Twitter was the most studied platform at 59%, followed by Facebook at 26% and Reddit at 7%. 46% of the papers solely used Twitter data.
65% of papers analyzed a Western democracy (US, EU countries, the UK, Australia, or New Zealand), 35% of papers analyzed exclusively users/posts from the United States, and 60% of papers exclusively used English-language data.
Only 12% of the papers investigated information flow across different platforms.
Only 13% of the papers scrutinized images or videos, while 43% examined text. The remainder of the papers analyzed either direct interactions with posts (e.g., reactions and comments), indirect interactions with posts (e.g., shares), post metadata, or content moderation.
53% used simple econometric methods (e.g., multiple regression), and only 23% used machine learning (ML). Of papers using ML, half used supervised ML algorithms. The remainder used descriptive or qualitative analysis.
Interviews and meetings with researchers revealed three main reasons for these shortfalls in research in the field:
- Data access is either expensive or difficult to obtain;
- Unlike natural language processing methods, analyzing images, videos, and cross-platform data requires multidisciplinary skills/techniques that are not widespread; and
- Recruiting and retaining skilled personnel is difficult due to the academic hiring structures and competition for such individuals. Data engineers and scientists can be especially expensive, as there is intense competition for these professionals. Additionally, researchers often rely on Ph.D. students or post-doctoral fellows to build pipelines and conduct analyses, but these positions are temporary and qualified candidates are also in high demand