iConference 2021 workshop

Machine Learning and Artificial Intelligence for Science of Science and Computational Discovery: Principles, Applications, and Future Opportunities

Daniel E. Acuna1, Tong Zeng2, Han Zhuang1, Lizhen Liang1
1School of Information Studies, Syracuse University, Syracuse, NY, USA
2School of Information Science, Nanjing University, China

Background

With the development of the Internet, scientific literature has been transformed into digital formats that are indexed, linked, and readily available. Together with other large scale datasets produced by the scientific process, they form the “big scholar data”. Recently, there has been an unprecedented release of these digital artifacts for researchers to pursue, including the PubMed Open Access full-text dataset, the Microsoft Academic Graph citation dataset, the Crossref metadata dataset, and the Federal Exporter funding dataset. These datasets offer tremendous opportunities to find relationships between various entities (e.g., funding agencies, institutions, researchers, citizens) and activities (e.g., grant applications, research workforce, publication).

To fully exploit these newly available scientific datasets, we need to use modern Machine Learning (ML) and Artificial Intelligence (AI) techniques to discover, predict, and unfold latent patterns and find and forecast future trends. ML/AI aims at developing algorithms that allow computers to learn from data without being pre-programmed. These techniques can be used for learning patterns in text, images, video, and audio. Thus, they are highly suitable for analyzing the large datasets that SciSci uses. They can also help scientists discover new ideas, predict future innovations, and validate results. Interestingly, the ML and AI techniques and applications have remained mostly unknown for a portion of researchers attending the iConference. This workshop aims to help bring awareness to ML and AI partially.

Purpose and Intended Audience

The Science of Science (SciSci) studies Science itself with the scientific method. It investigates various aspects of the scientific process using quantitative methods to understand the organization, mechanism, evolution, impact, and improvement of scientific activities. Many of SciSci research’s guiding ideas could be traced back to the 1930s, taking inspiration from other fields such as Meta-Science, Meta-Knowledge, and Bibliometrics. The distinctive feature of SciSci is its use of large, heterogeneous datasets about the doing of science, including large citation networks, full-text articles, mentorship networks, and success measures ( Fortunato et al., 2018; Acuna et al., 2012 ).

Similarly, advancements in computational techniques and datasets about science have allowed researchers to develop methods for Computational Discovery (CD): the partial automatization of processes traditionally done by scientists such as knowledge discovery, evaluation of ideas, and validation of results (Evans and Rzhetsky, 2010; more recently Thsitoyan et al., 2019).

This workshop aims to help bring awareness to ML and AI partially. It also aims to close this gap with a half-day workshop that will teach principles and techniques to a broad set of attendees. We will pay special attention to include historically under-represented disciplinary and demographic audiences. After this workshop finishes, attendees will have a good understanding of SciSci, and CD but will also grasp limitations and opportunities for future research.

The purpose of this workshop is to:

  • Introduce researchers to the Science of Science (SciSci) and Computational Discovery (CD) research communities
  • Demonstrate and help researchers interested in getting started with Machine Learning and Artificial Intelligence
  • Allow practitioners of SciSci and CD multiple opportunities to interact and network with the organizers, and peers.

Intended Audience:

  • Researchers from all research areas in critical information issues that affect contemporary society.
  • These researchers include Information Scientists, Network Scientists, Data Scientists, Computer Scientists, and Librarians.
  • Programming experience is preferred but not required.

Workshop Schedule

10:00 AM - 10:10 AM: Welcoming: goals, format, speakers, and schedule of the workshop

10:10 AM - 10:40 AM: Introduction to Science of Science: A broad overview of the scale and growth of science vs. scientists, biases, novelty, and problems in peer review, issues of false results, and non-reproducible science

10:40 AM - 11:00 AM: Talks about applications of AI and ML to SciSci and CD

11:00 AM - 11:30 AM: Introduction to ML and AI

11:30 AM - 12:00 AM: Packages and Frameworks: - Spark: Data transformation, ML pipeline - PyTorch: Basics, Pretrained models

12:10 PM - 13:10 PM: Hands on experiments: - An curated dataset - Exploratory data analysis - Modeling with Linear, Tree-based and Deep Learning models

13:10 PM - 14:00 PM: Flash Talk, 3 or 5 minutes each, the audience share their insights and experiences, could be any topic covers:

  • Interests and benefits;
  • Challenges and Constraints;
  • Opportunities and Future directions;

Date

TBD

References

  • Fortunato, S., Bergstrom, C. T., Börner, K., Evans, J. A., Helbing, D., Milojević, S., … & Vespignani, A. (2018). Science of science. Science , 359 (6379).
  • Evans, James, and Andrey Rzhetsky. “Machine science.” Science 329.5990 (2010): 399-400.
  • Tshitoyan, Vahe, et al. “Unsupervised word embeddings capture latent knowledge from materials science literature.” Nature 571.7763 (2019): 95-98.

More Reading

  • Hu, H., Deng, S., Lu, H., & Wang, D. (2020, March). A Comparative Study on the Classification Performance of Machine Learning Models for Academic Full Texts. In International Conference on Information (pp. 713-737). Springer, Cham.
  • Wang, R., Zhang, C., Zhang, Y., & Zhang, J. (2020, March). Extracting Methodological Sentences from Unstructured Abstracts of Academic Articles. In International Conference on Information (pp. 790-798). Springer, Cham.
  • Zeng, T., Shema, A., & Acuna, D. E. (2019, March). Dead science: Most resources linked in biomedical articles disappear in eight years. In International Conference on Information (pp. 170-176). Springer, Cham.
  • Acuna, D. E., Allesina, S., & Kording, K. P. (2012). Predicting scientific success. Nature , 489 (7415), 201-202.
  • JF Liénard, T Achakulvisut, DE Acuna, SV David, Intellectual synthesis in mentorship determines success in academic careers, Nature communications, 2018
  • Zeng, T., Acuna, D.E. (2020), Modeling citation worthiness by using attention-based Bidirectional Long Short-Term Memory networks and interpretable models, Scientometrics, Scientometrics, 124(1), 399–428.
  • Zeng, T., Acuna, DE (2020) Dataset mention extraction in scientific articles using a BiLSTM-CRF model Chapter 11 in Julia I. Lane, Ian Mulvany, and Paco Nathan (Ed.), Rich Search and Discovery for Research Datasets: Building the next generation of scholarly infrastructure, New York