When working with large-scale publication datasets, author name disambiguation is one of the trickiest challenges. There’s no perfect solution, and different datasets handle it differently. Here I’ve collected notes on how various data sources approach this problem.
APS
APS does not provide disambiguated author data. However, the authors of this paper performed author disambiguation through a hierarchical procedure based on metadata in the APS dataset. The procedure is as follows:
-
consider each author of each publication to be a unique one
-
Two authors identified in the previous step are considered to be the same individual if all of the following conditions are fulfilled
i) Last names of the two authors are identical
ii) Initials of the first names and , when available, given names are the same. When the full first names and given names are present for both authors, they have to be identical.
iii) One of the following is true:
-
the two authors cited each other at least once
-
the two authors share at least one co-author
-
the two authors share at least one similar affiliation (measured by cosine similarity of TF-IDF vectors)
-
According to the author, accuracy of the disambiguation algorithm evaluated through random 200 pairs papers considered to be written by the same authors and random 200 pairs papers considered to be written by the different authors was not bad. The false positive rate was 2% and false negative rate was 12%. I couldn’t find the source code for this method.
This paper also disambiguated the author name data with the similar method.
OpenAlex
OpenAlex provides its own author identifiers. It appears that OpenAlex generates identifiers using the most comprehensive metadata available, including concepts, institutions, citations, coauthor data, and ORCID data when available. According to the description here, OpenAlex uses XGBoost to compare these features across authors of each work. It seems they trained the model using the ORCID dataset. OpenAlex also provides the ORCID of authors for each work if available. They mentioned that they will release a document with more details in the future.
Web of Science
This paper about the author’s mobility in the space of scientific periodicals uses web of science dataset. However, they used the MAG disambiguated data again for those collected articles from WOS.
SciSciNet
The SciSciNet dataset provided by this paper also uses the MAG author disambiguation method. They acknowledged the remaining challenges of author disambiguation.
Other sources
Faculty dataset
If you’re not interested in the general publication trajectory of scientists but instead require data on specific cases such as faculty members, you can utilize the dataset used in this paper.