The all-new, singing, dancing, Big Data doesn’t provide the huge chunks of unadulterated data that some like to think it does. Various sources of error, some particular to each database, and data format, contribute to compromises to data integrity, and consequently sample sizes, making them very much less that the ‘n=all’ mantra emerging from ignorance in data science. Here we take a look at Google Scholar, making use of a large publications list, to see what type of mistakes might be present and how they contribute to any inaccuracies when counting. A few other reviews of the system are included in the Further Reading section at the end of this post.
Thus the sum of things is ever being reviewed, and mortals dependent one upon another. Some nations increase, others diminish, and in a short space the generations of living creatures are changed and like runners pass on the torch of life.
His prerogative, of course, and he does have a long-term track record of publications in high-profile, ‘media-magnetic’ topics: drugs, sex, and other addictions and fetishisms. According to his blog pieces on recommendations towards improving academic careers
, one important aspect is to boost citation counts
, including strategic self-citation, wherever justified, most impressively illustrated by his exemplary article on … self-citation
Mirror, Mirror, On the Wall
Griffiths’ own research output includes publications on internet addictions, including narcissism in social media use. It is ironic to think of his running record of publications and media appearances as narcissist-traits in his own social media behaviour, but that is what he seems to be suggesting, albeit perhaps tongue-in-cheek, “So what can the experienced and obsessive self-citation expert get up to in the course of a single article
), and indeed he may be among the best qualified to make that assessment: it would be like holding up a mirror to himself, holding up a mirror to himself, ad infinitum
. Interestingly, Griffiths makes a connection
between narcissism and egomania, although, he says, Narcissistic Personality Disorder, “is often linked more with megalomania
“. Ah, that’s a relief … Eh? What! Anyway, the answer to his rhetorical question is, about 31 self-citations is what an “experienced and obsessive self-citation expert
[can] get up to in the course of a single article
Nottingham Trent University
(NTU) are very proud of their several awards (below
) and high ranking in the university league table
, so no doubt, the NTU Vice-Chancellor
thinks that publicising the University’s achievements is an important part of maintaining visibility of the NTU brand profile
, such is the approach to modern academia, “Bold and brave – We are a great University, we have a lot to shout about. Let’s tell / show the world how good we are
“, so, Griffiths using his Twitter account essentially as a conduit for information, to update anybody else interested in his career, to maintain a record of his activity, for reasons of documentation and transparency, and advertise the successes of NTU are all understandable. Even admirable for the modern, career academic, not only surviving, but thriving, in the REF
) performance-driven, higher education, job market.
Where we stop seeing eye-to-eye, is in NTU’s refusal to take responsibility for Sutton
‘s public professional activities, including spreading misinformation
and abusing other academics
who try to get him to follow academic protocol for challenges to published work. I have had one informative response from NTU
in over two years, while several others whom I have come to know through looking at Sutton’s activities, have made similar official complaints, but have received no reply at all. Griffiths deigned to ignore my contact from the outset. It would seem that success must include with it aspects of unethical academic practice, and sociopathy towards outsiders. Or is it that, as long as someone maintains airs and graces of greatness, perhaps they might get away with it by fooling enough of the people who count?
Who’s the Greatest of Them All?
Reflecting the mores of social media, social academia (e.g.
) has followers and comments and media sharing, and updates, and all the other carrots and sticks playing with self-esteem, that force your hand in actively maintaining an online presence.
On Twitter and FaceBook, SnapChat and less so Instagram, the indoctrination instills in the account holder, an urge to be consistent, with frequency and quality of content, and reciprocally acknowledge content from others in your social network neighbourhood. Each has its own currency for success at social intercourse; “fitness” in selective terms: scores, likes, streaks, etc.. The academic equivalents centre on publication performance, mostly taking as their measure of fitness, the number of times a publication is cited.
In an ideal scientific, evidence-led world, citations should be dictated by necessity, through relevancy to the subject at hand. Alas, it seems even that is an illusion. Academic community networks are just as prone to the evolution of our psychologies, and the stereotypes of personality, as are groups of social media users (see Crosier, Webster & Dillon 2012
, for a more general reading). Where one has LIKES and FAVOURITES and RETWEETING, the other has reciprocal citing,
“There is a ridiculously strong relationship between the number of citations a paper receives and its number of references,” Gregory Webster, the psychologist at the University of Florida in Gainesville who conducted the research, told Nature. “If you want to get more cited, the answer could be to cite more people.”
Zoë Corbyn 2010 An easy way to boost a paper’s citations: An analysis of over 50,000 Science papers suggests that it could pay to include more references. Nature doi:10.1038/news.2010.406.
Grandstanding On the Shoulders of Giants
Your own perception of how good you are as an academic may be very different to how you perform against measures of achievement, the most enduring of these has been Jorge Hirsch
. This uses publication quantity and citations as the only factors, notably ignoring the impact factors of the publishing journals.
Sutton must have mixed feelings about this friend who is, “playing the game”, generating such a huge h-index. He makes them sound quite underhand.
Measures of performance are just another illustration of how fragile the academic system is, the presumption being that a terrible paper will never get published because its shortcomings ought to be caught during peer review. But as we have seen with Sutton, there are ways to abuse the system and sneak through fallacious nonsense. What then if that paper attracts attention, and there is an outcry of criticism? It will result in awarding that badly performing academic with an elevated h-index, which will then make them seem to be a well performing academic, giving them leverage by which to force through some more poor executed research. It’s not ideal.
More noticeable than Griffiths’ h-index, is his quite fantastic citation total: 28,842 in the 4-5 years since 2013 suggests, perhaps, a couple of citations a day, so not impossible for a prolific author. Except, scanning down the column of citations one might think that, either Griffiths has had an extraordinarily diverse career, or this is the publication record of more than one M. Griffiths.
So, the first thing we can note, is that Google Scholar
does not use a unique identifier to avoid assigning the same entries to people with the same name. M Griffiths listed as an author for the middle reference is also a Mark Griffiths
. We’ll come back to this after a quick look at how databases can avoid the problem.
Are You the One?
In accounting, the double entry system is a failsafe, by keeping track of the same amounts of finance from different perspectives: one account’s gain, is another’s liability (obligation to settle the balance), etc.. Analogous to a database, with its data structure spread across different tables, it is important to maintain strong links between associated data, for example, columns containing the same data with different meanings (e.g., credit and debit, 1-to-1), and columns containing different data with the same meaning (e.g., publications for same initials and surname combinations: 1-to-many).
A database uses primary keys to act as unique identifiers for each table. These are one or more of the columns, each containing a value relevant to the row of data, or record, in that table. For example, if the record is the personal account information for an individual, then a combination of two columns, storing their first and second names, might not be enough to be unique, because some names are more common than others. Surname and home address is possible, unless there is an expectation to be storing the data of more than one family member.
A generated unique key is often the answer, or ‘Globally Unique Identifier’ GUID. This can then be used confidently knowing that each row will be represented by a unique numerical code. Including that column in each table to pivot the data stored therein, then enables different categories of data to be stored in distinct structures, personal details in one, banking in another, and so on, but linked to the other data relevant to that record. Running a query on the database can then pull related data from each table, using the GUID to know which row to interrogate.
Alas, Google Scholar, and some other citations databases are not so sophisticated. It seems obvious, that there are going to be many researchers with the same surname and initials, or even surname and first name combinations, so why not design it with better data integrity from the outset?
The Wrong Crowd
The upshot, is that we know Griffiths is a prolific author, so it would be surprising if he wasn’t also Sutton’s mystery pole greaser, because the suggestion is that Griffiths is playing the game to drive up his citations tally, and h-index as a consequence. It doesn’t seem necessary, given his rate of output.
However, the bulking together of different authors with similar names is a potentially significant source of inaccuracy when estimating performance metrics. To clear that data noise, it is a simple, but tedious, task to rake through and check what portion of an author’s total publications record belongs to other authors who happen to be their’ namesake, and any other accounting mishaps. First, a quick inventory of his Google Scholar entries
(n=1748, raw data here
), compared with a more reliable citations index SCOPUS
which is hand curated, so not simply the result of a Google
SCOPUS also profiles individual authors in its Author Identifier, allocating a unique number, noting affiliations as an additional factor in discriminating between records that are closely matched purely by author name. When there are multiple entries for the same person, as here for, Author last name “griffiths” , Author first name “mark” , Affiliation “nottingham”, you can request that they are merged; Griffiths’ Author Identifiers are, Griffiths, M.D.#7201549643, Griffiths, M.D.#35519823300, and Griffiths Prof., M.D.#7201549643,
Six other authors share the same surname and initials (Author last name “griffiths” , Author first name “m.d.”), while a total of 22 authors with 19 affiliations, and six unaffiliated, appear as close variants, for example, “Griffiths, D.M.L.”.
||Aberystwyth University (Ceredigion, UK)
||Brock University (St Catharines, Canada)
||Children’s Hospital of Austin (Austin, USA)
||Defence Science and Technology Laboratory (Salisbury, UK)
||Lehigh University (Bethlehem, USA)
||University of Minnesota Twin Cities (Minneapolis, USA)
||Nottingham Trent University (Nottingham, UK)
||N.S.W. Department of Education (Bathurst, Australia)
||University of Nottingham (Nottingham, UK)
||University of Oxford (Oxford, UK)
||Pak-Poy & Kneebone Pty Ltd. (Sydney, Australia)
||Royal Hobart Hospital (Hobart, Australia)
||Rubber & Plastics Research Assoc. of GB (Shrewsbury, UK)
||St Bartholomew’s Hospital (London, UK)
||University of Salford (Salford, UK)
||Southampton University Hospitals NHS Trust (UK)
||University of Southern California (Los Angeles, USA)
||The Aromatherapy Research Group TARG (Australia)
||University of Wales (Cardiff, UK)
||Western University (London, Canada)
||total net Nottingham Trent University
Between them they contribute a further 207 publications, 175 of which have been cited (the publications date back to 1953, whilst the citations prior to 2014 are pooled into the one category), a total of 3268 times. It might be interesting to see if they appear in the Google results, and what that means for the citation totals. Although, as they account for only a fraction of Griffiths’ annual tally (% of Griffiths’ Google Scholar citations, below), it suggests that the majority of authors, mistakenly included by Google Scholar in Griffiths’ citation totals, are not included in SCOPUS. Alternatively, if Sutton’s claim that the majority of an h-index can be generated through self-citations, then the influence from mistakenly-included authors may be negligible.
||Citations for other authors
Google Scholar citations
Never Give Up Scope
Turning to Griffiths’ record in SCOPUS
, specifically using his Author Identifiers
, to ensure that it is him
, we can see that he has 559 publications recorded in this database, with 558 receiving 16322 citations from 7427 documents (the so far citation-less item is Griffiths’ and Sutton’s latest co-production, mentioned above, and rightly so: it is seriously flawed),
Remembering Sutton’s revelation that a well known professorial friend is working the numbers, the citations count for each year seems to be keeping in step with the document tally (the Spearman correlation is 0.98. Spearman correlation is better for ordinal datasets as it only requires monotonicity and not normality, as needed for the Pearson Product estimation), but this is as likely a product of increasing publications and a growing corpus, accumulated over those two decades shown. To pin it to self-citations will require some direct evidence.
Interestingly, the equivalent record of items and citations from Google Scholar
shows the obfuscating or masking effect (Spearman 0.57) of including other authors’ documents and not filtering publications for academic standards. Compared to the SCOPUS
output, this even greater lack of direct causal connection and looser housekeeping, allowing excessive SPAM, is why it is difficult to detect academics
abusing the Google Scholar
The lack of direct dependence is, however, insufficient to entirely swamp dynamic responses. The response in Google Scholar citations is surprisingly sensitive to the spikes in document number: while there is some trace of that spiky dynamic inherent in the document data for both databases (i.e., there is good agreement despite the spikes, Spearman 0.91), this dominates the Google Scholar citation data, but is buffered by the SCOPUS mechanism for citations (Spearman 0.41).
Another way of looking at this dynamic in the relationship is plotting one (citations) against the other (documents), essentially the correlations reported above, (R2 is the correlation coefficient between the observed and the model, i.e., the citations data and a linear regression.
For this we do need normally distributed data: it was obvious that the citations data was heavily skewed (A), so a log transformation (D) also helped spread out clustering around the origin (B,C). The linear regression appears curvilinear on this log plot, below; the square root, R, equals 0.98 and 0.58, for SCOPUS and Google Scholar, respectively, which are comparable with the 0.98 and 0.57, for the correlations, above).
History Repeats Itself
That low correlation suggests two factors at play in the disconnect between Google Scholar documents and citations: (i) the noise in the ‘dirty data’, from misallocation to other authors, inclusion of nonacademic and irrelevant items, and (ii) a lag in the database synchronising outputs (new publications), with inputs (citations, accrued across all titles for an author).
is a globally distributed system that employs regionality to present results more likely to be pertinent for users locations. But, when you want to search further afield, it is easy to do so, which means Google
is shipping vast volumes of pre-indexed search terms to allow you rapid, realtime searching, regardless of the origination of the information. This is speculative, but, along with updates to the Google
algorithm, which I do know occurs a couple of times a year, and for which there is an update log
, perhaps in addition to that, spikes in the data might be caused by what must involve huge dumps of search data hitting local systems, and updating citation counts.
The reason all this is important to both the ambitious career academic, and the researcher looking for on topic references, is that the sequence in which resources are returned to you is entirely dictated by their citation count
. The obvious consequence is that highly visible, actively publicised papers, will get more notice and attract more citations, maintaining their place, high in the research results (aka, the Matthew Effect
), regardless of academic merit. To counter this, some weighting is given to older articles (ibid.
), which works in their favour, and makes the most of accumulating citations since they were published. But, an old article is rarely what the cutting edge researcher seeks. They more often want the latest developments, or a recent review, from which they can trace the history, as required.
To see if it was possible to elicit any meaning from those spikes, I treated each of the datasets for documents and citations as autocorrelated time series, in order to carry out a spectral analysis and generate some periodograms. This is a useful tool in detecting recurring patterns.
The documents data on the left show the expected left clustering of signals at high frequencies, within a year, and every year. This is from the comparatively constant updating of Google from it trawling the main publishing sites. On the right, there are the same high frequency signals, but also a clear annual, and biannual event. If one of the Google algorithm updates tends to have a larger effect on variation, and it is generally in the same half of the year, then this signal could be the result. Similarly, the same stronger, but varying event once a year, will spread the effect into successive years. The larger signal at 4-5 years is less easy to interpret, but possibly from an edge effect with the data: the data runs from 1989 to 2018, which is 29 years, which might be switching the signal between multiples of 4 (28 years), and 5 (30 years). It would be interesting to ask if this is the reason for choosing 5 years as the alternative period (‘All’ and ‘Since 2013’) in the status box.
Submitting a tailored search request, we can use SCOPUS to retrieve data to allow a year-to-year comparison of documents over the last five years, to match recent publication histories reported by Google Scholar (i.e., “since 2013”), “AUTHOR-NAME ( m AND griffiths ) AND ( LIMIT-TO ( PREFNAMEAUID , “Griffiths, M.D.#7201549643 ” ) OR LIMIT-TO ( PREFNAMEAUID , “Griffiths, M.D.#35519823300 ” ) OR LIMIT-TO ( PREFNAMEAUID , “Griffiths Prof., M.D.#7201549643 ” ) OR LIMIT-TO ( PREFNAMEAUID , “Griffiths, M.D.#57201688964 ” ) ) AND ( LIMIT-TO ( PUBYEAR , 2018 ) OR LIMIT-TO ( PUBYEAR , 2017 ) OR LIMIT-TO ( PUBYEAR , 2016 ) OR LIMIT-TO ( PUBYEAR , 2015 ) OR LIMIT-TO ( PUBYEAR , 2014 ) OR LIMIT-TO ( PUBYEAR , 2013 ) )”.
(Chart and table above) Griffiths’ publications in the last 5 years reported by Google Scholar and SCOPUS
Remembering our motivation for all this, with so much noise in Google Scholar, the only way to see if the increase in citations are purely from self-citation is to weed out all the other authors’ documents. We come to that next (below), while using the less noisy SCOPUS data, we can omit any self-citations with a simple click on a tick-box, revealing a consistent 19±2.5% (s.d.) of the citations originating from Griffiths’ own publications.
It’s worth noting that the stricter requirements for inclusion in the SCOPUS database mean that this is likely an underestimate of the real proportion of self-citations that Griffiths accumulates; correspondingly, Sutton has already noted that his friend has been self-citing, “in journals that are not even peer reviewed
The table below shows the Comparative citation metrics reported for Griffiths from 1963 (Google Scholar) and 1989 (SCOPUS) until the current day (10-Jun 2018). Bracketed numbers for SCOPUS are excluding self-citations.
AllSince 2013Before 2013
The ratio for publications between the databases are in keeping across years (48.6 ± 6.7%, s.d.), indicating an inflated number of ~100% by Google Scholar compared to SCOPUS. Whether this is due to inclusion of other authors’ works, or because Google Scholar does not apply any academic threshold for what is included, clearly there’s some doubt regarding the publications reported for an author by Google Scholar, compared to SCOPUS.
Let’s Not Tilly-Tally
The total given for all of Griffiths’ publications stored in SCOPUS is 599 documents (papers, books and chapters, conference proceedings and letters). This is in comparison to Google Scholar‘s 1748 items. So what are those 1149 other items? To find out you’d need to delve into a more detailed breakdown of the Google Scholar results. Here is the separation by name, with links to the documents list. I was going to parr these down more, but I’ve spent too long on all this already. I would have looked closer at the 413, to see how many to add to the 1254, then I’d have also checked that tally. Feel free to do so (names link to data).
As expected, it looks like a good turn out for team “MD Griffiths & Co.”, but there is still a large discrepancy between databases, and that is down to Google’s Big Data
. As for the unethical practices of certain individuals regarding the true tally of their publication propagation, it’s only one of several systems being abused in academia, but we’ll leave it to Sutton to cast aspersions about his friends
When it first appeared as a contender to other established citation indices, it was soon realised that the wide reaching, universality of Google
might actually work against the aims of a specialised information resource. For example, here are some issues found in 2008 for Google Scholar
when it was compared with known reliable, professional citation databases (the source has been updated frequently
, most recently in 2017),
- Google Scholar includes some non-scholarly citations
- Not all scholarly journals are indexed in Google Scholar
- Google Scholar coverage might be uneven across different fields of study
- Google Scholar does not perform as well for older publications
- Google Scholar automatic processing creates occasional nonsensical results
- Names with diacritics or apostrophes are problematic
- Names with ligatures are problematic
Most studies have concluded the same, that specialist indices offer consistency and accuracy, and Google a further reach, but unreliably,
Scopus covers a wider journal range, of help both in keyword searching and citation analysis, but it is currently limited to recent articles (published after 1995) compared with Web of Science. Google Scholar, as for the Web in general, can help in the retrieval of even the most obscure information but its use is marred by inadequate, less often updated, citation information.
Falagas ME, Pitsouni EI, Malietzis GA, Pappas G. (2008) Comparison of PubMed, Scopus, Web of Science, and Google Scholar: strengths and weaknesses. FASEB J. 22(2):338-42 DOI: 10.1096/fj.07-9492LSF
How that extra information is then useful, is questionable. It is enticing, highly seductive, to think you have the world at your fingertips, but the reality is somewhat less sexy, and more bloated that voluptuous. Just look at the pig’s ear that Sutton has made
of having all that data, what he considers to be “all” data, made easily accessible. Duplicate copies of volumes are not such a hurdle for him, but definitely the way many books have been concatenated, and labelled as a single volume, is a major source of his confusion,
It is a common practice of G-S to display identically named links with identical URLs which does not reduce the infoglut. Such redundancy adds to the searchers’ confusion. Still, at least these entries are not misinforming the searchers. They just discombobulate them.
Peter Jacso (2005) As we may search—comparison of major features of the Web of Science, Scopus, and Google Scholar citation-based and citation-enhanced databases. Current Science 89(9), 1537-1547
The automatic processing means a hands-off approach that in turn also rewards with near-immediate updates to the database: estimated as every other day by linguist Dingemanse
using his clever metric of Prof. Et Al
. Experimenting with that ‘user account’ and one for ‘A. Author
‘, he concluded,
- Google Scholar is inclusive
- GOOD: It finds scholarly works of many types and indexes material from scholarly journals, books, conference proceedings, and preprint servers.
- BAD: It will count anything that remotely looks like an article.
- Its citation analysis is automated
- GOOD: Citations are updated continuously, and with Google indexing even the more obscure academic websites, keeping track of the influence of scholarly work has become easier than ever.
- BAD: here are no humans pushing buttons, making decisions and filtering stuff. This means rigorous quality control is impossible.
- Its profiles are done by scholars
- GOOD: No sane person wants to disambiguate the hundreds of scholars named Smith or clean up the mess of papers without named authors, titles or journals. Somebody at Google Scholar had the brilliant idea that this work can be farmed out to people who have a stake in it: individual scholars who want to make sure their contributions are presented correctly and comprehensively.
- BAD: Scholars have incentives to appear influential. H-indexes and citation counts play a role in how their work is evaluated and enter into funding and hiring decisions. Publications and co-authors can be added to Google Scholar manually without any constraints or control mechanism, an opportunity for gaming the system that some may find hard to resist.
And there is the crunch: if you are an ethical academic who has no interest in artificial rewards, working systems to your advantage, and using shades of deceit to gain advantage over your peers (something considered a form of professional misconduct by Nottingham Trent University, btw), then your Google Scholar h-index is not for you.
Fiddler On The Roof
Sutton calls matters of pride, “bragging rights
” for which one may, “trumpet your findings from the rooftops and be dammned
]”. He seems to take great pride in the support and inspiration he gets from colleague, friend and collaborator, Griffiths
, but their preference for Google Scholar
is fraught with issues, contrary to LSE’s suggestion that is is the preferred measure for the social sciences, having filtered out spurious entries
. A quick scan down his list of references
reveals the vacuity of his source of pleasure: there are no less that seven blog posts
in his publications list; nine papers in his own online journal
, the most recent he had “peer reviewed” by his retiring friend and an ex-student, neither specialists in the field; the two editions of his book
listed separately; four versions of the same paper
(“E-Mails with Unintended Consequences”
cowritten with Griffiths), two of which are identical duplicates. Out of his 85 total documents, there’s perhaps a dozen publications with doubts whether they ought to be included in any metric, contributing a total of 87 citations. It’s too fiddly to work out whether these contribute to his h-index, but unless you’re going to clean up your publication record, the more honest strategy is going to be sticking to a less error prone database.
In conclusion, if you are truly interested in your academic performance and wish to keep track with a view to self-improvement, over self-promotion, the best advice would be to work/read more/harder, but if you think citation metrics will help you in that aim, then an accurate one is the only value that is going to prove useful, if used honestly. Despite the judgemental tones that he uses to discuss his “friend”, presumably Griffiths, Sutton is as guilty of gaming the greasy polled system as much as any, submitting blog posts and claiming peer reviewed work. He may have conned his colleagues and friends, but it looks like the SCOPUS personnel have a far more discerning eye for fraudsters.
Full Time Score
Lastly, as with so much in life, reality often deals a slap in the face with a wet kipper. Once again, even thinking he has an admirable publication record is fatuous and delusional. Home Office, Schmome Office. It takes more than a government report to qualify as a valid piece of scholarship, which is why his record automatically recognises his publications start in 2006. Seriously, without malice; having looked at Sutton’s publications, his h-index of 1 does seem about right.
- 4 reasons why Google Scholar isn’t as great as you think it is
- PubMed, Web of Science, or Google Scholar?
- Google Scholar Metrics
- Prof Et Al
- Some things you need to know about Google Scholar
- Four great reasons to stop caring so much about the h-index