Chemistry experiment

The Fake News of Big Data


As engineers, developers, administrators, and other IT related disciplines, we govern our work by the scientific method. Or do we? What happens when we don’t? What are the implications if this happened a lot?

Read on the see the trouble that big data brings when trying to adhere to actual reproducible science…

Feature Image / License

Mo Data Mo Problems

“What gets us into trouble is not what we don’t know. It’s what we know for sure that just ain’t so.”

– Mark Twain

An article recently published in the ACM journal called “An Inability to Reproduce” by Samuel Greengard begins the exploration. In it the author talks about big data coupled with our analytical efforts often times are not reproducible.

Science has always hinged on the idea that researchers must be able to prove and reproduce the results of their research…this is what makes science – science

– Samuel Greengard – ACM “An Inability to Reproduce”

This is a key principle of the Scientific Method which anyone who reads this blog should follow. Our work isn’t as artists or craftsmen where it is subjective but rather more like mathematics where something is provable or not. Computers are deterministic machines.

The era of big data coupled with our powerful analytics has ushered in a new set of problems. Data sets have grown. Even the common person has access to machines that can quickly and cheaply analyze tons of data.

A key point of science is that researchers must be able to reproduce results from other researchers.

However, we are all humans and thus succumb to various irrational and illogical hazards. Meaning is provided by context and this is judged through the eyes of the researcher. Lawyers do this all of the time – it is called framing. We all look at the world through our relatively weak senses and impulses and try to make sense of things. The problem here is that is undermines the credibility of science – which has advanced our life generation after generation.

An Inability To Reproduce

As the article says, “In many cases, researchers cannot replicate existing findings and cannot produce the same conclusions”. Why is that? How does it happen?

Why We Cannot Reproduce

Researchers are finding patterns in data that have no relationship to the real world

– Samuel Greengard – ACM “An Inability to Reproduce”

A few factors contribute to this:

  • Experimental errors
  • Publication bias
  • Improper use of statistical methods
  • Subpar machine learning techniques
  • Various cognitive bias
  • Self serving interests

How Did the Increase of Unreproducible Results Occur?

When presented with large data sets, researchers often start with no leading hypothesis and grasp at straws at whatever correlates.

If the data universe is large enough – and this is frequently the case – there are reasonably good odds that by sheer chance, a valid p-value will appear.

– Samuel Greengard – ACM “An Inability to Reproduce”

Correlation is not causation. A square is always a rectangle, but a rectangle is not always a square. It is vital to understand and apply the differences. You can lie with statistics. Junk science and pseudoscience are flourishing because of this.

An anti-scientific sentiment and junk science is leading us down a bad path. “If results cannot be trusted, then the entire nature of research and science comes into question”. Everyone thinks the problem is other people but there are many who profess science and in practice abandon it.

Why Most Published Findings are False

Unfortunately, we are finding that it is more likely for research to be wrong than right. John Ioannidis, a professor of health research and public policy at Stanford published an academic paper called “Why Most Published Findings are False” in the journal PLOS medicine. It looks at bias and reproducibility issues in the scientific community.

Simulations show that for most study designs and settings, it is more likely for a research paper to be false than true

– John Ioannidis

In 2011, another researcher found more evidence that studies were more likely to be false than true. C Glenn Begley – at that time the head of the oncology division at the biopharmaceutical firm Amgen – decided to reproduce results for 53 foundational papers in oncology that appeared between 2001 and 2011 – a 10 year period.

Once his study was complete, he found he could only replicate results for only 6 papers, despite using datasets identical to the originals!

Consequently, we have seen this surge in dubious studies since the advent of social media. In an article published by Forbes called “Fake News: How Big Data And AI Can Help“, it shows the effect of intentional and unintentional misdirection in Facebook and the whole of our contemporary media. With big data it becomes a question of who controls the algorithms? Various agenda setting is coloring the context.

Case Example – Grievance Studies Affair

In 2017 and 2018, a group of 3 academics performed a hoax to show how poor the scholarship in some academic fields can be. More than one of the papers they wrote were not only published but received awards.

James A. Lindsay, Peter Boghossian, and Helen Pluckrose proved their point in grandiose fashion in what was called the Grievance Studies Affair . They wrote 20 articles that promoted intentionally absurd ideas and submitted them to peer-reviewed academic journals to be reviewed.

I encourage you to visit the authors website to learn more about this phenomenal experiment.

Also, if you want to hear an interview of the authors for 2 hours answering open and honest questions about their “research”, check out the Joe Rogan podcast.

It’s Hopeless! What Can We Do?

To start we can admonish against pseudoscience. Demand reproduction of consistent results – this is what separates evidence from hearsay. Also, more data is not always the answer. Sometimes that can make things worse.

Begley laid out 6 key questions which can be used to evaluate sound research:

  • Were the studies blinded?
  • Were all results shown?
  • Were experiments repeated?
  • Were positive and negative controls shown?
  • Were reagents validated?
  • Were the statistical tests appropriate?

Conducting this sort of due diligence can be a safeguard against junk science. Therefore, it is critical to guard against our own cognitive biases.

Because if two supposed “experts” say wildly disparate things about the same idea, all the while having the same facts, then I must inform one of them that they are fake news.

Thanks for reading!


If you liked this post then you might also like my recent post about Surveillance in the Workplace – Care or Coercion?

Do you care about InfoSec and Privacy? Then YOU need to use a VPN.

Did you find this helpful? Please subscribe!

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.