What if "big data" just…isn't worth very much?
I'm not saying the emperor has no clothes. I am saying his clothes are cheap, tacky, don't work and are seriously overrated.
“Data is the new oil” has been a tech catchphrase for so long that we tend to forget who it was who actually said the words. It wasn’t an American tech billionaire, it turns out, but British entrepreneur Clive Humby – the man who invented the Tesco Clubcard.
His comment, which might have seemed bold in 2006, strikes us today as just obviously true. The six biggest public companies in the world are all tech giants, and all but one seems to owe much of its value to data.1 Alphabet and Meta take the vast majority of their revenue from targeted advertising. Apple, Microsoft and Amazon also take billions in ad revenue, but the latter two also provide much of the infrastructure upon which the data economy relies.
The world’s biggest companies got their dominant position through data. We’ve also spent most of the last two decades hearing about the dangers of all of this data – micro-targeting is undermining the foundations of our democracy, data monopolies mean it is impossible to compete with big tech, and big data allows surveillance on a scale that we’ve never imagined.
As I’ve said before: Big data is the commodity of the 21st century, but this time it’s us – or the aggregate of our lives and experiences – that is in the pipelines. Except…what if it’s not? What if it’s much less important than we’ve all made out?
Big tech is more powerful than ever, and is powered by data. AI – the tech we’re told is going to revolutionise everything – relies on data. The idea data itself is overhyped seems mad. But hear me out. (And if you don’t subscribe yet, please think about doing so…)
How powerful is data? Let’s look at online ads
Advertising – selling our attention to big companies who want to sell us stuff – is the business model of the internet, and it’s all driven by the data of big tech companies. Many of us believe online adverts have an uncanny ability to target us: a majority think (falsely) that sites like Facebook listen in on our conversations using our phone to target their ads.
Most of the media discussion around online adverts imagines them as a disturbingly compelling, brilliantly targeted vehicle of persuasion against which we as a public are left essentially defenceless. The actual experience of browsing the internet, though, is very different – most of us would complain that we are bombarded by low-quality, repetitive, and outright spammy adverts, if not ads for outright scams.
Just step back and think for a moment: when is the last time you got a good online advert? How often have you marvelled at how brilliant the selection of ads in your social media feed are? How targeted are they really – are you getting ads beautifully tailored for your interests, or are you seeing the same low grade slop and drop shipping ads as everyone else?
The goal of online advertising is to serve consumers relevant ads for products and services they might actually buy – this will drive high clickthrough and conversion rates. If data was making this process better, then we should see clickthrough rates on adverts improving each year, as the targeting improves, and we could also expect to see fewer adverts in total, as revenue from the ultra-successful targeted ads would obviate the need for numerous irrelevant ones.
The reality of the modern internet is the absolute opposite: clickthrough rates for online display advertising are incredibly low, at just 0.05% to 0.1% or lower, and that is down rather than up over the last decade. These low click rates have led to sites running ever more ad slots, with more intrusive formats forcing users to click past (or ads that expand as the user scrolls past them) – leading to widespread public dissatisfaction: 72% of the public say intrusive ads damage their perception of brands, 71% say it makes them less likely to purchase, and 86% say they feel overwhelmed by ad overload.[2] Moreover, in this supposedly ultra-tracked and targeted industry, ad fraud rates are estimated at around 22% of all spend.
In short, every trend of what is happening with online ads is moving in the opposite direction than it would in a world in which ever-better data is improving their targeting and relevance year-on-year. Companies can’t target ads, no-one can track who’s being shown what, and the public hate what we see. If online brand ads are the torchbearer for the value of big data, then…big data doesn’t have that much of a case.
Micro-targeted political ads put Trump into the Oval Office, though, didn’t they?
A big claim in the years after the Brexit referendum was that “micro-targeting” or online manipulation was behind both the UK’s vote to leave the European Union and the victory of Donald Trump in 2016.
One reason people made this case was that it was easier to face up to the idea that voters had been tricked than the idea that voters wanted Brexit or Donald Trump. If you were a British or American liberal, the idea that people had been hoodwinked required less soul-searching and less reflection.
With almost a decade of hindsight, the premise that clever online manipulation led to either phenomenon should look much shakier. Did the Brexit campaign use millions of ultra-targeted slogans, or just one very good one: Take Back Control? The campaign certainly broke some spending rules (though Remain spent more on Facebook ads than Leave did), but it also just…ran a better campaign, albeit for a terrible policy.
From the vantage point of 2025, the idea Trump only won because of online manipulation looks almost like wishful thinking. Donald Trump won both the popular vote and the electoral college in the 2024 election, in which voters had every bit of information they could possibly wish on Trump – including his criminal conviction, civil ruling for sex crimes, and his role in the Jan 6th insurrection. To reach for the comfort that voters only choose candidates like Trump because they were tricked is foolish.
Much was made of Cambridge Analytica in particular for both races, though in reality investigation by the UK’s Information Commissioner found it did not work for Vote Leave on the Brexit race, and in the 2016 US race it was used far more by Ted Cruz’s campaign than Donald Trump’s.2
It is not hard to find slide decks or other presentations in which companies like Cambridge Analytica hype up the power of their tools – because they are snake oil salesmen, trying to find gullible politicians to pay for their work. Scare stories are some of the best marketing they can find.
In practice, the sophisticated psychology-based ad approach of Cambridge Analytica was quickly junked by almost everyone, for the simple reason that it proved less effective than a much simpler targeting mechanism: lookalike audiences. Brands upload a list of their existing customers and ad networks target people who are broadly similar to them, based on estimated income and geography.3
This is still a data-driven approach, but it’s a much simpler one. Similarly, targeting works best at its simplest: the reason you see sofa adverts for weeks after buying a sofa is that having been on the purchase page for one is a better predictor you might buy a sofa soon than any micro-targeted data could ever be – meaning it’s worth sending ads to people who’ve already bought one, just to get to the people who haven’t.
That means regular-sized data – knowing who your customers are and their basic demographics – has serious value for businesses, just like having the list of likely voters and donors has value for political parties and consultants. But that value is intrinsic and obvious – there’s not much mystery in it. "The promises that “big data” could change the game have, even after several decades, amounted to very little.
Has big data revolutionised offline retail?
Let’s come back to Clive Humby and his invention of the Tesco Clubcard – one of the first supermarket loyalty cards in the world. These things are now ubiquitous, and we’ve all spent years hearing about how, in exchange for a small discount or points to collect, we’ve given over huge quantities of data about our shopping habits.
The power of loyalty scheme data reached such mythical status that one of the best-known stories on big data comes from it – the tale of Target sending an advert for maternity wear and baby clothing to a teenage girl, before she even knew she was pregnant. Despite its wide retelling, and its origin in the New York Times, the tale is likely apocryphal,4but it has spread undiminished nonetheless. The big data promise of loyalty schemes is that they provide such good consumer insight that they allow for targeted promotion and price discrimination, as well as serving as a discovery mechanism on consumer preferences.
But all of this ignores that they operate on a much simpler level, too: they give customers a reason to pick out one supermarket over another very similar rival. Once you have points in a certain rewards system, it makes sense to maximise them – meaning customers have something to tip them towards, say, Tesco over Asda.
The way in which Tesco uses Clubcard in 2025 illustrates this second use case. Instead of targeted, individualised discounts, Tesco now uses “Clubcard prices” – blanket discounts (often large and advertisable ones) on products for all Clubcard customers. These included heavy discounts on Christmas veg and Baileys for December 2024, and routine offers around the year, usually backed up with heady promotion in TV and news media.
These large, universal discounts are particularly revealing as they actually diminish the value of the purchase data of Clubcard users as they distort their usual purchase patterns (though they may reveal the effectiveness of different promotions). The value of the discount and the saving of limiting it to price-sensitive consumers who use the card may outweigh the data collection.
At the most simple level, the question as to whether Clubcard and its rivals were worth billions to the supermarkets that used them can be answered by one simple number: what have they done to supermarket profit margins? Twenty years ago, UK supermarkets ran on a net profit of around 5%-6%, meaning that for every £1 you spent in the store, they kept about 5p (making them among the most competitive in the world). Today, they run on a margin of about…3%. There are, of course, endless factors to explain that, but the simple truth is that grocery retail is less profitable today than it was when Clubcard was rolled out. Big data hasn’t made it more lucrative.
Even the industries that really seem to need big data actually only need pretty basic information to function. Airline loyalty programmes are among the most longstanding and successful in the world, with whole online communities dedicated to discussing their intricacies.
But these require almost no data to be effective: their strength lies in persuading frequent fliers to argue with their booking services to fly on your airline rather than one of your rivals, or in persuading a passenger to take a flight at a less convenient time (or via a worse route) in the interest of preserving their status. The direct behaviour change is enough to generate huge value for the airline – just as credit card loyalty programmes like Amex points are enough to make customers turn a blind eye to punishing interest rates or annual fees.
Data is a compelling secret sauce to investors and the media, but the operation of such schemes in practice doesn’t rely on it: if airlines know how full different flights are at different price points, and how effective different upselling tactics are, there’s almost no extra tailored information on their customers they need to maximise their profits.
Data as a ‘moat’
Until quite recently, there was a great argument to be made about the value of big data, and it was that the huge stockpiles of data big tech companies had amassed made them almost impossible to replace.
This is most obvious with social media, where the “social graph” – the connections between our friends and our followers – make it hard to move to another site, because we can’t take those relationships with us. To move to a new social network is to start from scratch, so once a site is dominant, it stays dominant.
That social graph is still an immensely powerful force, but it doesn’t seem as strong now as once it did. It is hard to argue that people won’t leave their social graph when many of us…obviously have. Elon Musk bought Twitter, rebranded it to X, and has spent all the time since pretending he hasn’t noticed the mass exodus that followed.
Elsewhere, Facebook still has billions of users, but is almost irrelevant in Western discourse, while it is increasingly hard to argue that new social networks can’t break through while we’re also being alarmed by the rise of TikTok.
Google once looked like it had an unassailable moat, thanks to big data. Its core business is serving up adverts based on what someone is searching for – the better the ads, the better the revenue. And simultaneously, the better the search results, the better things work out – so that users don’t leave and go elsewhere.
Google sees not only what you search for but what you do with the results once you get them, and where you are when you do it, and because so many of us use it, it does this billions of times every day. Until quite recently, many of us thought that made it all but impossible to compete with Google – it could continually improve its product, improve its ads, and rake in money, thanks to its insurmountable data advantage,
And then, two things happened. The first was that, for whatever reason,5 Google results started getting worse, and generative AI came along – and lots of power users switched to using it (close to half of people have knowingly used a chatbot, a sizeable minority of them regularly – extremely fast rollout for a new tech, whatever you think of its virtues).
Simultaneously, Google seemed less able to use their data advantage to improve their product, just as a rival product that doesn’t rely on that data emerged. The case for the idea of the value of big data lying in scaring off potential competition seems much weaker than it used to.
Hang on a sec: if it’s not data, why are tech giants worth so much?
To argue that the huge volumes of data generated by Alphabet and Meta aren’t integral to the operation of the ad market is to risk begging a question. Alphabet makes around $250 billion a year in ad revenue, and Meta a little over $150 billion – what is generating that incredible income if it’s not the data?
The simple answer is attention. Meta has around three billion users across its social networks: Facebook, Instagram, Threads and WhatsApp. A typical US Instagram user spends 33 minutes a day on the network, while an active Facebook user is on the site for 30 minutes a day. Alphabet can boast around 2.5 billion users of YouTube, and its active US users spend 49 minutes a day there. It also has a near-monopoly on search, where it can deliver adverts based on search queries.
Search ads are simple: you don’t need to know much about a user to serve the most relevant ad on a search saying “buy flowers” – and brands wanting to preserve market share are all but forced to advertise. Alphabet and Meta have the attention of billions of users across the world, and often only need the most basic of information to be able to serve them adverts – that attention alone can explain their revenue, without any need for their data to generate value.
Questioning the value of “big data” as a commodity, then, doesn’t intrinsically shoot down the valuations of big tech. They command so much of our attention, and they do so on a regular and reliable basis. That alone is worth a fortune, and it only really needs minimal data to make it monetisable.
That’s why when I wrote that subheading about the emperor’s new clothes all those thousands of words ago, I didn’t try to say the emperor was naked. Big tech has clearly still got something hugely valuable. It’s just not necessarily the thing we thought it was.
Hang on another sec: what about AI?
If any companies should validate the data-as-the-engine-of-the-global-economy hypothesis, it should surely be the ones creating next-generation artificial intelligence.
OpenAI, for example, is still technically a not-for-profit, but its valuation has increased from around $14 billion in 2021 to $157 billion by the end of 2024. The data it used to train its models wasn’t its own, and wasn’t proprietary. The company has stopped disclosing what training data it uses, but early versions were trained on datasets of unpublished books, Reddit posts, Wikipedia, and hundreds of gigabytes of data from the open web.
ChatGPT and its rivals have clearly generated immense value in the eyes of investors, and they relied upon huge volumes of data to be trained – which on one level feels like a great argument in favour of data as a generator of value.
In reality, the argument is more complex than that. OpenAI used and discarded the data on which GPT was trained – it is not kept as a reference library for the finished model. Crucially, that data was not paid for, either. It was a necessary thing for the development of the model, but it’s the model that is OpenAI’s secret sauce, not the data upon which it draws (even if that data use is the subject of several ongoing lawsuits).6
The sketchy way in which AI companies have trained their models has caused huge (and deserved) uproar and has the potential to cost AI companies billions in damages.7 But as the development of LLMs continued, it is getting easier and cheaper to train ever-more powerful AIs on less data, relying less on pirated material. The Chinese model DeepSeek was trained faster and cheaper than many of its US counterparts.
To say AI shows the value of “big data” is a bit like suggesting that pigs show the value of “big food waste”. Pigs will eat pretty much any old scraps, and turn it into a sellable product – themselves – in much the same way that electricity can be generated from household waste.
There is some value in food scraps and household waste as an input, which is more-or-less the role big data plays in the AI process. It’s worth something. It’s just not the miracle commodity we believed it all to be.
What does all of this mean, then?
Around fifteen miles from San Fernando – the second city of Trinidad and Tobago – there is a lake unlike most others on the planet, in which an unmistakable black hydrocarbon bubbles up to the surface, endlessly refilling itself from underground.
Pitch Lake is not full of oil, though, but (as its name suggests) pitch, or tar. Anyone mistaking this stuff for oil would be sadly mistaken. Though it is made up of the same components, and is formed in the same way, tar in the wrong place will do nothing more than gum up essential machinery and make a huge mess. Confusing the two would be a costly error indeed, even though tar itself is both useful and holds some value, if it’s used in the right way.
Perhaps we need to ask ourselves whether data is truly the new oil, or whether it instead resembles something more like tar. Big business needed a new story to explain why it was different this time, and there was something new that could power a new era of economic growth. Data made for a better story than the connection that the internet brought – Google and Facebook’s value coming from taking the public’s attention away from TV and newspapers is a less glamorous story than them having unprecedented insight into our lives thanks to a world-changing technology.
We essentially forced data to become the central character in a story we needed to tell ourselves about the world – and before long there was too much investment, too many jobs, and far too many reputations staked on continuing that narrative. There is, of course, huge value to be derived from data: what sells when, how full different travel routes are at different prices, and the like – but this data is generally aggregated, used in predictable ways, and doesn’t feel like magic. That useful data is now bogged down in huge volumes of additional personal data of limited, or perhaps even no, value.
We shouldn’t allow this premise to continue unchallenged. The power of big tech is very real, and the dominance of a handful of companies over the global economy is indisputable. What’s being questioned here is the basis of that power – because to regulate tech effectively, we need to know the real driver of its success.
Challenging the data hegemony could be the start of finding real accountability, if only we’re brave enough to say that the new tech-emperors are, if not naked, perhaps more scantily clad than we may have been led to believe. And I can only apologise for leaving you with that mental image.
Postscript: I hope you’ve found some of the above interesting – I’m trying to work out a few ideas here a bit less formally than in my fully-reported features, and this is something that’s been on my mind for months.
If you enjoyed this feature, please do subscribe: a free subscription lets me know people appreciate the content, and lets you see when I post more (I’m currently aiming for 2-3 posts a month), and paid subscriptions are what make it viable for me to spend time posting here – so please do support if you’re able. Thanks!
The sixth is NVIDIA, whose valuation lies in the usefulness of its chips to AI training and development – and so indirectly relies upon data
I did a long analysis of Cambridge Analytica and Brexit here, for those who aren’t convinced. Yes, it’s published in the right-leaning Spectator, but I am also political editor of the pro-EU magazine The New World, and I voted Remain.
This is also why you see ads for products you bought offline: https://www.vox.com/recode/2019/12/19/21011527/retail-tracking-apps-wifi-bluetooth-facebook-ads
A good summary of the reasons is given here: https://medium.com/@colin.fraser/target-didnt-figure-out-a-teen-girl-was-pregnant-before-her-father-did-a6be13b973a5
Ed Zitron has a compelling theory here: https://www.wheresyoured.at/the-men-who-killed-google/
Most notably from the New York Times https://www.nytimes.com/2023/12/27/business/media/new-york-times-open-ai-microsoft-lawsuit.html
Conveniently I’ve just written on exactly this for transformer.ai
Thought provoking piece, James, and a a much-needed reality check on where the real value lies—and where it doesn't. I love the 'data as tar' analogy. I actually went with 'data as plastic' for operational perspective. High volume, often single-use, and creates a 'digital waste' problem with real costs and liabilities. https://medium.com/@matt.he.wanders.on/disruptions-dragons-is-your-big-data-strategy-creating-digital-waste-56c435e7f617