Cointime

Download App
iOS & Android

The Rise of Web3 Data

From paragrahp by yassinelanda.eth

In 2023, Web3 data grew by more than 400% year-on-year, as approximated by Filecoin growth, one of the leading decentralized data storage solutions. The NFT craze in 2021 arguably kickstarted the “on-chain media” space (consumer media applications with blockchains as their backend), and we are living in a data revolution. Crypto adoption has broader implications for the media and entertainment industry, where more power is shifting to users vs platforms, and this doesn’t get as much attention as the financial implications yet.

My “onchain” Data Science journey

I started doing data science on blockchain data in 2020. It began as a weekend hobby but quickly became an obsession, so much so that I quit my AI job at AWS and started doing it full-time as the first ML hire at Chainlink Labs. I learned much during the 2021-2022 period about blockchain data structure, depth, and potential, especially in the financial markets. Just a few years ago, doing a literature review of AI papers focusing on blockchain data took only a few hours, with maybe a dozen research papers worth looking at. Nowadays there are dedicated conferences for data wizards as well as multiple academic gatherings and publications for blockchain analysis and decentralized AI.

Fast forward to the beginning of 2023, right after a few decentralized social protocols like Lens and Farcaster started getting initial adoption, I realized that a new type of data was being put on the blockchain beyond financial transactions. It felt like the beginning of something bigger for the blockchains, a universal, immutable database. The diversity of data being stored went up and most importantly, it started looking like internet data (with information on consumer web and mobile interactions on different types of media). 

As a data practitioner in the traditional web (“web2”), the potential of aggregating so much Web3 data became obvious: in the Big Data age it is well known that “data has gravity,” and it is unstoppable in its growth once you “break the data silos”. Once an organization implements a data lake strategy , the next thing you know is that all the teams in that organization start creating 2x, 50x, and 100x more data than before (it is proximity to existing data that makes it easier to generate ideas about new insights that can be extracted, kinda like questions begging new questions). Even with this intuition in mind, I find myself surprised about what has been happening under the application and smart contract layers. For the past year, I have been building with a few data scientist friends custom AI models for Web3 using this new data. Wenow feel compelled to share more about these exciting trends.

It is estimated that 64.6 EiB are uploaded to the internet daily, this will put decentralized storage at 3% of that traffic, and definitely a top destination for data that is growing exponentially.

Decentralized Storage Datasets

As of February 2024, there is a 1.8674 EIB stored on Filecoin and another 155 TB on Arweave, the two leading decentralized storage solutions. The growth of this data is astonishing as there is more data uploaded daily on Filecoin (5.8 PIB daily in June 2023) and Arweave (250 Tb daily in avg) than what’s uploaded to Facebook in a day. That is also more than what is uploaded daily to Twitter, Instagram, and TikTok combined! *

* estimating daily data uploaded to these based on these assumptions:

Filecoin vs Arweave data composition

It is important to understand what type of data has been uploaded in detail to these decentralized storage platforms, as that tells a story about the different Web3 use cases developers or “archivers” had in mind.

Filecoin, which leverages the composability of the IPFS system has seen wider adoption with a large range of institutions by running programs like Filecoin Plus: an incentivization program to attract large internet datasets. Only a proportion of those are Web3 / crypto related, and are focused on blockchain archiving and NFT.storage (500 Terabytes in total). The rest is a mix from a large research dataset in Life Sciences, Healthcare, Environment and Internet (see composition here and Messari report here). 

proportion of Web3 / Crypto projects in Filecoin Plus program

Arweave, on the other hand with an innovative fee structure (pay once for permanent storage), and better scaling mechanisms pioneered by the likes of Irys (previously Bundlr), has seen more adoption with “DApp” (decentralized applications) developers. In particular it became the storage location for “onchain media” supporting developers building new consumer applications powered by Web3 ownership (login with your wallet, collecting posts, articles, podcasts etc). We estimate that 100 Tb (~65% of Arweave) is related to “Web3 Social”

A great example of a decentralized social protocol that adopted the blockchain as an immutable database to guarantee “switching powers” to their users, is Lens Protocol, their team pushed the envelope of onchain data scaling by building their own L3 on Araweave called Momoka.

Data from consumer crypto applications stored on Arweave

Multi-modality social datasets 

Monthly average data size uploads to Arweave by media modality

Over the last 2 years, we’ve seen the emergence of more and more applications that leverage the “onchain history” that users constructed by collecting NFTs into their wallets. This is becoming a powerful trend for two reasons:

  • The items that a crypto user can collect are not limited to basic profile NFTs (“PFPs”) anymore but can now span any media type imaginable. An NFT nowadays can be a social media post on Lens, a podcast on Pods, a song on Sound, a video on Odysee, or a ticket to a real event like a POAP.
  • This onchain history is also linked to other crypto users, effectively creating  large “social graphs” without silos. As more and more applications get built on top of crypto rails, either directly like on Lens (where a follow leaves a trace because it interacts with a smart contract) or Ethereum Follow Protocol, or indirectly like in the case of Warpcast (every user get a new wallet behind the scene).

This trend has been underpinning the growth in data on Arweave where at the peak of the cycle in 2022, 7.2 TB of NFTs, and a 1.25 Tb of video were uploaded per month.

Archiving of Web2 Social protocols

It is worth noting that incentives exist for different actors to start bringing web2 data into these storage solutions for archiving purposes. We have already seen more than 1 Tb of Web2 data from Wibo, Reddit, Twitter, Nostr, Youtube, Tiktok being uploaded to Arweave.

Data volumes started picking up again

Putting the data science hat again and following the data, we can see an uptick of video uploads on Arweave this January (👀). And with protocols like Lens opening up to any users, we expect the social media posts portion to grow even more with a focus on game and events streaming.

historical monthly data size upload to Arweave by media modality, courtesy of DataOS team.

It is amazing to see that Filecoin and Arweave have amassed an open dataset of the size of Wikipedia media dataset in only a few years, with strong guarantees around its preservation! If the amount of data 4X over the next 2 years, which is on the conservative side given the underlying developments, we should see a 500Tb Web3 social dataset on Arweave specifically eclipsing Wikipedia by a margin. Enough to train a model like ChatGPT on text alone (or 10X bigger if using all modalities, see how much data needed to train ChatGPT).

The future for Web3 Social Data

Long-form and short-term video formats

Zooming out, it seems we are at the early innings of a mega trend where 1- smart contract innovation drives data growth in ecosystems like Arweave, and 2- where one new cultural phenomenon (NFTs) has provided a way to tokenize media and has thereby driven the proliferation of new applications.

Following the steps of Web2, I believe the next wave will be driven by video enabled apps (decentralized “youtubes” & “tiktoks”), with many category leaders already emerging like Odysee which boasts some 5 million monthly active users despite all the headwinds the underlying blockchain LBRY supporting it had faced

Actually, many famous Youtubers with millions of followers, foreseeing the censorship risks of closed off platforms have started to actively build their audiences on Odysee as a hedge, in some cases already achieving 5 to 6 figures followers counts.

AI for content moderation, personalization, and generation

As the Web3 data size grows, some unique challenges and opportunities arrise, especially looked at from a Data Scientist perspective.

Content moderation

First, the monetary aspects of crypto do attract bad actors, spammers, and low quality users (“airdrop farmers”), all which can dilute  the value of the data created. Fortunately, AI techniques are showing good results in filtering out bot generated content for example, using network and semantic analysis. A caveat to that is that you need a good enough ground truth dataset to be collected. 

At mbd, we have been running LLM like models fine-tuned on 100m X/Twitter tweets to analyze the Farcaster ecosystem. We can say that Web2 AI models need to be adapted to Web3 as the cultural norms are different.

Example of TweetNLP model (academic model trained on 100M tweets) top results for the “offensive” label, largely detecting “shitposting” and misclassifying it (among other false positives)

Content Personalization

Second, as the data size grows discoverability becomes a problem, especially because decentralized storage solutions are hard to index and were not designed for “personalized read”. Luckily, here again, many techniques related to recommendation systems have been pioneered by social media companies over the past 20 years that make mining this data efficient.

AI can be used to understand this vast lake of data and serve as an assistant in exploring it. This has the potential to surface many of the great conversations happening in Web3 now that majority of people don't know about.

AI used to surface results based on a natural language query

Our early results leveraging the wider crypto association between users as they collect NFTs, posts, and articles are promising. Custom models fine tuned on Farcaster data, can predict what people will like, share, or reply to with a high accuracy, making creating algorithmic feeds that can generate engagement a reality for Web3 developers for the first time.

Example of AI feed builders of Web3 data. Feed comparator between two methods of ranking: “sharing” vs “replying” for the same user.

The challenge here, is to try and avoid the pitfalls of using machine learning blindly but to empower the developers and users to explore the “algorithmic feed design space” and offer different discovery mechanisms that align with their values or their community values.

Content Generation

Last, we live in a data hungry world where the competitive advantage between AI models does not lie in their sizes or the amount of compute available to companies (after a certain hurdle is passed) but in the quality and uniqueness of the the datasets trained on. 

Building content generation models tailored to the Web3 audience as well as complementing more mainstream ones with this data to help brand appeal to a growing audience is an awesome opportunity (that I am excited to be working on!). Especially when you couple AI training with incentive mechanisms to keep solving the long-tail AI problem, which are concepts Web3 developers have pioneered and excelled at.

Comments

All Comments

Recommended for you

  • ETH breaks through $3100

    the market shows ETH breaking through $3100, currently at $3100.29, with a 24-hour increase of 1.74%. The market is highly volatile, please manage your risks accordingly.

  • BTC breaks through $91,000

     the market shows BTC breaking through $91,000, currently at $91,011.99, with a 24-hour increase of 1.78%. The market is highly volatile, please manage your risk accordingly.

  • BTC breaks $90,000

    market shows BTC breaking through $90,000, currently at $90,009.99, the 24-hour decline narrowed to 0.57%, market volatility is high, please manage your risk properly.

  • The US spot Bitcoin ETF saw a net inflow of $54.8 million yesterday.

    according to data monitored by Farside Investors, the US spot Bitcoin ETF had a net inflow of 54.8 million USD yesterday.

  • The US spot Ethereum ETF saw a net outflow of $75.2 million yesterday.

     according to data monitored by Farside Investors, the US spot Ethereum ETF had a net outflow of 75.2 million USD yesterday.

  • Economists expect the Federal Reserve to cut interest rates in December, with two more cuts possible in 2026.

    according to economists surveyed, Federal Reserve officials are expected to vote next week to cut interest rates again to guard against the rising risk of a sharp deterioration in the labor market. The median of respondents shows that the Fed is expected to implement two more 25 basis point rate cuts within the year starting from March 2026. Next week's rate cut will continue the momentum of rate cuts from the policy meetings in September and October. A considerable majority also expect Fed officials to once again reiterate the statement that "the downside risks to employment have increased in recent months," as they did in October. The Federal Reserve will announce its decision at 2 PM Washington time on December 10, followed by a press conference held by Chairman Jerome Powell.

  • Bank of America: Markets will soon digest expectations of a Fed rate cut in January.

    Bank of America stated the market may soon price in the Federal Reserve's rate cut expectation in January. (Jin10)

  • He Lifeng held a video call with U.S. Treasury Secretary Bessant and Trade Representative Greer.

    He Lifeng, China's lead for China-US economic and trade relations and Vice Premier of the State Council, held a video call with the US leads, Treasury Secretary Janet Yellen and Trade Representative Katherine Tai. The two sides had in-depth and constructive exchanges on implementing the important consensus reached by the Chinese and US heads of state at the Busan meeting and the November 24 call, focusing on carrying out pragmatic cooperation and properly addressing mutual concerns in the economic and trade field. Both sides positively evaluated the implementation of the outcomes of the China-US economic and trade consultations in Kuala Lumpur, stating that under the strategic guidance of the two heads of state, they will continue to make good use of the China-US economic and trade consultation mechanism, continuously extend the cooperation list, reduce the list of issues, and promote the sustained, stable, and positive development of China-US economic and trade relations. 

  • Hassett: No discussion with US President Trump regarding the Federal Reserve Chair (selection)

    Director of the White House National Economic Council, Hassett, stated: He has not discussed the Federal Reserve Chair (candidate) issue with U.S. President Trump and supports Bassett's views on the Federal Reserve Chair. 

  • White House National Economic Council Director Hassett: It's Time for the Fed to Cautiously Cut Interest Rates

    White House National Economic Council Director Hassett stated: It is time for the Federal Reserve to cautiously cut interest rates.