Author: Lesley、Shelly, Footprint Analytics Researchers
Key Takeaways:
- The evolution of large language model (LLM) technology has increased the focus on the convergence of AI and Web3, ushering in new application paradigms. In this report, we explore how AI can improve the user experience and productivity of blockchain data.
- Due to the industry’s early stage and blockchain technology’s unique characteristics, Web3 data faces several challenges including data sources, update frequency, and anonymity attributes. Solving these challenges through AI has become a central point of interest.
- LLMs, with its advantages of scalability, adaptability, efficiency improvement, task decomposition, accessibility, and ease of use compared to traditional AI, open up avenues for improving the experience and productivity of blockchain data.
- A large amount of high-quality data is essential for training LLMs. With its rich vertical knowledge and openly available data, blockchain is a valuable source for LLM learning materials.
- In addition, LLMs also play a role in facilitating the production and increasing the value of blockchain data, which includes tasks such as data cleansing, labeling, and structured data generation.
- It’s important to note that LLMs are not a silver bullet. It must be applied based on specific business needs. Striking a balance between leveraging the high efficiency of LLMs and ensuring the accuracy of the results is critical.
- The Evolution and Convergence of AI and Web3
- History of AI: Artificial intelligence (AI) can be traced back to the 1950s. Since 1956, attention has been focused on AI, gradually developing early expert systems designed to address challenges in specialized domains. Subsequently, the advent of machine learning broadened the application scope of AI, enabling its widespread use across multiple industries. Fast forward to today, the explosion of deep learning and generative AI has opened up limitless possibilities, with each step marked by continuous challenges and innovations in the quest for higher levels of intelligence and broader applications.
On November 30, 2022, the debut of ChatGPT showcased for the first time the potential for AI to interact with humans in a user-friendly and efficient way. ChatGPT sparked a broader conversation about artificial intelligence, reshaping how we interact with AI to make it more efficient, intuitive, and human-centric. This shift also increased interest in various generative AI models, including those from Anthropic (Amazon), DeepMind (Google), Llama, and others that have since gained prominence. At the same time, professionals in various industries have begun actively exploring how AI can drive advances in their respective fields. Some are using the combination of AI technologies to differentiate themselves in their industries, further accelerating the convergence of AI across domains.
- The Convergence of AI and Web3
Web3’s vision begins with transforming the financial system to empower users and even drive the change of modern politics and culture. Blockchain technology serves as a robust technology to achieve this goal, not only by reshaping the transmission of value and incentives but also by facilitating resource allocation and decentralization.
As early as 2020, blockchain investment firm Fourth Revolution Capital (4RC) foresaw integrating blockchain technology with AI, envisioning a decentralized transformation of global sectors such as finance, healthcare, e-commerce, and entertainment.
The convergence of AI and Web3 revolves around two key aspects:
- Leveraging AI to improve productivity and user experience.
- By harnessing the transparent, secure, decentralized storage, traceable, and verifiable features of blockchain technology, combined with the decentralized production relations of Web3, we can address the challenges that traditional approaches have been unable to solve. This integration not only incentivizes community participation but also enhances production efficiency.
Some of the following directions for exploring the combination of AI and Web3 are currently available on the market:
- Data: Blockchain technology is being used to store model data, provide encrypted records to protect privacy, document the origin and use of data used by the model, and verify data authenticity. AI can extract valuable insights for model training and optimization by accessing and verifying data stored on the blockchain. At the same time, AI can act as a data production tool, increasing the efficiency of Web3 data processing.
- Algorithms: Web3’s algorithms provide AI with a more secure, reliable, and autonomously controlled computing environment. They provide cryptographic protection for the AI system, embedding security safeguards within model parameters to thwart misuse or malicious activity. AI can interact with Web3’s algorithms, perform tasks, validate data, and make decisions through smart contracts. At the same time, AI’s algorithms contribute to Web3 by providing more intelligent and more efficient decisions and services.
- Computing Power: Web3’s distributed computing resources provide AI with high-performance computing capabilities. AI uses these resources for model training, data analysis, and prediction. By distributing computational tasks across multiple network nodes, AI accelerates computational speed and manages larger data sets.
In this article, we focus on using AI technology to improve the processing productivity and user experience of Web3 data.
- Web3 Data Landscape
- Comparison of Web2 & Web3 Data
As the fundamental element of AI, Web3 and Web2 data have significant differences. These differences are mainly due to the application architectures of Web2 and Web3, resulting in different features of the data they generate.
- Comparison of Web2 & Web3 Application Architectures
In the Web2 architecture, web pages or application control is typically centralized within a single entity, often a company. This entity has absolute authority over the content it develops, determines access privileges to server content and logic, defines user rights, and dictates the lifespan of online content. There are many instances where Internet companies have the power to change platform rules or terminate services without allowing users to retain the value they’ve contributed.
In contrast, the Web3 architecture leverages the concept of the Universal State Layer, which places some or all of the content and logic on a public blockchain, allowing users to control that content and logic directly. Unlike Web2, where users don’t necessarily need accounts or privileged API keys to interact with what’s on the blockchain, Web3 users do not need authorized accounts or API keys to interact with blockchain content, except for certain administrative operations.
- Comparison of Web2 and Web3 Data Features
Web2 data is typically characterized by its closed and highly restricted nature, with complex permission controls, a high level of sophistication, multiple data formats, strict adherence to industry standards, and intricate business logic abstractions. Although this data is vast, its interoperability is relatively limited. It is typically stored on centralized servers, lacks privacy awareness, and often involves non-anonymous interactions.
In contrast, Web3 data is more open and widely accessible, albeit at a lower level of maturity. It consists primarily of unstructured data, with little standardization and relatively simplified business logic abstractions. While Web3 data is smaller in scale than Web2 data, it offers high interoperability, such as EVM compatibility. Data can be stored decentralized or centralized, with a strong emphasis on user privacy. Users often interact anonymously on blockchains.
- Web3 Data Industry Status, Outlook, and Challenges
In the Web2 era, data was as precious as “oil reserves,” and accessing and acquiring large amounts of data has always been a significant challenge. In Web3, the openness and sharing of data have made it seem like “oil is everywhere,” allowing easier access to more training data for AI models, which is critical to improving model performance and intelligence. However, challenges remain in processing this Web3 “new oil,” primarily as follows:
- Data Source: On-chain data is complicated and scattered, resulting in significant data processing costs.
Processing on-chain data involves time-consuming indexing processes that require significant effort from developers and analysts to adapt to the data differences across blockchains and projects. The web3 data industry lacks harmonized production and processing standards, except for blockchain ledger entries. Individual projects primarily define and produce data like events, logs, and traces. This complexity makes it difficult for non-professional traders to identify accurate and trustworthy data, complicating on-chain trading and investment decisions. For example, decentralized exchanges such as Uniswap and Pancakeswap may have different data processing methods and calibers, and the process of verifying and standardizing data calibers adds to the complexity of data processing.
- Data Update: The large volume and high update frequency of on-chain data makes timely processing into structured data challenging.
The dynamic nature of blockchain, with updates occurring in seconds or even milliseconds, underscores the importance of automated processing. However, the Web3 data industry is still in its infancy. The proliferation of new contracts and iterative updates, coupled with a need for more standards and diverse data formats, adds to the complexity of data processing.
- Data Analysis: The anonymous nature of on-chain data presents challenges in distinguishing data identities.
On-chain data often lacks sufficient information to uniquely identify each address, making it difficult to correlate on-chain data with off-chain economic, social, or legal movements. Nevertheless, understanding how on-chain activity correlates with specific individuals or entities in the real world remains critical for specific scenarios.
With the discussion of productivity changes brought about by LLMs, the ability to leverage AI to address these challenges has become one of the central focuses in the Web3 industry.
- Results When AI Collides with Web3 Data
- Differences between Traditional AI and LLM
When it comes to model training, traditional AI models are typically modest in size. The number of parameters ranges from tens of thousands to millions. However, ensuring the accuracy of output results requires a significant amount of manually labeled data. In part, LLM’s formidable strength lies in its use of massive corpora to calibrate parameters numbering in the tens or hundreds of billions. This greatly enhances its understanding of natural language, but at the same time requires a significant increase in the amount of data for training, resulting in particularly high training costs.
In terms of capabilities and modes of operation, traditional AI excels at tasks within specific domains, providing relatively precise and specialized answers. LLMs, on the other hand, are better suited to general tasks but are prone to hallucinations. This means that in certain scenarios, its answers may lack the required precision or specialization, or even be completely wrong. Consequently, for results that require objectivity, reliability, and traceability, multiple checks, repeated training, or the introduction of additional error correction mechanisms and frameworks may be necessary.
- Traditional AI Practices in the Web3 Data Industry
Traditional AI has proven its importance in the blockchain data industry, bringing greater innovation and efficiency to the field. For example, the 0xScope team uses AI techniques to develop a graph-based cluster analysis algorithm. This algorithm accurately identifies related addresses among users by assigning weights to various rules. The application of deep learning algorithms improves the accuracy of address clustering, providing a more precise tool for data analysis. Also, Nansen uses AI for NFT price prediction, providing insights into NFT market trends through data analysis and natural language processing. Trusta Labs utilizes a method machine learning approach based on asset graph exploration and user behavior sequence analysis. to strengthen its Sybil detection solution’s reliability and stability, contributing to the blockchain network’s overall security. Goplus strategically integrates traditional AI into its operations to improve the security and efficiency of decentralized applications (dApps). Their approach involves collecting and analyzing security information from dApps to provide rapid risk alerts, thereby mitigating risk exposure for these platforms. This includes identifying risks in dApp host contracts by assessing factors such as open source status and potential malicious behavior. In addition, Goplus compiles detailed audit information, including audit firm credentials, audit times, and links to audit reports. Footprint Analytics uses AI to generate code that produces structured data, facilitating the analysis of NFT transactions, wash trading activity, and bot account screening.
However, traditional AI is constrained by its limited information and focuses on performing predefined tasks using pre-defined algorithms and rules. In contrast, Large Language Models (LLM) capture and generate natural language by learning from rich natural language data, making them better suited for processing complex and large textual data. With the remarkable progress of LLMs, new considerations and explorations have emerged regarding integrating AI with Web3 data.
- Advantages of LLMs
LLMs boast several advantages compared to traditional AI:
- Scalability: LLMs excel at handling large amounts of data.
LLMs demonstrate remarkable scalability, efficiently managing large volumes of data and user interactions. This capability makes it exceptionally well-suited for tasks that require extensive information processing, such as text analysis or large-scale data cleansing. Its robust data processing capabilities offer the blockchain data industry tremendous analytical and practical potential.
- Adaptability: LLMs learn and adapt to requirements from diverse domains.
With outstanding adaptability, an LLM can be fine-tuned for specific tasks or integrated into industry-specific or private databases. This feature allows it to quickly learn and adapt to the subtle differences between different domains. An LLM is an ideal choice for addressing diverse, multi-purpose challenges and providing comprehensive support for blockchain applications.
- Increased Efficiency: LLMs automate tasks to increase productivity.
LLM’s high efficiency significantly streamlines operations within the blockchain data industry. It automates tasks that traditionally require significant manual effort and resources, increasing productivity and reducing costs. LLM can generate large amounts of text, analyze massive datasets, or perform various repetitive tasks within seconds, minimizing wait times and processing times to improve overall efficiency of blockchain data processing.
- Task Decomposition: Create specific plans for specific tasks, breaking them down into manageable steps.
LLM agents can generate detailed plans for specific tasks, breaking down complex tasks into manageable steps. This feature proves to be highly beneficial when working with extensive blockchain data and performing complex data analysis tasks. By breaking down large jobs into smaller tasks, LLM skillfully manages data processing flows and ensures the delivery of high-quality analytics.
- Accessibility and usability: LLM enables user-friendly natural language interactions.
LLM’s accessibility simplifies interactions between users and data, promoting a more user-friendly experience. By leveraging natural language, LLM facilitates easier access and interaction with data and systems, eliminating the need for users to understand complex technical terms or specific commands such as SQL, R, Python, etc. for data acquisition and analysis. This feature broadens the user base of blockchain applications, allowing more people, regardless of technical sophistication, to access and use Web3 applications and services. As a result, it promotes the development and widespread adoption of the blockchain data industry.
- Convergence of LLM with Web3 Data
Training LLMs relies on large amounts of data, with patterns within the data serving as the model’s foundation. The interaction and behavioral patterns embedded in blockchain data serve as the driving force for LLM learning. The quantity and quality of data also directly impact the effectiveness of the LLM.
Data isn’t just a resource that an LLM consumes. LLMs also contribute to data production and can even provide feedback. For example, LLMs can assist data analysts by contributing to data preprocessing, such as data cleansing and labelling, or by generating structured data that removes noise and highlights valuable information.
- Technologies that Enhance LLMs
ChatGPT not only demonstrates the solid problem-solving capabilities of LLM but also sparks a global exploration of integrating external capabilities into its general capabilities. This includes enhancing generic capabilities (such as context length, complex reasoning, mathematics, code, multimodality, etc.) and extending external capabilities (handling unstructured data, using more advanced tools, interacting with the physical world, etc.). Integrating domain-specific knowledge from the crypto field and personalized private data into the general capabilities of large models is a key technical challenge for the commercial application of LLM in the crypto field.
Currently, most applications focus on Retrieval-Augmented Generation (RAG), using prompt engineering and embedding techniques. Existing agent tools primarily aim to improve the efficiency and accuracy of RAG. The main architectures on the market for application stacks based on LLM technologies include the following:
- Prompt Engineering
Most practitioners use basic, prompt engineering solutions when building applications. In this method, specific prompts are designed to quickly modify the model’s inputs to meet the needs of a particular application. However, basic prompt engineering has limitations, such as delayed database updates, content redundancy, in-context length (ICL) constraints and limits on multiple rounds of conversations.
As a result, the industry is exploring more advanced solutions, including embedding and fine-tuning.
- Embedding
Embedding is a widely used mathematical representation of a set of data points in a lower-dimensional space that efficiently captures their underlying relationships and patterns. By mapping object attributes to vectors, embedding can quickly identify the most likely correct answer by analyzing the relationships between vectors. Embeddings can be built on top of LLM to exploit the rich linguistic knowledge gained from large corpora. Embedding techniques introduce task, or domain-specific information, into a large pre-trained model, making it more specialized and adaptable to a particular task, while retaining the generality of the underlying model.
In simple terms, embedding is like giving a fully trained student a reference book of knowledge relevant to a specific task and allowing them to consult the book as needed to solve specific problems.
- Fine-tuning
Unlike embedding, fine-tuning adapts a pre-trained language model to a specific task by adjusting its parameters and internal representations. This approach allows the model to exhibit improved performance in a particular task while maintaining its generality. The core idea of fine-tuning is to adjust model parameters to capture specific patterns and relationships relevant to the target task. However, the upper limit of model generalizability through fine-tuning is still constrained by the base model itself.
In simpler terms, fine-tuning is like providing a broadly educated college student with specialized knowledge courses that allow them to acquire specialized knowledge in addition to broad skills and solve problems in specialized domains independently.
- Retraining the LLM
Although current LLMs are powerful, they do not meet all the requirements. Retraining the LLM is a highly customized solution that introduces new datasets and adjusts model weights to improve adaptability to specific tasks, needs, or domains. However, this approach requires significant computational and data resources, and managing and maintaining the retrained model is also challenging.
- The Agent Model
The Agent Model is an approach to building intelligent agents that uses LLM as the core controller. This system includes several key components to provide more comprehensive intelligence.
- Planning: Breaking down large tasks into smaller ones for easier completion.
- Memory: Improving future plans by reflecting on past behavior.
- Tools: Agents can invoke external tools, such as search engines, calculators, etc., to obtain more information.
The AI agent model has robust language comprehension and generation capabilities, enabling it to address generic problems, perform task decomposition, and engage in self-reflection. This gives it broad potential in various applications. However, agent models also face limitations such as context length constraints, challenges in long-term planning and task decomposition, and unstable reliability of output content. Addressing these limitations requires continuous research and innovation further to expand the application of agent models in various domains.
The various techniques mentioned above are not mutually exclusive. They can be used together in the process of training and refining the same model. Developers can fully exploit the potential of existing LLMs and experiment with different approaches to meet the needs of increasingly complex applications. This integrated approach not only improves model performance but also drives rapid innovation and advances in Web3.
However, we believe that while existing LLMs have played a critical role in the rapid development of Web3, before fully exploring these existing models (e.g., OpenAI, Llama 2, and other open-source LLMs), it is prudent to consider fine-tuning and retraining the base model, starting with a shallow base using RAG strategies such as prompt engineering and embedding.
- How LLMs Streamline Different Stages of Blockchain Data Production
- Standard workflow for blockchain data processing
In today’s blockchain landscape, developers are increasingly realizing the value of data. This value spans multiple domains, such as operational monitoring, predictive modeling, recommender systems, and data-driven applications. Despite this growing awareness, the critical role of data processing — as the indispensable bridge from data collection to application — is often underestimated.
- Transforming Unstructured Raw Data into Structured Raw Data
Each transaction or event on the blockchain generates events or logs, typically in an unstructured format. While this step serves as the initial gateway to the data, further processing is required to extract valuable insights and form structured raw data. This involves organizing the data, handling exceptions, and transforming it into a standardized format.
- Transforming Structured Raw Data into Meaningful Business Abstract Tables
Once structured raw data is produced, additional steps are required for business abstraction. This involves mapping the data to business entities and metrics, such as transaction volume and number of users, ultimately transforming raw data into information relevant to business operations and decision-making.
- Calculating and Extracting Business Metrics from Abstracted Tables
With abstracted business data, subsequent calculations can provide important derived metrics. Metrics like the monthly growth rate of total transactions and user retention rate. These metrics, implemented using tools such as SQL and Python, are critical in monitoring business health, understanding user behavior, and identifying trends-supporting decision-making and strategic planning.
- Optimizing the Blockchain Data Generation Process with LLM
LLM addresses multiple blockchain data processing challenges, including but not limited to:
Process unstructured data:
- Gain information from transactions and events: LLM has the ability to analyze blockchain transaction logs and events, extracting key information such as transaction amounts, counterparty addresses, and timestamps. This transformation transforms unstructured data into meaningful business insights that are easier to analyze and understand.
- Data cleansing and anomaly detection: LLM can automatically identify and cleanse inconsistent or anomalous data, ensuring data accuracy and consistency and improving overall data quality.
Business Abstraction:
- Mapping raw on-chain data to business entities: LLM seamlessly maps raw blockchain data to business entities, such as linking blockchain addresses to actual users or assets. This intuitive process streamlines business operations for greater efficiency.
- Tagging the On-Chain Unstructured Content: An LLM digs into unstructured data, such as sentiment analysis results from Twitter, and categorizes it as positive, negative, or neutral. This helps users gain a better understanding of social media sentiment trends.
Interpret Data by Natural Language:
- Calculate key metrics: Using business abstractions, LLMs can help calculate key business metrics, including user transaction volume, asset value, and market share. This gives users a deeper understanding of their business performance.
- Query data: Through AIGC, an LLM interprets user intent and generates SQL queries, allowing users to express queries in natural language without the need for complex SQL statements. This greatly improves the accessibility of database queries.
- Metric selection, sorting, and correlation analysis: LLM helps users select, sort, and analyze multiple metrics, providing a comprehensive view of their relationships and correlations. This support facilitates in-depth data analysis and informed decision-making.
- Generate natural language descriptions for business abstractions: LLM generates natural language summaries or explanations based on factual data. This helps users understand business abstractions and data metrics, promoting interpretability and rational decision-making.
- Current applications
Leveraging LLM’s technological and product experience advantages, it finds versatile applications in various on-chain data scenarios, which can be broadly classified into four categories based on technical complexity:
- Data Transformation: This includes operations such as data augmentation and reconstruction, as well as tasks such as text summarization, classification, and information extraction. While these applications are rapidly evolving, they are better suited for generic scenarios and may not be optimal for simple, large-scale batch data processing.
- Natural Language Interface: Connecting LLMs to knowledge bases or tools facilitates the automation of conversations and basic tool usage. This capability can be used to build professional chatbots, with its actual value depending on factors such as the quality of the connected knowledge base.
- Workflow Automation: Using an LLM to standardize and automate business processes, this capability is proving beneficial for complicated blockchain data handling processes. Examples include deconstructing smart contract execution processes and identifying potential risks.
- Assisting Robots and Assisting Systems: Building on natural language interfaces, assistive systems integrate additional data sources and functionalities. This augmentation significantly increases user productivity.
- Limitations of the LLM
- Web3 Data Industry Landscape: Established Applications, Ongoing Challenges, and Unresolved Issues
Despite significant progress in the area of Web3 data, several challenges remain.
Established applications:
- Using LLMs for information processing: AI technologies, including LLM, have proven effective in generating text summaries, abstracts, and explanations. They help users distill key information from lengthy articles and technical reports, improving readability and comprehension.
- Using AI to address development challenges: LLMs have found an application in solving development-related problems, such as serving as a replacement for platforms like StackOverflow or traditional search engines. It provides developers with answers to questions and programming assistance.
Ongoing research and challenges:
- Code Generation with LLM: The industry is actively exploring the application of LLM in translating natural language into SQL query language to improve the automation and comprehensibility of database queries. However, this process faces several challenges. In specific contexts, the generated code must be highly accurate, requiring flawless syntax to ensure error-free execution and accurate results. Other challenges include providing successful, correct responses and a deep understanding of the business context.
- Complexity of data labeling: Data labeling is critical for training machine learning and deep learning models. However, data labeling complexities are particularly high in the Web3 data field, especially when dealing with anonymized blockchain data.
- Accuracy and hallucination issues: The occurrence of hallucinations in AI models can be influenced by various factors, such as biased or insufficient training data, overfitting, limited contextual understanding, lack of domain knowledge, adversarial attacks, and model architecture. Researchers and developers must continually refine model training and calibration methods to increase the credibility and accuracy of generated text.
- Leverage data for business analysis and content creation: Effective use of data for business analysis and content creation remains an ongoing challenge. The nuances of the problem, the need for well-crafted prompts, and considerations such as data quality, data volume, and strategies to mitigate hallucination issues all require attention.
- Automated indexing of smart contract data for domain-based data abstraction: The challenge of automatically indexing smart contract data from different business domains for data abstraction remains unresolved. This requires a holistic approach that considers the unique characteristics of different business domains, as well as the diversity and complexity of the data.
- Handling of temporal data, tabular document data, and other complex modalities: Advanced multimodal models such as DALL·E 2 excel at generating images, speech, and other common modalities over text. In the blockchain and financial industry, handling temporal data requires a nuanced approach beyond simple text vectorization. Integrating temporal data with text, performing joint cross-modal training, and similar approaches are critical research directions for achieving intelligent data analytics and applications.
- Why LLMs Are Not Enough to Perfectly Address Blockchain Data Industry Challenges
As a language model, LLM is best suited for scenarios that require high fluency, but achieving precision may require further model adjustments. The following framework provides valuable insights when applying LLM to the blockchain data industry.
When evaluating the applicability of LLMs in various scenarios, the focus on fluency and accuracy becomes paramount. Fluency measures the naturalness and coherence of the model’s output, while accuracy determines the precision of its responses. These dimensions have different requirements in different application contexts.
For tasks that emphasize fluency, such as natural language generation and creative writing, LLM typically excels due to its robust natural language processing performance, which enables the generation of fluent text.
Blockchain data presents multifaceted challenges involving data parsing, processing, and application. LLM’s exceptional language understanding and reasoning capabilities make it an ideal tool for interacting with, organizing, and summarizing blockchain data. However, LLM does not comprehensively solve all blockchain data problems.
Regarding data processing, LLM is better suited for fast, iterative, and exploratory processing of on-chain data, where continuous experimentation with new processing methods is essential. However, LLM is limited to tasks such as detailed matching in a production environment. Also, unstable answers to prompts affect downstream tasks, leading to unstable success rates and the inefficiency of performing high-volume tasks.
The processing of content by an LLM may cause hallucination problems. The estimated probability of hallucination in ChatGPT is approximately 15% to 20%, and the opaque nature of its processing makes many errors challenging to detect. Therefore, establishing a robust framework coupled with expert knowledge is crucial. In addition, combining LLM with on-chain data presents numerous challenges:
- Similar to other industries, the variety and diversity of on-chain data entities require further research and exploration of the most effective ways to feed information into LLM.
- On-chain data includes both structured and unstructured data. Most industry data solutions currently rely on an understanding of business data. Analyzing on-chain data involves using ETL to filter, cleanse, augment, and reconstruct business logic to transform unstructured data into structured data. This process increases the efficiency of subsequent analysis for various business scenarios. For example, structured DEX trades, NFT marketplace transactions, wallet address portfolios, etc., have characteristics such as high quality, high value, accuracy, and authenticity, complementing LLM in general applications.
- Myth about LLM
- Can LLMs handle unstructured data directly, making structured data obsolete?
LLMs are typically pre-trained on large amounts of textual data, making them naturally adept at processing diverse unstructured textual information. However, various industries already possess substantial amounts of structured data, particularly in the Web3 field where data has been parsed. Effectively leveraging this data to improve LLM has become a hot industry research topic.
For LLMs, structured data continues to offer several advantages:
- Massive data: Large amounts of data, especially private data, are stored in databases and various standard formats behind applications. Each company and industry also has a wealth of proprietary data that LLM has yet to leverage for pre-training.
- Already accessible: This data does not need to be reproduced, with minimal input costs; the only challenge is to unlock its potential.
- High quality and value: The long-term accumulation of domain expertise is embedded in structured data widely used in industry, academia, and research. Structured data quality is paramount to its usability and includes completeness, consistency, accuracy, uniqueness, and factuality.
- Efficiency: Structured data, stored in spreadsheets, databases, or other standardized formats, has predefined schemas that are consistent across the entire dataset. This provides predictability and control over data formats, types, and relationships, making data analysis and querying easier and more reliable. In addition, mature ETL tools and various industry-standard data processing and management tools enhance efficiency and ease of use. LLMs can also leverage this data through APIs.
- Accuracy and factual grounding: LLM’s text data, which relies on token probability, does not yet consistently provide accurate answers, which presents a fundamental challenge for LLM to address — hallucination. This is especially critical in industries such as healthcare and finance, where security and reliability are concerns. Structured data is emerging as a direction to support and address these challenges in LLM.
- Embrace relationship graphs and business logic: Different structured data types entered in specific organizational formats (e.g., relational databases, graph databases) address different domain challenges. Using standardized query languages such as SQL increases the efficiency and accuracy of complex data queries and analysis. Knowledge Graphs provide a clearer representation of entity relationships, facilitating correlation queries.
- Cost-effectiveness: LLMs eliminate the need to retrain the entire base model from scratch and work seamlessly with agents and LLM APIs to provide faster and more cost-effective access to LLM.
There are some imaginative views in the current market that suggest that LLMs have exceptional capabilities in handling both textual and unstructured information. According to this perspective, achieving the desired result is as simple as importing raw and unstructured data into an LLM.
This notion is similar to expecting a general-purpose LLM to solve mathematical problems: without constructing a specialized model of mathematical skills, most LLMs might make mistakes when tackling basic elementary school addition and subtraction problems. On the contrary, constructing a vertical cryptographic LLM model, similar to models for mathematical abilities and image generation, proves to be a more practical approach to addressing the application of LLMs in the crypto world.
- LLM has the ability to infer content from textual sources such as news and tweets, so on-chain data analysis is not required to get insights?
While an LLM can extract information from textual sources such as news and social media, insights derived directly from on-chain data remain critical for the following reasons:
- On-chain data represents raw, first-hand information, whereas news and social media information can be misleading. Analyzing on-chain data directly reduces information bias. Although using LLM for text analysis carries the risk of comprehension bias, analyzing on-chain data directly minimizes misinterpretation.
- On-chain data includes a comprehensive history of interactions and transactions, allowing for the identification of long-term trends and patterns. In addition, on-chain data provides a holistic view of the entire ecosystem, including the flow of funds and relationships between parties. These macro insights contribute to a deeper understanding of the situation, whereas news and social media information tends to be more fragmented and short-term.
- A key advantage of on-chain data is its openness. Anyone can verify the results of the analysis, preventing information asymmetry. In contrast, news and social media may not always be completely truthful. Textual information and on-chain data can be verified for each other, providing a more comprehensive and accurate assessment when combined.
The analysis of on-chain data remains essential, and LLM is a complementary tool for extracting information from text. However, it cannot replace the direct analysis of on-chain data. Optimal results are achieved by leveraging the strengths of both approaches.
- Is it easy to build blockchain data solutions on top of LLM using tools like LangChain, LlamaIndex, or other AI tools?
Tools such as LangChain and LlamaIndex provide a convenient way to build custom and simple LLM applications, enabling rapid development. However, successfully deploying these tools in the real world presents additional challenges. Building an LLM application with sustained high quality and efficiency is a complex task that requires a deep understanding of both blockchain technology and how AI tools work, as well as the effective integration of the two. This is proving to be a significant yet challenging undertaking for the blockchain data industry.
Throughout this process, recognizing the unique characteristics of blockchain data is critical. It requires a high level of accuracy and verifiability through repeatable checks. Once data is processed and analyzed through LLM, users have high expectations for its accuracy and reliability. However, there is a potential conflict between these expectations and the fuzzy fault tolerance of LLM. Therefore, when constructing blockchain data solutions, it is necessary to carefully balance these two aspects in order to meet users’ expectations.
In the current market, despite the availability of some basic tools, the field continues to evolve rapidly. The landscape is constantly changing as the Web2 world has evolved from the early days of PHP to more mature and scalable solutions such as Java, Ruby, Python, JavaScript, Node.js, and emerging technologies such as Go and Rust. Similarly, AI tools are changing dynamically, with emerging GPT frameworks such as AutoGPT, Microsoft AutoGen, and OpenAI’s recently announced ChatGPT 4.0 Turbo representing only a fraction of the possibilities. This underscores the fact that there is ample room for growth in both the blockchain data industry and AI technology, which requires continuous efforts and innovation.
When applying LLM, there are two pitfalls that require special attention:
- Unrealistic Expectations: Many individuals believe that LLM can solve all problems, but in reality, LLMs have clear limitations. It demands substantial computational resources, involves high training costs, and the training process may be unstable. It is crucial to have realistic expectations of LLM’s capabilities, recognizing its excellence in certain scenarios, such as natural language processing and text generation, while acknowledging its limitations in other domains.
- Neglecting Business Needs: Another pitfall is the forceful application of LLM technology without adequately considering business requirements. Prior to LLM implementation, it is essential to identify specific business needs clearly. An evaluation of whether LLM is the optimal technology choice is necessary, along with a thorough assessment and control of associated risks. Emphasizing the effective application of LLM calls for thoughtful consideration on a case-by-case basis to avoid misuse.
While LLM holds immense potential in various domains, developers and researchers must exercise caution and maintain an open-minded exploration approach when applying an LLM. This approach ensures the discovery of more suitable application scenarios and maximizes the advantages of LLMs.
This article is jointly published by Footprint Analytics, Future3 Campus, and HashKey Capital.
All Comments