Understanding Data Storage on Blockchain

Table of Contents

April 17, 2022 8 min read

Understanding Data Storage on Blockchain

After experimenting with blockchain for about a month and creating a sample notes project, it was time for me to delve deeper into building a dApp for the Web3 space.

Based on my initial understanding of how things function in web3 space, I had deduced that all the data generated and needed would be stored on the blockchain itself. However, once I came across the term “decentralized databases” my understanding had been completely changed.

Why do we need decentralized databases?

Blockchain is known to be expensive and it is rare that an individual would find it feasible to pay a lot of money to host data on Blockchain. This is especially the case, when Web2 serves the same purpose for considerably lower prices.

In case all our data is stored on blockchain it would mean that the users will have to start paying for basic functionalities as well, for example- you will have to pay some money each time you even like or comment on someone’s post. It is established that nobody would want to pay transaction fees when one can already do it for free. Platforms such as Facebook have a massive user base because it is free of costs.

Storing data on blockchain can increase costs and the data management tech on blockchain is still under development

But as we proceed on this journey of building dApps for blockchain, it becomes evident that storing all the data on blockchain is not efficient. Furthermore, the technology required to  fetch and search through data stored on blockchain is still under development. Thus, it is understood that Blockchains will only store 1-2% of the most important data in web3 and the rest of the data would have to be stored elsewhere. A lot of dApps currently store this information in centralized databases only. Thus the developers have to trade off between choosing to pay very high prices and poor data security.

Even though we know that NEAR protocol has immensely decreased gas fees, it is unlikely that people would start paying for services that are available for free in Web2. Organizations that are looking to fund this cost would also have to eventually look for ways to monetize through other means. What are the other ways to make money when using an application that is free? Ads and selling sensitive data. Does this ring a bell ? Boom! We are back to web 2.0. Thus, we have to look for some other way and this other road leads us to decentralized databases.

What are decentralized databases?

A decentralized database/ledger stores information across a network of distributed computers as opposed to  a single centralized server. Basically in a decentralized database each and every file is replicated across several storage nodes worldwide, lowering the storage costs and making sure that the data is available even if some of the nodes are down.

Decentralized databases distribute data storage across all the computers on the network

Some of the notable features of decentralized databases include:

  • Unmatched Privacy
  • Amazing Reliability
  • High Scalability
  • High Data Immutability
  • Better Performance Speed – Since the files and data are stored across several nodes worldwide, you gain access to the data within seconds. Besides, the underlying technology is designed to adapt and adjust the location and number of nodes to deliver faster speeds.

The NEAR Protocol documentation suggests the following decentralized databases as viable alternate storage solutions. One such solution is IPFS.

IPFS-The InterPlanetary File System (IPFS) is a peer-to-peer network protocol for storing and sharing data in a distributed file system, with addresses based on content, not location.

Identification and retrieval of Data

One of the most important differences between the centralized web and the decentralized web is the way we identify and retrieve data on each. Let's use a simple example to illustrate:

Two of your friends, Naina and Sonia , recommend the same book, but they describe the book to you in very different ways:

Sonia

"Go to the Bahrisons bookstore at 123 Khan Market in New Delhi, take the stairs to the 2nd floor, find the 3rd bookcase on the right in the Fiction section, and get the book that's 16 inches from the left on the top shelf."

Naina

"Check out Most Adventures ever Kittens Ever by Ruskin Bond. Its ISBN-10 number is 9781626972168."

If your goal is to get a copy of the book, which of these descriptors do you find most helpful? Which gives you the most options for how to acquire the book? In each case, once you've followed the instructions, how confident will you be that you've found the book your friend intended? Let us delve one step deeper into locating data.

Location addressing and content addressing

One of your friends identified the book by its location, and the other by its content. Location addressing points us to the location where data is stored by a specific entity. Sonia pointed us to a specific bookshelf controlled by the Bahrisons , where he knows they've previously kept this book, and assumes that they continue to offer it there. This is how we identify data on the centralized web.

Content addressing derives data by identifying the nature of the data and location addressing derives data based on the place that the data is stored

Contrarily, Content addressing provides a unique, content-derived identifier for the data, which we can use to retrieve the data from a variety of sources. We could have used the ISBN provided by Naina to verify we'd found the right book at our local library, our neighbor's house, or the book fair. This is how we identify data on the decentralized web.

Now can you relate how the URLs we have been using all our lives are a very good example of location based addressing. It's a workable approach but it surely has some drawbacks.

We can say it is very easy for 50,000 people to store exactly the same photo of the beautiful bridge, but all on different domains and with different filenames, leading to a lot of redundancy.Even on our own laptops most of us have accidentally saved the same document as download.pdf and download(01).pdf without realizing it, or saved iterations of the same term paper over and over again with v1 or 2018-12-18 added to the title. The present day web is a confusing mess of data that's saved multiple times at different URLs, and there's no easy way to tell which items are identical to each other.
There must be a better way!

Content based addressing

On the decentralized web, we can all host each other's data, with a different type of linking that's more secure, making it easy to trust our neighbors. This is generally achieved through Cryptographic hashes.
Cryptographic hashes can be derived from the content of the data itself, meaning that anyone using the same algorithm on the same data will arrive at the same hash. If Naina and Sonia are both using the same decentralized web protocol, such as IPFS, to share the exact same photo of a dog, both images will have exactly the same hash. By comparing those hashes and confirming that they're the same, we can guarantee that every single pixel of those two photos is identical.

Cryptographic hashes are unique for each data, guaranteeing that the data remains unaltered and original

Cryptographic hashes are unique. If Sonia uses Photoshop to remove a single whisker from that dog, the updated image will have a new hash. Simply by looking at that hash, even without access to the file itself, it will be easy to tell that the file now contains different data.

On the centralized web, we've learned to trust certain authorities and not others. We do our best with the clues we have from URLs, but there are some malicious actors who use the shortcomings of location addressing to trick us. For example netflix.com is anyday more reliable than movies123.xyz

On the decentralized web, though, we all pitch in and host each other's data, and content addressing enables us to trust the information that's shared. We may not know much about the peers who are hosting data, but hashes can prevent malicious actors from deceiving us about the content of files. That's what makes cryptographic hashing so important to the decentralized web. Since we use hashes to request data on the decentralized web, we can think of a hash as a link, not just a name.

How do you look up data on IPFS?

  • When other nodes look up your file, they ask their peer nodes who's storing the content referenced by the file's CID. When they view or download your file, they cache a copy — and become another provider of your content until their cache is cleared.
  • A node can pin content in order to keep (and provide) it forever, or discard content it hasn't used in a while to save space. This means each node in the network stores only content it is interested in, plus some indexing information that helps figure out which node is storing what.
  • If you add a new version of your file to IPFS, its cryptographic hash is different, and so it gets a new CID. This means files stored on IPFS are resistant to tampering and censorship — any changes to a file don't overwrite the original, and common chunks across files can be reused in order to minimize storage costs.
  • However, this doesn't mean you need to remember a long string of CIDs — IPFS can find the latest version of your file using the IPNS decentralized naming system, and DNSLink can be used to map CIDs to human-readable DNS names.

Limitations

IPFS requires large-scale adoption to achieve its full potential as the fetch time of the data is directly dependent on the number of clusters that exist. IPFS clusters, pin content ( saved from garbage collection) CID , DAG ( Directed acyclic graph) for which we would have to use pinata.

It also lacks one very important feature: Permanence. Content on the IPFS network can disappear. If no-one hosts the data, it could be lost forever. To overcome this we would need Arweave blockchain on top of IPFS. I had also briefly researched Ceramic. Most of these technologies lack in-depth documentation of how to carry out even basic tasks and thus understanding them takes up a lot of time. Here is a jist of what I could understand.

One of the limitations in IPFS is the permanence feature as data needs to be continually hosted on the network for it to be available on the network

Ceramic enables one to store data attached to a global user identity on IPFS in custom data model schemas. It relies on the additional layer of connecting to a Ceramic node (which anyone can run), which communicates via libp2p (came from IPFS) and is responsible for execution. The easiest way to see Ceramic in action is to create a profile on https://self.id/ connecting with your crypto wallet, and then going to https://dns.xyz and seeing your profile information already populated there. There are no repositories demonstrating basic CRUD functionality with data models. The Discord was not of much help. Furthermore, since the NEAR Protocol documentation did not mention it , I did not go ahead with it.

Concluding Notes

The decentralized future still continues to amaze me. Through this article, the importance and need for decentralized databases is highlighted. Several questions still required to be answered such as “Why and how would they work?” The answer for now is that there are only a limited number of players in this domain and each of them have some limitations that are under development. As people begin to realize the true potential of these databases and even data ownership in general, I am sure that the large-scale adoption of this technology is inevitable and unstoppable. It is also geared to attract developers and eventually users, which would then help solve the performance issues that these companies currently face. The future of decentralized databases looks promising and over time, it can completely overhaul the present day data storage and management infrastructures.

Great! Next, complete checkout for full access to Crypto Capable.
Welcome back! You've successfully signed in.
You've successfully subscribed to Crypto Capable.
Success! Your account is fully activated, you now have access to all content.
Success! Your billing info has been updated.
Your billing was not updated.