Personally Identifiable Information And Distributed Ledgers

In the context of the blockchain and distributed ledgers technologies, the following aspects of securing personally identifiable information are of great importance:

Protecting PII of an information system’s users; Protecting the data that can be considered personal and resides within the ledger; Protecting the data that can be deemed personal and resides outside the ledger, but is referenced in ledger entries; Protecting the information about transactions being performed with ledger entries.

Most of the modern approaches to maintaining personally identifiable information, which are explicated in international legislation, revolve around the presence of an operator responsible for processing this data in compliance with legal provisions. Such a perspective stems directly from the centralized architecture of the existing information systems. The principles conventionally underlying the blockchain (distributed ledger) technology ideologically echo back to Timothy May’s well-known Crypto Anarchist Manifesto. It comes down to the possibility of creating a self-governed society that doesn’t depend on supervisory authorities. According to the theory, this initiative should be based on the use of cryptographic mechanisms and anonymization, that is, a separation of an individual’s digital avatar and his real-world representation. This idea was at the core of creating the Bitcoin cryptocurrency based on a publicly available ledger, the so-called blockchain that stores all data about the system’s status (transactions completed) and allows every user to verify the correctness of these transactions. On the one hand, this sort of architecture doesn’t presuppose the presence of the above-mentioned personal data operator. On the other hand, it isn’t even tasked with PII processing due to the lack of mechanisms for user identification, the type thereof that are leveraged by traditional systems. Research conducted during the last few years discovered that the architecture of Bitcoin – as well as other cryptocurrencies, including the ones originally claiming to be fully anonymous – does not allow for ultimate protection of user data. Moreover, given the openness of the transactions ledger, this fact poses serious risks to both PII security and intactness of one’s private life in general. Meanwhile, businesses that are planning to implement the blockchain technology are interested in personal data processing and user identification in pursuit of complying with the provisions of the national and international legislation. This intention, in and of itself, contradicts the original concept of this tech to a certain extent. These caveats incentivize the crypto community to rethink the approaches to data processing – not only as compared with the classic systems but also with the first generation blockchain-based systems, such as Bitcoin and Ethereum. Let’s have a look at one of the most frequently mentioned blockchain implementations – a system designed for remote user identification via an operator that performed the primary identification. In case the data is stored in the general ledger, it is necessary to encrypt it. This brings up serious concerns with storing and managing the keys, for instance, a limited period during which the digital signature keys are valid. Another challenge is to control the way users access the data. It’s noteworthy that compromising the keys due to potential public accessibility of the ledger is an extremely serious threat to the security of personally identifiable information. Another approach implies that personal data is stored by PII operators which, in their turn, add a certain identifier to the blockchain that allows for checking the availability of such information. When verifying a customer’s personally identifiable information, the operator derives a corresponding identifier from it. Then, it searches the database for the identifier added by the operator that performed primary identification of the customer. Traditionally, this kind of a solution presupposes adding the hash value of user-stored data to the ledger – this, in conjunction with the properties of cryptographic hash functions, allows identifying data with probability approaching one. However, applying such a simple mechanism in the case of sensitive information is unacceptable. Let’s delve into this issue further. By its nature, personal data used in specific information systems is restricted by the number of entry variants – such as the first name, last name, address, SSN, etc. – out of a specific range depending on the objectives of the system. Importantly, the format of these entries needs to be rigidly structured for accurate comparisons to be feasible. This allows a threat actor to try and obtain someone’s personally identifiable information with a relatively small number of deviations by checking the correctness of this routine via the values of hash functions residing in the ledger. The sources used for this data harvesting activity may include publicly available databases, social networks, and the like. The smaller the maximum size threshold of the ledger (for example, as could be the case with a municipal database), the more effective the attack will be due to the limited scope of admissible values. This is exactly why the European Union’s GDPR (General Data Protection Regulation) does not consider hashing as a reliable method of anonymizing personally identifiable information. It is quite indicative that the depersonalization requirements in this example come into collision with one of the determinative properties of the blockchain, that is, the presence of unambiguous ties between different objects (blocks, transactions, and data.) The situation can be yet more complex in other domains of blockchain implementation. For example, blockchain-based electronic voting systems endorsed by a number of experts need in-depth research to assess the appropriateness of data anonymization mechanisms. Such an analysis is imperative to make sure these techniques comply with vote secrecy as well as other local and international regulations. All the above requires the formation of complex architectures of blockchain systems aimed at processing personally identifiable data. It also presupposes the mandatory use of specific cryptographic mechanisms, such as bit-oriented protocols, evidence-based zero-knowledge protocols, control procedures, etc. In this context, it’s worth mentioning that the architectures of a few recently developed blockchain systems tasked with processing, for instance, user identity data (such as Sovrin and uPort) mostly involve certain intermediary operators in their operation chain. These operators work with personally identifiable information or its identifiers, which can potentially nullify the declared benefits of decentralization. Another nontrivial matter has to do with eliminating data from the blockchain at the request of the court. This practice is referred to as “the right to be forgotten.” Even though a number of variants of the so-called editable blockchain have been proposed at this point, with most systems, such a requirement implies overwriting the entire ledger, starting with the entry indicated by the court. Although a transaction of that sort shouldn’t cause any particular complications from a technical perspective, it would require close cooperation of system users to update all local copies of the ledger. This type of cooperation is problematic for open blockchains where users are under the jurisdiction of different authorities. In summary, it makes sense to point out the issue of ledger openness once again. All the research made to date, as well as a number of existing commercial products, demonstrate the possibility of de-anonymizing users even in anonymous or pseudo-anonymous blockchains. Taking into account the numerous incidents where cryptocurrency users were robbed, sometimes at gunpoint, this poses a serious threat to people’s privacy and also raises questions regarding the assessment of existing and in-development systems’ security.