Overcoming Data Availability Attacks in Blockchain Systems: LDPC Code Design for Coded Merkle Tree

In blockchain systems, full nodes store the entire blockchain ledger and validate all transactions in the system by operating on the entire ledger. However, for better scalability and decentralization of the system, blockchains also run light nodes that only store a small portion of the ledger. In blockchain systems having a majority of malicious full nodes, light nodes are vulnerable to a data availability (DA) attack. In this attack, a malicious node makes the light nodes accept an invalid block by hiding the invalid portion of the block from the nodes in the system. Recently, a technique based on LDPC codes called Coded Merkle Tree (CMT) was proposed by Yu et al. that enables light nodes to detect a DA attack by randomly requesting/sampling portions of the block from the malicious node. However, light nodes fail to detect a DA attack with high probability if a malicious node hides a small stopping set of the LDPC code. To mitigate this problem, Yu et al. used well-studied techniques to design random LDPC codes with high minimum stopping set size. Although effective, these codes are not necessarily optimal for this application. In this paper, we demonstrate that a suitable co-design of specialized LDPC codes and the light node sampling strategy can improve the probability of detection of DA attacks. We consider different adversary models based on their computational capabilities of ﬁnding stopping sets in LDPC codes. For a weak adversary model, we devise a new LDPC code construction termed as the entropy-constrained PEG (EC-PEG) algorithm which concentrates stopping sets to a small group of variable nodes. We demonstrate that the EC-PEG algorithm coupled with a greedy sampling strategy improves the probability of detection of DA attacks. For stronger adversary models, we provide a co-design of a sampling strategy called linear-programming-sampling (LP-sampling) and an LDPC code construction called linear-programming-constrained PEG (LC-PEG) algorithm. The new co-design demonstrates a higher probability of detection of DA attacks compared to approaches proposed in earlier literature.


I. INTRODUCTION
Blockchains are tamper-proof ledgers of transaction data maintained by a network of nodes in a decentralized manner.Blockchains were initially proposed in the field of finance with cryptocurrencies like Bitcoin [4] and Ethereum [5].However, the decentralized nature of blockchains avoids the need for trusted thirst parties leading to the application of blockchains in fields such as supply chains [6], Internet of Things [7], and healthcare [8].
A blockchain is a collection of transaction blocks pieced together in the form of a hashchain.Full nodes in the blockchain network store the entire blockchain ledger and validate the transactions in each block by operating on the entire ledger.However, storing the entire ledger requires a significant storage overhead: currently the size of the Bitcoin and Ethereum ledgers are around 360GB [9] and 910GB [10], respectively.The large ledger size prevents resource limited nodes from joining the blockchain system, which in turn affects the scalability and decentralization of the network.To alleviate this problem, some blockchain systems also run light nodes [4], [11].These are nodes that only store a small fraction of the block and do not validate the transactions present in the block.In particular, light nodes only store the headers corresponding to each block of the blockchain which contain compressed information about the block.The header for each block contains a field called a Merkle root which is constructed from the block transactions [4].Using the Merkle root, light nodes can verify the inclusion of a given transaction in a block via a technique called a Merkle proof.However, since light nodes do not store the entire block, they cannot verify the correctness of the transactions present in the block.Assuming that the system has a majority of honest full nodes, light nodes simply accept headers that are part of the longest header chain considering that honest full nodes will not mine blocks on chains containing fraudulent transactions (i.e., a longest chain consensus protocol as proposed in [4] is used).However, due to the decentralized nature of a blockchain system, users have a personal incentive to collude and dominate a system with a majority of malicious full nodes.Hence, it is not inconceivable that a dishonest majority can exist in the system, thus making the longest chain protocol insecure for systems with light nodes.As such, researchers were prompted to find methods to provide security even under a dishonest majority of full nodes.
One such research endeavor was studied by authors in [1] who provided protocols for honest full nodes to generate and broadcast verifiable fraud proofs of invalid transactions.The mechanism allows light nodes, even in the presence of a majority of malicious full nodes, to reject headers of invalid blocks on receiving (and thereafter verifying) fraud proofs of the block invalidity from an honest full node.However, in the presence of a majority of malicious full nodes, the light nodes are still susceptible to data availability (DA) attacks which have been studied in [2] and [1].In this attack, as illustrated in Fig. 1 left panel, a malicious full node generates a block with invalid transactions, publishes the header of the invalid block to the light nodes, and hides the invalid portion of the block from the full nodes.Honest full nodes cannot validate the missing portion of the block and hence are unable to generate fraud proofs to be sent to the light nodes.Since the absence of a fraud proof also corresponds to the situation that the block is valid, light nodes accept the header of the invalid block on not receiving fraud proofs.Note that in this system, there is no way of identifying honest alarm messages (messages with no verifiable proof) sent out by full nodes about portions of the block being missing [2] (also see note in [12]).Thus, honest full nodes (that are in minority) cannot be incentivized to send alarm messages.
To prevent a DA attack, it is thus imperative for light nodes, on receiving a header from a full node that generates the block, to ensure that the block that the header corresponds to is available to the system.This condition will ensure that some honest full node in the system will be able to generate fraud proofs if the block was invalid.Light nodes can independently detect a DA attack if an anonymous request for a portion of the block is rejected by the full node that generates the block.As such, as illustrated in Fig. 1 right panel, light nodes randomly sample the block, i.e., randomly request for different portions of the block transactions and accept the header if all the requested potions are returned.In this paper, we are interested in reducing the probability of failure for a light node to detect a DA attack for a given sample size, thus improving the security of the system.Detection of a DA attack using random sampling becomes increasingly unlikely for the light nodes as the block size increases since a malicious node can hide a very small section of the block.To alleviate this problem, authors in [1] proposed coding the block using erasure codes.When the block is erasure coded, to make the invalid portion of the block unavailable, the malicious block producer must prevent honest full nodes from decoding back the original block (thus preventing them from generating fraud proofs) by either 1) hiding a larger portion of the coded block (more than the erasure correcting capability of the code) which, then, can be detected with a high probability by the light nodes using random sampling; 2) incorrectly generating the coded data, in which case honest full nodes can generate and broadcast verifiable incorrect-coding proofs [1], [2] allowing light nodes to reject the header.To keep the incorrect-coding proof size small, authors in [1] used 2D Reed-Solomon codes which results in an incorrect-coding proof size that scales as O( √ b log b), where b is the size of the block.Work in [2] extends the idea into a technique called Coded Merkle Tree (CMT) that allows detection of a DA attack on any layer of the Merkle tree.A CMT uses an erasure code to encode each layer of the Merkle tree.Although any erasure code can be used, similar to [2], we focus on Low-Density Parity-Check (LDPC) codes for encoding the Merkle tree since they provide the following benefits: 1) small check node (CN) degrees in the LDPC codes reduce the size of the incorrect-coding proof to O(log b) as shown [2]; 2) LDPC codes enable the use of a linear time peeling decoder [13] to decode the coded symbols, reducing the decoding complexity compared to Reed-Solomon codes used in [1].Despite these benefits, an LDPC code with a peeling decoder leads to certain problematic objects, called stopping sets [13], which allow malicious nodes to successfully hide a smaller portion of the block compared to Reed-Solomon codes.We address this issue in our paper through an informed design of LDPC codes.
A stopping set of an LDPC code is a set of variable nodes (VNs) that if erased prevents a peeling decoder from fully decoding the original block.If a malicious node hides coded symbols corresponding to a stopping set of the LDPC code used to encode a particular layer of the CMT, full nodes will not be able to decode the layer.Since the malicious node can hide the smallest stopping set, the probability of failure (at a particular layer of the CMT) for the light nodes to detect a DA attack using random sampling depends on the size of the smallest stopping set in the LDPC code.Thus, to reduce the probability of failure using random sampling, the best code design strategy is to construct deterministic LDPC codes that have large minimum stopping set size which is considered a hard problem in literature [14], [15].In this paper, we show that the probability of light node failure to detect a DA attack can be reduced by a suitable co-design of specialized LDPC codes used to construct the CMT and the light node sampling strategy.For various adversary models, we provide sampling strategies and coupled LDPC code constructions to reduce the probability of failure compared to techniques previously used in literature.
We can broadly categorize all possible adversaries into three types based on their computational capabilities.The computational complexity is based on how hard it is for a malicious node to find the minimum stopping set in the LDPC code (which is known to be an NP-hard problem [16]).Note that the light node sampling strategy is known by all entities in the system.The first adversary type is termed as a weak adversary where the malicious node does not have the resources to find a large number of stopping sets and settles for hiding a random one it finds and is unable to take advantage of the light node sampling strategy.The second type is a medium adversary where the malicious node, using more computational resources, can find all stopping sets upto a certain size and select the stopping set that performs the worst under the posted light node sampling strategy.While the medium adversary has more computational capability than a weak adversary, a medium adversary represents a malicious node with bounded resources and can only find stopping sets upto a certain size within a reasonable time frame.The final type is a strong adversary which we assume has unlimited resources and can find all stopping sets (of any size) and hide one among them that performs the worst.These three models represent how much resources we assume an adversary possesses to disrupt our system.As such, our modeling encompasses everything from a single hacker with a standard computer to a small group of hackers with a cluster of computers to a large organization with unlimited resources.

A. Contributions
Our main contributions in this paper are co-design techniques for LDPC code construction and coupled light node sampling strategies that result in a low probability of failure under the different adversary models described above.The contributions are listed as follows: 1) For the weak adversary, we demonstrate that concentrating stopping sets in LDPC codes to a small set of VNs and then greedily sampling this small set of VNs results in a low probability of light node failure.We then provide a specialized LDPC code construction technique called the entropy-constrained Progressive Edge Growth (EC-PEG) algorithm (which is based on PEG algorithm in [17]) that is able to concentrate stopping sets in LDPC code to a small set of VNs.
We provide a greedy sampling strategy, based on the cycle distributions of the code, for the light nodes to sample this small set of VNs.We demonstrate that, for a weak adversary, LDPC codes constructed using the EC-PEG algorithm along with the greedy sampling strategy results in a significantly lower probability of failure compared to techniques used in earlier literature.
2) To secure the light nodes against a medium and a strong adversary, we provide a co-design of a light node sampling strategy called linear-programming-sampling (LP-sampling) and an LDPC code construction called linear-programming-constrained PEG (LC-PEG) algorithm.LPsampling is tailor-made for the particular LDPC codes used to construct each layer of the CMT and is designed by solving a linear program (LP) based on the knowledge of the small stopping sets in the LDPC codes.The LC-PEG algorithm is designed to minimize the probability of failure when the light nodes use LP-sampling.We demonstrate that, for a medium and a strong adversary, LDPC codes designed by the LC-PEG algorithm coupled with LP-sampling result in a lower probability of failure compared to techniques used in earlier literature.

B. Previous Work
Channel coding was first proposed to solve DA attacks in [1], where authors used 2D Reed-Solomon codes to encode each transaction block.2D Reed-Solomon codes were used instead of 1D Reed-Solomon codes since the former have a incorrect-coding proof size of O( in the latter as shown in [1].In [2], authors proposed the CMT and demonstrated that encoding each layer of the CMT using LDPC codes results in a small incorrect-coding proof size of O(log b).The CMT can be constructed using any LDPC code.However, to result in a low probability of light node failure, authors in [2] used codes from a well-studied random LDPC ensemble from [18] that guarantees a certain stopping ratio (the smallest stopping set size divided by the codeword length [2]) with high probability.Despite the high stopping set ratios guaranteed by the LDPC ensembles, they were originally designed for other types of channels (i.e., BSC) and we show that they are not the best choice for this specific application.
In particular, we demonstrate that the co-design techniques presented in this work can result in a lower probability of failure compared to using codes from a random LDPC ensemble and random sampling.Moreover, constructing the CMT using codes from a random LDPC ensemble leads to the possibility of using codes with smaller stopping ratio (bad codes) than guaranteed by the ensemble.This situation requires the broadcast of bad-code proofs [2] which trigger all nodes in the system to use a newly sampled code from the ensemble.This mechanism of modifying the LDPC codes (when triggered by a bad-code proof) leads to additional overheads in the system (e.g., communication cost of broadcasting a bad-code proof) and also undermines the security of the system until a bad-code is detected (e.g., a header whose corresponding block is unavailable may be accepted by a light node when a bad LDPC code is used).To alleviate this problem, in this paper, we provide deterministic LDPC code design algorithms.In [19], authors provide a protocol called CoVer based on CMT, which allows a group of light nodes to collectively validate blocks without relying on full nodes.However, to mitigate DA attacks, [19] still uses random sampling and random LDPC ensembles to construct the CMT similar to [2].
In [1], authors proposed methods such as proof-of-computation (e.g., zk-STARKs [20]) and proof-of-proximity (using the local decodability feature of multi-dimensional Reed-Solomon codes) to eliminate the need for incorrect-correct proofs with an increase in either computational complexity or storage but the actual trade-off is not known.Hence, in this paper, we focus on improving the probability of failure of systems that mitigate incorrect-coding attack using incorrect-coding proofs instead of eliminating them.
DA attacks are possible in other blockchain systems as well.Sharded blockchains where each node is only responsible (to store and verify) for a fraction of the entire block while still storing the block headers are vulnerable to DA attacks [21] that can be solved using the CMT (as described in [21]).The DA attack in the context of sharded blockchain systems is very similar to what we describe in this paper in the context of light nodes and full nodes and hence the LDPC co-design techniques described in this paper are also applicable to sharded blockchains.
Side Blockchains [22] that improve the throughput of block transactions are also vulnerable to DA attacks.The vulnerability is mitigated in [22] by introducing a data availability oracle that uses the CMT.A similar idea as this paper of co-design to construct specialized LDPC codes to improve various performance metrics of the data availability oracle was demonstrated in [23].
The rest of this paper is organized as follows.In Section II, we provide the preliminaries and system model.In this section, we describe the adversary models considered in this paper.
In Section III, we present our approach for the LDPC code and sampling strategy co-design to overcome data availability attacks against the weak adversary, where we describe the greedy sampling strategy and the EC-PEG algorithm.In Section IV, we present our co-design approach for the medium and strong adversary where we describe the LP-sampling strategy and the LC-PEG algorithm.We provide the simulation results to demonstrate the benefits of our techniques in Section V. Finally, we provide concluding remarks in Section VI.

II. PRELIMINARIES AND SYSTEM MODEL
A blockchain is made up of data blocks, each carrying a batch of system transactions.Each block in its header has the root of the Coded Merkle Tree (CMT) built using its transactions as leaf nodes [2].In this section, we first look at the construction of the CMT, how Merkle proofs are created for the coded symbols in the CMT, and how the CMT is decoded.We then look at the different types of nodes in the system and how they use the CMT to prevent DA attacks.
A. Coded Merkle Tree (CMT) k and a rate R systematic LDPC code is applied to generate n coded symbols.These n coded symbols form the base layer of the CMT.The n coded symbols are then hashed using a hashing function and the hashes of every q coded symbols are concatenated to get one data symbol of the parent layer.The data symbols of this layer are again coded using a rate R systematic LDPC code and the coded symbols are further hashed and concatenated to get the data symbols of its parent layer.This iterative process is continued until there are only t (t > 1) hashes in a layer which form the CMT root.Left panel shows a CMT with n = 16, q = 4, R = 0.5 and t = 4.The circled symbols in L 1 and L 2 are the Merkle proof of the circled symbol in L 3 .Right panel: DA attack on the CMT.

1) CMT construction:
A CMT is constructed by encoding each layer of the Merkle tree [4] with an LDPC code and then hashing the layer to generate its parent layer.A simplified description of the CMT construction is shown in Fig. 2 left panel.As shown in 2 left panel, coded symbols of a layer are interleaved into the data symbols of the parent layer.In this paper, we adopt the interleaving technique introduced in [22].Let the CMT have l layers (except the root), L 1 , L 2 , . . ., L l , where L l is the base layer.The root of the CMT is referred to as L 0 which consists of t hashes.Let the LDPC code used in L j have a parity check matrix H j , 1 ≤ j ≤ l.
For 1 ≤ j ≤ l, let L j have n j coded symbols, where n l = n.Let N j [i], 1 ≤ i ≤ n j , be the i th symbol of the j th layer L j .Also, let Rn j + 1 ≤ i ≤ n j , be the systematic (data) and parity symbols of L j , respectively.Coded symbols P j [i], Rn j + 1 ≤ i ≤ n j are obtained from D j [i], 1 ≤ i ≤ Rn j using a rate R systematic LDPC code H j .In the above CMT, hashes of every q coded symbols of a layer are concatenated together to form a data symbol of its parent layer.Hence, the n j 's satisfy Let the number of systematic and parity symbols in L j be denoted by s j = Rn j and p j = (1 − R)n j , respectively.Also, define x mod p := (x) p .The data symbols of L j−1 are formed from the coded symbols of L j (for 1 < j ≤ l) in the following way: where Hash and concat represent the hash function and the string concatenation function, respectively.The CMT has a root consisting of t = n 1 hashes.
2) Merkle Proof for each base layer symbol: For the above CMT construction, the Merkle proof of a symbol in L j consists of a data symbol and a parity symbol from each intermediate layer of the tree that is above L j [22].An illustration of Merkle proof is shown in Fig. 2 left panel.In particular, the Merkle proof of the symbol N j Given the CMT root, the Merkle proof of a symbol can be used to check the inclusion of the symbol (and other symbols in the Merkle proof) in the tree in a manner similar to checking the proofs for regular Merkle trees in [4].Various properties satisfied by the symbols in a Merkle proof can be found in [22] 1 .
3) Hash-Aware Peeling decoder to detect and prove incorrect coding: Using the CMT root and coded symbols of each layer (some of which may be unavailable) of the CMT, the original block (data symbols of the base layer of the CMT) can be decoded using a hash-aware peeling decoder described in [2].The hash-aware peeling decoder decodes each layer of the CMT (from top to bottom) like a conventional peeling decoder [13].However, after decoding a symbol in layer j, the decoder matches its hash with the corresponding hash present in layer j −1.Matching the hashes allows the decoder to detect incorrect-coding attacks and generate incorrect-coding proofs as described in [2].The size of incorrect-coding proofs for a CMT is proportional to the number of coded symbols involved in each parity check equation.LDPC codes due to its sparse parity check equations have a small incorrect-coding proof size which is proportional to the degree of CNs in the LDPC code.For the hash-aware peeling decoder to successfully generate incorrect-coding proofs for incorrect coding at any layer of the CMT, it should be able to able to decode all layers of the CMT.Details regarding the decoder and incorrect-coding proofs genration mechanism can be found in [2].
Remark 1.While any erasure code can be used to encode the CMT, in this paper we focus on LDPC codes similar to [2] and provide new designs for the LDPC codes to reduce the probability of failure.Due to only making changes to the structure of the LDPC codes, we do not compromise on other performance metrics and their order-optimal solutions in [2].In particular: i) The CMT root has a fixed size t which does not grow with the blocklength; ii) the hash-aware peeling decoder has a decoding complexity that is linear in the blocklength; iii) we empirically show that the incorrect-coding proof size for our codes is similar to [2].

B. Type of Nodes in the System
We consider a blockchain system similar to [2] and [1] that has full nodes and light nodes.
Similar to [2], we consider that a network can have a dishonest majority of full nodes, but we assume that each light node is connected to at least one honest full node.The detailed actions performed by these nodes is described below.
1) Full nodes (can be seen in Fig. 1) can produce (mine) new blocks.On producing a new block, full nodes encode the block to construct its CMT and then broadcast all the coded symbols in the CMT (including the root) to all other full nodes and the root of the CMT to the light nodes.
On receiving a sampling request from the light nodes for certain symbols of the base layer of the CMT, full nodes return the requested symbols along with their Merkle proofs.Full nodes also download and decode each layer of the CMT using a hash-aware peeling decoder as explained in Section II-A.After decoding the base layer of the CMT, which contains transaction data, full nodes verify all the transactions.Full nodes store a local copy of all blocks (i.e. its CMT) that they verify to be valid (i.e., having no fraudulent transactions and no incorrect-coding at any layer).If full nodes find a certain block to be invalid, either due to fraudulent transactions or incorrect coding on some layer of the CMT, they broadcast a fraud proof or an incorrect-coding proof (provided by the hash-aware peeling decoder) for other nodes to reject the block.If full nodes find a certain layer of the CMT to be unavailable (i.e., having coded chunks missing that prevent decoding), they reject the block.In the case of the block being unavailable, full nodes do not broadcast any alarm message due to the lack of a reward mechanism for sending correct alarms [2], [12].A malicious full node need not follow the above protocol and can act arbitrarily.
2) Light nodes (can be seen in Fig. 1) are storage constrained and only store the CMT root corresponding to each block.Light nodes can download only a small portion of the block and perform tasks like fraud proof checks and incorrect-coding proof checks.Additionally, light nodes check the availability of each layer of the CMT by making sampling requests for coded symbols of the CMT base layer from the block producer and perform Merkle proof checks on the returned symbols.Upon receiving all the requested symbols and verifying their Merkle proofs, light nodes accept the block as available and store the block header.Light nodes broadcast all the symbols (and their Merkle proofs) returned by the block producer to other nodes in the system.
On receiving fraud proofs or incorrect-coding proofs sent out by a full node, light nodes verify the proof and reject the header if the proof is correct.We assume that each light node is honest.

C. Stopping sets and LDPC notation
A stopping set of an LDPC code is a set of variable nodes (VNs) such that every check node (CN) connected to this set is connected to it at least twice [13].A stopping set is said to be hidden (made unavailable) by a malicious node if all VNs present in it are hidden.The hash-aware peeling decoder fails to successfully decode layer j of the CMT if coded symbols corresponding to a stopping set of H j is unavailable.Let the parity check matrix n j }.Let the Tanner graph (TG) representation of H j be denoted by G j , where v (j) i is also referred as the i th VN in G j and rows of H j are referred as CNs in G j .For a parity check matrix H j , let H j [v (j) i ] denote the column of the parity check matrix corresponding to VN v (j) i .A cycle of length g is called a g-cycle.For a set S, let |S| denote its cardinality.For a cycle (stopping set) in the TG G, we say that a VN v touches the cycle (stopping set) iff v is part of the cycle (stopping set).We define the weight of a stopping set as the no. of VNs touching it.Let ω (j) min denote the minimum stopping set size of H j , 1 ≤ j ≤ l.Throughout this paper, we refer to individual stopping sets by the symbols ψ (with added subscripts and superscripts where ever necessary).The girth of a TG is defined as the length of the smallest cycle present in the graph.For p = (p 1 , p 2 . . ., p t ) such that p i ≥ 0, t i=1 p i = 1, we use the entropy function H(p) = − t i=1 p i log(p i ) (assume that 0 log 0 = 0).For a vector a, let max(a) (min(a)) denote the value of the largest (smallest) entry in the vector and let a i denote the i th element of the vector.Similarly, for a matrix M of size c × d, let M ki denote the element of M on the k th row and i th column,

D. Threat Model
We consider an adversary that is interested in conducting a DA attack by hiding certain coded symbols of the CMT (which may belong to any layer) such that an honest full node is unable to recover the layer from which the coded symbols are hidden.An illustration of a DA attack is shown in Fig. 2 right panel.On receiving sampling requests from the light nodes, an adversary only returns coded symbols of the CMT that it has not hidden and ignores other requests.Note that using Merkle proofs, light nodes can verify the inclusion of the returned samples in the original block with respect to the Merkle root, preventing the adversary from returning incorrect data (which is not consistent with the Merkle root).Since the light nodes also broadcast the returned sample requests, the adversary cannot return the actual hidden portion which prevented honest full nodes from decoding back the layer whose coded symbols were hidden.The adversary conducts a DA attack at layer j of the CMT by 1) generating coded symbols of layer j, each of which satisfies the Merkle proof, for the light nodes to accept these coded symbols as valid, and 2) hiding a small portion of the coded symbols of layer j, corresponding to a stopping set of H j , such that honest full nodes are not able to decode the layer sucessfully using a hash-aware peeling decoder.A DA attack at layer j thus prevents an honest full node from generating a fraud proof of fraudulent transactions (if j = l) or an incorrect-coding proof for incorrect coding at layer j.Since an incorrect coding can occur at any layer, for the full nodes to be able to send incorrect-coding proofs, light nodes must detect a DA attack at any layer j, 1 ≤ j ≤ l, that the adversary may perform.Light nodes detect a DA attack any layer j by sampling few base layer coded symbols.For each intermediate layer j, 1 ≤ j < l, the symbols of layer j collected as part of the Merkle proofs of the base layer samples are used to check the availability of layer j, and there is no need for additional sampling of the intermediate layers to check their availability.
Light nodes fail to detect a DA attack if none of the base samples requested or the symbols in their Merkle proofs are hidden.Let P (j) f (s), 1 ≤ j ≤ l, be the probability of failure of detecting a DA attack at layer j by the light nodes when they sample s base layer coded symbols.Also, let J max = argmax 1≤j≤l P (j) f (s).To maximize the probability of failure of the light nodes, we assume that the adversary is able to perform a DA attack at layer J max .
We now provide a precise mathematical definition of the three adversary models discussed in Section I based on their computational capabilities: 1) Weak Adversary: for each layer j, 1 ≤ j ≤ l, they hide stopping sets of size < µ j for the parity check matrix H j .Moreover, they do not exhaustively find all stopping sets of a particular size of a given parity check matrix or perform a tailored search for stopping sets.Instead, we assume that to conduct a DA attack at layer j, for all stopping sets of H j of a particular size, they randomly choose one of them to hide.
2) Medium Adversary: for each layer j, 1 ≤ j ≤ l, they hide stopping sets of size < µ j for the parity check matrix H j .However, they use the knowledge of the sampling strategy employed by the light nodes to hide the worst case stopping set that has the lowest probability of being sampled by the light nodes.Let S j be set of all stopping sets of H j of size < µ j .Also, let f (s; ψ), where P (j) f (s; ψ) is the probability of failure for the light nodes to detect at DA attack at layer j under the light node sampling strategy when the adversary hides the stopping set ψ of H j .For J max = argmax 1≤j≤l P (j) f (s), the medium adversary conducts a DA attack at layer J max by hiding a stopping set ψ from S J max with the highest P J max f (s; ψ).
3) Strong Adversary: They can find the worst case stopping sets of any size of H j , 1 ≤ j ≤ l.
Let S ∞ j be the set of all stopping sets of H j .Similar to the medium adversary, define P (j) f (s).The strong adversary conducts a DA attack at layer J max by hiding a stopping set ψ from S ∞ J max with the highest P J max f (s; ψ).
In our co-design to mitigate a DA attack against a medium and a strong adversary (using LP-sampling and LC-PEG algorithm), we assume that a blockchain system designer decides the value of µ j , 1 ≤ j ≤ l, and is able to find all stopping sets of H j of size < µ j , that is used to design LP-sampling.Note that the choice of µ j , 1 ≤ j ≤ l, that the designer makes (to design LP-sampling) is not publicly released and only the final design output, i.e., the LP-sampling strategy that the light nodes use, is publicly available.The designer decides the LDPC code and sampling strategy co-design to be employed in the system.The co-design choice depends on the extent to which they want to secure the light nodes in the systems against a DA attack conduced by adversaries of various types.Since finding the smallest stopping set size of an LDPC code is NP-hard [16], authors in [2] consider a realistically weak adversary which randomly hides CMT coded symbols.The weak adversary defined above is stronger (since it hides a stopping set) but assumes that it does not conduct a tailored search for stopping sets in the LDPC code to hide a stopping set having a low probability of being sampled by the light nodes.Instead, it hides a stopping set of a particular size randomly.The co-design that we provide to mitigate DA attacks against weak adversaries, i.e., the EC-PEG algorithm and the greedy sampling strategy, has the advantage of being computationally cheap and does not involve finding stopping sets.In order to mitigate DA attacks against a medium and a strong adversary, we provide LP-sampling, which uses stopping sets of size < µ j from layer j of the CMT and is more computationally expensive, and is an overkill for the weak adversary which can be mitigated using cheaper techniques.
Given the above adversary models, we wish to provide LDPC code construction and sampling strategies to minimize the probability of failure for light node to detect DA attacks.In the next section, we discuss the techniques to mitigate DA attacks conducted by a weak adversary.

III. LDPC CODE AND SAMPLING CO-DESIGN FOR WEAK ADVERSARY
In this section, we demonstrate our novel design idea of concentrating stopping sets in LDPC codes to reduce the probability of light node failure to detect a DA attack conducted by a weak adversary.Authors in [24] showed that LDPC codes with no degree-one VNs, all stopping sets are made up of cycles.Since working with stopping sets directly is computationally difficult, we focus on concentrating cycles to indirectly concentrate stopping sets.It is also well known that codes with irregular VN degree distributions are prone to small stopping sets.Thus, we consider VN degree regular LDPC codes of VN degree d v ≥ 3 in this paper.Investigation on irregular degree distributions that allow for efficient sampling strategies to reduce the probability of failure is beyond the scope of this paper and is the topic for future research.In the following, we first consider only the base layer of the CMT and look at the effect of the light node sampling strategy on the probability of failure when a DA attack occurs on the base layer (illustrated in Fig. 2 right panel).This will motivate the LDPC code construction for the base layer.Later, we will demonstrate how the LDPC code construction strategy for the base layer can be used in all layers by aligning the columns of the parity check matrices before constructing the CMT.
For simplicity of notation, we also denote the base layer parity check matrix H l by H having n The following lemma demonstrates that LDPC codes with concentrated ss κ results in a smaller probability of light node failure when a weak adversary conducts a DA attack (on the base layer) by randomly hiding a stopping set of size κ and the light nodes sample s base layer coded symbols.
The above lemma suggests that for a fixed sample size s, the lowest probability of failure is 1 − τ (S opt κ , κ) and is achieved when the light nodes sample the set S opt κ .Now, τ (S opt κ , κ) is large if a majority of stopping sets of weight κ are touched by the same small subset of VNs.This goal is achieved if the distributions ss κ are concentrated (i.e., have a high stopping set fractions ss κ i ) towards a small set of VNs.Thus, designing LDPC codes that have concentrated ss κ increases the value of τ (S opt κ , κ) and reduces the probability of failure.Later in Section III-B, we design the EC-PEG algorithm that achieves concentrated stopping set distributions.
Algorithm 1 Light node sampling strategy for weak adversary: greedy-set(G, g min , g max , s) 1: Inputs: TG G, g min , g max , s, Output: S greedy , Initialize: S greedy = ∅, g = g min , G = G 2: while |S greedy | < s do 3: V s = set of VNs that touch maximum number of g-cycles in G if G has no g-cycles then g = g + 2 7: if g ≥ g max then 8: V r = randomly select s − |S greedy | VNs from G (ordered arbitrarily) 9: S greedy = S greedy ∪ V r We are unaware of an efficient method to find S opt κ .Instead, we use a greedy algorithm provided in Algorithm 1 to find the set of base layer samples that the light node will request.
Algorithm 1 takes as input the TG G, its girth g min , an upper bound cycle length g max , and the sample size s and outputs a set of VNs S greedy that the light nodes will sample.Note that the same s samples are used irrespective of the stopping set size κ.The probability of failure using this strategy when a weak adversary randomly hides a stopping set of size κ from the base layer is Lemma 1).We refer to the VNs provided by greedy-set(G, g min , g max , s) in Algorithm 1 as greedy samples.At the end of this section, we will provide empirical evidence that concentrating the cycle distributions ζ g for different cycle lengths also concentrates the stopping set distributions.Thus, the LDPC code construction we provide aims to concentrate the cycle distributions to improve the probability of failure.

A. Aligning the parity check matrices of the CMT
In the above discussion, we demonstrated how to mitigate a DA attack conducted by a weak adversary on the base layer of the CMT using greedy sampling.Now, we want to extrapolate the idea of greedy sampling to the intermediate layers.Since the intermediate layers are sampled via the Merkle proofs of the base layers samples, we need to align the base layer and intermediate layer symbols such that the intermediate layers are also sampled greedily.We achieve this condition by aligning (permuting) the columns of the parity check matrices used in different layers of the CMT such that the samples of an intermediate layer j collected as part of the Merkle proofs of the base layer samples coincide with that provided by the greedy sampling strategy for layer j, i.e., they coincide with the greedy samples in greedy-set(G j , g (j) min , g (j)  max , s), where g min is the girth of G j , and g (j)  max is the upper cycle length for layer j.We assume that the greedy set output S greedy by Algorithm 1 is ordered according to the order VNs were added to S greedy .Let S (j) ordered = greedy-set(G j , g (j) min , g (j)  max , n j ), 1 ≤ j ≤ l.VNs in S (j) ordered are all the VNs of H j ordered (permuted) according to the order they were added to S (j) ordered .Hence, we denote S (j) ordered [i] as the i th VN in this ordered list of VNs.The procedure to align the columns of the parity check matrices of different layers of the CMT is provided in Algorithm 2. In the algorithm, we first permute the columns of the base layer parity check matrix H l (to obtain H l ) such the the VNs in S (l) ordered appear as columns 1, 2, . . ., n l in H l (line 3).After this permutation to obtain H l , the sampling strategy for the light nodes (for a sample size s) becomes sampling the first s VNs or symbols of the base layer.Recall that when the i th base layer symbol is sampled, then for every intermediate layer j the symbols with indices {1 + (i − 1) s j , 1 + s j + (i − 1) p j } get sampled.We assign columns of H j at these indices (starting from i = 1) the columns of H j correspond to the greedy samples in S (j) ordered from start to end (lines 5-8).We continue this process until all columns of H j have been assigned.Finally, the parity check matrices H j , 1 ≤ j ≤ l are used for constructing the CMT .
Remark 2. Recall that a CMT is built using systematic LDPC codes.Under the assumption of full rank, for the parity check matrices H j , 1 ≤ j ≤ l, the corresponding generator matrices can be easily constructed in a systematic form which then are used to construct the CMT.
Algorithm 2 Aligning parity check matrices of CMT for greedy sampling 1: Inputs: H j , S (j) ordered , 1 ≤ j ≤ l, Outputs: H j , 1 ≤ j ≤ l 2: Initialize: H j : matrix with unassigned columns, 1 ≤ j ≤ l, counter = 1 for i = 1, 2, . . ., n l do 6: if all columns of H j have been assigned then Break i for loop For a CMT built using H j , 1 ≤ j ≤ l, provided by Algorithm 2, greedy sampling of the base layer of the CMT according to Algorithm 1 ensures that all intermediate layer of the CMT are greedily sampled according to Algorithm 1 through the Merkle proofs of the base layer samples.
In the next subsection, we provide a design strategy to construct LDPC codes with concentrated stopping set distributions that result in a low probability of failure under greedy sampling.

B. Entropy-Constrained PEG (EC-PEG) Algorithm
In this subsection, we provide a heuristic method to construct LDPC codes with concentrated stopping set distributions.Our method is based on minimizing the entropy of cycle distribution ζ g .The intuition behind our algorithm is that uniform distributions have high entropy and distributions that are concentrated have low entropy.Thus, using entropy as a measure, we construct LDPC codes iteratively using the PEG algorithm [17] by making CN selections that minimize the entropy of the cycle distributions.Algorithm 3 presents the full EC-PEG algorithm for constructing a TG G with n VNs, m CNs, and VN degree d v that concentrates the cycle distributions ζ g for all g -cycles, g < g c .Choice of g c is a complexity constraint that determines how many cycles we keep track of in the algorithm.All ties in the algorithm are broken randomly.

Algorithm 3 EC-PEG Algorithm
c sel = Select a CN from K with the minimum degree under the current TG setting G 8:

else
(g-cycles, g < g c , are created) for each c in K do 10: L cycles = new g-cycles formed in G due to the addition of edge between c and v j 12: The PEG algorithm is an iterative algorithm that builds a TG in an edge by edge manner by iterating over the set of VNs and for each VN v j , establishing d v edges to it.For establishing the k th edge to VN v j , the PEG algorithm encounters two situations: i) addition of the edge is possible without creating cycles; ii) addition of the edge creates cycles.In both the situations, the PEG algorithm finds a set of candidate CNs that it proposes to connect to v j , to maximize the girth.We abstract out the steps followed in [17] to find the set of candidate CNs by a procedure PEG( G, v j ).It returns the set of candidate CNs K for establishing a new edge to VN v j under the TG setting G according to the PEG algorithm in [17].For situation ii), the procedure returns the cycle length g of the smallest cycles formed when an edge is established between any CN in K and v j .For situation i), it returns g = ∞.The output K is the set of all CNs in G that result Additionally, the distributions ζ 6 and ζ 8 are concentrated towards the same set of VNs since the same VNs that have a high 6-cycle fraction also have a high 8-cycle fraction.Fig. 3 middle and right panels show the corresponding stopping set distributions ss κ for the LDPC codes designed using the original PEG and EC-PEG algorithms.We see that for the EC-PEG algorithm, the VNs towards the left (right) on the x-axis have high (low) stopping set fraction.Thus, concentrating the cycle distributions concentrates the stopping set distributions towards the same set of VNs as that of the cycles.In Section V, we demonstrate that such concentrated stopping set distributions result in a low probability of failure when the greedy sampling strategy in Algorithm 1 is used.

IV. LDPC CODE AND SAMPLING CO-DESIGN FOR MEDIUM AND STRONG ADVERSARY
While the EC-PEG algorithm and greedy sampling work well for weak adversaries, they rely on the assumption that the weak adversary can only find a random small stopping set.For the medium and strong adversary as discussed in Section II-D, the EC-PEG algorithm and greedy sampling is insufficient to properly secure the system and requires stronger code and sampling design.In this section, we focus on overcoming these stronger adversary models that perform an exhaustive search for stopping sets in the LDPC codes to hide the worst case stopping set that has the lowest probability of being sampled by the light nodes.Similar to the previous section, we first look at a medium and a strong adversary who conduct a DA attack on the base layer of the CMT and propose a sampling strategy for the light nodes to sample the base layer in order to minimize the probability of failure.This will motivate the construction of LDPC codes for the base layer of the CMT, designed to lower the probability of failure under the new sampling strategy.Finally, we will generalize the sampling strategy and LDPC construction to consider the situation when the adversary conducts a DA attack at any intermediate layer of the CMT.
Recall that for each layer j, 1 ≤ j ≤ l, the medium adversary hides stopping sets of H j of size < µ j .Let S j = {ψ |S j | } be the set of all stopping sets of H j of size < µ j , 1 ≤ j ≤ l.For S j , let Π (j) denote the VN-to-stopping-set adjacency matrix of size |S j | × n j , where Π where x i is the probability that a light node requests for the i th base layer symbol for every sample request and β (l) is a non-negative real number.x , β (l) satisfy β (l) ≤ x i ≤ 1, n i=1 x i = 1.β (l) controls the minimum probability of requesting a given symbol from the CMT base layer.
Our goal is to compute a sampling strategy x , β (l) that results in a low probability of failure when a medium or a strong adversary conduct a DA attack.For a sampling strategy x , β (l) , the probability of failure against a medium adversary P (l) (f,med) (s) for a DA attack on the base layer can be calculated as . Similarly, we can upper bound the probability of failure against a strong adversary P (l) (f,str) (s) that conducts a DA attack on the base layer of the CMT as follows: (Recall that S ∞ j is set of all stopping sets of H j ) where the first term in the maximum is same as the probability of failure for a medium adversary and the second term is due to the fact that max ψ∈S ∞ l ,size(ψ)≥µ l P (l) In the rest of the paper, we assume P (l) (f,str) (s) is equal to the upper bound provided in Eqn (2).We find the light node sampling strategy by formulating a linear program (LP) in x , β (l) based on the above probabilities such that we get a lower probability of failure against the medium adversary and strong adversary compared to random sampling.The optimization problem (which can be easily converted into an LP by introducing additional variables) is provided below: where θ, 0 ≤ θ ≤ 1, is a parameter that controls the trade-off between the probability of failure against the medium and strong adversary.In the next subsection, we generalize LP (3) to take into effect DA attacks conducted by the adversary on any intermediate layer of the CMT.

A. Linear-programming-sampling (LP-sampling) for DA attacks on any layer of the CMT
In this subsection, we modify LP (3) to take into effect a DA attack conducted on any layer of the CMT and derive the sampling strategy based on the modified LP.We first align the columns of the parity check matrices of different layers of the CMT using the procedure described in Section III-A.Assume that the stopping sets and VNs mentioned in the following are based on the aligned parity check matrices.
A sampling strategy x , β (l) specifies the probability that different base layer symbols are sampled by a light node for each base layer sample request.This strategy induces a certain probability of sampling symbols of the intermediate layers of the CMT via the Merkle proofs of the base layer samples (however sampling different symbols are no more disjoint events in the intermediate layers).To calculate the probability that each intermediate layer symbol is sampled, we define for each j, 1 ≤ j ≤ l − 1, a matrix A (j) of size n j × n l whose entries are as follows: ki = 0 for all other cases.For 1 ≤ j ≤ l − 1, the i th column of A (j) corresponds to the i th base layer symbol and the non-zero positions in the i th column (two per column) correspond to the symbols of layer j which are part of the Merkle proof of the i th base layer symbol.For notational simplicity, assume that A (l) is an n l × n l identity matrix.For a sampling strategy x , β (l) , let x (j) = A (j) x, 1 ≤ j ≤ l − 1.Then it is easy to see that x (j) k is the probability that the k th symbol of layer j is sampled by the sampling strategy x , β (l) .Consider a stopping set ψ that belongs to an intermediate layer j.Note that the Merkle proof for a base layer sample contains a single data and a single parity symbol from layer j and is deterministic given the base layer sample.If both the symbols (VNs) exist in ψ, it is possible for a single base layer symbol to sample ψ at two VNs.To avoid over-counting, we define for 1 ≤ j ≤ l − 1 the matrices ∆ (j) = min(Π (j) A (j) , 1) where the minimum is element wise.∆ (j)   has the property that ∆ (j) ki is 1 if the i th base layer symbol samples, via its Merkle proof from layer j, the k th stopping set of S j and zero otherwise.Using the above matrices, we calculate the probability of failure P (j) (f,med) (s) and P (j) (f,str) (s) when a medium and a strong adversary conduct a DA attack on intermediate layer j.First, we consider the following definition.Definition 3. A sampling (with replacement) strategy x , β (1) , β (2) , . . ., β (l) is a sampling strategy x , β (l) , such that for is a non-negative real number and controls the minimum probability of requesting a given symbol from layer j of the CMT.
For a sampling strategy x , β (1) , β (2) , . . ., β (l) , it is not difficult to see that s , 1 ≤ j ≤ l.Now, let us consider the strong adversary.Since a Merkle proof contains one data and one parity symbol from every intermediate layer, all data (parity) symbols are sampled disjointly.As such, we can bound the probability of sampling a stopping set ψ of size ≥ u j , 1 ≤ j ≤ l − 1, by P i is a data symbol x (j) i and i is a parity symbol x (j) i .Summing the two inequalities and dividing over 2 yields P (j) µ j , which we use as an upper bound for the strong adversary.We define P (j) (f,str-bound) (s) := 1 − 1 2 β (j) µ j and P (f,str) (s) := max(P (f,med) (s)).We find the light node sampling strategy by formulating an LP in variables x , β (1) , β (2) , . . ., β (l) based on the above probabilities (although it is not written in the form of an LP, there exists an equivalent LP representation of problem (4)): where ξ (j) = 1 2 for 1 ≤ j < l and ξ (l) = 1.The first and second term in the outer maximum above corresponds to the the probability of failure against the medium and strong adversary for a DA attack on different layers of the CMT.θ (j) 's are trade-off parameters and control the importance given to a strong adversary on layer j of the CMT compared to a medium adversary.
The sampling strategy x , β (1) , β (2) , . . ., β (l) obtained as the optimal solution of LP ( 4) is called LP-sampling.To reduce the probability of failure against a medium and a strong adversary under LP-sampling, we shall design LDPC codes aimed toward providing a small probability of failure using LP-sampling by optimizing the probability for each layer separately.

B. Linear-programming-Constrained PEG (LC-PEG) Algorithm
In this section, we demonstrate how to design LDPC codes that perform well under LPsampling.We design such LDPC codes by modifying the CN selection procedure in the PEG algorithm.We call our code construction linear-programming-constrained PEG or LC-PEG algorithm since it is trying to minimize the optimal objective value of an LP.Our key idea is to optimize cycles in the graph instead of stopping sets similar the EC-PEG algorithm.The motivation for focusing on cycles is the following: for lists C and S of cycles and stopping sets, respectively, such that for every ψ ∈ S, there exists a O ∈ C which is part of ψ, we have Thus, the optimal objective value of LP (3) can be upper bounded by the optimal objective value of a modified version of LP (3) which is based on cycles.We select CNs in the PEG algorithm depending on the optimal objective value they produce on the modified LP.Algorithm 4 presents our LC-PEG algorithm for constructing a TG G with n VNs, m CNs, and VN degree d v .All ties in the algorithm are broken randomly.
In the LC-PEG algorithm, we use the concept of the extrinsic message degree (EMD) of a set of VNs that allows us to rank the harm a cycle may have in creating stopping sets.EMD of a set of VNs is the number of CN neighbors singly connected to the set [25] and is calculated using the method in [26].EMD of a cycle is the EMD of the VNs involved in the cycle.Low EMD cycles are more likely to form stopping sets and we term cycles with EMD below a threshold T th as bad cycles and use them to form the modified linear program provided below: The LC-PEG algorithm uses LP (5) via the procedure LP-objective( L, G) which outputs its optimal objective value.The procedure has inputs of a list L = {O 1 , . . ., O | L| } of cycles and a TG G. Let G have n VNs {v 1 , . . ., vn }.Here, C is a matrix of size | L| × n, such that In the LC-PEG algorithm, we use the procedure PEG() defined in Section III-B for the EC-PEG algorithm.The LC-PEG algorithm proceeds exactly as the EC-PEG algorithm when the PEG() procedure returns cycle length g ≥ g c .When the PEG() procedure returns cycle length g < g c , we select a CN from the set of candidate CNs K such that the resultant LDPC codes have a low optimal objective value of LP (3).We explain the CN selection procedure as follows.
While progressing through the LC-PEG algorithm, we maintain a list L which contain cycles of length g < g c that had EMD less than or equal to threshold T th when they were formed.
Cycles in L are considered bad cycles and we base our CN selection procedure on these cycles.
When the PEG() procedure returns candidate CNs K, we first select the set of CNs K mindeg that have the minimum degree under the current TG setting G (line 6).Of the CNs in K mindeg , we select the set of CNs K mincycles that form the minimum number of new g-cycles if an edge is established between the CN and v j (line 9).Now for every CN c in K mincycles , we find the list L c cycles of new g-cycles formed due to the addition of an edge between c and v j (line 11) and compute LP-objective(L ∪ L c cycles , G) to get cost[c] (line 12).Our modified CN selection procedure is to select a CN in K mincycles that has the minimum cost[c] (line 13).After selecting c sel using the above criteria, we update L as follows: let L sel be the list of g-cycles in L c sel cycles that have EMD ≤ T th .We add L sel to L (line 14).Finally, we update the TG G (line 15).Remark 4. We empirically observed that reducing the number of cycles in the TG (and hence the number of stopping sets) reduces the probability of failure against the medium and strong adversary when LP-sampling is employed even if the size of the smallest stopping set remains unchanged.This is in contrast to random sampling where the probability of failure only depends on the size of the smallest stopping set and is agnostic to the number of stopping sets of small size that is present in the code.Thus, based on this observation, we have added line 9 in our LC-PEG algorithm which selects CNs K mincycles that form the minimum number of cycles when a new edge is established.However, we further make an informed choice among the CNs in K mincycles to select a CN that has the minimum optimal objective value of LP (5).

V. SIMULATION RESULTS
In this section, we present the performance of the EC-PEG algorithm and the greedy sampling (GS) strategy in mitigating DA attacks conducted by a weak adversary and the performance of the LC-PEG algorithm and LP-sampling (LS) in mitigating DA attacks against a medium and a strong adversary.We compare the performance achieved by our methods with that of codes designed by the original PEG algorithm and the performance achieved by [2] using random LDPC codes and random sampling (RS).The different CMTs used for simulation are represented by the 4-tuple of parameters T = (n l , R, q, l) where the individual parameters were defined in Section II-A.For a CMT T , in order to compare the performance of different PEG based codes, we choose µ j = ω (j),P EG min +γ, 1 ≤ j ≤ l, for the various adversary models described in Section II-D.
Here, ω (j),P EG min is the minimum stopping set size for an LDPC code constructed using the original PEG algorithm for layer j of the CMT T and γ is a parameter.We calculate the probability of failure when the light nodes request for s base layer samples using random sampling for various scenarios as follows: for the base layer when the adversary hides a stopping set of size ω, P n l in the probability of failure expressions provided in Section IV-A, where 1 n l is a vector of ones of length n l ; for an LDPC code with a stopping ratio (the smallest stopping set size divided by the code length) ν * we calculate the probability of failure at the base layer using random sampling as P f, ω (s) quickly becomes zero for ω = 9 and ω = 10 using greedy sampling as s increases, hence we have not included these stopping set sizes in Fig. 4 left panel.The figure demonstrates three benefits of our co-design.The first benefit is due to the use of deterministic constructions to design finite-length codes that provide larger stopping set sizes than random ensembles, as can be seen when comparing the black and green curves.The second benefit comes from using the greedy sampling strategy as opposed to random sampling, which can be observed by the reduction in P (4) f, ω (s) between the green and red curves.The final benefit is provided by using concentrated LDPC code designed by the EC-PEG algorithm, as can be seen by comparing the red and blue curves.All of these benefits combine to significantly reduce P (4) f, ω (s) compared to the black curve which was proposed in earlier literature.In Fig. 4 right panel, we plot the probability of failure P The black curve is achieved using random sampling and a random LDPC code as in [2] with stopping ratio ν * = 0.064353.The value ν * is the best stopping ratio obtained for a rate 0.5 code following the method in [2, section 5.3] using parameters (c, d) = (8,16).The lines in green represent P    f (s), j = 1, 2, 3, 4 for the EC-PEG algorithm (solid) and original PEG algorithm (dotted) when using greedy sampling.P greedy , ω), and S (j) greedy is the samples of layer j collected, 1 ≤ j ≤ 4, (after the alignment process) when the light nodes request for s greedy samples from the base layer of CMT T 1 .layer of the CMT (L 4 ) has a larger probability of failure compared to other layers and the probability of failure for the intermediate layers quickly become very small.This is due to the alignment of the columns of the parity check matrices, which ensures that each intermediate layer is greedy sampled.We next observe that the EC-PEG algorithm with greedy sampling results in a lower P (j) f (s) compared to the original PEG algorithm for all layers of the CMT.Moreover, for the base layer, P f (s) (for both EC-PEG and original PEG coupled with greedy sampling) is lower than the probability of failure using random sampling for ω = 14 (green curve) and the probability of failure achieved by random LDPC codes and random sampling (black curve).Thus, in combination, the co-design of concentrated LDPC codes and a greedy sampling strategy results in a significantly lower P (4) f, ω (s) compared to methods proposed in [2].In Figs. 5, 6 and Table I, we demonstrate the performance of the LC-PEG algorithm and LP-sampling (LS).Fig. 5 and 6 correspond to CMT T 1 = (128, 0.5, 4, 4) where we have used the following parameters.For the adversary model, we have used γ = 4, thus µ j = ω (j),P EG min + 4, where ω (j),P EG min for different j are listed in Table II.The µ j 's are also parameters for LP-sampling.
Additionally, for LP-sampling, we have used θ (4) = 0.993, θ (j) = 1, j = 1, 2, 3.For the LC-PEG We first look at the improvements provided by LP-sampling over random sampling.Fig. 5 shows the performance of LP-sampling for a DA attack at different layers of the CMT constructed using the PEG, LC-PEG and MC-PEG algorithms.We see that while the probability of failure for some layers worsen in comparison to random sampling, for the worst layer, which is the base layer, the probability of failure improves for both the strong and medium adversary.It is understandable that the base layer is the worst layer since each symbol in the intermediate layers is part of the Merkle proof of multiple base layer symbols and hence has a higher probability of getting sampled compared to the base layer symbols.We generally find that the base layer is the worse layer so we focus on the base layer in the subsequent simulations.
To compare the performance of the PEG, MC-PEG, and LC-PEG algorithms using LPsampling, we plot P  f (s = 30) for the strong and medium adversary as a function of θ (4) for θ (j) = 1, j = 1, 2, 3.The green curve plots P f (s = 0.25n l ) for a DA attack on the base layer of CMT for various CMT parameters, coding schemes, and sampling strategies.The parameters used for the different CMTs is listed in Table II.For R = 0.4, we get stopping ratio ν * = 0.085059 following the method in [2, Section 5.3] using parameters (c, d) = 6, 10.
The third improvement comes from utilizing the key characteristic of the MC-PEG algorithm to reduce the number of small cycles as discussed in Remark 4 to produce codes that improve the probability of failure under LP-sampling.The final improvement comes from utilizing the informed check node selection in the LC-PEG algorithm to create tailored codes for LP-sampling as seen by comparing the dark and light blue curves.Overall, the co-design of the LC-PEG algorithm and LP-sampling results in a lower P  f (s = 30) as a function of the parameter θ (4) for the original PEG, MC-PEG and LC-PEG algorithms using LP-sampling.From Fig. 6 right panel, we see that θ (4) controls the trade-off between the probabilities of failure for the medium adversary and strong adversary.Thus, θ (4) can be chosen as a hyper-parameter based on the system specifications.We also see from Fig. 6 right panel that for all the values of θ (4) , the LC-PEG algorithm outperforms the PEG and MC-PEG algorithm for both the medium and strong adversary.min + 4, j = 1, . . ., l.For LC-PEG algorithm, we use μ(j) = µ (j) = ω (j),P EG min + γ, j = 1, . . ., l.Under each variable that depends on the layer, we enumerate the layer numbers.For completeness, we provide further examples of how our novel code constructions improve the probability of failure for different CMT parameters.In Table I, we list the probability of failure at the base layer when the light nodes sample 25% of the base layer coded symbols and compare various sampling strategies and LDPC code constructions.Similar to Fig. 6 left panel, from Table I, we see that the novel co-design of the LC-PEG algorithm and LP-sampling results in the lowest probability of failure for the different CMT parameters.
In Table III, we compare the maximum CN degree for the LDPC codes used in different layers of the CMT for various construction techniques.We see that PEG based constructions have similar maximum CN degrees compared to the ensemble LDPC codes used in [2].Since the incorrect-coding proof size is proportional to the maximum CN degree, we conclude that the new LDPC code constructions do not significantly impact the incorrect-coding proof size to improve the probability of failure.

VI. CONCLUSION
In this paper, we considered the problem of data availability attacks pertinent to blockchain systems that have light nodes and a majority of malicious full nodes.For various strengths of the malicious nodes, we demonstrated that a suitable co-design of specialized LDPC codes and the light node sampling strategy can result in a much lower probability of failure for the light nodes to detect data availability attacks compared to schemes proposed in prior literature.As a future extension to the work in this paper, we are currently investigating whether other code families like Reed-Muller Codes and Polar codes outperform LDPC codes in this application.

Fig. 1 :
Fig. 1: Left: Data Availability (DA) attack; Right: Detection of DA attack via light node sampling

Fig. 2 :
Fig. 2: Left Panel: Construction process of a CMT.A block of size b is partitioned into k data chunks (data symbols) each of size bk and a rate R systematic LDPC code is applied to generate n coded symbols.These n coded symbols form the base layer of the CMT.The n coded symbols are then hashed using a hashing function and the hashes of every q coded symbols are concatenated to get one data symbol of the parent layer.The data symbols of this layer are again coded using a rate R systematic LDPC code and the coded symbols are further hashed and concatenated to get the data symbols of its parent layer.This iterative process is continued until there are only t (t > 1) hashes in a layer which form the CMT root.Left panel shows a CMT with n = 16, q = 4, R = 0.5 and t = 4.The circled symbols in L 1 and L 2 are the Merkle proof of the circled symbol in L 3 .Right panel: DA attack on the CMT.

Lemma 1 .
Let SS κ denote the set of all weight κ stopping sets of H.For a weak adversary that randomly hides a stopping set from SS κ , the probability of failure at the base layer, P (l) f (s), when the light nodes use s samples and any sampling strategy satisfies P (l) f (s) ≥ 1 − max S⊆V,|S|=s τ (S, κ), where τ (S, κ) is the fraction of stopping sets of weight κ touched by the subset of VNs S of H. Let S opt κ = argmax S⊆V,|S|=s τ (S, κ).The lower bound in the above equation is achieved when light nodes sample, with probability one, the set S opt κ .Proof.The proof is straight forward and removed for brevity.The key idea in the proof is as follows.For SS κ = {ψ 1 , . . ., ψ |SSκ| }, where ψ i is a stopping set of H of weight κ, and any S ⊆ V, |S| = s, the probability of failure for the weak adversary if light nodes sample the VNs in S is P f

4 :v
= VN selected from V s uniformly at random 5:S greedy = S greedy ∪ {v}, Purge v and all its incident edges from G 6:

14 Fig. 3 :
Fig. 3: Results for LDPC codes with R = 0.5, d v = 4, n = 128 using different PEG algorithms.The x-axis in all the plots are the VN indices v i arranged in the decreasing order of the 6-cycle fractions ζ 6 i (for the respective codes); Left panel: 6-cycle and 8-cycle distributions ζ 6 and ζ 8 generated using the EC-PEG and original PEG algorithms; Middle Panel: stopping set distribution ss 13 ; Right Panel: stopping set distribution ss 14 .The lines in the middle and right panels are the best fit lines for ss κ indicating the graph slope.

Algorithm 4 LC-PEG Algorithm 1 :
Inputs: n, m , d v , g c , T th , θ, μ; Outputs: G, g min 2: Initialize G to n VNs, m CNs and no edges, L = ∅ 3: for j = 1 to n do 4: for k = 1 to d v do 5: [K, g] = PEG( G, v j ) 6: K mindeg = CNs in K with the minimum degree under the current TG setting G 7: if g ≥ g c then c sel = Select a CN randomly from K mindeg 8: else (g-cycles, g < g c , are created) 9:K mincycles = CNs in K mindeg that result in the minimum no. of new g-cycles due to the addition of edge between the CN and v j 10:for each c in K mincycles do 11:L c cycles = new g-cycles formed in G due to the addition of edge between c and v j 12:cost[c] = LP-objective(L ∪ L c cycles , G)13:c sel = CN in K mincycles with minimum cost[c]14: layers, we calculate the probability of failure for the medium and strong adversary by substituting x = 1n l

f
(s) = (1 − ν * ) s .The LDPC codes used to construct different layers of the CMTs are aligned using Algorithm 2 for which we use the parameters g max = g c (where g c determines the maximum observed cycles in the code constructions) and g min is set to the girth of the respective codes.

Fig. 4
Fig. 4 demonstrates the performance of the EC-PEG algorithm and the greedy sampling strategy for CMT T 1 = (128, 0.5, 4, 4).For the EC-PEG code construction, we have used the parameters: d v = 4 for all layers, R = 0.5 (specified by the CMT parameter), g (4) c = 10 and

Fig. 4 :
Fig. 4: The probability of light node failure for various coding schemes and sampling strategies for CMT T 1 = (128, 0.5, 4, 4); Left panel: probability of failure for a DA attack on the base layer for different stopping set sizes.The black curve is achieved using random sampling and a random LDPC code as in[2] with stopping ratio ν * = 0.064353.The value ν * is the best stopping ratio obtained for a rate 0.5 code following the method in [2, section 5.3] using parameters (c, d) =(8,16).The lines in green represent P

( 4 )
f, ω (s) obtained using random sampling for different ω.The curves in red and blue demonstrate the use of the greedy sampling strategy described in Algorithm 1 for codes designed by the original PEG and EC-PEG algorithms, respectively, and P (4) f, ω (s) is calculated as P

( 4 )
greedy is the output of Algorithm 1; Right panel: probability of failure across different layers of the CMT.The top black and green curves are the same as the left panel.The rest of the curves, correspond to the P (j)

Fig. 5 :
Fig. 5: For CMT T 1 = (128, 0.5, 4, 4), the probability of failure for a DA attack at different layers when the base layer is randomly sampled (circle marked) and the probability of failure when a strong (Str-bound) (triangle marked) and a medium (Med) (square marked) adversary conduct a DA attack on different layers and the base layer is sampled using LP-sampling.For all layers, we plot P (j) (f,str-bound) (s) (Str-bound) instead of P (j) (f,str) (s).Left panel: Original PEG; Middle panel: LC-PEG; Right panel: MC-PEG.

( 4 )Fig. 6 :
Fig. 6: The probability of light node failure for a DA attack on the base layer of CMT T 1 = (128, 0.5, 4, 4); Left panel: comparison of different coding schemes and sampling strategies.The black curve is achieved when the base layer uses random LDPC ensembles and is randomly sampled.The magenta curve is achieved when the base layer uses the PEG code (ω (4) min = 9) and is randomly sampled.The rest of the plots are the probability of failure for the medium (solid lines) and strong (dotted lines) adversary using LP-sampling (LS) and different PEG codes; Right panel: variation in P

( 4 )
f (s) compared to other techniques.In Fig.6right panel, we plot P v n } and TG G. Consider the following definition.Definition 1.For a parity check matrix H, let ss κ = (ss κ 1 , ss κ 2 , . . ., ss κ n ) denote the VN-tostopping-set of weight κ distribution where ss κ i is the fraction of stopping sets of H of weight κ touched by v i .Similarly, for a parity check matrix H, let ζ g = (ζ g 1 , ζ g 2 , . . ., ζ g n ) be the VN-tog-cycle distribution where ζ g i is the fraction of g-cycles of H touched by v i .We informally say that distribution ss κ (ζ g ) is concentrated if a small set of VNs have high corresponding stopping set (g-cycle) fractions ss κ i (ζ g i ).

TABLE II :
Parameters used for LP-sampling and LC-PEG code construction for various CMTs in TableI.For all LDPC codes we use d v = 4, g

TABLE III :
Maximum CN degree for the LDPC codes used in different layers of the CMT.Under each algorithm, we enumerate the layer numbers and specify the maximum CN degree for that layer.