loading page

Payload-Byte: A Tool for Extracting and Labeling Packet Capture Files of Modern Network Intrusion Detection Datasets
  • +3
  • Yasir Ali Farrukh ,
  • Irfan Khan ,
  • Syed Wali ,
  • David Bierbrauer ,
  • Nathaniel Bastian ,
  • John Pavlik
Yasir Ali Farrukh
Author Profile
Irfan Khan
Texas A&M University, Texas A&M University

Corresponding Author:[email protected]

Author Profile
Syed Wali
Author Profile
David Bierbrauer
Author Profile
Nathaniel Bastian
Author Profile
John Pavlik
Author Profile


Adapting modern approaches for network intrusion detection is becoming critical, given the rapid technological advancement and adversarial attack rates. Therefore, packet-based methods utilizing payload data are gaining much popularity due to their effectiveness in detecting certain attacks. However, packet-based approaches suffer from a lack of standardization, resulting in incomparability and reproducibility issues. Unlike flow-based datasets, no standard labeled dataset exists, forcing researchers to follow bespoke labeling pipelines for individual approaches. Without a standardized baseline, proposed approaches cannot be compared and evaluated with each other. One cannot gauge whether the proposed approach is a methodological advancement or is just being benefited from the proprietary interpretation of the dataset. Addressing comparability and reproducibility issues, we introduce Payload-Byte, an open-source tool for extracting and labeling network packets in this work. Payload-Byte utilizes metadata information and labels raw traffic captures of modern intrusion detection datasets in a generalized manner. Moreover, we transformed the labeled data into a byte-wise feature vector that can be utilized for training machine learning models. The whole cycle of processing and labeling is explicitly stated in this work. Furthermore, source code and processed data are made publicly available so that it may act as a standardized baseline for future research work. Lastly, we present a brief comparative analysis of machine learning models trained on packet-based and flow-based data. UNSW-NB15 and CIC-IDS2017.