loading page

TokenFree: A Tokenization-Free Generative Linguistic Steganographic Approach with Enhanced Imperceptibility
  • Ruiyi Yan ,
  • Tianjun Song ,
  • Yating Yang
Ruiyi Yan
School of Cyberspace Science and Technology

Corresponding Author:[email protected]

Author Profile
Tianjun Song
Author Profile
Yating Yang
Author Profile

Abstract

Since tokenization serves a fundamental preprocessing step in numerous language models, tokens naturally constitute the basic embedding units for generative linguistic steganography. However, token-based methods encounter challenges including limited embedding capacity and possible segmentation ambiguity. Despite existing character-level linguistic steganographic approaches, they neglect the problem of generating unknown or out-of-vocabulary words, potentially compromising steganographic imperceptibility. In this letter, we focus on both embedding capacity and imperceptibility for a tokenization-free linguistic steganographic approach. Firstly, we suggest that unknown words mainly stem from low-entropy distributions and rigid coding rules within candidate pools, and thus we propose an entropy-based selection approach to flexibly construct candidate pools. Further, we present a lexical emphasis approach, prioritizing characters within candidate pools capable of forming in-vocabulary words. Experiments show that, across a range of high embedding rates, our approaches achieve considerably higher imperceptibility and about 17% higher anti-steganalysis capacity than the baseline method without our approaches.