loading page

PDF investigation with parser differentials and ontology
  • Denley Lam ,
  • Letitia Li ,
  • Cory Anderson
Denley Lam
BAE Systems

Corresponding Author:[email protected]

Author Profile
Letitia Li
Author Profile
Cory Anderson
Author Profile


This paper describes the Verifiable Automatic  Language Analysis and Recognition for Inputs (VALARIN) system to process, evaluate, and flag unsafe PDFs. The system extracts error features from a collection of PDF parsers, and organizes the different types of error messages by how they impact file safety. An ontology was designed to describe the relationships between parsers, error messages, safety, and PDF properties to  support PDF-based malware classification efforts. Our domain specific PDF ontology shows that PDF parsers exhibit mutual  biases when recovering from specification ambiguities. Consensus on extracted error features among parsers had a direct relationship to the safety of the PDF. The PDF OWL ontology serves as a shareable method for information security and forensics efforts to highlight discrepancies and aid understanding by standardizing and describing the hierarchical relationship of diverse parsers, PDF structure, and validity.