PJScan a command-line utility that uses a learning algorithm to detect PDF files with JavaScript-related malware (i.e., malicious PDF files)

PJScan is a command-line utility that uses a learning algorithm to detect PDF files with JavaScript-related malware (i.e., malicious PDF files). The name PJScan is an acronym for “PDF and JavaScript Scanner”.

The learning algorithm

PJScan utilizes a machine learning algorithm called a One-class Support Vector Machine (One-class SVM) to learn a model of malicious PDF files and then uses this model to classify previously unseen, suspicious PDF files. This is accomplished in a two-step process:

Learning a model of malicious files.

This step consists of applying PJScan’s learning algorithm on a collection of malicious PDF files. PJScan analyzes these files, extracts JavaScript scripts from them (using libpdfjs) and applies a JavaScript tokenizer (pjscan-js, a modified version of Mozilla SpiderMonkey) in order to obtain the lexical properties of the scripts. The token sequences are then used as input (converted by libstem) for the machine learning algorithm (a One-class SVM implementation called libsvm_oc, based on libsvm), which outputs a model of known malicious PDF files. This model (saved as a file) is used as the input to the second step.

Classification of previously unseen files.

After a model of PDF files that are known to be malicious has been learned, it’s used for the classification of previously unseen PDF files. Every PDF file to be classified has its JavaScript scripts extracted, tokenized and converted for use with the learning algorithm. Finally, the learning algorithm compares this information with the learned model and classifies the file as malicous or benign.

Other uses

In addition to learning and classification, PJScan also features some useful diagnostic tools:

  • Dumping all JavaScript scripts from a PDF file.

You can use this tool to extract the source code of all JavaScript scripts from a certain PDF file for further analysis. The scripts are saved as UTF-8-encoded text files with a .js extension in a directory.

  • Analysis of machine learning features.

Top N machine learning features are extracted from a PDF file and printed in comparison with the features found in a previously learned model. This is useful for the analysis of the impact of individual features of JavaScript code on the classification result.

System Support : Linux | Read More in here

Change log

Download in here  : http://sourceforge.net mirror pjscan.tgz



  1. Hey there I wanted to take a moment to say I enjoyed reading your Site!

  2. Thanks for another wonderful post. Where else could anybody get that type of information in such a perfect method of writing? I have a presentation next week, and I am on the look for such information.

  3. Nice article and Nice blog greetings!

  4. I like foregathering utile info, this post has got me even more info! .

  5. Good job

  6. I do believe all the concepts you might have presented for your post. They’re quite convincing and can surely work. Nonetheless, the posts are quite quick for novices. Could you please lengthen them a bit from next time? Thank you for the post.

Sorry, the comment form is closed at this time.

Comments RSS