LNU-Phish

Welcome to the website related to the paper “Mitigating Adversarial Gray-box Attacks against Phishing Detectors”, accepted to IEEE Transactions on Dependable and Secure Computing (Sept. 2022).

This website contains all the resources we publicly release to the research community, as an additional contribution of our paper. Specifically, we provide:

the LNU-Phish dataset, containing over 20k websites (provided with their URL, HTML, DNS record, and screenshot) which we created and used for our evaluation;
our custom-built feature extractor, which we created to obtain the feature representation of each sample in LNU-Phish (and then used as basis for our analyses);
our implementation of the Protective Operation Chains (POC) algorithm, which is proposed and evaluated in our paper.

If you use any of such resources, we kindly ask you to cite our work with the following BibTeX entry:

  @article{apruzzese2022mitigating,
  author={Apruzzese, Giovanni and Subrahmanian, V. S.},
  journal={IEEE Transactions on Dependable and Secure Computing}, 
  title={Mitigating Adversarial Gray-Box Attacks Against Phishing Detectors}, 
  year={2022},
  volume={},
  number={},
  pages={1-19},
  doi={10.1109/TDSC.2022.3210029},
  publisher={IEEE}}

Paper

Our paper (a preprint is here) tackles the problem of adversarial attacks against Phishing Detectors (PD) relying on Machine Learning (ML). At a high-level, we make 3 contributions:

We carry out a large evaluation of “Gray-Box” adversarial attacks against ML-based PD.
- We consider 4 dataset, 13 classifiers, and 10 different families of Gray-Box attack in which we vary the portion of the features used by the PD known by the attacker.
We propose a new defensive mechanism, the Protective Operation Chains (POC) algorithm.
- We show that PD using POC are more robust that the baselines against our attacks, and show that POC does not degrade baseline performance while (potentially) confusing the attacker.
We create a new dataset for phishing website detection, LNU-Phish.
- Our dataset overcomes the problems of most existing datasets for PD, i.e., it provides the full information of each sample (the raw HTML, URL, DNS records, screenshot, as well as its feature representation – few datasets provide all of these) collected at the time of creating the dataset (phishing websites get taken down quickly, and it is not possible to collect such samples later).

We publicly release our dataset, our implementation of POC (together with an example of its application), and our feature extractor.

Our LNU-Phish dataset

Our dataset, LNU-Phish (short for “Liechtenstein and Northwestern University-Phishing”), contains information about benign and phishing websites, and can be used for a variety of purposes—including, but not limited to, the detection of phishing websites via machine learning (ML). Since LNU-Phish contains samples taken “from the wild (web)”, our dataset can be used both by researchers and practitioners alike.

Overview

The data in LNU-Phish spans across a total of 23’364 websites: 7’861 are phishing websites, whereas 15’773 are legitimate websites.

Samples and Information

Each sample in LNU-Phish is a website, which can be either benign or malicious. Regardless of its nature, each sample in LNU-Phish is provided with the following information:

its URL (which typically corresponds to the “homepage” of the website)
the (raw) HTML of the landing webpage of the URL
the screenshot showing the rendered webpage
the DNS records, such as whois info, SSL certificate, and pagerank.

Creation and Lists

To create LNU-Phish, we developed a data-collection script that visited each URL in a given list. If the URL was valid, we then created a new “sample”, thereby saving the URL, as well as the entire (raw) HTML and a screenshot of the (rendered) landing webpage (we used Dryscrape for this). Moreover, we also used the URL to retrieve the DNS record of the webpage.

The benign websites were taken by picking from the Alexa Top 1 Million list (as of March 2019). To create a balanced and diverse corpus containing both “well-known” websites as well as “less-known” (but still reputable) websites, we divided such list in three partitions: the “top” partition includes websites from rank 1 to 10’000; the “middle” partition includes websites ranked from 10’001 to 100’000; the “bottom” partition includes all websites ranked below 100’001. We extract ~5k URLs from each partition, and used them as input to our data-collection script. Ultimately, LNU-Phish contains 5’354, 3’824, and 6’595 samples from the “bottom”, “medium”, and “top” partitions, respectively.

The phishing websites were taken by crawling two well-known repositories of phishing websites: PhishTank and OpenPhish. Specifically, we devised a crawler that iteratively monitored such repositories for several days between March and April 2019. At the end of each “crawled day”, we used the collected URLs as input to our data-collection script. Ultimately, LNU-Phish contains 5’399 websites from PhishTank and 2’462 websites from OpenPhish (all distinct).

Upon completing all such operations, we created a feature extractor based on the guidelines provided in the UCI Phishing Website dataset (originally proposed in the works by Mohammad et al. [1, 2, 3] and still used even recently, e.g., [Jain 2018, Sharma 2020, Hannousse 2021]). We release our feature extractor as an additional contribution of our paper.

Formats

We provide our LNU-Phish dataset in 3(+1) diverse formats:

Feature-only (1.5MB). This format contains only the features generated via our custom feature extractor. It is the “ML-ready” format of our dataset.
- Access: Link [SHA256]
No-screenshots (550MB). This format contains all information for each sample (i.e., the raw URL, HTML and DNS information, as well as the feature representation) aside from the screenshots. It is useful to generate new features, or craft adversarial perturbations in the problem space (as done, e.g., in SpacePhish).
- Access: Link (OneDrive) [SHA256]
Full (34GB). This format contains everything, including the screenshot. It is the most complete format of LNU-Phish, but also the largest one.
- Access: (via explicit request – instructions below) [SHA256]
Full-snippet (50MB). This format contains a subset of 100 samples (60 benign and 40 phishing) taken from the “Full” format of LNU-Phish (this snippet was provided to the referees of TDSC during the peer-review process of our paper).
- Access: Link [SHA256]

(all archives are protected with a password: “dsail”, without quotes and lowercase)

Source Code (and Access)

Alongside our dataset, we also disclose our source-code to implement POC, together with a notebook containing an example of its application to harden a ML-based classifier on our LNU-Phish dataset. We also provide the code of our feature extractor (which still works today), alongside a notebook showcasing its application the “snippet” version of LNU-Phish.

All such resources are included in a private GitHub repository. We will provide (read) access to such repository to any interested researcher or enthusiasts.

(We also provide the supplementary material mentioned in the paper, available here.)

Access

To obtain access to our resources, you can either:

send an email to Giovanni Apruzzese (be sure to put the term “LNU-Phish” in the subject of the email!);
contact Giovanni Apruzzese in any other means (useful if, e.g., your mail is blocked);
fill out the form available at the following link: Google Form

In any case, please include your affiliation when requesting access to our resources.

List of institutions that used the dataset

University of Padua, Italy
Chulalongkorn University, Thailand
National University of Singapore
Harbin Institute of Technology, China
Southeast University, China
University of Texas San Antonio, TX, USA

LNU-Phish Data and Code