«PhD-FSTC-2015-30 Ecole Doctorale IAEM Lorraine Faculté des Sciences, de la Technologie et de la Communication DISSERTATION Defense held on ...»
General Introduction 2 Issues and Challenges in Phishing For more than ten years now, many solutions ranging from hardened authentication methods to techniques for identifying phishing websites have been developed to ﬁght phishing. However, the ever increasing number of phishing attacks performed and the monetary damage caused by phishing shows that there is still room for improvement in order to develop techniques that will be able to reverse this increasing trend. The ﬁght against phishing is a challenging task and to
develop eﬃcient protection method, one must consider several factors:
• The main challenge in tailoring eﬃcient phishing protection techniques is that phishing cannot be treated as other security issues. Malware infection or network intrusions for instance, rely on the exploitation by an attacker of technical security breaches that are the result of ﬂaws in the implementation of programs or network protocols. However, phishing targets the most vulnerable part of any system: the user. Phishing mostly relies on the use of social engineering tricks and the technical sophistication of phishing attacks is low [HCNK+ 14]. Hence, the technical analysis of ﬂaws exploited by phishing attacks and the adoption of technical countermeasures is not eﬃcient to cope with the problem. Actually, phishing exploits one ﬂaw of current electronic communications: the lack of authentication between users. While several strong authentication techniques exist, these are not mandatory and not understood by most users. Most people are unable to authenticate the identity of the entity they communicate with. Phishing protection techniques must help people to assess the legitimacy of the entity they are communicating with in an easy manner, in order to avoid the impersonation of legitimate entities by crooks.
• A second challenge lies on the diﬃculty to identify phishs. Since phishers mimic legitimate entities behaviour by sending emails or creating websites that copy the original ones, the diﬀerentiation between phishs and legitimate communication is diﬃcult. Many features are common to legitimate communications and phishs and only few diﬀer. The identiﬁcation of these discriminating features is the main challenge to build reliable phishing protection techniques. This reliability is paramount in order to prevent unlegitimate communications while allowing legitimate. On these features depend the adoption and the usage of a protection techniques by users, since users are globally not motivated to use protection techniques [DT05] and ignore them when these are not reliable [ECH08].
• The third challenge is to develop techniques that can cope with the several phishing vectors.
Phishing detection techniques usually focus on some categories of phishing attacks like fake websites identiﬁcation or phishing emails detection. Other techniques are even more limited and target only some speciﬁc phishing attacks like browsers windows spooﬁng [YS02, DT05] or tabnabbing [DRNDJ13]. The development of too speciﬁc phishing protection techniques does not provide a good protection against the large range of phishing attacks. To cope with this, the accumulation of case-speciﬁc phishing protection technique is needed to provide a wide protection coverage. To operate, these cumulated techniques require a large computation time, introducing thus a delay in the identiﬁcation of phishs. A long delay can impact the usability of a protection technique if it aims a real-time usage.
about the usage of oﬀ-line phishing detection methods in this context. Eﬃcient phishing protection techniques must focus more on on-the-ﬂy identiﬁcation of phishs to limit the impact of an attack. However, such methods must be used in a context of current usage of electronic communication means such as exchange of instant messages, or web surﬁng.
Hence, the proposed method must not impact users’ experience and must not introduce large delay that would prevent their usage.
3 Organization of Contributions Seeing the characteristics required by an eﬃcient phishing protection method in term of speed, coverage, reliability and ease of use, we propose in this manuscript new techniques that can cover these requirements. We exploit the fact that phishing attacks are a kind of modern swindle.
Phishers are crooks employing there persuasion power to convince their victims to act for their beneﬁce. They employ carefully chosen words in their communications to establish a trust atmosphere and delude victims. Based on this fact we propose to analyse the meaning and semantic of words used by phishers in order to detect messages produced by them. To cover a large set of phishing attacks, we analyse the semantic of URL and domain names. These resource locators are used in a large range of phishing attacks to misdirect users to malicious contents. The identiﬁcation of phishing URLs leads to cope with several phishing vectors and that is why it is currently used as phishing protection method in reactive URL blacklits [goob, mic]. However, to cope with the slow process of crowd veriﬁcation used by blacklist, we rather perform a real-time analysis of URLs and exploit the semantic of words embedded in them. Observing the increasing usage of malicious domain names to perform phishing attacks, as presented in Figure 1, we focus as well the semantic analysis on domain names and explore the possibility of predicting domain names used for phishing by analysing phishing domains composition and semantic.
This document is structured around two main research directions related to phishing URLs
and domain names detection and phishing domains prediction:
Part I: State of the Art and Background. This part gives the necessary background to position the contributions provided in this document according to the working context of phishing and domain name analysis. Chapter 1 deﬁnes the concept of phishing attacks and presents some of the most used phishing vectors. We provide an overview of the phishing nefarious impact and list the requirements to develop eﬃcient phishing protection methods. The existing techniques developed to cope with phishing are presented and we identify their weaknesses and their ability to meet the formulated requirements. Chapter 2 presents the organization and functioning of the Domain Name System. An overview of the diﬀerent usage of DNS monitoring techniquesare presented and we argue about the relevancy of using DNS monitoring to identify phishs.
Part II: Phishing Domain Names and URLs Detection. This part presents the ﬁrst contributions of this document in developing techniques to identify domain names and URLs used in phishing attacks. Chapter 3 introduces a domain name clustering technique based on passively captured DNS data. The method is able to group domain names according to their activity and to discriminate phishing from legitimate domain names. This is further used in Chapter 4 as a pre-process to group domain names. Chapter 4 introduces a technique to infer the legitimacy or maliciousness of a set of domain names using semantic analysis. Metrics quantifying the semantic similarity between two sets of words are introduced and used to compare words extracted from legitimate and phishing domain names. These metrics allow to diﬀerentiate General Introduction phishing from legitimate sets of domain names. Chapter 5 introduces a URL phishing detection technique relying on the analysis of intra-URL relatedness. Search engine query data is used to quantify the relatedness between the registered domain name of a URL and the remaining of it. It is showed that legitimate URLs present more intra-relatedness than phishing URLs.
The proposed technique relying on a machine learning algorithm is able to identify phishing URLs with an accuracy of 95% and a process time of less than a second thanks to a distributed processing architecture.
Part III: Semantic Based Phishing Domain Names Prediction. This part explores the possibility to predict domain names that will be used by phishers. The predictable character of domain names is explored in Chapter 6. We present a technique relying on the ﬁnding of semantically related words in order to discover the diﬀerent subdomains of a domain name. Based on a set of known subdomains this techniques is able to discover new subdomains and outperforms existing state of the art techniques showing the validity of using semantically related words to predict domain names. A similar technique is used in Chapter 7 to generate a predictive phishing blacklist. A domain name generator relying on a Markov Chain model using semantic extension is introduced. Learning from a set of existing phishing domains the generator is able to produce domain names that will be used for phishing activities and this even long time before these are used. This work shows that phishing domain names follow speciﬁc composition schemes and use words restricted to a limited vocabulary such that these are predictable.
The dissertation concludes that lexical and semantic analysis performed on domain names and URLs is relevant to build phishing protection methods. This analysis combined with other data sources such as DNS information shows good results in the identiﬁcation and prevention of phishs. It meets three essential requirements for a phishing protection that are speed, coverage and reliability.
Introduction The increasing usage of e-services (e.g. e-banking and e-commerce) during the last decades saw the emergence of new threats associated to these services. The valuable information handled by these services attracted miscreants seeking to steal this data and use it for lucrative purposes.
One example of such cybercrime activities is phishing. The ﬁrst appearance of this term dates back to 1996 and refers to the attack perpetrated against America Online (AOL) where scammers posing as AOL employees sent messages to ask customers for conﬁdential information. Although this was the ﬁrst recorded occurrence of a phishing attack, phishing became commonly known by ordinary people only ten years later. Now, twenty years after its appearance, phishing has become one of the most lucrative cybercrime activities causing billions of dollars of loss every year [gar07, str10, rsa14]. Although several techniques have been developed to cope with phishing during previous years, its economic impact is still increasing over time [rsa14]. Methods to perpetrate phishing evolve at the same pace as protection techniques, making it a continual threat.
Chapter 1. Phishing and Protection Techniques Phishing is a criminal mechanism employing technical subterfuges and social engineering to abuse the credulity of uninformed users.
The technique usually consists in masquerading as a trustworthy entity in order to convince an individual to perform an action that he would only do if asked by the impersonated entity. In most cases, this action consists in providing credentials information for e-services access, providing credit card information, downloading and installing malware, etc. The ﬁght against phishing is diﬃcult because phishing targets the most vulnerable part of the system: the user. As described in [DT05], in phishing both system designers and attackers battle in the user interface space to guide (or misguide) the users. Hence, the problem cannot be tackled as a traditional system or network security issue but must heavily consider the human factor. Most phishing attacks can be detected by experienced users but for basic Internet users, security is a secondary purpose and they are not motivated and skilled enough to properly identify phishs. Phishing protection methods must consider the human factor and more precisely the limited skills of users and the "unmotivated user" property [WT99]. It is shown in [SHK+ 10] that half of unexperienced users fall for phish and that even after being trained almost a third of studied users are still tricked by phishing attacks. Thus, developing eﬃcient protection techniques is challenging and raise several requirements like ease of use, speed or performance.
Some automated and easy to use methods have been proposed to protect users from phishing. Phishing email ﬁltering techniques [FST07, RW12, AKS14], security toolbars [CLTM04, GPGL11] and Web browser phishing warnings [goob, mic] are examples of such techniques.
These methods can be classiﬁed in two categories being phishing prevention methods, helping to prevent exposition to phishing by enforcing authentication for instance, and phishing detection methods, which analyse a given email or web page in order to assess its legitimacy. The scope of some techniques is however limited to speciﬁc phishing attacks and the identiﬁcation delay is important. The variety of means used to perform phishing and the short lifetime of phishing attacks make these solutions often ineﬃcient or easy to bypass, requiring new solutions to be proposed. Despite more than ten years of ﬁght against phishing, its nefarious impact is still growing.
We start in this chapter by deﬁning in Section 1.1 what phishing is and present the means used to perpetrate this task as well as an overview of the economic impact and evolution of phishing activities. Based on the observations, we deﬁne the requirements to meet for eﬃcient phishing protection. Section 1.2 presents three techniques to prevent phishing attacks and give some examples of implemented solutions. Methods for phishing detection are presented in Section
1.3 and we identify the strengths and weaknesses of each state of the art technique.
1.1 Phishing: an Online Con Game Phishing has been a continual threat present for almost 20 years. The range of malicious activities and attacks categorized as phishing is wide and some have few similarities with each others. We ﬁrst give a deﬁnition of phishing including the diﬀerent aspects of the activities it includes and provide a short history of phishing. Then, we present some phishing attacks and vectors leveraging technical subterfuges and social engineering. We provide an analysis of the phishing evolution in term of economic impacts, attacks performed and techniques used over the years.
Finally we present the several challenges to develop eﬃcient phishing protection solutions.