«A dissertation submitted to the Graduate School—New Brunswick Rutgers, The State University of New Jersey in partial fulﬁllment of the ...»
SIMILARITY DETECTION TECHNIQUES FOR MOBILE
BY AMRUTA GOKHALE
A dissertation submitted to the
Graduate School—New Brunswick
Rutgers, The State University of New Jersey
in partial fulﬁllment of the requirements
for the degree of
Doctor of Philosophy
Graduate Program in Computer Science
Written under the direction of
Dr. Vinod Ganapathy and approved by New Brunswick, New Jersey October, 2015 c 2015 Amruta Gokhale
ALL RIGHTS RESERVED
OF THE DISSERTATIONSimilarity Detection Techniques for Mobile Platform Artifacts by Amruta Gokhale Dissertation Director: Dr. Vinod Ganapathy There has been a tremendous increase in availability and use of mobile apps in recent years.
Two dominant mobile platform providers, Apple and Google, have over 1.4 million applications (apps) each in their app markets as of May 2015. Such huge number of available apps gives rise to problems in managing app repositories. Also, a peculiar set of challenges is faced by the large community of mobile app developers.
This dissertation describes novel solutions to two problems in the space of mobile apps. The ﬁrst problem is that of fragmentation of mobile app development across multiple platforms. To accelerate the native development of same app across multiple platforms, mobile app devel- opers need tools to better navigate across diﬀerent platforms. The second problem is that of keeping app repositories free of plagiarized apps. It is in the best interest of users, developers as well as app repository owners to get rid of plagiarized apps in the app markets, which often carry malicious software with them, and thus pose a threat.
Every mobile app interacts with the platform, taking advantage of the functionalities ex- posed by the platform. The behavior of an app can be described at a high level in terms of its interaction with the underlying platform. By monitoring the app and collecting such interaction in the form of API method invocations, we can record its high level behavior. We use this obser- vation to build two new systems. The ﬁrst system provides assistance in cross-platform mobile app development. The second system detects plagiarized mobile apps in app repositories. The ii proposed techniques intercept the interactions between mobile apps and the underlying plat- form and utilize the interactions to infer likely resemblances between two mobile platform APIs or between mobile apps.
iii Acknowledgements As I embarked on this journey of procuring the doctorate degree,there were many people who played roles in various capacities which helped me get to the ﬁnal destination. I express my heartfelt gratitude towards them.
My advisor, Dr. Vinod Ganapathy, has helped shape many ideas behind the research projects I worked on. His excitement about solving new problems was infectious. His insights about various possible solutions to problems proved very helpful. I would like to thank him for being an excellent research advisor. I would also like to thank Dr. Liviu Iftode for helping me understand the big-picture view of the research problems I was working on. His comments on presentation slides helped improve my conference talks.
I would like to thank my committee members, Dr. Alex Borgida, Dr. Santosh Nagarkatte and Dr. Pratyusa Manadhata, for providing me feedback to improve this dissertation. I would also like to thank my collaborators, Dr. Ulrich Kremer, Daeyoung Kim, Yogesh Padmanaban and John McCabe from Rutgers and Abhinav Srivastava from AT&T, for their contributions to the joint research projects.
I had the fortune of making some wonderful friends during my graduate school period.
Chathra Hendahewa, Priya Govindan and Rezwana Karim formed the group with whom I could share the high’s and low’s of graduate school. I would like to thank past and present members of disco-lab, specially Arati Baliga, Lu Han, Liu Yang, Mohan Dhawan, Shakeel Butt, Rezwana Karim, Daeyoung Kim, Hai Nguyen and Nader Boushehrinejadmoradi. They provided valuable feedback on my presentations and talks. I had interesting technical as well as non-technical discussions with them.
I owe a debt of gratitude to my parents – aai and baba. When I was growing up as a kid, my parents sowed the seeds in me to perform to the best of my ability in anything I do. They provided me with the best possible resources, for which they often had to sacriﬁce some part iv of their own lives. For aai, my mother, the ﬁrst priority was always her kids; everything else was secondary. I never can pay back the debt I owe to her. I have happy and fond memories of spending my younger years with my sister, Ruta, and my brother, Chaitanya. Conversations with them as well as my cheerful niece, Maitreyee, is something I always look forward to.
Their love and support gives me the needed strength. Last, but certainly not the least, I attribute my success to my husband – Ajay – who stood behind me like a powerful force when needed.
His encouragement, patience and understanding over all the years was instrumental in driving me from the beginning of this journey till the very end.
3.1. Comparison of diﬀerent obfuscators in terms of their transformation capabilities. 45
3.2. Size statistics of apps chosen for similarity measurement experiment...... 46
2.6. Studying the impact of various combinations of factors.............. 31
2.7. Design of the system to identify API mappings using a static approach..... 35
3.4. API birthmark results on pairwise comparisons for 300 apps when Java API method calls were present in the traces...................... 64
3.5. Distribution of similarity measures reported by API Birthmark when two different obfuscators were used........................... 65
3.6. Similarity scores reported by Androguard and API Birthmark.......... 66
This dissertation describes novel solutions to two similarity detection problems in the space of mobile apps. The ﬁrst problem is inferring similarities between API methods of two mobile platforms. The second problem is ﬁnding similarities between mobile apps of a single platform with application to detecting plagiarism in them.
In the last few years, we have been witnessing the gradual decline of desktop computers and the rapid rise of mobile devices for accessing a majority of digital content. According to the recent statistics , mobile traﬃc to websites will overtake desktop traﬃc by March 2017. Fueling this growth in the adoption of mobile devices is the huge number of mobile apps being made available on mobile platforms. The two most popular mobile platforms, iOS and Android, have more than one million apps each in their app markets.
If you compare the landscape of development of mobile apps with that of desktop software, you will ﬁnd many diﬀerences. The diﬀerences are visible in categories such as the aﬃliation and skill set of developers, the size of developers’ community, the kind of repositories to upload software, and the functionalities provided by the platform. To give an example, Apple announced in June 2015  that the iOS app store had 1.5 million apps available, with 100 billion number of downloads of apps from the app store and that $30 billion has been paid out to developers to date. As per the data released by app ﬁgures in 2014 , Google Play has over 350,000 registered developers where as iOS App store attracted over 250,000 developers. This is at an unprecedented scale, in terms of the size of developers’ community, the number of apps available at the app market and the total number of downloads of the software. Let us look at these diﬀerences in greater details.
2 Until a few years ago, the task of developing applications was largely conﬁned to teams of software engineers, either in the open-source community or at IT companies. Most of these engineers were professional programmers employed by IT ﬁrms, with an exception of small number of developers of open source software. Almost all of the commercial software development was done in-house. For example, the software applications for Microsoft’s Windows operating system were written by employees and contractors employed by Microsoft and could never be done by third-party developers.
In contrast, the development of mobile apps is usually driven by small teams consisting of one or two developers, as noted in . Many of these developers are either individual contributors or employees of small IT ﬁrms. These developers write apps for a platform which is owned and released by a company diﬀerent than the one they are aﬃliated with. For example, it is not uncommon for individual contributors or programmers employed by small IT ﬁrms to write apps for Microsoft’s Windows Phone platform, even though their parent IT ﬁrm does not own the platform. Such teams (or individuals) may lack the expertise and experience of a large team of in-house developers. Mobile app developers therefore tend to rely heavily on the functionality provided by the underlying mobile platform . It is imperative for the mobile platforms to come equipped with a rich set of APIs so as to make the development easier for the third-party app developers.
The concept of app stores was non-existent until the arrival of mobile devices. The company that would make commercial software would distribute it via it’s own distribution channel.
Since the software was being developed in-house, and was usually closed to third-party developers, the company would have complete control over the distribution channel. As a result, managing the distribution channel or the software repository owned by the company was relatively easy. For example, although the issue of plagiarism of desktop software was rampant, such plagiarized software could never gain an entry on the software repository owned by the maker of original software. The in-house desktop software development also limited the scale at which software applications were uploaded and distributed. This made the management of software repositories easier. In case of mobile platforms, the humongous number of apps being deployed to the app stores gave birth to various issues associated with the management of large software repositories. Among these issues, we will look at detecting plagiarized mobile apps.
3 This dissertation is divided into two parts. In the ﬁrst part, we are focusing our attention on mobile app development and the challenges faced by the developers of mobile apps. In the second part, we will look at how to detect plagiarized mobile apps.
One challenge in mobile app development is the fragmentation of app development process across multiple mobile platforms. Development on diﬀerent mobile platforms diﬀers on many fronts such as use of diﬀerent programming languages, support for speciﬁc development environments and diﬀerent Software Development Kits (SDKs). Fragmentation also exists within the same platform in the form of diﬀerent screen sizes, various screen resolutions and diﬀerent ﬂavors or versions of the same operating system. This requires maintaining diﬀerent code bases, thus leading to increased maintenance overhead. Mobile app developers also need to have better support for monitoring, analysis and testing of mobile apps. Joorabchi et al. conducted an exploratory study  to understand the current practices and diﬃculties in native mobile app development. Among all the diﬀerent challenges expressed by developers, 76% of survey participants expressed concerns over app development for multiple mobile platforms.
As per the survey, “One type of challenge mentioned by many developers is learning more languages and APIs for the various platforms”. We have attempted to address this challenge by developing a tool that would help developers to familiarize themselves with the API of a newer platform.
Every mobile platform is shipped with APIs which the programmers can use to perform common tasks in the implementation of the business logic of the app. For example, in Android, drawRect() can be used to draw a rectangle on the screen. setColor() can be invoked to set the color of the palette. These methods provide abstractions for common tasks, thus freeing the programmers from writing low level code. Programmers therefore ﬁnd it easy and convenient to use the platform libraries to build apps on top of the respective platform. Hence, it is reasonable to expect that almost all mobile apps will interact with the underlying platform during their execution in the form of invocations to the library methods.
The second problem is how to detect plagiarized mobile apps that have been produced by fraudulently copying other apps. Next, we describe the two problems in more detail and the diﬀerent techniques that we have employed to generate and use the interaction of the app with the platform.
1.2 Detecting Similarities or Mappings between Mobile Platform APIs
Today’s mobile market is dominated by three major platforms, namely iOS, Android and Windows Phone. Developers often wish to make their apps available across as many of these platforms as possible so as to increase their user base. These developers face the challenge of developing apps separately for each new platform. Small teams of developers cannot aﬀord to allocate separate resources to develop apps for each individual platform. Thus, we need tools to help such developers who are looking to develop apps across multiple platforms.