Open Access Open Access  Restricted Access Subscription or Fee Access

File Type Identification and E-mail Spam Filtering

R. Dhanalakshmi, C. Chellappan

Abstract


The widespread use of email has provided malicious users an easy way to distribute harmful content to the internal network. Hackers can easily circumvent the protection offered by a firewall by tunneling through the email protocol, since it does not analyze email content. Organizations often fail to acknowledge that there is a great risk of crucial data being stolen from within the company. Identifying the true type of a computer file is a difficult and important problem as hackers and malicious users use either non-standard file formats or change the extensions of files while storing or transmitting them over a network bypassing the firewall from filtering. This makes recovering data out of these files difficult and confidential data being sent away from organizations in different allowable file formats. Previous methods of file type recognition include fixed file extensions, fixed “magic numbers” stored with the files, and proprietary descriptive file wrappers. All of these methods have significant limitations. Hence it is proposed to have  an content based approach for   generating “fingerprints” of file types based on a set of known input files, then using the fingerprints to recognize the true type of unknown files based on their content, rather than metadata associated with them.

 E-mail spam has become an epidemic problem that can negatively affect the usability of electronic mail as a communication means. Besides wasting users’ time and effort to scan and delete the massive amount of junk e-mails received, it consumes network bandwidth and storage space, slows down e-mail servers, and provides a medium to distribute harmful and/or offensive content. Inspired by the success of fuzzy similarity in text classification and document retrieval, the approach investigates its effectiveness in filtering spam based on the textual content of e-mail messages.


Keywords


Fingerprint(File Print) , Spam , Signature.

Full Text:

PDF

References


Ahmed Khorsi, “An Overview of Content-Based Spam Filtering Techniques”, Informatica 31 (2007) 269-277, May 26, 2007.

Che-Fu Yeh, Ching-Hao Mao, Hahn-Ming Lee, Tsuhan Chen, “Adaptive E-mail Intention Finding Mechanism based on E-mail Words Social Networks” ACM 978-1-59593-785-8/07/0008, 2007.

D.C. Trudgian, “Spam Classification Using Nearest Neighbour Techniques,” In Proc. of the Fifth Int. Conf. on Intelligent Data Engineering and Automated Learning (IDEAL04), UK, 2004.

Dong-Qing Zhang and Shih-Fu Chang, “Detecting image near-duplicate by stochastic attributed relational graph matching with learning,” in ACM MM, 2004, pp. 877 – 884.

El-Sayed M. El-Alfy, Fares S. Al-Qunaieer, “A Fuzzy Similarity Approach for Automated Spam Filtering” IEEE Transactions on Neural Networks, vol. 10, no. 5, pp. 1048 – 1054, Sept. 2008

Gregory A. Hall, Wilbon P. Davis, “Sliding Window Measurement for File Type Identification” ,proceedings of IEEE workshop on Information Assurance Workshop, June 2006

G. Sakkis, I. Androutsopoulos, G. Paliouras, “A memory-based approach to anti-spam filtering,” Information Retrieval, vol. 6, pp. 49-73, 2003.

H. Drucker, D. Wu, and V.N. Vapnik, "Support vector machines for spam categorization," IEEE Transactions on Neural Networks, vol. 10, no. 5, pp. 1048 – 1054, Sept. 1999.

Androutsopoulos, J. Koutsias, V. Chandrinos, and D. Dpyropoulos, “An experimental comparison of naïve Bayesian and keyword-based anti-spam filtering with personal e-mail messages,” In Proc. of the 23rd Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, 2000.

Irfan Ahmed, Kyung-suk Lhee, Hyunjung Shin and ManPyo Hong, On Improving the Accuracy and Performance of Content-based File Type Identification, Proceedings of the 14th Australasian Conference on Information Security and Privacy (ACISP 2009), pp.44-59, LNCS (Springer), Brisbane, Australia, July 2009.

Irfan Ahmed, Kyung-suk Lhee, Hyunjung Shin and ManPyo Hong, Fast File-type Identification, Proceedings of the 25th ACM Symposium on Applied Computing (ACM SAC 2010), ACM, Sierre, Switzerland, March 2010.

John Haggerty and Mark Taylor FORSIGS; Forensic Signature Analysis of the Hard Drive for Multimedia File Fingerprints, , IFIP TC11 International Information Security Conference, 2006, Sandton, South Africa.

Jongwan Kim, Dejing Dou, Haishan Liu, and Donghwi Kwak, “Constructing a User Preference Ontology for Anti-spam Mail Systems”, Canadian AI 2007, LNAI 4509, pp. 272 – 283, 2007

Karresand M., Shahmehri N., Oscar: File Type Identification of Binary Data in Disk Clusters and RAM Pages, Proceedings of IFIP International Information Security Conference: Security and Privacy in Dynamic Environments (SEC2006), Springer, ISBN 0-387-33405-x, pp.413-424, Karlstad, Sweden, May 2006.

Karresand Martin, Shahmehri Nahid “File type identification of data fragments by their binary structure”. In:Proceedings of the IEEE workshop on information assurance; 2006

Martin, K., Nahid, S.: Oscar - file type identification of binary data in disk clusters and RAM pages. In: IFIP security and privacy in dynamic environments, pp. 413– 424 (2006)

Martin, K., Nahid, S.: File type identification of data fragments by their binary structure. In: Proceedings of the IEEE workshop on information assurance, pp. 140–147 (2006)

Mason McDaniel and M. Hossain Heydari, “Content Based File Type Detection algorithms” ,IEEE Proceedings of the 36th Hawaii International Conference on System Sciences,2003

Mehdi Chehel Amirani, Mohsen Toorani, and Ali Asghar Beheshti Shirazi, A New Approach to Content-based File type Detection, Proceedings of the 13th IEEE Symposium on Computers and Communications (ISCC'08), pp.1103-1108, IEEE ComSoc, Marrakech, Morocco, July 2008.

M. Sahami, S. Dumais, D. Heckerman, and E. Horvitz, “A Bayesian approach to filtering junk e-mail,” In Proc. of AAAI’98

Workshop on Learning for Text Categorization, Madison, WI, July 1998.

Robert F. Erbacher and John Mulholland, "Identification and Localization of Data Types within Large-Scale File Systems," Proceedings of the 2nd International Workshop on Systematic Approaches to Digital Forensic Engineering, Seattle, WA, April 2007.

Roussev, Vassil, and Garfinkel, Simson, "File Classification Fragment-The Case for Specialized Approaches," Systematic Approaches to Digital Forensics Engineering (IEEE/SADFE 2009), Oakland, California.

Roy Lechich “ File Format Identification and Validation Tools” Integrated Library & Technology Systems , Yale University Library

Ryan M. Harris, "Using Artificial Neural Networks for Forensic File Type Identification," Master's Thesis, Purdue University, May 2007.

Ryan Ware “File Extension Renaming and Signaturing “ Digital Forensics September 19, 2006

Sarah J. Moody and Robert F. Erbacher, SÁDI – Statistical Analysis for Data type Identification, 3rd International Workshop on Systematic Approaches to Digital Forensic Engineering, 2008.

T. Joachims, “Text categorization with support vector machines: learning with many relevant features,” In Proc. of the 10th European Conf. on Machine Learning (ECML-98), 1998.

Veenman, C.J.: Statistical disk cluster classification for file carving. In: IEEE third international symposium on information assurance and security, pp. 393–398 (2007)

William C. Calhoun, Drue Coles “Predicting the types of file fragments “, Digital Forensic Research Workshop,Elsiever , Science Journal 2008

Wei-Jen Li, Ke Wang, Salvatore J. Stolfo, Benjamin Herzog, “Fileprints: Identifying File Types by n-gram Analysis “, Proceedings of the 2005 IEEE Workshop on Information Assurance

W.S. Yerazunis, “Sparse binary polynomial hashing and the CRM114 discriminator,” In Proc. of MIT Spam Conf., 2003

W.W. Cohen, “Learning rules that classify e-mail,” In Proc. of AAAI’96 Spring Symposium on Machine Learning in Information Access, Stanford, California, April 1996.

W. Zhao, and Z. Zhang, "An e-mail classification model based on rough set theory," In Proc. of the Int. Conf. on Active Media Technology, 2005.

Xiaofan Lin • Yan Xiong – “Detection and analysis of table of contents based on content association “-International Journal of Document Analysis (2006)

Computer and Intrusion Forensics - Book By George Mohay , Alison Anderson ,Byron Collie , Olivier De Vel , Rod Mc Kemmish 2003 ARTECH HOUSE, INC.

File extensions, http://www.file-extension.com/

Magic numbers http://qdn.qnx.com/support/docs/qnx4/utils/m/magic.html

File Format Registry http://hul.harvard.edu/~stephen/Format_Registry.doc

File command for Windows http://sourceforge.net/project/shownotes.php?release_id=98302

Jhove (2005). JSTOR/Harvard Object Validation Environment. Retrieved on May 01, 2006 from: http://hul.harvard.edu/jhove/

http://www.pro.gov.uk/about/preservation/digital/pronom.htm


Refbacks

  • There are currently no refbacks.