Open Access Open Access  Restricted Access Subscription or Fee Access

Data-Deduplication in Linux Kernel File-System

Amit Savyanavar, Sachin Katarnaware, Pritam Bankar, Prashant Jadhav, Nikhil Bagde

Abstract


The Data Deduplication is basically a compression technique to eliminate redundant data from hard disk or storage space to efficiently use the storage space. As in every operating system the storage space is manage by file system or we can say data is stored on secondary storage space by file system. So we are modifying the file system so that it can eliminate the redundant block of data before storing to the secondary space which is also called as Inline Data Deduplication. Ext4 is latest file system which is used in Linux, which is having so many new features, so we are modifying Ext4 and adding this one more feature called as Data Deduplication [5].In our method Inline data deduplication we create a table to store a hash key, and the corresponding block number, which contains the data for that hash key. The hash key is generated using sha1 algorithm. Every time whenever the new data comes it is given to sha1 before allocating any blocks for it and the key is generated. Then this key is compare with already stored keys in the table, it the key is already present then in that case only the corresponding counter of the key is modified or incremented, this counter is basically used to keep track of count of pointers that are pointing to block on the physical device. Whenever the key is not present in that case key is stored and the control is passed to superblock which allocates the free blocks, from the list which it contains and then returns the allocated block numbers to table where they are stored corresponding there key and the counter is also incremented. So by using this method we can eliminate redundant allocation of data blocks, as result we can save the space and increase the efficiency of the storage space. This is how enterprises and big organization can save space as there data is growing exponentially in their field. An also as this method is block level elimination it elimination ratio is also good and good save of storage space.

Keywords


Linux,Block allocation and Operating system

Full Text:

PDF

References


Linux kernel development by Robert Love

Understanding the Linux Kernel by Daniel P. Bovet Marco Cesati

The Design of the UNIX Operating System - M. Bach (1983)

BEGINING LINUX PROGRAMMING 4TH EDITION

http://india.emc.com/collateral/analyst-reports/idc-20090519-data-deduplication.pdf

http://india.emc.com/collateral/hardware/white-papers/h6065-achieve-storage-effficiency-celerra-dedup-wp.pdf Cached

http://india.emc.com/collateral/analyst-reports/010208-esg-emc-centera-ease-of-use.pdf Cached

http://india.emc.com/collateral/campaign/global/dedupe-roadshow/roi-of-backup-redesign.pdf Cached

http://ieeexplore.ieee.org/search/srchabstract.jsp?tp=&arnumber=5423123&queryText%3DDATA+DEDUPLICATION%26openedRefinements%3D*%26searchField%3DSearch+All

http://ieeexplore.ieee.org/search/srchabstract.jsp?tp=&arnumber=4812468&queryText%3DLarge-Scale+Deduplication+with+Constraints+using%26openedRefinements%3D*%26searchField%3DSearch+All

http://www.usenix.org/event/lsf07/tech/cao_m.pdf

http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5541078

CPUSETS. Linux kernel documentation: kernel/Documentation/cpusets.txt.

EtherCAT Technical Introduction and Overview. http://www.packagingdigest.com/ contents/pdf/EtherCAT_ Introduction_en.pd%f.

Linux Real Time Patch Review - Vanilla vs. RT patch comparison. http://www.captain.at/ howto-linux-real-time-patch.php

Real-Time Linux Wiki. Project site: http://rt.wiki.kernel.org.

RT-mutex subsystem with PI support. Linux kernel documentation: kernel/Documentation/rt-mutex.txt.

RTLinuxPro CPU Reservation Technology. http://www.linuxdevices.com/ articles/AT7665542109.html.

J. F. Gantz, C. Chute, A. Manfrediz, S. Minton, D. Reinsel, W . Schlichting, and A. Toncheva, “The diverse and exploding digital universe: An updated forecast of worldwide information growth through 2011,” IDC, An IDC White Paper - sponsored by EMC, March 2008.

“104th Congress, United States of America. Public Law 104-191: Health Insurance Portability and Accountability Act (HIPAA),” August 1996.

“107th Congress, United States of America. Public Law 107-204: Sarbanes-Oxley Act of 2002,” July 2002

H. Biggar, “Experiencing Data De-Duplication: Improving Efficiency and Reducing Capacity Requirements,” The Enterprise Strategy Group, Feb. 2007.

G. Forman, K. Eshghi, and S. Chiocchetti, “Finding similar files in large document repositories,” in KDD ’05: Proceeding of the Eleventh ACMSIGKDD International Conference on Knowledge Discovery in Data Mining, 2005, pp. 394–400.

A. Muthitacharoen, B. Chen, and D. Mazi`eres, “A low-bandwidth network file system,” in Proceedings of the ACM Symposium on Operating Systems Principles (SOSP ’01), 2001, pp. 174–187.

S. Quinlan and S. Dorward, “Venti: A new approach to archival storage,” in Proceedings of the First USENIX Conference on File and Storage Technologies (FAST), 2002, pp. 89–101.

L. L. You and C. Karamanolis, “Evaluation of efficient archival storage techniques,” in Proceedings of the 21st IEEE / 12th NASA Goddard Conference on Mass Storage Systems and Technologies, Apr. 2004.


Refbacks

  • There are currently no refbacks.


Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 License.