Home   >   CSC-OpenAccess Library   >    Manuscript Information
Evidence Data Preprocessing for Forensic and Legal Analytics
Sundar Krishnan, Narasimha Shashidhar, Cihan Varol, ABM Rezbaul Islam
Pages - 24 - 34     |    Revised - 31-05-2021     |    Published - 30-06-2021
Volume - 12   Issue - 2    |    Publication Date - June 2021  Table of Contents
eDiscovery, Electronic Stored Information, Digital Evidence, Digital Forensics, Digital Forensic Analytics, Legal Analytics, Machine Learning, Preprocessing, Natural Language Processing.
Electronic evidential data pertaining to a legal case, or a digital forensic investigation can be enormous given the extensive electronic data generation mechanisms of companies and users coupled with cheap storage alternatives. Working with such volumes of data can be tasking, sometimes requiring matured analytical processes and a degree of automation. Once electronic data is collected post eDiscovery hold or post forensic acquisition, it can be framed into datasets for analytical research. This paper focuses on data preprocessing of such evidentiary datasets outlining best practices and potential pitfalls prior to undertaking analytical experiments.
1 Google Scholar 
2 Semantic Scholar 
3 refSeek 
4 BibSonomy 
5 ResearchGate 
6 J-Gate 
7 Scribd 
8 SlideShare 
A. K. Uysal and S. Gunal, Jan 2014, “The impact of preprocessing on text classification,” Inf. Process. Management., [On-line] vol. 50, no. 1, pp. 104–112, Available: https://doi.org/10.1016/j.ipm.2013.08.006., [Mar. 08, 2021].
Artificial intelligence and machine learning in e-discovery and beyond., Available: https://www2:deloitte:com/ch/en/pages/forensics/articles/AI-and-machine-learning-in-E- discovery:html, [Mar. 07, 2021].
“Digital Evidence and Forensics.” Internet: https://nij:ojp:gov/digital-evidence-and- forensics, [Mar. 02, 2021].
“Legal Analytics.”, Internet: http://www:argopoint:com/legalanalytics, [Mar. 02, 2021].
“Speech error - Wikipedia.”, Internet: https://en:wikipedia:org/wiki/Speech error, [Mar. 22, 2021].
“What is Legal Analytics?”, Internet: https://www:lexisnexis:com/community/lexis-legal- advantage/b/insights/posts/what-is-legal-analytics, 2019, [Mar. 02, 2021].
B. A. Kumara, M. M. Kodabagi, T. Choudhury, and J.-S. Um, Jan 2021, “Improved email classification through enhanced data preprocessing approach,” Spat. Inf. Res., [On-line] pp. 1–9, Available: https://link:springer:com/article/10:1007/s41324-020-00378-y, [Mar. 07, 2021].
Brainspace: Make Smarter, Faster, & More Informed Decisions., Available: https://www:brainspace:com/, [Mar. 07, 2021].
Casetext, “Moore v. Groupe, 868 F. Supp. 2d 137”, Available: https://casetext:com/case/moore-v-groupe, 2012, [Mar. 04, 2021].
D. Quick and K. K. R. Choo, Dec 2014, “Impacts of increasing volume of digital forensic data: A survey and future research challenges,” Digit. Investig., [On-line] vol. 11, no. 4, pp. 273–294,Available: https://www.sciencedirect.com/science/article/abs/pii/S1742287614001066, [Mar. 02, 2021].
Definition: Model fitting, Internet: https://www:educative:io/edpresso/definition-model-fitting, [Mar. 25, 2021].
EDRM, “Technology Assisted Review.”, Internet: https://edrm:net/resources/frameworks- and-standards/technologyassisted-review/, [Mar. 04, 2021].
Electronic Discovery Reference Model, Internet: https://edrm.net/resources/frameworks-and- standards/, [Mar. 02, 2021].
F. Z. Ruskanda, Mar 2019, “Study on the Effect of Preprocessing Methods for Spam Email Detection,” Indones. J. Comput., [On-line] vol. 4, no. 1, p. 109, Available: http://www:mail- abuse:com/, [Mar. 07, 2021].
G. Taranto, “The Evolution of TAR”, Internet: https://www:law:com/2020/12/31/the-evolution- of-tar/?slreturn=20210110063112, 2020, [Mar. 04, 2021].
Hyles v. City of New York et al, No.1:2010cv03119 - Document 97 (S.D.N.Y. 2016), Available:https://law:justia:com/cases/federal/district-courts/new- york/nysdce/1:2010cv03119/361399/97/, 2016, [Mar. 06, 2021].
J. Greer, “Email Threading in eDiscovery: The Longest Thread Policy,” Internet: https://www:digitalwarroom:com/blog/emailthreading-ediscovery-problems-with-longest- thread, 2019, [Mar. 08, 2021].
J. Kerry-Tyerman, “Why Machine Learning Matters in Ediscovery”, Internet: https://www:everlaw:com/blog/2018/01/03/machine-learning-in-ediscovery/, 2018, [Mar. 04, 2021].
J. Tang, H. Li, Y. Cao, and Z. Tang, 2005, “Email data cleaning,” in Proc. ACM SIGKDD Int. Conf. Knowl. Discov. Data Min. New York, New York, USA: ACM Press, [On-line] pp. 489– 498, Available: http://portal:acm:org/citation:cfm?doid=1081870:1081926, [Mar. 07, 2021].
L. Wilson, “Enterprise AI: Data Analytics, Data Science and Machine Learning”, Available: https://www:cio:com/article/3342421/enterprise-ai-data-analyticsdata-science-and-machine- learning:html, 2018, [Mar. 07, 2021].
M. Kantepe and M. C. Ga˜niz, Oct 2017, “Preprocessing framework for Twitter bot detection” in 2nd Int. Conf. Comput. Sci. Eng. UBMK 2017. Institute of Electrical and Electronics Engineers Inc., [On-line] pp. 630–634, Available: https://doi.org/10.1109/UBMK.2017.8093483, [Mar. 12, 2021].
N. I. S. Program, “DoD 5220.22-M, Operating Manual”, Internet: https://www:esd:whs:mil/Portals/54/Documents/DD/issuances/dodm/522022M:pdf, 2006, [Mar. 29, 2021].
NIST, “Guidelines for Media Sanitization, Special Publication 800-88”, Internet: https://nvlpubs:nist:gov/nistpubs/SpecialPublications/NIST:SP:800-88r1:pdf, 2014, [Mar. 29, 2021].
Rule 26. Duty to Disclose; General Provisions Governing Discovery— Federal Rules of Civil Procedure — US Law — LII / Legal Information Institute., Internet: https://www:law:cornell:edu/rules/frcp/rule 26#rule 26 a 1 B, [Mar. 10, 2021].
S. Kernisan, “TAR 1.0 or TAR 2.0: Which method is best for you?”, Internet: https://www:casepoint:com/blog/tar-1-0-versus-tar-2-0/, [Mar. 04, 2021].
S. Krishnan and N. Shashidhar, Mar 2021, “Interplay of Digital Forensics in eDiscovery,” IJCSS, [On-line] vol. 15, issue 2, pp 19-44, Available: https://www.cscjournals.org/manuscript/Journals/IJCSS/Volume15/Issue2/IJCSS-1602.pdf, [Mar. 19, 2021].
S. Krishnan, A. Neyaz, and N. Shashidhar, 2019, “A Survey of Security and Forensic Features In Popular eDiscovery Software Suites,”. [On-line]. Available: https://www:cscjournals:org/manuscript/Journals/IJS/Volume10/Issue2/IJS-152:pdf, [Mar. 02, 2021].
W. Etaiwi and G. Naymat, Jan 2017, “The Impact of applying Different Preprocessing Steps on Review Spam Detection,” in Procedia Computer Science., vol. 113. Elsevier B.V., [On- line] pp. 273–279., Available: https://doi.org/10.1016/j.procs.2017.08.368., [Mar. 12, 2021].
Mr. Sundar Krishnan
Department of Computer Science, Sam Houston State University, Huntsville, TX - United States of America
Dr. Narasimha Shashidhar
Department of Computer Science, Sam Houston State University, Huntsville, TX - United States of America
Dr. Cihan Varol
Department of Computer Science, Sam Houston State University, Huntsville, TX - United States of America
Dr. ABM Rezbaul Islam
Department of Computer Science, Sam Houston State University, Huntsville, TX - United States of America

View all special issues >>