스팸 메일 분류를 위한 데이터 마이닝 응용

김기태

서지반출

국문초록

본 연구는 데이터 마이닝을 이용하여 스팸메일을 분류하는 문제에 대한 해법을 제시한다. 인터넷 의 보급과 전산화로 인해 이메일을 이용하는 개인이나 단체가 증가하면서 생활의 편리함과 비즈니스 에서의 효율성이 크게 향상이 되었다. 반면에 이러한 변화는 스팸메일을 수신하면서 많은 문제들을 야기시키고 있다. 스팸메일은 광고나 악의적인 목적으로 원하지 않는 수신자에게 전달되는 이메일을 말한다. 스팸메일은 개인에게는 혼란과 컴퓨터에 악영향을 미치고, 비즈니스에서는 중요한 업무를 방 해하는 등 악영향을 끼치고 있다. 이러한 스팸메일을 스팸메일함으로 보내는 방법이 많이 연구되어 왔지만 효과적인 방법에 대한 필요가 여전히 높다. 이러한 문제를 해결하기 위해서 많은 해결 방법론 이 연구되어져 왔고 데이터 마이닝이 우수한 결과를 보여줬다. 하지만 보유하고 있는 데이터의 상태 가 불완전한 경우에는 데이터 마이닝 기법을 적용하기 쉽지 않다. 특히 데이터의 클레스에 대한 정보 가 한쪽만 가지고 있거나 불확실한 경우에 대해서는 일반적인 데이터 마이닝 기법은 분류모형을 찾 는 것이 어렵다. 본 논문에서는 이를 해결하기 위해 PU learning을 이용한다. 또한 기본 데이터 마이 닝 기법으로는 Support Vector Machine(SVM)을 적용하였다. 실험 결과에서는 제시한 방법론이 스팸 메일 분류에 대해 좋은 분류모형을 제시할 수 있다는 것을 보여준다.

영문초록

This paper proposes a classification model for spam email using data mining. The use of personal or business email has increased along with the growth of internet population and computerization. This change allows one to live in convenience or to improve the efficiency of business. On the other hand, spam emails cause many problems in our life. Spam email is defined as the email which is sent to anyone who does not want to receive the email that brings to annoyance or computer virus or interruption of business process. Although a lot of studies have been proposed to protect the spam email, we are still in need of an efficient classification method. Data mining is a prominent way to classify spam emails. However, if data do not have sufficient information, traditional data mining method may not apply for the problem. Therefore, we suggest PU learning algorithm to classify the problem with insufficient data which have only positive class and unlabeled data. Support vector machine (SVM) has been used as the basic data mining method. Experimental results show the viability of the proposed classification model.

키워드

데이터 마이닝 스팸메일 PU learning Support Vector Machine

구매하기 (4,000)

장바구니

국문초록

영문초록

목차

키워드