Social feature-based enterprise email classification without examining email contents

Social feature-based enterprise email classification without examining email contents

Journal of Network and Computer Applications 35 (2012) 770–777 Contents lists available at SciVerse ScienceDirect Journal of Network and Computer Ap...

691KB Sizes 2 Downloads 125 Views

Journal of Network and Computer Applications 35 (2012) 770–777

Contents lists available at SciVerse ScienceDirect

Journal of Network and Computer Applications journal homepage: www.elsevier.com/locate/jnca

Social feature-based enterprise email classification without examining email contents Min-Feng Wang n, Meng-Feng Tsai, Sie-Long Jheng, Cheng-Hsien Tang Department of Computer Science and Information Engineering, National Central University, No. 300, Jhongda Road, Jhongli 32001, Taiwan, ROC

a r t i c l e i n f o

a b s t r a c t

Article history: Received 24 June 2011 Received in revised form 19 September 2011 Accepted 10 November 2011 Available online 23 November 2011

Without imposing restrictions, many enterprises find nonwork-related contents consuming network resources. Business communication over emails thus incurs undesired delays and inflicts damages to businesses, explaining why many enterprises are concerned with the competition to use email services. Obviously, enterprises should prioritize business emails over personal ones in their email service. Therefore, previous works present content-based classification methods to categorize enterprise emails into business or personal correspondence. Accuracy of these methods is largely determined by their ability to survey as much information as possible. However, in addition to decreasing the performance of these methods, monitoring the details of email contents may violate privacy rights that are under legal protection, requiring a careful balance of accurately classifying enterprise emails and protecting privacy rights. The proposed email classification method is thus based on social features rather than a survey of emails contents. Social-based metrics are also designed to characterize emails as social features; the obtained features are treated as an input of machine learning-based classifiers for email classification. Experimental results demonstrate the high accuracy of the proposed method in classifying emails. In contrast with other content-based methods that examine email contents, the emphasis on social features in the proposed method is a promising alternative for solving similar email classification problems. & 2011 Elsevier Ltd. All rights reserved.

Keywords: Social network analysis Enterprise email classification Machine learning

1. Introduction Pervasiveness of the Internet has led to the exponential growth of email. According to IDC statistics (ILTA, 2009), more than 60% of business-critical information is stored in email and other electronic messaging systems. Even for a start-up or a small company, email is essential for business in contacting customers or partners, even more important than a company website. Email communication encompasses everything from product quotations and contracts to customer service and marketing advertisements. However, potential threats of outgoing emails to enterprise success have seldom been addressed (Vandermeer, 2006). For instance, employees often send many nonwork-related emails during work hours, causing network managers to spend considerable effort in processing a tremendous amount of work unrelated messages while filtering and archiving emails. Moreover, the popularity of multimedia technologies and social networks has increased the frequency of email attachments containing large sized multimedia datasets, which are then sent to multiple recipients. In addition to straining the resources of email servers, such behavior also increase internal communication costs in an enterprise. Enterprises must thus focus on preventing outgoing

n

Corresponding author. Tel.: þ886 958 716 358. E-mail address: [email protected] (M.-F. Wang).

1084-8045/$ - see front matter & 2011 Elsevier Ltd. All rights reserved. doi:10.1016/j.jnca.2011.11.010

personal emails from overburdening the email system to more efficiently utilize company resources. Concerning email system stability, employee productivity, and competitive data confidentiality, many enterprises institute email policies to define personal usage restrictions and unacceptable email behavior (Seow et al., 2005). Enterprises incorporate the email usage policy into the existing employee manual as an enforcement policy (Pacilla, 2008). The ePolicy Institute also provides email guides to instruct employers on how to craft an email policy for employees, such as ‘‘educate employees to understand the email rules and email policies’’ and ‘‘address ownership issues and privacy expectations’’ (Flynn, 2009; Flynn and Flynn 2003; Flynn and Kahn, 2003). Moreover, to ensure success of the email policy, the enterprise must educate employees regarding the legal aspects of email usage and email etiquette. However, most enterprises fail to educate employees on email risks, rules and responsibilities. According to AMA’s 2003 survey (AMA Press Room, 2003), 75% of all organizations have written email policies; only 48% offer email policy education to employees, and only 27% train employees on retention and deletion policies of email. This report also indicates that most enterprises pay less attention to educating their staff and leave the staff to interpret independently email policy while they have set up an email policy to alleviate the risks associated with email (Seow et al., 2005). To orient employees on the do’s and don’ts of email communication in workspace, certain enterprises state clear reasons, such as

M.-F. Wang et al. / Journal of Network and Computer Applications 35 (2012) 770–777

recent court cases and network bandwidth statistics, to persuade employees to prevent sending illegal or improper email. Additionally, adopting an effective written policy for employees to write email professionally and respectfully can help an enterprise to maintain a harassment-free workplace and increase employee productivity. Moreover, most organizations allow for limited personal email if employees can adhere to certain company policies (Pacilla, 2008), such as:

 Personal use must not be commercial oriented, or for any personal financial gain.

 Personal email/Internet usage must not interfere with work performance. American Media Inc. has also released an email education training video to help employees to use email properly (AMI, 2001). Major points of the video are summarized as follows:

 Email messages should be viewed as public not private information.

 Email messages should be viewed as a permanent form of communication, although users delete messages.

 Email messages should avoid discussing sensitive issues to avoid embarrassing situations. From the above description and statistics, we recognize that a comprehensive email education and email policy spends considerable resources for an enterprise. However, many enterprises prefer to devote resources to market expansion and strategic planning rather than train and educate employees on email usage behavior. This work designs an effective email classification method based on social network analysis for enterprise to handle a large amount of outgoing emails. According to statistics proposed by a legal scholar named Jay Kesan (2002), 40 million American employees sent 60 billion emails in 2000. According to the Email Statistics Report published recently by the Radicati Group (2011), the number of worldwide corporate email accounts is forecasted to increase from over 780 million in 2011 to over 1.1 billion in 2015; in addition, a typical corporate user sends and receives approximately 105 emails daily with this number rising to nearly 125 before the end of 2015. Moreover, the Radicati Group forecasts corporate email accounts to increase at a faster pace than consumer ones. Above statistics reflect the tremendous burden that work and work unrelated emails impose on the outgoing message queue of an email system. Therefore, most enterprises use different email classification tools to collect and monitor email contents in order to prevent data leakage and reduce overloading of servers. According to statistics from the 2007 Electronic Monitoring & Surveillance Survey (AMA Press Room, 2007), of the 43% of the companies surveyed that monitor email, 73% use technology tools to automatically monitor email while 40% assign staff to manually read and review email. However, public concern has grown over potential privacy violations in monitoring employee emails. Although most enterprises claim the right to monitor such emails, peeking into email contents still remains a contentious issue owing to loopholes in the legal system or ambiguous employment contracts (Sipior and Ward, 1995; Weisband and Reinig, 1995). Consequently, enterprises aim to strike a balance between employer assets and employee privacy. Therefore, this work develops a novel enterprise email classification method without breaching privacy rights. The proposed method helps to classify enterprise emails as either business or personal emails while considering employee privacy and protecting enterprise assets from the unintentional or malicious

771

activities of employees. This work makes the following contributions to e-business, privacy protection, and social media analysis: (1) An email classification method is developed based on social network analysis. In contrast with previous works that concentrated on protecting enterprises from threats of incoming emails such as spams, viruses, and phishing scams, the proposed method aims classifies outgoing emails as business or personal emails to reduce the likelihood of overburdening email servers and avoiding delays of business emails. Analytical results demonstrate the high classification accuracy of the proposed method; (2) Five social-based metrics are developed to characterize how business and personal emails differ in terms of the communication behavior of senders. Experimental results indicate that the metrics can help to discriminate between emails with a high accuracy and ensure stability in different sized email datasets. (3) A noncontent-based email classification method is devised without infringing upon real contents of emails to protect employee privacy and employer assets. (4) An email scheduling simulation model is constructed to prevent a mass of work unrelated emails from delaying the sending of work related ones. To our knowledge, this work presents the first method based on social network analysis to address how to classify enterprise emails into work related and work unrelated. The rest of this paper is organized as follows. Section 2 reviews pertinent literature, while addressing both empirical and theoretical aspects of the role of social network analysis in email classification. Section 3 then introduces the proposed framework, which largely consists of system architecture, problem formulation, and characteristics of an email social network. Next, Section 4 discusses the social features and metrics of the proposed method. Additionally, Section 5 presents the proposed classification approach, scheduling algorithm, and experimental results. Conclusions are finally drawn in Section 6, along with recommendations for future research.

2. Related work Characteristics of email and their most effective classification methods have received considerable interest in recent decades. However, this topic has seldom been explored in social network analysis. Recently, Tyler et al. (2005) extracted social information from the headers of email logs to organize the email social network of an enterprise, and use betweenness centrality algorithm to identify communities of an enterprise. Yelupula and Ramaswamy (2008) also extracted relevant features of emails from an Enron dataset (Shetty and Adibi, 2004) to reproduce and predict an organizational structure of Enron. Although based on social network methodologies to identify the community structure of an enterprise, these works failed to address email classification. Twining et al. (2004) analyzed the email sent history to forecast future email behavior in order to prioritize the delivery of business email to the desktop rather than preventing junk email from reaching the desktop. However, this work failed to consider the personal social network when analyzing previous behavior involving sent email to evaluate whether an email is spam or not. Chirita et al. (2005) developed a ranking and classification approach, i.e. MailRank, to detect spam according to the address of email senders. In that work, a power-iteration algorithm was implemented to calculate a global reputation score and personalized trust value for each known emails separately to classify emails. While attempting to reduce the number

772

M.-F. Wang et al. / Journal of Network and Computer Applications 35 (2012) 770–777

of spam emails and develop a spam filter that satisfy the three critical characteristics, i.e. attack-resilient, personalized, and user-friendly, Li and Shen (2011) utilized the social relationship among email users to detect spam. That work also designed a spam filter by applying the constructed social network to evaluate and infer the closeness centrality, trust score, and personal interests among users, and then integrated the obtained features into the Bayesian filter. Stolfo et al. (2003, 2006) adopted different user behavior models, which are based on volume and velocity statistics of a user’s email rate, and an analysis of the user’s social cliques to analyze email behavior. They then created the Email Mining Toolkit (EMT) to compute these models for detecting the onset of viral propagations. To detect potential insider threats of an organization, Okolica et al. (2008) used an extension of the probabilistic latent semantic indexing (PLSI) method to discern user interests. That work also incorporated the interests of obtained users into the implicit and explicit email social network which are constructed from the corresponding external and internal email communication to identify individuals who are alienated from the organization or interested in a sensitive topic. Microsoft developed and implemented a social network and relationship finder (SNARF) prototype for email triage (Fisher et al., 2007; Neustaedter et al., 2005a,b). They aggregated social metadata to estimate the social metrics for each collected email and, then, sorted user emails based on the corresponding social metrics. Additionally, several studies have suggested the merits of social features in prioritizing emails. For instance, Yang et al. (2010) and Yoo et al. (2009) obtained social features form each email to represent corresponding social importance and developed a transductive learning algorithm for personalized email prioritization. Moreover, since the structure of social networks is diverse and changes overtime, Tseng et al. (2007) and Tseng and Chen (2009) proposed a progressive update scheme to feedback the most important features of email social network for spam classification efficiently. Following our exhaustive research survey of email classification, some works developed innovative and interesting methods for spam email filtering based on social network analysis. However, research is still lacking on how to classify the outgoing emails of enterprises based on social network analysis. Therefore, this work integrates the use of a social network method with machine learning-based classifiers to classify emails without the monitoring details of email contents.

3. Overall framework 3.1. Structure of an email message This work attempts to avoid violating privacy issues by extracting only the sender field and recipient fields of emails as our social features for email classification. In this section, we briefly introduce the basic structure of an email. An email message majorly consists of message header and, optionally, a body. The message header contains control information such as sender address (From), recipients addresses (To, Cc, and Bcc), and content type. The body contains basic contents of an email message containing text, image, and even video. Fig. 1 illustrates the email message structure.

Fig. 1. Email message structure.

Fig. 2. System architecture of email classification in an enterprise.

Based on the preprocessing model, contact and domain name information is parsed and extracted from emails to construct both business and personal social networks, and characterize social features. In addition to constructing business and personal social networks based on the contact information of emails, the social network analysis model examines the proposed social-based metrics to characterize the nodes, which are email senders or email recipients that exist in social networks. The classification model adopts four machine learning-based classifiers, i.e. MultilayerPerception, Decorate, LibSVM, and NaiveBayes, to classify emails into business and personal emails. The prioritization model dispatches the classified emails of the classification model into different priorities of queues. The HI queue can send emails with a high priority, and is allocated a higher bandwidth and more resources while the LO queue is not.

3.2. System architecture

3.3. Problem formulation

Fig. 2 briefly reviews the proposed system architecture. The proposed system integrates a preprocessing model, social network analysis model, classification model, and prioritization model, which automatically classifies and prioritizes emails based on their social features. The function of each component is described as follows.

This works describes how to construct social networks in order to explain social-based metrics. Several symbols are defined as follows. While assuming that M¼{m1, m2, y, mn}, a set of outgoing emails from employees are collected from an enterprise email server, where n denotes the total number of outgoing emails from

M.-F. Wang et al. / Journal of Network and Computer Applications 35 (2012) 770–777

employees. First, M is divided into two email sets based on domain experts; Mb denotes a business email set, and Mp denotes a personal email set. Next, the sender set S ¼{s1, s2, y, sn} is extracted from the sender field (From:) of each corresponding email mi of M. Additionally a recipient set ri ¼{ri1, ri2, y, rim} is formed by all distinct recipients that are extracted from recipient fields (To:, Cc:, and Bcc:) of the corresponding email mi of M. Finally, assume that R ¼{r1, r2, y, rn} represents a set formed by each recipient set ri. Moreover, a meta-pattern is introduced to specify the recipient set ri. The meta-pattern can be derived as follows: r i : /MEi S where MEi can be viewed as a recipient set, which is classified by whether a recipient is an employee of an enterprise. A more detailed definition of MEi is provided below. Definition 1. From an enterprise perspective, a recipient set MEi comprises two subsets, i.e. an employee subset Ei and a nonemployee subset NEi, such that MEi ¼Ei [ NEi, where Ei \ NEi ¼|, 9MEi 9¼9ri9¼9Ei9þ9NEi9¼ n. 3.4. Characteristics of an email social network This work also constructs both business and personal social networks from the corresponding business email set Mb and personal email set Mp for email classification. These two social networks are constructed for the following reasons:

 According to our observation, the email social network of an





enterprise consists of work related and work unrelated social relationships of employees during working hours. Owing to enterprises policies such as increasing employee productivity, many enterprises restrict access to social websites such as Twitter and Facebook to prevent employees from surfing the web. As employees are legitimate users of an email system, classifying emails via insufficient contact information in email headers without peeking into details of email contents is extremely difficult; Email contents change over time, and content-based classification methods must expend considerable processing time to analyze each email.

Next, the formal definition of social networks is provided below. Definition 2. Business and personal email social networks are represented using two directed graphs Gb ¼ (Vb, Eb) and Gp ¼ (Vp, Ep), respectively. Where Vb and Vp denote nonempty sets of vertices that consists of email senders si and email recipients rij. Additionally, Vb and Vp are extracted from the corresponding business email set Mb and personal email set Mp. Moreover, Eb and Ep are two sets of edges consisting of ordered pairs of elements in the corresponding Vb and Vp. Each element of two sets Eb and Ep represents an edge that corresponds to an email sending event when mi is sent from si to rij. 4. Social features and metrics Five social-based metrics are designed to distinguish between business and personal social networks in terms of the differences of social activities of a node as follows. 4.1. Domain name divergence As is widely assumed (Chirita et al., 2005), if A sends/receives emails to/from B, and then it can be regarded as trusting B/A.

773

However, in this work, this assumption alone seems insufficient to help distinguish business emails from personal ones since employees are trusted and legal users of an enterprise email system. To overcome this circumstance, this work also conducted domain names of email recipient addresses to support email classification, as well as design the following metric to estimate the divergence of domain names in an outgoing email: DomainNameDivðmi Þ ¼

domainðmi Þ recipientðmi Þ

where the function domain(mi) can obtain the number of distinct domain names of email recipient addresses; the function recipient(mi) can obtain the number of distinct recipients of an email. 4.2. In-degree centrality of nonemployee email recipients Our observation suggest that employees may make friends with business partners and send emails to them, which may be either work related or work unrelated. To make the differentiation, in addition to focusing on the number of receiving events of each nonemployee recipient, this work designs a metric to evaluate how business and personal social networks differ in terms of in-degree centrality for each email recipient. P  8r A ðNEi 4V b Þ deg ðr ij Þ þ 1 InDegreeCentðmi Þ ¼ log P ij  8r ij A ðNEi 4V p Þ deg ðr ij Þ þ1 where the function deg-(rij) calculates the in-degree of a vertex rij in a social network. 4.3. Occurrence ratio of email recipients The following metric is designed to calculate the number of email recipients in business and personal social networks separately: ORrecipient ðmi Þ ¼ log

9V b \ r i 9 þ1 9V p \ r i 9 þ 1

4.4. Occurrence ratio of nonemployee recipients An outgoing email with more recipients that are nonemployees generally has a higher likelihood to be regarded as a personal email to determine whether or not it is a personal email. To overcome this limitation, this work focuses not only on the occurrence ratio of all email recipients in social networks but also on nonemployee email recipients: ORnonemployee ðmi Þ ¼ log

9V b \ NEi 9 þ1 9V p \ NEi 9 þ 1

4.5. Cohesion of an email sender The cohesion describes the connectivity among neighbors of an email sender: P  8r ij A ðV b 4NEi Þ deg ðr ij Þ Cohesionðmi Þ ¼ log   Nbrb ðsi Þ C þNbrb ðsi Þ  9NEi \ V b 9 2 P  8rij A ðV p 4NEi Þ deg ðr ij Þ log   Nbrp ðsi Þ C þ Nbr p ðsi Þ  9NEi \ V p 9 2 where the function Nbrb(si) can obtain the number of neighbors of an email sender that exists in the business social network; the function Nbrp(si) can obtain the number of neighbors of an email sender that exists in the personal social network.

774

M.-F. Wang et al. / Journal of Network and Computer Applications 35 (2012) 770–777

5. Experiments 5.1. Email dataset collection and preprocessing Enterprise outgoing emails lasting for ten days are collected from the outgoing email server of a Taiwan semiconductor company. Totally, 8491 emails are collected, and then labeled by domain experts for experiments. The 2897 emails are labeled as business emails, and the remaining 5594 emails are labeled as personal. After collecting and labeling emails, the sender field (From:) and three recipient fields (To:, Cc:, and Bcc:) are extracted from the foremost 4000 emails sorted in an ascending order by email sending time for constructing both business and personal social networks. Additionally, domain names of each recipient’s email address are parsed for characterizing domain divergence of each email sending event. 5.2. Feature groups The characteristics of social features are examined based on the proposed social-based metrics and the constructed business and personal social networks. The obtained social features and their different combinations are then applied as an input of machine learning-based classifiers for discriminating business emails from personal emails. Therefore, the proposed social features are more readable by abbreviating them as follows:

    

DD—domain name divergence of email recipients addresses. IC—in-degree centrality of nonemployee email recipients. ON—occurrence ratio of nonemployee recipients of email. OR—occurrence ratio of email recipients. CS—cohesion of an email sender.

Section 4 defines in detail these features. After the representation of each social feature is abbreviated, an n-dimensional vector is used to represent the different combinations of the proposed social features, e.g., DDþICþ ONþORþCS, ICþ OR, and IC, for each outgoing email, where n denotes the total number of features. Additionally, the vector is taken as an input of the machine learning-based classifiers for email classification. 5.3. Approaches 5.3.1. Email classification Effectiveness of the proposed social-based metrics is examined by using four machine learning-based classifiers provided by the WEKA workbench (EL-Manzalawy and Honavar, 2005; Hall et al., 2009) to determine whether an email is business-related or not. The classifiers are NaiveBayes, LibSVM, MultilayerPerception, and Decorate, respectively. In practice, because the email dataset is insufficiently large, 10fold cross-validation is performed to increase the reliability of email classification and achieve statistically precise results. 10-fold crossvalidation involves using all observations for both training and validation. Moreover, each observation is made for validation exactly once. Finally, more objective experimental results are obtained. 5.3.2. Email prioritization To demonstrate the importance of classifying outgoing emails for an enterprise, this work adopts some portions of a scheduling model developed by Twining et al. (2004) to simulate how the original email sending process differs from the modified one. The original email sending model contains only one outgoing queue for queuing outgoing emails and then sends them sequentially. Compared with the original model, the modified model comprises two outgoing email queues: HI (high priority) queue and LO (low priority) queue. Additionally, business emails are placed into HI queue, whereas

personal emails are placed into LO queue. Next, a scheduler is implemented using the absolute priority scheduling algorithm (Nakamura, 2000–2003). Notably, the scheduler always monitors HI queue, obtains and sends emails that are existed in HI queue. While the HI queue is empty, the scheduler switches to service the LO queue. 5.4. Results All experiments in this work are performed on a machine with Intel Pentium IV 2.8 GHz CPU, 1 GB RAM, and run on Microsoft Windows XP Professional. Four sizes of experimental datasets are extracted from the collected outgoing emails, i.e. 9D9¼600, 9D9¼1200, 9D9¼2400, and 9D9¼3000, respectively, where 9D9 denotes the size of the experimental dataset. 5.4.1. Experimental results of email classification Classification accuracy results are computed and compared with different classifiers and the different combinations of the proposed social features, as shown in Figs. 3–6. According to those results, the accuracy rates of all of the proposed social features evaluated by all classifiers in different sizes of datasets exceed 0.9. According to Figs. 3–6, the IC has the highest accuracy rate among other individual features when it is evaluated by all classifiers. The F-measure calculation results shown in Figs. 3–6 also indicate that the discriminate power of using the IC as input of classifiers is higher than that of other individual features. Moreover, with the growing size of dataset, experimental results indicate that the proposed features still enable all classifiers to distinguish business emails from personal ones. According to Figs. 3–6, for most of the classifiers, the accuracy rate and F-measure scores approximate to 0.95. This observation reveals that the email classification relying on noncontent-based features should be able to classify emails effectively. To obtain more precise results, this work also runs four classifiers to evaluate the effectiveness of the proposed method. Experimental results of Figs. 3–6 indicate that the NaiveBayes classifier has a lower accuracy rate and F-measure score for email classification, while the performance of other classifiers yields a higher accuracy rate and F-measure score. However, the classification accuracy of each classifier still exceeds 0.9 in different sized experimental datasets. Moreover, the performance of using the IC as an input of the classifiers is superior to that of many composite features. This observation implies that in-degree centrality of nonemployee email recipients may be an excellent discriminative feature for enterprise email classification. Additionally, according to our results, using some types of individual feature alone may be insufficient to classify emails, such as ON and OR, despite the fact that IC is a significant individual social feature. However, after some combinations of different individual features are adopted as composite features, the results are superior to those of an individual feature. Overall, the excellent results of a high accuracy rate and F-measure score facilitate the classification of emails by applying social features; in addition, the experimental results are effective as well. Based on our observations, we believe that social-based features should be a promising alternative to solve similar email classification problems without infringing upon email contents. 5.5. Experimental results of email prioritization The original and the modified email scheduling models are compared with respect to the average maximum waiting time of emails sent. The scheduling algorithm scans the outgoing email queues at different time schedule intervals and, then, sends emails in different bandwidths. Fig. 7 provides the list of four bandwidths and twelve schedule time intervals used for evaluating the delays of sent emails. According

M.-F. Wang et al. / Journal of Network and Computer Applications 35 (2012) 770–777

775

Accuracy Rate 1 0.98 0.96 0.94 0.92 0.9 0.88 0.86 0.84 0.82 0.8

1 0.98 0.96 0.94 0.92 0.9 0.88 0.86 0.84 0.82 0.8

Decorate

LibSVM

Multilayer Perception

Naive Bayes

Decorate

LibSVM

F-Measure Multilayer Perception Naive Bayes

Fig. 3. Performance of email classification when 9D9 ¼600. (a) Comparison of accuracy rates for different classifiers and combinations of features when 9D9¼ 600. (b) Comparison of F-measures for different classifiers and combinations of features when 9D9¼ 600.

Accuracy Rate 1 0.98 0.96 0.94 0.92 0.9 0.88 0.86 0.84 0.82 0.8

1 0.98 0.96 0.94 0.92 0.9 0.88 0.86 0.84 0.82 0.8

Decorate

LibSVM

Multilayer Perception

Naive Bayes

Decorate

LibSVM

F-Measure Multilayer Perception Naive Bayes

Fig. 4. Performance of email classification when 9D9¼1200. (a) Comparison of accuracy rates for different classifiers and combinations of features when 9D9¼ 1200. (b) Comparison of F-measures for different classifiers and combinations of features when 9D9¼ 1200.

776

M.-F. Wang et al. / Journal of Network and Computer Applications 35 (2012) 770–777

Accuracy Rate 1 0.98 0.96 0.94 0.92 0.9 0.88 0.86 0.84 0.82 0.8

1 0.98 0.96 0.94 0.92 0.9 0.88 0.86 0.84 0.82 0.8

Decorate

LibSVM

Multilayer Perception

Naive Bayes

Decorate

LibSVM

F-Measure Multilayer Perception Naive Bayes

Fig. 5. Performance of email classification when 9D9 ¼2400. (a) Comparison of accuracy rates for different classifiers and combinations of features when 9D9¼2400. (b) Comparison of F-measures for different classifiers and combinations of features when 9D9¼2400.

Accuracy Rate 1 0.98 0.96 0.94 0.92 0.9 0.88 0.86 0.84 0.82 0.8

1 0.98 0.96 0.94 0.92 0.9 0.88 0.86 0.84 0.82 0.8

Decorate

LibSVM

Multilayer Perception

Naive Bayes

Decorate

LibSVM

F-Measure Multilayer Perception Naive Bayes

Fig. 6. Performance of email classification when 9D9 ¼3000. (a) Comparison of accuracy rates for different classifiers and combinations of features when 9D9¼ 3000. (b) Comparison of F-measures for different classifiers and combinations of features when 9D9¼3000.

M.-F. Wang et al. / Journal of Network and Computer Applications 35 (2012) 770–777

Fig. 7. Performance of email prioritization among different schedule time intervals and bandwidths.

to this figure, the average maximum waiting times of the modified scheduling model are shorter than those of the original scheduling model. Moreover, the two models more obviously differ when the schedule time interval is increased. Simulation results demonstrate that based on the email classification results, the proposed method can avoid undesirable delays of business emails.

6. Conclusions This paper presents a novel email classification method for enterprises by analyzing the social behaviors of email users. To our knowledge, this work develops for the first time a noncontent-based method that incorporates privacy rights of employees without peeking into the details of email contents. In addition to protecting an enterprise from unnecessary disputes between employers and employees, the proposed method can reduce the efforts of processing needless messages of email managers. The social-based features of the proposed method can help to classify emails based on the intentions of business activities and personal interests of emails. Moreover, to explore the waiting times of delivering business emails between the modified and original email frameworks, a simulation model is constructed for email scheduling and prioritization. Experimental results indicate that the proposed method hold advantages over enterprises business communication in preventing an email system from becoming overburdened by dispensable personal social activities. We recommend that future research integrates state-of–the-art email prioritization algorithms with the proposed classification method to balance the load of an email server. The proposed method can also increase the utilization rate of network bandwidth. Moreover, future research should design a progressive framework to cope with the social network evolution of email.

Acknowledgments The authors would like to thank the National Science Council, Taiwan, for financially supporting this research under the Contract no. NSC 99-2219-E-002-021. This article is an extended version of the paper that will be published in ASONAM 2011 (Wang, et al., 2011). References AMA Press Room. E-mail rules, policies and practices survey. Available from: /http://www.epolicyinstitute.com/survey/survey.pdfS; May 14, 2003.

777

AMA Press Room. Electronic monitoring & surveillance survey. American Management Association and The ePolicy Institute; 2007. [February 2008]. AMI. No privacy: legal issues in e-mail. American Media Incorporated; 2001. Chirita P-A, Diederich J, Nejdl, W. MailRank: Using ranking for spam detection. In: Proceedings of the ACM CIKM conference on information and knowledge management; 2005. p. 373–80. EL-Manzalawy Y, Honavar V. WLSVM: integrating libsvm into weka environment. Available from: URL: /http://www.cs.iastate.edu/  yasser/wlsvmS; 2005. Fisher D, Brush AJ, Hogan B, Smith M, Jacobs A. Using social metadata in email triage: lessons from the field. In: Proceedings of the 2007 conference on human–computer interaction; 2007. p. 13–22. Flynn N. The e-policy handbook: rules and best practices to safely manage your company’s e-mail, blogs, social networking, and other electronic communication tools.2nd ed. New York: Amacom Pr; 2009. Flynn N, Flynn T. Writing effective e-mail: Improving your electronic communication.New York: Course Technology Ptr Pr; 2003. Flynn N, Kahn R. E-mail rules: a business guide to managing policies, security, and legal issues for e-mail and digital communication.New York: Amacom Pr; 2003. Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH. The WEKA data mining software: an update. ACM SIGKDD Explorations Newsletter 2009;11(1):10–8. ILTA. A purist RM perspective: e-mail needs maintaining, not archiving. White paper. Coan TJ, Ostrander AM, International Legal Technology Association, Austin, TX; July 2009. Kesan JP. Cyber-working or cyber-shirking?: a first principles examination on electronic privacy in the workspace Florida Law Review 2002;54:289–332. Li Z, Shen H. SOAP: a social network aided personalized and effective spam filter to clean your e-mail box. In: Proceedings of the IEEE international conference on computer communications; 2011. p. 1835–43. Nakamura G. The GNU C library—absolute priority. Available from: /http://www. imodulo.com/gnu/glibc/Absolute-Priority.htmlS; 2000–2003. Neustaedter C, Brush AJ, Smith M. Beyond ‘‘from’’ and ‘‘received’’: exploring the dynamics of email triage. In: Proceedings of the 2005 conference on human factors in computing systems; 2005a, p. 1977–80. Neustaedter, C, Brush, AJ, Smith, M and Fisher, D The social network and relationship finder: social sorting for email triage. In: Proceedings of the Second Conference on Email and Anti-Spam; 2005b. Okolica JS, Peterson GL, Mills RF. Using PLSI-U to detect insider threats by datamining email. International Journal of Security and Networks 2008;3(2):114–21. Pacilla FJ. Craft an e-mail policy for your employees. Available from: /http://www. physiciansnews.com/law/308pacilla.htmlS; 2008. Radicati Group. Email statistics report, 2011–2015. Available from: /http://www. radicati.com/wp/wp-content/uploads/2011/05/Email-Statistics-Report-20112015-Executive-Summary.pdfS; 2011. Seow BB, Chennupati KR, Foo S. Management of emails as official records in Singapore: a case study. Records Management Journal 2005;15(1):43–57. Shetty J, Adibi J The Enron dataset database schema and brief statistical report. Available from: /http://www.isi.edu/ adibi/Enron/Enron_Dataset_Report.pdfS [retrieved november 4, 2004]. Sipior JC, Ward BT. The ethical and legal quandary of email privacy. Communication of ACM 1995;38(12):48–54. Stolfo SJ, Hershkop S, Hu C-W, Li W-J, Nimeskern O, Wang K. Behavior-based modeling and its application to email analysis. ACM Transactions on Internet Technology 2006;6(2). Stolfo SJ, Hu C-W, Li W-J. Combing behavior models to secure email systems. Columbia University Technical report; April 2003. Tseng C-Y, Chen M-S. Incremental SVM model for spam detection on dynamic email social networks. In: Proceedings of the IEEE international conference on computational science and engineering; 2009. Tseng C-Y, Huang J-W, Chen M-S. ProMail: using progressive email social network for spam detection. In: Proceedings of the Pan-Asia conference on knowledge discovery and data mining; 2007. p. 833–40. Twining D, Williamson MM, Mowbray M, Rahmouni, M. Email prioritization: reducing delays on legitimate mail caused by junk mail. In: Proceedings of the USENIX annual technical conference; 2004. p. 45–58. Tyler JR, Wilkinson DM, Huberman BA. Email as spectroscopy: automated discovery of community structure within organizations. The Information Society 2005;21(2):133–41. Vandermeer J. Seven highly successful habits of enterprise email managers: ensuring that your employees’ email usage is not putting your company at risk. Information Security Journal: A Global Perspective 2006;15(6):64–75. Wang M-F, Tsai M-F, Jheng S-L. Tang C-H. Enterprise email classification based on social network features. In: Proceeding of the international conference on advances in social networks analysis and mining; 2011. p. 532–6. Weisband SP, Reinig BA. Managing user perceptions of email privacy. Communication of ACM 1995;38(12):40–7. Yang Y, Yoo S, Lin F, Moon I-C. Personalized email prioritization based on content and social network analysis. IEEE Intelligent Systems 2010;25(4):12–8. Yelupula, K and Ramaswamy, S Social network analysis for email classification. In: Proceedings of the 46th annual southeast regional conference; 2008. p. 469–74. Yoo S, Yang Y., Lin F, Moon I-C. Mining social networks for personalized email prioritization. In: Proceedings of the 15th Conference on Knowledge Discovery and Data Mining; 2009. p. 967–76.