Association Rules

Suppose that a large database exists which consists of the SKUs of those items purchased in customer visits to grocery stores.  Each transaction may consist of only at most a few dozen SKUs out of a possible set of hundreds of thousands or more.  We may ask if the purchase of certain products is predictive of other purchases.  For example, we may be able to determine that the purchase of coffee (in this case, we lump all of the SKUs for different brands and packages into one) may be predictive of the purchase of cream.  Such a problem is frequently termed “basket analysis.”

The analysis of data for the creation of association rules applies to many business and scientific problems and has been the subject of lively research since the early 1990s.  One successful application of the association rule mining algorithm that I performed for is the discovery of patterns of job titles as found within a large corpus of resumes.  As might be expected, there are many “career paths” that are shared where one job may lead to others more or less predictably.  Discovering and summarizing such patterns from the data led to both well known career paths and a number of surprises.

There are several efficient algorithms (and many variations) that have found broad application.  The association rule discovery problem may be formalized as follows.  An association rule is an expression XY , where X and Y are sets of items (“itemsets”).  The meaning of such rules is interpreted as: Given a database D of transactions, where each transaction TD is an itemset, XY  expresses that whenever T contains X, that T “probably” contains Y.  The rule probability or “confidence” expresses the percentage of transactions containing Y that also contain X as compared to all transactions containing X.  Thus, we may say, that a customer who buys products x1 and x2 may also buy product y with confidence P.

In addition to confidence, we also speak of the “support” for a particular set or for a rule in a database.  The support of a set is the fraction of transactions in D containing that set.  The support of a rule XY is defined as the support of XY.  We can see that the confidence for a rule may be expressed in terms of support:  conf(XY)  = supp(XY) / supp(X).

Differences in algorithmic approaches for mining association rules are largely based on differing strategies for traversing the search space of sets of items.  One of the most popular algorithms (known as “apriori”) traverses the space in a breadth-first fashion, generating candidate itemsets from known itemsets with at least some minimum support, and then performing counts of those candidate itemsets from the database.  The generation of candidates is based on the simple observation that if a given itemset X has support S, then any additional items added to X will never increase the support.

In words, the algorithm works as follows. There are two primary stages: the discovery of all itemsets in D that have some minimum support (“frequent itemsets”), then the analysis of those itemsets for rules that have some minimum confidence. The bulk of the processing and complexity lies in the first stage. The second stage for rule discovery is straightforward – for each frequent itemset x, each subset a is considered for a rule of the form a ⇒ (xa) . If the rule has minimum confidence then it is output. Variations of this strategy exist for efficiency.

Finding frequent itemsets is based on the creation of candidate itemsets which may be tested for minimum support by counting the number of occurrences within D. The key trick for efficiency is to keep all item identifiers in lexicographic sorted order.  Then we can create candidate itemsets that have one more element than those of a previous breadth-first pass by joining existing previous pass frequent itemsets with the condition that a new candidate set x has its first k-2 elements in common with joined itemsets y and z, its k-1 element is the last element of y, its kth element is any element from z, and x must be in lexicographic order. Finally, each subset of length k-1 of the new candidate must be a frequent itemset.  This has the effect of pruning the candidates to minimize testing time (say, in the form of database scans) to identifiy all of the k element itemsets.

Refinements to the association rule discovery problem include methods for creating hierarchical organizations of rules and the creation of rules based on cross-classifications.  For example, taxonomies of job titles and job skills can be used as the basis for mining both career trajectories and qualifications through associations and predictive modelling using enhanced hierarchical association rule mining. Refinements to the algorithms focus on alternative search and counting strategies and on the storage of the itemsets.