Hoyle paper 019-31
SUGI 31
Text Mining SAS-L Topics
Larry Hoyle, Policy Research Institute, University of Kansas
Hoyle paper 019-31
SUGI 31 SAS-L topics• Read each weekly topic list from
http://www.listserv.uga.edu/archives/sas-l.html
• Parse topic, HTMLdecode
• Strip “Re: “ /* strip variations of re: */
topicRE = prxparse('/^ *[R|r][E|e] *: *(.*)/');
if prxmatch(topicRE, topic) then do;
topic = prxposn(topicRE, 1,topic);
end;
• Proc SQL to aggregate topic counts across weeks
Hoyle paper 019-31
SUGI 31 SAS-L 2005
• 35324 thread/topic lines in the html files• 7081 threads after merging across weeks and a
little cleaning
Hoyle paper 019-31
SUGI 31SAS-L Top Threads in Number of Messages
Hoyle paper 019-31
SUGI 31 Text Miner on the SAS-L topics
Hoyle paper 019-31
SUGI 31
Hoyle paper 019-31
SUGI 31
Hoyle paper 019-31
SUGI 31
Hoyle paper 019-31
SUGI 31
Hoyle paper 019-31
SUGI 31 Largest clusters
Hoyle paper 019-31
SUGI 31 Smaller Clusters
Hoyle paper 019-31
SUGI 31 Message Content
Hoyle paper 019-31
SUGI 31 Web scraping with tmfilteroptions noxwait;
%macro aweek(week=0501a);
x "md C:\ddrive\projects\sugs\sugi31\SASLBOF\posts\&week";x "md C:\ddrive\projects\sugs\sugi31\SASLBOF\filteredposts\&week";
libname sugi31 'C:\ddrive\projects\sugs\sugi31\SASLBOF\datasets';
%tmfilter(dataset=sugi31.SL&week.,dir=C:\ddrive\projects\sugs\sugi31\SASLBOF\posts\&week,destdir=C:\ddrive\projects\sugs\sugi31\SASLBOF\filteredPosts\&week,URL=http://listserv.uga.edu/cgi-bin/wa?A1=ind&week.%NRSTR(&L=sas-l),
depth=1,links=sugi31.SL&week.L,norestrict=' ',
numchars=2000)
%mend aweek;
%aweek(week=0501a);%aweek(week=0501b);
Hoyle paper 019-31
SUGI 31 Parse date and sender
Hoyle paper 019-31
SUGI 31Using a 10% sample of message text
Hoyle paper 019-31
SUGI 31Using a 10% sample of message text
Hoyle paper 019-31
SUGI 31Filter out too common terms, listserv
Hoyle paper 019-31
SUGI 31Filter out too common terms, listserv
Top Related