Learning to read urls
-
Upload
calvin-giles -
Category
Data & Analytics
-
view
223 -
download
2
Transcript of Learning to read urls
Learning to read urls
Finding the word boundaries in multi-word domain names withpython and sklearn.Calvin Giles
Who am I?Data Scientist at AdthenaPyData Co-OrganiserPhysicistLike to solve problems pragmatically
The ProblemGiven a domain name:
'powerwasherchicago.com' 'catholiccommentaryonsacredscripture.com'
Find the concatenated sentence:
'power washer chicago (.com)' 'catholic commentary on sacred scripture (.com)'
Why is this useful?How similar are 'powerwasherchicago.com' and 'extreme-tyres.co.uk'?
How similar are 'power washer chicago (.com)' and 'extreme tyres (.co.uk)'?
Domains resolved into words can be compared on a semantic level, not simply as strings.
Primary use caseGiven 500 domains in a market, what are the themes?
Scope of projectAs part of our internal idea incubation Adthena labs, this approach was developed during a one-day hack to determine if such an approach could be useful to the business.
Adthena's Data
> 10 million unique domains> 50 million unique search terms
3rd Party DataProject Gutenberg (https://www.gutenberg.org/)Google ngram viewer datasets(http://storage.googleapis.com/books/ngrams/books/datasetsv2.html)
Process1. Learn some words
2. Find where words occur in a domain name
3. Choose the most likely set of words
1. Learn some wordsBuild a dictionary using suitable documents.
Documents: search terms
In [2]: import pandas, ossearch_terms = pandas.read_csv(os.path.join(data_directory, 'search_terms.csv'))search_terms = search_terms['SearchTerm'].dropna().str.lower()search_terms.iloc[1000000::2000000]
Out[2]: 1000000 new 2014 mercedes benz b200 cdi3000000 weight watchers in glynneath5000000 property for rent in batlow nsw7000000 us plug adaptor for uk9000000 which features mobile is best for purchaseName: SearchTerm, dtype: object
In [125]: from sklearn.feature_extraction.text import CountVectorizerdef build_dictionary(corpus, min_df=0): vec = CountVectorizer(min_df=min_df, token_pattern=r'(?u)\b\w{2,}\b') # Require 2+ characters vec.fit(corpus) return set(vec.get_feature_names())
In [126]: st_dictionary = build_dictionary(corpus=search_terms, min_df=0.00001)dictionary_size = len(st_dictionary)print('{} words found'.format(num_fmt(dictionary_size)))sorted(st_dictionary)[dictionary_size//20::dictionary_size//10]
Out[126]:
21.4k words found
['430', 'benson', 'colo', 'es1', 'hd7', 'leed', 'nikon', 'razors', 'springs', 'vinyl']
We have 21 thousand words in our base dictionary. We can augment this with some booksfrom project gutenberg:
In [127]: dictionary = st_dictionaryfor fname in os.listdir(os.path.join(data_directory, 'project_gutenberg')): if not fname.endswith('.txt'): continue with open(os.path.join(data_directory, 'project_gutenberg', fname)) as f: book = pandas.Series(f.readlines()) book = book.str.strip() book = book[book != ''] book_dictionary = build_dictionary(corpus=book, min_df=2) # keep words that appear in 0.001% of documents dictionary_size = len(book_dictionary) print('{} words found in {}'.format(num_fmt(dictionary_size), fname)) dictionary |= book_dictionaryprint('{} words in dictionary'.format(num_fmt(len(dictionary))))
2.11k words found in a_christmas_carol.txt1.65k words found in alice_in_wonderland.txt3.71k words found in huckleberry_finn.txt4.09k words found in pride_and_predudice.txt4.52k words found in sherlock_holmes.txt26.4k words in dictionary
Actually, scrap that...... and use the google ngram viewer datasets:
In [212]: dictionary = set()ngram_files = [fn for fn in os.listdir(ngram_data_directory) if 'googlebooks' in fn and fn.endswith('_processed.csv')]for fname in ngram_files: ngrams = pandas.read_csv(os.path.join(ngram_data_directory, fname)) ngrams = ngrams[(ngrams.match_count > 10*1000*1000) & (ngrams.ngram.str.len() == 2) | (ngrams.match_count > 1000) & (ngrams.ngram.str.len() > 2) ] ngrams = ngrams.ngram ngrams = ngrams.str.lower() ngrams = ngrams[ngrams != ''] ngrams_dictionary = set(ngrams) dictionary_size = len(ngrams_dictionary) print('{} valid words found in "{}"'.format(num_fmt(dictionary_size), fname)) dictionary |= ngrams_dictionaryprint('{} words in dictionary'.format(num_fmt(len(dictionary))))
2.93k valid words found in "googlebooks-eng-all-1gram-20120701-0_processed.csv"12.7k valid words found in "googlebooks-eng-all-1gram-20120701-1_processed.csv"5.58k valid words found in "googlebooks-eng-all-1gram-20120701-2_processed.csv"4.09k valid words found in "googlebooks-eng-all-1gram-20120701-3_processed.csv"3.28k valid words found in "googlebooks-eng-all-1gram-20120701-4_processed.csv"2.72k valid words found in "googlebooks-eng-all-1gram-20120701-5_processed.csv"2.52k valid words found in "googlebooks-eng-all-1gram-20120701-6_processed.csv"2.18k valid words found in "googlebooks-eng-all-1gram-20120701-7_processed.csv"2.08k valid words found in "googlebooks-eng-all-1gram-20120701-8_processed.csv"2.5k valid words found in "googlebooks-eng-all-1gram-20120701-9_processed.csv"61.6k valid words found in "googlebooks-eng-all-1gram-20120701-a_processed.csv"55.2k valid words found in "googlebooks-eng-all-1gram-20120701-b_processed.csv"72k valid words found in "googlebooks-eng-all-1gram-20120701-c_processed.csv"46.1k valid words found in "googlebooks-eng-all-1gram-20120701-d_processed.csv"36.2k valid words found in "googlebooks-eng-all-1gram-20120701-e_processed.csv"32.4k valid words found in "googlebooks-eng-all-1gram-20120701-f_processed.csv"36k valid words found in "googlebooks-eng-all-1gram-20120701-g_processed.csv"37.9k valid words found in "googlebooks-eng-all-1gram-20120701-h_processed.csv"30.3k valid words found in "googlebooks-eng-all-1gram-20120701-i_processed.csv"12.3k valid words found in "googlebooks-eng-all-1gram-20120701-j_processed.csv"31.4k valid words found in "googlebooks-eng-all-1gram-20120701-k_processed.csv"36.7k valid words found in "googlebooks-eng-all-1gram-20120701-l_processed.csv"63.6k valid words found in "googlebooks-eng-all-1gram-20120701-m_processed.csv"
That takes us to ~1M words!
We even get some good two-letter words to work with:
In [130]: print('{} 2-letter words'.format(len({w for w in dictionary if len(w) == 2})))print(sorted({w for w in dictionary if len(w) == 2}))
142 2-letter words['00', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '41', '42', '43', '44', '45', '46', '47', '48', '49', '50', '51', '52', '53', '54', '55', '56', '57', '58', '59', '60', '61', '62', '63', '64', '65', '66', '67', '68', '69', '70', '71', '72', '73', '74', '75', '76', '77', '78', '79', '80', '81', '82', '83', '84', '85', '86', '87', '88', '89', '90', '91', '92', '93', '94', '95', '96', '97', '98', '99', 'ad', 'al', 'am', 'an', 'as', 'at', 'be', 'by', 'cm', 'co', 'de', 'di', 'do', 'du', 'ed', 'el', 'en', 'et', 'ex', 'go', 'he', 'if', 'ii', 'in', 'is', 'it', 'iv', 'la', 'le', 'me', 'mg', 'mm', 'mr', 'my', 'no', 'of', 'oh', 'on', 'op', 'or', 're', 'se', 'so', 'st', 'to', 'un', 'up', 'us', 'vi', 'we', 'ye']
In [144]: choice(list(dictionary), size=40)
Out[144]: array(['fades', 'archaeocyatha', 'subss', 'bikanir', 'fitn', 'cockley', 'chinard', 'curtus', 'quantitiative', 'obfervation', 'poplin', 'xciv', 'hanrieder', 'macaura', 'nakum', 'teuira', 'humphrey', 'improvisationally', 'enforeed', 'caillie', 'plachter', 'feirer', 'atomico', 'jven', 'ujvari', 'rekonstruieren', 'viverra', 'genéticos', 'layn', 'dryl', 'thonis', 'legítimos', 'latts', 'radames', 'bwlch', 'lanzamiento', 'quea', 'dumnoniorum', 'matu', 'conoció'], dtype='<U81')
2. Find where words occur in a domain nameFind all substrings of a domain that are in our dictionary, along with their start and endindicies.
In [149]: def find_words_in_string(string, dictionary, longest_word=None): if longest_word is None: longest_word = max(len(word) for word in dictionary) substring_indicies = ((start, start + length) for start in range(len(string)) for length in range(1, longest_word + 1)) for start, end in substring_indicies: substring = string[start:end] if substring in dictionary: # use len(substring) in case we sliced beyond the end yield substring, start, start + len(substring)
In [234]: domain = 'powerwasherchicago'words = sorted({w for w, *_ in find_words_in_string(domain, dictionary)})print(len(words))print(words)
39['ago', 'as', 'ash', 'ashe', 'asher', 'cag', 'cago', 'chi', 'chic', 'chica', 'chicag', 'chicago', 'erc', 'erch', 'erw', 'go', 'he', 'her', 'herc', 'hic', 'hicago', 'ica', 'icago', 'owe', 'ower', 'pow', 'powe', 'power', 'rch', 'rwa', 'rwas', 'she', 'sher', 'was', 'wash', 'washe', 'washer', 'we', 'wer']
In [235]: domain = 'catholiccommentaryonsacredscripture'words = sorted({w for w, *_ in find_words_in_string(domain, dictionary)})print(len(words))print(words)
101['acr', 'acre', 'acred', 'ary', 'aryo', 'at', 'ath', 'atho', 'athol', 'atholic', 'cat', 'cath', 'catho', 'cathol', 'catholi', 'catholic', 'cco', 'ccom', 'co', 'com', 'comm', 'comme', 'commen', 'comment', 'commenta', 'commentar', 'commentary', 'cre', 'cred', 'creds', 'cri', 'crip', 'cript', 'dsc', 'dscr', 'ed', 'eds', 'en', 'ent', 'enta', 'entar', 'entary', 'hol', 'holi', 'holic', 'icc', 'icco', 'ipt', 'lic', 'me', 'men', 'ment', 'menta', 'mentar', 'mentary', 'mm', 'mme', 'mment', 'nsa', 'nsac', 'nta', 'ntar', 'ntary', 'oli', 'olic', 'omm', 'omme', 'ommen', 'omment', 'on', 'ons', 'ptu', 'pture', 're', 'red', 'reds', 'rip', 'ript', 'ryo', 'ryon', 'ryons', 'sac', 'sacr', 'sacre', 'sacred', 'scr', 'scri', 'scrip', 'script', 'scriptur', 'scripture', 'tar', 'tary', 'tho', 'thol', 'tholic', 'tur', 'ture', 'ure', 'yon', 'yons']
3. Choose the most likely set of wordsSimple approach to do this:
1. Find all subsets of the set of words found2. Determine if that subset if non-overlapping3. Decide how likely is the domain given a particular subset 4. Decide how likely it is that the subset would occur overall 5. Determine best subset
P(d|s)P(s)
P(s|d)argmaxs
We need some domain name data for the next part...
In [153]: domains = pandas.read_csv(os.path.join(data_directory, 'domains.csv'))domains = domains['Domain'].str.lower()
domains = domains[domains.str.endswith(".com")]
domains = domains.str.replace("\.com$", "")
domains = domains.str.replace("̂https?\:\/\/", "")domains = domains.str.replace("̂www\d?\.", "")
num_fmt(len(domains))
Out[153]: '3.8M'
In [224]: choice(domains, size=20)
Out[224]: array(['1topchannel', 'scales-chords', 'marcusmajestic', 'mylyfestart', 'bluediamondturlock', 'bedfordvisionclinic', 'justinmccain', 'miniot-online', 'chelseabarracksbarracks', 'zeroeasy', 'newlookupholstery', 'radcliffehealth', 'embracingthemundane', 'immunityassist', 'simplynostretchmarks', 'teachmetoswim', 'thetford-europe', 'charlesallenford', 'china-chargermanufacturer', 'coolbabykid'], dtype=object)
1. Find all subsets of the set of words found
There are different sentences that can be constructed from n substrings, including the emptysentence. We can get an idea how bad that will be with a sample of the data.
2n
In [53]: longest_word = max(len(word) for word in dictionary) # speeds up searchdef find_n_words_in_string(domain): return len(set(find_words_in_string(domain, dictionary, longest_word)))
In [56]: import numpyn_words = domains.tail(1000).apply(find_n_words_in_string)(n_words).describe().apply(num_fmt)
Out[56]: count 1kmean 28.3std 15.8min 125% 1750% 2675% 38max 93Name: Domain, dtype: object
In [227]: num_fmt(2**28), 2**93
Out[227]: ('268M', 9903520314283042199192993792)
So the worst case in a sample of 1000 domains is permutations to test!293
Combine steps 1 and 2
1. Find all subsets of the set of words found2. Determine if that subset if non-overlapping
becomes:
1. Find all subsets with non-overlapping words2. Do nothing :-)
3.1 Find all subsets with non-overlapping words
Build a tree of subsets of non-overlapping words by sorting the words by their start index.
...and only return the "best" few cases anyway
It seems intuitive that sentences that match more of the domain are better. This is not infalable, butwe can achieve som significant if we only consider sentences at least half as long as the best match.
In practice, this does not appear to have any impact on the results but prevents an explosion ofsentences with particularly long domains.
A little more code...In [147]: def find_sentences(string, words, part_sentence, sentences, threshold=0.0,
current_idx=0, current_score=0, best_score=0): """ Return sentences made of words that are common substrings of ̀string̀. ̀words̀ MUST be ordered by start index or the results will be wrong! """ current_threshold = int(best_score * threshold) if ((current_idx >= len(string)) or current_score + len(string) - current_idx < current_threshold): return sentences, best_score
for i, (word, start_idx, end_idx) in enumerate(words): if current_idx > start_idx: continue new_score = current_score + len(word) best_score = max(best_score, new_score) new_part_sentence = part_sentence + [word] if new_score + len(string) - end_idx >= current_threshold: sentences.append((new_part_sentence, new_score)) sentences, best_score = find_sentences(string=string, words=words[i+1:], part_sentence=new_part_sentence, sentences=sentences, threshold=threshold, current_idx=end_idx, current_score=new_score, best_score=best_score) return sentences, best_score
Add a wrapper
In [148]: def get_sentences(domain, thresh=0.95): words = set(find_words_in_string(domain, dictionary, longest_word)) words = sorted(words, key=lambda x:(x[1], -x[2], x[0]))
sentences, best_score = find_sentences(domain, words, [], [], thresh)
return [sentence for sentence, score in sentences if score >= int(best_score * thresh)]
In [64]: sentences = get_sentences('powerwasherchicago')print(len(sentences))
choice(sentences, size=15)
Out[64]:
245
array([['pow', 'erw', 'as', 'her', 'chicago'], ['pow', 'erw', 'ashe', 'chica', 'go'], ['power', 'was', 'her', 'chica', 'go'], ['power', 'was', 'he', 'rch', 'cago'], ['power', 'was', 'her', 'chicago'], ['power', 'wash', 'erch', 'icago'], ['power', 'ash', 'erc', 'hicago'], ['ower', 'wash', 'erc', 'hicago'], ['power', 'wash', 'erch', 'icago'], ['power', 'was', 'her', 'chi', 'cago'], ['power', 'was', 'her', 'chic', 'ago'], ['power', 'as', 'he', 'rch', 'ica', 'go'], ['ower', 'washer', 'chicago'], ['owe', 'rwas', 'he', 'rch', 'ica', 'go'], ['power', 'washer', 'chic', 'go']], dtype=object)
In [65]: sentences = get_sentences('catholiccommentaryonsacredscripture')print(len(sentences))choice(sentences, size=15)
Out[65]:
540428
array([['cat', 'holi', 'ccom', 'me', 'nta', 'ryon', 'sacr', 'ed', 'scrip', 're'], ['catholic', 'co', 'mm', 'en', 'aryo', 'nsac', 'ed', 'scri', 'pture'], ['catholic', 'omm', 'enta', 'ryon', 'sacr', 'eds', 'crip', 'tur'], ['cathol', 'icc', 'ommen', 'tar', 'on', 'sacr', 'ed', 'script', 'ure'], ['at', 'holic', 'omme', 'ntary', 'ons', 'acred', 'scri', 'pture'], ['cathol', 'icc', 'omm', 'ntar', 'yons', 'creds', 'crip', 'ture'], ['cat', 'hol', 'icc', 'omm', 'entary', 'ons', 'acr', 'eds', 'cri', 'ptu', 're'], ['cath', 'lic', 'com', 'me', 'ntar', 'yon', 'sac', 're', 'dsc', 'ript', 'ure'], ['cathol', 'icco', 'mm', 'ntary', 'on', 'sac', 're', 'dsc', 'rip', 'ture'], ['catholic', 'co', 'mm', 'enta', 'ryon', 'sac', 're', 'dsc', 'rip', 'tur'], ['cat', 'holic', 'com', 'me', 'ntar', 'yon', 'sac', 'reds', 'cript', 're'], ['cat', 'holic', 'com', 'menta', 'ryon', 'acr', 'ed', 'cript', 'ure'], ['cat', 'oli', 'ccom', 'mentary', 'nsac', 'red', 'scri', 'pture'], ['cathol', 'icc', 'ommen', 'tary', 'on', 'sacr', 'ed', 'cri', 'ture'], ['cat', 'hol', 'ccom', 'me', 'ntar', 'on', 'sac', 'red', 'scripture']], dtype=object)
In [71]: tail_sentences = domains.tail(1000).apply(get_sentences).apply(len)
In [155]: tail_sentences.describe().apply(int).apply(num_fmt)
Out[155]: count 1kmean 1.18kstd 10.7kmin 125% 1250% 3975% 145max 280kName: Domain, dtype: object
In [73]: domains.tail(1000)[tail_sentences <= 1].values
Out[73]: array(['cizerl', 'sahoko', 'pes-llc', 'mp3fil', 'wyzli', 'buypsa', 'ylqhjt', 'sblgnt', 'axbet', 'eirnyc', 'wsl', 'kms88', 'paknic', 'mrojp', 'irozho', 'bienve'], dtype=object)
In [74]: domains.tail(1000)[tail_sentences > 10000].values
Out[74]: array(['studentdebtreductioncenter', 'inspiredholisticwellness', 'forensicaccountingexpert', 'medicalintuitivetraining', 'lavidamassagesandyspringsga', 'thirdgenerationshootingsupply', 'commercialrefrigerationrepairmiami', 'athenatrainingacademy', 'business-leadership-qualities', 'casaquetzalsanmigueldeallende', 'landscapedesignimagingsoftware', 'southcaliforniauniversity', 'replacementtractorpartsforsale', 'reinventinghealthcareinfo', 'shoppingforpowerinvertersnow', 'cambriaheightschristianacademy', 'californiaconstructionjobs', 'margaritavilleislandhotel', 'whatstoressellgarciniacambogia'], dtype=object)
In [75]: [' '.join(sentence) for sentence in get_sentences('replacementtractorpartsforsale ')[:10]]
Out[75]: ['replacement tractor parts forsale', 'replacement tractor parts forsa', 'replacement tractor parts forsa le', 'replacement tractor parts fors ale', 'replacement tractor parts fors al', 'replacement tractor parts fors le', 'replacement tractor parts for sale', 'replacement tractor parts for sal', 'replacement tractor parts for ale', 'replacement tractor parts for al']
3.2 Decide how likely is the domain given a particular subset
A first approach would be to say that the probability decreasses as each letter in the domain isommited from the sentence. We could model this in an unnormalised way by counting thesentence length.
To sort by this probability, we can therefore use the following:
P(d|s)
In [77]: def score_d_given_s(sentence, domain): domain_length = len(domain) sentence_length = sum(len(word) for word in sentence) return sentence_length / domain_length, 1.0 / (1 + len(sentence))
In [78]: domain = 'powerwasherchicago'sentences = get_sentences(domain)sorted(sentences, key=lambda s:score_d_given_s(s, domain))[::-1][:15]
Out[78]: [['power', 'washer', 'chicago'], ['pow', 'erw', 'asher', 'chicago'], ['powe', 'rwa', 'sher', 'chicago'], ['powe', 'rwas', 'her', 'chicago'], ['powe', 'rwas', 'herc', 'hicago'], ['power', 'was', 'her', 'chicago'], ['power', 'was', 'herc', 'hicago'], ['power', 'wash', 'erc', 'hicago'], ['power', 'wash', 'erch', 'icago'], ['power', 'washe', 'rch', 'icago'], ['power', 'washer', 'chi', 'cago'], ['power', 'washer', 'chic', 'ago'], ['power', 'washer', 'chica', 'go'], ['pow', 'erw', 'as', 'her', 'chicago'], ['pow', 'erw', 'as', 'herc', 'hicago']]
In [79]: domain = 'catholiccommentaryonsacredscripture'sentences = get_sentences(domain)sorted(sentences, key=lambda s:score_d_given_s(s, domain))[:-15:-1]
Out[79]: [['catholic', 'commenta', 'ryon', 'sacred', 'scripture'], ['catholic', 'commenta', 'ryons', 'acred', 'scripture'], ['catholic', 'commentar', 'yon', 'sacred', 'scripture'], ['catholic', 'commentar', 'yons', 'acred', 'scripture'], ['catholic', 'commentary', 'on', 'sacred', 'scripture'], ['catholic', 'commentary', 'ons', 'acred', 'scripture'], ['cat', 'holic', 'commenta', 'ryon', 'sacred', 'scripture'], ['cat', 'holic', 'commenta', 'ryons', 'acred', 'scripture'], ['cat', 'holic', 'commentar', 'yon', 'sacred', 'scripture'], ['cat', 'holic', 'commentar', 'yons', 'acred', 'scripture'], ['cat', 'holic', 'commentary', 'on', 'sacred', 'scripture'], ['cat', 'holic', 'commentary', 'ons', 'acred', 'scripture'], ['cath', 'olic', 'commenta', 'ryon', 'sacred', 'scripture'], ['cath', 'olic', 'commenta', 'ryons', 'acred', 'scripture']]
Let's see the top guesses for a selection of domains:
In [105]: import redef flesh_out_sentence(sentence, domain): if sum(len(w) for w in sentence) == len(domain): return sentence full_sentence = [] for word in sentence: start, end = re.search(re.escape(word), domain).span() if start > 0: full_sentence.append(domain[:start]) full_sentence.append(word) domain = domain[end:] if len(domain) > 0: full_sentence.append(domain) return full_sentence
In [ ]: def guess(d, n_guesses=25): guesses = [] sentences = get_sentences(d) sentences = sorted(sentences, key=lambda s:score_d_given_s(s, domain))[::-1] i = 0 for i, s in enumerate(sentences[:n_guesses]): s = flesh_out_sentence(s, d) guesses.append(' '.join(s)) for _ in range(i + 1, n_guesses): guesses.append('') return pandas.Series(guesses)
In [238]: subset = domains.iloc[len(domains)//200::len(domains)//100]df = pandas.DataFrame(subset.apply(guess).values, index=(subset+'.com').values)# df.to_csv(os.path.join(data_directory, 'predictions.csv'))df = df.iloc[:10, :3]df['correct'] = [0, 3, -1, 0, 0, 2, 0, 3, 0, 0] # Correct guess for first 10 domains or -1df[['correct'] + list(range(3))]
Out[238]: correct 0 1 2
hedgefundupdate.com 0hedge fundupdate
hedge fundupdate
he dge fundupdate
traveldailynews.com 3traveldailynews
tra veldailynews
trav eldailynews
miriamkhalladi.com -1miria mkhalladi
miriam khalladi
mir iam khalladi
poolheatpumpstore.com 0pool heatpump store
pool heatpumps tore
poo lhe at pumpstore
blogorganization.com 0blogorganization
blo gorganization
blo gorganization
smallcapvoice.com 2 smallcap voice smal lcap voice small cap voice
cefcorp.com 0 cef corp c efc orp cef c orp
lightandmotionphotography.com 3lightandmotionphotography
lightandmotionphotography
ligh tandmotionphotography
uggbootrepairs.com 0ugg bootrepairs
ugg boo trepairs
ugg boo trepairs
abundancesecrets.com 0abundancesecrets
abun dancesecrets
abund ancesecrets
In [239]: %matplotlib inlineimport matplotlib.pyplot as pltimport seaborncorrect = [0, 3, -1, 0, 0, 2, 0, 3, 0, 0, 4, 1, 0, 4, 0, 0, -1, 0, 0, -1, 1, 8, 0, 0, 0, 0, 8, 0, -1, -1, 0, -1, 0, 3, 16, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, -1, -1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, -1, 0, 0, -1, 0, 2, 4, 13, 0, -1, 0, 0, -1, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 2, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, -1]
In [240]: pandas.Series(correct).hist(bins=range(-5, 25), normed=True, figsize=(12, 5))plt.xlabel('correct guess no. or -1 if incorrect');
In a test of 100 samples, the first guess was correct 65 times and one of the first 25 were correct87 times.
Is this good enough?Primary use case: given 500 domains in a market, what are the themes?
Expect ~325 domains in theme clusters and ~175 distributed randomly.
This will probably still require human sanity checks.
What can be done?So far, we only consider the likelyhood of a domain given a sentence.
But how likely is the sentence?
The next hack day is to develop a model for sentence likelyhood .P(s)
Determine the best sentence
From Bayes:
Since is the same for all sentences, this can be ignored when finding the argmax:
P(s|d)argmaxs
P(s|d) =P(d|s)P(s)
P(d)
P(d)P(s|d) = P(d|s)P(s)argmaxs argmaxs
What was doneTrained dictionary using google ngram viewer dataFound word substrings in domainBuilt sentences from words with applied crude cutsOrdered predictions based on crude score functionMeasured performance on 100 labelled domains
What I usedInspiration:
Peter Norvig's
Libraries:
pandas, numpy, resklearn.feature_extraction.text.CountVectorizer
Functions:
spell-correct (http://norvig.com/spell-correct.html)
build_dictionary(corpus, min_df=0)find_words_in_string(string, dictionary, longest_word=None)find_sentences(string, words, part_sentence, sentences, threshold=0.0)get_sentences(domain, thresh=0.95)score_d_given_s(sentence, domain)guess(d, n_guesses=25)
After training, it can be used like this:
In [211]: guess('powerwasherchicago')[0]
Out[211]: 'power washer chicago'
What still needs to be done for performance
Performance needs to be tested against a larger labelled dataset including robusttrain-develop-test splits.
Sentences need to be compared based on the likelyhood of that sentenceconstruction, i.e.
Additional words need to be incorporated into the dictionary
Threshold hyper-parameters need tuning
P(s)
...and to make it usable
Replace custom code with library functions where possibleExtend remaining code to support array and dataframe inputsMake compatible with sklearn pipelineImprove .com, .co.uk etc. handling so it can be used on a wider set of domainsOptimise substring search
Think you can do better?Get in touch:
[email protected]@calvingiles
In [122]: import math
def num_fmt(num): i_offset = 12 # change this if you extend the symbols!!! prec = 3 fmt = '.{p}g'.format(p=prec) symbols = [#'Y', 'Z', 'E', 'P', 'T', 'G', 'M', 'k', '', 'm', 'u', 'n'] try: e = math.log10(abs(num)) except ValueError: return repr(num) if e >= i_offset + 3: return '{:{fmt}}'.format(num, fmt=fmt) for i, sym in enumerate(symbols): e_thresh = i_offset - 3 * i if e >= e_thresh: return '{:{fmt}}{sym}'.format(num/10.**e_thresh, fmt=fmt, sym=sym) return '{:{fmt}}'.format(num, fmt=fmt)