Learning to read urls

Learning to read urls

Finding the word boundaries in multi-word domain names withpython and sklearn.Calvin Giles

Who am I?Data Scientist at AdthenaPyData Co-OrganiserPhysicistLike to solve problems pragmatically

The ProblemGiven a domain name:

'powerwasherchicago.com' 'catholiccommentaryonsacredscripture.com'

Find the concatenated sentence:

'power washer chicago (.com)' 'catholic commentary on sacred scripture (.com)'

Why is this useful?How similar are 'powerwasherchicago.com' and 'extreme-tyres.co.uk'?

How similar are 'power washer chicago (.com)' and 'extreme tyres (.co.uk)'?

Domains resolved into words can be compared on a semantic level, not simply as strings.

Primary use caseGiven 500 domains in a market, what are the themes?

Scope of projectAs part of our internal idea incubation Adthena labs, this approach was developed during a one-day hack to determine if such an approach could be useful to the business.

Adthena's Data

> 10 million unique domains> 50 million unique search terms

3rd Party DataProject Gutenberg (https://www.gutenberg.org/)Google ngram viewer datasets(http://storage.googleapis.com/books/ngrams/books/datasetsv2.html)

http://storage.googleapis.com/books/ngrams/books/datasetsv2.html

https://www.gutenberg.org/

Process1. Learn some words

2. Find where words occur in a domain name

3. Choose the most likely set of words

1. Learn some wordsBuild a dictionary using suitable documents.

Documents: search terms

In [2]: import pandas, ossearch_terms = pandas.read_csv(os.path.join(data_directory, 'search_terms.csv'))search_terms = search_terms['SearchTerm'].dropna().str.lower()search_terms.iloc[1000000::2000000]

Out[2]: 1000000 new 2014 mercedes benz b200 cdi3000000 weight watchers in glynneath5000000 property for rent in batlow nsw7000000 us plug adaptor for uk9000000 which features mobile is best for purchaseName: SearchTerm, dtype: object

In [125]: from sklearn.feature_extraction.text import CountVectorizerdef build_dictionary(corpus, min_df=0): vec = CountVectorizer(min_df=min_df, token_pattern=r'(?u)\b\w{2,}\b') # Require 2+ characters vec.fit(corpus) return set(vec.get_feature_names())

In [126]: st_dictionary = build_dictionary(corpus=search_terms, min_df=0.00001)dictionary_size = len(st_dictionary)print('{} words found'.format(num_fmt(dictionary_size)))sorted(st_dictionary)[dictionary_size//20::dictionary_size//10]

Out[126]:

21.4k words found

['430', 'benson', 'colo', 'es1', 'hd7', 'leed', 'nikon', 'razors', 'springs', 'vinyl']

We have 21 thousand words in our base dictionary. We can augment this with some booksfrom project gutenberg:

In [127]: dictionary = st_dictionaryfor fname in os.listdir(os.path.join(data_directory, 'project_gutenberg')): if not fname.endswith('.txt'): continue with open(os.path.join(data_directory, 'project_gutenberg', fname)) as f: book = pandas.Series(f.readlines()) book = book.str.strip() book = book[book != ''] book_dictionary = build_dictionary(corpus=book, min_df=2) # keep words that appear in 0.001% of documents dictionary_size = len(book_dictionary) print('{} words found in {}'.format(num_fmt(dictionary_size), fname)) dictionary |= book_dictionaryprint('{} words in dictionary'.format(num_fmt(len(dictionary))))

2.11k words found in a_christmas_carol.txt1.65k words found in alice_in_wonderland.txt3.71k words found in huckleberry_finn.txt4.09k words found in pride_and_predudice.txt4.52k words found in sherlock_holmes.txt26.4k words in dictionary

Actually, scrap that...... and use the google ngram viewer datasets:

In [212]: dictionary = set()ngram_files = [fn for fn in os.listdir(ngram_data_directory) if 'googlebooks' in fn and fn.endswith('_processed.csv')]for fname in ngram_files: ngrams = pandas.read_csv(os.path.join(ngram_data_directory, fname)) ngrams = ngrams[(ngrams.match_count > 10*1000*1000) & (ngrams.ngram.str.len() == 2) | (ngrams.match_count > 1000) & (ngrams.ngram.str.len() > 2) ] ngrams = ngrams.ngram ngrams = ngrams.str.lower() ngrams = ngrams[ngrams != ''] ngrams_dictionary = set(ngrams) dictionary_size = len(ngrams_dictionary) print('{} valid words found in "{}"'.format(num_fmt(dictionary_size), fname)) dictionary |= ngrams_dictionaryprint('{} words in dictionary'.format(num_fmt(len(dictionary))))

2.93k valid words found in "googlebooks-eng-all-1gram-20120701-0_processed.csv"12.7k valid words found in "googlebooks-eng-all-1gram-20120701-1_processed.csv"5.58k valid words found in "googlebooks-eng-all-1gram-20120701-2_processed.csv"4.09k valid words found in "googlebooks-eng-all-1gram-20120701-3_processed.csv"3.28k valid words found in "googlebooks-eng-all-1gram-20120701-4_processed.csv"2.72k valid words found in "googlebooks-eng-all-1gram-20120701-5_processed.csv"2.52k valid words found in "googlebooks-eng-all-1gram-20120701-6_processed.csv"2.18k valid words found in "googlebooks-eng-all-1gram-20120701-7_processed.csv"2.08k valid words found in "googlebooks-eng-all-1gram-20120701-8_processed.csv"2.5k valid words found in "googlebooks-eng-all-1gram-20120701-9_processed.csv"61.6k valid words found in "googlebooks-eng-all-1gram-20120701-a_processed.csv"55.2k valid words found in "googlebooks-eng-all-1gram-20120701-b_processed.csv"72k valid words found in "googlebooks-eng-all-1gram-20120701-c_processed.csv"46.1k valid words found in "googlebooks-eng-all-1gram-20120701-d_processed.csv"36.2k valid words found in "googlebooks-eng-all-1gram-20120701-e_processed.csv"32.4k valid words found in "googlebooks-eng-all-1gram-20120701-f_processed.csv"36k valid words found in "googlebooks-eng-all-1gram-20120701-g_processed.csv"37.9k valid words found in "googlebooks-eng-all-1gram-20120701-h_processed.csv"30.3k valid words found in "googlebooks-eng-all-1gram-20120701-i_processed.csv"12.3k valid words found in "googlebooks-eng-all-1gram-20120701-j_processed.csv"31.4k valid words found in "googlebooks-eng-all-1gram-20120701-k_processed.csv"36.7k valid words found in "googlebooks-eng-all-1gram-20120701-l_processed.csv"63.6k valid words found in "googlebooks-eng-all-1gram-20120701-m_processed.csv"

That takes us to ~1M words!

We even get some good two-letter words to work with:

In [130]: print('{} 2-letter words'.format(len({w for w in dictionary if len(w) == 2})))print(sorted({w for w in dictionary if len(w) == 2}))

142 2-letter words['00', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '41', '42', '43', '44', '45', '46', '47', '48', '49', '50', '51', '52', '53', '54', '55', '56', '57', '58', '59', '60', '61', '62', '63', '64', '65', '66', '67', '68', '69', '70', '71', '72', '73', '74', '75', '76', '77', '78', '79', '80', '81', '82', '83', '84', '85', '86', '87', '88', '89', '90', '91', '92', '93', '94', '95', '96', '97', '98', '99', 'ad', 'al', 'am', 'an', 'as', 'at', 'be', 'by', 'cm', 'co', 'de', 'di', 'do', 'du', 'ed', 'el', 'en', 'et', 'ex', 'go', 'he', 'if', 'ii', 'in', 'is', 'it', 'iv', 'la', 'le', 'me', 'mg', 'mm', 'mr', 'my', 'no', 'of', 'oh', 'on', 'op', 'or', 're', 'se', 'so', 'st', 'to', 'un', 'up', 'us', 'vi', 'we', 'ye']

In [144]: choice(list(dictionary), size=40)

Out[144]: array(['fades', 'archaeocyatha', 'subss', 'bikanir', 'fitn', 'cockley', 'chinard', 'curtus', 'quantitiative', 'obfervation', 'poplin', 'xciv', 'hanrieder', 'macaura', 'nakum', 'teuira', 'humphrey', 'improvisationally', 'enforeed', 'caillie', 'plachter', 'feirer', 'atomico', 'jven', 'ujvari', 'rekonstruieren', 'viverra', 'genéticos', 'layn', 'dryl', 'thonis', 'legítimos', 'latts', 'radames', 'bwlch', 'lanzamiento', 'quea', 'dumnoniorum', 'matu', 'conoció'], dtype='<U81')

2. Find where words occur in a domain nameFind all substrings of a domain that are in our dictionary, along with their start and endindicies.

In [149]: def find_words_in_string(string, dictionary, longest_word=None): if longest_word is None: longest_word = max(len(word) for word in dictionary) substring_indicies = ((start, start + length) for start in range(len(string)) for length in range(1, longest_word + 1)) for start, end in substring_indicies: substring = string[start:end] if substring in dictionary: # use len(substring) in case we sliced beyond the end yield substring, start, start + len(substring)

In [234]: domain = 'powerwasherchicago'words = sorted({w for w, *_ in find_words_in_string(domain, dictionary)})print(len(words))print(words)

39['ago', 'as', 'ash', 'ashe', 'asher', 'cag', 'cago', 'chi', 'chic', 'chica', 'chicag', 'chicago', 'erc', 'erch', 'erw', 'go', 'he', 'her', 'herc', 'hic', 'hicago', 'ica', 'icago', 'owe', 'ower', 'pow', 'powe', 'power', 'rch', 'rwa', 'rwas', 'she', 'sher', 'was', 'wash', 'washe', 'washer', 'we', 'wer']

In [235]: domain = 'catholiccommentaryonsacredscripture'words = sorted({w for w, *_ in find_words_in_string(domain, dictionary)})print(len(words))print(words)

101['acr', 'acre', 'acred', 'ary', 'aryo', 'at', 'ath', 'atho', 'athol', 'atholic', 'cat', 'cath', 'catho', 'cathol', 'catholi', 'catholic', 'cco', 'ccom', 'co', 'com', 'comm', 'comme', 'commen', 'comment', 'commenta', 'commentar', 'commentary', 'cre', 'cred', 'creds', 'cri', 'crip', 'cript', 'dsc', 'dscr', 'ed', 'eds', 'en', 'ent', 'enta', 'entar', 'entary', 'hol', 'holi', 'holic', 'icc', 'icco', 'ipt', 'lic', 'me', 'men', 'ment', 'menta', 'mentar', 'mentary', 'mm', 'mme', 'mment', 'nsa', 'nsac', 'nta', 'ntar', 'ntary', 'oli', 'olic', 'omm', 'omme', 'ommen', 'omment', 'on', 'ons', 'ptu', 'pture', 're', 'red', 'reds', 'rip', 'ript', 'ryo', 'ryon', 'ryons', 'sac', 'sacr', 'sacre', 'sacred', 'scr', 'scri', 'scrip', 'script', 'scriptur', 'scripture', 'tar', 'tary', 'tho', 'thol', 'tholic', 'tur', 'ture', 'ure', 'yon', 'yons']

3. Choose the most likely set of wordsSimple approach to do this:

1. Find all subsets of the set of words found2. Determine if that subset if non-overlapping3. Decide how likely is the domain given a particular subset 4. Decide how likely it is that the subset would occur overall 5. Determine best subset

P(d|s)P(s)

P(s|d)argmaxs

We need some domain name data for the next part...

In [153]: domains = pandas.read_csv(os.path.join(data_directory, 'domains.csv'))domains = domains['Domain'].str.lower()

domains = domains[domains.str.endswith(".com")]

domains = domains.str.replace("\.com$", "")

domains = domains.str.replace("̂https?\:\/\/", "")domains = domains.str.replace("̂www\d?\.", "")

num_fmt(len(domains))

Out[153]: '3.8M'

In [224]: choice(domains, size=20)

Out[224]: array(['1topchannel', 'scales-chords', 'marcusmajestic', 'mylyfestart', 'bluediamondturlock', 'bedfordvisionclinic', 'justinmccain', 'miniot-online', 'chelseabarracksbarracks', 'zeroeasy', 'newlookupholstery', 'radcliffehealth', 'embracingthemundane', 'immunityassist', 'simplynostretchmarks', 'teachmetoswim', 'thetford-europe', 'charlesallenford', 'china-chargermanufacturer', 'coolbabykid'], dtype=object)

1. Find all subsets of the set of words found

There are different sentences that can be constructed from n substrings, including the emptysentence. We can get an idea how bad that will be with a sample of the data.

2n

In [53]: longest_word = max(len(word) for word in dictionary) # speeds up searchdef find_n_words_in_string(domain): return len(set(find_words_in_string(domain, dictionary, longest_word)))

In [56]: import numpyn_words = domains.tail(1000).apply(find_n_words_in_string)(n_words).describe().apply(num_fmt)

Out[56]: count 1kmean 28.3std 15.8min 125% 1750% 2675% 38max 93Name: Domain, dtype: object

In [227]: num_fmt(2**28), 2**93

Out[227]: ('268M', 9903520314283042199192993792)

So the worst case in a sample of 1000 domains is permutations to test!293

Combine steps 1 and 2

1. Find all subsets of the set of words found2. Determine if that subset if non-overlapping

becomes:

1. Find all subsets with non-overlapping words2. Do nothing :-)

3.1 Find all subsets with non-overlapping words

Build a tree of subsets of non-overlapping words by sorting the words by their start index.

...and only return the "best" few cases anyway

It seems intuitive that sentences that match more of the domain are better. This is not infalable, butwe can achieve som significant if we only consider sentences at least half as long as the best match.

In practice, this does not appear to have any impact on the results but prevents an explosion ofsentences with particularly long domains.

A little more code...In [147]: def find_sentences(string, words, part_sentence, sentences, threshold=0.0,

current_idx=0, current_score=0, best_score=0): """ Return sentences made of words that are common substrings of ̀string̀. ̀words̀ MUST be ordered by start index or the results will be wrong! """ current_threshold = int(best_score * threshold) if ((current_idx >= len(string)) or current_score + len(string) - current_idx < current_threshold): return sentences, best_score

for i, (word, start_idx, end_idx) in enumerate(words): if current_idx > start_idx: continue new_score = current_score + len(word) best_score = max(best_score, new_score) new_part_sentence = part_sentence + [word] if new_score + len(string) - end_idx >= current_threshold: sentences.append((new_part_sentence, new_score)) sentences, best_score = find_sentences(string=string, words=words[i+1:], part_sentence=new_part_sentence, sentences=sentences, threshold=threshold, current_idx=end_idx, current_score=new_score, best_score=best_score) return sentences, best_score

Add a wrapper

In [148]: def get_sentences(domain, thresh=0.95): words = set(find_words_in_string(domain, dictionary, longest_word)) words = sorted(words, key=lambda x:(x[1], -x[2], x[0]))

sentences, best_score = find_sentences(domain, words, [], [], thresh)

return [sentence for sentence, score in sentences if score >= int(best_score * thresh)]

In [64]: sentences = get_sentences('powerwasherchicago')print(len(sentences))

choice(sentences, size=15)

Out[64]:

245

array([['pow', 'erw', 'as', 'her', 'chicago'], ['pow', 'erw', 'ashe', 'chica', 'go'], ['power', 'was', 'her', 'chica', 'go'], ['power', 'was', 'he', 'rch', 'cago'], ['power', 'was', 'her', 'chicago'], ['power', 'wash', 'erch', 'icago'], ['power', 'ash', 'erc', 'hicago'], ['ower', 'wash', 'erc', 'hicago'], ['power', 'wash', 'erch', 'icago'], ['power', 'was', 'her', 'chi', 'cago'], ['power', 'was', 'her', 'chic', 'ago'], ['power', 'as', 'he', 'rch', 'ica', 'go'], ['ower', 'washer', 'chicago'], ['owe', 'rwas', 'he', 'rch', 'ica', 'go'], ['power', 'washer', 'chic', 'go']], dtype=object)

In [65]: sentences = get_sentences('catholiccommentaryonsacredscripture')print(len(sentences))choice(sentences, size=15)

Out[65]:

540428

array([['cat', 'holi', 'ccom', 'me', 'nta', 'ryon', 'sacr', 'ed', 'scrip', 're'], ['catholic', 'co', 'mm', 'en', 'aryo', 'nsac', 'ed', 'scri', 'pture'], ['catholic', 'omm', 'enta', 'ryon', 'sacr', 'eds', 'crip', 'tur'], ['cathol', 'icc', 'ommen', 'tar', 'on', 'sacr', 'ed', 'script', 'ure'], ['at', 'holic', 'omme', 'ntary', 'ons', 'acred', 'scri', 'pture'], ['cathol', 'icc', 'omm', 'ntar', 'yons', 'creds', 'crip', 'ture'], ['cat', 'hol', 'icc', 'omm', 'entary', 'ons', 'acr', 'eds', 'cri', 'ptu', 're'], ['cath', 'lic', 'com', 'me', 'ntar', 'yon', 'sac', 're', 'dsc', 'ript', 'ure'], ['cathol', 'icco', 'mm', 'ntary', 'on', 'sac', 're', 'dsc', 'rip', 'ture'], ['catholic', 'co', 'mm', 'enta', 'ryon', 'sac', 're', 'dsc', 'rip', 'tur'], ['cat', 'holic', 'com', 'me', 'ntar', 'yon', 'sac', 'reds', 'cript', 're'], ['cat', 'holic', 'com', 'menta', 'ryon', 'acr', 'ed', 'cript', 'ure'], ['cat', 'oli', 'ccom', 'mentary', 'nsac', 'red', 'scri', 'pture'], ['cathol', 'icc', 'ommen', 'tary', 'on', 'sacr', 'ed', 'cri', 'ture'], ['cat', 'hol', 'ccom', 'me', 'ntar', 'on', 'sac', 'red', 'scripture']], dtype=object)

In [71]: tail_sentences = domains.tail(1000).apply(get_sentences).apply(len)

In [155]: tail_sentences.describe().apply(int).apply(num_fmt)

Out[155]: count 1kmean 1.18kstd 10.7kmin 125% 1250% 3975% 145max 280kName: Domain, dtype: object

In [73]: domains.tail(1000)[tail_sentences <= 1].values

Out[73]: array(['cizerl', 'sahoko', 'pes-llc', 'mp3fil', 'wyzli', 'buypsa', 'ylqhjt', 'sblgnt', 'axbet', 'eirnyc', 'wsl', 'kms88', 'paknic', 'mrojp', 'irozho', 'bienve'], dtype=object)

In [74]: domains.tail(1000)[tail_sentences > 10000].values

Out[74]: array(['studentdebtreductioncenter', 'inspiredholisticwellness', 'forensicaccountingexpert', 'medicalintuitivetraining', 'lavidamassagesandyspringsga', 'thirdgenerationshootingsupply', 'commercialrefrigerationrepairmiami', 'athenatrainingacademy', 'business-leadership-qualities', 'casaquetzalsanmigueldeallende', 'landscapedesignimagingsoftware', 'southcaliforniauniversity', 'replacementtractorpartsforsale', 'reinventinghealthcareinfo', 'shoppingforpowerinvertersnow', 'cambriaheightschristianacademy', 'californiaconstructionjobs', 'margaritavilleislandhotel', 'whatstoressellgarciniacambogia'], dtype=object)

In [75]: [' '.join(sentence) for sentence in get_sentences('replacementtractorpartsforsale ')[:10]]

Out[75]: ['replacement tractor parts forsale', 'replacement tractor parts forsa', 'replacement tractor parts forsa le', 'replacement tractor parts fors ale', 'replacement tractor parts fors al', 'replacement tractor parts fors le', 'replacement tractor parts for sale', 'replacement tractor parts for sal', 'replacement tractor parts for ale', 'replacement tractor parts for al']

3.2 Decide how likely is the domain given a particular subset

A first approach would be to say that the probability decreasses as each letter in the domain isommited from the sentence. We could model this in an unnormalised way by counting thesentence length.

To sort by this probability, we can therefore use the following:

P(d|s)

In [77]: def score_d_given_s(sentence, domain): domain_length = len(domain) sentence_length = sum(len(word) for word in sentence) return sentence_length / domain_length, 1.0 / (1 + len(sentence))

In [78]: domain = 'powerwasherchicago'sentences = get_sentences(domain)sorted(sentences, key=lambda s:score_d_given_s(s, domain))[::-1][:15]

Out[78]: [['power', 'washer', 'chicago'], ['pow', 'erw', 'asher', 'chicago'], ['powe', 'rwa', 'sher', 'chicago'], ['powe', 'rwas', 'her', 'chicago'], ['powe', 'rwas', 'herc', 'hicago'], ['power', 'was', 'her', 'chicago'], ['power', 'was', 'herc', 'hicago'], ['power', 'wash', 'erc', 'hicago'], ['power', 'wash', 'erch', 'icago'], ['power', 'washe', 'rch', 'icago'], ['power', 'washer', 'chi', 'cago'], ['power', 'washer', 'chic', 'ago'], ['power', 'washer', 'chica', 'go'], ['pow', 'erw', 'as', 'her', 'chicago'], ['pow', 'erw', 'as', 'herc', 'hicago']]

In [79]: domain = 'catholiccommentaryonsacredscripture'sentences = get_sentences(domain)sorted(sentences, key=lambda s:score_d_given_s(s, domain))[:-15:-1]

Out[79]: [['catholic', 'commenta', 'ryon', 'sacred', 'scripture'], ['catholic', 'commenta', 'ryons', 'acred', 'scripture'], ['catholic', 'commentar', 'yon', 'sacred', 'scripture'], ['catholic', 'commentar', 'yons', 'acred', 'scripture'], ['catholic', 'commentary', 'on', 'sacred', 'scripture'], ['catholic', 'commentary', 'ons', 'acred', 'scripture'], ['cat', 'holic', 'commenta', 'ryon', 'sacred', 'scripture'], ['cat', 'holic', 'commenta', 'ryons', 'acred', 'scripture'], ['cat', 'holic', 'commentar', 'yon', 'sacred', 'scripture'], ['cat', 'holic', 'commentar', 'yons', 'acred', 'scripture'], ['cat', 'holic', 'commentary', 'on', 'sacred', 'scripture'], ['cat', 'holic', 'commentary', 'ons', 'acred', 'scripture'], ['cath', 'olic', 'commenta', 'ryon', 'sacred', 'scripture'], ['cath', 'olic', 'commenta', 'ryons', 'acred', 'scripture']]

Let's see the top guesses for a selection of domains:

In [105]: import redef flesh_out_sentence(sentence, domain): if sum(len(w) for w in sentence) == len(domain): return sentence full_sentence = [] for word in sentence: start, end = re.search(re.escape(word), domain).span() if start > 0: full_sentence.append(domain[:start]) full_sentence.append(word) domain = domain[end:] if len(domain) > 0: full_sentence.append(domain) return full_sentence

In [ ]: def guess(d, n_guesses=25): guesses = [] sentences = get_sentences(d) sentences = sorted(sentences, key=lambda s:score_d_given_s(s, domain))[::-1] i = 0 for i, s in enumerate(sentences[:n_guesses]): s = flesh_out_sentence(s, d) guesses.append(' '.join(s)) for _ in range(i + 1, n_guesses): guesses.append('') return pandas.Series(guesses)

In [238]: subset = domains.iloc[len(domains)//200::len(domains)//100]df = pandas.DataFrame(subset.apply(guess).values, index=(subset+'.com').values)# df.to_csv(os.path.join(data_directory, 'predictions.csv'))df = df.iloc[:10, :3]df['correct'] = [0, 3, -1, 0, 0, 2, 0, 3, 0, 0] # Correct guess for first 10 domains or -1df[['correct'] + list(range(3))]

Out[238]: correct 0 1 2

hedgefundupdate.com 0hedge fundupdate

hedge fundupdate

he dge fundupdate

traveldailynews.com 3traveldailynews

tra veldailynews

trav eldailynews

miriamkhalladi.com -1miria mkhalladi

miriam khalladi

mir iam khalladi

poolheatpumpstore.com 0pool heatpump store

pool heatpumps tore

poo lhe at pumpstore

blogorganization.com 0blogorganization

blo gorganization

blo gorganization

smallcapvoice.com 2 smallcap voice smal lcap voice small cap voice

cefcorp.com 0 cef corp c efc orp cef c orp

lightandmotionphotography.com 3lightandmotionphotography

lightandmotionphotography

ligh tandmotionphotography

uggbootrepairs.com 0ugg bootrepairs

ugg boo trepairs

ugg boo trepairs

abundancesecrets.com 0abundancesecrets

abun dancesecrets

abund ancesecrets

In [239]: %matplotlib inlineimport matplotlib.pyplot as pltimport seaborncorrect = [0, 3, -1, 0, 0, 2, 0, 3, 0, 0, 4, 1, 0, 4, 0, 0, -1, 0, 0, -1, 1, 8, 0, 0, 0, 0, 8, 0, -1, -1, 0, -1, 0, 3, 16, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, -1, -1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, -1, 0, 0, -1, 0, 2, 4, 13, 0, -1, 0, 0, -1, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 2, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, -1]

In [240]: pandas.Series(correct).hist(bins=range(-5, 25), normed=True, figsize=(12, 5))plt.xlabel('correct guess no. or -1 if incorrect');

In a test of 100 samples, the first guess was correct 65 times and one of the first 25 were correct87 times.

Is this good enough?Primary use case: given 500 domains in a market, what are the themes?

Expect ~325 domains in theme clusters and ~175 distributed randomly.

This will probably still require human sanity checks.

What can be done?So far, we only consider the likelyhood of a domain given a sentence.

But how likely is the sentence?

The next hack day is to develop a model for sentence likelyhood .P(s)

What was doneTrained dictionary using google ngram viewer dataFound word substrings in domainBuilt sentences from words with applied crude cutsOrdered predictions based on crude score functionMeasured performance on 100 labelled domains

What I usedInspiration:

Peter Norvig's

Libraries:

pandas, numpy, resklearn.feature_extraction.text.CountVectorizer

Functions:

spell-correct (http://norvig.com/spell-correct.html)

build_dictionary(corpus, min_df=0)find_words_in_string(string, dictionary, longest_word=None)find_sentences(string, words, part_sentence, sentences, threshold=0.0)get_sentences(domain, thresh=0.95)score_d_given_s(sentence, domain)guess(d, n_guesses=25)

http://norvig.com/spell-correct.html

After training, it can be used like this:

In [211]: guess('powerwasherchicago')[0]

Out[211]: 'power washer chicago'

What still needs to be done for performance

Performance needs to be tested against a larger labelled dataset including robusttrain-develop-test splits.

Sentences need to be compared based on the likelyhood of that sentenceconstruction, i.e.

Additional words need to be incorporated into the dictionary

Threshold hyper-parameters need tuning

P(s)

...and to make it usable

Replace custom code with library functions where possibleExtend remaining code to support array and dataframe inputsMake compatible with sklearn pipelineImprove .com, .co.uk etc. handling so it can be used on a wider set of domainsOptimise substring search

Think you can do better?Get in touch:

[email protected]@calvingiles

In [122]: import math

def num_fmt(num): i_offset = 12 # change this if you extend the symbols!!! prec = 3 fmt = '.{p}g'.format(p=prec) symbols = [#'Y', 'Z', 'E', 'P', 'T', 'G', 'M', 'k', '', 'm', 'u', 'n'] try: e = math.log10(abs(num)) except ValueError: return repr(num) if e >= i_offset + 3: return '{:{fmt}}'.format(num, fmt=fmt) for i, sym in enumerate(symbols): e_thresh = i_offset - 3 * i if e >= e_thresh: return '{:{fmt}}{sym}'.format(num/10.**e_thresh, fmt=fmt, sym=sym) return '{:{fmt}}'.format(num, fmt=fmt)

Learning to read urls

Data & Analytics

Transcript of Learning to read urls