Subwords

Overhaul of a former (sketch of a) project, Subwords, now ported into Python.

I now have a tool that dissects long words into smaller ones, with a focus on ‘perfect’ decompositions: no letter of the long word remains unaccounted for (the program only saves solutions for e.g. a 10-lettered word containing two 5-lettered one, or a 13-lettered words containing one 5-lettered word and two 4-lettered ones, etc.). It works as a script that one can invoke through the command line.

The program uses two recursive functions and regular expressions:

first, going through the first list of subwords, it checks with RegEx whether there is a match with the superword at hand;
if there is, the function positions() recursively calculates all the possible distributions of the subword (which is a subsequence) within the superword, and return those as a list of lists;
the program then creates a string ‘remainder’ without those letters, and recurses, this time using the second list of subwords;
when at the bottom of the recursion, the program checks if the final remainder is in the appropriate dictionary (a Dawg, for fast look-ups).

I added the constraint that the subwords should always have at least one inner division (none of the subwords can appear ‘as it is’, fully concatenated, somewhere within the superword). It was also interesting to find a way of avoiding repetitions in the results (‘presentations,penis,raton,set’ could also be ‘presentations, raton, penis, set’, etc.). Using sets for this, and the reconstructing the results in the right order (using the length the user had input in the arguments), was the way to go.

As mentioned in my update post on Wordsquares, the input dictionary is also quite crucial here (especially for short words, where irrelevant abbreviations abound).

The result can be remarkably ‘unreadable’, or at least cryptic. A few examples from the GitHub Readme:

w a  r
 e ve

va in  
  g  al

 a g in  g
b r a  in 

b r  in  g
 a ga  in

 upp   er
s   ort  s

  e the r
sw e   a t

  N er    o
      va i n
co s    t

   l    ion
de  ar
  c   at   s

 i s     i s
m   i n   e
  s  o ar

fa cin    g
  s   atin

 error s
t     i ts

      M   at
  s    ent
in t         o
    ru      i n

 u n        i
v   e   il 
  l   ab      s
     r    it e

Still to be done: a graphical representation that I can have printed.

It is not clear that I want to keep this level of constraint for this one. Some results are somewhat salient, but it might prove more fruitful, for instance, to drop the ‘perfection’ constraint, and write a version of the program that allows for ‘undecomposed’ letters, present only in the superword and none of the subwords. The number of results would explode, and it is likely that I would have to renounce the idea of a ‘total’ corpus of solutions, but it might be best for my artistic purpose: I could have to include a way of searching for results word after word, perhaps, having more auctorial input throughout. The method I am using, which consists in building a ‘total’ set of results, before I research it (with tools I am yet to develop), comes with quite a lot of issues, perhaps the more prickly of which being the feeling that the constraint is alienating me, forcing me to look at ‘literary’ objects of no interest, and reducing my output to logology, the trivial pastime of scrabblers….