Sorting your music library with Python.

Irene Naya
9 min readJul 20, 2018

--

I have been hoarding music for ever. I have multiple backups of music folders, and lots of folders coming from different origins.

Recently I decided to see if I could come up with a comprehensive list of all the music I have stored in my hard drives and I found that the task can be quite daunting. Why I did this? Well, on the one hand, of course, just because. There’s another reason at the end of this post, but it’s not related to programming at all, so I’ll leave it for later.

Of course, the big issue is that we hardly ever name files and folders in any regular fashion. Different people have different conventions and, as I discovered, I have had very different conventions from time to time. Even when what I basically wanted was a simple list containing just album, artist and year, retrieving those three items was quite a challenge. Unless you have a very organised library, you’re bound to end up like me, having to manually curate the list at the end of the process. But that’s orders of magnitude better than having to type the whole thing manually!

Here’s what I learned in the process. Hopefully, there’s going to be a few helpful tips for you out there.

Organising your data

The first thing, of course, was to get myself organised so that I could have the data I obtained the way I wanted. Since I wanted those 3 fields I mentioned before ( artist, album, year), I made a rather simple Python class to hold them:

class Music:
def __init__(self, date, artist, album):
self.date = date
self.artist = artist
self.album = album
def __eq__(self, other):
return ((self.date == other.date) and (self.artist == other.artist) and (self.album == other.album))
def __hash__(self):
return hash((self.date, self.artist, self.album))

Other than the data, all the class has are the definitions for __eq__ and __hash__, which I needed for using a set. The set was the data structure where I stored all the music objects, and I used a set to avoid repetitions as much as possible.

The other things I used were:

  • A list of strings for the paths where the music was (not needed if you have all your music in only one location)
  • A list of strings to store excluded folders (I had videos and books in some subfolders — yes, I’m a bit messy)
  • A list for “unwanted” strings in the folders names (things such as “mp3”, “flac”, “320kbps”, etc.)
  • A set, as I mentioned above, to store all music objects

There are other things I used, but they are related to each different way to obtain the data, so we’ll deal with them as we go along.

The simple case: You actually named your folders properly

So yes, there actually was this case: In many cases, I had album folders that already did have all the information I needed, so that’s the first thing I checked for. So in that case, all you need to do is just retrieve that information from the folder.

One of the reasons I chose Python was how ridiculously easy it is to traverse the directory structure and get information from your files. Enter os.walk().

os.walk allows you to easily iterate your directories, by traversing them recursively from top to bottom (or from bottom to top, if that’s what you want). The function stores all the information you need at every step, and makes your life very easy.

For this part, the only extra item I needed was a regex to find the years in the folder name:

years = r’((?:19|20)\d\d)’
year = re.compile(years)

After that, I was ready to start traversing the directory. Everything that comes from now on will happen inside the os.walk() loop. Separating it here for ease of reading. First thing we check is if we have a folder that has a year in it:

# musicPaths: List of all the paths I wanted to traverse
for currPath in musicPaths:
for root, dirs, files in os.walk(currPath):
#search subdirectories
for i in range(len(dirs)-1,-1,-1):
if dirs[i] in excluded: #excluded is the list of unwanted directories
del dirs[i]
continue
match = year.search(dirs[i])
if match is not None:
artistL = root.split(“/”)
artist = artistL[len(artistL) -1]
musicSet.add(Music(match.group(), artist, dirs[i] ))
del dirs[i]

Python’s documentation is fairly clear, but basically the 3 returns of os.walk() are: root, the root directory; dirs, a list with all the directories at the current level; files, a list with all the files not inside a directory at the current level.
The reason for traversing the directories backwards is because I am actually deleting the directories that either had a year in them (no need to explore further), or the ones that we weren’t interested in.

Don’t make the silly mistake I made initially of forgetting that if you’re deleting from a list by index you should never iterate forward: If you delete the element at index 0, after deleting you will now have a new element at index 0 (the one that was at index 1), but now you have moved to index 1 and you end up missing one element.

This part runs blazingly fast but, of course, it was nowhere near enough.

Getting help from libraries

When you can’t get the information from the directory name, the next best option is to try to find the metadata. Here’s the second main reason for me to have used Python: you can find libraries for pretty much everything, and that’s a great thing.

Scraping mp3 metadata:

One thing I found out quickly enough was that MacOS has a pretty handy command: “mdls”. This is an extension of the well known Unix command “ls”, only that this one gets all the metadata information in your file. What I also found was that someone had already taken the trouble of writing a very easy to use wrapper for it, and you can find it here: https://github.com/fractaledmind/metadata

The usage is pretty simple: once you install it (instructions in the repo are pretty clear), you just import metadata, and you call the list function on your file. For this, i got the last element of the root element returned by os.walk(), and appended the current file. From the data returned from list, you access the elements by passing the corresponding strings: “recording_year”, “authors”, “album”. I recommend trying it from the command line first, to see the output of the list() method. Also, keep in mind that the mp3 tags themselves can be incomplete, so I had to use a counter and a bool to make sure that I had found the complete metadata:

for f in files:
found = False # bool to store whether we found the metadata
paths = root.split(“/”)
curr = paths[len(paths)-1]
# search mp3. If metadata returns a recording year, we set bool to True and there’s no need to check other files
if f.endswith(“.mp3”):
try:
file_data = metadata.list(root + “/” + f)
except:
# here you can do what you want. I wrote to a separate file for debugging. If data retrieving failed, try next file
continue
count = 0 # counter to make sure we get all 3 elements
if “recording_year” in file_data:
date = str(file_data[“recording_year”])
count +=1
if “authors” in file_data:
artist = str(file_data[‘authors’][0])
count +=1
if ‘album’ in file_data:
album = file_data[‘album’]
count +=1
if count == 3:
found = True
musicSet.add(Music(date,artist,album))
break # if we're here, we have our data, no need to continue iterating files in this directory

What about FLAC files?

FLAC metadata is not stored the way mp3 metadata is stored, so it’s not accessible via mdls and, therefore, we can’t use the library above to retrieve it. But, of course, there’s a way to retrieve it and, of course, there’s a Python library to help us with that. It’s called “pyflacmeta” and you can also find it in github: https://github.com/isaaczafuta/pyflacmeta

Again, this is extremely easy to use, and the logic was the same as above, only that I ran into the issue that the tags could have any combination of upper and lower case characters, so instead of accessing them directly, I iterated the List returned by flac_data.keys(), and did something like this:

count = 0
for key in flac_data.keys():
if key.lower() == “date”:
date = flac_data[key]
count +=1
# the rest is the same logic as for mp3, but with this basic syntax

So, by now, we can trust that every file that has proper tags or was properly named will be correctly placed into our set. There’s two problems, though: You may have music files of other formats, like .wav, .ogg, etc., and there are bound to be files that were never tagged properly

If all else fails….

What to do in case none of the options above worked? Well, there are internet databases, of course. After reading and researching a bit, I ended up choosing Discogs (https://www.discogs.com/). You need to register for an account, and get a token to be able to connect with the API, but both are free and fairly easy to use. Once you did that and got the API, you just import discogs_client, and you put the authentication information in your code:

app = [String containing your app name]
token = [String containing your token]

Then, you start your client:

discog = discogs_client.Client(app, user_token=token)

And then you’re ready to use it. I had to create a function to prepare the path so that I could extract the album name + artist so that it could be found in the database. That was when the “unwanted” list came in handy, because it allowed me to remove the most frequent strings that could confuse the search. The rest of the preprocessing of the path is basic string manipulation, and it’s way too dependant on how your directories and files are set up to post it here.

The way in which I actually retrieved the data was less intuitive than in the previous cases, and there probably is a better way. What i did was, basically: First searching by name, which returns a list of results. From these, you get the id from the first one, and find the album by means of discogs.release(). This will allow you to get year, artist, album.

if not found: # this was the bool we set up at the beginning of the loop. If we're here, neither library above worked as expected. 
album = preparePath(root)
# first search by name. Get id from result, then search the release by id to get the year
try:
results = discog.search(album, type=”release”)
#if there’s no results, count will be 0
if results.count < 1:
break
id = results[0].id
release = discog.release(id)
musicSet.add(Music(str(release.year), release.artists[0].name, release.title))
except:
continue
break

Discogs is brilliant, and chances are that if something exists, it will find it. There’s two problems though:

  1. Of course, because it has to connect via the Internet, it’s much slower than any of the previous methods. Keep that in mind if you have a large collection
  2. It’s extremely sensitive to misspellings, so it may not find things if they are not spelled correctly. But what’s even worse, their algorithms may decide that the most likely match to your badly spelled search was something other than what you thought.

Because of all this, for the most part you are bound to have to go through the list yourselves even after all this. I can’t think of a way out of that. It’s the curse of unreliable input, of course.

As an aside comment, at some point I did try other methods, like using MusicBrainz Picard app. The problem is that those kinds of apps require constant feedback on your part if you want the results to be perfect, so it’s pretty much the same for large libraries.

I ended up with a list of about 3500 albums, with some repetitions, mostly due to typos and different spellings. The source was about 1 TB of music spread out across several folders in 3 hard drives. The whole thing was done in a matter of minutes, with the vast majority of the time being spent in querying Discogs, so it’s pretty fast overall. As I said, there was a lot of manual work after that, but it saved me a lot of time.

At any rate, I hope this will help someone who wants to sort their music collection. As I said, other than “just because”, the reason I did this was because I decided to give my whole collection a listen in chronological order (I know, I’ll be there a while!). I am writing about it, so if you haven’t had enough of my ramblings ( and, of course, if you like music), you can read about it here: https://mymusicintime.blogspot.com.ar/

--

--

No responses yet