Remove Invalid Characters From Filenames
##################################

Source of the SED regular expression from this site: http://serverfault.com/questions/348482/how-to-remove-invalid-characters-from-filenames

I have another article on converting file names between encodings that might fit your needs: converting between encodings

Intro

Imagine a file with this name:  009_-_%86ndringshåndtering.html

Or what if you need to convert a full path to a single filename, for example /data/folder/file.txt to _data_folder_file.txt. This might be useful in a script. Especially if you have a script that operates on folders and you want your log file to contain the folder path. Say you make your own defrag tool and you run it on a folder called /data/folder, the log file can be called defrag_log_data_folder.log. This is important because its impossible to have a file that has a / character. Try making a file called defrag_log_/data/folder.log, you will see it fails and for obvious reasons.

How to rename it to something to have more legit ascii characters, so that it can sit as a file?

Use sed, tell said for every none ascii character (which is not A-Z and not a-z and not 0-9 and not a dot, underscore or a dash) then replace it with an underscore

Answer:  mv 'file' $(echo 'file' | sed -e 's/[^A-Za-z0-9._-]/_/g')

If filename variable is set to the filename:

Here is how you operate on full paths like I mentioned earlier:

This is also useful to convert a full filename path to a string so that it can be appended to a file example

 How to convert many bad files and folder names to good ones:

First download rename.pl and edit the program script so that its output is easier to read, like I have done here: beautifying rename.pl script

SIDENOTE: here is an interesting article showing how you can use rename.pl script to rename Movie folders

Now go to your folder:

Now run this command, which will not harm anything (because of -n dry run option), but it will find all of the files & folders (in this folder, not recursive) with questionable characters:

What do we have here: We are telling it to look thru all files and hypothetically rename everything that doesnt have a lower or upper case letter or number or dot underscore hyphen into an underscore. hypothetically meaning it wont commit the change because of the dry run option placed with -n. Here we are converting everything to a underscore but you can convert to a hyphen (it will look like this:  rename -n 's/[^A-Za-z0-9._-]/-/g' * ) or into a space (it will look like this:  rename -n 's/[^A-Za-z0-9._-]/ /g' * )

Don’t remove the -n and run it, unless you want all of those changes to commit.

Study the output and find a file that you want to rename. Then put it in like this

NOTE: windows will show the above file as A~2+34GB but linux will show it

Notice without the -n sign it will commit the change. So maybe its good to log everything

I would instead though keep track of all of my changes like this (tee will append to this renames.log file, if renames.log doesnt exist it will make it. tee is just like > and tee -a is just like >> except that you also get screen output):

NOTE: The “rename” tool im using is the perl rename tool sometimes called “prename”, “rename.pl”, or simply rename (sidenote, there is also another similar tool called “rename” tool from a package called util-linux, which isnt as good as the one we work with – google “rename util-linux”, I have more info on the provided link). More information on perl rename and not-as-good-util-linux rename: http://ram.kossboss.com/renameperl/

Maybe your files only have a couple or one offensive character.

If all of your files have the | character that you want to convert to an underscore _.

If you have a couple trouble characters like lets say the pipe | and also the colon : then you can have it work on both characters (or you can do the slow method and rerun the above commands for the other character, the fast way is to have sed look for | or : and convert them to _).

The End

Leave a Reply

Your email address will not be published. Required fields are marked *