Page MenuHomePhabricator

Image file extension should not be part of the name
Open, LowPublic

Description

Currently all images include an extension that specifies the format of the image (such as .jpg, .png, .gif, .svg, etc.) Ideally the image name should not include this information, since it doesn't matter to those who *use* the image whether it's a JPEG or a PNG. For example, it would be much better to be able to say [[File:Map of Europe]] than to have to say [[File:Map of Europe.png]]. The author of the article shouldn't have to know (or care) what format the image is in.

Additionally, if a new version of the image is uploaded which is in a different format, it must be uploaded under a different image name. Then all the pages that use the image have to be changed, and the history of the old image is lost (T25255). This is a lot of unnecessary hassle.

Fixing this can also standardize lower-case image extensions, so that we don't have images named Bleck.PNG or Mleko.JPG (T34660).

The only major problem I see with fixing this is in the conversion process, when something has to be done about images whose names are the same except for the extension.

See also https://www.mediawiki.org/wiki/Requests_for_comment/Extensionless_files

Details

Reference
bz4421

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Respond to "Duncan Harris" Tell me ONE good reason why we want confusing
filenames/extensions such as foo.jpg and foo.jpeg. We do not want files with
names like Paris.jpg Paris.jPg Paris.jpeg The intention is to make itso it is
not backwards compatible since the point is resolving the problem not forking it.

As you point out the two files in question are different and both of them should
have had been given a more descriptive filename to avoid a conflict in the first
place. The upload should not have been allowed and given an error to pick a
better name.

There was a check to see how many images were conflicting and the number was a
thousand something some time ago. All of those 1000+ images need to be renamed
to something that is actually descriptive.

The point is to separate the image description page and the actual image
allowing moves and other manipulations. It is a complete waste of system
resources and peoples time to move images as we are doing now.

I strongly support this request. The "image description pages" have
fundamentally wrong URLs. For example, I use a little add-on in firefox that
pops up as soon as I want to surf to a pdf-File, asking me whether to do with
that file (save, view). It works just fine everywhere but in commons, where a
non-PDF-file has pdf as its ending: the description page. If you want to link to
the Commons from outside Wikimedia (say, from a blog), people see a URL ending
in JPG and might try direct download ("right-klick" and "save as"). Thus, they
will download a description page. This whole behaviour of Commons is different
from what users would rightly expect from the URL.

ui2t5v002 wrote:

I furiously agree with this proposal. The encoding of an image is completely irrelevant to the way it is used in an article and the way the image's description page is accessed. The only place it is important is when fed to an end-user, where the extension would, of course, still be used. For the title of the image description page and inclusion in articles, the extension should not be present at all.

This is a perfect example of the problems it would help solve:

http://commons.wikimedia.org/wiki/Commons_talk:Deletion_requests/Superseded

If a PNG image is updated by uploading another PNG image with the same name, the older version is kept in the "image history" on the image description page for licensing and practical reasons. (Sometimes the replacement is inferior and is reverted back to the old version, for instance.) If an SVG image is uploaded to replace the PNG, however, it must exist at a new name, breaking the link to the old, and inspiring people to delete the old PNG so that it will not continue to be used. Images in the "image history" cannot be used in articles, but images that exist at other names still can be.

"I support this; metadata shouldn't be part of the filename."

It shouldn't even *be* a "filename". It should be an "image title".

"This is silly. [[image:foo.jpg]], [[image:foo.jpeg]] are different."

That's why we need Bug 709 (Cannot rename/move images and other media files.) In the meantime, the .jpeg will continue to be part of the image's title to keep them distinct. But in the future, it should be possible to rename, redirect, and merge images just like articles. Then bots can strip the extensions from image titles that don't conflict, and flag the rest for human disambiguation.

"I strongly support this request. The "image description pages" have fundamentally wrong URLs."

Another good reason. The image description page has a URL that looks like an image, but is actually a web page.

ayg wrote:

We should store the MIME type as we do now, discard the image extension on upload (or at least remove it by default), and generate an extension automatically for actual media links and thumbnails based on the stored MIME type. I can't imagine any code depending on images ending with a suffix, given that they're namespaced and all, so I don't think this would be excessively difficult to fix, either.

Existing images should stay where they are until we have image redirects, at which point they can be mass-moved to extensionless forms (with conflicts being ignored and left for manual cleanup). In the meantime, people who are deleting images merely for being obsolete under the assumption that they aren't being used on non-Wikimedia parts of the Internet should be smacked.

  • Bug 19874 has been marked as a duplicate of this bug. ***

Clarified summary, added dep for bug 20971.

I started looking at this problem to see if I could work up the first baby step
patch toward allowing arbitrary file names, which is to allow moving an image
to an arbitrary location.

I first just tried removing checkExtensionCompatibility from
Title::isValidMoveOperation(). That mostly seems to work, but the hitch comes
when you try to reference the bare image URL, because that's served directly
from the webserver, which typically relies on the file extension to get the
media type. So, regardless of how MediaWiki refers to the file, it still needs
to be stuffed onto disk with a valid extension intact unless you're hosting on
some ninja psychic webserver that just knows what the media type of the file is
(or perhaps one that's cracking open files to see what's in them).

The strategy I cooked up next was to modify FileRepo::getNameFromTitle(). Here
was the plan:

  • Check if the media type from the database matches up with the media type

derived from the file name from the title

  • If they match, do nothing different
  • If they don't match, tack on the extension to the filename based on the media

type

I'm not even remotely sure if getNameFromTitle is the right place to insert this
sort of thing (I suspect it isn't, actually). I was looking for a place where
I could muck with the file name without also mucking with other uses of the name.
It seems safer to go elsewhere, like File::getRel and File::getUrlRel. I may
play around more with this once I get unstuck.

The wall I hit was figuring out the right way to pull the MIME type out of the
database, because I think I accidently engaged in a mutually recursive death
spiral calling LocalFile::GetMimeType() from there. If anyone has any tips on
the right way to pull the mime type for a given Title out, that'd be most
helpful.

I may putter around with this a bit more this weekend. This isn't a problem I'm
planning to bulldog until it's fixed, but it's something I'll post a patch to if
I manage to muddle my way into something that seems to work.

One hacky workaround that could work ok is to allow arbitrary file names in the
upload screen, but instead of rejecting the upload if the name doesn't match a
valid MIME type, simply tack the extension onto the end, then put a redirect from
the given name to the extended name. That's probably a little too hacky to have
the fully desired effect, but it does at least make it a little easier to refer to
the file in an extension-agnostic manner.

(In reply to comment #27)
<snip>

I'm not even remotely sure if getNameFromTitle is the right place to insert
this
sort of thing (I suspect it isn't, actually). I was looking for a place where
I could muck with the file name without also mucking with other uses of the
name.
It seems safer to go elsewhere, like File::getRel and File::getUrlRel. I may
play around more with this once I get unstuck.

I agree getNameFromTitle is bad. I'd suggest it would be better to add a new accessor File::getNameOnDisk as an alternative to File::getName, and then change the URL constructors to use the former rather than the latter.

The wall I hit was figuring out the right way to pull the MIME type out of the
database, because I think I accidently engaged in a mutually recursive death
spiral calling LocalFile::GetMimeType() from there. If anyone has any tips on
the right way to pull the mime type for a given Title out, that'd be most
helpful.

<snip>

Assuming your goal was to use MIME type to determine the appropriate extension, then that won't work. Because we have allowed capitalization variations, e.g. Foo.JPG != Foo.jpg != Foo.jpeg, there is no way to uniquely determine what the extension should be from the type. Almost certainly we will need to add an extension field to the Image and Oldimage tables and simply look up the extension. An advantage of this is that one could set all existing files to have a null extension, meaning that nothing needs to be added to the file name as already exists.

ayg wrote:

(In reply to comment #27)

I first just tried removing checkExtensionCompatibility from
Title::isValidMoveOperation(). That mostly seems to work, but the hitch comes
when you try to reference the bare image URL, because that's served directly
from the webserver, which typically relies on the file extension to get the
media type. So, regardless of how MediaWiki refers to the file, it still needs
to be stuffed onto disk with a valid extension intact unless you're hosting on
some ninja psychic webserver that just knows what the media type of the file is
(or perhaps one that's cracking open files to see what's in them).

The simplest way to handle this from our perspective is to just give all the on-disk files a name ending in, say, .png. This will typically cause an incorrect Content-Type to be served -- except for PNG files, of course -- but browsers will display the pictures fine anyway, as long as it's served as some recognized image type. See http://tools.ietf.org/html/draft-abarth-mime-sniff-03. In fact, it should work fine in many cases even if a non-image MIME type is served.

Arguably, relying on this MIME type sniffing is incorrect and confusing. But it's a possibility, for simplicity's sake. It's certainly reliable.

(In reply to comment #29)

The simplest way to handle this from our perspective is to just give all the
on-disk files a name ending in, say, .png. This will typically cause an
incorrect Content-Type to be served -- except for PNG files, of course -- but
browsers will display the pictures fine anyway, as long as it's served as some
recognized image type. See
http://tools.ietf.org/html/draft-abarth-mime-sniff-03. In fact, it should
work fine in many cases even if a non-image MIME type is served.

Arguably, relying on this MIME type sniffing is incorrect and confusing. But
it's a possibility, for simplicity's sake. It's certainly reliable.

Though I haven't tested it systematically, I'll assume that some fraction of browsers will happily process some fraction of file types without extensions. However this feels like a terrible hack, and I worry that not enough browsers would process enough file types.

In particular, if we want our approach to work for Mediawiki installs in general, then we can't simply assume that we are only talking about image files. Does it work for PDFs, for Word Documents, for spreadsheets, etc.? Also if the file has a .png extension and a person saves it to their local hard drive, then I strongly suspect that Windows users will have a hard time reading the file in most apps without manually changing the extension (not sure about Mac / Linux).

I think it makes much more sense to provide the user with an appropriate extension, even if that information is unnecessary in some cases.

ayg wrote:

(In reply to comment #30)

Though I haven't tested it systematically, I'll assume that some fraction of
browsers will happily process some fraction of file types without extensions.
However this feels like a terrible hack, and I worry that not enough browsers
would process enough file types.

In particular, if we want our approach to work for Mediawiki installs in
general, then we can't simply assume that we are only talking about image
files. Does it work for PDFs, for Word Documents, for spreadsheets, etc.?
Also if the file has a .png extension and a person saves it to their local hard
drive, then I strongly suspect that Windows users will have a hard time reading
the file in most apps without manually changing the extension (not sure about
Mac / Linux).

I think it makes much more sense to provide the user with an appropriate
extension, even if that information is unnecessary in some cases.

Reasonable points, especially about things like PDFs and saving files locally. I retract the suggestion.

Created attachment 6680
Incomplete attempt to allow image moves to arbitrary names

Patch attached for an initial incomplete implementation. It seems to work with my very limited testing. Note: I added a new config setting ($wgCheckFileExtensions) which needs to be set to "false" in order to use this (default is "true").

In reply to comment #28 (great feedback, btw):

I agree getNameFromTitle is bad. I'd suggest it would be better to add a new
accessor File::getNameOnDisk as an alternative to File::getName, and then
change the URL constructors to use the former rather than the latter.

Okee doke. I added File::getFilename, and changed a few calls to point to that. I had to add a corresponding FileRepo::getFilenameFromTitle.

Assuming your goal was to use MIME type to determine the appropriate extension,
then that won't work. Because we have allowed capitalization variations, e.g.
Foo.JPG != Foo.jpg != Foo.jpeg, there is no way to uniquely determine what the
extension should be from the type.

It's possible, and the attached patch does it in a pretty reasonable way (adding a new "getPreferredExtensionForType" that leverages some existing normalization code).

However, I concur that this isn't the best solution. The problem with it is that an innocent reconfiguration could render the files inaccessible.

Almost certainly we will need to add an
extension field to the Image and Oldimage tables and simply look up the
extension. An advantage of this is that one could set all existing files to
have a null extension, meaning that nothing needs to be added to the file name
as already exists.

I ran out of time before I could implement this, but that would seem to be the next logical step in all of this. I still think we'll need the logic for generating an extension from a MIME type, in the event that the initial uploaded file name doesn't match the MIME type we'll need to get an extension from somewhere.

attachment bug4421-robla-v1.patch ignored as obsolete

Created attachment 6859
v2 patch - handles upload, still buggy

New version of a patch. Still some testing, known bugs and cleanup to do, but looking for last minute feedback before I finish this off. The big change is that there's a new field in the image table (img_file_ext) along with corresponding changes in oldimage and filearchive. It appears as though the check on upload was already mostly coded, and there was even a $wgCheckFileExtensions variable that I didn't notice in my first version (looks like its an antique, too)

Known bug: uploading an image without an extension will cause the DB to end up in incorrect state.

Here's my test plan:

  • Image renaming:
    • Upload Foo.jpg
    • Rename Foo.jpg to Foo
    • Rename Foo to Foo.jpeg
    • Rename Foo.jpeg to Foo.gif
    • Upload Bar (GIF file)
    • Rename Bar to Bar.gif
  • Set $wgSaveDeletedFiles=true
  • Set $wgFileStore['deleted']['directory'] to valid directory
  • Delete, then undelete an image
  • Upload a new version of an image
    • With no extension
    • with proper extension
  • Change configuration of default extension from "jpg" to "jpeg". Deal with images from before transition
  • Install MW 1.15, set wgCheckFileExtensions=false, upload images (with/without matching extensions) then upgrade to new version and check images
  • Fresh install of MediaWiki uploading both images with/without matching extension in title

attachment bug4421-robla-v2.patch ignored as obsolete

Created attachment 6885
bug4421-robla-v3-svn59811.patch

It's mostly working, though working through all of the edge cases is a bit of a game of whack-a-mole. Most of the complexity comes from needing to store the files on the filesystem with appropriate file extensions, since these get served directly from the filesystem from Apache. Thus, there's a lot of convoluted logic for tacking on the file extension in the appropriate spots.

An example of something that isn't working that I need advice on is this: with my modified version, it's possible to upload a jpeg to a location without an extension, then upload a png to that same location. The problem comes in LocalFile::publish(). Here's the call it makes from that function:
$status = $this->repo->publish( $srcPath, $dstRel, $archiveRel, $flags );

This causes two things to happen:

  1. copy $dstRel to $archiveRel
  2. copy $srcPath to $dstRel.

The problem here is with uploading a png over the top of a jpg. For example, if the title name is "File:Foo", then the filename for the first version of the file will be "Foo.jpg", and the replacement will be "Foo.png". So, if we pass "Foo.jpg" to $dstRel, then step 1 works, but step 2 fails. If we pass "Foo.png", then the opposite problem occurs.

Thoughts on dealing with this problem? It would seem that modifying FileRepo::publish() (or adding a new method with more parameters) seems like the only solution here.

attachment bug4421-robla-v3-svn59811.patch ignored as obsolete

Created attachment 6926
bug4421-robla-v4all-svn60601.patch

Yet another version of the patch. Still needs testing, but otherwise I think this one is ready for primetime.

attachment bug4421-robla-v4all-svn60601.patch ignored as obsolete

Created attachment 6927
bug4421-robla-v4staged-svn60601.tar.gz

bug4421-robla-v4staged-svn60601.tar.gz is a tarball containing the same patch as bug4421-robla-v4all-svn60601.patch, only broken up into several stages worth of patches. I broke it up both in hopes that it might be easier to digest in smaller parts, and as a way to review my own code.

Attached:

I've now tested this as much as I'm going to now. Anyone care to try this out?

Such a major change to the file repo code needs a review by Tim for security, scalability, etc.

I'm not really interested in making major changes to the trunk at the moment, due to the need to stabilise for a 1.16 release branch. But feel free to commit it to a development branch.

(In reply to comment #39)

I'm not really interested in making major changes to the trunk at the moment,
due to the need to stabilise for a 1.16 release branch. But feel free to commit
it to a development branch.

Granted, and I certainly don't suggest trying to get this in before 1.16 branches and releases, just a general note that I'd like a thorough review before this does (eventually) go into trunk :)

The individual patches are checked into a branch now:
http://www.mediawiki.org/w/index.php?title=Special:Code/MediaWiki/path&path=/branches/extensionless-files/

I checked in the important patches (stage 1 through stage 3) first, then the optional ones after that (and then the one file I forgot to add...oops). svn revs 60770-60773 and 60779 are the important ones, svn revs 60774-60778 are minutia that can evaluated independently.

I am the owner of the bug 20971, which is closely linked to this. Could someone kindly inform me about the progress here? Thank you

Hi Mattia, the code is still sitting in the extensionless-files branch. I can conceivably take a crack at bringing it up-to-date with the trunk and merging it in. However, it won't make it into 1.16, and it requires a database upgrade, so it probably needs more review than its gotten so far.

If you're eager to accelerate progress on this, my recommendation would be to raise this on the mediawiki-l mailing list, making a case for why this is needed sooner rather than later.

In the meantime, I'll work on getting the easier portions of this patch incorporated into trunk, so as to hopefully make it easier to incorporate the rest when the time comes.

(In reply to comment #43)

If you're eager to accelerate progress on this, my recommendation would be to
raise this on the mediawiki-l mailing list, making a case for why this is
needed sooner rather than later.

wikitech-l would probably be more appropriate.

I'm not so sure about this.

Personally, my scripts often rely on the assumption that image.img_name is the same as page.page_title when page_namespace = 6. I use this assumption to generate reports of files without file description pages, file description pages without files, and comparing enwiki_p.page.page_title to commonswiki_p.image.img_name. I imagine there are other scripts on the Toolserver and elsewhere that rely on a similar assumption. This change would likely break these scripts.

I'm also concerned about naming conflicts. In bug 20971#c4, Brion suggests that only non-conflicted image names would be stripped. Deliberate inconsistency here doesn't seem like an ideal situation for editors or anyone else. Though I suppose page text will be inconsistent for the rest of time if this change is implemented anyway.

On an emotional level, stripping the file extensions feels wrong. A JPG simply isn't the same as a GIF or a PNG or an SVG. Even users who are only adding the file inclusion code to pages need to understand and appreciate that.

wikipedia wrote:

Your concerns are, IMO, trivial compared to the troubles we have from including extensions thus far.

You're right that JPEG, GIF, PNG, & SVG simply are not the same, which is just another reason why they don't need filename extensions anywhere, and certainly not at their File:Name locales.

mike.lifeguard+bugs wrote:

(In reply to comment #45)

stripping the file extensions feels wrong. A JPG simply
isn't the same as a GIF or a PNG or an SVG. Even users who are only adding the
file inclusion code to pages need to understand and appreciate that.

I agree completely here. We often upload different formats of the same image for differing purposes, and change only the file extension. The reason for that is it *does* matter which one you use!

(In reply to comment #5)

Okay, well, the main reason I proposed it is that it basically prevents a
replacement image in a different format from being uploaded.

That's a good thing. The replacement image with a different format is a different image. Any good Commoner knows that an image of a given format can never be superseded by an image in another format on that basis alone. This is a feature, not a bug.

ayg wrote:

I don't think any of the objections from comment 45 or comment 47 are very compelling. But I'm wondering what happens with non-image files. Should it be impossible to tell videos from images from PDFs based on the names? You would have no idea from looking at the wikitext source whether [[File:Foo]] is including image or video or audio or maybe something else entirely.

mike.lifeguard+bugs wrote:

(In reply to comment #46)

the troubles we have from including extensions thus far.

Remind me what those troubles are? I cannot think of a single one, while I can think of contraindications.

ayg wrote:

"The file extension is stored in a new 'img_file_ext' field in the 'image' table (and similar fields to oldimage and filearchive). This field defaults to null. When it is set to null, the file name and the page title are the same."

Could we instead just key off the MIME type here? That seems simpler and less redundant. What would img_file_ext='gif' but img_minor_mime='jpeg' mean? That's denormalized.

Hi Aryeh: the reason I chose to store the file extension in addition to MIME is that both img_file_ext='jpg' or 'jpeg' are both valid values when img_minor_mime is 'jpeg'. While one might be able to infer what the extension would be based on the preferred extension given the MIME type, it's potentially a booby trap for devs and sysadmins down the road, who might unintentionally corrupt a wiki by changing the preferred file extension from one to the other. What may seem like a harmless switch from "jpg" to "jpeg" as the preferred extension would suddenly cause a lot of existing images, archive images, and thumbnails to break. By storing this in the DB, changing the preferred extension in the configuration/code is safe, with only future updates taking on the new preferred extension.

(In reply to comment #48)

I don't think any of the objections from comment 45 or comment 47 are very
compelling.

I disagree, I think they are rather convincing, and I pretty much agree with everything Mike.lifeguard and MZMcBride have said.

But I'm wondering what happens with non-image files. Should it be
impossible to tell videos from images from PDFs based on the names? You would
have no idea from looking at the wikitext source whether [[File:Foo]] is
including image or video or audio or maybe something else entirely.

I think that's a rather compelling argument against it right there.

(In reply to comment #46)

You're right that JPEG, GIF, PNG, & SVG simply are not the same, which is just
another reason why they don't need filename extensions anywhere, and certainly
not at their File:Name locales.

That doesn't make any sense. I want to know what type of media I'm using. The extension helps convey that.

I do see the annoyances of JPG vs jpg, but that could very be fixed by normalizing extensions to lowercase on upload, regardless of what happens here. I'm still not convinced of the overall usefulness of this though.

But I'm wondering what happens with non-image files. Should it be
impossible to tell videos from images from PDFs based on the names? You would
have no idea from looking at the wikitext source whether [[File:Foo]] is
including image or video or audio or maybe something else entirely.

How does one know the difference between a GIF and an animated GIF based on file extension? How does one know the difference between a Flash file (.swf) that just has static vector art versus video? How does one know the difference between a static .svg and one that includes a <video> element?

The arguments in the "URI Opacity" section of the W3C's Architecture group apply to this conversation too:
http://www.w3.org/TR/webarch/#uri-opacity

(In reply to comment #54)

How does one know the difference between a GIF and an animated GIF based on
file extension? How does one know the difference between a Flash file (.swf)
that just has static vector art versus video? How does one know the difference
between a static .svg and one that includes a <video> element?

Sure, a file's extension is not a magic bullet that unambiguously tells you everything you want to know about a file. But it *helps* a great deal.

wikipedia wrote:

Helps with what? Realizing you have to upload a replacement image to another name because this bug isn't closed? For what other reason would it matter what format the file is? (even though you'd be able to tell regardless)

nw.wikipedia wrote:

So what would this do for cases like http://commons.wikimedia.org/wiki/File:Banana.JPG and http://commons.wikimedia.org/wiki/File:Banana.png? The two are of completely different images.

So what would this do for cases like
http://commons.wikimedia.org/wiki/File:Banana.JPG and
http://commons.wikimedia.org/wiki/File:Banana.png? The two are of completely
different images.

Since they have two different page titles, they'd be treated as two different images. For that matter, http://commons.wikimedia.org/wiki/File:Banana.jpeg and http://commons.wikimedia.org/wiki/File:Banana.jpg would still be treated as two different images.

The only thing this feature does (if enabled) is *allow* for the creation of "http://commons.wikimedia.org/wiki/File:Banana", and decouple the MIME type from the page title extension. It does not automatically strip off the extension from existing page titles or create automatic redirects of any sort.

The parenthetical "(if enabled)" bit is important here, too. There's nothing forcing anyone (including Wikimedia Foundation) to actually use this feature just by virtue of MediaWiki supporting the functionality.

(In reply to comment #56)

Helps with what? Realizing you have to upload a replacement image to another
name because this bug isn't closed? For what other reason would it matter what
format the file is? (even though you'd be able to tell regardless)

You still haven't explained how we're supposed to know what [[File:Name]] is when looking at the syntax.

(In reply to comment #58)

The only thing this feature does (if enabled) is *allow* for the creation of
"http://commons.wikimedia.org/wiki/File:Banana", and decouple the MIME type
from the page title extension. It does not automatically strip off the
extension from existing page titles or create automatic redirects of any sort.

The question is not "will it break existing images," it's that when you strip the extension, the name becomes meaningless. If I'm trying to include a picture of a Banana, I want a JPG or PNG, not a MPG. What type of media am I using here [[File:Banana]]? By keeping the extension, it's not (as) ambiguous. Like Roan said above, it's not a magic bullet, but it certainly helps.

The parenthetical "(if enabled)" bit is important here, too. There's nothing
forcing anyone (including Wikimedia Foundation) to actually use this feature
just by virtue of MediaWiki supporting the functionality.

Yes, but if it's not a good feature (which we seem to disagree on), we shouldn't support it at all. If we implemented every idea someone had, we'd have a lot less WONTFIXes. I think the outstanding questions need answering, before this moves forward any more.

You still haven't explained how we're supposed to know what [[File:Name]] is
when looking at the syntax.

You're not, as explained here:
http://www.w3.org/TR/webarch/#uri-opacity

(In reply to comment #60)

You still haven't explained how we're supposed to know what [[File:Name]] is
when looking at the syntax.

You're not, as explained here:
http://www.w3.org/TR/webarch/#uri-opacity

Agent's aren't supposed to infer anything. From the spec:

The example URI used in the travel scenario ("http://weather.example.com
/oaxaca") suggests to a human reader that the identified resource has something > to do with the weather in Oaxaca.

Of course it's not guaranteed to be correct (as the spec goes on to say), but it certainly does help. This is about human readability, not whether the file extension really matters.

Of course it's not guaranteed to be correct (as the spec goes on to say), but
it certainly does help. This is about human readability, not whether the file
extension really matters.

If that's the primary concern, then the right thing to do is to set up "Image:", "Audio:" and "Video:" namespaces to distinguish between different file types, rather than lumping them all in to "File:". Expecting non-technical users to understand that ".svg" usually means a vector diagram hardly serves the goal of readability.

(In reply to comment #62)

Of course it's not guaranteed to be correct (as the spec goes on to say), but
it certainly does help. This is about human readability, not whether the file
extension really matters.

If that's the primary concern, then the right thing to do is to set up
"Image:", "Audio:" and "Video:" namespaces to distinguish between different
file types, rather than lumping them all in to "File:". Expecting
non-technical users to understand that ".svg" usually means a vector diagram
hardly serves the goal of readability.

There may be a case for adding "Audio" and "Video" prefixes as aliases for "File", though it would probably cause conflicts with a fair number of installations that have already created separate namespaces with these prefixes.

Implementing this feature (extensionless files) as a configurable option with the default off might be an option, though the required schema changes make it unlikely that many people would utilize it, I think.

In general, it seems like removing the extensions causes far more problems than it solves.

ayg wrote:

Okay, so:

  1. Problem with the current system: Cannot upload a new version of a file in a different format while preserving history.
  2. Problem with the current system (not mentioned for a while): Google apparently doesn't index image pages properly on non-Wikimedia MW installs, because it assumes anything ending in .png/.jpeg/etc. is an image page, not an HTML page.
  3. Problem with the proposed system: Files are possible that have no extension, or a completely misleading extension, so it's not clear what general type of file they are (although sometimes this is unclear anyway).

There are several possible solutions I can think of. The status quo solves (3) but not (1) or (2). The proposal solves (1) and (2) but not (3). I don't see any reason why we wouldn't want to allow the proposed changes as an option; some wiki admins will surely prefer the option, although others may not. It could be disabled by default.

Another possibility is to require extensions as now, but allow upload of a new file to an existing filename of a different type. This would automatically rename the file to the new appropriate extension, and would only work if that's possible. Reverting to an earlier file of a different type would also change the name. This solves (1) and (3) but not (2). It would be a bit messy, but I think strictly better than the status quo.

(In reply to comment #52)

Hi Aryeh: the reason I chose to store the file extension in addition to MIME is
that both img_file_ext='jpg' or 'jpeg' are both valid values when
img_minor_mime is 'jpeg'. While one might be able to infer what the extension
would be based on the preferred extension given the MIME type, it's potentially
a booby trap for devs and sysadmins down the road, who might unintentionally
corrupt a wiki by changing the preferred file extension from one to the other.
What may seem like a harmless switch from "jpg" to "jpeg" as the preferred
extension would suddenly cause a lot of existing images, archive images, and
thumbnails to break. By storing this in the DB, changing the preferred
extension in the configuration/code is safe, with only future updates taking on
the new preferred extension.

Why not just hardcode "jpg" as the preferred version, and never change it? That seems a lot simpler and less error-prone than keeping track of it.

Why not just hardcode "jpg" as the preferred version, and never change it?
That seems a lot simpler and less error-prone than keeping track of it

There would need to be all sorts of red flags and warnings around the part of the configuration/code that specifies that mapping, and if there's ever a legitimate need to remap any extension, fixing it becomes pretty fragile. The current mapping of image/jpeg->".jpeg" as preferred extension is in the mime.types file, which looks roughly compatible with the Apache mime.types file. Someone may naively copy an Apache file over and screw up their wiki if the ordering isn't the same.

Mind you, it's not just JPEG that has multiple choices for filename, it's most media types. Changing it on an existing wiki seems like it'd really screw things up, and it's pretty easy to imagine someone trying it.

That said, I'm not dug in on this approach. I can definitely see the benefit of not touching the database; in fact, it was my original strategy. Part of the reason why I went with the database approach was the recommendation in comment #28, the wisdom of which was borne out after I spent a fair amount of time trying to make the no-database-changes approach work. I understand the code better now, so I'd probably be more successful if I tried again - though I'm a little nervous I might just rediscover another reason why the database change was needed. As I recall, I think what tipped me over was taking a good look at how mime.types are configured.

Regardless, the job of trying a different approach would be made easier by getting some variant of r60772 checked in (as well as some of the other fixes and tweaks on that branch), since I'm a little worried that there's more code that's being checked in that glibly assumes article title==filename. The sooner those bits are checked in, the easier it would be to maintain a branch that implements the actual feature.

ayg wrote:

(In reply to comment #65)

There would need to be all sorts of red flags and warnings around the part of
the configuration/code that specifies that mapping, and if there's ever a
legitimate need to remap any extension, fixing it becomes pretty fragile.

Hardcode it, not configurable. Add a comment if you're worried, saying "Do not change this or else existing files will become inaccessible". Even if you think developers will ignore the comment *and* no one else will notice in code review or testing, which seems excessively pessimistic, it will still be noticed immediately upon deployment, and fixed with minimal damage.

If an end-user modifies the source code without knowing what they're doing, on the other hand, they deserve whatever happens to them. There are much more destructive things they can do to their wiki.

Mind you, it's not just JPEG that has multiple choices for filename, it's most
media types. Changing it on an existing wiki seems like it'd really screw
things up, and it's pretty easy to imagine someone trying it.

It's extremely hard for me to see why anyone would decide they prefer .jpeg to .jpg (or vice versa) so much that they'd look through the source code, find the code that has the mapping, *and* ignore the comment warning them not to change it. Even if they do something so pathologically stupid, it will be caught quickly and isn't that hard to fix manually.

Regardless, the job of trying a different approach would be made easier by
getting some variant of r60772 checked in (as well as some of the other fixes
and tweaks on that branch), since I'm a little worried that there's more code
that's being checked in that glibly assumes article title==filename. The
sooner those bits are checked in, the easier it would be to maintain a branch
that implements the actual feature.

No objection to checking in a preliminary version, but it wouldn't make any sense to do a schema change only to decide we actually don't need it.

(In reply to comment #64)

  1. Problem with the current system (not mentioned for a while): Google

apparently doesn't index image pages properly on non-Wikimedia MW installs,
because it assumes anything ending in .png/.jpeg/etc. is an image page, not an
HTML page.

I wrote [[mw:Extension:FilePageMasking]] which transparently rewrites ".xxx" to "_xxx" for image description pages. This solves the Google problem by masking out the extension.

It's extremely hard for me to see why anyone would decide they prefer .jpeg to
.jpg (or vice versa) so much that they'd look through the source code, find the
code that has the mapping, *and* ignore the comment warning them not to change
it. Even if they do something so pathologically stupid, it will be caught
quickly and isn't that hard to fix manually.

I think you may be missing my point, and I also think you need to take a closer look at how things are currently done.

Look here:
http://svn.wikimedia.org/viewvc/mediawiki/trunk/phase3/includes/MimeMagic.php?view=markup
(38 mime types, 9 with multiple file extensions)

...and here:
http://svn.wikimedia.org/viewvc/mediawiki/trunk/phase3/includes/mime.types?view=markup
(137 mime types, 36 with multiple file extensions)

...and here:
https://svn.apache.org/repos/asf/httpd/httpd/branches/2.2.x/docs/conf/mime.types
(629 mime types, 86 with multiple file extensions)

All current and future media types with multiple choices for file extension would need to be hardcoded to specify the immutable preferred version. Granted, not all or even most of these really matter, but even accounting for that, it still leaves a lot of management headache ensuring things stay "right".

No objection to checking in a preliminary version, but it wouldn't make any
sense to do a schema change only to decide we actually don't need it.

r60772 isn't a preliminary version. It's a necessary portion of a complete final version that would be needed regardless of whether storing extensions in the database or using hardcoded extensions is the choice (or any other scheme, for that matter). There are no database changes in r60772.

ayg wrote:

(In reply to comment #68)

I think you may be missing my point, and I also think you need to take a closer
look at how things are currently done.

Look here:
http://svn.wikimedia.org/viewvc/mediawiki/trunk/phase3/includes/MimeMagic.php?view=markup
(38 mime types, 9 with multiple file extensions)

...and here:
http://svn.wikimedia.org/viewvc/mediawiki/trunk/phase3/includes/mime.types?view=markup
(137 mime types, 36 with multiple file extensions)

...and here:
https://svn.apache.org/repos/asf/httpd/httpd/branches/2.2.x/docs/conf/mime.types
(629 mime types, 86 with multiple file extensions)

All current and future media types with multiple choices for file extension
would need to be hardcoded to specify the immutable preferred version.
Granted, not all or even most of these really matter, but even accounting for
that, it still leaves a lot of management headache ensuring things stay
"right".

Hmm. You might be right, but denormalizing to this extent still doesn't seem like the best solution to me. If anything had to be in the database, we should be able to have a single 1:1 table mapping (img_major_mime, img_minor_mime) -> extension, not the same extension duplicated in millions of image rows.

r60772 isn't a preliminary version. It's a necessary portion of a complete
final version that would be needed regardless of whether storing extensions in
the database or using hardcoded extensions is the choice (or any other scheme,
for that matter). There are no database changes in r60772.

No objection from me, then. It's true that I haven't looked closely at this -- I just don't have the time right now, so I only read the RFC.

Bryan.TongMinh wrote:

As of r81601 thumbnailing of files without extension should work. Of course you can't upload files without extension, so this not useful currently, but a step in the proper direction.

svenmanguard wrote:

Sorry to bring up ancient history, but I was told this is the bug to do it at. Please see http://en.wikipedia.org/wiki/Wikipedia:Village_pump_%28proposals%29#Several_changes_to_file_naming - a proposal I put forth (not knowing about this) to fix certain consistent issues with file naming. Most reliant there are points #2 and #3, as I have been told that the first one is impossible. As it stands, the three points are:

  1. Case sensitivity in image names: As it stands, three separate users could upload three separate images of three separate subjects called File:TestImage.jpg, File:TeStImAgE.jpg, and File:Testimage.jpg. There is no reason why file names should be case sensitive.
  1. Multiple filetype extensions for the same filetype: As it stands, two separate users could upload two separate images of two separate subjects as File:TestImage.jpg and File:TestImage.jpeg. There is no reason for this.
  1. Case sensitivity in filetype extensions: As it stands, and as I have seen at least twice recently, two separate images can be uploaded as File:TestImage.jpg and File:TestImage.JPG. This has the potential to cause even more problems that the above situations. There is no reason why filetype extensions should be case sensitive.

If we can handle #2 and #3 that would be wonderful.

wikipedia wrote:

(In reply to comment #71)

Sorry to bring up ancient history…

Closing this bug would effectively nullify 2 & 3.

1 isn't relevant to this bug. (It could be done, but shouldn't be, IMO. Like it or not we have given upper and lower case letters distinction from one another in this world. We need not limit our files thus.)

svenmanguard wrote:

Alright. If we can, at the very least, knock off 2 and 3, that'd be an improvement. Any word from any devs? Can this be put into motion? There's a ton of support at the thread linked in 71.

ayg wrote:

We know there's tons of support for this, and we all want to see it happen too. It hasn't happened yet because it will take a bunch of work that has yet to be done.

neilk wrote:

I have a proposal here for eliminating the file ending and also ending most of the other restrictions on filenames for Commons and uploads in MediaWiki generally.

However, I am now spending all my time on other things for the foreseeable future. Maybe someone else will find those ideas useful.

http://www.mediawiki.org/wiki/User:NeilK/Multimedia2011/Titles

*** Bug 20971 has been marked as a duplicate of this bug. ***

svenmanguard wrote:

Okay, you know what, I've had enough of this nonsense. Will someone with more knowledge of Bugzilla split my proposal, (Comment 71), off from this?

There are two proposals on this page. One is to remove filetype extensions entirely, and has gotten a whole lot of shrieks of horror over the past five years. The other is my proposal, which really shouldn't have been placed here. I did what I was told to, but my proposal is entirely different from the one made in 2006.

john wrote:

(In reply to comment #77)

Okay, you know what, I've had enough of this nonsense. Will someone with more
knowledge of Bugzilla split my proposal, (Comment 71), off from this?

There are two proposals on this page. One is to remove filetype extensions
entirely, and has gotten a whole lot of shrieks of horror over the past five
years. The other is my proposal, which really shouldn't have been placed here.
I did what I was told to, but my proposal is entirely different from the one
made in 2006.

Why don't you do it yourself (In fact, there's probably already a bug submitted for that. Search for it)

rd232 wrote:

(In reply to comment #77)

Okay, you know what, I've had enough of this nonsense. Will someone with more
knowledge of Bugzilla split my proposal, (Comment 71), off from this?

There are two proposals on this page. One is to remove filetype extensions
entirely, and has gotten a whole lot of shrieks of horror over the past five
years. The other is my proposal, which really shouldn't have been placed here.
I did what I was told to, but my proposal is entirely different from the one
made in 2006.

Split done for your points 2 and 3. Point 1 is probably more controversial and if you want to pursue it, should be separate.

Bug 32660 - File extensions for the same file type should not allow variations of a file name (File:X.jpg, File:X.jpeg, File:X.JPG should all refer to the same file)

sumanah wrote:

Comment on attachment 6926
bug4421-robla-v4all-svn60601.patch

Patch no longer applies cleanly to trunk per Rusty Burchfield's automated testing https://docs.google.com/spreadsheet/ccc?key=0Ah_71HHl7qa7dGtvSms3TGpHQU9NU2Y1VmNzUEUteWc .

johnnymrninja wrote:

Bug 32660 was broken off of here, and I'm breaking another bug off of that, Bug 40479 "File extensions should be automatically decided by MIME type at upload". It won't fix this bug, but it would be a step in the right direction.

Reviving this per new filing duped to T25255.

Note that the actual backing file needs an extension or our life gets much harder, but we have no particular need for pages in the file: namespace to be locked to filesystem file names.

Decoupling them would probably require schema changes. The image and oldimage tables currently reference the file name including extension, and don't have a separate extension listed.

This may also want to tie in with bigger restructuring of the image/oldimage tables T589, or restructuring the file storage back ends file names T66214.

For instance, if the backing files are named with uuids/hashes plus an extension, then we can use the image table or its replacement to map from page title to backing file.

This request would be require T126408 which was be declined 2016!?
Anyway this request seems for me illogical and unsolvable, as I see no real meaning in this task. I guess it would be require to rename thousands of files to unique all names (maybe a million). So I strongly vote for decline this task here. (As the other sub task is still open T34660).

I opposite this because I have a use case where it is useful to know from Wikidata's commonsMedia reference what is the file type/media type of the referenced file without having to call any API. I can now determine this only by looking at the file extension of the commonsMedia reference while otherwise I would have to call an API to determine this. If one is processing whole Wikidata's dump, this could be a lot of API calls required.