• MLA looks tool make work easier. (And more difficult, you know, when you first time taste some good, you see you know nothing :D).

    Metadata import is very easy when using images. I use Exiftool and csv-file. Very powerfull way: List images to file, manually add captions, descriptions, keywords etc. Then add metadata to images with Exiftool, and import images. Now I have hundreds images on my website and I can search it easy and powerfully.

    Ok, but it question:
    Also other mime-types than image can contain metadata. For test I edit one pdf-file with pdf-editor and add keywords, captions etc etc.

    How I can import pdf-files and MLA read also it metadata? I edit file, add metadata, also I try add some manually some metadata-header (as in images). But no: After import all fields are empty. No tags, no captions etc.

    So, is it any special setting in MLA? Looks problem is “I forget any cross in setting”.

Viewing 7 replies - 1 through 7 (of 7 total)
  • Plugin Author David Lingren

    (@dglingren)

    Good to hear from you again, and thanks for your questions.

    MLA can extract metadata from PDF files in addition to image files. However, the metadata is stored in somewhat different ways, e.g., often in XMP structures.

    If you post a link to one or more of your PDF documents I can do some investigation and give you more specific guidance on how to extract metadata for your application. Any additional details or examples you can provide will be helpful.

    Thread Starter Jukka K?hk?nen

    (@elkesan)

    Hi again.

    This week I am study “how to manage metadata in other mimetypes than image”. Now I am familiar with EXIF, IPTC and XMP. So, answer is: “MLA is just right tool for import. But only for mimetypes with exact metadata: images. Not for other.” Looks MLA need only little adjusting and it read metadata from pdf and use it as images.

    Problem is “metadata is not standard”. Eg. pdf-file contain all same metadata as image, BUT not eg. “category”. Ok, image: keywords are “Mountain, Finland, Childrens” and I can use this keywords search. In my work, “Manufacturername, outdoor, vandalproof” and CATEGORY “productphoto” and “installedexamplesystem”. Just traditional category and keyword-use. Eg. “make photogallery, use taxonomy: keyword vandalproof, category productphoto” and gallery make photo gallery and search product photos from vandalproof. Category is just, “product photo” and it tell eg. “this is studio photo” and “installedexamplesystem” category tell “this is from real life”. Sorry, long text, but I try write “IN PHOTOS category and keywords are so natural”.

    IN OTHER MIMETYPE what is same, is keywords. But, in pdf this “category” it is customtag. Keyword is eg. “manufacturername” and category is “user manual”, or “install manual” or “leaflet” etc. IN WORDPRESS and MLA this is very easy add manually: IMPORT pdf with MLA and (bulk) edit all this information.

    OK. We import pdf with MLA and after it we can add this information manually. And, little programmin, MLA read this information from PDF, from IPTC/EXIF/XMP.

    BUT. Before this. There is no any clever way add this metadata to pdf-file. Metadata fields are quite standard exl Category. Depend of editor, metadata adding is difficult, very difficult or impossible. Corel Draw pdf export: No metadata fields. Word: Some metadata fields. Etc etc, list is very long.

    Reason of unstable metadata include to pdf, most clever way add IPTC/EXIF/XMP metadata to pdf is afterwork, using pdf editor OR Exiftool (as images). In images this process is natural. In pdf: 1, create pdf, 2, make csv-file, 3 crunch metadata to pdf using exiftool.

    It is prolog, now question:

    1, how to read metadata from pdf? In MLA settings is IPTC/exif-tab, and there is my IMAGE-settings, eg. “Caption” = 2#120: Caption or Abstract etc. Hundred and hundred jpg I imported with MLA and MLA read all standard metadata as dream. How I can add this same for pdf? It is, “this rule use only for images, but this rule use for pdf”?

    2, reason of tenth software I use make pdf, it is no any method add metadata inside editor (eg. corel draw etc). So, I must add it later (eg. exiftool or adobe pdf editor). Can I make any metadata file? “MLA import pdf” “MLA import metadata for pdf”. Eg. “manual123.pdf” and “manual123.metadata”?

    3, any other way?

    Plugin Author David Lingren

    (@dglingren)

    Thanks for your update and for all the research you’ve done!

    The PDF standard defines two ways to add metadata to a document.

    First, there are some “standard” PDF metadata values like Title, Author, Subject, Keywords. These are stored in the “Document Information Dictionary”, which has a fixed format and content. MLA provides access to these values with the pdf: prefix.

    Second, the standard defines “Metadata Streams”, which allow any kind of metadata format and content stored in an XML format using the Adobe XMP standard. MLA provides access to these values with the xmp: prefix.

    More information can be found in the “Field-level metadata in PDF documents” section of the Settings/Media Library Assistant Documentation tab. As I wrote in my first response, MLA can extract and use any metadata in the document once you know its location and name.

    As you have discovered, adding metadata to the document is more difficult than accessing it. Adobe Acrobat Pro has good support, but most other tools like Microsoft Word provide very limited features for working with PDF metadata.

    Regarding your questions:

    1. You can use a Content Template to populate a field such as Caption from different sources depending on item type and content. In the IPTC/EXIF mapping rules you would leave the IPTC value set to “none” and code a template in the EXIF/Template text box. For your example, you could code something like:

    template:([+pdf:Title+]|[+iptc:2#120+])

    This simple example will take the value from the PDF Title field or, if that is not defined or is empty, from the IPTC Caption or Abstract field. Since image files do not contain any PDF fields the template will access the IPTC field instead. You can extend the Content Template approach to additional data sources.

    2. I regret that the current MLA version does not have any features for reading CSV files and using them to update Media Library items. The idea of updating item values from an external file such as a CSV file has come up before:

    Batch upload of media (not bulk upload)

    As described in the earlier topics, a custom plugin could be developed to provide some form of CSV support. Since I wrote those earlier posts my circumstances have changed and I would have very little time to work on such a plugin.

    3. I do not know of any other approach that would serve your purpose. The best solution would be to find a good tool such as Adobe Acrobat Pro to add the required values to the documents before uploading them to WordPress.

    I am marking this topic resolved because I have given you the best information I have, but please update it if you have any further questions regarding how to find the metadata in your PDF documents. Thanks for your understanding and your continued interest in the plugin.

    Thread Starter Jukka K?hk?nen

    (@elkesan)

    Dear David,

    thanks to patience.

    Only one additional question:
    In MLA settings:
    https://drive.google.com/file/d/1YCRwjj7J0Me28BzPbDdda_ae8KL9WHLn/view?usp=sharing

    This all is MLA basic settings. I adjust this data and image import work as it must: MLA import image metadata. All meet Standards. I add metadata to images using Exiftool, but IPTC/EXIF is standard- simply add fully standard metadata to images.

    ———-

    I want confirmation: With PDF this is not possible? Please notify, IMAGE import can read metadata and need only little adjust as you see in my screen capture.

    https://www.adobe.com/devnet/pdf/pdf_reference.html
    https://wwwimages2.adobe.com/content/dam/acom/en/devnet/pdf/PDF32000_2008.pdf
    (Page 550)
    Also this confirm:
    https://sno.phy.queensu.ca/~phil/exiftool/TagNames/PDF.html location “PDF Info TAGS”.

    ALL this tags “Author”, “Subject” etc etc etc are BASIC metadata inside PDF. About all software can make this. This is one public PDF-document, one user manual written by me:
    https://drive.google.com/file/d/1yOa4_-gp_2K-gMpPKb8_Cg9jqECP6utK/view?usp=sharing

    As you see, there is some metadata inside, BASIC metadata.

    So, my question is: MLA import image metadata and it works. But PDF metadata is not so easy? All this IPTC/EXIF rules not understand PDF-metadata. “Add New Custom Field Rule” -> Name?? + IPTC-value?? …??? I only need confirmation “MLA Cannot read PDF metadata, there is no any possible make rule for it” ??

    Thread Starter Jukka K?hk?nen

    (@elkesan)

    Solved! template:([+pdf:Title+]|[+iptc:2#120+]) as you say. Also I found what is problem. PDF is not so standard as images. This Template-work is just it I search.

    Problem is, I start testing with Description. Field names are not so standard as in images: Depend of source, this Description is Description or Subject (field name). Of course, brains say “Subject” is “Caption”, but no…

    So, now this is ok, only need jumping with field names…

    Thread Starter Jukka K?hk?nen

    (@elkesan)

    Of course I must continue questions. I make this more and more and more and study. I found good solutions and immediately after more questions.

    Main problem with PDF is extremely limited metadatafields. Of course it is possible make own fields, but standard fields are very limited.

    This is my solution:
    Otsikko/Title template:([+pdf:Title+]|[+iptc:2#005+])
    Alt-teksti/alt-text template:([+pdf:Title+]|[+iptc:2#105+])
    Kuvateksti/picture text template:([+pdf:Subject+]|[+iptc:2#105+])
    Kuvaus/Description template:([+pdf:Subject+]|[+iptc:2#120+])
    Tags template:([+pdf:Keywords+]|[+iptc:2#025+])
    Categories ???

    I am sorry: I have english worpress now but some field names are still finnish in MLA-settings IPTC/EXIF-tab. So only way is make direct translate from finnish terms to english. Eg. “kuvateksti” direct translate is “picture text”. I really hope I can bend WordPress fully english, but looks not possible.

    OKAY BUT IT QUESTIONS!
    PDF standard fields are Title, Author, Subject, Keywords. Title = Title, Subject = Description, Keywords = Keywords.

    1. In my mind “Description” is long fairytale. “Caption” is shortened version of it. Reason PDF standard not contain this “Caption” must use Subject as Caption. So, is this possible?
    a: template:([+pdf:Subject+ <get only 100 character>]|[+iptc:2#105+])
    or,
    b: template:([+pdf:Subject+ <get until found any delimiter, eg. “\o/”]|[+iptc:2#105+])
    c: ?? Other way?
    Idea is make very long Description and short Caption.

    2. PDF support Keywords but not Category.
    a, any idea how to add Categories?
    b, template which use Keywords:
    Keywords: tag1, tag2, tag3, CC-Category1CC-Category2
    or
    Keywords: tag1, tag2, tag3 LIMITERWORD Category1, Category2
    Looks weird, but:
    template-keywords: “Read Keywords and STOP when found LIMITERWORD”
    template-category: “find LIMITERWORD and start read categories after limiterword”….

    or any other way?

    So, my questions are simply:
    1, “I want use all three, Description, Title and Caption with PDF, but PDF not support Caption. If start of Subject-field in pdf contain caption, can I read: up to 100 character or better, up to limiter mark?”
    2, “Any idea how I can import also Categories from pdf? PDF standard not support Categories.”

    Plugin Author David Lingren

    (@dglingren)

    Thanks for your questions and the extensive details. Thanks as well for the link to one of your documents; very helpful.

    I examined your example document with Adobe Acrobat’s “Document Properties” dialog boxes (File Menu, Properties…). I saw all the standard properties you mentioned and I also saw a “Custom” property, “Categories” with a value of “Ohje, K?ytt?ohje”.

    I uploaded the document to my test system to see how MLA would decode the properties using the logging technique described in this earlier topic:

    Possible to convert ASCII

    I was able to map these values to the Att. Categories taxonomy with a template: template:([+xmp:Categories+]).

    This alternate template also worked for me: template:([+pdf:Categories+]).

    The above templates answer your second question regarding categories. I believe you can add additional custom properties in the same way for fields such as “Caption”, which would answer your first question.

    A more direct answer to your first question would use MLA’s “Regular Expression Features” to divide the value in the Subject field using a “limiter mark” as you proposed. I think the custom property technique is better, but you can find more information on the Regular Expression Features in the Settings/Media Library Assistant Documentation tab if you want to try the “limiter mark” idea.

Viewing 7 replies - 1 through 7 (of 7 total)
  • The topic ‘Metadata other other filetypes than image (pdf)’ is closed to new replies.