Evaluate format selector before download in Python #9973

emphoeller · 2024-05-20T01:17:05Z

DO NOT REMOVE OR SKIP THE ISSUE TEMPLATE

I understand that I will be blocked if I intentionally remove or skip any mandatory* field

Checklist

I'm asking a question and not reporting a bug or requesting a feature
I've looked through the README
I've verified that I have updated yt-dlp to nightly or master (update instructions)
I've searched known issues and the bugtracker for similar questions including closed ones. DO NOT post duplicates
I've read the guidelines for opening an issue

Please make sure the question is worded well enough to be understood

From within a Python script, I would like to determine which format a specific format selector will select before starting the download. Concretely, I would like to check filesize and filesize_approx of bestvideo* and bestaudio before downloading them. Is there a good way to do this?

There already is the issue [Question] How to extract filesize before download?, but that only discusses template strings. It also brings up the key requested_formats, which doesn’t seem to work here:

with YoutubeDL({'format': 'bestvideo*,bestaudio'}) as ydl:
    info = ydl.extract_info(some_url, download=False)
    f1, f2 = info['requested_formats'] # KeyError: 'requested_formats'

I’ve also already noticed that info has the keys filesize and filesize_approx, but which size do they refer to? The video, the audio, the merged format? (I didn’t even request a merge…) And why does it have both, with different values?

Provide verbose output that clearly demonstrates the problem

Run your yt-dlp command with -vU flag added (yt-dlp -vU <your command line>)
If using API, add 'verbose': True to YoutubeDL params instead
Copy the WHOLE output (starting with [debug] Command-line config) and insert it below

Complete Verbose Output

No response

The text was updated successfully, but these errors were encountered:

bashonly · 2024-05-20T03:34:07Z

The info dict will only have a requested_formats field if you request formats to be merged with the + operator, e.g. bv+ba

WillFa · 2024-05-20T09:16:29Z

The "bestvideo" and "bestaudio" are typically the defaults, and Youtube pushes that information into the info.json in the format_id field. All the filesize_approx, vcodec, acodec, resolution, ext, format_note etc fields at the root of the json object describe this combination of formats.

All the filesize(_approx)s are in the .formats branch. There's about 20 of them and you could loop through building a dictionary with the format_id and .filesize or .filesize_approx. The .formats branch is what builds the table you see with the -F option passed to yt-dlp.

emphoeller · 2024-05-20T20:18:09Z

From what you’ve said and my own experimentation, I’ve figured out the following:

If I request a merged format in format, I can see what the individual format selectors resolve to in format_id. E.g. format='bestvideo*+bestaudio' might become info['format_id'] == '244+251'.
If I request multiple formats without a merge in format (,), I see the resolved format for one of them in format_id (always the last?).
The fields filesize, filesize_approx, ext etc. at the top level of info refer to the format in format_id.

So with this knowledge, I can do the following:

def get_filesize(format_id, formats):
    for f in formats:
        if f['format_id'] == format_id:
            return f.get('filesize', f.get('filesize_approx'))

with YoutubeDL({'format': 'bestvideo*+bestaudio'}) as ydl:
    info = ydl.extract_info(some_url, download=False)
    video_format_id, audio_format_id = info['format_id'].split('+')
    video_size = get_filesize(video_format_id, info['formats'])
    audio_size = get_filesize(audio_format_id, info['formats'])
    # Or just look them up in info['requested_formats']

# ... Do logic based on video_size and audio_size ...

with YoutubeDL('format': 'bestvideo*,bestaudio') as ydl:
    ydl.download([some_url])

Notice how I need to create another YoutubeDL object if I don’t want to merge the formats. I am also downloading the metadata for the video twice. Can this be done better?

I was hoping for a way to evaluate any format selector given info. In the above code, the evaluation happens somewhere inside extract_info(), but doing it that way has the following downsides:

The result for multiple selectors is only easily accessible if they are combined with a merge (+).
You are downloading the metadata again each time you check another format selector.

There’s also still my question about why filesize and filesize_approx are sometimes both populated, but with different values.

bashonly · 2024-05-22T14:53:17Z

I am also downloading the metadata for the video twice. Can this be done better?

with this pattern:

# extract info only
with YoutubeDL(ydl_opts) as ydl:
    info = ydl.extract_info(URL, download=False)

# do things with info dict here

# process and download
with YoutubeDL(ydl_opts) as ydl:
    processed_info = ydl.process_ie_result(info)

There’s also still my question about why filesize and filesize_approx are sometimes both populated, but with different values.

can you give an example

emphoeller · 2024-05-22T22:54:39Z

<aside>

For anyone with the same questions as me, I just realized the following: It is easy to manually evaluate any format selector because the formats are already sorted from worst* to best*. (I haven’t found that documented anywhere; I only realized it upon remembering this.) For example, to evaluate bestvideo*[ext=mp4], one could simply:

def get_best_mp4_video(formats):
    for f in reversed(formats):
        if f.get('vcodec') != 'none' and f.get('ext') == 'mp4':
            return f['format_id']

I have observed acodec not being present – for an audio-only format! This appears to mean that the audio codec is unknown but audio is (possibly?) present. Compare that to a value of 'none', which must mean that the format is known to have no audio. Hence f.get('vcodec') != 'none' in the above code, and not f['vcodec'] != 'none' or f.get('vcodec', 'none') != 'none'.

</aside>

There’s also still my question about why filesize and filesize_approx are sometimes both populated, but with different values.

can you give an example

yt-dlp -O "%(filesize)s, %(filesize_approx)s" -f bv,ba https://youtu.be/BaW_jenozKc

669625, 669624
142527, 142525

I didn’t even know that -O prints for each comma-separated format until now.

I am also downloading the metadata for the video twice. Can this be done better?

with this pattern:

…

Great, this works for my current particular case where I know which format I want to download beforehand. What would I do if I wanted to look at all the formats and then choose one?

bashonly · 2024-05-23T14:36:16Z

What would I do if I wanted to look at all the formats and then choose one?

select the format(s) by adding the format param to your ydl_opts before instantiating the YoutubeDL that you'll use to call process_ie_result

emphoeller added the question Question label May 20, 2024

bashonly closed this as completed May 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluate format selector before download in Python #9973

Evaluate format selector before download in Python #9973

emphoeller commented May 20, 2024

bashonly commented May 20, 2024

WillFa commented May 20, 2024

emphoeller commented May 20, 2024

bashonly commented May 22, 2024

emphoeller commented May 22, 2024

bashonly commented May 23, 2024

Evaluate format selector before download in Python #9973

Evaluate format selector before download in Python #9973

Comments

emphoeller commented May 20, 2024

DO NOT REMOVE OR SKIP THE ISSUE TEMPLATE

Checklist

Please make sure the question is worded well enough to be understood

Provide verbose output that clearly demonstrates the problem

Complete Verbose Output

bashonly commented May 20, 2024

WillFa commented May 20, 2024

emphoeller commented May 20, 2024

bashonly commented May 22, 2024

emphoeller commented May 22, 2024

bashonly commented May 23, 2024