Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluate format selector before download in Python #9973

Closed
6 of 9 tasks
emphoeller opened this issue May 20, 2024 · 6 comments
Closed
6 of 9 tasks

Evaluate format selector before download in Python #9973

emphoeller opened this issue May 20, 2024 · 6 comments
Labels
question Question

Comments

@emphoeller
Copy link

DO NOT REMOVE OR SKIP THE ISSUE TEMPLATE

  • I understand that I will be blocked if I intentionally remove or skip any mandatory* field

Checklist

Please make sure the question is worded well enough to be understood

From within a Python script, I would like to determine which format a specific format selector will select before starting the download. Concretely, I would like to check filesize and filesize_approx of bestvideo* and bestaudio before downloading them. Is there a good way to do this?

There already is the issue [Question] How to extract filesize before download?, but that only discusses template strings. It also brings up the key requested_formats, which doesn’t seem to work here:

with YoutubeDL({'format': 'bestvideo*,bestaudio'}) as ydl:
    info = ydl.extract_info(some_url, download=False)
    f1, f2 = info['requested_formats'] # KeyError: 'requested_formats'

I’ve also already noticed that info has the keys filesize and filesize_approx, but which size do they refer to? The video, the audio, the merged format? (I didn’t even request a merge…) And why does it have both, with different values?

Provide verbose output that clearly demonstrates the problem

  • Run your yt-dlp command with -vU flag added (yt-dlp -vU <your command line>)
  • If using API, add 'verbose': True to YoutubeDL params instead
  • Copy the WHOLE output (starting with [debug] Command-line config) and insert it below

Complete Verbose Output

No response

@emphoeller emphoeller added the question Question label May 20, 2024
@bashonly
Copy link
Member

The info dict will only have a requested_formats field if you request formats to be merged with the + operator, e.g. bv+ba

@WillFa
Copy link

WillFa commented May 20, 2024

The "bestvideo" and "bestaudio" are typically the defaults, and Youtube pushes that information into the info.json in the format_id field. All the filesize_approx, vcodec, acodec, resolution, ext, format_note etc fields at the root of the json object describe this combination of formats.

All the filesize(_approx)s are in the .formats branch. There's about 20 of them and you could loop through building a dictionary with the format_id and .filesize or .filesize_approx. The .formats branch is what builds the table you see with the -F option passed to yt-dlp.

@emphoeller
Copy link
Author

From what you’ve said and my own experimentation, I’ve figured out the following:

  • If I request a merged format in format, I can see what the individual format selectors resolve to in format_id. E.g. format='bestvideo*+bestaudio' might become info['format_id'] == '244+251'.
  • If I request multiple formats without a merge in format (,), I see the resolved format for one of them in format_id (always the last?).
  • The fields filesize, filesize_approx, ext etc. at the top level of info refer to the format in format_id.

So with this knowledge, I can do the following:

def get_filesize(format_id, formats):
    for f in formats:
        if f['format_id'] == format_id:
            return f.get('filesize', f.get('filesize_approx'))

with YoutubeDL({'format': 'bestvideo*+bestaudio'}) as ydl:
    info = ydl.extract_info(some_url, download=False)
    video_format_id, audio_format_id = info['format_id'].split('+')
    video_size = get_filesize(video_format_id, info['formats'])
    audio_size = get_filesize(audio_format_id, info['formats'])
    # Or just look them up in info['requested_formats']

# ... Do logic based on video_size and audio_size ...

with YoutubeDL('format': 'bestvideo*,bestaudio') as ydl:
    ydl.download([some_url])

Notice how I need to create another YoutubeDL object if I don’t want to merge the formats. I am also downloading the metadata for the video twice. Can this be done better?

I was hoping for a way to evaluate any format selector given info. In the above code, the evaluation happens somewhere inside extract_info(), but doing it that way has the following downsides:

  • The result for multiple selectors is only easily accessible if they are combined with a merge (+).
  • You are downloading the metadata again each time you check another format selector.

There’s also still my question about why filesize and filesize_approx are sometimes both populated, but with different values.

@bashonly
Copy link
Member

I am also downloading the metadata for the video twice. Can this be done better?

with this pattern:

# extract info only
with YoutubeDL(ydl_opts) as ydl:
    info = ydl.extract_info(URL, download=False)

# do things with info dict here

# process and download
with YoutubeDL(ydl_opts) as ydl:
    processed_info = ydl.process_ie_result(info)

There’s also still my question about why filesize and filesize_approx are sometimes both populated, but with different values.

can you give an example

@emphoeller
Copy link
Author

<aside>

For anyone with the same questions as me, I just realized the following: It is easy to manually evaluate any format selector because the formats are already sorted from worst* to best*. (I haven’t found that documented anywhere; I only realized it upon remembering this.) For example, to evaluate bestvideo*[ext=mp4], one could simply:

def get_best_mp4_video(formats):
    for f in reversed(formats):
        if f.get('vcodec') != 'none' and f.get('ext') == 'mp4':
            return f['format_id']

I have observed acodec not being present – for an audio-only format! This appears to mean that the audio codec is unknown but audio is (possibly?) present. Compare that to a value of 'none', which must mean that the format is known to have no audio. Hence f.get('vcodec') != 'none' in the above code, and not f['vcodec'] != 'none' or f.get('vcodec', 'none') != 'none'.

</aside>

There’s also still my question about why filesize and filesize_approx are sometimes both populated, but with different values.

can you give an example

yt-dlp -O "%(filesize)s, %(filesize_approx)s" -f bv,ba https://youtu.be/BaW_jenozKc
669625, 669624
142527, 142525

I didn’t even know that -O prints for each comma-separated format until now.

I am also downloading the metadata for the video twice. Can this be done better?

with this pattern:

Great, this works for my current particular case where I know which format I want to download beforehand. What would I do if I wanted to look at all the formats and then choose one?

@bashonly
Copy link
Member

What would I do if I wanted to look at all the formats and then choose one?

select the format(s) by adding the format param to your ydl_opts before instantiating the YoutubeDL that you'll use to call process_ie_result

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Question
Projects
None yet
Development

No branches or pull requests

3 participants