-
Notifications
You must be signed in to change notification settings - Fork 306
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add progress bar for to_arrow method #352
Conversation
google/cloud/bigquery/job.py
Outdated
@@ -20,6 +20,9 @@ | |||
import copy | |||
import re | |||
import threading | |||
import tqdm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to be an optional import. See:
python-bigquery/google/cloud/bigquery/table.py
Lines 39 to 42 in 20f473b
try: | |
import tqdm | |
except ImportError: # pragma: NO COVER | |
tqdm = None |
google/cloud/bigquery/job.py
Outdated
@@ -3275,6 +3283,27 @@ def result( | |||
rows._preserve_order = _contains_order_by(self.query) | |||
return rows | |||
|
|||
def _get_progress_bar(self, progress_bar_type, description, total, unit): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like this is very close to the logic in table.py
python-bigquery/google/cloud/bigquery/table.py
Line 1374 in 20f473b
def _get_progress_bar(self, progress_bar_type): |
I'd suggest creating a _tqdm_helpers.py
module similar to our pandas helpers to hold this logic for both Table and Job logic.
…into bigquery_issue_343
google/cloud/bigquery/job.py
Outdated
@@ -3337,7 +3339,49 @@ def to_arrow( | |||
|
|||
..versionadded:: 1.17.0 | |||
""" | |||
return self.result().to_arrow( | |||
if self.query_plan and progress_bar_type: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this logic necessary if we are calling to_arrow
? Can't we rely on to_arrow
's progress bar support?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, didn't get this point, are you talking about table.to_arrow() progress bar support?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I was getting this confused with RowIterator
, where to_dataframe
relies on to_arrow
's progress bar support. I do wonder if some of this "waiting for the query to finish" logic could be refactored into a _tqdm_helpers.py
function, since it should be identical between to_dataframe
and to_arrow
google/cloud/bigquery/job.py
Outdated
@@ -3406,7 +3450,46 @@ def to_dataframe( | |||
Raises: | |||
ValueError: If the `pandas` library cannot be imported. | |||
""" | |||
return self.result().to_dataframe( | |||
query_plan = self.query_plan | |||
if query_plan and progress_bar_type: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
query_plan
can get updated as the job progresses. I'd prefer if this always created a progress bar, but had a generic message when the job is still pending and there isn't a query plan.
google/cloud/bigquery/job.py
Outdated
) | ||
|
||
try: | ||
query_result = self.result(timeout=0.5) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should make constants for this (PROGRESS_BAR_UPDATE_INTERVAL
, for example)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good. A few nits regarding naming and indentation.
return None | ||
|
||
|
||
def _query_job_result_helper(query_job, progress_bar_type=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's use active verbs in this helper name.
def _query_job_result_helper(query_job, progress_bar_type=None): | |
def wait_for_query(query_job, progress_bar_type=None): |
|
||
def _query_job_result_helper(query_job, progress_bar_type=None): | ||
"""Return query result and display a progress bar while the query running, if tqdm is installed.""" | ||
if progress_bar_type: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can reduce the level of indentation if we exit early. Also, I think we'll want to pass through some keyword arguments to result()
, correct?
if progress_bar_type: | |
if progress_bar_type is None: | |
return query_job.result() |
Aside (not relevant for this PR): we'll eventually want to pass additional arguments to result()
whenever we implement #296
_PROGRESS_BAR_UPDATE_INTERVAL = 0.5 | ||
|
||
|
||
def _get_progress_bar(progress_bar_type, description, total, unit): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want to use these helpers from other modules, let's remove the (redundant because of private module) leading _
.
def _get_progress_bar(progress_bar_type, description, total, unit): | |
def get_progress_bar(progress_bar_type, description, total, unit): |
progress_bar = _get_progress_bar( | ||
progress_bar_type, "Query is running", 1, "query" | ||
) | ||
if query_job.query_plan: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be one level deeper. I'd like to see the while True
loop, even when query_plan
is not initially populated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd expect the while
loop to contain the following steps:
- Call
result()
with a short-ish timeout. - Call
reload()
- Update the progress bar (different cases for
query_plan
present or not)
"""Return query result and display a progress bar while the query running, if tqdm is installed.""" | ||
if progress_bar_type is None: | ||
query_result = query_job.result() | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant that you could return query_job.result()
here. The else
statement is then unnecessary, saving us 1 level of indentation.
i += 1 | ||
continue | ||
else: | ||
query_result = query_job.result() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need a timeout here. It's not clear to me why the above try
block is even in the if
statement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the only difference is the presence of total
, which could be set to a default value in the case where query_plan
is not present.
while True: | ||
if query_job.query_plan: | ||
total = len(query_job.query_plan) | ||
query_job.reload() # Refreshes the state via a GET request. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't we only want to call reload
after result
times out?
Also, why would we only call reload
inside this if
statement? The job might not have a query_plan
until after reload
is called in many cases.
system test failed not related to changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love it!
P.S. I have opened #388 to fix those failing tests.
Fixes #343
PR open for the feedback and suggestions.
Currently implemented for 'QueryJob.to_arrow' method
When the
cacheHit
for result is true at timequery_plan
is blank so can't implement progress bar.